-
Notifications
You must be signed in to change notification settings - Fork 184
Description
hai i am using llmperf from so long to check the performence of the models. now i am unable to benchmark the llama4 newly relaesed models. could you please release a new version or make some required changes in llmperf scripts.
here are the result when i tried to benchmark the "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" model.
root@llmperf-654d88b897-sndcc:/data/llmperf# python3 token_benchmark_ray.py --model "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" --mean-input-tokens 100 --mean-output-tokens 300 --stddev-output-tokens 0 --stddev-input-tokens 0 --max-num-completed-requests 1 --timeout 600 --num-concurrent-requests 1 --results-dir "mavric-llama4_vllm_100_300_1-4n" --llm-api openai
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy
(previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
2025-04-07 08:36:55,950 WARNING services.py:2072 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-04-07 08:36:57,110 INFO worker.py:1841 -- Started a local Ray instance. 0%| | 0/1 [00:00<?, ?it/s]Exception in thread Thread-2 (launch_request): Traceback (most recent call last): File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/data/llmperf/token_benchmark_ray.py", line 121, in launch_request request_metrics[common_metrics.INTER_TOKEN_LAT] /= request_metrics[common_metrics.NUM_OUTPUT_TOKENS] ZeroDivisionError: division by zero 0%| | 0/1 [00:00<?, ?it/s] \Results for token benchmark for meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 queried with the openai api.
Traceback (most recent call last): File "/data/llmperf/token_benchmark_ray.py", line 478, in <module> run_token_benchmark( File "/data/llmperf/token_benchmark_ray.py", line 319, in run_token_benchmark summary, individual_responses = get_token_throughput_latencies( File "/data/llmperf/token_benchmark_ray.py", line 167, in get_token_throughput_latencies ret = metrics_summary(completed_requests, start_time, end_time) File "/data/llmperf/token_benchmark_ray.py", line 218, in metrics_summary df_without_errored_req = df[df[common_metrics.ERROR_CODE].isna()] File "/usr/local/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__ indexer = self.columns.get_loc(key) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/range.py", line 417, in get_loc raise KeyError(key) KeyError: 'error_code' (OpenAIChatCompletionsClient pid=913861) Warning Or Error: 400 Client Error: Bad Request for url: http://llama-4-maverick-instruct-fp8.llms/v1/chat/completions (OpenAIChatCompletionsClient pid=913861) 400
i think need to modify the script as per the above requirement.could some one help me out of this.