Skip to content

unabl to benchmark the llama4 models using llmperf #90

@nskpro-cmd

Description

@nskpro-cmd

hai i am using llmperf from so long to check the performence of the models. now i am unable to benchmark the llama4 newly relaesed models. could you please release a new version or make some required changes in llmperf scripts.

here are the result when i tried to benchmark the "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" model.

root@llmperf-654d88b897-sndcc:/data/llmperf# python3 token_benchmark_ray.py --model "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" --mean-input-tokens 100 --mean-output-tokens 300 --stddev-output-tokens 0 --stddev-input-tokens 0 --max-num-completed-requests 1 --timeout 600 --num-concurrent-requests 1 --results-dir "mavric-llama4_vllm_100_300_1-4n" --llm-api openai

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in huggingface/transformers#24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
2025-04-07 08:36:55,950 WARNING services.py:2072 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2025-04-07 08:36:57,110 INFO worker.py:1841 -- Started a local Ray instance. 0%| | 0/1 [00:00<?, ?it/s]Exception in thread Thread-2 (launch_request): Traceback (most recent call last): File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/local/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/data/llmperf/token_benchmark_ray.py", line 121, in launch_request request_metrics[common_metrics.INTER_TOKEN_LAT] /= request_metrics[common_metrics.NUM_OUTPUT_TOKENS] ZeroDivisionError: division by zero 0%| | 0/1 [00:00<?, ?it/s] \Results for token benchmark for meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 queried with the openai api.

Traceback (most recent call last): File "/data/llmperf/token_benchmark_ray.py", line 478, in <module> run_token_benchmark( File "/data/llmperf/token_benchmark_ray.py", line 319, in run_token_benchmark summary, individual_responses = get_token_throughput_latencies( File "/data/llmperf/token_benchmark_ray.py", line 167, in get_token_throughput_latencies ret = metrics_summary(completed_requests, start_time, end_time) File "/data/llmperf/token_benchmark_ray.py", line 218, in metrics_summary df_without_errored_req = df[df[common_metrics.ERROR_CODE].isna()] File "/usr/local/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__ indexer = self.columns.get_loc(key) File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/range.py", line 417, in get_loc raise KeyError(key) KeyError: 'error_code' (OpenAIChatCompletionsClient pid=913861) Warning Or Error: 400 Client Error: Bad Request for url: http://llama-4-maverick-instruct-fp8.llms/v1/chat/completions (OpenAIChatCompletionsClient pid=913861) 400

i think need to modify the script as per the above requirement.could some one help me out of this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions