unabl to benchmark the llama4 models using llmperf

hai i am using llmperf from so long to check the performence of the models. now i am unable to benchmark the llama4 newly relaesed models. could you please release a new version or make some required changes in llmperf scripts. 

here are the result when i tried to benchmark the "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" model.

`root@llmperf-654d88b897-sndcc:/data/llmperf# python3 token_benchmark_ray.py --model "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" --mean-input-tokens 100 --mean-output-tokens 300 --stddev-output-tokens 0 --stddev-input-tokens 0 --max-num-completed-requests 1 --timeout 600 --num-concurrent-requests 1 --results-dir "mavric-llama4_vllm_100_300_1-4n" --llm-api openai`

`None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.`
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
`2025-04-07 08:36:55,950	WARNING services.py:2072 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=10.24gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.`
`2025-04-07 08:36:57,110	INFO worker.py:1841 -- Started a local Ray instance.
  0%|                                                                                                                                                                                           | 0/1 [00:00<?, ?it/s]Exception in thread Thread-2 (launch_request):
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/data/llmperf/token_benchmark_ray.py", line 121, in launch_request
    request_metrics[common_metrics.INTER_TOKEN_LAT] /= request_metrics[common_metrics.NUM_OUTPUT_TOKENS]
ZeroDivisionError: division by zero
  0%|                                                                                                                                                                                           | 0/1 [00:00<?, ?it/s]
\Results for token benchmark for meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 queried with the openai api.`

`Traceback (most recent call last):
  File "/data/llmperf/token_benchmark_ray.py", line 478, in <module>
    run_token_benchmark(
  File "/data/llmperf/token_benchmark_ray.py", line 319, in run_token_benchmark
    summary, individual_responses = get_token_throughput_latencies(
  File "/data/llmperf/token_benchmark_ray.py", line 167, in get_token_throughput_latencies
    ret = metrics_summary(completed_requests, start_time, end_time)
  File "/data/llmperf/token_benchmark_ray.py", line 218, in metrics_summary
    df_without_errored_req = df[df[common_metrics.ERROR_CODE].isna()]
  File "/usr/local/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/range.py", line 417, in get_loc
    raise KeyError(key)
KeyError: 'error_code'
(OpenAIChatCompletionsClient pid=913861) Warning Or Error: 400 Client Error: Bad Request for url: http://llama-4-maverick-instruct-fp8.llms/v1/chat/completions
(OpenAIChatCompletionsClient pid=913861) 400`

i think need to modify the script as per the above requirement.could some one help me out of this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

unabl to benchmark the llama4 models using llmperf #90

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

unabl to benchmark the llama4 models using llmperf #90

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions