Skip to content

[BUG] Evaluation tracker won't save task which relies on using metrics of 2 different types #950

@clefourrier

Description

@clefourrier

Describe the bug

EvaluationTracker.save() will fail at dataset = Dataset.from_list([asdict(detail) for detail in task_details]) with

Exception has occurred: ArrowInvalid       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
cannot mix list and non-list, non-null values

if the task launched uses metrics requiring both generative and logprobs metrics. (because they don't save lists of the same lengths for ex)

To Reproduce

            "name": "launch vllm",
            "type": "debugpy",
            "request": "launch",
            "module": "lighteval",
            "args": [
                "vllm",
                "model_name=Qwen/Qwen3-0.6B,max_num_batched_tokens=100000,max_model_length=38912,generation_parameters={temperature:0.6,top_p:0.95,top_k:20,min_p:0,presence_penalty:1,max_new_tokens:38912},system_prompt='/no_think'",
                "lighteval|mmlu_redux_2:security_studies|0", //lighteval|mmlu_redux_2|0|0",
                "--max-samples",
                "10",
                "--save-details",
            ],

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions