Skip to content

Evaluate on OSWorld #642

Open
Open
@abrichr

Description

@abrichr

Feature request

We would like to test OpenAdapt's ability to perform the tasks in https://os-world.github.io/.

This may involve creating recordings of the tasks described in the benchmark, since (as per https://github.com/xlang-ai/OSWorld/tree/main/evaluation_examples) the data sample are formatted as:

{
    "id": "uid", # unique id
    "snapshot": "snapshot_id", # the snapshot id of the environment, with some data already there and apps already opened, or just desktop
    "instruction": "natural_language_instruction", # the natural language instruction of the task, what we want the agent to do
    "source": "website_url", # where we know this example, some forum, or some website, or some paper
    "config": {xxx}, # the scripts to setup the donwload and open files actions, as the initial state of a task
    "trajectory": "trajectory_directory", # the trajectory directory, which contains the action sequence file, the screenshots and the recording video
    "related_apps": ["app1", "app2", ...], # the related apps, which are opened during the task
    "evaluator": "evaluation_dir", # the directory of the evaluator, which contains the evaluation script for this example
…
}

The ./trajectories file contains the annotated trajectories for each data item in ./examples for finishing the task.

Unfortunately this file does not appear to be included in the repo. Therefore completing this evaluation may involve manually re-creating the trajectories via openadapt.record.

Motivation

Evaluation

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions