Skip to content

Bug: Incomplete Dataset for yandex-1B-200-angular/yandex_t2i_gt_100k #250

@yudhiesh

Description

@yudhiesh

What is the issue?

I am trying to run the benchmark on my own adding in a new vector database LanceDB specifically to compare filter search performance. When I try running the benchmark using this command:

poetry run python3 run.py --engines "lancedb-*"

I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: '/Users/Yudhiesh/Projects/vector-db-benchmark/datasets/yandex-1B-200-angular/yandex_t2i_gt_100k/vectors.npy'

I noticed that the dataset that is downloaded is incomplete:

datasets/yandex-1B-200-angular
└── yandex_t2i_gt_100k
    └── tests.jsonl

2 directories, 1 file

Based on the specification of the AnnCompoundReader since the files are .jsonl it is missing the actual vectors:

class AnnCompoundReader(JSONReader):
    """
    A reader created specifically to read the format used in
    https://github.com/qdrant/ann-filtering-benchmark-datasets, in which vectors
    and their metadata are stored in separate files.
    """

    VECTORS_FILE = "vectors.npy"
    QUERIES_FILE = "tests.jsonl"

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions