Skip to content

wiki_dpr pre-processing performance #1670

Open
@dbarnhart

Description

@dbarnhart

I've been working with wiki_dpr and noticed that the dataset processing is seriously impaired in performance [1]. It takes about 12h to process the entire dataset. Most of this time is simply loading and processing the data, but the actual indexing is also quite slow (3h).

I won't repeat the concerns around multiprocessing as they are addressed in other issues (#786), but this is the first obvious thing to do. Using cython to speed up the text manipulation may be also help. Loading and processing a dataset of this size in under 15 minutes does not seem unreasonable on a modern multi-core machine. I have hit such targets myself on similar tasks. Would love to see this improve.

The other issue is that it takes 3h to construct the FAISS index. If only we could use GPUs with HNSW, but we can't. My sharded GPU indexing code can build an IVF + PQ index in 10 minutes on 20 million vectors. Still, 3h seems slow even for the CPU.

It looks like HF is adding only 1000 vectors at a time by default [2], whereas the faiss benchmarks adds 1 million vectors at a time (effectively) [3]. It's possible the runtime could be reduced with a larger batch. Also, it looks like project dependencies ultimately use OpenBLAS, but this is known to have issues when combined with OpenMP, which HNSW does [3]. A workaround is to set the environment variable OMP_WAIT_POLICY=PASSIVE via os.environ or similar.

References:
[1] https://github.com/huggingface/datasets/blob/master/datasets/wiki_dpr/wiki_dpr.py
[2] https://github.com/huggingface/datasets/blob/master/src/datasets/search.py
[3] https://github.com/facebookresearch/faiss/blob/master/benchs/bench_hnsw.py
[4] facebookresearch/faiss#422

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions