wiki_dpr pre-processing performance

I've been working with wiki_dpr and noticed that the dataset processing is seriously impaired in performance [1]. It takes about 12h to process the entire dataset. Most of this time is simply loading and processing the data, but the actual indexing is also quite slow (3h).

I won't repeat the concerns around multiprocessing as they are addressed in other issues (#786), but this is the first obvious thing to do. Using cython to speed up the text manipulation may be also help. Loading and processing a dataset of this size in under 15 minutes does not seem unreasonable on a modern multi-core machine. I have hit such targets myself on similar tasks. Would love to see this improve.

The other issue is that it takes 3h to construct the FAISS index. If only we could use GPUs with HNSW, but we can't. My sharded GPU indexing code can build an IVF + PQ index in 10 minutes on 20 million vectors. Still, 3h seems slow even for the CPU.

It looks like HF is adding only 1000 vectors at a time by default [2], whereas the faiss benchmarks adds 1 million vectors at a time (effectively) [3]. It's possible the runtime could be reduced with a larger batch. Also, it looks like project dependencies ultimately use OpenBLAS, but this is known to have issues when combined with OpenMP, which HNSW does [3]. A workaround is to set the environment variable `OMP_WAIT_POLICY=PASSIVE` via `os.environ` or similar.

References:
[1] https://github.com/huggingface/datasets/blob/master/datasets/wiki_dpr/wiki_dpr.py
[2] https://github.com/huggingface/datasets/blob/master/src/datasets/search.py
[3] https://github.com/facebookresearch/faiss/blob/master/benchs/bench_hnsw.py
[4] https://github.com/facebookresearch/faiss/issues/422

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wiki_dpr pre-processing performance #1670

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

wiki_dpr pre-processing performance #1670

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions