Skip to content

[RFC]: Optimize embedding task #21796

@maxdebayser

Description

@maxdebayser

Motivation.

Since PR #21270 we now have full support for encoder embedding models. These models include not only the BERT-type models but also models such as Alibaba-NLP/gte-Qwen2-1.5B-instruct that are decoder models converted to encoder. Encoder models continue to be essential for information retrieval use cases such as RAG because their bidirectional attention gives them superior performance over decoder models with causal attention.

vLLM stated mission is "Easy, fast, and cheap LLM serving for everyone". Embedding model serving has been easy since V0, but is it also fast? Snowflake's excellent blog post suggests otherwise. In it they show a series of optimizations that they did in their backend to 4x the throughput of seq len 512. While one of the optimizations is specific to their infrastructure where vLLM runs behind a gRPC frontend, the other optimizations suggest that there is room for improvement.

There aren't many details about the specific vLLM version that was used as a baseline, but the model was Snowflake/snowflake-arctic-embed-m-v1.5, which is a BERT model, and the publication date was May 2025, so the V0 engine was used since the V1 support hadn't been merged in main yet.

This raises a series of questions: does vLLM V1 perform better for embedding tasks than V0? How much room is there for improvement in V1? The answer is we don't even know exactly how well V0 was performing since we never added embedding benchmarks to the benchmarks scripts directory.

Proposed Change.

This RFC is meant to gather feedback on how we should do benchmarking and what optimizations we should try.

As an initial roadmap I propose that we :

  1. Add embedding benchmarking scripts
  2. Try to reproduce the baseline vLLM V0 Snowflake results
  3. Compare V1 performance with V0
  4. Analyze the V1 results to see if there are still GPU "bubbles" due to
    tokenization and other CPU tasks

As noted in the Snowflake blog, most embedding models are quite small and since they don't support optimizations such as chunked prefill, the GPU utilization could be sub-optimal. In their case they solved it running multiple model instances on a single GPU. This is an option that can be explored with data parallelism, but we could also take a look to see if there a scheduler and model runner optimizations that can be done to improve GPU utilization.

Feedback Period.

Two weeks.

CC List.

@robertgshaw2-redhat , @DarkLight1337 , @noooop , @22quinn

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions