[RFC]: Optimize embedding task

### Motivation.

Since PR https://github.com/vllm-project/vllm/pull/21270 we now have full support for encoder embedding models. These models include not only the BERT-type models but also models such as Alibaba-NLP/gte-Qwen2-1.5B-instruct that are decoder models converted to encoder. Encoder models continue to be essential for information retrieval use cases such as RAG because their bidirectional attention gives them superior performance over decoder models with causal attention.

vLLM stated mission is "Easy, fast, and cheap LLM serving for everyone".  Embedding model serving has been easy since V0, but is it also fast? [Snowflake's excellent blog post](https://www.snowflake.com/en/engineering-blog/embedding-inference-arctic-16x-faster/) suggests otherwise. In it they show a series of optimizations that they did in their backend to 4x the throughput of seq len 512. While one of the optimizations is specific to their infrastructure where vLLM runs behind a gRPC frontend, the other optimizations suggest that there is room for improvement.

There aren't many details about the specific vLLM version that was used as a baseline, but the model was Snowflake/snowflake-arctic-embed-m-v1.5, which is a BERT model, and the publication date was May 2025, so the V0 engine was used since the V1 support hadn't been merged in main yet.

This raises a series of questions: does vLLM V1 perform better for embedding tasks than V0? How much room is there for improvement in V1? The answer is we don't even know exactly how well V0 was performing since we never added embedding benchmarks to the `benchmarks` scripts directory.

### Proposed Change.

This RFC is meant to gather feedback on how we should do benchmarking and what optimizations we should try.

As an initial roadmap I propose that we :

1) Add embedding benchmarking scripts
2) Try to reproduce the baseline vLLM V0 Snowflake results
3) Compare V1 performance with V0
4) Analyze the V1 results to see if there are still GPU "bubbles" due to
   tokenization and other CPU tasks

As noted in the Snowflake blog, most embedding models are quite small and since they don't support optimizations such as chunked prefill, the GPU utilization could be sub-optimal. In their case they solved it running multiple model instances on a single GPU. This is an option that can be explored with data parallelism, but we could also take a look to see if there a scheduler and model runner optimizations that can be done to improve GPU utilization. 

### Feedback Period.

Two weeks.

### CC List.

@robertgshaw2-redhat , @DarkLight1337 , @noooop , @22quinn 

### Any Other Things.

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[RFC]: Optimize embedding task #21796

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[RFC]: Optimize embedding task #21796

Description

Motivation.

Proposed Change.

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions