Refactor Separation of embedding logic through the DocumentTransformer #1239

youngmoneee · 2024-08-18T19:13:54Z

This PR aims to achieve two objectives through the proposed changes:

Separate the common embedding logic present in the VectorStore implementations using the DocumentTransformer interface. By isolating the logic that adds embedding data before inserting Documents into the VectorStore, maintainability and testability are improved.
Improve batch processing performance by executing blocking operations asynchronously. Sequential and synchronous Embedding Request tasks are executed on a separate Scheduler using Reactor, leading to enhanced performance.”

spring-ai/vector-stores/spring-ai-weaviate-store/src/main/java/org/springframework/ai/vectorstore/WeaviateVectorStore.java

Line 327 in 10e1e13

    
           List<WeaviateObject> weaviateObjects = documents.stream().map(this::toWeaviateObject).toList();

In the example code, the map operation synchronously performs the next task only after the previous task has been completed.

spring-ai/vector-stores/spring-ai-weaviate-store/src/main/java/org/springframework/ai/vectorstore/WeaviateVectorStore.java

Lines 363 to 368 in 10e1e13

    
           private WeaviateObject toWeaviateObject(Document document) { 
        
           	if (document.getEmbedding() == null || document.getEmbedding().length == 0) { 
        
           		float[] embedding = this.embeddingModel.embed(document); 
        
           		document.setEmbedding(embedding); 
        
           	}

spring-ai/spring-ai-core/src/main/java/org/springframework/ai/embedding/EmbeddingModel.java

Lines 55 to 62 in 10e1e13

    
           default List<float[]> embed(List<String> texts) { 
        
           	Assert.notNull(texts, "Texts must not be null"); 
        
           	return this.call(new EmbeddingRequest(texts, EmbeddingOptionsBuilder.builder().build())) 
        
           		.getResults() 
        
           		.stream() 
        
           		.map(Embedding::getOutput) 
        
           		.toList(); 
        
           }

The call method synchronously requests an EmbeddingResponse object, creating a significant bottleneck due to the sequential execution of these blocking methods.

For comparison, when embedding and inserting the same 100 Document objects into a vector database, the original code took 106 seconds.

spring-ai/spring-ai-core/src/main/java/org/springframework/ai/transformer/DocumentEmbeddingTransformer.java

Lines 49 to 59 in eb58cf4

    
           	return Flux.fromIterable(documents).flatMap(document -> { 
        
           		if (document.getEmbedding() == null || document.getEmbedding().length == 0) 
        
           			return Mono 
        
           				.zip(Mono.just(document), Mono.fromCallable(() -> embeddingModel.embed(document)), (doc, embed) -> { 
        
           					doc.setEmbedding(embed); 
        
           					return doc; 
        
           				}) 
        
           				.subscribeOn(Schedulers.boundedElastic()); 
        
           		return Mono.just(document); 
        
           	}).collectList().block(); 
        
           }

To decrease this bottleneck, the code internally uses Reactor objects to execute these blocking methods asynchronously, minimizing the need for major code modifications.

And, after modifying the code to process the tasks on a separate asynchronous scheduler, the execution time was reduced to 8.6 seconds, representing a 92% decrease in processing time.

This PR aimed to optimize performance with minimal changes to the existing code.
However, in the long term, I think that expressing the ETL pipeline as a stream rather than batch processing through a List would be more appropriate.

I have created an issue( #1219 ) related to this topic. I would appreciate any insights or thoughts you might have.

It would be great if you could take a look at the issue when you have time.

Thanks 🧑🏼‍💻

separated common logic process optimization

markpollack · 2024-09-17T18:22:23Z

review in light of 087de16

Refactor Separation of embedding logic through the DocumentTransformer

cb616cb

separated common logic process optimization

markpollack added embedding vector store design labels Sep 17, 2024

csterwa added this to the 1.0.0-RC1 milestone Sep 25, 2024

youngmoneee mentioned this pull request Dec 1, 2024

VectorStore improvements. #1600

Open

markpollack modified the milestones: 1.0.0-RC1-triage, 1.0.x May 8, 2025

aniketg-21 mentioned this pull request Jun 14, 2025

Support Custom User-Provided Embeddings in VectorStore add(...) Method #3540

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor Separation of embedding logic through the DocumentTransformer #1239

Refactor Separation of embedding logic through the DocumentTransformer #1239

Uh oh!

youngmoneee commented Aug 18, 2024

Uh oh!

markpollack commented Sep 17, 2024

Uh oh!

Uh oh!

	private WeaviateObject toWeaviateObject(Document document) {

	if (document.getEmbedding() == null \|\| document.getEmbedding().length == 0) {
	float[] embedding = this.embeddingModel.embed(document);
	document.setEmbedding(embedding);
	}

	default List<float[]> embed(List<String> texts) {
	Assert.notNull(texts, "Texts must not be null");
	return this.call(new EmbeddingRequest(texts, EmbeddingOptionsBuilder.builder().build()))
	.getResults()
	.stream()
	.map(Embedding::getOutput)
	.toList();
	}

	return Flux.fromIterable(documents).flatMap(document -> {
	if (document.getEmbedding() == null \|\| document.getEmbedding().length == 0)
	return Mono
	.zip(Mono.just(document), Mono.fromCallable(() -> embeddingModel.embed(document)), (doc, embed) -> {
	doc.setEmbedding(embed);
	return doc;
	})
	.subscribeOn(Schedulers.boundedElastic());
	return Mono.just(document);
	}).collectList().block();
	}

Refactor Separation of embedding logic through the DocumentTransformer #1239

Are you sure you want to change the base?

Refactor Separation of embedding logic through the DocumentTransformer #1239

Uh oh!

Conversation

youngmoneee commented Aug 18, 2024

Uh oh!

markpollack commented Sep 17, 2024

Uh oh!

Uh oh!