Skip to content

Commit f8be4f1

Browse files
authored
Merge pull request #61 from zc277584121/main
add more built-in functions examples in README
2 parents 613e04a + 104b207 commit f8be4f1

1 file changed

Lines changed: 106 additions & 20 deletions

File tree

README.md

Lines changed: 106 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,9 @@ print('RAG answer:', results["generator"]["replies"][0])
164164
```
165165

166166
## Sparse Retrieval
167+
### Sparse retrieval with haystack sparse embedder
168+
This example demonstrates the basic approach to sparse indexing and retrieval using Haystack's sparse embedders.
169+
167170
```python
168171
from haystack import Document, Pipeline
169172
from haystack.components.writers import DocumentWriter
@@ -185,7 +188,7 @@ documents = [
185188
Document(content="My name is Wolfgang and I live in Berlin"),
186189
Document(content="I saw a black horse running"),
187190
Document(content="Germany has many big cities"),
188-
Document(content="fastembed is supported by and maintained by Milvus."),
191+
Document(content="full text search is supported by Milvus."),
189192
]
190193

191194
sparse_document_embedder = FastembedSparseDocumentEmbedder()
@@ -198,22 +201,64 @@ indexing_pipeline.connect("sparse_document_embedder", "writer")
198201

199202
indexing_pipeline.run({"sparse_document_embedder": {"documents": documents}})
200203

201-
query_pipeline = Pipeline()
202-
query_pipeline.add_component("sparse_text_embedder", FastembedSparseTextEmbedder())
203-
query_pipeline.add_component("sparse_retriever", MilvusSparseEmbeddingRetriever(document_store=document_store))
204-
query_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")
204+
retrieval_pipeline = Pipeline()
205+
retrieval_pipeline.add_component("sparse_text_embedder", FastembedSparseTextEmbedder())
206+
retrieval_pipeline.add_component("sparse_retriever", MilvusSparseEmbeddingRetriever(document_store=document_store))
207+
retrieval_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")
205208

206-
query = "Who supports fastembed?"
209+
query = "who supports full text search?"
207210

208-
result = query_pipeline.run({"sparse_text_embedder": {"text": query}})
211+
result = retrieval_pipeline.run({"sparse_text_embedder": {"text": query}})
209212

210213
print(result["sparse_retriever"]["documents"][0])
211214

212-
# Document(id=..., content: 'fastembed is supported by and maintained by Milvus.', sparse_embedding: vector with 48 non-zero elements)
215+
# Document(id=..., content: 'full text search is supported by Milvus.', sparse_embedding: vector with 48 non-zero elements)
216+
```
217+
### Sparse retrieval with Milvus built-in BM25 function
218+
Milvus provides a built-in BM25 function that can generate sparse vectors directly from text fields. This approach simplifies the pipeline construction compared to using Haystack's sparse embedders. The main differences are:
219+
220+
1. We need to specify a `BM25BuiltInFunction` in the document store with some field specification parameters.
221+
2. We don't need to use the embedder explicitly since Milvus handles the sparse embedding in the Milvus server end.
222+
3. The pipeline is simpler with fewer components and connections.
223+
224+
Below is a complete example using Milvus' built-in BM25 function. The code with `+` signs shows the simplified approach using Milvus' built-in functionality, while the code with `-` signs shows the original approach that requires explicit sparse embedding:
225+
226+
```diff
227+
+ from milvus_haystack.function import BM25BuiltInFunction
228+
+
229+
document_store = MilvusDocumentStore(
230+
connection_args={"uri": "http://localhost:19530"},
231+
sparse_vector_field="sparse_vector",
232+
text_field="text",
233+
+ builtin_function=[
234+
+ BM25BuiltInFunction( # The BM25 function converts the text into a sparse vector.
235+
+ input_field_names="text", output_field_names="sparse_vector",
236+
+ )
237+
+ ],
238+
drop_old=True,
239+
)
240+
- sparse_document_embedder = FastembedSparseDocumentEmbedder()
241+
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.NONE)
242+
indexing_pipeline = Pipeline()
243+
- indexing_pipeline.add_component("sparse_document_embedder", sparse_document_embedder)
244+
indexing_pipeline.add_component("writer", writer)
245+
- indexing_pipeline.connect("sparse_document_embedder", "writer")
246+
- indexing_pipeline.run({"sparse_document_embedder": {"documents": documents}})
247+
+ indexing_pipeline.run({"writer": {"documents": documents}})
248+
retrieval_pipeline = Pipeline()
249+
- retrieval_pipeline.add_component("sparse_text_embedder", FastembedSparseTextEmbedder())
250+
retrieval_pipeline.add_component("sparse_retriever", MilvusSparseEmbeddingRetriever(document_store=document_store))
251+
- retrieval_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")
252+
query = "who supports full text search?"
253+
- result = retrieval_pipeline.run({"sparse_text_embedder": {"text": query}})
254+
+ result = retrieval_pipeline.run({"sparse_retriever": {"query_text": query}})
255+
print(result["sparse_retriever"]["documents"][0])
213256
```
214257

215-
## Hybrid Retrieval
216258

259+
## Hybrid Retrieval
260+
### Hybrid retrieval with haystack sparse embedder
261+
This example demonstrates the basic approach to perform hybrid retrieval using Haystack's sparse embedders.
217262
```python
218263
from haystack import Document, Pipeline
219264
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
@@ -236,7 +281,7 @@ documents = [
236281
Document(content="My name is Wolfgang and I live in Berlin"),
237282
Document(content="I saw a black horse running"),
238283
Document(content="Germany has many big cities"),
239-
Document(content="fastembed is supported by and maintained by Milvus."),
284+
Document(content="full text search is supported by Milvus."),
240285
]
241286

242287
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.NONE)
@@ -250,34 +295,75 @@ indexing_pipeline.connect("dense_doc_embedder", "writer")
250295

251296
indexing_pipeline.run({"sparse_doc_embedder": {"documents": documents}})
252297

253-
querying_pipeline = Pipeline()
254-
querying_pipeline.add_component("sparse_text_embedder",
298+
retrieval_pipeline = Pipeline()
299+
retrieval_pipeline.add_component("sparse_text_embedder",
255300
FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1"))
256301

257-
querying_pipeline.add_component("dense_text_embedder", OpenAITextEmbedder())
258-
querying_pipeline.add_component(
302+
retrieval_pipeline.add_component("dense_text_embedder", OpenAITextEmbedder())
303+
retrieval_pipeline.add_component(
259304
"retriever",
260305
MilvusHybridRetriever(
261306
document_store=document_store,
262307
# reranker=WeightedRanker(0.5, 0.5), # Default is RRFRanker()
263308
)
264309
)
265310

266-
querying_pipeline.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
267-
querying_pipeline.connect("dense_text_embedder.embedding", "retriever.query_embedding")
311+
retrieval_pipeline.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
312+
retrieval_pipeline.connect("dense_text_embedder.embedding", "retriever.query_embedding")
268313

269-
question = "Who supports fastembed?"
314+
question = "who supports full text search?"
270315

271-
results = querying_pipeline.run(
316+
results = retrieval_pipeline.run(
272317
{"dense_text_embedder": {"text": question},
273318
"sparse_text_embedder": {"text": question}}
274319
)
275320

276321
print(results["retriever"]["documents"][0])
277322

278-
# Document(id=..., content: 'fastembed is supported by and maintained by Milvus.', embedding: vector of size 1536, sparse_embedding: vector with 48 non-zero elements)
279-
323+
# Document(id=..., content: 'full text search is supported by Milvus.', embedding: vector of size 1536, sparse_embedding: vector with 48 non-zero elements)
324+
```
325+
### Hybrid retrieval with Milvus built-in BM25 function
326+
Milvus provides a built-in BM25 function that can generate sparse vectors directly from text fields. This approach simplifies the pipeline construction compared to using Haystack's sparse embedders, making it a useful complement to semantic search. The main differences are:
327+
328+
1. We need to specify a `BM25BuiltInFunction` in the document store with some field specification parameters.
329+
2. We don't need to use the embedder explicitly since Milvus handles the sparse embedding in the Milvus server end.
330+
3. The pipeline is simpler with fewer components and connections, which is especially beneficial in hybrid retrieval setups.
331+
332+
Below is a complete example using Milvus' built-in BM25 function for hybrid retrieval. The code with `+` signs shows the simplified approach using Milvus' built-in functionality, while the code with `-` signs shows the original approach that requires explicit sparse embedding:
333+
334+
```diff
335+
+ from milvus_haystack.function import BM25BuiltInFunction
336+
+
337+
document_store = MilvusDocumentStore(
338+
connection_args={"uri": "http://localhost:19530"},
339+
sparse_vector_field="sparse_vector",
340+
text_field="text",
341+
+ builtin_function=[
342+
+ BM25BuiltInFunction( # The BM25 function converts the text into a sparse vector.
343+
+ input_field_names="text", output_field_names="sparse_vector",
344+
+ )
345+
+ ],
346+
drop_old=True,
347+
)
348+
- sparse_document_embedder = FastembedSparseDocumentEmbedder()
349+
writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.NONE)
350+
indexing_pipeline = Pipeline()
351+
- indexing_pipeline.add_component("sparse_document_embedder", sparse_document_embedder)
352+
indexing_pipeline.add_component("writer", writer)
353+
- indexing_pipeline.connect("sparse_document_embedder", "writer")
354+
- indexing_pipeline.run({"sparse_document_embedder": {"documents": documents}})
355+
+ indexing_pipeline.run({"writer": {"documents": documents}})
356+
retrieval_pipeline = Pipeline()
357+
- retrieval_pipeline.add_component("sparse_text_embedder", FastembedSparseTextEmbedder())
358+
retrieval_pipeline.add_component("sparse_retriever", MilvusSparseEmbeddingRetriever(document_store=document_store))
359+
- retrieval_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")
360+
query = "who supports full text search?"
361+
- result = retrieval_pipeline.run({"sparse_text_embedder": {"text": query}})
362+
+ result = retrieval_pipeline.run({"sparse_retriever": {"query_text": query}})
363+
print(result["sparse_retriever"]["documents"][0])
280364
```
365+
366+
281367
## License
282368

283369
`milvus-haystack` is distributed under the terms of the [Apache-2.0](https://spdx.org/licenses/Apache-2.0.html) license.

0 commit comments

Comments
 (0)