-
-
Notifications
You must be signed in to change notification settings - Fork 10.6k
Open
Labels
Description
项目地址
https://github.com/cocoindex-io/cocoindex
类别
Python
项目标题
全世界第一款支持自定义逻辑并且自带增量更新的数据索引框架
项目描述
CocoIndex是全世界第一款支持自定义逻辑,并且自带增量更新(incremental update)的数据框架。CocoIndex 可以有效地帮你给AI准备数据(RAG,Semantic Search)。以最简单的形式,像乐高一样搭建你的ETL pipeline,并且提供增量更新(incremental update)。
CocoIndex框架+引擎,里面可以套任何的自定义模块,各种PDF parsing,chunking,embedding都可以套进去用。
🔥 核心feature:
- 像乐高一样搭建你的RAG Pipeline。
- 增量更新,当你源数据改变后CocoIndex引擎会减少计算和数据更新,只更新需要的差量delta。
- 数据流编程(Data flow programming),以最简单的形式定义数据流。
- 高效稳定,核心是用Rust🦀写的。给各位爱好Python🐍的小伙伴们提供了Python SDK。
亮点
文档齐全,新手包友好。模块化的搭建你的RAG Pipeline,五分钟上手🚀。
示例代码
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
# Add a data source to read files from a directory
data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))
# Add a collector for data to be exported to the vector index
doc_embeddings = data_scope.add_collector()
# Transform data of each document
with data_scope["documents"].row() as doc:
# Split the document into chunks, put into `chunks` field
doc["chunks"] = doc["content"].transform(
cocoindex.functions.SplitRecursively(
language="markdown", chunk_size=300, chunk_overlap=100))
# Transform data of each chunk
with doc["chunks"].row() as chunk:
# Embed the chunk, put into `embedding` field
chunk["embedding"] = chunk["text"].transform(
cocoindex.functions.SentenceTransformerEmbed(
model="sentence-transformers/all-MiniLM-L6-v2"))
# Collect the chunk into the collector.
doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
text=chunk["text"], embedding=chunk["embedding"])
# Export collected data to a vector index.
doc_embeddings.export(
"doc_embeddings",
cocoindex.storages.Postgres(),
primary_key_fields=["filename", "location"],
vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
截图或演示视频

No response