[开源推荐] CocoIndex 🥥 为AI实时索引数据

### 项目地址

https://github.com/cocoindex-io/cocoindex

### 类别

Python

### 项目标题

全世界第一款支持自定义逻辑并且自带增量更新的数据索引框架

### 项目描述

CocoIndex是全世界第一款支持自定义逻辑，并且自带增量更新（incremental update）的数据框架。CocoIndex 可以有效地帮你给AI准备数据（RAG，Semantic Search）。以最简单的形式，像乐高一样搭建你的ETL pipeline，并且提供增量更新（incremental update）。

![Image](https://github.com/user-attachments/assets/77d0748f-e49b-4a2c-8fc0-45ef29eb8c03)

CocoIndex框架+引擎，里面可以套任何的自定义模块，各种PDF parsing，chunking，embedding都可以套进去用。 


🔥 核心feature: 

- 像乐高一样搭建你的RAG Pipeline。
- 增量更新，当你源数据改变后CocoIndex引擎会减少计算和数据更新，只更新需要的差量delta。
- 数据流编程（Data flow programming），以最简单的形式定义数据流。
- 高效稳定，核心是用Rust🦀写的。给各位爱好Python🐍的小伙伴们提供了Python SDK。


### 亮点

[文档](https://cocoindex.io/docs/)齐全，[新手包](https://cocoindex.io/docs/getting_started/quickstart)友好。模块化的搭建你的RAG Pipeline，五分钟上手🚀。

### 示例代码

```markdown
@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(
                language="markdown", chunk_size=300, chunk_overlap=100))

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_index=[("embedding", cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])
```

### 截图或演示视频

<img width="1227" alt="Image" src="https://github.com/user-attachments/assets/d09d4fa7-a065-41a1-ae21-c7e6f45986e4" />

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[开源推荐] CocoIndex 🥥 为AI实时索引数据 #2918

项目地址

类别

项目标题

项目描述

亮点

示例代码

截图或演示视频

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[开源推荐] CocoIndex 🥥 为AI实时索引数据 #2918

Description

项目地址

类别

项目标题

项目描述

亮点

示例代码

截图或演示视频

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions