Skip to content

Commit 593e6f0

Browse files
authored
Merge pull request #126 from snexus/feature/update
Add support for OpenAI embeddings. Bump all dependencies.
2 parents 0f162ba + 402bd46 commit 593e6f0

File tree

10 files changed

+162
-77
lines changed

10 files changed

+162
-77
lines changed

README.md

Lines changed: 13 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -8,30 +8,36 @@ The purpose of this package is to offer a convenient question-answering (RAG) sy
88

99
## Features
1010

11-
* Supported formats
11+
* Supported document formats
1212
* Build-in parsers:
1313
* `.md` - Divides files based on logical components such as headings, subheadings, and code blocks. Supports additional features like cleaning image links, adding custom metadata, and more.
1414
* `.pdf` - MuPDF-based parser.
1515
* `.docx` - custom parser, supports nested tables.
1616
* Other common formats are supported by `Unstructured` pre-processor:
1717
* List of formats see [here](https://unstructured-io.github.io/unstructured/core/partition.html).
1818

19-
* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.
20-
21-
* Optional support for image parsing using Gemini API.
22-
23-
* Supports multiple collection of documents, and filtering the results by a collection.
19+
* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
20+
* OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
21+
* HuggingFace models.
22+
* Llama cpp supported models - for full list see [here](https://github.com/ggerganov/llama.cpp#description).
2423

25-
* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
24+
* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))
2625

2726
* Generates dense embeddings from a folder of documents and stores them in a vector database ([ChromaDB](https://github.com/chroma-core/chroma)).
2827
* The following embedding models are supported:
2928
* Hugging Face embeddings.
3029
* Sentence-transformers-based models, e.g., `multilingual-e5-base`.
3130
* Instructor-based models, e.g., `instructor-large`.
31+
* OpenAI embeddings.
3232

3333
* Generates sparse embeddings using SPLADE (https://github.com/naver/splade) to enable hybrid search (sparse + dense).
3434

35+
* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
36+
37+
* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.
38+
39+
* Optional support for image parsing using Gemini API.
40+
3541
* Supports the "Retrieve and Re-rank" strategy for semantic search, see [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
3642
* Besides the originally `ms-marco-MiniLM` cross-encoder, more modern `bge-reranker` is supported.
3743

@@ -44,13 +50,6 @@ The purpose of this package is to offer a convenient question-answering (RAG) sy
4450

4551
* Supprts optional chat history with question contextualization
4652

47-
* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
48-
* OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
49-
* HuggingFace models.
50-
* Llama cpp supported models - for full list see [here](https://github.com/ggerganov/llama.cpp#description).
51-
* AutoGPTQ models (temporarily disabled due to broken dependencies).
52-
53-
* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))
5453

5554
* Other features
5655
* Simple CLI and web interfaces.

docs/index.rst

Lines changed: 22 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -11,59 +11,58 @@ The purpose of this package is to offer a convenient question-answering system w
1111
Features
1212
--------
1313

14-
* Supported formats
14+
* Supported document formats
1515
* Build-in parsers:
1616
* `.md` - Divides files based on logical components such as headings, subheadings, and code blocks. Supports additional features like cleaning image links, adding custom metadata, and more.
1717
* `.pdf` - MuPDF-based parser.
1818
* `.docx` - custom parser, supports nested tables.
1919
* Other common formats are supported by `Unstructured` pre-processor:
20-
* List of formats https://unstructured-io.github.io/unstructured/core/partition.html
20+
* List of formats see [here](https://unstructured-io.github.io/unstructured/core/partition.html).
2121

22-
* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.
23-
24-
* Optional support for image parsing using Gemini API.
25-
26-
* Supports multiple collection of documents, and filtering the results by a collection.
27-
28-
* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
22+
* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
23+
* OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
24+
* HuggingFace models.
25+
* Llama cpp supported models - for full list see [here](https://github.com/ggerganov/llama.cpp#description).
2926

30-
* Generates dense embeddings from a folder of documents and stores them in a vector database (ChromaDB).
27+
* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))
3128

29+
* Generates dense embeddings from a folder of documents and stores them in a vector database ([ChromaDB](https://github.com/chroma-core/chroma)).
3230
* The following embedding models are supported:
33-
34-
* Huggingface embeddings.
31+
* Hugging Face embeddings.
3532
* Sentence-transformers-based models, e.g., `multilingual-e5-base`.
3633
* Instructor-based models, e.g., `instructor-large`.
34+
* OpenAI embeddings.
3735

3836
* Generates sparse embeddings using SPLADE (https://github.com/naver/splade) to enable hybrid search (sparse + dense).
3937

40-
* Supports the "Retrieve and Re-rank" strategy for semantic search, see - https://www.sbert.net/examples/applications/retrieve_rerank/README.html.
38+
* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
39+
40+
* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.
41+
42+
* Optional support for image parsing using Gemini API.
43+
44+
* Supports the "Retrieve and Re-rank" strategy for semantic search, see [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
4145
* Besides the originally `ms-marco-MiniLM` cross-encoder, more modern `bge-reranker` is supported.
4246

43-
* Supports HyDE (Hypothetical Document Embeddings) - https://arxiv.org/pdf/2212.10496.pdf
47+
* Supports HyDE (Hypothetical Document Embeddings) - see [here](https://arxiv.org/pdf/2212.10496.pdf).
4448
* WARNING: Enabling HyDE (via config OR webapp) can significantly alter the quality of the results. Please make sure to read the paper before enabling.
45-
* Based on empirical observations, enabling HyDE significantly boosts quality of the output on a topics where user can't formulate the quesiton using domain specific language of the topic - e.g. when learning new topics.
49+
* From my own experiments, enabling HyDE significantly boosts quality of the output on a topics where user can't formulate the quesiton using domain specific language of the topic - e.g. when learning new topics.
4650

4751
* Support for multi-querying, inspired by `RAG Fusion` - https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1
4852
* When multi-querying is turned on (either config or webapp), the original query will be replaced by 3 variants of the same query, allowing to bridge the gap in the terminology and "offer different angles or perspectives" according to the article.
4953

5054
* Supprts optional chat history with question contextualization
5155

52-
* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
53-
* OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
54-
* HuggingFace models.
55-
* Llama cpp supported models - for full list see https://github.com/ggerganov/llama.cpp#description
56-
* AutoGPTQ models (temporarily disabled due to broken dependencies).
57-
58-
* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))
5956

6057
* Other features
61-
* Simple web interface.
58+
* Simple CLI and web interfaces.
6259
* Deep linking into document sections - jump to an individual PDF page or a header in a markdown file.
6360
* Ability to save responses to an offline database for future analysis.
6461
* Experimental API
6562

6663

64+
65+
6766
Installation
6867
============
6968

requirements.txt

Lines changed: 14 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,31 @@
11
llama-cpp-python==0.2.76
22
chromadb~=0.5.5
3-
langchain~=0.2.14
4-
langchain-community~=0.2.12
5-
langchain-openai~=0.1.22
6-
langchain-huggingface~=0.0.3
3+
langchain>=0.3,<0.4
4+
langchain-community>=0.3,<0.4
5+
langchain-openai>=0.2,<0.3
6+
langchain-huggingface>=0.1,<0.2
7+
langchain-chroma>=0.1.4,<0.2
78
pydantic~=2.7
8-
transformers~=4.41
9-
sentence-transformers==3.0.1
9+
transformers~=4.47
10+
sentence-transformers==3.3.1
1011
pypdf2~=3.0.1
1112
ebooklib==0.18
1213
# sentencepiece==0.20
1314
setuptools==67.7.2
1415
loguru
1516
python-dotenv
16-
accelerate~=0.33
17+
accelerate~=1.2.0
1718
protobuf==3.20.2
1819
termcolor
19-
openai~=1.41
20+
openai~=1.57
2021
einops # required for Mosaic models
2122
click
2223
bitsandbytes==0.43.1
2324
# auto-gptq==0.2.0
2425
InstructorEmbedding==1.0.1
25-
unstructured~=0.14.5
26-
pymupdf==1.24.9
27-
streamlit~=1.28
26+
unstructured~=0.16.9
27+
pymupdf==1.25.0
28+
streamlit~=1.40
2829
python-docx~=1.1
2930
six==1.16.0 ; python_version >= "3.10" and python_version < "4.0"
3031
sniffio==1.3.0 ; python_version >= "3.10" and python_version < "4.0"
@@ -34,8 +35,8 @@ sympy==1.11.1 ; python_version >= "3.10" and python_version < "4.0"
3435
tenacity==8.2.3 ; python_version >= "3.10" and python_version < "4.0"
3536
threadpoolctl==3.1.0 ; python_version >= "3.10" and python_version < "4.0"
3637
tiktoken==0.7.0 ; python_version >= "3.10" and python_version < "4.0"
37-
tokenizers==0.19.1; python_version >= "3.10" and python_version < "4.0"
38+
tokenizers>=0.21,<0.22; python_version >= "3.10" and python_version < "4.0"
3839
tqdm==4.65.0 ; python_version >= "3.10" and python_version < "4.0"
3940
# transformers==4.29.2 ; python_version >= "3.10" and python_version < "4.0"
4041
gmft==0.2.1
41-
google-generativeai~=0.7
42+
google-generativeai~=0.8.3

sample_templates/generic/config_template.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ embeddings:
55
embeddings_path: /path/to/embedding/folder ## specify a folder where embeddings will be saved.
66

77
embedding_model: # Optional embedding model specification, default is e5-large-v2. Swap to a smaller model if out of CUDA memory
8+
# Supported types: "huggingface", "instruct", "openai"
89
type: sentence_transformer # other supported types - "huggingface" and "instruct"
910
model_name: "intfloat/e5-large-v2"
1011

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
cache_folder: /storage/llm/cache
2+
3+
embeddings:
4+
embeddings_path: /storage/llm/embeddings_md2
5+
6+
embedding_model:
7+
type: openai
8+
model_name: "text-embedding-3-large"
9+
additional_kwargs:
10+
dimensions: 1024
11+
12+
splade_config:
13+
n_batch: 5
14+
15+
chunk_sizes:
16+
- 1024
17+
18+
document_settings:
19+
- doc_path: /storage/llm/md_docs2
20+
scan_extensions:
21+
- md
22+
- pdf
23+
passage_prefix: "passage: "
24+
label: "md"
25+
26+
27+
semantic_search:
28+
search_type: similarity
29+
replace_output_path:
30+
- substring_search: "/storage"
31+
substring_replace: "okular:///storage"
32+
33+
append_suffix:
34+
append_template: "#page={page}"
35+
36+
max_char_size: 8192
37+
max_k: 15
38+
query_prefix: "query: "
39+
hyde:
40+
enabled: False

src/llmsearch/chroma.py

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
from typing import List, Optional, Tuple
55

66
import tqdm
7-
from langchain_community.vectorstores import Chroma
7+
from langchain_chroma import Chroma
88
from loguru import logger
99

1010
from llmsearch.config import Config
@@ -77,8 +77,8 @@ def create_index_from_documents(
7777
metadatas=[doc.metadata for doc in group],
7878
)
7979
logger.info("Generated embeddings. Persisting...")
80-
if vectordb is not None:
81-
vectordb.persist()
80+
# if vectordb is not None:
81+
# vectordb.persist()
8282
vectordb = None
8383

8484
def _load_retriever(self, **kwargs):
@@ -105,13 +105,13 @@ def add_documents(self, docs: List[Document]):
105105
metadatas=[doc.metadata for doc in group],
106106
)
107107
logger.info("Generated embeddings. Persisting...")
108-
self.vectordb.persist()
108+
# self.vectordb.persist()
109109

110110
def delete_by_id(self, ids: List[str]):
111111
logger.warning(f"Deleting {len(ids)} chunks.")
112112
# vectordb = Chroma(persist_directory=self._persist_folder, embedding_function=self._embeddings)
113113
self.vectordb.delete(ids=ids)
114-
self.vectordb.persist()
114+
# self.vectordb.persist()
115115

116116
def get_documents_by_id(self, document_ids: List[str]) -> List[Document]:
117117
"""Retrieves documents by ids

src/llmsearch/config.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -70,6 +70,7 @@ class EmbeddingModelType(str, Enum):
7070
huggingface = "huggingface"
7171
instruct = "instruct"
7272
sentence_transformer = "sentence_transformer"
73+
openai = "openai"
7374

7475

7576
class EmbeddingModel(BaseModel):

0 commit comments

Comments
 (0)