Merge pull request #126 from snexus/feature/update

snexus · web-flow · commit 593e6f014b55 · 2024-12-08T15:29:33.000+08:00
Add support for OpenAI embeddings. Bump all dependencies.
diff --git a/README.md b/README.md
@@ -8,30 +8,36 @@ The purpose of this package is to offer a convenient question-answering (RAG) sy
 
 ## Features
 
-* Supported formats
+* Supported document formats
     * Build-in parsers:
         * `.md` - Divides files based on logical components such as headings, subheadings, and code blocks. Supports additional features like cleaning image links, adding custom metadata, and more.
         * `.pdf` - MuPDF-based parser.
         * `.docx` - custom parser, supports nested tables.
     * Other common formats are supported by `Unstructured` pre-processor:
         * List of formats see [here](https://unstructured-io.github.io/unstructured/core/partition.html).
 
-* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.
-
-* Optional support for image parsing using Gemini API.
-
-* Supports multiple collection of documents, and filtering the results by a collection.
+* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
+    * OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
+    * HuggingFace models.
+    * Llama cpp supported models - for full list see [here](https://github.com/ggerganov/llama.cpp#description).
 
-* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
+* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))
 
 * Generates dense embeddings from a folder of documents and stores them in a vector database ([ChromaDB](https://github.com/chroma-core/chroma)).
   * The following embedding models are supported:
     * Hugging Face embeddings.
     * Sentence-transformers-based models, e.g., `multilingual-e5-base`.
     * Instructor-based models, e.g., `instructor-large`.
+    * OpenAI embeddings.
 
 * Generates sparse embeddings using SPLADE (https://github.com/naver/splade) to enable hybrid search (sparse + dense).
 
+* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
+
+* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.
+
+* Optional support for image parsing using Gemini API.
+
 * Supports the "Retrieve and Re-rank" strategy for semantic search, see [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
     * Besides the originally `ms-marco-MiniLM` cross-encoder, more modern `bge-reranker` is supported.
 
@@ -44,13 +50,6 @@ The purpose of this package is to offer a convenient question-answering (RAG) sy
 
 * Supprts optional chat history with question contextualization
 
-* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
-    * OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
-    * HuggingFace models.
-    * Llama cpp supported models - for full list see [here](https://github.com/ggerganov/llama.cpp#description).
-    * AutoGPTQ models (temporarily disabled due to broken dependencies).
-
-* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))
 
 * Other features
     * Simple CLI and web interfaces.
diff --git a/docs/index.rst b/docs/index.rst
@@ -11,59 +11,58 @@ The purpose of this package is to offer a convenient question-answering system w
 Features
 --------
 
-* Supported formats
+* Supported document formats
     * Build-in parsers:
         * `.md` - Divides files based on logical components such as headings, subheadings, and code blocks. Supports additional features like cleaning image links, adding custom metadata, and more.
         * `.pdf` - MuPDF-based parser.
         * `.docx` - custom parser, supports nested tables.
     * Other common formats are supported by `Unstructured` pre-processor:
-        * List of formats https://unstructured-io.github.io/unstructured/core/partition.html
+        * List of formats see [here](https://unstructured-io.github.io/unstructured/core/partition.html).
 
-* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.
-
-* Optional support for image parsing using Gemini API.
-
-* Supports multiple collection of documents, and filtering the results by a collection.
-
-* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
+* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
+    * OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
+    * HuggingFace models.
+    * Llama cpp supported models - for full list see [here](https://github.com/ggerganov/llama.cpp#description).
 
-* Generates dense embeddings from a folder of documents and stores them in a vector database (ChromaDB).
+* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))
 
+* Generates dense embeddings from a folder of documents and stores them in a vector database ([ChromaDB](https://github.com/chroma-core/chroma)).
   * The following embedding models are supported:
-
-    * Huggingface embeddings.
+    * Hugging Face embeddings.
     * Sentence-transformers-based models, e.g., `multilingual-e5-base`.
     * Instructor-based models, e.g., `instructor-large`.
+    * OpenAI embeddings.
 
 * Generates sparse embeddings using SPLADE (https://github.com/naver/splade) to enable hybrid search (sparse + dense).
 
-* Supports the "Retrieve and Re-rank" strategy for semantic search, see - https://www.sbert.net/examples/applications/retrieve_rerank/README.html.
+* An ability to update the embeddings incrementally, without a need to re-index the entire document base.
+
+* Support for table parsing via open-source gmft (https://github.com/conjuncts/gmft) or Azure Document Intelligence.
+
+* Optional support for image parsing using Gemini API.
+
+* Supports the "Retrieve and Re-rank" strategy for semantic search, see [here](https://www.sbert.net/examples/applications/retrieve_rerank/README.html).
     * Besides the originally `ms-marco-MiniLM` cross-encoder, more modern `bge-reranker` is supported.
 
-* Supports HyDE (Hypothetical Document Embeddings) - https://arxiv.org/pdf/2212.10496.pdf
+* Supports HyDE (Hypothetical Document Embeddings) - see [here](https://arxiv.org/pdf/2212.10496.pdf).
     * WARNING: Enabling HyDE (via config OR webapp) can significantly alter the quality of the results. Please make sure to read the paper before enabling.
-    * Based on empirical observations, enabling HyDE significantly boosts quality of the output on a topics where user can't formulate the quesiton using domain specific language of the topic - e.g. when learning new topics.
+    * From my own experiments, enabling HyDE significantly boosts quality of the output on a topics where user can't formulate the quesiton using domain specific language of the topic - e.g. when learning new topics.
 
 * Support for multi-querying, inspired by `RAG Fusion` - https://towardsdatascience.com/forget-rag-the-future-is-rag-fusion-1147298d8ad1
     * When multi-querying is turned on (either config or webapp), the original query will be replaced by 3 variants of the same query, allowing to bridge the gap in the terminology and "offer different angles or perspectives" according to the article.
 
 * Supprts optional chat history with question contextualization
 
-* Allows interaction with embedded documents, internally supporting the following models and methods (including locally hosted):
-    * OpenAI models (ChatGPT 3.5/4 and Azure OpenAI).
-    * HuggingFace models.
-    * Llama cpp supported models - for full list see https://github.com/ggerganov/llama.cpp#description
-    * AutoGPTQ models (temporarily disabled due to broken dependencies).
-
-* Interoperability with LiteLLM + Ollama via OpenAI API, supporting hundreds of different models (see [Model configuration for LiteLLM](sample_templates/llm/litellm.yaml))
 
 * Other features
-    * Simple web interface.
+    * Simple CLI and web interfaces.
     * Deep linking into document sections - jump to an individual PDF page or a header in a markdown file.
     * Ability to save responses to an offline database for future analysis.
     * Experimental API
 
 
+
+
 Installation
 ============
 
diff --git a/requirements.txt b/requirements.txt
@@ -1,30 +1,31 @@
 llama-cpp-python==0.2.76
 chromadb~=0.5.5
-langchain~=0.2.14
-langchain-community~=0.2.12
-langchain-openai~=0.1.22
-langchain-huggingface~=0.0.3
+langchain>=0.3,<0.4
+langchain-community>=0.3,<0.4
+langchain-openai>=0.2,<0.3
+langchain-huggingface>=0.1,<0.2
+langchain-chroma>=0.1.4,<0.2
 pydantic~=2.7
-transformers~=4.41
-sentence-transformers==3.0.1
+transformers~=4.47
+sentence-transformers==3.3.1
 pypdf2~=3.0.1
 ebooklib==0.18
 # sentencepiece==0.20
 setuptools==67.7.2
 loguru
 python-dotenv
-accelerate~=0.33
+accelerate~=1.2.0
 protobuf==3.20.2
 termcolor
-openai~=1.41
+openai~=1.57
 einops # required for Mosaic models
 click 
 bitsandbytes==0.43.1
 # auto-gptq==0.2.0
 InstructorEmbedding==1.0.1
-unstructured~=0.14.5
-pymupdf==1.24.9
-streamlit~=1.28
+unstructured~=0.16.9
+pymupdf==1.25.0
+streamlit~=1.40
 python-docx~=1.1
 six==1.16.0 ; python_version >= "3.10" and python_version < "4.0"
 sniffio==1.3.0 ; python_version >= "3.10" and python_version < "4.0"
@@ -34,8 +35,8 @@ sympy==1.11.1 ; python_version >= "3.10" and python_version < "4.0"
 tenacity==8.2.3 ; python_version >= "3.10" and python_version < "4.0"
 threadpoolctl==3.1.0 ; python_version >= "3.10" and python_version < "4.0"
 tiktoken==0.7.0 ; python_version >= "3.10" and python_version < "4.0"
-tokenizers==0.19.1; python_version >= "3.10" and python_version < "4.0"
+tokenizers>=0.21,<0.22; python_version >= "3.10" and python_version < "4.0"
 tqdm==4.65.0 ; python_version >= "3.10" and python_version < "4.0"
 # transformers==4.29.2 ; python_version >= "3.10" and python_version < "4.0"
 gmft==0.2.1
-google-generativeai~=0.7
+google-generativeai~=0.8.3
diff --git a/sample_templates/generic/config_template.yaml b/sample_templates/generic/config_template.yaml
@@ -5,6 +5,7 @@ embeddings:
   embeddings_path: /path/to/embedding/folder ## specify a folder where embeddings will be saved.
   
   embedding_model: # Optional embedding model specification, default is e5-large-v2. Swap to a smaller model if out of CUDA memory
+    # Supported types: "huggingface", "instruct", "openai"
     type: sentence_transformer # other supported types - "huggingface" and "instruct"
     model_name: "intfloat/e5-large-v2"
   
diff --git a/sample_templates/openai_embeddings.yaml b/sample_templates/openai_embeddings.yaml
@@ -0,0 +1,40 @@
+cache_folder: /storage/llm/cache
+
+embeddings:
+  embeddings_path: /storage/llm/embeddings_md2
+  
+  embedding_model:
+    type: openai
+    model_name: "text-embedding-3-large"
+    additional_kwargs:
+      dimensions: 1024
+
+  splade_config:
+    n_batch: 5
+
+  chunk_sizes:
+    - 1024
+
+  document_settings:
+  - doc_path: /storage/llm/md_docs2
+    scan_extensions: 
+      - md
+      - pdf
+    passage_prefix: "passage: "
+    label: "md"
+
+
+semantic_search:
+  search_type: similarity 
+  replace_output_path:
+    - substring_search: "/storage"
+      substring_replace: "okular:///storage"
+
+  append_suffix:
+    append_template: "#page={page}"
+
+  max_char_size: 8192
+  max_k: 15
+  query_prefix: "query: "
+  hyde:
+    enabled: False
diff --git a/src/llmsearch/chroma.py b/src/llmsearch/chroma.py
@@ -4,7 +4,7 @@
 from typing import List, Optional, Tuple
 
 import tqdm
-from langchain_community.vectorstores import Chroma
+from langchain_chroma import Chroma
 from loguru import logger
 
 from llmsearch.config import Config
@@ -77,8 +77,8 @@ def create_index_from_documents(
                     metadatas=[doc.metadata for doc in group],
                 )
         logger.info("Generated embeddings. Persisting...")
-        if vectordb is not None:
-            vectordb.persist()
+        # if vectordb is not None:
+            # vectordb.persist()
         vectordb = None
 
     def _load_retriever(self, **kwargs):
@@ -105,13 +105,13 @@ def add_documents(self, docs: List[Document]):
                 metadatas=[doc.metadata for doc in group],
             )
         logger.info("Generated embeddings. Persisting...")
-        self.vectordb.persist()
+        # self.vectordb.persist()
 
     def delete_by_id(self, ids: List[str]):
         logger.warning(f"Deleting {len(ids)} chunks.")
         # vectordb = Chroma(persist_directory=self._persist_folder, embedding_function=self._embeddings)
         self.vectordb.delete(ids=ids)
-        self.vectordb.persist()
+        # self.vectordb.persist()
 
     def get_documents_by_id(self, document_ids: List[str]) -> List[Document]:
         """Retrieves documents by ids
diff --git a/src/llmsearch/config.py b/src/llmsearch/config.py
@@ -70,6 +70,7 @@ class EmbeddingModelType(str, Enum):
     huggingface = "huggingface"
     instruct = "instruct"
     sentence_transformer = "sentence_transformer"
+    openai = "openai"
 
 
 class EmbeddingModel(BaseModel):
diff --git a/src/llmsearch/embeddings.py b/src/llmsearch/embeddings.py
diff --git a/src/llmsearch/process.py b/src/llmsearch/process.py
diff --git a/src/llmsearch/utils.py b/src/llmsearch/utils.py