Skip to content

Commit 7c7b44b

Browse files
authored
[NeuralChat] Add readme, add content length filter, fix build error (intel#1378)
1 parent 3775ffe commit 7c7b44b

File tree

5 files changed

+30
-25
lines changed

5 files changed

+30
-25
lines changed

intel_extension_for_transformers/langchain/vectorstores/chroma.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -196,7 +196,6 @@ def build(
196196
client_settings=client_settings,
197197
client=client,
198198
collection_metadata=collection_metadata,
199-
**kwargs,
200199
)
201200
return chroma_collection
202201
else:

intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/README.md

Lines changed: 15 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -35,16 +35,16 @@ To ensure a smooth experience, we've made sure this plugin is compatible with co
3535
| xlsx | ['Questions', 'Answers']<br>['question', 'answer', 'link']<br>['context', 'link'] |
3636
| csv | ['question', 'correct_answer'] |
3737
| json/jsonl | {'content':xxx, 'link':xxx}|
38-
| txt | / |
39-
| html | / |
40-
| markdown | / |
41-
| word | / |
42-
| pdf | / |
38+
| txt | No format required |
39+
| html | No format required |
40+
| markdown | No format required |
41+
| word | No format required |
42+
| pdf | No format required |
4343

4444
# Usage
45-
The most convenient way to use is this plugin is via our `build_chatbot` api as introduced in the [example code](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat/examples/plugins/retrieval). The user could refer to it for a simple test.
45+
Before using RAG in NeuralChat, please install the necessary dependencies in [requirements.txt](https://github.com/intel/intel-extension-for-transformers/blob/main/intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/requirements.txt) to avoid the import errors. The most convenient way to use is this plugin is via our `build_chatbot` api as introduced in the [example code](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat/examples/plugins/retrieval). The user could refer to it for a simple test.
4646

47-
We support multiple file formats for retrieval, including unstructured file formats such as pdf, docx, html, txt, and markdown, as well as structured file formats like jsonl and xlsx. For structured file formats, they must adhere to predefined structures.
47+
We support multiple file formats for retrieval, including unstructured file formats such as pdf, docx, html, txt, and markdown, as well as structured file formats like jsonl/json, csv, xlsx. For structured file formats, they must adhere to predefined structures. We also support to upload the knowledge base via a http web link.
4848

4949
In the case of jsonl files, they should be formatted as dictionaries, such as: {'content':xxx, 'link':xxx}. The support for xlsx files is specifically designed for Question-Answer (QA) tasks. Users can input QA pairs for retrieval. Therefore, the table's header should include items labeled as "Question" and "Answer". The reference files could be found [here](https://github.com/intel/intel-extension-for-transformers/tree/main/intel_extension_for_transformers/neural_chat/assets/docs).
5050

@@ -83,11 +83,15 @@ Below are the description for the available parameters in `agent_QA`,
8383
| embedding_model | str | The name or path for the text embedding model |-|
8484
| response_template | str | Default response when there is no available relevant documents for RAG |-|
8585
| mode | str | The RAG behavior for different use case. Please check [here](#rag-mode) |"accuracy", "general"|
86-
| retrieval_type | str | The type of the retriever. Please check [here](#retrievers) for more details | "default", "child_parent"|
86+
| retrieval_type | str | The type of the retriever. Please check [here](#retrievers) for more details | "default", "child_parent", "bm25"|
8787
| process | bool | Whether to split the long documents into small chucks. The size of each chuck is defined by `max_chuck_size` and `min_chuck_size`|True, False|
8888
| max_chuck_size | int | The max token length for a single chuck in the knowledge base |-|
8989
| min_chuck_size | int | The min token length for a single chuck in the knowledge base |-|
9090
| append | bool | Whether the new knowledge will be append to the existing knowledge base or directly load the existing knowledge base |True, False|
91+
| polish | bool | Whether to polish the input query before processing |True, False|
92+
| enable_rerank | bool | Whether to enable retrieval then rerank pipeline |True, False|
93+
| reranker_model | str | The name of the reranker model from the Huggingface or a local path |-|
94+
| top_n | int | The return number of the reranker model |-|
9195

9296
More retriever- and vectorstore-related parameters please check [here](#langchain-extension)
9397

@@ -185,17 +189,17 @@ plugins.retrieval.args["search_kwargs"]=xxx
185189
```
186190

187191
If "search_type"="similarity":
188-
>search_kwargs={"k"=xxx}
192+
>search_kwargs={"k":xxx}
189193
190194
"k" is the number of the returned most similar documents.
191195

192196
If "search_type"="mmr":
193-
>search_kwargs={"k"=xxx, "fetch_k"=xxx, "lamabda_mult"=xxx}
197+
>search_kwargs={"k":xxx, "fetch_k":xxx, "lamabda_mult":xxx}
194198
195199
"k" is the number of the returned most similar documents. "fetch_k" is the number of Documents to fetch to pass to MMR algorithm. "Lamabda_mult" is a number between 0 and 1 that determines the degree of diversity among the results with 0 corresponding to maximum diversity and 1 to minimum diversity. Defaults to 0.5.
196200

197201
If "search_type"="similarity_score_threshold":
198-
>search_kwargs={"k"=xxx, "score_threshold"=xxx}
202+
>search_kwargs={"k":xxx, "score_threshold":xxx}
199203
200204
"k" is the number of the returned most similar documents. "score_threshold" is the similar score threshold for the retrieved documents.
201205

intel_extension_for_transformers/neural_chat/pipeline/plugins/retrieval/retrieval_agent.py

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -39,14 +39,15 @@
3939
level=logging.INFO
4040
)
4141

42-
def document_transfer(data_collection):
42+
def document_transfer(data_collection, min_length):
4343
"Transfer the raw document into langchain supported format."
4444
documents = []
4545
for data, meta in data_collection:
46-
doc_id = str(uuid.uuid4())
47-
metadata = {"source": meta, "identify_id":doc_id}
48-
doc = Document(page_content=data, metadata=metadata)
49-
documents.append(doc)
46+
if len(data) > min_length:
47+
doc_id = str(uuid.uuid4())
48+
metadata = {"source": meta, "identify_id":doc_id}
49+
doc = Document(page_content=data, metadata=metadata)
50+
documents.append(doc)
5051
return documents
5152

5253
def document_append_id(documents):
@@ -84,6 +85,7 @@ def __init__(self,
8485
self.mode = mode
8586
self.process = process
8687
self.retriever = None
88+
self.min_chuck_size = min_chuck_size
8789
self.splitter = RecursiveCharacterTextSplitter(chunk_size= kwargs['child_size'] \
8890
if 'child_size' in kwargs else 512)
8991
allowed_retrieval_type: ClassVar[Collection[str]] = (
@@ -162,7 +164,7 @@ def __init__(self,
162164
data_collection = self.document_parser.load(input=self.input_path, **kwargs)
163165
logging.info("The parsing for the uploaded files is finished.")
164166

165-
langchain_documents = document_transfer(data_collection)
167+
langchain_documents = document_transfer(data_collection, self.min_chuck_size)
166168
logging.info("The format of parsed documents is transferred.")
167169

168170
if self.vector_database == "Chroma":
@@ -235,7 +237,7 @@ def create(self, input_path, **kwargs):
235237
Create a new knowledge base based on the uploaded files.
236238
"""
237239
data_collection = self.document_parser.load(input=input_path, **kwargs)
238-
langchain_documents = document_transfer(data_collection)
240+
langchain_documents = document_transfer(data_collection, self.min_chuck_size)
239241

240242
if self.retrieval_type == 'default':
241243
knowledge_base = self.database.from_documents(documents=langchain_documents, \
@@ -261,7 +263,7 @@ def append_localdb(self, append_path, **kwargs):
261263
"Append the knowledge instances into a given knowledge base."
262264

263265
data_collection = self.document_parser.load(input=append_path, **kwargs)
264-
langchain_documents = document_transfer(data_collection)
266+
langchain_documents = document_transfer(data_collection, self.min_chuck_size)
265267

266268
if self.retrieval_type == 'default':
267269
knowledge_base = self.database.from_documents(documents=langchain_documents, \

intel_extension_for_transformers/neural_chat/prompts/prompt.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -133,7 +133,7 @@
133133
- Please refer to the search results obtained from the local knowledge base. But be careful to not \
134134
incorporate the information that you think is not relevant to the question.
135135
- If you don't know the answer to a question, please don't share false information.\n""" ,
136-
roles=("### Question:", "### Search Results:", "### Chat History:", "### Response:"),
136+
roles=("### Question: ", "### Search Results: ", "### Chat History: ", "### Response: "),
137137
sep_style=SeparatorStyle.NO_COLON_SINGLE,
138138
sep="\n",
139139
)
@@ -145,7 +145,7 @@
145145
name="rag_without_context",
146146
system_message="Have a conversation with a human. " + \
147147
"You are required to generate suitable response to the user input.\n",
148-
roles=("### Input:", "### Response:"),
148+
roles=("### Input: ", "### Response: "),
149149
sep_style=SeparatorStyle.NO_COLON_SINGLE,
150150
sep="\n",
151151
)
@@ -157,7 +157,7 @@
157157
name="rag_without_context_memory",
158158
system_message="Have a conversation with a human. " + \
159159
"You are required to generate suitable response to the user input.\n",
160-
roles=("### Input:", "### Chat History:", "### Response:"),
160+
roles=("### Input: ", "### Chat History: ", "### Response: "),
161161
sep_style=SeparatorStyle.NO_COLON_SINGLE,
162162
sep="\n",
163163
)
@@ -172,7 +172,7 @@
172172
- Please refer to the search results obtained from the local knowledge base. But be careful to not \
173173
incorporate the information that you think is not relevant to the question.
174174
- If you don't know the answer to a question, please don't share false information.\n""",
175-
roles=("### Question:", "### Search Results:", "### Chat History:", "### Response:"),
175+
roles=("### Question: ", "### Search Results: ", "### Chat History: ", "### Response: "),
176176
sep_style=SeparatorStyle.NO_COLON_SINGLE,
177177
sep="\n",
178178
)

intel_extension_for_transformers/neural_chat/tests/ci/plugins/retrieval/test_parameters.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -321,7 +321,7 @@ def test_false_process(self):
321321
plugins.retrieval.args["input_path"] = "../assets/docs/sample_1.txt"
322322
plugins.retrieval.args["persist_directory"] = "./false_process"
323323
plugins.retrieval.args["retrieval_type"] = 'default'
324-
plugins.retrieval.args["min_chuck_size"] = 100
324+
plugins.retrieval.args["min_chuck_size"] = 10
325325
plugins.retrieval.args["max_chuck_size"] = 150
326326
plugins.retrieval.args["process"] = False
327327
config = PipelineConfig(model_name_or_path="facebook/opt-125m",

0 commit comments

Comments
 (0)