Releases: deepset-ai/haystack
v1.17.0-rc1
v1.17.0-rc1
v1.16.1
What's changed
- fix: update ImportError for 'metrics' dependency by @bilgeyucel in #4778
Full Changelog: v1.16.0...v1.16.1
v1.16.0
⭐️ Highlights
Using GPT-4 through PromptNode and Agent
Haystack now supports GPT-4 through PromptNode and Agent. This means you can use the latest advancements in large language modeling to make your NLP applications more accurate and efficient.
To get started, create a PromptModel for GPT-4 and plug it into your PromptNode. Just like with ChatGPT, you can use GPT-4 in a chat scenario and ask follow-up questions, as shown in this example:
prompt_model = PromptModel("gpt-4", api_key=api_key)
prompt_node = PromptNode(prompt_model)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"},
]
result = prompt_node(messages)More flexible routing of Documents with RouteDocuments
This release includes an enhancement to the RouteDocuments node, which makes Document routing even more flexible.
The RouteDocuments node now not only returns Documents matched by the split_by or metadata_values parameter, but also creates an extra route for unmatched Documents. This means that you won't accidentally filter out any Documents due to missing metadata fields. Additionally, the update adds support for using List[List[str]] as input type to metadata_values, so multiple metadata values can be grouped into a single output.
Deprecating RAGenerator and Seq2SeqGenerator
RAGenerator and Seq2SeqGenerator are deprecated and will be removed in version 1.18. We advise using the more powerful PromptNode instead, which can use RAG and Seq2Seq models as well. The following example shows how to use PromptNode as a replacement for Seq2SeqGenerator:
p = PromptNode("vblagoje/bart_lfqa")
# Start by defining a question/query
query = "Why does water heated to room temperature feel colder than the air around it?"
# Given the question above, suppose the documents below were found in some document store
documents = [
"when the skin is completely wet. The body continuously loses water by...",
"at greater pressures. There is an ambiguity, however, as to the meaning of the terms 'heating' and 'cooling'...",
"are not in a relation of thermal equilibrium, heat will flow from the hotter to the colder, by whatever pathway...",
"air condition and moving along a line of constant enthalpy toward a state of higher humidity. A simple example ...",
"Thermal contact conductance. In physics, thermal contact conductance is the study of heat conduction between solid ...",
]
# Manually concatenate the question and support documents into BART input
# conditioned_doc = "<P> " + " <P> ".join([d for d in documents])
# query_and_docs = "question: {} context: {}".format(query, conditioned_doc)
# Or use the PromptTemplate as shown here
pt = PromptTemplate("lfqa", "question: {query} context: {join(documents, delimiter='<P>')}")
res = p.prompt(prompt_template=pt, query=query, documents=[Document(d) for d in documents])⚠️ Breaking Changes
Refactoring of our dependency management
We added the following extras as optional dependencies for Haystack: stats, metrics, preprocessing, file-conversion, and elasticsearch. To keep using certain components, you need to install farm-haystack with these new extras:
| Component | Installation extra |
|---|---|
PreProcessor |
farm-haystack[preprocessing] |
DocxToTextConverter |
farm-haystack[file-conversion] |
TikaConverter |
farm-haystack[file-conversion] |
LangdetectDocumentLanguageClassifier |
farm-haystack[file-conversion] |
ElasticsearchDocumentStore |
farm-haystack[elasticsearch] |
Dropping support for Python 3.7
Since Python 3.7 will reach end of life in June 2023, we will no longer support it as of Haystack version 1.16.
Smaller Breaking Changes
- Using
TableCellinstead ofSpanto indicate the coordinates of a table cell (#4616) - Default
save_dirforFARMReader'strainmethod changed tof"./saved_models/{self.inferencer.model.language_model.name}"(#4553) - Using
PreProcessorwithsplit_respect_sentence_boundaryset toTruemight return a different set of Documents than in v1.15 (#4470)
What's Changed
Breaking Changes
- feat: Deduplicate duplicate Answers resulting from overlapping Documents in
FARMReaderby @bogdankostic in #4470 - feat: Change default save_dir for FARMReader.train by @GitIgnoreMaybe in #4553
- feat!: drop Python3.7 support by @ZanSara in #4421
- refactor!: extract evaluation and statistical dependencies by @ZanSara in #4457
- refactor!: extract preprocessing and file conversion deps by @ZanSara in #4605
- feat: Implementation of Table Cell Proposal by @sjrl in #4616
Pipeline
- fix: Fix pipeline config and agent tools hashing for telemetry by @silvanocerza in #4508
- refactor: Adjust WhisperTranscriber to pipeline run methods by @vblagoje in #4510
- Adding filtering support for Weaviate when used for BM25 querying by @zoltan-fedor in #4385
- test: Remove duplicate whisper test by @julian-risch in #4567
- fix: provide a fallback for PyMuPDF by @masci in #4564
- Docs: Shaper API update by @agnieszka-m in #4542
- Docs: Update Whisper API. by @agnieszka-m in #4539
- refactor: remove variadic parameters in
WebSearchinitialization; make new nodes directly importable by @anakin87 in #4581 - test: Add pytest fixture to block requests in unit tests by @silvanocerza in #4433
- test: Rework conftest by @silvanocerza in #4614
- feat: arbitrary
crawler_depthforCrawlerclass by @benheckmann in #4623 - fix: ParsrConverter list element added by @Namoush in #4562
- fix: make
langdetecttruly optional by @ZanSara in #4686 - feat: More flexible routing for RouteDocuments node by @sjrl in #4690
- docs: Adapt Shaper docstrings regarding dropping metadata by @bogdankostic in #4655
DocumentStores
- fix: Check for date fields in weaviate meta update by @joekitsmith in #4371
- chore: skip Milvus tests by @ZanSara in #4654
- docs: Add deprecation information to doc string of
MilvusDocumentStoreby @bogdankostic in #4658 - Ignore cross-reference properties when loading documents by @masci in #4664
- fix: PineconeDocumentStore error when delete_documents right after initialization by @Namoush in #4609
- fix: remove warnings from the more recent Elasticsearch client by @masci in #4602
- fix: Fixing the Weaviate BM25 query builder bug by @zoltan-fedor in #4703
Documentation
- Docs: Update Seq2SeqGen models and docstrings lg by @agnieszka-m in #4595
- feat: Load documents from remote - helper function by @TuanaCelik in #4545
- refactor: Remove unecessary literal_eval when parsing env var by @silvanocerza in #4570
- Docs: Fix QuestionGenerator and Summarizer docstrings by @agnieszka-m in #4594
- refactor: Rework prompt tests by @silvanocerza in #4600
- feat: Add util method to make HTTP requests with configurable retry by @silvanocerza in #4627
- refactor: Rework invocation layers by @silvanocerza in #4615
- refactor: Add 503 as status code that triggers retry in request_with_retry by @silvanocerza in #4640
- feat: initial implementation of
MemoryDocumentStorefor new Pipelines by @ZanSara in #4447 - docs: Add PDFToTextOCRConverter to API Docs by @bogdankostic in #4656
- Docs: Add max length unit to PromptNode API docs by @agnieszka-m in #4601
- fix: Add model_max_length model_kwargs parameter to HF PromptNode by @vblagoje in #4651
- feat: Add chatgpt streaming by @vblagoje in #4659
- feat: Add Hugging Face inferencing PromptNode layer by @vblagoje in #4641
- refactor:
node->componentby @ZanSara in #4687 - feat: Add AzureChatGPT Capability using new InvocationLayer style by @recrudesce in #4675
...
v1.15.1
v1.15.1-rc1
v1.15.1-rc1
v1.15.0
⭐ Highlights
Build Agents Yourself with Open Source
Exciting news! Say hello to LLM-based Agents, the new decision makers for your NLP applications! These agents have the power to answer complex questions by creating a dynamic action plan and using a variety of Tools in a loop. Picture this: your Agent decides to tackle a multi-hop question by retrieving pieces of information through a web search engine again and again. That's just one of the many feats these Agents can accomplish. Excited about the recent ChatGPT plugins? Agents allow you to build similar experiences in an open source way: your own environment, full control and transparency.
But how do you get started? First, wrap your Haystack Pipeline in a Tool and give your Agent a description of what that Tool can do. Then, initialize your Agent with a list of Tools and a PromptNode that decides when to use each Tool.
web_qa_tool = Tool(
name="Search",
pipeline_or_node=WebQAPipeline(retriever=web_retriever, prompt_node=web_qa_pn),
description="useful for when you need to Google questions.",
output_variable="results",
)
agent = Agent(
prompt_node=agent_pn,
prompt_template=prompt_template,
tools=[web_qa_tool],
final_answer_pattern=r"Final Answer\s*:\s*(.*)",
)
agent.run(query="<Your question here!>")Check out the full example, a stand-alone WebQAPipeline, our new tutorials and the documentation!
Flexible PromptTemplates
Get ready to take your Pipelines to the next level with the revamped PromptNode. Now you have more flexibility when it comes to shaping the PromptNode outputs and inputs to work seamlessly with other nodes. But wait, there's more! You can now apply functions right within prompt_text. Want to concatenate the content of input documents? No problem! It's all possible with the PromptNode. And that's not all! The output_parser converts output into Haystack Document, Answer, or Label formats. Check out the AnswerParser in action, fully loaded and ready to use:
PromptTemplate(
name="question-answering",
prompt_text="Given the context please answer the question.\n"
"Context: {join(documents)}\n"
"Question: {query}\n"
"Answer: ",
output_parser=AnswerParser(),
)More details here.
Using ChatGPT through PromptModel
A few lines of code are all you need to start chatting with ChatGPT through Haystack! The simple message format distinguishes instructions, user questions, and assistant responses. And with the chat functionality you can ask follow-up questions as in this example:
prompt_model = PromptModel("gpt-3.5-turbo", api_key=api_key)
prompt_node = PromptNode(prompt_model)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"},
{"role": "assistant", "content": "The Los Angeles Dodgers won the World Series in 2020."},
{"role": "user", "content": "Where was it played?"},
]
result = prompt_node(messages)Haystack Extras
We now have another repo haystack-extras with extra Haystack components, like audio nodes AnswerToSpeech and DocumentToSpeech. For example, these two can be installed via:
pip install farm-haystack-text2speech
What's Changed
Breaking Changes
- feat!: Increase Crawler standardization regarding Pipelines by @danielbichuetti in #4122
- feat: Enable PDFToTextConverter multiprocessing, increase general performance and simplify installation by @danielbichuetti in #4226
- build: Use
uvicorninstead ofgunicornas server in REST API's Dockerfile by @bogdankostic in #4304 - chore!: remove deprecated OpenDistroElasticsearchDocumentStore by @masci in #4361
- refactor: Remove AnswerToSpeech and DocumentToSpeech nodes by @silvanocerza in #4391
- fix: Fix debug on PromptNode by @recrudesce in #4483
- feat: PromptTemplate extensions by @tstadel in #4378
Pipeline
- feat: Add JsonConverter node by @bglearning in #4130
- fix: Shaper store all outputs from function by @sjrl in #4223
- refactor: Isolate PDF OCR converter from PDF text converter by @danielbichuetti in #4193
- fix: add option to not override results by
Shaperby @tstadel in #4231 - feat: reduce and focus telemetry by @ZanSara in #4087
- refactor: Remove deprecated nodes
EvalDocumentsandEvalAnswersby @anakin87 in #4194 - refact: mark unit tests under the
test/nodes/**path by @masci in #4235 - fix: FARMReader produces Answers with negative start and end position by @julian-risch in #4248
- test: replace
ElasticsearchDSwithInMemoryDSwhen it makes sense; supportscale_scoreinInMemoryDSby @anakin87 in #4283 - test: mock all
Translatortests and move one toe2eby @ZanSara in #4290 - fix: Prevent going past token limit in OpenAI calls in PromptNode by @sjrl in #4179
- feat: Add Azure OpenAI embeddings support by @danielbichuetti in #4332
- test: move tests on standard pipelines in
e2e/by @ZanSara in #4309 - fix: EvalResult load migration by @tstadel in #4289
- feat: Report execution time for pipeline components in
_debugby @zoltan-fedor in #4197 - refactor: Use TableQuestionAnsweringPipeline from transformers by @sjrl in #4303
- fix: hf-tiny-roberta model loading from disk and mypy errors by @mayankjobanputra in #4363
- docs:
TransformersImageToText- inform about supported models, better exception handling by @anakin87 in #4310 - fix: check that
answeris notNonebefore accessing it intable.pyby @culms in #4376 - feat: add automatic OCR detection mechanism and improve performance by @danielbichuetti in #4329
- Add Whisper node by @vblagoje in #4335
- tests: Mark Crawler tests correctly by @silvanocerza in #4435
- test: Skip flaky test_multimodal_retriever_query by @silvanocerza in #4444
- fix: issue evaluation check for content type by @ju-gu in #4181
- feat: break retry loop for 401 unauthorized errors in promptnode by @FHardow in #4389
- refactor: Remove retry_with_exponential_backoff in favor of tenacity by @silvanocerza in #4460
- refactor: Remove ElasticsearchRetriever and ElasticsearchFilterOnlyRetriever by @silvanocerza in #4499
- refactor: Deprecate BaseKnowledgeGraph, GraphDBKnowledgeGraph, InMemoryKnowledgeGraph and Text2SparqlRetriever by @silvanocerza in #4500
- refactor: remove telemetry v1 by @ZanSara in #4496
- feat: expose prompts to Answer and EvaluationResult by @tstadel in #4341
- feat: Add agent tools by @vblagoje in #4437
- refactor: reduce telemetry events count by @ZanSara in #4501
DocumentStores
- fix:
OpenSearchDocumentStore.delete_indexdoesn't raise by @tstadel in #4295 - fix: increase
MetaDocumentORMvalue length inSQLDocumentStoreby @anakin87 in #4333 - fix: when using IVF* indexing, ensure the index is trained frist by @kaixuanliu in #4311
- refactor: Mark MilvusDocumentStore as deprecated by @silvanocerza in #4498
Documentation
- feat: add
top_ktoPromptNodeby @tstadel in #4159 - feat: Add Agent by @julian-risch in #4148
- ci: Automate OpenAPI specs upload to Readme.io by @silvanocerza in #4228
- ci: Refactor docs config and generation by @silvanocerza in #4280
- feat: Add Azure as OpenAI endpoint by @vblagoje in #4170
- refactor: Allow flexible document id generation by @danielbichuetti in https://github.com/deepset-a...
v1.15.0-rc2
v1.15.0-rc2
v1.15.0-rc1
v1.15.0-rc1
v1.14.0
⭐ Highlights
PromptNode enhancements
PromptNode just rolled out prompt logging (pipeline debug), run_batch, and model_kwargs support. More updates to PromptNode and PromptTemplates coming soon!
Shaper
We're introducing the Shaper, PromptNode's helper. Shaper unlocks the full potential of PromptNode and ensures its seamless integration with Haystack. But Shaper's scope and functionality are not limited to PromptNode; you can also use it independently, opening up a whole new world of possibilities.
IVF and Product Quantization support for OpenSearchDocumentStore
We've added support for IVF and IVF with Product Quantization to OpenSearchDocumentStore. You can train the IVF index by calling train_index method (same as in FAISSDocumentStore) or by setting ivf_train_size when initializing OpenSearchDocumentStore and take your search to the next level.
What's Changed
Breaking Changes
- refactor: Updated rest_api schema for tables to be consistent with Document.to_dict by @sjrl in #3872
- feat: Support multiple document_ids in Answer object (for generative QA) by @tstadel in #4062
- feat: Update OpenAIAnswerGenerator defaults and with learnings from PromptNode by @sjrl in #4038
- build: cache nltk models into the docker image by @mayankjobanputra in #4118
- feat: Add IVF and Product Quantization support for OpenSearchDocumentStore by @bogdankostic in #3850
Pipeline
- feat: add frontmatter to meta in
MarkdownConverterby @TuanaCelik in #3953 - fix: removing code block in
MarkdownConverterby @TuanaCelik in #3960 - feat: Add page range support to PDF converters. by @danielbichuetti in #3965
- fix: Update telemetry to not serialize Pipeline if disabled. by @sjrl in #4000
- feat: add
Shaperby @ZanSara in #3880 - fix: Event sending for
RayPipelinecrashing Haystack by @zoltan-fedor in #3971 - fix: document retrieval metrics for non-document_id document_relevance_criteria by @tstadel in #3885
- fix: make the crawler more robust on Windows by @anakin87 in #4049
- fix: use correct count of outgoing edges in RayPipeline by @zoltan-fedor in #4066
- feat: Allow all training options for training a SentenceTransformers EmbeddingRetriever by @sjrl in #4026
- refactor: replace mutable default arguments by @julian-risch in #4070
- feat: Support multiple
RayPipelinesby @zoltan-fedor in #4078 - Remove double batching in retrieve_batch by @sjrl in #4014
- style: Update black by @silvanocerza in #4101
- fix: Fix
TableTextRetrieverfor input consisting of tables only by @jackapbutler in #4048 - fix: Deduplicate same Documents in isolated evaluation of Reader by @bogdankostic in #4114
- Docs: Fix code block formatting by @agnieszka-m in #4162
- refactor: Remove the pin from the espnet module and fix the audio node tests. by @danielbichuetti in #4128
- fix: change tiktoken fallback mechanism to support Windows amd64 by @danielbichuetti in #4175
- feat: Add OpenAIError to retry mechanism by @sjrl in #4178
DocumentStores
- refactor: use weaviate client to build BM25 query by @hsm207 in #3939
- fix: fixed
InMemoryDocumentStore.get_embedding_countto return correct number by @sjrl in #3980 - fix: Add inner query for mysql compatibility by @julian-risch in #4068
- feat: add support for custom headers by @hsm207 in #4040
- feat: Add BM25 support for tables in InMemoryDocumentStore by @bogdankostic in #4090
- refactor:
InMemoryDocumentStore- manage documents without embedding & fix mypy errors by @anakin87 in #4113 - refactor: complete the document stores test refactoring by @masci in #4125
- feat: include testing facilities into haystack package by @masci in #4182
Documentation
- Align with the docs install guide + correct lg by @agnieszka-m in #3950
- docs: Update Crawler docstring for correct usage in Google colab by @silvanocerza in #3979
- Docs: Update docstrings by @agnieszka-m in #4119
- docs: Update Annotation Tool README.md by @bogdankostic in #4123
- feat: Add model_kwargs option to PromptNode by @sjrl in #4151
- fix: Remove logging statement of setting ID manually in
Documentby @bogdankostic in #4129 - chore: Fixing PromptNode .prompt() docstring to include the PromptTemplate object as an option by @TuanaCelik in #4135
- chore: de-couple the telemetry events for each tutorial from the dataset on AWS that is used by @TuanaCelik in #4155
- feat: Implement
run_batchfor PromptNode by @sjrl in #4072
Other Changes
- fix: add option to not override results by Shaper #4231
- fix: Shaper store all outputs from function #4223
- fix: allowing file-upload api to write files to disk #4221
- fix: Fix bug in prompt template check of OpenAIAnswerGenerator #4220
- feat: add top_k to PromptNode #4159
- feat: Add JsonConverter node #4130
- feat: adding secure loading of models by default for haystack by @mayankjobanputra in #3901
- fix: add tiktoken fallback mechanism. by @danielbichuetti in #3929
- fix: change model in distillation test by @ZanSara in #3944
- feat: Expose
output_variablein PromptNode result, adjust unit tests by @vblagoje in #3892 - fix: Fix type in
FARMReader'ssave_to_remoteby @bogdankostic in #3952 - refactor: Remove PromptNode hash and equality functions by @vblagoje in #3923
- ci: Remove mypy deps install step in python_cache action by @silvanocerza in #3956
- fix: overwrite params with environment variables even if there are no params in the pipeline definition; make
mypyignore REST API tests by @anakin87 in #3930 - Docs: Update ImageToText docstrings by @agnieszka-m in #3963
- Docs: Add TransformersImageToText API doc by @agnieszka-m in #3966
- ci: Add Docker images testing by @silvanocerza in #3943
- feat: Allow users to set a timeout for remote APIs by @danielbichuetti in #3949
- ci: Fix docker image testing on release by @silvanocerza in #3976
- Fix: Fix quotation marks by @agnieszka-m in #3973
- fix: PromptNode doesn't have run_batch support (yet) by @vblagoje in #3972
- chore: increased timeout for loading pipelines through API by @mayankjobanputra in #3977
- Missing import for
TransformersImageToTextby @ZanSara in #3984 - test: CI on py3.8 by @ZanSara in #3926
- Simplifies and fix docker images tests on release by @silvanocerza in #3982
- feat: Add
use_prefilteringparameter toDeepsetCloudDocumentStoreby @bogdankostic in #3969 - ci: Delete Docker images after testing to prevent workflow failure by @silvanocerza in #4004
- fix: Add a verbose option to PromptNode to let users understand the prompts being used #2 by @zoltan-fedor in #3898
- fix: prevent posthog from sending errors to stderr by @julian-risch in #4008
- fix: extend schema for prompt node results by @tstadel in #3891
- proposal: TableCell by @sjrl in #3875
- refactor: In PromptNode reuse tokenizer instead of loading new one for stop words by @sjrl in #4016
- ci: Automate release on PyPi by @silvanocerza in https://github.co...
v1.14.0rc2
What's Changed
- fix: add option to not override results by Shaper #4231
- fix: Shaper store all outputs from function #4223
- fix: allowing file-upload api to write files to disk #4221
- fix: Fix bug in prompt template check of OpenAIAnswerGenerator #4220
- feat: add top_k to PromptNode #4159
- feat: Add JsonConverter node #4130