Error: Token indices sequence length is longer than the specified maximum sequence length for this model #9101
Replies: 1 comment
-
Welp, I found this: TLDR: In the context of the HybridChunker, this is a known & ancitipated "false alarm". |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello everyone, I've been trying to figure out why I get this error/warning when I run my Pipeline. I am not even sure if this is the correct place to add this question, but here we go.
Let me give you a small background on what I want to do first:
I want to create a RAG pipeline using Milvus as a vector database and docling as a document converter while using haystack as the backend.
Here's a snippet of the code:
Output
I tried both openAI embedder and HF embedder and I get the same results.
My main question is, where does it find a sequence of length 26838? Why is there a sequence of that length while my max_tokens param is at 2500? Shouldn't all chunks be at max 2500 tokens?
Also from the output, I get the sequence warning before the first print statement "Indexing 1 files..." which means that the warning comes from somewhere outside that cell?
I am very confused and I've been looking at this for days.
The tokenizer is BAAI/bge-m3, the embedder I use is text-embedding-3-large from OpenAI, both have the same Max input size.
I cannot narrow down on which step of the code I get this warning and how to solve it.
Should I use Haystack's splitter into the pipeline to split the chunks down? I am under the impression that HybridChunker handles that, and I have it setup correctly with the max_tokens parameter.
Checking the tokens of each chunk, I get this:
Sorry for the long text, thank you for reading!
EDIT:
If I run the last cell again without restarting the notebook, I do not get the sequence length warning/error. But if I restart the notebook and run the cell, I get the warning. So this error pops up only once per notebook session. Running it again and again does not make it appear.
Furtherdown I have a RAG pipeline which works, so I do not know wether do ignore the error or find a solution.
Beta Was this translation helpful? Give feedback.
All reactions