-
Notifications
You must be signed in to change notification settings - Fork 1k
Fix/litellm tokenizer model name #1925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix/litellm tokenizer model name #1925
Conversation
…del name parsing.
Signed-off-by: Faizan Shaikh <[email protected]>
Please make sure all the checkboxes are checked:
|
WalkthroughUpdated LiteLLMEmbeddingEngine: removed MAX_RETRIES and retry_count, rewrote get_tokenizer to branch by provider and compute an effective_model (stripping "hosted_vllm/" and "openai/"), returning provider-specific tokenizers and using explicit fallback logic. Added a standalone reproduction script exercising the new tokenizer behavior. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @faizansk25, thank you for submitting a PR! We will respond as soon as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (2)
188-234: Consider documenting the prefix-stripping logic.The method handles model name prefixes differently based on provider type:
- Lines 201-218: Known providers (openai, gemini, mistral) use
model.split("/")[-1](last segment only)- Line 222: Custom providers use full model with prefix replacement
This inconsistency is intentional (HuggingFace models need namespace like "BAAI/bge-m3"), but the logic could benefit from a comment explaining why different approaches are used.
🔎 Suggested documentation improvement
else: + # For custom/HuggingFace models, strip known provider prefixes while preserving + # the model namespace (e.g., "openai/BAAI/bge-m3" -> "BAAI/bge-m3") try: tokenizer = HuggingFaceTokenizer( model=self.model.replace("hosted_vllm/", "").replace("openai/", ""),
222-222: Fix correctly removes the "openai/" prefix for HuggingFace tokenizer initialization.The addition of
.replace("openai/", "")correctly sanitizes model names like "openai/BAAI/bge-m3" to "BAAI/bge-m3" for HuggingFace tokenizer compatibility. The code intentionally uses the full model string with prefix stripping only in the custom provider branch (line 219+), while known providers extract only the model name segment on line 199.Consider whether other LiteLLM provider prefixes (e.g.,
azure/,bedrock/,anthropic/) might appear with HuggingFace embedding models and need similar handling. If custom providers with such prefixes become common, a more robust approach using a configurable prefix list or regex pattern would be maintainable.reproduce_issue_1915.py (2)
16-50: Consider adding a docstring and integrating into test suite.The function lacks a docstring, which is recommended per the coding guidelines. Additionally, this appears to be a reproduction script rather than a formal unit test.
Consider:
- Adding a docstring explaining the test's purpose
- Converting this to a proper pytest test in the
tests/directory with assertions instead of print statements- If this is temporary for issue reproduction, document that in comments or the filename
🔎 Docstring example
def test_tokenizer_model_name(): + """ + Reproduce issue #1915: Verify that model names with 'openai/' prefix + are correctly sanitized when loading HuggingFace tokenizers. + + Expected: 'openai/BAAI/bge-m3' should result in tokenizer.model == 'BAAI/bge-m3' + """ print("--- Starting Reproduction Test ---")
46-46: Minor formatting: Extra space in comment.- # If it fell back to TikToken, that means HF failed (likely because 'openai/BAAI/bge-m3' was passed) + # If it fell back to TikToken, that means HF failed (likely because 'openai/BAAI/bge-m3' was passed)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py(2 hunks)reproduce_issue_1915.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.pyreproduce_issue_1915.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.pyreproduce_issue_1915.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
🧬 Code graph analysis (1)
reproduce_issue_1915.py (3)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)
LiteLLMEmbeddingEngine(34-233)cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (1)
HuggingFaceTokenizer(6-78)cognee/infrastructure/llm/tokenizer/TikToken/adapter.py (1)
TikTokenTokenizer(7-123)
| # Add project root to path | ||
| sys.path.append(os.getcwd()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fragile path manipulation may cause import failures.
Using sys.path.append(os.getcwd()) assumes the script is run from the repository root. This will fail if executed from a different directory.
🔎 More robust alternative
-# Add project root to path
-sys.path.append(os.getcwd())
+# Add project root to path
+script_dir = os.path.dirname(os.path.abspath(__file__))
+project_root = os.path.dirname(script_dir)
+sys.path.insert(0, project_root)Alternatively, if this is meant to be a permanent test, consider moving it to the tests/ directory and using proper relative imports or pytest fixtures.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Add project root to path | |
| sys.path.append(os.getcwd()) | |
| # Add project root to path | |
| script_dir = os.path.dirname(os.path.abspath(__file__)) | |
| project_root = os.path.dirname(script_dir) | |
| sys.path.insert(0, project_root) |
🤖 Prompt for AI Agents
In reproduce_issue_1915.py around lines 6-7, the use of
sys.path.append(os.getcwd()) is fragile because it assumes the script is run
from the repo root; replace it by computing the repository root relative to this
file (e.g., using os.path.dirname(os.path.abspath(__file__)) and walking up to
the project root) and insert that path at the front of sys.path
(sys.path.insert(0, root_path)) to ensure deterministic imports, guard against
duplicate entries, or—better—move this code into a tests/ location and use
proper package-relative imports or pytest fixtures instead of mutating sys.path.
Signed-off-by: Faizan Shaikh <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)
187-195: Enhance docstring to document provider-specific tokenizer selection.The current docstring doesn't explain the provider-specific branching logic or the fallback behavior. Per coding guidelines, function definitions should include comprehensive docstrings.
Consider documenting:
- The prefix stripping logic for
hosted_vllm/andopenai/- Provider-specific tokenizer selection (OpenAI → TikToken, Gemini → TikToken with no model, Mistral → MistralTokenizer, others → HuggingFace with TikToken fallback)
- The fallback behavior when HuggingFaceTokenizer initialization fails
As per coding guidelines, undocumented function definitions are assumed incomplete.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code
Files:
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
⚙️ CodeRabbit configuration file
**/*.py: When reviewing Python code for this project:
- Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.
- As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.
- As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.
- As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.
- As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.
Files:
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
cognee/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Use shared logging utilities from cognee.shared.logging_utils in Python code
Files:
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
cognee/{modules,infrastructure,tasks}/**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)
Files:
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
🧬 Code graph analysis (1)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (3)
cognee/infrastructure/llm/tokenizer/TikToken/adapter.py (1)
TikTokenTokenizer(7-123)cognee/infrastructure/llm/tokenizer/Mistral/adapter.py (1)
MistralTokenizer(6-89)cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (1)
HuggingFaceTokenizer(6-78)
🔇 Additional comments (1)
cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)
209-228: Address inconsistent model name extraction between OpenAI/Mistral and HuggingFace branches.The OpenAI and Mistral branches use
self.model.split("/")[-1]to extract model names, while the HuggingFace fallback useseffective_model(with provider prefixes stripped). This creates a problematic inconsistency. Real-world usage includes models like"openai/BAAI/bge-m3", where splitting by "/" and taking the last part yields just"bge-m3", losing the namespace information required by HuggingFace. The HuggingFace approach witheffective_modelpreserves the full identifier as"BAAI/bge-m3".To fix this inconsistency, ensure model name extraction is uniform across all branches. Consider whether OpenAI and Mistral tokenizers truly expect the full split form or if they should also use
effective_modelto preserve namespace information.
| mock: bool | ||
|
|
||
| MAX_RETRIES = 5 | ||
| mock: bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicate mock type annotation.
The mock: bool attribute is declared twice (lines 51 and 53). Remove the duplicate on line 53.
🔎 Proposed fix
model: str
dimensions: int
mock: bool
-
- mock: bool📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mock: bool | |
| MAX_RETRIES = 5 | |
| mock: bool | |
| model: str | |
| dimensions: int | |
| mock: bool |
🤖 Prompt for AI Agents
In cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
around lines 51 to 53, there is a duplicated type annotation for the attribute
`mock: bool` declared twice; remove the redundant declaration (the second
occurrence on line 53) so the attribute is only declared once, leaving the
single correct `mock: bool` annotation.
Signed-off-by: Faizan Shaikh <[email protected]>
Description
Acceptance Criteria
Type of Change
Screenshots/Videos (if applicable)
Pre-submission Checklist
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.