Fix/litellm tokenizer model name #1925

faizansk25 · 2025-12-19T13:07:27Z

Description

Acceptance Criteria

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update
Code refactoring
Performance improvement
Other (please specify):

Screenshots/Videos (if applicable)

Pre-submission Checklist

I have tested my changes thoroughly before submitting this PR
This PR contains minimal changes necessary to address the issue/feature
My code follows the project's coding standards and style guidelines
I have added tests that prove my fix is effective or that my feature works
I have added necessary documentation (if applicable)
All new and existing tests pass
I have searched existing PRs to ensure this change hasn't been submitted already
I have linked any relevant issues in the description
My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

Bug Fixes
- Improved handling of provider prefixes in model identifiers so the correct tokenizer is selected for qualified model names, increasing compatibility across providers.
Tests
- Added a standalone reproduction script to validate tokenizer selection and model-name handling for edge-case provider/model combinations.
Chores
- Narrowed the public API surface to explicit pipeline exports to reduce unintended namespace exposure.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…del name parsing.

Signed-off-by: Faizan Shaikh <[email protected]>

pull-checklist · 2025-12-19T13:07:31Z

Please make sure all the checkboxes are checked:

I have tested these changes locally.
I have reviewed the code changes.
I have added end-to-end and unit tests (if applicable).
I have updated the documentation and README.md file (if necessary).
I have removed unnecessary code and debug statements.
PR title is clear and follows the convention.
I have tagged reviewers or team members for feedback.

coderabbitai · 2025-12-19T13:07:41Z

Walkthrough

Updated LiteLLMEmbeddingEngine: removed MAX_RETRIES and retry_count, rewrote get_tokenizer to branch by provider and compute an effective_model (stripping "hosted_vllm/" and "openai/"), returning provider-specific tokenizers and using explicit fallback logic. Added a standalone reproduction script exercising the new tokenizer behavior.

Changes

Cohort / File(s)	Summary
LiteLLMEmbeddingEngine tokenizer rewrite `cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py`	Removed class constant `MAX_RETRIES` and `self.retry_count` initialization. Replaced prior tokenizer logic with multi-branch provider handling: compute `effective_model` (strip `hosted_vllm/` and `openai/`), normalize provider, return provider-specific tokenizers (OpenAI→TikTokenTokenizer, Gemini→TikTokenTokenizer(model=None), Mistral→MistralTokenizer, default→HuggingFaceTokenizer with fallback to TikTokenTokenizer). Error logging preserved; branches return tokenizers directly.
Reproduction test added `reproduce_issue_1915.py`	New standalone script that instantiates LiteLLMEmbeddingEngine with model `"openai/BAAI/bge-m3"`, prints engine and tokenizer info, and verifies tokenizer type/model expectations (reports fallback failures).

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Review provider-branch correctness and that effective_model stripping covers intended prefixes without over-stripping.
Verify HuggingFaceTokenizer initialization and fallback behavior are safe and logged.
Confirm removal of MAX_RETRIES / self.retry_count has no remaining references or unintended effects.
Run the new reproduce_issue_1915.py to validate both success and fallback paths.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	The pull request description is incomplete. All sections are empty templates with no actual content provided by the author.	Please fill in the Description section explaining the changes, provide Acceptance Criteria with testing instructions, and complete the Pre-submission Checklist items that apply.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing the tokenizer model name handling in the LiteLLM embedding engine.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions

Hello @faizansk25, thank you for submitting a PR! We will respond as soon as possible.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (2)
188-234: Consider documenting the prefix-stripping logic.

The method handles model name prefixes differently based on provider type:

Lines 201-218: Known providers (openai, gemini, mistral) use model.split("/")[-1] (last segment only)

Line 222: Custom providers use full model with prefix replacement

This inconsistency is intentional (HuggingFace models need namespace like "BAAI/bge-m3"), but the logic could benefit from a comment explaining why different approaches are used.
🔎 Suggested documentation improvement
     else:
+        # For custom/HuggingFace models, strip known provider prefixes while preserving
+        # the model namespace (e.g., "openai/BAAI/bge-m3" -> "BAAI/bge-m3")
         try:
             tokenizer = HuggingFaceTokenizer(
                 model=self.model.replace("hosted_vllm/", "").replace("openai/", ""),
222-222: Fix correctly removes the "openai/" prefix for HuggingFace tokenizer initialization.

The addition of .replace("openai/", "") correctly sanitizes model names like "openai/BAAI/bge-m3" to "BAAI/bge-m3" for HuggingFace tokenizer compatibility. The code intentionally uses the full model string with prefix stripping only in the custom provider branch (line 219+), while known providers extract only the model name segment on line 199.

Consider whether other LiteLLM provider prefixes (e.g., azure/, bedrock/, anthropic/) might appear with HuggingFace embedding models and need similar handling. If custom providers with such prefixes become common, a more robust approach using a configurable prefix list or regex pattern would be maintainable.
reproduce_issue_1915.py (2)
16-50: Consider adding a docstring and integrating into test suite.

The function lacks a docstring, which is recommended per the coding guidelines. Additionally, this appears to be a reproduction script rather than a formal unit test.

Consider:

Adding a docstring explaining the test's purpose

Converting this to a proper pytest test in the tests/ directory with assertions instead of print statements

If this is temporary for issue reproduction, document that in comments or the filename
🔎 Docstring example
 def test_tokenizer_model_name():
+    """
+    Reproduce issue #1915: Verify that model names with 'openai/' prefix
+    are correctly sanitized when loading HuggingFace tokenizers.
+    
+    Expected: 'openai/BAAI/bge-m3' should result in tokenizer.model == 'BAAI/bge-m3'
+    """
     print("--- Starting Reproduction Test ---")
46-46: Minor formatting: Extra space in comment.
-         # If it fell back to TikToken, that means HF failed (likely because 'openai/BAAI/bge-m3' was passed)
+        # If it fell back to TikToken, that means HF failed (likely because 'openai/BAAI/bge-m3' was passed)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d8d3844 and f637f80.

📒 Files selected for processing (2)

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (2 hunks)
reproduce_issue_1915.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
reproduce_issue_1915.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py
reproduce_issue_1915.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py

🧬 Code graph analysis (1)

reproduce_issue_1915.py (3)

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)

LiteLLMEmbeddingEngine (34-233)

cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (1)

HuggingFaceTokenizer (6-78)

cognee/infrastructure/llm/tokenizer/TikToken/adapter.py (1)

TikTokenTokenizer (7-123)

coderabbitai · 2025-12-19T13:11:20Z

reproduce_issue_1915.py

+# Add project root to path
+sys.path.append(os.getcwd())


⚠️ Potential issue | 🟡 Minor

Fragile path manipulation may cause import failures.

Using sys.path.append(os.getcwd()) assumes the script is run from the repository root. This will fail if executed from a different directory.

🔎 More robust alternative

-# Add project root to path -sys.path.append(os.getcwd()) +# Add project root to path +script_dir = os.path.dirname(os.path.abspath(__file__)) +project_root = os.path.dirname(script_dir) +sys.path.insert(0, project_root)

Alternatively, if this is meant to be a permanent test, consider moving it to the tests/ directory and using proper relative imports or pytest fixtures.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Add project root to path

sys.path.append(os.getcwd())

# Add project root to path

script_dir = os.path.dirname(os.path.abspath(__file__))

project_root = os.path.dirname(script_dir)

sys.path.insert(0, project_root)

🤖 Prompt for AI Agents

In reproduce_issue_1915.py around lines 6-7, the use of sys.path.append(os.getcwd()) is fragile because it assumes the script is run from the repo root; replace it by computing the repository root relative to this file (e.g., using os.path.dirname(os.path.abspath(__file__)) and walking up to the project root) and insert that path at the front of sys.path (sys.path.insert(0, root_path)) to ensure deterministic imports, guard against duplicate entries, or—better—move this code into a tests/ location and use proper package-relative imports or pytest fixtures instead of mutating sys.path.

Signed-off-by: Faizan Shaikh <[email protected]>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)

187-195: Enhance docstring to document provider-specific tokenizer selection.

The current docstring doesn't explain the provider-specific branching logic or the fallback behavior. Per coding guidelines, function definitions should include comprehensive docstrings.

Consider documenting:

The prefix stripping logic for hosted_vllm/ and openai/

Provider-specific tokenizer selection (OpenAI → TikToken, Gemini → TikToken with no model, Mistral → MistralTokenizer, others → HuggingFace with TikToken fallback)

The fallback behavior when HuggingFaceTokenizer initialization fails

As per coding guidelines, undocumented function definitions are assumed incomplete.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f637f80 and 240d50e.

📒 Files selected for processing (1)

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

**/*.py: Use 4-space indentation in Python code
Use snake_case for Python module and function names
Use PascalCase for Python class names
Use ruff format before committing Python code
Use ruff check for import hygiene and style enforcement with line-length 100 configured in pyproject.toml
Prefer explicit, structured error handling in Python code

Files:

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py

⚙️ CodeRabbit configuration file

**/*.py: When reviewing Python code for this project:

Prioritize portability over clarity, especially when dealing with cross-Python compatibility. However, with the priority in mind, do still consider improvements to clarity when relevant.

As a general guideline, consider the code style advocated in the PEP 8 standard (excluding the use of spaces for indentation) and evaluate suggested changes for code style compliance.

As a style convention, consider the code style advocated in CEP-8 and evaluate suggested changes for code style compliance.

As a general guideline, try to provide any relevant, official, and supporting documentation links to any tool's suggestions in review comments. This guideline is important for posterity.

As a general rule, undocumented function definitions and class definitions in the project's Python code are assumed incomplete. Please consider suggesting a short summary of the code for any of these incomplete definitions as docstrings when reviewing.

Files:

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py

cognee/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Use shared logging utilities from cognee.shared.logging_utils in Python code

Files:

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py

cognee/{modules,infrastructure,tasks}/**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Co-locate feature-specific helpers under their respective package (modules/, infrastructure/, or tasks/)

Files:

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py

🧬 Code graph analysis (1)

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (3)

cognee/infrastructure/llm/tokenizer/TikToken/adapter.py (1)

TikTokenTokenizer (7-123)

cognee/infrastructure/llm/tokenizer/Mistral/adapter.py (1)

MistralTokenizer (6-89)

cognee/infrastructure/llm/tokenizer/HuggingFace/adapter.py (1)

HuggingFaceTokenizer (6-78)

🔇 Additional comments (1)

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py (1)

209-228: Address inconsistent model name extraction between OpenAI/Mistral and HuggingFace branches.

The OpenAI and Mistral branches use self.model.split("/")[-1] to extract model names, while the HuggingFace fallback uses effective_model (with provider prefixes stripped). This creates a problematic inconsistency. Real-world usage includes models like "openai/BAAI/bge-m3", where splitting by "/" and taking the last part yields just "bge-m3", losing the namespace information required by HuggingFace. The HuggingFace approach with effective_model preserves the full identifier as "BAAI/bge-m3".

To fix this inconsistency, ensure model name extraction is uniform across all branches. Consider whether OpenAI and Mistral tokenizers truly expect the full split form or if they should also use effective_model to preserve namespace information.

coderabbitai · 2025-12-19T13:24:44Z

cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py

    mock: bool

-    MAX_RETRIES = 5
+    mock: bool


⚠️ Potential issue | 🟡 Minor

Remove duplicate mock type annotation.

The mock: bool attribute is declared twice (lines 51 and 53). Remove the duplicate on line 53.

🔎 Proposed fix

model: str dimensions: int mock: bool - - mock: bool

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mock: bool

MAX_RETRIES = 5

mock: bool

model: str

dimensions: int

mock: bool

🤖 Prompt for AI Agents

In cognee/infrastructure/databases/vector/embeddings/LiteLLMEmbeddingEngine.py around lines 51 to 53, there is a duplicated type annotation for the attribute `mock: bool` declared twice; remove the redundant declaration (the second occurrence on line 53) so the attribute is only declared once, leaving the single correct `mock: bool` annotation.

Signed-off-by: Faizan Shaikh <[email protected]>

faizansk25 added 2 commits December 19, 2025 18:20

test: Add reproduction script for LiteLLMEmbeddingEngine tokenizer mo…

1cf0a20

…del name parsing.

fix: handle provider prefix in LiteLLMEmbeddingEngine tokenizer loading

f637f80

Signed-off-by: Faizan Shaikh <[email protected]>

github-actions bot reviewed Dec 19, 2025

View reviewed changes

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

refactor: cleanup unused code and improve tokenizer loading logic

240d50e

Signed-off-by: Faizan Shaikh <[email protected]>

Vasilije1990 added the community-contribution Community contribution label label Dec 19, 2025

coderabbitai bot reviewed Dec 19, 2025

View reviewed changes

refactor: replace wildcard import in pipelines.py with explicit imports

bc6117f

Signed-off-by: Faizan Shaikh <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/litellm tokenizer model name #1925

Fix/litellm tokenizer model name #1925

faizansk25 commented Dec 19, 2025 •

edited

Loading

Uh oh!

pull-checklist bot commented Dec 19, 2025

Uh oh!

coderabbitai bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

github-actions bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 19, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix/litellm tokenizer model name #1925

Are you sure you want to change the base?

Fix/litellm tokenizer model name #1925

Conversation

faizansk25 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Acceptance Criteria

Type of Change

Screenshots/Videos (if applicable)

Pre-submission Checklist

DCO Affirmation

Summary by CodeRabbit

Uh oh!

pull-checklist bot commented Dec 19, 2025

Please make sure all the checkboxes are checked:

Uh oh!

coderabbitai bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

faizansk25 commented Dec 19, 2025 •

edited

Loading

coderabbitai bot commented Dec 19, 2025 •

edited

Loading