Skip to content

feat: added support for modernbert embeddings #9

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 21, 2025
Merged

Conversation

ErikBjare
Copy link
Member

@ErikBjare ErikBjare commented Jan 21, 2025

Probably needs a lot of cleanup, I kinda just let gptme run with it.


Important

Added support for ModernBERT embeddings, updated CLI and indexing functionalities, and enhanced documentation and testing.

  • Behavior:
    • Added support for ModernBERT embeddings in gptme_rag/embeddings.py.
    • Updated CLI in cli.py to include options for embedding-function and device.
    • Enhanced index() and search() functions in cli.py to utilize ModernBERT.
    • Introduced ChunkMerger and SearchOutputFormatter classes in cli.py for improved chunk handling and output formatting.
  • Indexer:
    • Modified Indexer class in indexer.py to integrate ModernBERT embeddings.
    • Adjusted chunk size and overlap defaults based on embedding model.
    • Added logic to handle collection recreation if embedding model changes.
  • Document Processing:
    • Updated DocumentProcessor in document_processor.py to log chunk creation details.
  • Configuration:
    • Updated .pre-commit-config.yaml to comment out local pytest hook.
    • Added sentence-transformers to dependencies in pyproject.toml.
  • Documentation:
    • Updated README.md to include instructions for using ModernBERT embeddings.
  • Testing:
    • Added tests in test_context_assembler.py to ensure context assembly handles token limits and includes all components.

This description was created by Ellipsis for c5a5e11. It will automatically update as commits are pushed.

Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 1862dfd in 1 minute and 18 seconds

More details
  • Looked at 853 lines of code in 7 files
  • Skipped 1 files when reviewing.
  • Skipped posting 6 drafted comments based on config settings.
1. .pre-commit-config.yaml:20
  • Draft comment:
    The pytest pre-commit hook is commented out. Consider enabling it to ensure tests are run before commits, which helps catch issues early.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The PR introduces a new feature for ModernBERT embeddings, but the pre-commit hook for pytest is commented out. This could lead to untested code being merged.
2. gptme_rag/cli.py:28
  • Draft comment:
    Consider moving the default_persist_dir to a configuration file or environment variable for better flexibility and maintainability.
  • Reason this comment was not posted:
    Confidence changes required: 30%
    The PR introduces a new feature for ModernBERT embeddings, but the pre-commit hook for pytest is commented out. This could lead to untested code being merged.
3. gptme_rag/cli.py:207
  • Draft comment:
    The warning about changing the embedding model is useful, but consider logging it as well for better traceability.
  • Reason this comment was not posted:
    Confidence changes required: 20%
    The PR introduces a new feature for ModernBERT embeddings, but the pre-commit hook for pytest is commented out. This could lead to untested code being merged.
4. gptme_rag/embeddings.py:42
  • Draft comment:
    Consider making the batch size configurable to allow users to optimize for their specific hardware capabilities.
  • Reason this comment was not posted:
    Confidence changes required: 20%
    The PR introduces a new feature for ModernBERT embeddings, but the pre-commit hook for pytest is commented out. This could lead to untested code being merged.
5. gptme_rag/indexing/indexer.py:55
  • Draft comment:
    The embedding_function parameter is not type-annotated. Consider adding type annotations for better code clarity and type checking.
  • Reason this comment was not posted:
    Confidence changes required: 20%
    The PR introduces a new feature for ModernBERT embeddings, but the pre-commit hook for pytest is commented out. This could lead to untested code being merged.
6. gptme_rag/indexing/indexer.py:144
  • Draft comment:
    The logic for determining if a collection needs to be recreated is complex. Consider refactoring this into a separate method for clarity and maintainability.
  • Reason this comment was not posted:
    Confidence changes required: 30%
    The PR introduces a new feature for ModernBERT embeddings, but the pre-commit hook for pytest is commented out. This could lead to untested code being merged.

Workflow ID: wflow_kzTR8OfiVC2gW1WH


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Incremental review on c5a5e11 in 18 seconds

More details
  • Looked at 18 lines of code in 1 files
  • Skipped 0 files when reviewing.
  • Skipped posting 0 drafted comments based on config settings.

Workflow ID: wflow_nUhRL9vtds2H2f56


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

@ErikBjare ErikBjare merged commit 2265bbe into master Jan 21, 2025
1 check passed
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 22.48804% with 162 lines in your changes missing coverage. Please review.

Project coverage is 46.60%. Comparing base (4c9a7dd) to head (c5a5e11).
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
gptme_rag/cli.py 0.00% 157 Missing ⚠️
gptme_rag/indexing/indexer.py 87.80% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
- Coverage   50.39%   46.60%   -3.80%     
==========================================
  Files           9       10       +1     
  Lines        1127     1311     +184     
==========================================
+ Hits          568      611      +43     
- Misses        559      700     +141     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants