fix: case-insensitive file extension detection in RAG data-type auto-detection#6400
Conversation
…detection DataTypes.from_content() matched file extensions case-sensitively, so files and URLs with uppercase extensions (.PDF, .CSV, .DOCX, ...) were misrouted to the plain-text loader, feeding raw binary into the RAG index. Lowercase the path before comparison and add regression tests for mixed-case file/URL extensions. Fixes crewAIInc#6399
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (2)
📝 WalkthroughWalkthroughThe ChangesCase-Insensitive Extension Detection
No sequence diagram is warranted: the change is a single-function bug fix with a single component path (extension string matching) and no multi-actor interaction. 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary
DataTypes.from_content()matched file extensions case-sensitively, so files and URLs with uppercase extensions (.PDF,.CSV,.DOCX, …) were misrouted to the plain-text loader — feeding raw binary into the RAG index instead of the parsed document text. This lowercases the path before the extension comparison inget_file_type().Fixes #6399
Changes
rag/data_types.py: lowercase the path before extension matching inget_file_type()(covers both local files and URLs).tests/rag/test_data_types.py: new regression tests for mixed-case file and URL extensions.Testing
main(12 failures across.PDF/.CSV/.DOCX/.MDX/.MDand URL cases) and pass with the fix (22 passed).tests/tools/rag/test_rag_tool_add_data_type.pystill passes (39 passed).