Fixes to llm output parsing when using LLM based ranking #16
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This pull request introduces several improvements and bug fixes to the schema filtering and LLM schema ranking pipeline. The main focus is on making schema item parsing and sanitization more robust, improving the handling of LLM output formatting errors, and providing clearer error handling and logging. Additionally, the prompt template for schema ranking has been rewritten for clarity and strict output formatting.
Schema item parsing and filtering improvements:
_parse_schema_item,_parse_column_ref,_get_foreign_key) inadd_schema.pyto robustly parse and validate schema item references and foreign key relationships, improving the correctness of schema filtering.filter_schemato use the new parsing helpers, handle empty input, and ensure only valid and related schema items (including foreign keys) are included in the filtered schema.LLM output extraction and sanitization:
_preprocess_json_stringinllm_util.pyto automatically fix common LLM JSON formatting errors before parsing, reducing extraction failures.extract_jsonto use the new preprocessing step, improving resilience to LLM output quirks.extract_objectto debug level to avoid noisy logs when extraction fails but is handled gracefully.Schema ranking pipeline enhancements:
RankSchemaItems._process_outputto filter and repair LLM outputs, with fallback to all schema items if output is invalid or empty. Enhanced logging for these cases.rank_schema_llm.pyto support the new logging statements.Prompt template improvements:
RANK_SCHEMA_ITEMS_V1) for clarity, explicit output requirements, and stricter JSON formatting, reducing ambiguity for the LLM and improving downstream parsing reliability.Other minor changes:
main.py, removed the early return after cleaning to ensure pipeline configuration always runs, even if only cleaning is requested.These changes collectively make the schema filtering and LLM ranking pipeline more robust, maintainable, and user-friendly.