Skip to content

Conversation

@amrit110
Copy link
Member

@amrit110 amrit110 commented Nov 6, 2025

This pull request introduces several improvements and bug fixes to the schema filtering and LLM schema ranking pipeline. The main focus is on making schema item parsing and sanitization more robust, improving the handling of LLM output formatting errors, and providing clearer error handling and logging. Additionally, the prompt template for schema ranking has been rewritten for clarity and strict output formatting.

Schema item parsing and filtering improvements:

  • Added helper functions (_parse_schema_item, _parse_column_ref, _get_foreign_key) in add_schema.py to robustly parse and validate schema item references and foreign key relationships, improving the correctness of schema filtering.
  • Refactored filter_schema to use the new parsing helpers, handle empty input, and ensure only valid and related schema items (including foreign keys) are included in the filtered schema.

LLM output extraction and sanitization:

  • Introduced _preprocess_json_string in llm_util.py to automatically fix common LLM JSON formatting errors before parsing, reducing extraction failures.
  • Updated extract_json to use the new preprocessing step, improving resilience to LLM output quirks.
  • Changed error logging in extract_object to debug level to avoid noisy logs when extraction fails but is handled gracefully.

Schema ranking pipeline enhancements:

  • Added a schema item sanitization step in RankSchemaItems._process_output to filter and repair LLM outputs, with fallback to all schema items if output is invalid or empty. Enhanced logging for these cases.
  • Imported the logger in rank_schema_llm.py to support the new logging statements.

Prompt template improvements:

  • Rewrote the schema ranking prompt (RANK_SCHEMA_ITEMS_V1) for clarity, explicit output requirements, and stricter JSON formatting, reducing ambiguity for the LLM and improving downstream parsing reliability.

Other minor changes:

  • In main.py, removed the early return after cleaning to ensure pipeline configuration always runs, even if only cleaning is requested.

These changes collectively make the schema filtering and LLM ranking pipeline more robust, maintainable, and user-friendly.

@amrit110 amrit110 self-assigned this Nov 6, 2025
@amrit110 amrit110 added bug Something isn't working enhancement New feature or request labels Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants