Fixes to llm output parsing when using LLM based ranking #16

amrit110 · 2025-11-06T19:44:27Z

This pull request introduces several improvements and bug fixes to the schema filtering and LLM schema ranking pipeline. The main focus is on making schema item parsing and sanitization more robust, improving the handling of LLM output formatting errors, and providing clearer error handling and logging. Additionally, the prompt template for schema ranking has been rewritten for clarity and strict output formatting.

Schema item parsing and filtering improvements:

Added helper functions (_parse_schema_item, _parse_column_ref, _get_foreign_key) in add_schema.py to robustly parse and validate schema item references and foreign key relationships, improving the correctness of schema filtering.
Refactored filter_schema to use the new parsing helpers, handle empty input, and ensure only valid and related schema items (including foreign keys) are included in the filtered schema.

LLM output extraction and sanitization:

Introduced _preprocess_json_string in llm_util.py to automatically fix common LLM JSON formatting errors before parsing, reducing extraction failures.
Updated extract_json to use the new preprocessing step, improving resilience to LLM output quirks.
Changed error logging in extract_object to debug level to avoid noisy logs when extraction fails but is handled gracefully.

Schema ranking pipeline enhancements:

Added a schema item sanitization step in RankSchemaItems._process_output to filter and repair LLM outputs, with fallback to all schema items if output is invalid or empty. Enhanced logging for these cases.
Imported the logger in rank_schema_llm.py to support the new logging statements.

Prompt template improvements:

Rewrote the schema ranking prompt (RANK_SCHEMA_ITEMS_V1) for clarity, explicit output requirements, and stricter JSON formatting, reducing ambiguity for the LLM and improving downstream parsing reliability.

Other minor changes:

In main.py, removed the early return after cleaning to ensure pipeline configuration always runs, even if only cleaning is requested.

These changes collectively make the schema filtering and LLM ranking pipeline more robust, maintainable, and user-friendly.

Fixes to llm output parsing when using LLM based ranking

7fd3ba6

amrit110 self-assigned this Nov 6, 2025

amrit110 added bug Something isn't working enhancement New feature or request labels Nov 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixes to llm output parsing when using LLM based ranking #16

Fixes to llm output parsing when using LLM based ranking #16

Uh oh!

amrit110 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fixes to llm output parsing when using LLM based ranking #16

Are you sure you want to change the base?

Fixes to llm output parsing when using LLM based ranking #16

Uh oh!

Conversation

amrit110 commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants