Skip to content

Fix PDF text ordering in ScenarioList.from_pdf#2384

Open
Swapnil-jain wants to merge 1 commit intoexpectedparrot:mainfrom
Swapnil-jain:fix/issue-957-scenariolist-from-pdf-ordering
Open

Fix PDF text ordering in ScenarioList.from_pdf#2384
Swapnil-jain wants to merge 1 commit intoexpectedparrot:mainfrom
Swapnil-jain:fix/issue-957-scenariolist-from-pdf-ordering

Conversation

@Swapnil-jain
Copy link

Summary

Reasoning

PdfTools.extract_text_from_pdf() was using page.get_text() which returns text blocks in arbitrary order determined by the PDF's internal structure. This caused content to appear misordered (e.g., question texts out of order with options).

Scenario.from_pdf() uses PdfExtractor which extracts text blocks via get_text("blocks") and sorts them by vertical position (y0) then horizontal position (x0) to maintain proper reading order.

Updated PdfTools.extract_text_from_pdf() to use the same block-based extraction and sorting approach.

PdfTools.extract_text_from_pdf was using page.get_text() which returns
text in arbitrary order. Updated to use get_text("blocks") and sort
blocks by vertical then horizontal position, matching the behavior of
Scenario.from_pdf via PdfExtractor.
@Swapnil-jain
Copy link
Author

A review please @johnjosephhorton @rbyh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ScenarioList.from_pdf() is dropping/misordering content that loads fine with Scenario.from_pdf()

1 participant