Fix PDF text ordering in ScenarioList.from_pdf by Swapnil-jain · Pull Request #2384 · expectedparrot/edsl

Swapnil-jain · 2026-02-03T04:32:44Z

Summary

Fixes ScenarioList.from_pdf() is dropping/misordering content that loads fine with Scenario.from_pdf() #957
ScenarioList.from_pdf() now extracts text in proper reading order, matching Scenario.from_pdf() behavior

Reasoning

PdfTools.extract_text_from_pdf() was using page.get_text() which returns text blocks in arbitrary order determined by the PDF's internal structure. This caused content to appear misordered (e.g., question texts out of order with options).

Scenario.from_pdf() uses PdfExtractor which extracts text blocks via get_text("blocks") and sorts them by vertical position (y0) then horizontal position (x0) to maintain proper reading order.

Updated PdfTools.extract_text_from_pdf() to use the same block-based extraction and sorting approach.

PdfTools.extract_text_from_pdf was using page.get_text() which returns text in arbitrary order. Updated to use get_text("blocks") and sort blocks by vertical then horizontal position, matching the behavior of Scenario.from_pdf via PdfExtractor.

Swapnil-jain · 2026-02-03T04:35:15Z

A review please @johnjosephhorton @rbyh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix PDF text ordering in ScenarioList.from_pdf#2384

Fix PDF text ordering in ScenarioList.from_pdf#2384
Swapnil-jain wants to merge 1 commit intoexpectedparrot:mainfrom
Swapnil-jain:fix/issue-957-scenariolist-from-pdf-ordering

Swapnil-jain commented Feb 3, 2026

Uh oh!

Swapnil-jain commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Swapnil-jain commented Feb 3, 2026

Summary

Reasoning

Uh oh!

Swapnil-jain commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant