Skip to content

Different block extraction results between Windows (local) and Docker using DocstrumBoundingBoxes.Instance #1200

@Edouard-Tby

Description

@Edouard-Tby

I am experiencing inconsistent results when extracting text blocks from a PDF using DocstrumBoundingBoxes.Instance as the page segmenter. The issue occurs when running the same PDF processing code locally on Windows and in production on Docker.

Expected behavior
The block should be correctly extracted into five lines as it is locally on Windows:

D*****************S
5 RUE P*****L
93200 SAINT-DENIS
REPRENTE PAR D******* I**********
SIRET : 989 288 774 00015

Actual behavior
When running inside Docker, the extracted blocks are incorrectly split and partially scrambled:

D*****************S
5 RUE P*****L
93200
SAINT-DENIS
REPRENTE PAR D******* I**********
SIRET : 288 00015
989 774

Environment

  • UglyToad version: 0.1.11
  • OS (local): Windows 11
  • Docker base image : mcr.microsoft.com/dotnet/aspnet:9.0
  • .NET runtime version: 9.0

Additional information

Both environments use the same code and dependencies.
The attached PDF exhibits the problem reproducibly.
I tried both DefaultWordExtractor and NearestNeighbourWordExtractor.
This suggests a possible difference in floating-point precision, font rendering, or locale behavior between Windows and Linux environments.

EXAMPLE (2) (1).pdf

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions