-
Notifications
You must be signed in to change notification settings - Fork 311
Description
I am experiencing inconsistent results when extracting text blocks from a PDF using DocstrumBoundingBoxes.Instance as the page segmenter. The issue occurs when running the same PDF processing code locally on Windows and in production on Docker.
Expected behavior
The block should be correctly extracted into five lines as it is locally on Windows:
D*****************S
5 RUE P*****L
93200 SAINT-DENIS
REPRENTE PAR D******* I**********
SIRET : 989 288 774 00015
Actual behavior
When running inside Docker, the extracted blocks are incorrectly split and partially scrambled:
D*****************S
5 RUE P*****L
93200
SAINT-DENIS
REPRENTE PAR D******* I**********
SIRET : 288 00015
989 774
Environment
- UglyToad version: 0.1.11
- OS (local): Windows 11
- Docker base image : mcr.microsoft.com/dotnet/aspnet:9.0
- .NET runtime version: 9.0
Additional information
Both environments use the same code and dependencies.
The attached PDF exhibits the problem reproducibly.
I tried both DefaultWordExtractor and NearestNeighbourWordExtractor.
This suggests a possible difference in floating-point precision, font rendering, or locale behavior between Windows and Linux environments.