Skip to content

Commit 48f4d1b

Browse files
PeterStaar-IBMcau-gitdolfim-ibm
authored
fix: Add unit tests (#51)
* add the pytests Signed-off-by: Peter Staar <[email protected]> * renamed the test folder and added the toplevel test Signed-off-by: Peter Staar <[email protected]> * updated the toplevel function test Signed-off-by: Peter Staar <[email protected]> * need to start running all tests successfully Signed-off-by: Peter Staar <[email protected]> * added the reference converted documents Signed-off-by: Peter Staar <[email protected]> * added first test for json and md output Signed-off-by: Peter Staar <[email protected]> * ran pre-commit Signed-off-by: Peter Staar <[email protected]> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <[email protected]> * replaced deprecated json function with model_dump_json Signed-off-by: Peter Staar <[email protected]> * reformatted code Signed-off-by: Peter Staar <[email protected]> * Fix backend tests Signed-off-by: Christoph Auer <[email protected]> * commented out the drawing Signed-off-by: Peter Staar <[email protected]> * ci: avoid duplicate runs Signed-off-by: Michele Dolfi <[email protected]> * commented out json verification for now Signed-off-by: Peter Staar <[email protected]> * added verification of input cells Signed-off-by: Peter Staar <[email protected]> * reformat code Signed-off-by: Peter Staar <[email protected]> * added test to verify the cells in the pages Signed-off-by: Peter Staar <[email protected]> * added test to verify the cells in the pages (2) Signed-off-by: Peter Staar <[email protected]> * added test to verify the cells in the pages (3) Signed-off-by: Peter Staar <[email protected]> * run all examples in CI Signed-off-by: Michele Dolfi <[email protected]> * make sure examples return failures Signed-off-by: Michele Dolfi <[email protected]> * raise a failure if examples fail Signed-off-by: Michele Dolfi <[email protected]> * fix examples Signed-off-by: Michele Dolfi <[email protected]> * run examples after tests Signed-off-by: Michele Dolfi <[email protected]> * Add tests and update top_level_tests using only datamodels Signed-off-by: Christoph Auer <[email protected]> * Remove unnecessary code Signed-off-by: Christoph Auer <[email protected]> * Validate conversion status on e2e test Signed-off-by: Christoph Auer <[email protected]> * package verify utils and add more tests Signed-off-by: Michele Dolfi <[email protected]> * reduce docs in example, since they are already in the tests Signed-off-by: Michele Dolfi <[email protected]> * skip batch_convert Signed-off-by: Michele Dolfi <[email protected]> * pin docling-parse 1.1.2 Signed-off-by: Michele Dolfi <[email protected]> * updated the error messages Signed-off-by: Peter Staar <[email protected]> * commented out the json verification for now Signed-off-by: Peter Staar <[email protected]> * bumped GLM version Signed-off-by: Peter Staar <[email protected]> * Fix lockfile Signed-off-by: Christoph Auer <[email protected]> * Pin new docling-parse v1.1.3 Signed-off-by: Christoph Auer <[email protected]> --------- Signed-off-by: Peter Staar <[email protected]> Signed-off-by: Christoph Auer <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Co-authored-by: Christoph Auer <[email protected]> Co-authored-by: Michele Dolfi <[email protected]> Co-authored-by: Michele Dolfi <[email protected]>
1 parent 256f4d5 commit 48f4d1b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

43 files changed

+4999
-353
lines changed

.github/workflows/checks.yml

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,3 +14,22 @@ jobs:
1414
python-version: ${{ matrix.python-version }}
1515
- name: Run styling check
1616
run: poetry run pre-commit run --all-files
17+
- name: Install with poetry
18+
run: poetry install --all-extras
19+
- name: Testing
20+
run: |
21+
poetry run pytest -v tests
22+
- name: Run examples
23+
run: |
24+
for file in examples/*.py; do
25+
# Skip batch_convert.py
26+
if [[ "$(basename "$file")" == "batch_convert.py" ]]; then
27+
echo "Skipping $file"
28+
continue
29+
fi
30+
31+
echo "Running example $file"
32+
poetry run python "$file" || exit 1
33+
done
34+
- name: Build with poetry
35+
run: poetry build

.github/workflows/ci.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ name: "Run CI"
22

33
on:
44
pull_request:
5-
types: [opened, reopened, synchronize, ready_for_review]
5+
types: [opened, reopened]
66
push:
77
branches:
88
- "**"
@@ -25,4 +25,4 @@ jobs:
2525
# - uses: ./.github/actions/setup-poetry
2626
# - name: Build docs
2727
# run: poetry run mkdocs build --verbose --clean
28-
28+

.pre-commit-config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,15 +4,15 @@ repos:
44
hooks:
55
- id: system
66
name: Black
7-
entry: poetry run black docling examples
7+
entry: poetry run black docling examples tests
88
pass_filenames: false
99
language: system
1010
files: '\.py$'
1111
- repo: local
1212
hooks:
1313
- id: system
1414
name: isort
15-
entry: poetry run isort docling examples
15+
entry: poetry run isort docling examples tests
1616
pass_filenames: false
1717
language: system
1818
files: '\.py$'

docling/datamodel/base_models.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -238,9 +238,9 @@ class EquationPrediction(BaseModel):
238238

239239
class PagePredictions(BaseModel):
240240
layout: LayoutPrediction = None
241-
tablestructure: TableStructurePrediction = None
242-
figures_classification: FigureClassificationPrediction = None
243-
equations_prediction: EquationPrediction = None
241+
tablestructure: Optional[TableStructurePrediction] = None
242+
figures_classification: Optional[FigureClassificationPrediction] = None
243+
equations_prediction: Optional[EquationPrediction] = None
244244

245245

246246
PageElement = Union[TextElement, TableElement, FigureElement]

docling/models/ds_glm_model.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,12 @@
1616
class GlmModel:
1717
def __init__(self, config):
1818
self.config = config
19+
self.model_names = self.config.get(
20+
"model_names", ""
21+
) # "language;term;reference"
1922
load_pretrained_nlp_models()
20-
model = init_nlp_model(model_names="language;term;reference")
23+
# model = init_nlp_model(model_names="language;term;reference")
24+
model = init_nlp_model(model_names=self.model_names)
2125
self.model = model
2226

2327
def __call__(self, conv_res: ConversionResult) -> DsDocument:

docling/models/table_structure_model.py

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,16 @@ def draw_table_and_cells(self, page: Page, tbl_list: List[TableElement]):
4444

4545
for tc in table_element.table_cells:
4646
x0, y0, x1, y1 = tc.bbox.as_tuple()
47-
draw.rectangle([(x0, y0), (x1, y1)], outline="blue")
47+
if tc.column_header:
48+
width = 3
49+
else:
50+
width = 1
51+
draw.rectangle([(x0, y0), (x1, y1)], outline="blue", width=width)
52+
draw.text(
53+
(x0 + 3, y0 + 3),
54+
text=f"{tc.start_row_offset_idx}, {tc.start_col_offset_idx}",
55+
fill="black",
56+
)
4857

4958
image.show()
5059

examples/batch_convert.py

Lines changed: 14 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -49,17 +49,18 @@ def export_documents(
4949
f"of which {failure_count} failed "
5050
f"and {partial_success_count} were partially converted."
5151
)
52+
return success_count, partial_success_count, failure_count
5253

5354

5455
def main():
5556
logging.basicConfig(level=logging.INFO)
5657

5758
input_doc_paths = [
58-
Path("./test/data/2206.01062.pdf"),
59-
Path("./test/data/2203.01017v2.pdf"),
60-
Path("./test/data/2305.03393v1.pdf"),
61-
Path("./test/data/redp5110.pdf"),
62-
Path("./test/data/redp5695.pdf"),
59+
Path("./tests/data/2206.01062.pdf"),
60+
Path("./tests/data/2203.01017v2.pdf"),
61+
Path("./tests/data/2305.03393v1.pdf"),
62+
Path("./tests/data/redp5110.pdf"),
63+
Path("./tests/data/redp5695.pdf"),
6364
]
6465

6566
# buf = BytesIO(Path("./test/data/2206.01062.pdf").open("rb").read())
@@ -73,12 +74,19 @@ def main():
7374
start_time = time.time()
7475

7576
conv_results = doc_converter.convert(input)
76-
export_documents(conv_results, output_dir=Path("./scratch"))
77+
success_count, partial_success_count, failure_count = export_documents(
78+
conv_results, output_dir=Path("./scratch")
79+
)
7780

7881
end_time = time.time() - start_time
7982

8083
_log.info(f"All documents were converted in {end_time:.2f} seconds.")
8184

85+
if failure_count > 0:
86+
raise RuntimeError(
87+
f"The example failed converting {failure_count} on {len(input_doc_paths)}."
88+
)
89+
8290

8391
if __name__ == "__main__":
8492
main()

examples/custom_convert.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,14 +42,14 @@ def export_documents(
4242
f"Processed {success_count + failure_count} docs, of which {failure_count} failed"
4343
)
4444

45+
return success_count, failure_count
46+
4547

4648
def main():
4749
logging.basicConfig(level=logging.INFO)
4850

4951
input_doc_paths = [
50-
Path("./test/data/2206.01062.pdf"),
51-
Path("./test/data/2203.01017v2.pdf"),
52-
Path("./test/data/2305.03393v1.pdf"),
52+
Path("./tests/data/2206.01062.pdf"),
5353
]
5454

5555
###########################################################################
@@ -114,12 +114,19 @@ def main():
114114
start_time = time.time()
115115

116116
conv_results = doc_converter.convert(input)
117-
export_documents(conv_results, output_dir=Path("./scratch"))
117+
success_count, failure_count = export_documents(
118+
conv_results, output_dir=Path("./scratch")
119+
)
118120

119121
end_time = time.time() - start_time
120122

121123
_log.info(f"All documents were converted in {end_time:.2f} seconds.")
122124

125+
if failure_count > 0:
126+
raise RuntimeError(
127+
f"The example failed converting {failure_count} on {len(input_doc_paths)}."
128+
)
129+
123130

124131
if __name__ == "__main__":
125132
main()

examples/export_figures.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ def main():
2222
logging.basicConfig(level=logging.INFO)
2323

2424
input_doc_paths = [
25-
Path("./test/data/2206.01062.pdf"),
25+
Path("./tests/data/2206.01062.pdf"),
2626
]
2727
output_dir = Path("./scratch")
2828

@@ -41,10 +41,13 @@ def main():
4141

4242
conv_results = doc_converter.convert(input_files)
4343

44+
success_count = 0
45+
failure_count = 0
4446
output_dir.mkdir(parents=True, exist_ok=True)
4547
for conv_res in conv_results:
4648
if conv_res.status != ConversionStatus.SUCCESS:
4749
_log.info(f"Document {conv_res.input.file} failed to convert.")
50+
failure_count += 1
4851
continue
4952

5053
doc_filename = conv_res.input.file.stem
@@ -66,10 +69,17 @@ def main():
6669
with element_image_filename.open("wb") as fp:
6770
image.save(fp, "PNG")
6871

72+
success_count += 1
73+
6974
end_time = time.time() - start_time
7075

7176
_log.info(f"All documents were converted in {end_time:.2f} seconds.")
7277

78+
if failure_count > 0:
79+
raise RuntimeError(
80+
f"The example failed converting {failure_count} on {len(input_doc_paths)}."
81+
)
82+
7383

7484
if __name__ == "__main__":
7585
main()

0 commit comments

Comments
 (0)