Skip to content

Commit 5458a88

Browse files
dolfim-ibmceberam
andauthored
ci: add coverage and ruff (#1383)
* add coverage calculation and push Signed-off-by: Michele Dolfi <[email protected]> * new codecov version and usage of token Signed-off-by: Michele Dolfi <[email protected]> * enable ruff formatter instead of black and isort Signed-off-by: Michele Dolfi <[email protected]> * apply ruff lint fixes Signed-off-by: Michele Dolfi <[email protected]> * apply ruff unsafe fixes Signed-off-by: Michele Dolfi <[email protected]> * add removed imports Signed-off-by: Michele Dolfi <[email protected]> * runs 1 on linter issues Signed-off-by: Michele Dolfi <[email protected]> * finalize linter fixes Signed-off-by: Michele Dolfi <[email protected]> * Update pyproject.toml Co-authored-by: Cesar Berrospi Ramis <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> --------- Signed-off-by: Michele Dolfi <[email protected]> Signed-off-by: Michele Dolfi <[email protected]> Co-authored-by: Cesar Berrospi Ramis <[email protected]>
1 parent 293c28c commit 5458a88

File tree

104 files changed

+665
-633
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

104 files changed

+665
-633
lines changed

.github/codecov.yml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
codecov:
2+
# https://docs.codecov.io/docs/comparing-commits
3+
allow_coverage_offsets: true
4+
coverage:
5+
status:
6+
project:
7+
default:
8+
informational: true
9+
target: auto # auto compares coverage to the previous base commit
10+
flags:
11+
- docling
12+
comment:
13+
layout: "reach, diff, flags, files"
14+
behavior: default
15+
require_changes: false # if true: only post the comment if coverage changes
16+
branches: # branch names that can post comment
17+
- "main"

.github/workflows/cd.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ env:
1010
jobs:
1111
code-checks:
1212
uses: ./.github/workflows/checks.yml
13+
with:
14+
push_coverage: false
1315
pre-release-check:
1416
runs-on: ubuntu-latest
1517
outputs:

.github/workflows/checks.yml

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,13 @@
11
on:
22
workflow_call:
3+
inputs:
4+
push_coverage:
5+
type: boolean
6+
description: "If true, the coverage results are pushed to codecov.io."
7+
default: true
8+
secrets:
9+
CODECOV_TOKEN:
10+
required: false
311

412
env:
513
HF_HUB_DOWNLOAD_TIMEOUT: "60"
@@ -32,7 +40,13 @@ jobs:
3240
run: poetry install --all-extras
3341
- name: Testing
3442
run: |
35-
poetry run pytest -v tests
43+
poetry run pytest -v --cov=docling --cov-report=xml tests
44+
- name: Upload coverage to Codecov
45+
if: inputs.push_coverage
46+
uses: codecov/codecov-action@v5
47+
with:
48+
token: ${{ secrets.CODECOV_TOKEN }}
49+
file: ./coverage.xml
3650
- name: Run examples
3751
run: |
3852
for file in docs/examples/*.py; do

.github/workflows/ci.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,3 +17,5 @@ jobs:
1717
code-checks:
1818
if: ${{ github.event_name == 'push' || (github.event.pull_request.head.repo.full_name != 'docling-project/docling' && github.event.pull_request.head.repo.full_name != 'docling-project/docling') }}
1919
uses: ./.github/workflows/checks.yml
20+
secrets:
21+
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}

.pre-commit-config.yaml

Lines changed: 13 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,43 +1,26 @@
11
fail_fast: true
22
repos:
3+
- repo: https://github.com/astral-sh/ruff-pre-commit
4+
rev: v0.11.5
5+
hooks:
6+
# Run the Ruff formatter.
7+
- id: ruff-format
8+
name: "Ruff formatter"
9+
args: [--config=pyproject.toml]
10+
files: '^(docling|tests|docs/examples).*\.(py|ipynb)$'
11+
# Run the Ruff linter.
12+
- id: ruff
13+
name: "Ruff linter"
14+
args: [--exit-non-zero-on-fix, --fix, --config=pyproject.toml]
15+
files: '^(docling|tests|docs/examples).*\.(py|ipynb)$'
316
- repo: local
417
hooks:
5-
- id: black
6-
name: Black
7-
entry: poetry run black docling docs/examples tests
8-
pass_filenames: false
9-
language: system
10-
files: '\.py$'
11-
- id: isort
12-
name: isort
13-
entry: poetry run isort docling docs/examples tests
14-
pass_filenames: false
15-
language: system
16-
files: '\.py$'
17-
# - id: flake8
18-
# name: flake8
19-
# entry: poetry run flake8 docling
20-
# pass_filenames: false
21-
# language: system
22-
# files: '\.py$'
2318
- id: mypy
2419
name: MyPy
2520
entry: poetry run mypy docling
2621
pass_filenames: false
2722
language: system
2823
files: '\.py$'
29-
- id: nbqa_black
30-
name: nbQA Black
31-
entry: poetry run nbqa black docs/examples
32-
pass_filenames: false
33-
language: system
34-
files: '\.ipynb$'
35-
- id: nbqa_isort
36-
name: nbQA isort
37-
entry: poetry run nbqa isort docs/examples
38-
pass_filenames: false
39-
language: system
40-
files: '\.ipynb$'
4124
- id: poetry
4225
name: Poetry check
4326
entry: poetry check --lock

docling/backend/asciidoc_backend.py

Lines changed: 7 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ def __init__(self, in_doc: InputDocument, path_or_stream: Union[BytesIO, Path]):
3434
text_stream = self.path_or_stream.getvalue().decode("utf-8")
3535
self.lines = text_stream.split("\n")
3636
if isinstance(self.path_or_stream, Path):
37-
with open(self.path_or_stream, "r", encoding="utf-8") as f:
37+
with open(self.path_or_stream, encoding="utf-8") as f:
3838
self.lines = f.readlines()
3939
self.valid = True
4040

@@ -75,14 +75,12 @@ def convert(self) -> DoclingDocument:
7575

7676
return doc
7777

78-
def _parse(self, doc: DoclingDocument):
78+
def _parse(self, doc: DoclingDocument): # noqa: C901
7979
"""
8080
Main function that orchestrates the parsing by yielding components:
8181
title, section headers, text, lists, and tables.
8282
"""
8383

84-
content = ""
85-
8684
in_list = False
8785
in_table = False
8886

@@ -95,7 +93,7 @@ def _parse(self, doc: DoclingDocument):
9593
# indents: dict[int, Union[DocItem, GroupItem, None]] = {}
9694
indents: dict[int, Union[GroupItem, None]] = {}
9795

98-
for i in range(0, 10):
96+
for i in range(10):
9997
parents[i] = None
10098
indents[i] = None
10199

@@ -125,7 +123,6 @@ def _parse(self, doc: DoclingDocument):
125123

126124
# Lists
127125
elif self._is_list_item(line):
128-
129126
_log.debug(f"line: {line}")
130127
item = self._parse_list_item(line)
131128
_log.debug(f"parsed list-item: {item}")
@@ -147,7 +144,6 @@ def _parse(self, doc: DoclingDocument):
147144
indents[level + 1] = item["indent"]
148145

149146
elif in_list and item["indent"] < indents[level]:
150-
151147
# print(item["indent"], " => ", indents[level])
152148
while item["indent"] < indents[level]:
153149
# print(item["indent"], " => ", indents[level])
@@ -176,7 +172,6 @@ def _parse(self, doc: DoclingDocument):
176172
elif in_table and (
177173
(not self._is_table_line(line)) or line.strip() == "|==="
178174
): # end of table
179-
180175
caption = None
181176
if len(caption_data) > 0:
182177
caption = doc.add_text(
@@ -195,7 +190,6 @@ def _parse(self, doc: DoclingDocument):
195190

196191
# Picture
197192
elif self._is_picture(line):
198-
199193
caption = None
200194
if len(caption_data) > 0:
201195
caption = doc.add_text(
@@ -250,7 +244,6 @@ def _parse(self, doc: DoclingDocument):
250244
text_data = []
251245

252246
elif len(line.strip()) > 0: # allow multiline texts
253-
254247
item = self._parse_text(line)
255248
text_data.append(item["text"])
256249

@@ -273,14 +266,14 @@ def _parse(self, doc: DoclingDocument):
273266

274267
def _get_current_level(self, parents):
275268
for k, v in parents.items():
276-
if v == None and k > 0:
269+
if v is None and k > 0:
277270
return k - 1
278271

279272
return 0
280273

281274
def _get_current_parent(self, parents):
282275
for k, v in parents.items():
283-
if v == None and k > 0:
276+
if v is None and k > 0:
284277
return parents[k - 1]
285278

286279
return None
@@ -328,15 +321,15 @@ def _parse_list_item(self, line):
328321
"marker": marker,
329322
"text": text.strip(),
330323
"numbered": False,
331-
"indent": 0 if indent == None else len(indent),
324+
"indent": 0 if indent is None else len(indent),
332325
}
333326
else:
334327
return {
335328
"type": "list_item",
336329
"marker": marker,
337330
"text": text.strip(),
338331
"numbered": True,
339-
"indent": 0 if indent == None else len(indent),
332+
"indent": 0 if indent is None else len(indent),
340333
}
341334
else:
342335
# Fallback if no match
@@ -357,7 +350,6 @@ def _parse_table_line(self, line):
357350
return [cell.strip() for cell in line.split("|") if cell.strip()]
358351

359352
def _populate_table_as_grid(self, table_data):
360-
361353
num_rows = len(table_data)
362354

363355
# Adjust the table data into a grid format

docling/backend/csv_backend.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ def convert(self) -> DoclingDocument:
5858
head = self.content.readline()
5959
dialect = csv.Sniffer().sniff(head, ",;\t|:")
6060
_log.info(f'Parsing CSV with delimiter: "{dialect.delimiter}"')
61-
if not dialect.delimiter in {",", ";", "\t", "|", ":"}:
61+
if dialect.delimiter not in {",", ";", "\t", "|", ":"}:
6262
raise RuntimeError(
6363
f"Cannot convert csv with unknown delimiter {dialect.delimiter}."
6464
)

docling/backend/docling_parse_backend.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
import logging
22
import random
3+
from collections.abc import Iterable
34
from io import BytesIO
45
from pathlib import Path
5-
from typing import Iterable, List, Optional, Union
6+
from typing import List, Optional, Union
67

78
import pypdfium2 as pdfium
89
from docling_core.types.doc import BoundingBox, CoordOrigin, Size
@@ -156,7 +157,6 @@ def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
156157
def get_page_image(
157158
self, scale: float = 1, cropbox: Optional[BoundingBox] = None
158159
) -> Image.Image:
159-
160160
page_size = self.get_size()
161161

162162
if not cropbox:

docling/backend/docling_parse_v2_backend.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
import logging
22
import random
3+
from collections.abc import Iterable
34
from io import BytesIO
45
from pathlib import Path
5-
from typing import TYPE_CHECKING, Iterable, List, Optional, Union
6+
from typing import TYPE_CHECKING, List, Optional, Union
67

78
import pypdfium2 as pdfium
89
from docling_core.types.doc import BoundingBox, CoordOrigin
@@ -172,7 +173,6 @@ def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
172173
def get_page_image(
173174
self, scale: float = 1, cropbox: Optional[BoundingBox] = None
174175
) -> Image.Image:
175-
176176
page_size = self.get_size()
177177

178178
if not cropbox:

docling/backend/docling_parse_v4_backend.py

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
11
import logging
2-
import random
2+
from collections.abc import Iterable
33
from io import BytesIO
44
from pathlib import Path
5-
from typing import TYPE_CHECKING, Iterable, List, Optional, Union
5+
from typing import TYPE_CHECKING, Optional, Union
66

77
import pypdfium2 as pdfium
88
from docling_core.types.doc import BoundingBox, CoordOrigin
99
from docling_core.types.doc.page import SegmentedPdfPage, TextCell
1010
from docling_parse.pdf_parser import DoclingPdfParser, PdfDocument
11-
from PIL import Image, ImageDraw
11+
from PIL import Image
1212
from pypdfium2 import PdfPage
1313

1414
from docling.backend.pdf_backend import PdfDocumentBackend, PdfPageBackend
@@ -93,7 +93,6 @@ def get_bitmap_rects(self, scale: float = 1) -> Iterable[BoundingBox]:
9393
def get_page_image(
9494
self, scale: float = 1, cropbox: Optional[BoundingBox] = None
9595
) -> Image.Image:
96-
9796
page_size = self.get_size()
9897

9998
if not cropbox:

docling/backend/docx/latex/latex_dict.py

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,8 @@
1-
# -*- coding: utf-8 -*-
2-
31
"""
42
Adapted from https://github.com/xiilei/dwml/blob/master/dwml/latex_dict.py
53
On 23/01/2025
64
"""
75

8-
from __future__ import unicode_literals
9-
106
CHARS = ("{", "}", "_", "^", "#", "&", "$", "%", "~")
117

128
BLANK = ""
@@ -79,7 +75,6 @@
7975
}
8076

8177
T = {
82-
"\u2192": "\\rightarrow ",
8378
# Greek letters
8479
"\U0001d6fc": "\\alpha ",
8580
"\U0001d6fd": "\\beta ",

docling/backend/docx/latex/omml.py

Lines changed: 4 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -76,8 +76,7 @@ def get_val(key, default=None, store=CHR):
7676
return default
7777

7878

79-
class Tag2Method(object):
80-
79+
class Tag2Method:
8180
def call_method(self, elm, stag=None):
8281
getmethod = self.tag2meth.get
8382
if stag is None:
@@ -130,7 +129,6 @@ def process_unknow(self, elm, stag):
130129

131130

132131
class Pr(Tag2Method):
133-
134132
text = ""
135133

136134
__val_tags = ("chr", "pos", "begChr", "endChr", "type")
@@ -159,7 +157,7 @@ def do_brk(self, elm):
159157
def do_common(self, elm):
160158
stag = elm.tag.replace(OMML_NS, "")
161159
if stag in self.__val_tags:
162-
t = elm.get("{0}val".format(OMML_NS))
160+
t = elm.get(f"{OMML_NS}val")
163161
self.__innerdict[stag] = t
164162
return None
165163

@@ -248,7 +246,6 @@ def do_spre(self, elm):
248246
"""
249247
the Pre-Sub-Superscript object -- Not support yet
250248
"""
251-
pass
252249

253250
def do_sub(self, elm):
254251
text = self.process_children(elm)
@@ -331,7 +328,7 @@ def do_limlow(self, elm):
331328
t_dict = self.process_children_dict(elm, include=("e", "lim"))
332329
latex_s = LIM_FUNC.get(t_dict["e"])
333330
if not latex_s:
334-
raise NotSupport("Not support lim %s" % t_dict["e"])
331+
raise RuntimeError("Not support lim {}".format(t_dict["e"]))
335332
else:
336333
return latex_s.format(lim=t_dict.get("lim"))
337334

@@ -413,7 +410,7 @@ def do_r(self, elm):
413410
"""
414411
_str = []
415412
_base_str = []
416-
found_text = elm.findtext("./{0}t".format(OMML_NS))
413+
found_text = elm.findtext(f"./{OMML_NS}t")
417414
if found_text:
418415
for s in found_text:
419416
out_latex_str = self.process_unicode(s)

0 commit comments

Comments
 (0)