Releases: ispras/dedoc
Releases · ispras/dedoc
v2.2.3
- Show attached images and added ability to download attached files in the HTML output representation (API usage, return_format="html").
- Added hierarchy level information and annotations to
PptxReader.
v2.2.2
- Added images extraction to
ArticleReader. - Added attachments and references to them in the HTML output representation (return_format="html").
- Fixed functionality of parameter
need_content_analysis. - Fixed
CSVReader(exclude BOM character from the output). - Added handling files with wrong extension or without extension to
DedocManager(detect file type by its content). - Update
README.md.
v2.2.1
- Added
fintocstructure type for parsing financial prospects according to the FinTOC 2022 Shared task (FintocStructureExtractor). - Fixed small bugs in
ArticleReader: colspan for tables, keywords, sections numbering, etc. - Added references to nodes and fixed small bugs in the HTML output representation (return_format="html").
- Removed
other_fieldsfromLineMetadataandDocumentMetadata. - Update
README.md.
v2.2
PdfTabbyReaderimproved: bugs fixes, speed increase of partial PDF extraction (with parameterpages).- Added benchmarks for evaluation of PDF readers performance.
- Added
ReferenceAnnotationclass. - Fixed bug in
can_readmethod for all readers. - Added
articlestructure type for parsing scientific articles using GROBID (ArticleReader,ArticleStructureExtractor).
v2.1.1
v2.1
- Custom loggers deleted (the common logger is used for all dedoc classes).
- Do not change the document image if it has a correct orientation (orientation correction function changed).
- Use only
PdfTabbyReaderduring detection of a textual layer in PDF files. - Code related to the labeling mode refactored and removed from the library package (it is located in the separate directory).
- Added
BoldAnnotationfor words inPdfImageReader. - More benchmarks are added: images of tables parsing, postprocessing of Tesseract OCR.
- Some fixes are made in a web-form of Dedoc.
- Tutorial how to add a new structure type to Dedoc added.
- Parsing of EML and HTML files fixed.
v2.0
- Fix table extraction from
PDFusing empty config (see issue) - Add more benchmarks for Tesseract
- Fix extension extraction for file names with several dots
- Change names of some methods and their parameters for all main classes (attachments extractors, converters, readers, metadata extractors, structure extractors, structure constructors).
Please look to thePackage referenceof documentation for more details - Add
AttachAnnotationandTableAnnotationtoPPTX(see discussion) - Fix bugs in
DOCXhandling (see issues 378, 379
v1.1.1
- Use older
pydanticversion for improving compatibility with other libraries. - Add support for
RTFformat. - Fix bug in handling files' names with dots and spaces.
- Fix bug in non-integer values of text formatting in
DocxReader. - Add support of
on_gpuparameter inconfig. - Add attached images extraction for
PdfTabbyReader. - Fix partial file reading for
PdfTabbyReader. - Add tutorial how to create dedoc's basic data structures.
- Fix
attachments_dirparameter for readers and attachments extractors.
v1.1.0
- Add
BBoxAnnotationto table cells forPdfTabbyReader. - Fix swagger, add api schema classes, remove
to_dictmethod fromParsedDocument. - Improve parsing PDF by
PdfTxtlayerReader, add benchmarks. - Fix
BBoxAnnotationextraction for tables inPdfImageReaderusingtable_type=split_last_columnparameter. - Change base method of metadata extractors, rename it to
extract_metadata. - Unify
BBoxAnnotationextraction for all PDF readers - return only words bboxes. - Increase timeout value for all converters.
v1.0
- Remove
is_one_column_document_listparameter. - Add tutorial about support for a new document type to the documentation.
- Improve textual layer correctness classifier.
- Improve orientation and columns classifier.
- Change table's output structure - added
CellWithMetainstead of a textual string. - Add
BBoxAnnotationto table cells forPdfTxtlayerReaderandPdfImageReader. - Add
ConfidenceAnnotationto table cells forPdfImageReader. - Remove
insert_tableparameter. - Added information about table and page rotation to the table and document metadata respectively.
- Use dedoc-utils library for document images preprocessing.
- Change web interface, fix online-examples of document processing.
- Add comparison operator to
LineWithMeta.