This is part of Mangostten Dataset project
To use this repo
- Install
uv - Run
bash setup.sh - Check if
documentsdir is there at root project - Put pdf documents in
documents/inputs. Each source of pdf should be one folder like this
documents
|____inputs/
|___<source_1>
|___<source_2>
...
|___<source_n>
- Run
run_marker_loop.pyin terminal - Run code in
notebook/nb.ipynb
We encourage users to implement pdf downloading pipeline and export to json pipeline yourself. This repo only show experimenal code.
To learn more about marker, see https://github.com/datalab-to/marker