Open document layer and doc→dataset pipeline for LLMs, with NumGuard numeric integrity and multi-framework exports.
3DCF/doc2dataset ingests PDFs, Markdown, plain text and other text-like formats into a normalized index (documents.jsonl, pages.jsonl, cells.jsonl), extracts NumGuard hashes for numeric cells, and generates QA/Summary/RAG datasets plus exports for HuggingFace, LLaMA-Factory, Axolotl, OpenAI, and custom RAG stacks. The workspace bundles the Rust core, CLI, doc2dataset pipeline, HTTP service + UI, and Python/Node bindings.
- Research paper (PDF)
- Technical Report / Spec
- CLI guide
- Configuration guide
- Data format reference
- Installation notes
- Evaluation data (GitHub Releases) – evaluation corpora and metrics are distributed as a GitHub Release asset (see the latest release for a
3dcf-eval-*.tar.*archive). Download the archive, unpack it at the repo root (it recreates theeval/tree), and then followeval/README.md.
- Document layer standard – deterministic macro-cells with
kind/bbox/importancestored in three JSONL files:documents.jsonl,pages.jsonl, andcells.jsonl. - NumGuard numeric integrity – per-cell number hashes with A/B/C/D coverage; in our evaluation, all A-bucket corruptions are detected (recall 1.0).
- Token-efficient contexts – macro-cell contexts are typically 3–6× smaller in tokens than naive pdfminer/Unstructured baselines on our micro-corpora, while maintaining or improving QA accuracy and numeric faithfulness (see Technical Report).
- doc→dataset tasks – reusable
qa.jsonl,summary.jsonl, andrag.jsonlsamples with metrics for observability. - Multi-framework exports – ready-to-use datasets for HuggingFace (text/chat), LLaMA-Factory (Alpaca/ShareGPT), Axolotl (text/chat), OpenAI finetune (
messagesJSONL), and a generic RAG JSONL. - Rust-native core + bindings – CLI (
three_dcf_cli) and HTTP service (three_dcf_service), plus Python (three_dcf_py) and Node (three_dcf_node) bindings for easy integration.
- ML / AI platform teams – need a reproducible document layer that feeds RAG and fine-tuning pipelines.
- Fintech / regulatory / analytics teams – care about numeric correctness in reports, filings, and policies (NumGuard).
- LLM researchers / OSS devs – want an open, inspectable doc→dataset standard with a realistic evaluation suite.
git clone https://github.com/3DCF-Labs/3dcf.git
cd 3dcf
# Build and run the main CLI without installing
cargo run -p three_dcf_cli -- --help# Run the doc2dataset pipeline directly via Cargo
cargo run -p doc2dataset -- run --config examples/doc2dataset/openai-finetune/doc2dataset.yaml
# Inspect the generated dataset
tree datasets/default -L 2You should see datasets/default/{index,raw/3dcf,samples,exports} populated with QA/Summary/RAG samples and finetune exports.
# Encode a sample Markdown report
cargo run -p three_dcf_cli -- encode datasets/sample/sample_report.md --preset reports --budget 256 --out sample.3dcf --text-out sample.3dcf.txt --json-out sample.3dcf.json
# Serialize context and compute token stats
cargo run -p three_dcf_cli -- serialize sample.3dcf --out sample.context.txt --preview 96
cargo run -p three_dcf_cli -- stats sample.3dcf --tokenizer cl100k_baseIf you prefer installing binaries instead of cargo run, you can do:
cargo install --path crates/cli --force # installs `three_dcf_cli` as `3dcf` on $PATH
cargo install --path crates/doc2dataset --force
3dcf --help
doc2dataset --help3dcf/
Cargo.toml # workspace definition
proto/3dcf.proto # Protobuf schema
crates/
core/ # encode/decode/serializer/stats/NumGuard
cli/ # CLI (`three_dcf_cli` → `3dcf`)
doc2dataset/ # doc→dataset pipeline (`doc2dataset`)
service/ # HTTP service + UI
index/, llm/, rag/ # index/LLM/RAG helpers
ffi-node/, ffi-py/ # Node/Python bindings
datasets/ # sample corpora + README
docs/ # CLI/config/format guides
eval/ # local evaluation runs (downloaded from GitHub Releases; not tracked in git)
examples/, recipes/ # integration examples and recipes
Note:
eval/is intended to be populated from a GitHub Release archive (e.g.,3dcf-eval-v0.1.tar.gz)
doc2dataset is a separate CLI that orchestrates ingest, task generation, and exports based on a YAML config.
Environment variables:
DOC2DATASET_PROVIDER– LLM provider (openai,anthropic,local, etc.).DOC2DATASET_MODEL– model name (gpt-4.1-mini,claude-3.5-sonnet, ...).DOC2DATASET_LANG– language code (e.g.,en).
Ingest options (in configs or flags) mirror the core encoder:
preset– encoder preset (reports,news, etc.).enable_ocr/force_ocr– control OCR usage.ocr_langs– Tesseract language codes (e.g.,["eng"]).
FileFormat::from_path in crates/doc2dataset/src/model.rs normalizes file extensions to a small enum, and convert::prepare_document either passes the original file to 3DCF or converts it to temporary Markdown before ingest.
Currently supported conversions (see crates/doc2dataset/src/convert/):
-
HTML / XML –
*.html,*.htm,*.xml,*.xhtml,*.rss,*.atom
→ converted to Markdown via a simple HTML-to-text pass (convert/html.rs). -
JSON / YAML / TOML / INI –
*.json,*.yaml,*.yml,*.toml,*.ini,*.cfg,*.conf
→ parsed into a normalized JSON structure and rendered as nested headings + key/value sections; simple arrays of objects are rendered as Markdown tables when keys align (convert/structured.rs). -
CSV / TSV / compressed variants –
*.csv,*.tsv,*.csv.gz,*.tsv.gz
→ parsed with thecsvcrate and emitted as Markdown tables, chunked at 50 rows per table by default (convert/tabular.rs). -
TeX / Bib / Bbl –
*.tex,*.bib,*.bbl
→ flattened into headings and text;tabularblocks are rendered as Markdown tables (convert/tex.rs,convert/bib.rs). -
Logs / RTF –
*.log,*.rtf
→ read as UTF-8 and wrapped as simple text blocks with a top-level heading based on the file stem (convert/log.rs,convert/rtf.rs). -
PDF / Markdown / plain text –
*.pdf,*.md,*.markdown,*.txt
→ passed directly to 3DCF core ingest. -
Images –
*.png,*.jpg,*.jpeg,*.gif,*.tif,*.tiff,*.bmp,*.webp
→ treated asFileFormat::Imageand passed to core ingest; OCR is applied if the preset and flags enable OCR (seethree_dcf_core::ocr).
Unsupported or unknown extensions are ingested as-is (if possible) or skipped with a log entry.
dataset_root: ./datasets/company
sources:
- path: ./docs/policies
pattern: "*.pdf"
- path: ./docs/wiki_export
pattern: "*.md,*.html,*.json,*.csv"
tasks: [qa, summary]
exports:
hf: true
llama_factory:
format: sharegpt
openai: true
axolotl:
mode: chat
rag_jsonl: true
ingest:
preset: reports
enable_ocr: false
ocr_langs: ["eng"]Running:
doc2dataset run --config doc2dataset.yaml(or cargo run -p doc2dataset -- run --config doc2dataset.yaml) will:
- ingest all matching files,
- build a 3DCF index under
datasets/company/index/, - generate
samples/qa.jsonlandsamples/summary.jsonl, - emit the selected exports under
datasets/company/exports/.
The three_dcf_cli binary exposes helper commands for manual experiments:
# Build a compressed context from a PDF
cargo run -p three_dcf_cli -- context input.pdf --preset reports --budget 256 --tokenizer cl100k_base
# Ask different providers with the same compressed context
cargo run -p three_dcf_cli -- ask-openai input.pdf --preset reports --budget 256 --model gpt-4.1-mini
cargo run -p three_dcf_cli -- ask-anthropic input.pdf --preset reports --budget 256 --model claude-3-5-sonnet
cargo run -p three_dcf_cli -- ask-gemini input.pdf --preset reports --budget 256 --model gemini-1.5-flash
cargo run -p three_dcf_cli -- ask-deepseek input.pdf --preset reports --budget 256 --model deepseek-chatAll of these subcommands share the same encoder options and metrics (tokens, savings, NumGuard coverage), and can be used in CI with --quiet to suppress summaries.
The three_dcf_service crate exposes the core functionality over HTTP:
cargo run -p three_dcf_serviceBy default it starts an Axum server on 0.0.0.0:8000 with:
POST /encode– multipart upload (PDF + options) returning encoded 3DCF documents and stats.GET /– a simple bundled UI that lets you upload documents from a browser and inspect contexts.
The root Dockerfile builds a static image with the same binary. You can run it with:
docker build -t three-dcf-service .
docker run -p 8000:8000 three-dcf-service3dcf bench emits JSONL with CER/WER, numeric guard mismatches, throughput, and RSS per document. Feed it to 3dcf report for an HTML dashboard, or into your monitoring stack (Prometheus/Grafana, etc.).
The canonical evaluation corpora and metrics are distributed via GitHub Releases:
-
Go to the Releases page of this repository.
-
Download the latest
3dcf-eval-*.tar.*archive (for example,3dcf-eval-v0.1.tar.gz). -
From the repo root, unpack it:
tar -xzf 3dcf-eval-v0.1.tar.gz # or the filename you downloaded -
You should now have an
eval/directory withREADME.md, raw documents, and JSONL metrics.
examples/doc2dataset/openai-finetune/– OpenAI finetune JSONL export.examples/doc2dataset/llama-factory-sharegpt/– LLaMA-Factory ShareGPT config.examples/doc2dataset/axolotl/– Axolotl chat/text export.recipes/3dcf-recipes/langchain-rag/– LangChain loader/compressor/reader.recipes/3dcf-recipes/openai-qa/– Python helper that callsthree_dcf_pythen OpenAI/responses.recipes/3dcf-recipes/numguard-demo/– NumGuard corruption demo.
- Deterministic containers – every macro-cell carries hashes, coordinates, and NumGuard metadata, making it easy to diff, audit, and replay pipelines.
- Token-aware pruning – headings, tables, and numeric-heavy cells are prioritized to meet strict budgets without losing critical context.
- Prompt-friendly previews –
.3dcf.txtmirrors layout with table sketches that RAG prompts can use directly. - Observability baked in –
3dcf bench+3dcf reporttrack CER/WER, numeric guards, throughput, and memory.
# Default tests
cargo test
# Full surface (PDFium + OCR, macOS example)
export PDFIUM_LIB_DIR=~/opt/pdfium/lib
export PDFIUM_INCLUDE_DIR=~/opt/pdfium/include
export RUSTFLAGS='-L native=/opt/homebrew/opt/leptonica/lib -L native=/opt/homebrew/lib'
cargo test --all --all-features- Node.js (
crates/ffi-node):npm install && npm run build(orcargo build -p three_dcf_node). - Python (
crates/ffi-py):maturin develop -m crates/ffi-py/Cargo.toml(orcargo build -p three_dcf_py).
Both bindings expose encode, decode_text, stats, and related helpers using the same tokenizer names as the CLI (cl100k_base, o200k, anthropic).