3DCF / doc2dataset

Open document layer and doc→dataset pipeline for LLMs, with NumGuard numeric integrity and multi-framework exports.

3DCF/doc2dataset ingests PDFs, Markdown, plain text and other text-like formats into a normalized index (documents.jsonl, pages.jsonl, cells.jsonl), extracts NumGuard hashes for numeric cells, and generates QA/Summary/RAG datasets plus exports for HuggingFace, LLaMA-Factory, Axolotl, OpenAI, and custom RAG stacks. The workspace bundles the Rust core, CLI, doc2dataset pipeline, HTTP service + UI, and Python/Node bindings.

Documentation

Research paper (PDF)
Technical Report / Spec
CLI guide
Configuration guide
Data format reference
Installation notes
Evaluation data (GitHub Releases) – evaluation corpora and metrics are distributed as a GitHub Release asset (see the latest release for a 3dcf-eval-*.tar.* archive). Download the archive, unpack it at the repo root (it recreates the eval/ tree), and then follow eval/README.md.

Features

Document layer standard – deterministic macro-cells with kind / bbox / importance stored in three JSONL files: documents.jsonl, pages.jsonl, and cells.jsonl.
NumGuard numeric integrity – per-cell number hashes with A/B/C/D coverage; in our evaluation, all A-bucket corruptions are detected (recall 1.0).
Token-efficient contexts – macro-cell contexts are typically 3–6× smaller in tokens than naive pdfminer/Unstructured baselines on our micro-corpora, while maintaining or improving QA accuracy and numeric faithfulness (see Technical Report).
doc→dataset tasks – reusable qa.jsonl, summary.jsonl, and rag.jsonl samples with metrics for observability.
Multi-framework exports – ready-to-use datasets for HuggingFace (text/chat), LLaMA-Factory (Alpaca/ShareGPT), Axolotl (text/chat), OpenAI finetune (messages JSONL), and a generic RAG JSONL.
Rust-native core + bindings – CLI (three_dcf_cli) and HTTP service (three_dcf_service), plus Python (three_dcf_py) and Node (three_dcf_node) bindings for easy integration.

Who is this for?

ML / AI platform teams – need a reproducible document layer that feeds RAG and fine-tuning pipelines.
Fintech / regulatory / analytics teams – care about numeric correctness in reports, filings, and policies (NumGuard).
LLM researchers / OSS devs – want an open, inspectable doc→dataset standard with a realistic evaluation suite.

Quickstart

1. Build the CLI

git clone https://github.com/3DCF-Labs/3dcf.git
cd 3dcf

# Build and run the main CLI without installing
cargo run -p three_dcf_cli -- --help

2. Run doc2dataset on an example config

# Run the doc2dataset pipeline directly via Cargo
cargo run -p doc2dataset -- run   --config examples/doc2dataset/openai-finetune/doc2dataset.yaml

# Inspect the generated dataset
tree datasets/default -L 2

You should see datasets/default/{index,raw/3dcf,samples,exports} populated with QA/Summary/RAG samples and finetune exports.

3. Minimal macro-cell smoke test

# Encode a sample Markdown report
cargo run -p three_dcf_cli -- encode   datasets/sample/sample_report.md   --preset reports   --budget 256   --out sample.3dcf   --text-out sample.3dcf.txt   --json-out sample.3dcf.json

# Serialize context and compute token stats
cargo run -p three_dcf_cli -- serialize sample.3dcf   --out sample.context.txt   --preview 96

cargo run -p three_dcf_cli -- stats sample.3dcf   --tokenizer cl100k_base

If you prefer installing binaries instead of cargo run, you can do:

cargo install --path crates/cli --force       # installs `three_dcf_cli` as `3dcf` on $PATH
cargo install --path crates/doc2dataset --force

3dcf --help
doc2dataset --help

Layout

3dcf/
  Cargo.toml                # workspace definition
  proto/3dcf.proto          # Protobuf schema
  crates/
    core/                   # encode/decode/serializer/stats/NumGuard
    cli/                    # CLI (`three_dcf_cli` → `3dcf`)
    doc2dataset/            # doc→dataset pipeline (`doc2dataset`)
    service/                # HTTP service + UI
    index/, llm/, rag/      # index/LLM/RAG helpers
    ffi-node/, ffi-py/      # Node/Python bindings
  datasets/                 # sample corpora + README
  docs/                     # CLI/config/format guides
  eval/                     # local evaluation runs (downloaded from GitHub Releases; not tracked in git)
  examples/, recipes/       # integration examples and recipes

Note: eval/ is intended to be populated from a GitHub Release archive (e.g., 3dcf-eval-v0.1.tar.gz)

doc2dataset in practice

doc2dataset is a separate CLI that orchestrates ingest, task generation, and exports based on a YAML config.

Environment variables:

DOC2DATASET_PROVIDER – LLM provider (openai, anthropic, local, etc.).
DOC2DATASET_MODEL – model name (gpt-4.1-mini, claude-3.5-sonnet, ...).
DOC2DATASET_LANG – language code (e.g., en).

Ingest options (in configs or flags) mirror the core encoder:

preset – encoder preset (reports, news, etc.).
enable_ocr / force_ocr – control OCR usage.
ocr_langs – Tesseract language codes (e.g., ["eng"]).

Supported formats and automatic conversions

FileFormat::from_path in crates/doc2dataset/src/model.rs normalizes file extensions to a small enum, and convert::prepare_document either passes the original file to 3DCF or converts it to temporary Markdown before ingest.

Currently supported conversions (see crates/doc2dataset/src/convert/):

HTML / XML – *.html, *.htm, *.xml, *.xhtml, *.rss, *.atom
→ converted to Markdown via a simple HTML-to-text pass (convert/html.rs).
JSON / YAML / TOML / INI – *.json, *.yaml, *.yml, *.toml, *.ini, *.cfg, *.conf
→ parsed into a normalized JSON structure and rendered as nested headings + key/value sections; simple arrays of objects are rendered as Markdown tables when keys align (convert/structured.rs).
CSV / TSV / compressed variants – *.csv, *.tsv, *.csv.gz, *.tsv.gz
→ parsed with the csv crate and emitted as Markdown tables, chunked at 50 rows per table by default (convert/tabular.rs).
TeX / Bib / Bbl – *.tex, *.bib, *.bbl
→ flattened into headings and text; tabular blocks are rendered as Markdown tables (convert/tex.rs, convert/bib.rs).
Logs / RTF – *.log, *.rtf
→ read as UTF-8 and wrapped as simple text blocks with a top-level heading based on the file stem (convert/log.rs, convert/rtf.rs).
PDF / Markdown / plain text – *.pdf, *.md, *.markdown, *.txt
→ passed directly to 3DCF core ingest.
Images – *.png, *.jpg, *.jpeg, *.gif, *.tif, *.tiff, *.bmp, *.webp
→ treated as FileFormat::Image and passed to core ingest; OCR is applied if the preset and flags enable OCR (see three_dcf_core::ocr).

Unsupported or unknown extensions are ingested as-is (if possible) or skipped with a log entry.

Example `doc2dataset.yaml`

dataset_root: ./datasets/company

sources:
  - path: ./docs/policies
    pattern: "*.pdf"
  - path: ./docs/wiki_export
    pattern: "*.md,*.html,*.json,*.csv"

tasks: [qa, summary]

exports:
  hf: true
  llama_factory:
    format: sharegpt
  openai: true
  axolotl:
    mode: chat
  rag_jsonl: true

ingest:
  preset: reports
  enable_ocr: false
  ocr_langs: ["eng"]

Running:

doc2dataset run --config doc2dataset.yaml

(or cargo run -p doc2dataset -- run --config doc2dataset.yaml) will:

ingest all matching files,
build a 3DCF index under datasets/company/index/,
generate samples/qa.jsonl and samples/summary.jsonl,
emit the selected exports under datasets/company/exports/.

Context + ask commands (CLI)

The three_dcf_cli binary exposes helper commands for manual experiments:

# Build a compressed context from a PDF
cargo run -p three_dcf_cli -- context input.pdf   --preset reports   --budget 256   --tokenizer cl100k_base

# Ask different providers with the same compressed context
cargo run -p three_dcf_cli -- ask-openai    input.pdf --preset reports --budget 256 --model gpt-4.1-mini
cargo run -p three_dcf_cli -- ask-anthropic input.pdf --preset reports --budget 256 --model claude-3-5-sonnet
cargo run -p three_dcf_cli -- ask-gemini    input.pdf --preset reports --budget 256 --model gemini-1.5-flash
cargo run -p three_dcf_cli -- ask-deepseek  input.pdf --preset reports --budget 256 --model deepseek-chat

All of these subcommands share the same encoder options and metrics (tokens, savings, NumGuard coverage), and can be used in CI with --quiet to suppress summaries.

HTTP service + Docker

The three_dcf_service crate exposes the core functionality over HTTP:

cargo run -p three_dcf_service

By default it starts an Axum server on 0.0.0.0:8000 with:

POST /encode – multipart upload (PDF + options) returning encoded 3DCF documents and stats.
GET / – a simple bundled UI that lets you upload documents from a browser and inspect contexts.

The root Dockerfile builds a static image with the same binary. You can run it with:

docker build -t three-dcf-service .
docker run -p 8000:8000 three-dcf-service

Evaluation & observability

3dcf bench emits JSONL with CER/WER, numeric guard mismatches, throughput, and RSS per document. Feed it to 3dcf report for an HTML dashboard, or into your monitoring stack (Prometheus/Grafana, etc.).

Eval data from GitHub Releases

The canonical evaluation corpora and metrics are distributed via GitHub Releases:

Go to the Releases page of this repository.
Download the latest 3dcf-eval-*.tar.* archive (for example, 3dcf-eval-v0.1.tar.gz).

From the repo root, unpack it:

tar -xzf 3dcf-eval-v0.1.tar.gz   # or the filename you downloaded

You should now have an eval/ directory with README.md, raw documents, and JSONL metrics.

Examples & recipes

examples/doc2dataset/openai-finetune/ – OpenAI finetune JSONL export.
examples/doc2dataset/llama-factory-sharegpt/ – LLaMA-Factory ShareGPT config.
examples/doc2dataset/axolotl/ – Axolotl chat/text export.
recipes/3dcf-recipes/langchain-rag/ – LangChain loader/compressor/reader.
recipes/3dcf-recipes/openai-qa/ – Python helper that calls three_dcf_py then OpenAI /responses.
recipes/3dcf-recipes/numguard-demo/ – NumGuard corruption demo.

Why 3DCF vs. plain OCR/Markdown?

Deterministic containers – every macro-cell carries hashes, coordinates, and NumGuard metadata, making it easy to diff, audit, and replay pipelines.
Token-aware pruning – headings, tables, and numeric-heavy cells are prioritized to meet strict budgets without losing critical context.
Prompt-friendly previews – .3dcf.txt mirrors layout with table sketches that RAG prompts can use directly.
Observability baked in – 3dcf bench + 3dcf report track CER/WER, numeric guards, throughput, and memory.

Testing

# Default tests
cargo test

# Full surface (PDFium + OCR, macOS example)
export PDFIUM_LIB_DIR=~/opt/pdfium/lib
export PDFIUM_INCLUDE_DIR=~/opt/pdfium/include
export RUSTFLAGS='-L native=/opt/homebrew/opt/leptonica/lib -L native=/opt/homebrew/lib'
cargo test --all --all-features

Bindings

Node.js (crates/ffi-node): npm install && npm run build (or cargo build -p three_dcf_node).
Python (crates/ffi-py): maturin develop -m crates/ffi-py/Cargo.toml (or cargo build -p three_dcf_py).

Both bindings expose encode, decode_text, stats, and related helpers using the same tokenizer names as the CLI (cl100k_base, o200k, anthropic).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

3DCF / doc2dataset

Documentation

Features

Who is this for?

Quickstart

1. Build the CLI

2. Run doc2dataset on an example config

3. Minimal macro-cell smoke test

Layout

doc2dataset in practice

Supported formats and automatic conversions

Example `doc2dataset.yaml`

Context + ask commands (CLI)

HTTP service + Docker

Evaluation & observability

Eval data from GitHub Releases

Examples & recipes

Why 3DCF vs. plain OCR/Markdown?

Testing

Bindings

About

Uh oh!

Releases 1

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
bench		bench
crates		crates
datasets		datasets
docs		docs
examples/doc2dataset		examples/doc2dataset
proto		proto
recipes/3dcf-recipes		recipes/3dcf-recipes
ui		ui
3dcf_doc2dataset_spec.md		3dcf_doc2dataset_spec.md
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
env.example		env.example

License

3DCF-Labs/doc2dataset

Folders and files

Latest commit

History

Repository files navigation

3DCF / doc2dataset

Documentation

Features

Who is this for?

Quickstart

1. Build the CLI

2. Run doc2dataset on an example config

3. Minimal macro-cell smoke test

Layout

doc2dataset in practice

Supported formats and automatic conversions

Example doc2dataset.yaml

Context + ask commands (CLI)

HTTP service + Docker

Evaluation & observability

Eval data from GitHub Releases

Examples & recipes

Why 3DCF vs. plain OCR/Markdown?

Testing

Bindings

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Example `doc2dataset.yaml`

Packages