VLM Integration for Enhanced PDF Understanding
Background
The current PDF pipeline extracts text only. Figures, charts, tables, and diagrams carry semantic information that text extraction cannot capture. Vision Language Models (VLMs) can interpret these visual elements and make their content available for search and retrieval.
How VLM Summaries Fit Into the Pipeline
The core idea: use a VLM to generate text descriptions of visual elements during ingest. These descriptions are embedded and stored alongside regular text chunks, making visual content searchable through the existing vector + FTS pipeline.
The consuming LLM (Claude, etc.) receives richer context without any change to the search interface.
Data Storage Options
Where to store VLM-generated descriptions determines how they interact with search. Three options are on the table:
Option A: Append to chunk text
Concatenate VLM description into the text chunk that corresponds to the same page.
text: "Design patterns improve... [Figure: bar chart comparing Singleton vs Factory, 30% throughput gain]"
- Searchable via both vector and FTS
- May dilute embedding focus — a chunk about design patterns now also contains chart description, which could weaken precision for both topics
Option B: Store as metadata
Attach VLM description to the chunk's metadata field.
metadata: { fileName, fileSize, fileType, visualSummary: "Bar chart comparing..." }
- NOT searchable (metadata is not embedded or FTS-indexed in current schema)
- Returned with search results, so the consuming LLM sees it as additional context
- Zero impact on existing search behavior
- Useful when the visual content supplements nearby text rather than standing on its own
Option C: Store as separate chunks
Create independent chunks for VLM descriptions with their own embeddings.
{ text: "Figure on page 12: bar chart comparing...", chunkIndex: N, chunkType: "visual", ... }
- Searchable independently via vector and FTS
- No embedding dilution
- Granularity mismatch: text chunks are 2-5 sentences (semantic chunker output), VLM chunks describe an entire page's visuals. Different granularities in the same table may affect grouping filters and score distribution
- Needs a way to link visual chunks back to their source page/context
Comparison
| Criteria |
A: In text |
B: In metadata |
C: Separate chunk |
| Visual content searchable (vector) |
Yes |
No |
Yes |
| Visual content searchable (FTS) |
Yes |
No |
Yes |
| Embedding quality preserved |
Risk of dilution |
Yes |
Yes |
| Granularity consistency |
Same chunk |
Same chunk |
Mixed granularity |
| Implementation complexity |
Low |
Low |
Medium |
| Schema change |
None |
Metadata extension |
Chunk type field |
Each option has trade-offs. The right choice depends on how VLM summaries actually perform in retrieval — prototyping with real PDFs is needed before committing to one.
Candidate Technologies
VLM Models (local inference, no API calls)
| Model |
Params |
Runtime |
Notes |
| Qwen2-VL-2B |
2B |
transformers.js (ONNX Q4), Ollama |
ONNX version available, same runtime as existing embedder |
| GLM-OCR |
0.9B |
Ollama, vLLM, SGLang |
Top score on OmniDocBench (94.6), 1.86 pages/sec, tables and formulas to Markdown |
| ColSmol |
500M |
Python (colpali-engine) |
Visual retrieval embeddings, different paradigm (page-level vector search) |
Qwen2-VL-2B fits the current stack: it runs on @huggingface/transformers with ONNX, same as the embedding model. Model auto-downloads on first use (~1.5GB). No Python, no external service.
GLM-OCR scores higher on benchmarks but requires Ollama. Could serve as an optional backend for users who already have it.
Page Image Rendering
mupdf (npm, WebAssembly) provides page.toPixmap() for rendering PDF pages to images. If we migrate text extraction to mupdf, the same package handles image rendering with no additional dependency.
Design Considerations
1. Selective processing
Not every page has visual content worth summarizing. Candidate heuristics:
- Pages where mupdf reports image blocks in structured text output
- Pages with low text-to-area ratio
- Pages with detected table structures
Skipping text-only pages keeps VLM inference cost manageable.
2. Inference cost
VLM inference is 10-100x slower than text extraction per page. Prompt design matters — a prompt tuned for "describe this figure for search retrieval" produces different (and likely more useful) output than a generic "describe this image" prompt.
3. Page tracking
Current chunks do not track which PDF page they came from. VLM summaries are inherently page-scoped. To connect visual context with text context, chunks need page number metadata.
4. Embedding compatibility
VLM summaries are natural language text, so they embed in the same vector space as regular chunks. No multimodal embedding model is required. If retrieval quality for visual queries turns out to be weak, prompt tuning for the summary is the first lever to pull.
5. Ingest modes
Two modes: text (default, zero-setup) and text+visual (requires ~1.5GB model download). Switching modes on re-ingest should not corrupt existing data.
6. Storage
A 100-page PDF with ~30 pages containing figures adds ~30 extra chunks. This is modest compared to ColPali-style approaches (627 vectors per page).
Architecture Sketch
ingest_file(filePath, mode = "text" | "text+visual")
text mode (default):
PDF -> mupdf -> text chunks -> embedder -> LanceDB
text+visual mode:
PDF -> mupdf -> text chunks -> embedder -> LanceDB
PDF -> mupdf (toPixmap) -> select visual pages -> VLM -> description chunks -> embedder -> LanceDB
Search logic stays the same: vector search + FTS against one table.
Open Questions
- Which storage option (A/B/C) works best in practice? Needs prototype comparison with real PDFs.
- How does chunk granularity mismatch (Option C) affect the grouping filter?
- What VLM prompt produces the most search-effective descriptions?
- Should users be able to plug in an Ollama endpoint for a stronger VLM?
- How to handle re-ingest when switching between text and text+visual modes?
VLM Integration for Enhanced PDF Understanding
Background
The current PDF pipeline extracts text only. Figures, charts, tables, and diagrams carry semantic information that text extraction cannot capture. Vision Language Models (VLMs) can interpret these visual elements and make their content available for search and retrieval.
How VLM Summaries Fit Into the Pipeline
The core idea: use a VLM to generate text descriptions of visual elements during ingest. These descriptions are embedded and stored alongside regular text chunks, making visual content searchable through the existing vector + FTS pipeline.
The consuming LLM (Claude, etc.) receives richer context without any change to the search interface.
Data Storage Options
Where to store VLM-generated descriptions determines how they interact with search. Three options are on the table:
Option A: Append to chunk text
Concatenate VLM description into the text chunk that corresponds to the same page.
Option B: Store as metadata
Attach VLM description to the chunk's metadata field.
Option C: Store as separate chunks
Create independent chunks for VLM descriptions with their own embeddings.
Comparison
Each option has trade-offs. The right choice depends on how VLM summaries actually perform in retrieval — prototyping with real PDFs is needed before committing to one.
Candidate Technologies
VLM Models (local inference, no API calls)
Qwen2-VL-2B fits the current stack: it runs on
@huggingface/transformerswith ONNX, same as the embedding model. Model auto-downloads on first use (~1.5GB). No Python, no external service.GLM-OCR scores higher on benchmarks but requires Ollama. Could serve as an optional backend for users who already have it.
Page Image Rendering
mupdf (npm, WebAssembly) provides
page.toPixmap()for rendering PDF pages to images. If we migrate text extraction to mupdf, the same package handles image rendering with no additional dependency.Design Considerations
1. Selective processing
Not every page has visual content worth summarizing. Candidate heuristics:
Skipping text-only pages keeps VLM inference cost manageable.
2. Inference cost
VLM inference is 10-100x slower than text extraction per page. Prompt design matters — a prompt tuned for "describe this figure for search retrieval" produces different (and likely more useful) output than a generic "describe this image" prompt.
3. Page tracking
Current chunks do not track which PDF page they came from. VLM summaries are inherently page-scoped. To connect visual context with text context, chunks need page number metadata.
4. Embedding compatibility
VLM summaries are natural language text, so they embed in the same vector space as regular chunks. No multimodal embedding model is required. If retrieval quality for visual queries turns out to be weak, prompt tuning for the summary is the first lever to pull.
5. Ingest modes
Two modes:
text(default, zero-setup) andtext+visual(requires ~1.5GB model download). Switching modes on re-ingest should not corrupt existing data.6. Storage
A 100-page PDF with ~30 pages containing figures adds ~30 extra chunks. This is modest compared to ColPali-style approaches (627 vectors per page).
Architecture Sketch
Search logic stays the same: vector search + FTS against one table.
Open Questions