Skip to content

VLM integration for enhanced PDF understanding #66

@shinpr

Description

@shinpr

VLM Integration for Enhanced PDF Understanding

Background

The current PDF pipeline extracts text only. Figures, charts, tables, and diagrams carry semantic information that text extraction cannot capture. Vision Language Models (VLMs) can interpret these visual elements and make their content available for search and retrieval.

How VLM Summaries Fit Into the Pipeline

The core idea: use a VLM to generate text descriptions of visual elements during ingest. These descriptions are embedded and stored alongside regular text chunks, making visual content searchable through the existing vector + FTS pipeline.

The consuming LLM (Claude, etc.) receives richer context without any change to the search interface.

Data Storage Options

Where to store VLM-generated descriptions determines how they interact with search. Three options are on the table:

Option A: Append to chunk text

Concatenate VLM description into the text chunk that corresponds to the same page.

text: "Design patterns improve... [Figure: bar chart comparing Singleton vs Factory, 30% throughput gain]"
  • Searchable via both vector and FTS
  • May dilute embedding focus — a chunk about design patterns now also contains chart description, which could weaken precision for both topics

Option B: Store as metadata

Attach VLM description to the chunk's metadata field.

metadata: { fileName, fileSize, fileType, visualSummary: "Bar chart comparing..." }
  • NOT searchable (metadata is not embedded or FTS-indexed in current schema)
  • Returned with search results, so the consuming LLM sees it as additional context
  • Zero impact on existing search behavior
  • Useful when the visual content supplements nearby text rather than standing on its own

Option C: Store as separate chunks

Create independent chunks for VLM descriptions with their own embeddings.

{ text: "Figure on page 12: bar chart comparing...", chunkIndex: N, chunkType: "visual", ... }
  • Searchable independently via vector and FTS
  • No embedding dilution
  • Granularity mismatch: text chunks are 2-5 sentences (semantic chunker output), VLM chunks describe an entire page's visuals. Different granularities in the same table may affect grouping filters and score distribution
  • Needs a way to link visual chunks back to their source page/context

Comparison

Criteria A: In text B: In metadata C: Separate chunk
Visual content searchable (vector) Yes No Yes
Visual content searchable (FTS) Yes No Yes
Embedding quality preserved Risk of dilution Yes Yes
Granularity consistency Same chunk Same chunk Mixed granularity
Implementation complexity Low Low Medium
Schema change None Metadata extension Chunk type field

Each option has trade-offs. The right choice depends on how VLM summaries actually perform in retrieval — prototyping with real PDFs is needed before committing to one.

Candidate Technologies

VLM Models (local inference, no API calls)

Model Params Runtime Notes
Qwen2-VL-2B 2B transformers.js (ONNX Q4), Ollama ONNX version available, same runtime as existing embedder
GLM-OCR 0.9B Ollama, vLLM, SGLang Top score on OmniDocBench (94.6), 1.86 pages/sec, tables and formulas to Markdown
ColSmol 500M Python (colpali-engine) Visual retrieval embeddings, different paradigm (page-level vector search)

Qwen2-VL-2B fits the current stack: it runs on @huggingface/transformers with ONNX, same as the embedding model. Model auto-downloads on first use (~1.5GB). No Python, no external service.

GLM-OCR scores higher on benchmarks but requires Ollama. Could serve as an optional backend for users who already have it.

Page Image Rendering

mupdf (npm, WebAssembly) provides page.toPixmap() for rendering PDF pages to images. If we migrate text extraction to mupdf, the same package handles image rendering with no additional dependency.

Design Considerations

1. Selective processing

Not every page has visual content worth summarizing. Candidate heuristics:

  • Pages where mupdf reports image blocks in structured text output
  • Pages with low text-to-area ratio
  • Pages with detected table structures

Skipping text-only pages keeps VLM inference cost manageable.

2. Inference cost

VLM inference is 10-100x slower than text extraction per page. Prompt design matters — a prompt tuned for "describe this figure for search retrieval" produces different (and likely more useful) output than a generic "describe this image" prompt.

3. Page tracking

Current chunks do not track which PDF page they came from. VLM summaries are inherently page-scoped. To connect visual context with text context, chunks need page number metadata.

4. Embedding compatibility

VLM summaries are natural language text, so they embed in the same vector space as regular chunks. No multimodal embedding model is required. If retrieval quality for visual queries turns out to be weak, prompt tuning for the summary is the first lever to pull.

5. Ingest modes

Two modes: text (default, zero-setup) and text+visual (requires ~1.5GB model download). Switching modes on re-ingest should not corrupt existing data.

6. Storage

A 100-page PDF with ~30 pages containing figures adds ~30 extra chunks. This is modest compared to ColPali-style approaches (627 vectors per page).

Architecture Sketch

ingest_file(filePath, mode = "text" | "text+visual")

text mode (default):
  PDF -> mupdf -> text chunks -> embedder -> LanceDB

text+visual mode:
  PDF -> mupdf -> text chunks -> embedder -> LanceDB
  PDF -> mupdf (toPixmap) -> select visual pages -> VLM -> description chunks -> embedder -> LanceDB

Search logic stays the same: vector search + FTS against one table.

Open Questions

  • Which storage option (A/B/C) works best in practice? Needs prototype comparison with real PDFs.
  • How does chunk granularity mismatch (Option C) affect the grouping filter?
  • What VLM prompt produces the most search-effective descriptions?
  • Should users be able to plug in an Ollama endpoint for a stronger VLM?
  • How to handle re-ingest when switching between text and text+visual modes?

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions