Veritas

Can you trust your AI's output?

LLMs hallucinate. RAG pipelines fabricate facts not in the documents. AI agents take wrong actions based on flawed reasoning. In 2026, the best hallucination detectors still only score 58% accuracy. AI-generated code has 37% more vulnerabilities than human-written code. And 40-60% of enterprise RAG deployments fail to reach production because nobody can diagnose why.

Veritas is a research project exploring multi-agent verification for AI outputs — with real benchmarks, tested hypotheses, and documented findings on what works and what doesn't.

pip install git+https://github.com/riaz-sana/veritas.git

What It Does

Three tools. One interface. Every AI architecture.

from veritas import verify, diagnose_rag, before_action

# 1. Verify any claim
result = await verify("The first iPhone was released in 2006")
# REFUTED (0.98) — Released June 2007, not 2006.

# 2. Diagnose WHY a RAG pipeline failed
result = await diagnose_rag(
    query="What is our refund policy?",
    retrieved_docs=["Policy: 30-day return window..."],
    generated_answer="Our refund window is 90 days.",
)
# generation_contradiction — Answer says '90 days' but doc says '30 days'
# Retrieval: 85% relevant  ← docs are fine
# Generation:  0% faithful  ← THIS is where it broke
# 3 ungrounded claims identified with source quotes

# 3. Verify agent actions BEFORE execution
@before_action
async def transfer_funds(account: str, amount: float):
    ...
# BLOCKED (0.99) — Amount $500K is 100x the $5K invoice. 12 risks identified.

The Research Question

Does using multiple isolated AI agents produce better verification than a single well-crafted prompt?

We ran 7 experiments to find out. Some confirmed our hypotheses. Some didn't.

What We Proved

Finding	Evidence
Multi-agent is more thorough than single-prompt	+1.6 completeness, +1.0 specificity in blind evaluation (9 cases)
Multi-agent does NOT improve binary accuracy	Both score 9.1/10 on getting the core diagnosis right
Isolation is 2-3x faster than shared-context debate	Consistent across all benchmarks
Isolation produces fewer false positives on RAG tasks	3 vs 6 false alarms on 25 grounding tests

What We Disproved

Hypothesis	Result
"Information asymmetry prevents confirmation bias"	Full-context evaluation outperforms isolated agents 97.1% vs 91.4% on bias-triggering cases

Full experiment details, raw data, and methodology: docs/research/FINDINGS.md

Benchmark Results

Ablation: Multi-Agent vs Single-Prompt

9 test cases (5 RAG + 4 action verification). Blind LLM judge, randomized order.

Dimension	Multi-Agent	Single-Prompt	Winner
Accuracy	9.1	9.1	Tie
Completeness	9.7	8.1	Multi-Agent (+1.6)
Specificity	9.4	8.4	Multi-Agent (+1.0)
Overall	9.3	8.6	Multi-Agent (+0.7)
Cost	4.4x	1x	Single-Prompt
Speed	22.6s	13.7s	Single-Prompt

Takeaway: Multi-agent finds more issues and cites better evidence. Single-prompt gets the verdict right at 1/4 the cost. Choose based on whether you need thoroughness or speed.

FaithBench (NAACL 2025 — hardest hallucination benchmark)

Metric	Veritas	Published SOTA (o3-mini)
Balanced Accuracy	58%	58%

RAG Grounding (25 doc-answer pairs)

Metric	Isolation Mode	Debate Mode
F1	89.7%	81.3%
Precision	81.3%	68.4%
Recall	100%	100%

RAGVUE Head-to-Head (bias-triggering cases)

Metric	Full-Context (RAGVUE-style)	Isolated Agents (Veritas)
Claim Accuracy	97.1%	91.4%
False Positives	1	3

Honest conclusion: Full-context single-pass evaluation beats our isolated multi-agent approach for claim-level accuracy. We adopted this finding.

How It Works

                      ┌─────────────────┐
                      │  Input (claim,   │
                      │  docs, action)   │
                      └────────┬────────┘
                               │
            ┌──────────────────┼──────────────────┐
            │                  │                   │
            ▼                  ▼                   ▼
  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐
  │   Agent 1      │ │   Agent 2      │ │   Agent 3      │
  │                │ │                │ │                │
  │ Logic /        │ │ Facts /        │ │ Adversary /    │
  │ Retrieval      │ │ Generation     │ │ Risk           │
  └───────┬────────┘ └───────┬────────┘ └───────┬────────┘
          │                  │                   │
          └──────────────────┼───────────────────┘
                             ▼
                  ┌─────────────────────┐
                  │    Synthesiser      │
                  │                     │
                  │ Verdict + evidence  │
                  │ + failure modes     │
                  └─────────────────────┘

Each tool uses specialized agents:

Tool	Agents	What they check
`verify()`	Logic, Source, Adversary, Calibration	Consistency, facts, counterexamples, confidence
`diagnose_rag()`	Retrieval, Generation, Coverage	Docs relevant? Answer faithful? KB has the info?
`verify_action()`	Reasoning, Parameters, Risk, Scope	Logic sound? Params correct? Risks? Matches goal?

Real Test Output

RAG Diagnostics — caught 3 fabricated claims with source evidence

Input:  "Our refund window is 90 days for all items. Refunds processed instantly."
Source: "30-day return window. Sale items final sale. 5-7 business days."

Diagnosis: generation_contradiction
Retrieval:  85% relevant  ← docs are correct
Generation:  0% faithful  ← LLM ignored the documents

Claims:
  [UNGROUNDED] "90 days" → doc says 30 days
  [UNGROUNDED] "all items including sale" → doc says sale items final sale
  [UNGROUNDED] "processed instantly" → doc says 5-7 business days

Fix: Add system prompt constraint to only use facts from documents.

Action Verification — blocked a $500K fraud-pattern transfer

Action:  transfer_funds($500,000 → unknown_external_789)
Goal:    Pay vendor invoice #INV-2025-001 for $5,000

BLOCKED (0.99 confidence)

Risks identified (12):
  [CRITICAL] Amount is 100x the invoice ($500K vs $5K)
  [CRITICAL] Recipient 'unknown_external_789' is unverified
  [CRITICAL] Pattern matches Business Email Compromise fraud
  [CRITICAL] $500K triggers mandatory AML reporting
  [HIGH]     Wire transfer is irreversible

Install & Use

pip install git+https://github.com/riaz-sana/veritas.git
export ANTHROPIC_API_KEY="sk-ant-..."

Python:

from veritas import verify, diagnose_rag, verify_action, before_action

CLI:

veritas check "Any claim"
veritas check "..." --verbose --json
veritas shell

Claude Code: /verify The RAG says our policy is 90 days

MCP Server (Claude Desktop, Cursor, any AI tool):

{"mcpServers": {"veritas": {"command": "python", "args": ["-m", "veritas.mcp_server"]}}}

Works with everything: LangChain, LlamaIndex, CrewAI, AutoGen, FastAPI, CI/CD, batch eval. Same verify(claim, context) interface for all. See docs/USAGE.md.

Enterprise Features

from veritas import Config, AgentModels

# Economy mode — Haiku for simple agents, Sonnet for critical (~60% cheaper)
config = Config(agent_models=AgentModels.economy())

# Caching — zero cost on repeat queries
config = Config(cache_enabled=True)

# Confidence routing — skip verification for high-confidence outputs
config = Config(confidence_routing=True, confidence_threshold=0.8)

# Domain-specific — code, schema, medical, legal, financial, scientific
result = await verify(claim=generated_code, context=spec, domain="code")

Verdicts & Failure Modes

Verdict	Meaning
`VERIFIED`	Evidence supports the claim
`PARTIAL`	Some parts correct, some not
`UNCERTAIN`	Insufficient evidence
`DISPUTED`	Conflicting evidence
`REFUTED`	Evidence contradicts the claim

When something fails, Veritas classifies WHY:

Type	What it means
`factual_error`	A fact is wrong
`logical_inconsistency`	Reasoning contradicts itself
`unsupported_inference`	Claim exceeds the evidence
`temporal_error`	Information is outdated
`scope_error`	Overgeneralization
`source_conflict`	Sources disagree

Project Structure

veritas/
  core/           # verify(), config, cache, data models
  agents/         # 5 verification agents + domain prompts
  diagnostics/    # RAG diagnostic engine
  agentic/        # Pre-action verification + @before_action
  orchestration/  # Parallel runner + challenge round
  providers/      # Claude, Brave Search, Tavily
  cli/            # check, shell, benchmark commands
  ablation/       # Multi-agent vs single-prompt comparison code
  benchmarks/     # FaithBench, RAG grounding, adversarial datasets
  mcp_server.py   # MCP server for any AI tool
skills/verify/    # Claude Code skill
docs/
  research/       # All experiment data, findings, methodology
  USAGE.md        # Integration patterns
  ENTERPRISE-REALITY.md  # Honest deployment assessment

110 tests. Python 3.10+. MIT License.

Research Documents

Document	What's in it
FINDINGS.md	All 7 experiments — what we proved, what we disproved, raw data
Ablation Study	Multi-agent vs single-prompt — methodology, 9 test cases, blind evaluation
Honest Assessment	Competitive landscape — RAGVUE, Superagent, Galileo Luna, who does what
Enterprise Reality	Where Veritas works, where it doesn't, cost/latency analysis
Benchmark Methodology	Why each benchmark, dataset design, evaluation principles

Raw Benchmark Data (JSON)

Dataset	Samples	Key Result
Ablation	9 cases	MA 9.3 vs SP 8.6 overall
FaithBench	50	58% balanced accuracy
RAG Grounding	25	89.7% F1
Adversarial	50	100% detection (too easy)
RAGVUE H2H	33 claims	Tied at 100%
Bias H2H	35 claims	RAGVUE 97.1% vs Veritas 91.4%

References

Papers

Du et al. "Improving Factuality through Multiagent Debate" — ICML 2024
"Emergent social conventions and collective bias in LLM populations" — Science Advances 2025
"Cross-Context Verification" — 2026
"RAGVUE" — 2026
"Agent-as-a-Judge" — ICML 2025
"Semantic Entropy" — Nature 2024
"FaithBench" — NAACL 2025
Amazon "Enhancing LLM-as-a-Judge via Multi-Agent Collaboration"

Competing Tools

RAGVUE — claim-level RAG evaluation
RAGAS — RAG evaluation metrics
Superagent — agentic AI safety
Galileo Luna — fast hallucination detection
Axiom — formal verification for Lean

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.overstory		.overstory
docs		docs
skills/verify		skills/verify
tests		tests
veritas		veritas
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
claude.md		claude.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Veritas

Can you trust your AI's output?

What It Does

The Research Question

What We Proved

What We Disproved

Benchmark Results

Ablation: Multi-Agent vs Single-Prompt

FaithBench (NAACL 2025 — hardest hallucination benchmark)

RAG Grounding (25 doc-answer pairs)

RAGVUE Head-to-Head (bias-triggering cases)

How It Works

Real Test Output

RAG Diagnostics — caught 3 fabricated claims with source evidence

Action Verification — blocked a $500K fraud-pattern transfer

Install & Use

Enterprise Features

Verdicts & Failure Modes

Project Structure

Research Documents

Raw Benchmark Data (JSON)

References

Papers

Competing Tools

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Veritas

Can you trust your AI's output?

What It Does

The Research Question

What We Proved

What We Disproved

Benchmark Results

Ablation: Multi-Agent vs Single-Prompt

FaithBench (NAACL 2025 — hardest hallucination benchmark)

RAG Grounding (25 doc-answer pairs)

RAGVUE Head-to-Head (bias-triggering cases)

How It Works

Real Test Output

RAG Diagnostics — caught 3 fabricated claims with source evidence

Action Verification — blocked a $500K fraud-pattern transfer

Install & Use

Enterprise Features

Verdicts & Failure Modes

Project Structure

Research Documents

Raw Benchmark Data (JSON)

References

Papers

Competing Tools

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages