LLMs hallucinate. RAG pipelines fabricate facts not in the documents. AI agents take wrong actions based on flawed reasoning. In 2026, the best hallucination detectors still only score 58% accuracy. AI-generated code has 37% more vulnerabilities than human-written code. And 40-60% of enterprise RAG deployments fail to reach production because nobody can diagnose why.
Veritas is a research project exploring multi-agent verification for AI outputs — with real benchmarks, tested hypotheses, and documented findings on what works and what doesn't.
pip install git+https://github.com/riaz-sana/veritas.gitThree tools. One interface. Every AI architecture.
from veritas import verify, diagnose_rag, before_action
# 1. Verify any claim
result = await verify("The first iPhone was released in 2006")
# REFUTED (0.98) — Released June 2007, not 2006.
# 2. Diagnose WHY a RAG pipeline failed
result = await diagnose_rag(
query="What is our refund policy?",
retrieved_docs=["Policy: 30-day return window..."],
generated_answer="Our refund window is 90 days.",
)
# generation_contradiction — Answer says '90 days' but doc says '30 days'
# Retrieval: 85% relevant ← docs are fine
# Generation: 0% faithful ← THIS is where it broke
# 3 ungrounded claims identified with source quotes
# 3. Verify agent actions BEFORE execution
@before_action
async def transfer_funds(account: str, amount: float):
...
# BLOCKED (0.99) — Amount $500K is 100x the $5K invoice. 12 risks identified.Does using multiple isolated AI agents produce better verification than a single well-crafted prompt?
We ran 7 experiments to find out. Some confirmed our hypotheses. Some didn't.
| Finding | Evidence |
|---|---|
| Multi-agent is more thorough than single-prompt | +1.6 completeness, +1.0 specificity in blind evaluation (9 cases) |
| Multi-agent does NOT improve binary accuracy | Both score 9.1/10 on getting the core diagnosis right |
| Isolation is 2-3x faster than shared-context debate | Consistent across all benchmarks |
| Isolation produces fewer false positives on RAG tasks | 3 vs 6 false alarms on 25 grounding tests |
| Hypothesis | Result |
|---|---|
| "Information asymmetry prevents confirmation bias" | Full-context evaluation outperforms isolated agents 97.1% vs 91.4% on bias-triggering cases |
Full experiment details, raw data, and methodology: docs/research/FINDINGS.md
9 test cases (5 RAG + 4 action verification). Blind LLM judge, randomized order.
| Dimension | Multi-Agent | Single-Prompt | Winner |
|---|---|---|---|
| Accuracy | 9.1 | 9.1 | Tie |
| Completeness | 9.7 | 8.1 | Multi-Agent (+1.6) |
| Specificity | 9.4 | 8.4 | Multi-Agent (+1.0) |
| Overall | 9.3 | 8.6 | Multi-Agent (+0.7) |
| Cost | 4.4x | 1x | Single-Prompt |
| Speed | 22.6s | 13.7s | Single-Prompt |
Takeaway: Multi-agent finds more issues and cites better evidence. Single-prompt gets the verdict right at 1/4 the cost. Choose based on whether you need thoroughness or speed.
| Metric | Veritas | Published SOTA (o3-mini) |
|---|---|---|
| Balanced Accuracy | 58% | 58% |
| Metric | Isolation Mode | Debate Mode |
|---|---|---|
| F1 | 89.7% | 81.3% |
| Precision | 81.3% | 68.4% |
| Recall | 100% | 100% |
| Metric | Full-Context (RAGVUE-style) | Isolated Agents (Veritas) |
|---|---|---|
| Claim Accuracy | 97.1% | 91.4% |
| False Positives | 1 | 3 |
Honest conclusion: Full-context single-pass evaluation beats our isolated multi-agent approach for claim-level accuracy. We adopted this finding.
┌─────────────────┐
│ Input (claim, │
│ docs, action) │
└────────┬────────┘
│
┌──────────────────┼──────────────────┐
│ │ │
▼ ▼ ▼
┌────────────────┐ ┌────────────────┐ ┌────────────────┐
│ Agent 1 │ │ Agent 2 │ │ Agent 3 │
│ │ │ │ │ │
│ Logic / │ │ Facts / │ │ Adversary / │
│ Retrieval │ │ Generation │ │ Risk │
└───────┬────────┘ └───────┬────────┘ └───────┬────────┘
│ │ │
└──────────────────┼───────────────────┘
▼
┌─────────────────────┐
│ Synthesiser │
│ │
│ Verdict + evidence │
│ + failure modes │
└─────────────────────┘
Each tool uses specialized agents:
| Tool | Agents | What they check |
|---|---|---|
verify() |
Logic, Source, Adversary, Calibration | Consistency, facts, counterexamples, confidence |
diagnose_rag() |
Retrieval, Generation, Coverage | Docs relevant? Answer faithful? KB has the info? |
verify_action() |
Reasoning, Parameters, Risk, Scope | Logic sound? Params correct? Risks? Matches goal? |
Input: "Our refund window is 90 days for all items. Refunds processed instantly."
Source: "30-day return window. Sale items final sale. 5-7 business days."
Diagnosis: generation_contradiction
Retrieval: 85% relevant ← docs are correct
Generation: 0% faithful ← LLM ignored the documents
Claims:
[UNGROUNDED] "90 days" → doc says 30 days
[UNGROUNDED] "all items including sale" → doc says sale items final sale
[UNGROUNDED] "processed instantly" → doc says 5-7 business days
Fix: Add system prompt constraint to only use facts from documents.
Action: transfer_funds($500,000 → unknown_external_789)
Goal: Pay vendor invoice #INV-2025-001 for $5,000
BLOCKED (0.99 confidence)
Risks identified (12):
[CRITICAL] Amount is 100x the invoice ($500K vs $5K)
[CRITICAL] Recipient 'unknown_external_789' is unverified
[CRITICAL] Pattern matches Business Email Compromise fraud
[CRITICAL] $500K triggers mandatory AML reporting
[HIGH] Wire transfer is irreversible
pip install git+https://github.com/riaz-sana/veritas.git
export ANTHROPIC_API_KEY="sk-ant-..."Python:
from veritas import verify, diagnose_rag, verify_action, before_actionCLI:
veritas check "Any claim"
veritas check "..." --verbose --json
veritas shellClaude Code: /verify The RAG says our policy is 90 days
MCP Server (Claude Desktop, Cursor, any AI tool):
{"mcpServers": {"veritas": {"command": "python", "args": ["-m", "veritas.mcp_server"]}}}Works with everything: LangChain, LlamaIndex, CrewAI, AutoGen, FastAPI, CI/CD, batch eval. Same verify(claim, context) interface for all. See docs/USAGE.md.
from veritas import Config, AgentModels
# Economy mode — Haiku for simple agents, Sonnet for critical (~60% cheaper)
config = Config(agent_models=AgentModels.economy())
# Caching — zero cost on repeat queries
config = Config(cache_enabled=True)
# Confidence routing — skip verification for high-confidence outputs
config = Config(confidence_routing=True, confidence_threshold=0.8)
# Domain-specific — code, schema, medical, legal, financial, scientific
result = await verify(claim=generated_code, context=spec, domain="code")| Verdict | Meaning |
|---|---|
VERIFIED |
Evidence supports the claim |
PARTIAL |
Some parts correct, some not |
UNCERTAIN |
Insufficient evidence |
DISPUTED |
Conflicting evidence |
REFUTED |
Evidence contradicts the claim |
When something fails, Veritas classifies WHY:
| Type | What it means |
|---|---|
factual_error |
A fact is wrong |
logical_inconsistency |
Reasoning contradicts itself |
unsupported_inference |
Claim exceeds the evidence |
temporal_error |
Information is outdated |
scope_error |
Overgeneralization |
source_conflict |
Sources disagree |
veritas/
core/ # verify(), config, cache, data models
agents/ # 5 verification agents + domain prompts
diagnostics/ # RAG diagnostic engine
agentic/ # Pre-action verification + @before_action
orchestration/ # Parallel runner + challenge round
providers/ # Claude, Brave Search, Tavily
cli/ # check, shell, benchmark commands
ablation/ # Multi-agent vs single-prompt comparison code
benchmarks/ # FaithBench, RAG grounding, adversarial datasets
mcp_server.py # MCP server for any AI tool
skills/verify/ # Claude Code skill
docs/
research/ # All experiment data, findings, methodology
USAGE.md # Integration patterns
ENTERPRISE-REALITY.md # Honest deployment assessment
110 tests. Python 3.10+. MIT License.
| Document | What's in it |
|---|---|
| FINDINGS.md | All 7 experiments — what we proved, what we disproved, raw data |
| Ablation Study | Multi-agent vs single-prompt — methodology, 9 test cases, blind evaluation |
| Honest Assessment | Competitive landscape — RAGVUE, Superagent, Galileo Luna, who does what |
| Enterprise Reality | Where Veritas works, where it doesn't, cost/latency analysis |
| Benchmark Methodology | Why each benchmark, dataset design, evaluation principles |
| Dataset | Samples | Key Result |
|---|---|---|
| Ablation | 9 cases | MA 9.3 vs SP 8.6 overall |
| FaithBench | 50 | 58% balanced accuracy |
| RAG Grounding | 25 | 89.7% F1 |
| Adversarial | 50 | 100% detection (too easy) |
| RAGVUE H2H | 33 claims | Tied at 100% |
| Bias H2H | 35 claims | RAGVUE 97.1% vs Veritas 91.4% |
- Du et al. "Improving Factuality through Multiagent Debate" — ICML 2024
- "Emergent social conventions and collective bias in LLM populations" — Science Advances 2025
- "Cross-Context Verification" — 2026
- "RAGVUE" — 2026
- "Agent-as-a-Judge" — ICML 2025
- "Semantic Entropy" — Nature 2024
- "FaithBench" — NAACL 2025
- Amazon "Enhancing LLM-as-a-Judge via Multi-Agent Collaboration"
- RAGVUE — claim-level RAG evaluation
- RAGAS — RAG evaluation metrics
- Superagent — agentic AI safety
- Galileo Luna — fast hallucination detection
- Axiom — formal verification for Lean
MIT