Eval-Driven Development

What we can build matters less than what we can prove.

AI writes code. The engineer defines "working," measures it, enforces it. Eval-Driven Development: every probabilistic system starts with a correctness spec. Nothing ships without automated proof it passes.

Principles

1. Evaluation is the product

Build evals first. Code is generated. Evals are engineered.

2. Define correctness before you write a prompt

Can't express "correct" as a deterministic function? Not ready to build. Every task needs an eval. Every eval needs a threshold. Every threshold needs a justification.

3. Probabilistic systems require statistical proof

One passing test proves nothing about a stochastic system. Sample sizes, confidence intervals, regression baselines. Distributions, not anecdotes.

4. Evals run in CI

Evals that don't run on every change don't exist. Next to lint, type-check, build.

5. Evaluation drives architecture

Can't independently evaluate a component? Can't independently trust it. Design for measurability.

6. Cost is a metric

Token spend, latency, compute. Correct but unaffordable is a failed eval.

7. Human judgment doesn't scale

Every manual review is a missing eval. Extract judgment into a rubric, automate it, evaluate the evaluator.

8. Ship the eval, not the demo

Demos prove something works once. Evals prove it works under distribution shift.

9. Version your evals

Definitions, datasets, thresholds, results. Version control. Changelogs. Document why.

10. The eval gap is the opportunity

"Works on my machine" vs. "passes at p < 0.05." That gap is where defensible products get built.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
_data		_data
_includes		_includes
.eleventy.js		.eleventy.js
.eleventyignore		.eleventyignore
.gitignore		.gitignore
404.njk		404.njk
LICENSE		LICENSE
README.md		README.md
build.py		build.py
build.sh		build.sh
index.md		index.md
llms.njk		llms.njk
package-lock.json		package-lock.json
package.json		package.json
post_tweet.py		post_tweet.py
robots.njk		robots.njk
sitemap.njk		sitemap.njk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Eval-Driven Development

Principles

1. Evaluation is the product

2. Define correctness before you write a prompt

3. Probabilistic systems require statistical proof

4. Evals run in CI

5. Evaluation drives architecture

6. Cost is a metric

7. Human judgment doesn't scale

8. Ship the eval, not the demo

9. Version your evals

10. The eval gap is the opportunity

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Eval-Driven Development

Principles

1. Evaluation is the product

2. Define correctness before you write a prompt

3. Probabilistic systems require statistical proof

4. Evals run in CI

5. Evaluation drives architecture

6. Cost is a metric

7. Human judgment doesn't scale

8. Ship the eval, not the demo

9. Version your evals

10. The eval gap is the opportunity

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages