What we can build matters less than what we can prove.
AI writes code. The engineer defines "working," measures it, enforces it. Eval-Driven Development: every probabilistic system starts with a correctness spec. Nothing ships without automated proof it passes.
Build evals first. Code is generated. Evals are engineered.
Can't express "correct" as a deterministic function? Not ready to build. Every task needs an eval. Every eval needs a threshold. Every threshold needs a justification.
One passing test proves nothing about a stochastic system. Sample sizes, confidence intervals, regression baselines. Distributions, not anecdotes.
Evals that don't run on every change don't exist. Next to lint, type-check, build.
Can't independently evaluate a component? Can't independently trust it. Design for measurability.
Token spend, latency, compute. Correct but unaffordable is a failed eval.
Every manual review is a missing eval. Extract judgment into a rubric, automate it, evaluate the evaluator.
Demos prove something works once. Evals prove it works under distribution shift.
Definitions, datasets, thresholds, results. Version control. Changelogs. Document why.
"Works on my machine" vs. "passes at p < 0.05." That gap is where defensible products get built.