cankun.me

The Verifier Is the Whole Game

Jun 14, 2026

There is a puzzle in how AI capability has developed, and once you see it clearly, a lot of confusing things snap into place. The puzzle is the lopsidedness. Models have become extraordinary at coding, not gradually, but on a steep curve that keeps steepening, while remaining merely competent at most other forms of serious intellectual work. The usual explanation is that coding is somehow easier or more structured. That explanation is wrong, and the right one is far more consequential.

The asymmetry

Reinforcement learning needs a reward. A reward needs to be checkable. Coding comes with a checker that is nearly free: the test suite, the compiler, the sandbox that either runs the program or does not. So RL can iterate on coding with abandon: millions of attempts, each one scored by an oracle that does not lie and does not get tired. The capability curve is steep because the feedback loop is tight, cheap, and trustworthy.

Most other domains have no such oracle. The reward, if it exists at all, is slow, expensive, noisy, or contested. RL starves. Whatever competence the model has in those domains comes mostly from passive absorption during pretraining, not from the active sharpening that RL provides where a verifier exists.

So the model is not "smarter" in some uniform way. It is honed razor-sharp wherever a cheap verifier happens to exist, and left blunt everywhere else. The shape of its capability is the shape of where verification is cheap.

What the verifier actually is

Here is the part that the easy explanation misses. Coding does not have a cheap verifier because coding is simple. Coding has a cheap verifier because software engineering, as a social practice, manufactured one. A test is not a fact of nature. It is a human-written contract that says, in advance: satisfy these conditions and you count as correct. The discipline compressed the question "is this right?" into a machine-checkable agreement, and then handed the agreement to a machine to enforce.

This is a remarkable thing to have done, and it is easy to forget how unusual it is. In most domains, "what counts as correct" cannot be written down in advance as a contract. It is a continuous, social, perpetually-reopenable negotiation. You cannot freeze it into a step-level reward function because it is not the kind of thing that holds still.

That is why RL cannot be fed in those domains, not because the signal is noisy, but because the notion of "correct" is the product of a process, and a process cannot be compressed into a reward at each step. Where the model is sharp, it is sharp because humans already did the work of turning a social judgment into a machine contract. Where it is blunt, it is blunt because no such contract is possible.

The dangerous move

This sets up the failure that defines the current moment. Take the architecture that works gloriously where a verifier exists, autonomous agents, recursive self-improvement, humans pushed up to oversight, and transplant it into a domain with no cheap verifier. The architecture still runs. The agents still execute. But the thing that made the whole apparatus work, the oracle filtering every output for correctness, is simply absent.

What you get is not a slightly worse version of the same thing. You get a machine that produces, at superhuman scale and speed, outputs that look correct and have no relationship to truth. The recursive loop that meant "get better" where a verifier existed now means "get better at seeming right," because seeming-right is the only signal left when the oracle is gone. Scale, pointed at a domain without a verifier, amplifies plausibility rather than truth.

And the most capable systems are the most dangerous here, because their outputs are the most fluently grounded-looking. They will produce conclusions that are internally consistent, well-cited, methodologically tidy, and wrong; and the polish that should signal quality instead provides cover.

Why this is not a passing phase

It would be comforting to think the blunt domains are just waiting their turn, that better models will eventually be sharp everywhere. For some domains, that is true; the verifier just has not been built yet. But for the domains where "correct" is genuinely a social, process-bound thing, the blunt edge is not a temporary lag. It is a structural boundary. You cannot give a step-level verifier to something whose correctness is, by its nature, the slow product of a collective process. The model can get arbitrarily good at the parts that can be contracted, and it will keep hitting the same wall on the part that cannot.

The consequence

If the verifier is the whole game, then the frontier is not where the algorithms are. The reinforcement learning literature has more or less converged on this already: the algorithms have become commodities, and the scarce, decisive resource is reward design, which is to say, verifier design. The teams that build the strongest systems are the ones that can specify and measure quality, not the ones with a cleverer optimizer.

Which means the hardest and most valuable problem is not building another autonomous system. It is the question everyone routing around: in a domain with no cheap verifier, where does a trustworthy reward come from, and how do you keep optimization from collapsing into the production of plausible nonsense?

That question gets dressed up as a technical detail. It is the central problem. Everywhere a verifier is missing (serious science, genuine writing, real judgment) optimizing for "what a human finds good" quietly substitutes for "what is true," and the cost of that substitution is the texture of the output: confident, safe, fluent, and hollow. The whole game is whether you can find something to optimize against that is not just the satisfaction of the reader.

← Back to Writing