The Verifier Is the Whole Game

Jun 14, 2026

There is a puzzle in how AI capability has developed, and once you see it clearly, a lot of confusing things snap into place. The puzzle is the lopsidedness. Models have become extraordinary at coding, not gradually, but on a steep curve that keeps steepening, while remaining merely competent at most other forms of serious intellectual work. The usual explanation is that coding is somehow easier or more structured. That explanation is wrong, and the right one is far more consequential.

The asymmetry

Reinforcement learning needs a reward. A reward needs to be checkable. Coding comes with a checker that is nearly free: the test suite, the compiler, the sandbox that either runs the program or does not. So RL can iterate on coding with abandon: millions of attempts, each one scored by an oracle that does not lie and does not get tired. The capability curve is steep because the feedback loop is tight, cheap, and trustworthy.

Most other domains have no such oracle. The reward, if it exists at all, is slow, expensive, noisy, or contested. RL starves. Whatever competence the model has in those domains comes mostly from passive absorption during pretraining, not from the active sharpening that RL provides where a verifier exists.

So the model is not "smarter" in some uniform way. It is honed razor-sharp wherever a cheap verifier happens to exist, and left blunt everywhere else. The shape of its capability is the shape of where verification is cheap.

What the verifier actually is

Here is the part that the easy explanation misses. Coding does not have a cheap verifier because coding is simple. Coding has a cheap verifier because software engineering, as a social practice, manufactured one. A test is not a fact of nature. It is a human-written contract that says, in advance: satisfy these conditions and you count as correct. The discipline compressed the question "is this right?" into a machine-checkable agreement, and then handed the agreement to a machine to enforce.

This is a remarkable thing to have done, and it is easy to forget how unusual it is. In most domains, "what counts as correct" cannot be written down in advance as a contract. It is a continuous, social, perpetually-reopenable negotiation. You cannot freeze it into a step-level reward function because it is not the kind of thing that holds still.

That is why RL cannot be fed in those domains, not because the signal is noisy, but because the notion of "correct" is the product of a process, and a process cannot be compressed into a reward at each step. Where the model is sharp, it is sharp because humans already did the work of turning a social judgment into a machine contract. Where it is blunt, it is blunt because no such contract is possible.

The dangerous move

This sets up the failure that defines the current moment. Take the architecture that works gloriously where a verifier exists, autonomous agents, recursive self-improvement, humans pushed up to oversight, and transplant it into a domain with no cheap verifier. The architecture still runs. The agents still execute. But the thing that made the whole apparatus work, the oracle filtering every output for correctness, is simply absent.

What you get is not a slightly worse version of the same thing. You get a machine that produces, at superhuman scale and speed, outputs that look correct and have no relationship to truth. The recursive loop that meant "get better" where a verifier existed now means "get better at seeming right," because seeming-right is the only signal left when the oracle is gone. Scale, pointed at a domain without a verifier, amplifies plausibility rather than truth.

And the most capable systems are the most dangerous here, because their outputs are the most fluently grounded-looking. They will produce conclusions that are internally consistent, well-cited, methodologically tidy, and wrong; and the polish that should signal quality instead provides cover.

Why this is not a passing phase

It would be comforting to think the blunt domains are just waiting their turn, that better models will eventually be sharp everywhere. For some domains, that is true; the verifier just has not been built yet. But for the domains where "correct" is genuinely a social, process-bound thing, the blunt edge is not a temporary lag. It is a structural boundary. You cannot give a step-level verifier to something whose correctness is, by its nature, the slow product of a collective process. The model can get arbitrarily good at the parts that can be contracted, and it will keep hitting the same wall on the part that cannot.

The consequence

If the verifier is the whole game, then the frontier is not where the algorithms are. The reinforcement learning literature has more or less converged on this already: the algorithms have become commodities, and the scarce, decisive resource is reward design, which is to say, verifier design. The teams that build the strongest systems are the ones that can specify and measure quality, not the ones with a cleverer optimizer.

Which means the hardest and most valuable problem is not building another autonomous system. It is the question everyone routing around: in a domain with no cheap verifier, where does a trustworthy reward come from, and how do you keep optimization from collapsing into the production of plausible nonsense?

That question gets dressed up as a technical detail. It is the central problem. Everywhere a verifier is missing (serious science, genuine writing, real judgment) optimizing for "what a human finds good" quietly substitutes for "what is true," and the cost of that substitution is the texture of the output: confident, safe, fluent, and hollow. The whole game is whether you can find something to optimize against that is not just the satisfaction of the reader.

验证者才是全部

Jun 14, 2026

AI 能力是怎么长出来的，这里有个谜。看清了它，很多原本拧巴的事就一下子对上号了。谜在于它长得不匀称。模型写代码已经强得出奇，而且不是慢慢变强，是顺着一条越来越陡的曲线往上冲；可在大多数别的严肃脑力活上，它顶多算个及格。常见的解释是，编程不知怎么就更简单、更有章法。这个解释是错的，而对的那个解释，分量重得多。

这种不对称

强化学习要有奖励。奖励得能被检验。编程自带一个几乎不要钱的检验工具：测试套件、编译器、那个要么把程序跑起来、要么跑不起来的沙盒。所以强化学习能在编程上放开了迭代——上百万次尝试，每一次都由一个不撒谎、不喊累的判官来打分。能力曲线之所以陡，是因为反馈回路又紧、又便宜、又靠得住。

别的领域大多没有这样的判官。奖励就算有，也慢、也贵、也吵、也没定论。强化学习就饿着了。模型在这些领域里有的那点本事，多半来自预训练时的被动吸收，而不是来自验证者所提供的那种主动打磨——只有验证者在场的地方才有这种打磨。

所以模型并不是平白地、整体地变聪明了。哪里恰好有个便宜的验证者，它就在哪里被磨得锋利如刀；其余地方，一律留着发钝。它能力的形状，就是"验证便宜"那片区域的形状。

验证者到底是什么

省事的解释漏掉了一点。编程有便宜的验证者，不是因为编程简单。编程有便宜的验证者，是因为软件工程作为一种社会实践，自己造了一个出来。测试不是什么自然界的事实。它是人预先写好的一纸契约，上面说：满足这些条件，你就算对。这一行把"这对不对？"压成了一份机器能查的约定，再把约定交给机器去执行。

这事干得相当了不起，而人们很容易忘了它有多反常。在大多数领域，"什么才算对"没法预先写成契约。它是一场连续的、社会性的、永远能被重新掀开的协商。你冻不住它，做不成一个一步一步给分的奖励函数，因为它压根不是会停下来不动的那种东西。

强化学习在那些领域喂不进去，原因就在这里——不是信号吵，而是"正确"这个概念本身是一个过程的产物，而过程没法在每一步都压成奖励。模型锋利的地方，锋利是因为人早就把活干完了：把一种社会判断变成了一纸机器契约。模型发钝的地方，发钝是因为这样的契约根本立不起来。

危险的那一步

当下这个时刻的典型失败，根子就在这里。把那套在验证者在场时运转得无比漂亮的架构——自主智能体、递归式自我改进、人被推到监督的位置——整个搬进一个没有便宜验证者的领域。架构照样跑。智能体照样执行。可那个让整台机器管用的东西，那个替每一条输出筛查对错的判官，干脆就不在了。

你得到的不是同一样东西差一点的版本。你得到的是一台机器，它以超人的规模和速度，造出一堆看着对、却跟真相没半点关系的东西。那条递归回路，在验证者在场时意思是"变得更好"，如今意思是"变得更会装对"——因为判官一走，"看着对"就成了唯一剩下的信号。把规模对准一个没有验证者的领域，被放大的是貌似可信，不是真相。

而在这里，最强的系统最危险，因为它们的输出看上去最像是有根有据。它们造出来的结论，内部自洽、引证齐全、方法论工整——可是错的；那层本该标示质量的光泽，反倒成了遮掩。

为什么这不是一阵子就过去的事

要是能认为那些发钝的领域只是在排队等轮到自己，认为更好的模型迟早处处锋利，那倒挺让人安心。对有些领域，确实如此——验证者只是还没造出来。但对那些"正确"当真是社会性的、被过程拴住的领域，这道钝边不是暂时的落后。它是一道结构性的边界。一件东西的正确性，本质上是某个集体过程慢慢磨出来的，你就没法给它配一个一步一步给分的验证者。模型可以在那些能被契约化的部分上变得要多强有多强，而在那个无法契约化的部分上，它会一次次撞上同一堵墙。

由此而来的后果

如果验证者才是全部，那么前沿就不在算法那边。强化学习这个圈子大体上已经认下了这一点：算法早就成了大路货，真正稀缺、真正决定胜负的，是奖励设计——说白了，就是验证者设计。能造出最强系统的团队，是那些能把质量说清楚、量出来的团队，不是那些有个更聪明优化器的团队。

这就意味着，最难、最值钱的问题不是再造一个自主系统。是那个人人都在绕着走的问题：在一个没有便宜验证者的领域里，一个靠得住的奖励从哪来？又怎么不让优化塌成一台生产貌似可信的胡话的机器？

这个问题常被打扮成一处技术细节。它其实是核心。凡是验证者缺席的地方（严肃的科学、真正的写作、真正的判断），冲着"人觉得好的东西"去优化，就悄悄顶替了冲着"什么是真的"去优化；这一替换的代价，写在输出的质地里：自信、安全、流畅，而且空。全部的胜负，就在于你能不能找到一个可以拿来优化的靶子，而它不只是读者的满足感。

← Back to Writing