Querying a Black Box

Jun 14, 2026

Some kinds of work are debugging a system whose source code exists. Other kinds are interrogating a system that has no source code at all, that you can only poke from the outside, that answers you through an instrument that may itself be lying. These two situations feel similar from the inside, both involve uncertainty, both involve careful reasoning toward a conclusion, but they are epistemically different in a way that explains almost everything about why automation works beautifully for one and treats the other like quicksand.

It is worth building the distinction slowly, because the philosophy that illuminates it is old, and the old version is sharper than the casual version that gets passed around.

Hume's problem, in its strong form

The induction problem is usually told softly: you have seen a thousand white swans, you cannot conclude the next is white. Told that way it sounds like a warning against overgeneralizing. Hume's actual version is much harder.

Why do you believe the future will resemble the past at all? Say "because the past has always resembled the past": that is itself an inductive inference, using induction to justify induction, circular. Say "because nature is uniform": and why believe nature is uniform? Only because it has been so far, which is induction again, circular again. Hume's conclusion is not that induction is somewhat unreliable. It is that induction has no rational justification, none. Every "the sun will rise tomorrow," every "this drug will still work next week," is, as a matter of logic, exactly as groundless as its denial. We make these inferences anyway, Hume says, not because we have a reason, but out of habit: a psychological fact, not a logical entitlement.

This matters because the entire enterprise of reasoning from data to conclusion rests on the step Hume dissolved. When you go from "in my sample, X and Y are associated" to "X and Y are really associated," there is no logical bridge. The work is not useless, but its validity was never the thing holding it up. Something else is. What that something else is, is what the next two centuries of philosophy argue about.

Popper's escape, and what it costs

Popper accepts that Hume won, induction has no logical justification, and says the mistake was thinking science runs on induction at all. It does not.

You can never verify a universal claim by observation, because the next observation might break it. But you can falsify it with a single counterexample. Verification and falsification are logically asymmetric: no number of white swans confirms "all swans are white," one black swan refutes it. So science is not the accumulation of supporting evidence until you are sure. It is the proposal of bold, falsifiable conjectures, and the attempt to kill them. What survives the attempt is provisionally kept. A theory's scientific status is not how much evidence supports it but whether it sticks its neck out: whether it says, in advance, what observation would prove it wrong.

There is a quiet, important consequence here. For a conjecture to have any falsifying power, the prediction must come before the evidence. State what you bet you will see, then look. If you look first, see the result, and then construct an explanation, you have not tested anything, because any result can be fit with some story after the fact, and a story that explains everything has been refuted by nothing. The temporal order is not bookkeeping. It is the whole difference between a test and a rationalization.

Why falsification is never clean

Popper is beautiful and incomplete, and the incompleteness has a name: the Duhem-Quine problem. When an experiment contradicts a theory, what exactly has been falsified?

A prediction is never derived from a theory alone. It comes from the theory plus a large cloud of auxiliary assumptions: the instrument works, the reagents are pure, the sample is clean, the statistical model is appropriate, that one confounder you never thought of is absent. When the result conflicts, logic tells you only that something in the whole bundle is wrong. It does not tell you which. So you can always rescue the theory you love by blaming an auxiliary: "must be the equipment, run it again." Falsification is never the clean single stroke Popper imagined. A black swan can always be explained away as "not really a swan" or "I misread."

In domains where the instruments are transparent, this rarely bites: when you read a variable in a debugger, the debugger does not lie, so a failed test usually means the code is wrong, and blame is locatable. But where the instrument is itself opaque, where the measuring apparatus is its own black box you don't fully understand, the blame for a contradiction cannot be cleanly assigned. You wanted to interrogate one black box, and your interrogation tool turned out to be another.

Map and territory

All of this rises to a single picture. Every model, every claim, every hypothesis is a map. The territory is the real world, existing independently of any map. Hume's problem, Popper's asymmetry, Duhem-Quine's undecidability, all of them flow from one fact: we only ever have access to maps comparing against maps. We never get to step outside all maps and grab the territory directly to check.

You want to verify a claim. With what? Another observation, which is itself a map, mediated by your instrument, your method, your conceptual frame. You are forever on the map side, and the territory intervenes only in a few places where it is forced to answer directly: the genuine experiment, and above all the experiment that puts a question to the irreducible complexity of the real world over real time. Those are the rare moments the territory talks back. Everywhere else, what you take for a fact is a map agreeing with a map.

The unifying frame: querying a black box

Put it together and the distinction at the start becomes precise. Some work is debugging a white box: a system with source code, whose state you can read directly, whose measurements are transparent. Other work is querying a black box: a system with no source code, that you cannot read but only perturb, whose every answer comes back through a noisy, possibly-deceptive instrument, and which never even promised that its underlying rules are simple or stable.

This one frame explains the whole asymmetry. Why one kind of work has a cheap verifier and the other does not: the white box's state can be read directly and compared, while the black box's state can only be perturbed and inferred through noise. Why automation soars on one and stalls on the other: querying the white box is instant, certain, infinitely repeatable; querying the black box is slow, noisy, and never exactly reproducible. Why plausibility is so dangerous in one and not the other: in the white box a plausible error is quickly exposed by transparent measurement, while in the black box a plausible error can hide forever behind "maybe it was the instrument." And why the deepest verifier is irreducible: to query the ultimate black box, the real world, in its full diversity, over real time, there is no cheaper query that substitutes, because every cheaper query is a different black box, separated from the real one by another map-territory gap that cannot be cleanly crossed.

The frame does not solve the problem. Nothing solves it; it is the structure of being on the map side. But it tells you what kind of problem you are in, which is the prerequisite for not lying to yourself about what your conclusions are worth. The discipline that follows is simple to state and hard to keep: never let "I have a map of it" pass for "I have touched the territory." The map can be excellent. The territory still has not spoken.

向黑箱发问

Jun 14, 2026

有些工作是在调试一个有源码的系统。另一些工作是在审问一个根本没有源码的系统——你只能从外面戳它，而它的回答还要经过一件可能自己就在撒谎的仪器。从里头看，这两件事很像，都有不确定性，都要小心推理才能得出结论。但它们在认识论上不一样，而正是这点差别，几乎解释了自动化为什么在一种工作上漂亮得不行，对另一种却像踩进流沙。

这个区分值得慢慢搭起来，因为照亮它的那套哲学很老，而老版本比四处流传的随口版本要锋利得多。

休谟的难题，最硬的那个版本

归纳问题通常被讲得很软：你见过一千只白天鹅，也不能断定下一只是白的。这么讲，它听上去像是在告诫人别过度概括。休谟自己的版本要狠得多。

你凭什么相信未来会像过去？说"因为过去一直像过去"——这本身就是一次归纳，用归纳来给归纳辩护，循环。说"因为自然是齐一的"——那又凭什么相信自然是齐一的？只因为它至今如此，这又是归纳，又循环了。休谟的结论不是归纳有点不可靠，而是归纳根本没有理性根据，一点都没有。每一句"明天太阳会升起"、每一句"这药下周还有效"，就逻辑而言，都和它们的反面一样毫无根据。我们照样做这些推断，休谟说，不是因为有理由，而是出于习惯：这是个心理事实，不是逻辑上的权利。

这要紧，是因为从数据推到结论这整桩事，全靠休谟拆掉的那一步。你从"我的样本里 X 和 Y 相关"走到"X 和 Y 真的相关"，中间没有逻辑桥梁。这活儿不是没用，但撑着它的从来不是它的有效性。撑着它的是别的东西。这别的东西到底是什么，正是之后两个世纪的哲学在吵的事。

波普尔的出路，以及它的代价

波普尔承认休谟赢了——归纳没有逻辑根据——然后说，错就错在以为科学是靠归纳运转的。它不是。

你永远没法靠观察去证实一个全称命题，因为下一次观察就可能推翻它。但你可以用一个反例去证伪它。证实和证伪在逻辑上不对称：再多白天鹅也确认不了"所有天鹅都是白的"，一只黑天鹅就能驳倒它。所以科学不是不断攒支持证据直到你放心。科学是提出大胆的、可证伪的猜想，然后想方设法弄死它们。挺过这一关的，暂且留下。一个理论算不算科学，不看有多少证据支持它，而看它敢不敢把脖子伸出去：它有没有事先说清楚，什么样的观察能证明它错。

这里有个不显眼但很重要的后果。一个猜想要有证伪的力量，预测就必须出现在证据之前。先说你赌自己会看到什么，再去看。要是你先看，看到结果，然后才编一套解释，那你什么都没检验，因为任何结果事后都能用某个故事圆上，而一个能解释一切的故事，等于什么都没被驳倒。先后顺序不是记账上的讲究。它就是检验和文过饰非之间的全部区别。

为什么证伪从来不干净

波普尔很漂亮，但不完整，而这份不完整有个名字：迪昂-蒯因问题（Duhem-Quine problem）。一个实验和理论冲突时，到底是什么被证伪了？

预测从来不是光从理论里推出来的。它来自理论加上一大团辅助假设：仪器没坏、试剂纯、样本干净、统计模型合适、那个你压根没想到的混杂因素不存在。结果一冲突，逻辑只告诉你这一整捆里有东西错了，不告诉你是哪个。所以你总能甩锅给某个辅助假设，保住你心爱的理论："准是设备的事，再跑一遍。"证伪从来不是波普尔想象的那种干净一刀。一只黑天鹅总能被搪塞过去——"那其实不是天鹅""我看错了"。

仪器透明的领域里，这个问题很少咬人：你在调试器里读一个变量，调试器不会撒谎，所以测试挂了通常就是代码错了，责任能定位。可一旦仪器本身就不透明，测量装置自己就是一只你没完全搞懂的黑箱，矛盾的责任就没法干净地分派。你本来想审问一只黑箱，结果你的审问工具也是一只黑箱。

地图与疆域

这一切汇成一幅图。每个模型、每个主张、每个假说，都是一张地图。疆域是真实世界，不依赖任何地图而存在。休谟的难题、波普尔的不对称、迪昂-蒯因的不可判定，全都出自同一个事实：我们手里永远只有地图比地图。我们从来没法跨出所有地图，直接抓住疆域来核对。

你想核实一个主张。用什么核实？用另一次观察，而它本身又是一张地图，被你的仪器、你的方法、你的概念框架隔了一层。你永远在地图这一边，疆域只在少数几个被逼着直接回答的地方才插进来：真正的实验，尤其是那种把问题抛给真实世界、让它随真实时间去回应其不可化约之复杂的实验。那才是疆域开口说话的稀罕时刻。其他所有地方，你当成事实的，不过是一张地图和另一张地图对上了。

统一的框架：向黑箱发问

把这些拼起来，开头那个区分就清楚了。有些工作是在调试一只白箱：有源码的系统，状态你能直接读，测量是透明的。另一些工作是在向一只黑箱发问：没有源码的系统，你读不了它，只能扰动它，它的每个回答都经过一件嘈杂、可能骗你的仪器传回来，而且它从没保证过自己底层的规则简单或稳定。

这一个框架就解释了整个不对称。为什么一种工作有便宜的验证者，另一种没有：白箱的状态能直接读、直接比，黑箱的状态只能扰动，再从噪声里推。为什么自动化在一种工作上一飞冲天，在另一种上原地打转：问白箱是即时的、确定的、能无限重复；问黑箱是慢的、吵的，而且从来无法精确复现。为什么貌似有理在一种工作里那么危险，在另一种里却不：白箱里一个貌似有理的错，透明的测量很快就把它揭穿；黑箱里一个貌似有理的错，能永远躲在"也许是仪器的问题"后面。还有，为什么最深的那个验证者不可化约：要向那个终极黑箱——真实世界——在它的全部多样里、随真实时间发问，没有更便宜的查询能替代，因为每一个更便宜的查询都是另一只黑箱，和真正那只之间隔着又一道没法干净跨过的地图-疆域鸿沟。

这个框架不解决问题。没什么能解决它；它就是身处地图这一边的结构本身。但它告诉你，你正卡在哪一类问题里，而这是你不在结论的分量上自欺的前提。随之而来的纪律说起来简单，守住却难：永远别让"我有一张关于它的地图"冒充"我碰到了疆域"。地图可以很出色。疆域照样还没开口。

← Back to Writing