cankun.me

The Garden, the Multiverse, and What History Cannot Teach

Jun 14, 2026

There is a curse at the heart of data analysis that, once you see it clearly, turns out to contain its own partial cure; and the cure, pushed far enough, runs into a wall that is not an engineering limit but a fact about what discovery is. The whole arc is worth walking, because it connects a famous statistical worry to the question of what a machine could ever be trained to do.

The garden of forking paths

The worry is Andrew Gelman's, and its power is in how it survives every defense you would normally raise. The naive concern about untrustworthy results is that someone went fishing: tried analysis after analysis until something crossed the significance threshold. The forking-paths argument is that even a completely honest analyst, who fixed their hypothesis in advance and ran exactly one analysis, can produce an untrustworthy result.

Why? Because the analysis is full of small decisions made after seeing the data: how to filter outliers, which normalization, how to group, whether to drop a sample, which model, where to set a threshold. Each decision, on its own, is reasonable. But each is data-dependent: had the data looked different, the analyst would have decided differently. So the paths not taken still bear on the credibility of the result. The analyst walked one route through a garden of thousands of branching paths, believing it was the only sensible route, when in fact a slightly different dataset would have sent them down another, to a different conclusion. Every untraveled branch is an invisible deduction from how much the single result should be believed. No conscious cheating required.

Turning the curse into a resource

Traditionally this is a curse precisely because you cannot walk all the paths. There is one analyst, one lifetime, one route. But that constraint is exactly the one that cheap, capable agents dissolve. When running an analysis is nearly free, you can actually walk most of the paths. You are no longer choosing one pipeline and praying it is robust; you can run the field's whole space of reasonable pipelines and observe how a conclusion behaves across them.

This is the industrialized form of the cure: turn the garden of forking paths from an enemy into a resource. A conclusion is no longer a binary that holds or does not. It has a robustness profile: a map of how stable it is across the space of tool choices, parameter settings, and data subsets. Does it survive switching the differential-expression method? Survive moving the threshold? Survive a bootstrap of the data? The profile is rich, and, crucially, it requires no verifier. You do not need to know which pipeline is correct. You only need to observe the distribution of the conclusion across all reasonable pipelines. Which is why it works in a domain with no cheap ground truth, and why it is suited to a setting where compute is abundant and truth is scarce.

Three things robustness can reward, and which two matter

Once you can run the whole space, you can extract a training signal from it; but there are three different things you could reward, and they are not the same.

You could reward robust conclusions directly: a result that holds across ninety percent of reasonable pipelines gets a high score. This trains the model to produce robust conclusions; but it punishes true-but-fragile findings. Some real discoveries are visible only under one specific method, because only that method had the power to see them. Rewarding convergence breeds a conservative parrot that only reports what holds no matter how you look, which is the same as only reporting the bland.

You could instead reward honest calibration: not "the conclusion is robust," but "the model's stated confidence in its conclusion's robustness was accurate." The model claims something is robust; you run the multiverse to check; if it was right, reward, if it overclaimed, penalize. This trains the model to know, honestly, how fragile its own conclusions are. It does not punish fragile discoveries; it punishes calling a fragile thing robust. This is the anti-laundering principle as a reward.

And you could reward noticing disagreement: when two pipelines diverge, that divergence is not noise, it is information, pointing at a methodologically sensitive spot that might be real biological heterogeneity or might be one tool's artifact. A good system notices and explains the divergence rather than silently picking one. This turns disagreement into a source of value, specifically, into a guide for which next experiment would resolve it.

The calibration reward and the disagreement reward are the ones worth having, because they train honesty and judgment. The convergence reward is a trap that trains a coward. And all three are free of any verifier; they need only the multiverse and the compute to run it.

The trap inside the cure

But there is a flaw at the heart of all this, and it is the same one the whole map-territory picture predicts. "All the mainstream tools agree" does not mean "close to the truth."

The field's mainstream tools share enormous amounts of structure: the same assumptions, the same statistical frame, sometimes the same underlying bugs. They may agree not because the conclusion is right but because they are wrong in the same way. If the entire field's methodology assumes some false premise, then "runs through all the mainstream tools and stays stable" yields a high robustness score for a conclusion that is wholly wrong. This is the map-territory gap in its exact local form: walk all the maps, and you get the maps' consensus, not the territory. Multiverse agreement verifies stability within the field's methodological consensus, not truth. The gap between those two is precisely the space in which an entire field can be collectively mistaken.

So the robustness signal must honestly mark itself as "stable within methodological consensus", agreed, not verified. The only thing that closes the gap is the territory talking back: the rare ground-truth anchors, used not as dense training signal but to calibrate how well consensus-robustness actually tracks truth. Discovering that gap, measuring how often a high-robustness conclusion later turns out false, is itself among the most valuable things one could produce, because it quantifies the distance between the consensus of maps and the territory.

What history can and cannot teach

There is a seductive way to manufacture a verifier for free: use past scientific discoveries whose answers history has already revealed. Give a model the situation a scientist faced before the discovery, let it infer, and reward it against the answer that was later confirmed. The slow, expensive verifier is replaced by one history already ran. It is clever, and it has two failure modes, of very different severity.

The shallow one is answer leakage. The model read every textbook in pretraining; it does not infer the answer, it retrieves it, and the reward then trains "remembering" rather than "inferring." This is partly fixable: use very new or unpublished findings the model could not have seen; or reward the quality of the reasoning path rather than the answer; or, most interestingly, use historical cases that were later overturned, and reward not reproducing the old conclusion but identifying its fragility, the experiment that would break it. That last move turns leakage from a bug into a feature: the model knows the later truth, but it never memorized how to recognize, at the time, that the old consensus would fall; and that recognition cannot be retrieved.

The deep failure mode cannot be fixed, and it is the more important one. Hindsight does not just reveal the answer; it reshapes the question. The scientist before the discovery faced an un-conceptualized mess: they did not know which variables to measure, what to ask, what was even relevant. The discovery revealed, simultaneously, how to see the problem. Constructing the training environment today, you stand on the far side of the answer; the data you hand the model, the variables, the very framing of the task, already leak the discovery. The hardest part, realizing what to ask, inventing the new concept the old frame could not hold, has been pre-solved by your environment. The model only fills in the last step on a stage you already set.

This points at the unfixable thing, and it is the same wall everything else in this terrain runs into. You can train inference within an already-correct framing. You cannot train the creation of a new framing, because training requires ground truth, and the ground truth for a new frame does not exist until the frame has been created. Once the answer is known, the new concept already exists, and the act of discovering it has vanished. This is not an engineering limit. It is the logical structure of discovery: the irreducible part of a real discovery, realizing the old vocabulary cannot hold what you are seeing, is precisely the part no environment with a known answer can ever contain.

The corner that remains

What survives all of this is narrow and, for that reason, valuable. The overturned consensus of the past is a badly underpriced resource, because its reward, was later refuted, is a definite historical fact, while the capability it would train, recognizing, at the time, that it would be refuted, cannot be gotten by retrieving an answer. You cannot train a machine to reproduce science's successes without it cheating. You might be able to train it to reproduce science's self-correction; and self-correction, the organized skepticism that is the real source of trustworthy knowledge, is the one thing history offers in enormous, well-labeled supply. The goal worth aiming at may not be a machine that discovers, but a machine that doubts well. And of doubt, done honestly and then vindicated by history, there is no shortage of training data at all.

← Back to Writing