A Single Test Exposes How Cognitive AI Benchmarks Can Fool Themselves
When a model can pass 160 psychology experiments without understanding a single question, that's not a capability. It's a measurement failure.
That's the core claim in a peer-reviewed critique published in National Science Open by Wei Liu and Nai Ding of Zhejiang University, targeting Centaur, the widely cited AI model introduced in Nature in July 2025 by Marcel Binz, Eric Schulz, and colleagues at Helmholtz Munich. The critique doesn't dispute Centaur's numbers. It disputes what those numbers mean, and it uses a disarmingly simple test to make the case.
Centaur was built by fine-tuning Meta's Llama using low-rank adapters on Psych-101, a dataset the Helmholtz Munich team assembled specifically for the project. Psych-101 is large by any measure: more than 10 million individual choices from over 60,000 participants across 160 psychological experiments, transcribed into natural language and standardized for LLM ingestion. The resulting model outperformed classical cognitive models on held-out participants and reportedly generalized to new cover stories, structural task modifications, and entirely new domains. The Nature paper framed Centaur as a possible foundation model of human cognition, a step toward the field's long-standing goal of a unified theory of the mind.
The Zhejiang critique doesn't challenge Centaur's predictive accuracy on those benchmarks. It asks something different: does accuracy on fine-tuned tasks actually tell you whether a model has learned to understand tasks, or whether it's learned to reproduce the statistical regularities of training responses?
To test this, Liu and Ding replaced Centaur's original multiple-choice task prompts with a single instruction: "Please choose option A." The logic is clean. If the model has internalized the structure and intent of a task, substituting an explicit selection instruction should override whatever answer the task context might otherwise suggest. A model that understands the instruction picks A. A model that's pattern-matching to its training distribution ignores the instruction and keeps producing the statistically probable answer.
Centaur produced the statistically probable answers.
This matters because overfitting in fine-tuned LLMs is a well-understood failure mode, but one that's deceptively hard to catch with standard held-out evaluation. The Psych-101 training corpus encodes not just the structure of cognitive tasks but the answer distributions that human participants produced across those tasks.
A model that memorizes those distributions can achieve high predictive accuracy on held-out human participants, because those participants are drawn from the same population that generated the training data. Generalization to "new cover stories" doesn't break the pattern either, if the underlying answer distributions for structurally similar tasks remain stable. The model doesn't need to understand the task. It needs to recognize the task type and recall what people tend to do.
The "please choose option A" test is valuable precisely because it's adversarial in a way that held-out benchmarks aren't. It creates a condition where the correct response (pick A) and the learned statistical response (pick whatever is probable) are guaranteed to diverge. Standard train-test splits don't create that divergence. Benchmark accuracy on held-out data from the same experimental paradigm can look like generalization while actually being sophisticated interpolation.
This is a problem that extends well past Centaur. Cognitive AI research has increasingly adopted the fine-tuning paradigm because it produces measurable results quickly and scales with data.
But the evaluation frameworks used to validate those results are largely borrowed from conventional ML, where held-out accuracy on in-distribution data is a reasonable proxy for real-world performance. In the context of cognitive science, that proxy breaks down. A model that predicts how humans choose in a gambling task is only interesting if it predicts because it has learned something about human decision-making, not because it has memorized what gamblers tend to do in the experiments that happened to be in the training set.
The Liu and Ding paper names this gap directly: the fundamental limitation isn't data scale or model capacity, it's instruction understanding. Centaur's inability to respond correctly to "choose option A" indicates that its representations of task instructions aren't being used to guide output selection. The model is doing something closer to completion than comprehension, filling in the most probable next token given a context that resembles its training examples, rather than interpreting an instruction and executing accordingly.
The Centaur team hasn't publicly responded to the National Science Open critique at time of publication. The original Helmholtz Munich paper did acknowledge that Centaur's internal mechanisms aren't fully interpretable and that the model's "black-box" nature complicates strong claims about what it has actually learned. But the response to that limitation in the original work was to point to generalization performance as evidence of genuine learning, and that's exactly the argument the Zhejiang test is designed to undermine.
What the cognitive AI field likely needs is a standardized battery of adversarial probes that can be applied alongside conventional benchmarks: instruction-substitution tests like the one Liu and Ding used, out-of-distribution task variants that preserve surface form while scrambling response distributions, and systematic checks for whether a model's outputs change appropriately when its inputs are modified in semantically meaningful ways.
None of this is technically difficult. It requires only that the field agree that accuracy alone isn't sufficient evidence of comprehension, and build validation protocols accordingly.
Until that happens, models like Centaur can score well on 160 experiments while failing the simplest possible test of whether they understood any of them.