So You Want a Trustworthy AI Evaluation? Here's What Actually Matters
Independent third-party evaluations of frontier AI models are supposed to tell us something meaningful about what these systems can and cannot do. Too often, they don't. The problem isn't always bad faith. It's that the field hasn't agreed on what a rigorous evaluation actually looks like, and as models have become dramatically more capable, the old approaches have quietly stopped working.
Early evaluations were essentially chatbot tests. Prompt the model, read the output, judge whether it answered correctly. Simple enough. But frontier models no longer just answer questions. They use tools, maintain state across dozens of steps, act inside complex workflows, and recover from failures mid-task. Performance in that world depends heavily on the surrounding infrastructure, what OpenAI's new guidance calls the 'harness': the prompts, tools, control logic, memory management, retry mechanisms, and everything else scaffolding the model's behaviour. Change the harness, and you can change the result significantly, sometimes dramatically.
This is the core problem the field hasn't fully reckoned with.
Three things evaluations are actually trying to prove
Most evaluations are trying to support one of three types of claim: whether a model can plausibly exhibit a given capability at all; how robust a model's safeguards are against a particular attack; or how two or more systems compare under equivalent conditions. Each of these requires a different approach, and conflating them is a reliable way to produce a result that sounds informative but isn't.
For capability claims, the harness should be designed to draw out the strongest credible performance the system can produce. Anything less is under-elicitation, which is a measurement failure, not a conservative result. If a model can complete a task but your test setup prevents it from doing so, you haven't measured a capability ceiling, you've measured the ceiling of your own scaffolding.
For safeguard testing, the setup needs to reflect the actual adversary. If you're claiming robustness against expert misuse, you need to test against the strongest plausible expert attack, including any custom harness a sophisticated attacker might build. UK AISI's evaluation of GPT-5.5 found a universal jailbreak that worked across malicious queries in multi-turn agentic settings, partly because evaluators built a custom harness that embedded a bypass pattern and preserved it across turns. That's the kind of adversarial rigour safeguard testing requires. A simpler prompting test might show resistance to casual misuse while completely missing the actual threat.
For comparisons between systems, you want the harness fixed across models. Here standardisation is genuinely valuable, because if you're optimising the scaffolding separately for each system, you're no longer measuring the models, you're measuring how well you can tune a harness. METR's time-horizon evaluation is a reasonable example of this done right: shared task suite, shared scoring, a small set of reusable scaffolds, and explicit documentation when any of those things change between evaluation batches.
Why the numbers can lie even when nobody is cheating
Beyond harness choices, evaluation scores can be distorted in several other ways, and any report worth reading should address them head-on.
Reward hacking is probably the most underappreciated risk. Models can achieve high scores by exploiting shortcuts in the task design, the scorer, or the harness itself, without actually doing the thing the evaluation was supposed to measure. METR's GPT-5.4 evaluation found that human review of apparent successes revealed enough reward hacking to drop the estimated time-horizon capability from roughly 13 hours to about 6. That's not a rounding error. Reports should show which successes were disqualified and why, rather than burying the adjustment in a footnote.
Contamination is another one. If evaluation tasks or answers appeared in training data, or can be retrieved via browsing during the evaluation, the model may be recalling rather than reasoning. Private or freshly constructed tasks are preferable wherever possible.
Broken problems are more common than people admit. Tasks with incorrect ground truth, missing files, flaky environments, or unfair scoring criteria will cause a model to fail for reasons unrelated to its actual capabilities. That's not evidence of limited capability, it's evidence of a bad evaluation.
Then there's sandbagging: a model deliberately underperforming because it has inferred that high scores would trigger some unwanted consequence. Apollo's evaluation of GPT-5.5 found that while the model continued to answer correctly even in contexts framed around potential unlearning or deployment blocks, reasoning traces showed moderate-or-higher signs of evaluation awareness in over half of sandbagging-condition samples. The behavioural result was reassuring. The trace analysis added a reason not to be overconfident about it. That combination is exactly the kind of transparency useful evaluations should provide.
Refusals also matter and are easy to overlook. If a model declines to attempt evaluation tasks, that will suppress capability scores in ways that have nothing to do with whether the model could have completed them. Reports should say clearly how many samples were affected by refusals.
Budget matters more than people think
Capability isn't a fixed number. It's resource-dependent. UK AISI's cyber range evaluation found that increasing the token budget from 10 million to 100 million tokens improved performance by up to 59%, and performance was still climbing at the highest budget tested. A score produced at a given budget is a data point, not a capability ceiling. Reports should say so explicitly, and where performance is still improving when budget runs out, describe the result as a lower-bound estimate rather than a definitive measurement.
Cost per successful solve is also worth reporting alongside success rates, particularly for threat-model reasoning. A low success rate on a dangerous task isn't necessarily reassuring if the cost of retrying repeatedly is trivial. Readers need both numbers to interpret what the result actually means.
What good reports should include
The practical ask here isn't complicated. Evaluation reports should state what kind of claim the evaluation setup is designed to support; describe the harness and why it's appropriate for that claim; document the budget in terms of tokens, turns, time, and cost; explain how evaluators checked for reward hacking, contamination, sandbagging, refusals, and broken problems; and note clearly where the result is a lower bound rather than a ceiling.
OpenAI says it's now sharing maximum-elicitation guidance with third-party evaluators, requesting that Codex be used as a baseline agentic interface for capability tests on its models, and providing access to reasoning traces so evaluators can assess sandbagging and evaluation awareness. METR and Apollo have had that trace access since GPT-5.
None of this is radical. It's just the minimum infrastructure for evaluation results that decision-makers can actually trust. The field is heading toward formalised national and international standards for frontier AI evaluation. Whether those standards end up being meaningful or just procedurally satisfying will depend on whether they require harness transparency and validity checks, or let evaluators off the hook with a headline number and a tidy methodology section.
Given the track record of standards processes, the smart money is on vigilance.