Claude Code Wrote a Better AI Scaling Algorithm Than Humans Could — For $40
The standard approach to test-time scaling goes something like this: researchers design rules by hand that tell a model when to spin up new solution paths, when to push further down an existing one, and when to give up. It works, more or less, but it's slow, expensive, and heavily dependent on human intuition about a search space that's genuinely hard to reason about.
A team spanning UMD, UVA, WUSTL, UNC, Google, and Meta decided to skip the hand-crafting entirely. Their system, AutoTTS, doesn't ask humans to write the algorithm. It asks humans to build the environment where an AI agent can go find one.
The idea hinges on a neat observation: most existing test-time scaling methods are really just different paths through the same two-dimensional space. Width is how many solution paths you run at once. Depth is how far each one goes. Beam search, self-consistency, sequential refinement — these are all just different ways of moving through that space. So instead of plotting another route manually, why not let a machine search for a better one?
To keep costs sensible, AutoTTS runs offline. The team pre-generates solution paths from the target language model and stores them. A candidate control algorithm then decides how to allocate compute against that cached data, which means you can evaluate thousands of variants without running the model live each time.
The searcher is Claude Code. Over multiple iterations, it reads logs from previous attempts, identifies where earlier algorithms wasted compute or fell apart, and writes a new controller directly as code. Each proposal is kept deliberately constrained — one high-level controller exposed externally, which then sets all the lower-level thresholds itself. This prevents the search from dissolving into a mess of individually tunable knobs.
What came out the other end is genuinely odd — in a good way. The discovered algorithm doesn't just check whether a majority of answers agree. It tracks how the model's confidence shifts across multiple rounds. If confidence is creeping up steadily, it stops spawning new paths. If it's barely moving, it opens more. Solution paths that agree with the current majority get more compute thrown at them. Paths that diverge only get cut if they keep heading in the wrong direction over several consecutive rounds — not at the first sign of disagreement.
The authors describe this kind of dynamic, momentum-aware coordination as something that would have been nearly impossible to arrive at through manual design. Looking at the behaviour in hindsight, it's easy to see why it works. Designing it from scratch? Much less obvious.
On AIME and HMMT maths benchmarks, the agent-discovered algorithm matches or beats established methods in accuracy while using dramatically less compute. Compared to vanilla self-consistency — which just generates 64 answers in parallel and picks the majority — the AutoTTS algorithm cuts token usage by around 70 percent with no meaningful accuracy drop. It also transfers to DeepSeek-R1-Distill-Llama-8B and holds up on GPQA-Diamond, a non-maths benchmark, which rules out the obvious worry that it just overfit to one task type.
The entire discovery run cost roughly $40 and finished in under three hours.
The ablation results are worth paying attention to. Remove the single high-level controller constraint and the agent finds shortcuts that look efficient during search but fall apart on new tasks — classic overfitting to the evaluation setup. Strip out the detailed logs and force the agent to work from bare final results alone, and it produces algorithms that use more compute and perform worse. The logging isn't just bookkeeping; it's the mechanism by which the agent actually learns what went wrong.
AutoTTS sits in a growing cluster of work — FunSearch, AlphaEvolve, ADAS — that treats language models as program searchers rather than answer generators. Applying that idea to test-time scaling is the novel bit here, since TTS has historically been a domain where humans stayed firmly in the driver's seat.
There are real limitations. The current framework only handles the width-depth trade-off and can't represent more complex structures like tree search. The quality of what gets discovered also depends heavily on which coding agent does the searching — the paper doesn't address whether cheaper or open-source alternatives would perform comparably.
But the broader shift the paper points to is interesting regardless. The human role moves from designing strategies to designing the space in which strategies get discovered. Whether that generalises cleanly beyond test-time scaling is the obvious next question.