Two AI Systems Claim Drug Repurposing Wins. Here's What That Actually Means.
Nature dropped two papers this week covering AI systems built to assist with scientific research. One is Google's Co-Scientist. The other is Robin, from a nonprofit called FutureHouse. Both were tested on drug repurposing tasks in biology. Both found something that worked in cell culture. Neither is about to replace a research department.
Let's not oversell this.
The actual problem both systems are trying to solve is mundane but genuine: there is simply too much published science for any human to track. The number of journals has exploded alongside online publishing, and keeping up with your own field is hard enough without also monitoring adjacent ones. If you study eye disease, there's a decent chance something published in a nephrology journal is quietly relevant to your work. You'll probably never see it.
This is where AI actually earns its keep. Not by being smarter than scientists, but by being tireless enough to read everything and flag connections a person would never have time to find. FutureHouse puts it plainly: their system targets non-obvious links between unrelated fields, the kind of thing that falls through the cracks when knowledge gets siloed.
Google's system runs on Gemini. Researchers feed it a description of what they're investigating, and it goes off to search the literature, form hypotheses, and run them through what the paper calls a tournament: hypotheses compete against each other, a Reflection agent scores the survivors, and an Evolution agent tries to improve the ones still standing. Then the cycle repeats.
Scientists stay involved throughout. When Co-Scientist was applied to acute myeloid leukemia, its drug suggestions were filtered by a panel of human experts before anything went near a lab. The results were appropriately complicated: some drugs worked, but only on certain cell lines. That's not a failure, it's just how cancer biology works. Different cells reach uncontrolled growth via different routes, and a drug blocking one route won't necessarily touch another.
Google also notes the system is model-agnostic, so it can be updated as better models come along. They're also candid that it inherits the flaws of whatever model sits underneath it, including the ever-present risk of hallucination.
FutureHouse's Robin is structured similarly but comes with a few notable additions. It has dedicated literature tools: Crow for quick summaries, Falcon for deeper dives. The headline figure is that Robin can analyse 551 papers in 30 minutes. A human doing the same job properly would need around 540 hours.
Applied to macular degeneration, Robin generated hypotheses about disease mechanisms, ranked them using an LLM-based comparison system, proposed cell line models, and produced reports on 30 candidate drugs. Humans then decided which experiments to actually run. Robin also suggested which assays to use, and in most cases researchers went with variants of those suggestions.
The genuinely interesting bit is Finch, a tool that can process results from standard biological screening assays like flow cytometry and RNA-seq directly. It's a small but meaningful step: the system isn't just generating ideas, it's also doing some of the grunt work of reading experimental output.
Robin's novel hypothesis was that boosting retinal cells' ability to clear extracellular debris might protect against macular degeneration, and it identified a drug that appeared to do exactly that in the proposed experiments.
On hallucinations, FutureHouse ran a comparison that should give pause to anyone defaulting to off-the-shelf models for literature work. Replacing their specialist search tool with OpenAI's o4-mini pushed the rate of hallucinated references from zero to 45 percent. They also tested OpenAI's research-focused product and found that every drug it suggested which Robin hadn't already flagged failed to show any effect on the cells.
Drug repurposing is not the hardest problem in pharmaceutical research. These systems weren't designing novel molecules. They were identifying known drugs with existing safety profiles, many of them off-patent and cheap. That's genuinely useful, but it's also the part of drug development where the bar is lowest. The brutal attrition happens in animal studies and clinical trials, not cell culture.
The hypotheses being tested here are also among the more tractable in biology: mechanism X underlies disease Y, therefore drug Z might help. That's a concrete, falsifiable claim. Many scientific questions aren't structured that cleanly. Figuring out why a single genetic mutation causes defects across a dozen different tissue types, or what's happening at the boundary of a gene expression domain, isn't the kind of thing you can easily frame as a literature search problem.
That said, literature overload is a real constraint on scientific progress. The scenario where a crucial connection sat in the published record for a decade, unnoticed, is not hypothetical. Tools that can systematically surface those connections have genuine value.
Having two independent systems doing this, built by different organisations with different architectures, is probably the right approach for now. Run both, compare outputs, treat disagreements as signal. Nobody should be handing either of these systems the keys to a research programme just yet, but as one layer in a larger process, they're starting to look like something more than a demo.