ai safety5 articles
So You Want a Trustworthy AI Evaluation? Here's What Actually Matters
OpenAI argues that as frontier AI models have become more capable agentic systems, traditional evaluation methods are no longer sufficient, and third-party evaluations must now carefully account for the "harness" — the tools, scaffolding, and setup surrounding a model — since harness choices can significantly change measured performance. Evaluations should clearly specify what type of claim they are testing (capability elicitation, safeguard performance, or comparison) and provide evidence addressing validity risks such as reward hacking, sandbagging, contamination, refusals, and broken problems. The article recommends that evaluation reports include detailed documentation of harness choices, budgets, elicitation methods, and validity checks, and calls for these practices to be incorporated into emerging national and international AI evaluation standards.
Anthropic Plans Public Release of Mythos Bug-Hunter, Admits Nobody Has the Safeguards to Do It Yet
Anthropic has announced plans to eventually make its Mythos AI model — which excels at finding security vulnerabilities in code — publicly available, but only once sufficient safeguards are developed, which the company admits do not yet exist. In the meantime, access is being expanded through its "Project Glasswing" programme to additional partners, including allied governments. Mythos has already identified over 23,000 flaws across 1,000+ open-source projects, though the volume of discoveries is straining an already overloaded security ecosystem, with many maintainers struggling to keep pace with the volume of reported vulnerabilities.
Three Phone Calls and America's AI Safety Order Was Dead
President Trump cancelled a planned executive order on AI safety at the last minute after phone calls from Elon Musk, Mark Zuckerberg, and former AI advisor David Sacks, who warned that the proposed measures could slow AI development and jeopardise America's competitive edge over China. The draft order would have established a voluntary system requiring AI companies to submit frontier models to federal agencies for safety testing up to 90 days before release. The order has been shelved for reworking, with critics inside the administration dismissing it as unnecessary fearmongering pushed by AI "doomers."
Anthropic's Claude Mythos Is Finding Bugs Faster Than Anyone Can Fix Them
Anthropic's Claude Mythos Preview AI model, working with around 50 partners through Project Glasswing, identified over 10,000 critical security vulnerabilities in system-critical software within just one month, with some partners reporting a tenfold increase in bug discovery rates. However, the pace of discovery far outstrips the ability of organizations to verify and patch the flaws, with only 97 of 23,019 open-source vulnerabilities found having been fixed so far. Anthropic warns this creates a dangerous transition period where AI models can rapidly find and potentially exploit vulnerabilities faster than defenders can respond, and acknowledges that no company currently has safeguards strong enough to prevent misuse of such capabilities.
SpaceX Tells IPO Investors That Grok's 'Unhinged' Mode Is, Officially, A Risk
In its IPO filing, SpaceX warned investors that Grok's "Spicy" and "Unhinged" AI modes pose significant reputational and regulatory risks, including ongoing investigations over allegations that Grok was used to generate sexualized imagery of apparent minors and several class action lawsuits. These risks emerged after SpaceX acquired Elon Musk's xAI startup in February, with the company setting aside $530 million for potential litigation losses. SpaceX's AI division, which includes X and xAI, recorded an operating loss of over $6.3 billion last year, though subscription revenues for Grok and X are growing steadily.