TAG

benchmarks3 articles

So You Want a Trustworthy AI Evaluation? Here's What Actually Matters

OpenAI argues that as frontier AI models have become more capable agentic systems, traditional evaluation methods are no longer sufficient, and third-party evaluations must now carefully account for the "harness" — the tools, scaffolding, and setup surrounding a model — since harness choices can significantly change measured performance. Evaluations should clearly specify what type of claim they are testing (capability elicitation, safeguard performance, or comparison) and provide evidence addressing validity risks such as reward hacking, sandbagging, contamination, refusals, and broken problems. The article recommends that evaluation reports include detailed documentation of harness choices, budgets, elicitation methods, and validity checks, and calls for these practices to be incorporated into emerging national and international AI evaluation standards.

30 May 2026

Gemini 3.5 Flash Is Faster and Smarter Than Its Predecessor — And Considerably More Expensive

Google has released Gemini 3.5 Flash, its fastest model in its intelligence class at over 280 output tokens per second, but it comes at 5.5 times the operating cost of its predecessor due to tripled token prices and significantly higher token consumption on agentic tasks. Despite strong improvements in agentic and multimodal benchmarks, the model notably underperforms competitors like GPT-5.5 and Claude Opus 4.7 in coding, one of the most important use cases for agentic AI. The price hike mirrors a broader industry trend, with Anthropic and OpenAI also raising effective costs on newer models, signalling that AI pricing is increasingly driven by complex, multi-step task demands rather than simple per-token rates.

23 May 2026

NousCoder-14B: Open-Source Coding Model Arrives Just as Everyone's Losing Their Minds Over Claude Code

Nous Research has released NousCoder-14B, an open-source coding model trained in just four days on 48 Nvidia B200 GPUs, achieving 67.87% accuracy on the LiveCodeBench v6 benchmark — a 7-point improvement over its base model. The release stands out for its radical transparency, with Nous publishing not only the model weights but also the full training environment and reinforcement learning framework, enabling others to reproduce the work. However, the researchers flag a significant concern: the training dataset approached the limits of available competitive programming problems, pointing to data scarcity as a key obstacle for future AI coding progress and highlighting synthetic data generation as a critical area for future research.

17 May 2026