No Priors: Artificial Intelligence | Technology | Startups

Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown

36 min

OpenAI researcher Noam Brown argues that traditional AI benchmarks fail to capture the true capabilities of modern models because they don't account for test-time compute scaling. He explains that models can improve dramatically with more inference time or budget, making single-number benchmarks misleading. The conversation also covers the need for new evaluation frameworks, the potential of large-scale test-time compute, and the implications for safety evaluations.

Summarized by Podsumo

🎙️ Ask about this episode

✨ Key Takeaways

1

Noam argues that AI benchmarks should be evaluated with an x-axis of tokens, cost, or time, as models like GPT-5.5 can improve with higher inference budgets.
2

He warns that existing safety policies (e.g., Responsible Scaling Policies) don't account for how models' capabilities scale with test-time compute, potentially understating risks.
3

Noam used LLMs to build a poker solver, observing that models like GPT-5.5 can now complete tasks that previously required a PhD, but still lack 'research taste' for novel algorithmic breakthroughs.
4

He suggests the AI community is in a 'bad equilibrium' where benchmark grids are published by convention, even though researchers know they are insufficient for comparing modern models.
5

Noam discusses the challenge of evaluating models when the only way to fully measure their ceiling is to run them for months—far longer than current release cycles.

💬 Notable Quotes

"The capability of the model is a function of how much money you put into it. If you give it a budget of $10,000, it can do a lot more than with a budget of $10. At what budget should you evaluate these models? The policies that exist today don't address that question."
"The point at which performance plateaus is actually really far out these days. We're in a world now where the models can think for weeks before having performance plateau on some benchmarks."
"I think one of the reasons why the benchmark results don't show a huge improvement is because they're not controlling for the amount of test time compute. GPT-5.5 is much more efficient with its thinking."