Study accuses LM Arena of helping top AI labs game its benchmark
A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of…
The One Stop Destination
A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of…
A discrepancy between first- and third-party benchmark results for OpenAI’s o3 AI model is raising questions about the company’s transparency and model testing practices. When OpenAI unveiled o3 in December,…
Earlier this week, Meta landed in hot water for using an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on a crowdsourced benchmark, LM…
A Meta exec on Monday denied a rumor that the company trained its new AI models to present well on specific benchmarks while concealing the models’ weaknesses. The executive, Ahmad…
One of the new flagship AI models Meta released on Saturday, Maverick, ranks second on LM Arena, a test that has human raters compare the outputs of models and choose…
Alright, so you’ve read all the reviews of Nvidia’s newest generation of graphics cards: the RTX 5090, 5080, 5070 Ti, and 5070. You know how powerful they are (or aren’t),…
Thought Pokémon was a tough benchmark for AI? One group of researchers argues that Super Mario Bros. is even tougher. Hao AI Lab, a research org at the University of…
Every Sunday, NPR host Will Shortz, The New York Times’ crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the Sunday Puzzle. While written to…
Shiva Suri gained a unique perspective into how radiologists work when he quarantined and shared a home office with his mom, a well-regarded radiologist. “I watched her work day in…
There isn’t a shortage of AI-powered coding assistance startups. They include Augment, Codeium, Magic, and Poolside. However, Cursor has become one of the most popular. Its developer, Anysphere, has seen…