Monday, March 31, 2025

These researchers used NPR Sunday Puzzle inquiries to benchmark AI ‘reasoning’ fashions

Each Sunday, NPR host Will Shortz, The New York Instances’ crossword puzzle guru, will get to quiz hundreds of listeners in a long-running section known as the Sunday Puzzle. Whereas written to be solvable with out too a lot foreknowledge, the brainteasers are normally difficult even for expert contestants.

That’s why some specialists suppose they’re a promising option to check the bounds of AI’s problem-solving talents.

In a latest research, a group of researchers hailing from Wellesley Faculty, Oberlin Faculty, the College of Texas at Austin, Northeastern College, Charles College, and startup Cursor created an AI benchmark utilizing riddles from Sunday Puzzle episodes. The group says their check uncovered shocking insights, like that reasoning fashions — OpenAI’s o1, amongst others — typically “quit” and supply solutions they know aren’t appropriate.

“We wished to develop a benchmark with issues that people can perceive with solely basic information,” Arjun Guha, a pc science college member at Northeastern and one of many co-authors on the research, instructed TechCrunch.

The AI trade is in a little bit of a benchmarking quandary in the meanwhile. Many of the assessments generally used to judge AI fashions probe for abilities, like competency on PhD-level math and science questions, that aren’t related to the typical person. In the meantime, many benchmarks — even benchmarks launched comparatively just lately — are rapidly approaching the saturation level.

The benefits of a public radio quiz sport just like the Sunday Puzzle is that it doesn’t check for esoteric information, and the challenges are phrased such that fashions can’t draw on “rote reminiscence” to resolve them, defined Guha.

“I feel what makes these issues arduous is that it’s actually troublesome to make significant progress on an issue till you resolve it — that’s when every little thing clicks collectively all of sudden,” Guha mentioned. “That requires a mixture of perception and a technique of elimination.”

No benchmark is ideal, in fact. The Sunday Puzzle is U.S. centric and English solely. And since the quizzes are publicly accessible, it’s attainable that fashions skilled on them can “cheat” in a way, though Guha says he hasn’t seen proof of this.

“New questions are launched each week, and we will anticipate the most recent inquiries to be really unseen,” he added. “We intend to maintain the benchmark contemporary and monitor how mannequin efficiency adjustments over time.”

On the researchers’ benchmark, which consists of round 600 Sunday Puzzle riddles, reasoning fashions akin to o1 and DeepSeek’s R1 far outperform the remaining. Reasoning fashions totally fact-check themselves earlier than giving out outcomes, which helps them keep away from a few of the pitfalls that usually journey up AI fashions. The trade-off is that reasoning fashions take just a little longer to reach at options — usually seconds to minutes longer.

Not less than one mannequin, DeepSeek’s R1, offers options it is aware of to be improper for a few of the Sunday Puzzle questions. R1 will state verbatim “I quit,” adopted by an incorrect reply chosen seemingly at random — habits this human can actually relate to.

The fashions make different weird selections, like giving a improper reply solely to right away retract it, try and tease out a greater one, and fail once more. In addition they get caught “considering” without end and provides nonsensical explanations for solutions, or they arrive at an accurate reply instantly however then go on to contemplate various solutions for no apparent cause.

“On arduous issues, R1 actually says that it’s getting ‘pissed off,’” Guha mentioned. “It was humorous to see how a mannequin emulates what a human may say. It stays to be seen how ‘frustration’ in reasoning can have an effect on the standard of mannequin outcomes.”

NPR benchmark
R1 getting “pissed off” on a query within the Sunday Puzzle problem set.Picture Credit:Guha et al.

The present best-performing mannequin on the benchmark is o1 with a rating of 59%, adopted by the just lately launched o3-mini set to excessive “reasoning effort” (47%). (R1 scored 35%.) As a subsequent step, the researchers plan to broaden their testing to further reasoning fashions, which they hope will assist to establish areas the place these fashions is perhaps enhanced.

NPR benchmark
The scores of the fashions the group examined on their benchmark.Picture Credit:Guha et al.

“You don’t want a PhD to be good at reasoning, so it must be attainable to design reasoning benchmarks that don’t require PhD-level information,” Guha mentioned. “A benchmark with broader entry permits a wider set of researchers to understand and analyze the outcomes, which can in flip result in higher options sooner or later. Moreover, as state-of-the-art fashions are more and more deployed in settings that have an effect on everybody, we consider everybody ought to be capable of intuit what these fashions are — and aren’t — able to.”

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles