A renowned check on a mathematical problem is making significant progress towards being resolved. Despite claims from its developers that the assessments’ flaws stem from the test’s design, rather than a genuine analytical breakthrough?
In 2019, François Chollet, a prominent figure in the AI world, launched the ARC-AGI benchmark, short for “Summary and Reasoning Corpus for Synthetic Basic Intelligence.” This benchmark aims to gauge whether an AI system can effectively acquire new abilities outside the information it was trained on.
Until this year, the most advanced AI has been able to successfully solve less than a third of the tasks in ARC-AGI. Frustrated by the partnership between his company and large language models (LLMs), Chollet argued that these AI systems are incapable of truly capturing “reasoning”.
He attributes the limitations of LLMs in terms of generalization to their total reliance on memorization, stating this in a series of posts published on X in February. “They often struggle with issues outside their area of expertise.”
To Chollet’s level, LLMs are statistical machines that rely heavily on mathematical calculations and probability theory to generate human-like language outputs. Trained on diverse datasets, these models identify trends and correlations to generate predictions, such as the propensity for “to whom” to appear before “it could concern” in an email.
According to Chollet, large language models may excel at internalising established “reasoning patterns,” but the prospect of them producing entirely novel logical frameworks in response to unprecedented circumstances seems improbable. According to Chollet, “When attempting to internalize multiple examples of a concept, whether explicit or implicit, and seeking a reusable representation to facilitate further study, you’re essentially memorizing.”
In an effort to stimulate the development of sophisticated artificial intelligence models beyond language learning machines (LLMs), Chollet and Zapier co-founder Mike Knoop jointly launched a substantial $1 million initiative in June, with the goal of creating open-source AI capable of surpassing ARC-AGI. Among a record-breaking 17,789 submissions, one standout scored an impressive 55.5 percent, surpassing the 2023 prime scorer by approximately 20 percentage points – still falling short of the coveted “human-level” mark of 85 percent needed to claim victory.
While some may claim that we’re approximately 20% closer to achieving Artificial General Intelligence (AGI), this assertion is unsubstantiated.
The moment we’ve been waiting for: announcing the winners of the prestigious ARC Prize 2024. We’re also releasing a comprehensive technical report detailing our findings on competitor analysis (link: [insert link]).
The state-of-the-art performance surged from 33% to 55.5%, marking a significant single-year improvement unparalleled since 2020. The…
— François Chollet (@fchollet)
According to Knoop’s observation, the majority of submissions to ARC-AGI rely on “brute pressure” to arrive at an answer, implying that a substantial proportion of ARC-AGI tasks lack significant value towards measuring general intelligence.
The ARC-AGI framework presents a series of puzzles, wherein an Artificial General Intelligence (AI) must, given a grid comprising variously colored squares, create the correct “response” grid. The challenges have been crafted to compel an artificial intelligence to evolve and respond effectively to novel situations it has not encountered previously. It remains unclear whether they are actually achieving their intended outcomes.

Knoop conceded that the ARC-AGI system remained stagnant since 2019, failing to live up to expectations of excellence.
François and Knoop have also taken aim at the overselling of ARC-AGI as a benchmark for AGI, amidst ongoing debates about the very definition of AGI.
One OpenAI employee recently argued that AGI has already been achieved, provided one defines it as AI surpassing human capabilities in most tasks.
Knoop and Chollet intend to release a revised ARC-AGI benchmark in the near future, specifically designed to address these concerns, coinciding with the anticipated introduction of competing AI systems in 2025. “According to a post on X, François Chollet, a prominent figure in the field, has announced that his team will shift their focus towards addressing a crucial yet unresolved challenge in AI and accelerate the development of Artificial General Intelligence (AGI).”
Fixes seemingly gained’t come simple. Despite the ARC-AGI framework’s limitations in identifying intelligent AI, the challenge of defining intelligence for machines echoes the long-standing conundrum faced by humans in describing human intelligence.