Two key players in San Francisco’s synthetic intelligence ecosystem, Google and OpenAI, have begun engaging the general public by providing a series of questions designed to test the capabilities of massive language models (LLMs), such as Google Gemini and OpenAI’s O1. Scale AI, specializing in preparing massive datasets that train language models, partnered with the Center for AI Security (CAIS) to unveil the “Humanity’s Final Exam” initiative.
That includes prizes of $5,000 for individuals who provide the highest 50 questions chosen for review, with Scale and CAIS stating that the goal is to gauge how close we are to achieving “expert-level AI programs” utilizing the “largest, broadest coalition of specialists in history.”
Why do that? While the main LLMs have excelled at many well-established benchmarks in intelligence, there’s still uncertainty regarding the significance of these achievements. In many cases, students could have prefabricated the solutions due to the vast amounts of information available in their educational resources and a substantial proportion of the entire online knowledge base.
Knowledge is paramount in this comprehensive sphere. As the paradigm shifts from traditional computing to Artificial Intelligence, AI systems transitioned from being instructed to perform tasks (“tell me”) to showcasing capabilities and allowing humans to guide their actions through intuitive interfaces (“display”). This requires excellent coaching datasets, combined with rigorous checks. By leveraging previously untapped expertise, builders can create innovative solutions using novel dataset approaches.
As AI technology rapidly advances, large language models (LLMs) will likely soon possess the capability to learn and prepare for established checks, such as bar exams, with unprecedented speed and accuracy. By 2028, AI analytics websites are expected to achieve a significant milestone: the ability to comprehend and absorb all written human knowledge, marking a major turning point in the evolution of artificial intelligence. A crucial challenge lies in determining how to evaluate AIs once the AI-assessment threshold has been reached.
The internet’s growth rate remains remarkable, with a staggering number of new digital entities emerging daily, exceeding tens of thousands. Might that maintain these issues?
While AI-generated content’s proliferation poses a concern, another insidious issue arises: the perpetuation of biased language. As AI-driven content spreads, it can create self-reinforcing loops, where flawed training data reinforces biases in future AI models, potentially triggering unforeseen consequences. To mitigate this disadvantage, numerous developers are proactively gathering insights gleaned from their AI’s human engagements, encompassing cutting-edge expertise for mentoring and evaluation purposes.
While some experts propose that AIs should evolve to become embodied, literally traversing the physical world and accumulating their own experiential knowledge, much like humans do. Until you grasp the innovative approach employed by Tesla, this concept might seem too ambitious to consider. One alternative is wearable devices, inspired by Meta’s pioneering work. Equipped with advanced cameras and high-quality microphones, these devices are designed to capture vast amounts of human-centric video and audio data.
Slender Checks
However, even if assurance is provided by such merchandise, the challenge persists: determining how to define and measure intelligence – particularly in the context of Artificial General Intelligence (AGI), which would equal or surpass human cognition.
Traditional human IQ tests have been criticized for falling short in capturing the scope of intelligence, covering a broad range from language and arithmetic to emotional intelligence and common sense.
While there’s a similar drawback to the assessments employed on AIs. Artificial intelligence has led to the development of various effective methods for tasks like summarising written content, comprehending text, generating insights, recognising human body language and postures, and machine vision capabilities have been successfully established.
While certain AI-driven checks are being phased out, their efficiency is undeniable, yet it’s crucial to recognize that these task-specific assessments barely scratch the surface in measuring cognitive abilities. The chess-playing AI surpasses Magnus Carlsen’s record in terms of scoring, making it a formidable opponent among the greatest human players of all time. However, Stockfish is unable to perform tasks akin to comprehending natural language. While it may seem intuitive to equate a chess-playing AI’s cognitive prowess with human-level intelligence, this assumption would undoubtedly be misguided.
As AI systems increasingly showcase sophisticated behavior, a pressing challenge emerges: designing novel yardsticks to assess and track their advancements. Notably, a prominent approach originated from François Chollet, a French engineer at Google. True intelligence actually resides in the capacity to learn, adjust, and extrapolate knowledge to novel, uncharted situations? In 2019, he conceptualized the “Abstraction and Reasoning Corpus” (ARC), a collection of visual grid-based puzzles engineered to evaluate an artificial intelligence’s capacity for deductive reasoning and summarizing guidelines in a straightforward manner.
I’ve simply launched a reasonably prolonged paper on defining & measuring intelligence, in addition to a brand new AI analysis dataset, the “Abstraction and Reasoning Corpus”. I’ve been engaged on this for the previous 2 years, on & off.
Paper:
ARC:
— François Chollet (@fchollet)
Through a process of coaching an artificial intelligence on vast photographic datasets, complete with meticulous annotations detailing the objects within, ARC provides a modest initial foundation for understanding. The AI must determine the puzzle’s underlying logic rather than merely exploring all possible solutions.
The Autonomous Racing Competition (ARC) offers a substantial incentive for its participants: a $600,000 prize awaits the primary AI system that can achieve a remarkable rating of 85%. At the moment of writing, we are still a considerable way off from reaching that point.
The two leading large language models currently are OpenAI’s O1 Preview and Anthropic’s Sonnet 3.5, with a score of 21% on the ARC Public Leaderboard, commonly referred to as the.
Using OpenAI’s GPT-4, albeit controversially, as it generated hundreds of possible responses before selecting the optimal answer for the test. Although initially uncertain, this was eventually reassuringly stripped of triggers that could have impacted the prize, ultimately aligning with human-level performances.
While the Association for the Advancement of Artificial Intelligence (ARC) remains a respected benchmark for assessing genuine artificial intelligence capabilities, the Scale and CAIS initiative demonstrates that the pursuit of innovative solutions persists. Intriguingly, we may never get to witness the prize-winning queries. The examination papers won’t be printed online to prevent AI systems from accessing them.
Will we be able to identify when machines approach human-level intelligence, thus prompting pressing concerns about security, morality, and ethics? When the era arrives, we’ll likely face an even more daunting challenge: developing methods to audit and scrutinize the workings of a superintelligent entity. Determining this challenge requires a profound level of insight and meticulous attention to detail.