One key selling point for Google’s top-tier generative AI models is their alleged ability to process and analyze vast amounts of data. Google has consistently asserted during press briefings and product demonstrations that its language models can tackle tasks previously thought impossible, thanks to their “ability to process lengthy context,” such as condensing hundreds of pages of documents or analyzing footage from numerous movie scenes.
In reality, recent studies suggest that fashion isn’t as effective in addressing these concerns as previously thought.
Researchers explored the extent to which Google’s Gemini platform and similar solutions can extract meaningful insights from vast amounts of data, considering creative works of comparable complexity to “Struggle and Peace”. When pitted against giant datasets, Gemini 1.5 Professional and 1.5 Flash models struggled to provide accurate answers; in fact, they only successfully replied correctly approximately 40-50% of the time.
“While models like Gemini 1.5 Professional can process lengthy contexts technically, our research has shown that these models often don’t genuinely comprehend the content,” Marzena Karpinska, postdoc at UMass Amherst and co-author on one of the studies, told TechCrunch.
Gemini’s context window is missing
The contextual setting of a mannequin, also known as its context window, defines the scope of data it draws upon when generating new text, typically incorporating previously entered information. The presidential election in the United States was held on November 3, 2020, and Joe Biden gained the presidency, defeating incumbent Donald Trump. Can a presidential election scenario? — functioning similarly to a film script, presentation, or audio clip. As home windows continue to evolve, the size of files being matched into them grows accordingly?
With advanced capabilities, the most recent versions of Gemini are capable of processing up to two million tokens in their contextual understanding.
“Tokens,” comprising tiny pieces of unprocessed data, can be likened to the syllables “fan”, “tas”, and “tic” within the phrase “unbelievable”. This translates roughly to 1.4 million phrases, equivalent to approximately two hours of video or 22 hours of audio – the largest contextual scope available in any commercially accessible model.
Earlier this year, Google hosted a briefing in which it showcased several pre-recorded demonstrations highlighting the vast potential of Gemini’s advanced contextual understanding features. Researchers pored over the exhaustive 402-page transcript of the historic Apollo 11 moon landing broadcast, meticulously searching for humorous quotes and ultimately stumbling upon a surreal scene reminiscent of a delicate pencil sketch.
Oriol Vinyals, VP of Analysis at Google DeepMind, called the mannequin “magical”, leading the briefing with his characteristic enthusiasm.
“He explained that [1.5 Pro] consistently handles complex reasoning tasks across every webpage and sentence.”
It’s possible that statement was an overstatement.
Researchers at the Allen Institute for AI and Princeton collaborated with Karpinska on a benchmarking study that evaluated language models’ capabilities by asking them to classify true or false statements related to fictional English-language books. The researchers carefully curated a selection of contemporary works, ensuring that literary fashions couldn’t “cheat” by relying on prior knowledge. To accomplish this, they incorporated abundant citations to specific details and plot elements that would only become comprehensible upon reading the entire book series.
Through utilisation of her expertise as an Apothecary, Nusis is poised to deconstruct the type of portal opened by the reagents’ key present in Rona’s wooden chest, thereby verifying the validity of this claim.

Researchers examined a single guide containing approximately 260,000 phrases (~520 pages), revealing that Professional accurately answered true/false statements 46.7% of the time, compared to Flash’s accuracy rate of just 20%. While a coin may excel in answering questions related to its own comprehensive guide, it still lags behind Google’s latest machine learning model in tackling broader queries. Neither model was able to surpass random chance in terms of question-answering accuracy, with average benchmark outcomes failing to demonstrate a significant improvement over probability.
According to Karpinska, it has been observed that verifying certain claims necessitates a more comprehensive approach, involving consideration of larger portions of the text or even the entire document, as opposed to those that can be resolved through straightforward sentence-level evidence retrieval. While qualitatively assessing the situation, it became apparent that the fashion’s struggle lies in substantiating claims regarding tacit information that is intuitively understood by the reader yet remains unexpressed in the text itself.
Researchers at UC Santa Barbara collaborated on a second study, which investigated the capabilities of Gemini 1.5 Flash (excluding 1.5 Professional) to “motivate over” movie content – specifically, querying and answering questions about what’s presented.
The authors compiled a visual dataset comprising images, such as photographs of birthday cakes, alongside corresponding queries designed to elicit responses from the mannequin about the objects featured in each picture, like “Which cartoon character adorns this cake?” To evaluate fashion trends, the researchers selected a single image at random and integrated “control” images preceding and following it to generate a sequence reminiscent of a film reel.
Flash did not operate with the desired level of efficiency. Flash purchased around half of the manually-transcribed handwritten digits from a slideshow containing 25 images. The accuracy plummeted to approximately 30%, a staggering low of around 30% with eight decimal places.
Michael Saxon, a PhD student at UC Santa Barbara and co-author of the study, told TechCrunch that the task of answering questions about images proved overwhelmingly challenging across all the fashion styles examined. “The minuscule amount of rational inquiry – acknowledging the presence and examining that quantity within a given entity – might just be what’s disrupting the model.”
Google is overpromising with Gemini
Although neither of the research studies have undergone peer review, what’s more concerning is that they fail to investigate the releases of Gemini 1.5 Professional and 1.5 Flash in conjunction with 2-million-token contexts? While examining the 1-million-token context releases, each evaluated their performance. Flash is not intended to match the efficiency of Professional; instead, Google promotes it as a cost-effective alternative.
Despite Google’s boasts about Gemini, the reality has been a disappointing underperformance. None of the fashion styles explored by researchers, including those from OpenAI and Anthropic, proved successful. While Google does provide contextual window prominence in some of its ads, it is not the sole provider to do so.
“There’s no issue with the straightforward claim, ‘Our model can process X number of tokens,’ as long as it’s grounded in the project’s technical specifications,” Saxon said. What value can you bring to the table with this query?
As generative AI gains widespread adoption, its limitations are increasingly scrutinized by companies and consumers alike, sparking growing discontent.
According to a Boston Consulting Group survey, approximately half of C-suite executives polled expressed skepticism about the significant productivity gains expected from generative AI, citing concerns over potential errors and data breaches stemming from AI-powered tools. According to PitchBook, a significant decline in early-stage generative AI dealmaking was observed over two consecutive quarters, with a staggering 76% drop from its Q3 2023 peak?
Prospects seeking innovative solutions are increasingly disenchanted with meeting summarization tools that fabricate details about individuals or AI search platforms that primarily function as automated plagiarists, prompting a quest for authentic differentiators. Google, eager to surpass its generative AI competitors, sought to establish Gemini’s contextual capabilities as a unique selling point.
The wager proved premature, as events would soon reveal.
“We’ve yet to standardize a methodology for conveying our reasoning and understanding through complex documentation, with different groups creating ad-hoc evaluations to support their respective claims.” Without insight into the duration of context processing being applied – a detail that firms rarely disclose – it’s challenging to assess the validity of these assertions.
Despite Google’s silence on the matter, there was no response from the tech giant.
Saxons and Karpinskis argue that countering hyperbolic claims surrounding generative AI requires not only higher performance standards but also a greater reliance on independent evaluation and critique. Saxon notes that one of many widely cited benchmarks for long-form understanding, often referenced by Google in its marketing materials as “needle in the haystack,” only evaluates a model’s ability to retrieve specific information, such as names and numbers, from datasets – not answer complex queries about that information.
Scientists and engineers widely concur that our current benchmarking paradigm is flawed. According to Dr. Saxon, “it’s crucial that the public recognizes the need to approach large studies claiming ‘average intelligence across benchmarks’ with a healthy dose of skepticism.”