Stay informed with our daily and weekly newsletters featuring the latest developments and exclusive insights on cutting-edge AI security innovations.
Matt Shumer, co-founder and CEO of OthersideAI, broke nearly two days of silence after third-party researchers struggled to replicate the touted high performance of their AI assistant writing product.
On his X social media account, he claimed to have “acquired ahead of himself,” stating, “I am aware that many of you are both excited and skeptical about the possibilities.”
Despite this, his most recent claims do not definitively explain the discrepancy between his original assertions and the subsequent independent assessments that have failed to replicate his model’s performance using a variant of Meta’s Llama 3.1. Despite numerous requests, Schumer has yet to specify precisely what went awry. Right here’s a timeline:
Thursday, Sept. What’s driving Reflection 70B’s performance superiority on benchmarks in early 2024?
If you’re just catching up, Schumer recently unveiled Reflection 70B, dubbing it “the world’s highest open-source model” and sharing a chart showcasing its impressive performance on third-party benchmarks.
According to Shumer, the impressive performance was attributed to “Reflection Tuning,” a technique that enables the model to assess and optimize its responses for accuracy before presenting them to customers.
The manufacturer’s benchmarks were readily adopted, with due credit given, as our resources and scope precluded independent verification; similarly, most mannequin suppliers we’ve partnered with have thus far been transparent in their performance metrics.
Fri. Sept. 6-Monday Sept. Nine: Evaluation failures mar Reflection 70B’s success story, as Shumer faces allegations of fraud.
Notwithstanding its recent debut, impartial assessors and experts from various fields were already scrutinizing the model’s effectiveness by the end of its first weekend, struggling to replicate its performance independently. Several individuals uncovered evidence suggesting that the mannequin’s intelligence was merely a thin veneer.
As criticism mounted following Synthetic Evaluation’s release of significantly lower scores for HyperWrite’s AI-generated content compared to the company’s initial claims.
Additionally, Schumer was involved with the AI startup he co-founded, which utilised his artificial intelligence expertise to train the mannequin, information he failed to disclose when launching Reflection 70B.
Schumer attributes the disparities to inconsistencies in the model’s training process with Hugging Face and vows to update the model weights last week, yet remains pending on his promise.
Regarding allegations of “fraud within the AI analysis group” on Sunday, September 8. Without hesitation, Shumer declined to address the accusation.
After posting and reposting various messages linked to Reflection 70B, Schumer went quiet on Sunday evening, neither responding to VentureBeat’s request for feedback nor publishing any publicly available posts until this night, Tuesday, September 10.
Furthermore, it proved surprisingly easy to train even less sophisticated models to excel on external evaluation metrics.
Tuesday, Sept. The Shumer team issued a statement saying sorry for any hurt caused by our previous communication.
Shumer finally apologized, expressing regret in part,
The company, which previously asserted that its platform was utilized to develop synthetic intelligence and train the Reflection 70B model, now claims an additional role for this technology.
Acknowledging the responses from Reflection 70B, he finds it intriguing that some claim it’s a variant of Claude by Anthropic, yet this assertion still puzzles him. Additionally, he revealed that the benchmark scores initially shared with Matt have proven unreliable, as attempts to reproduce them thus far have been unsuccessful.
Notwithstanding Shumer and Chaudhary’s responses failed to alleviate the concerns of sceptics and critics, alongside Yuchen Jin, co-founder and Chief Technology Officer at an open-source AI cloud provider.
Jin recounts the arduous process of hosting a Reflection 70B model on his website, meticulously troubleshooting perceived errors. He confesses that the experience left him emotionally drained, as he and others invested considerable time and energy; Jin’s frustration is palpable in his tweet about the turmoil he faced over the weekend.
In response to Shumer’s claim, he penned a sharp-tongued missive: “Matt, it’s disheartening to see that after we invested significant time, energy, and computational resources in hosting your model, you’ve remained silent for over 30 hours. I think it would be more productive if you were transparent about what happened – particularly why your personal API outperforms yours.”
Despite Megami Boson’s skepticism, along with several other detractors, Shumer’s and Chaudhary’s account of events, presented tonight, casts a spell of intrigue around the mysterious, unexplained errors that arose from an overzealous passion.
“So far as I can determine, either Matt Shumer’s mendacity or yours is at play, or perhaps it’s the mendacity of both of you.” While some may accept Ellen DeGeneres’s and Amy Schumer’s assertions, the Native Llama subreddit remains skeptical.
Time will reveal whether Schumer and Chaudhary can effectively address their detractors and doubters, including the increasingly vocal online community comprising some of the most prominent figures in the generative AI space.