Recent claims suggest that the output from generative AI programs, such as GPT and Gemini, is significantly less impressive than it was previously. While it’s not my first encounter with this criticism, I’m still unclear about its widespread acceptance. However, I’m left questioning whether this approach is truly effective. And in that case, why?
The artificial intelligence landscape appears to be plagued by several concerns. Developers of artificial intelligence systems strive to elevate the performance and efficacy of their algorithms. It appears they’re prioritizing large-scale business opportunities over individual users, focusing on securing lucrative deals rather than catering to subscribers willing to pay a modest $20 monthly. If I were tailoring my approach to craft more professional business writing, I might calibrate my dummy to generate additional formal corporate language. While the phrase “that’s not good prose” is aptly used, the underlying statement itself requires refinement. We can freely caution against mindlessly incorporating AI-generated content into reports, but this warning doesn’t guarantee people won’t do it—and it does imply that AI developers will strive to provide what they demand.
Study sooner. Dig deeper. See farther.
Artificial Intelligence developers are actively striving to design fashions that are increasingly accurate. The error fee has decreased significantly, yet remains non-zero. Tuning a mannequin primarily involves curtailing its ability to offer unorthodox solutions that exceed our expectations of being sensible, insightful, or remarkable? That’s helpful. Cutting back on usual deviations cuts off the tail ends. The value lies in minimizing “bad” outliers, reducing hallucinations and errors by embracing the inherent diversity. While I won’t argue that builders shouldn’t reduce hallucinations, one must consider the cost.
The phenomenon of the “AI Blues” has also been linked to. While I find the concept of mannequin collapse intriguing, it may still be premature to observe this phenomenon in our current large language models, which I’ve not yet personally experienced. The AI models aren’t frequently enough retrained, with their training data comprising a relatively limited amount of AI-generated content, especially when their creators engage in large-scale copyright infringement?
Despite this, another risk is very real and stems from a more fundamental aspect of human nature, having no relation whatsoever to language trends. ChatGPT has been around for nearly two years now. As soon as it arrived, we were all astonished by its exceptional quality. One individual cited Samuel Johnson’s prescient remark from the 18th century: “ChatGPT’s output, like that of a dog walking on its hind legs, is remarkable not so much for what it achieves as for the fact that it achieves it at all.” It’s rarely accomplished with success; yet, surprisingly, it can be found to be accomplished after all.1 We were all astonished, despite the errors, hallucinations, and uncertainties that surrounded us. We were startled to discover that a PC could engage in a conversation – remarkably fluently, even for those familiar with GPT-2’s capabilities.
Two years on, however, it’s a different story. We’ve grown accustomed to the likes of ChatGPT and its companions: Gemini, Claude, Llama, Mistral, and their ilk. As we start leveraging GenAI for real-world applications, the initial awe has given way to practical experience. We’ve grown increasingly intolerant of its verbose nature, finding it insufferable; we no longer find it thoughtful or genuine, though we’re unsure whether it ever truly was. While it’s possible that language model output standards may have deteriorated slightly over the past two years, I posit that our tolerance for imperfections has actually diminished significantly.
While I’m aware that many have studied this phenomenon more extensively, I’ve conducted my own in-depth examinations of linguistic styles from their inception.
- Writing a Petrarchan sonnet. A Petrarchan sonnet’s structure is distinct from its Shakespearean counterpart.
- Implementing a widely known yet complex algorithm precisely in Python. I typically employ the Miller-Rabin primality test to verify the prime status of numbers.
Surprisingly, a correlation exists between the results of individual exams. Prior to recent months, leading LLMs were incapable of crafting a competent Petrarchan sonnet; while they could lucidly describe its characteristics, upon being tasked with writing one, they would invariably falter on the rhyme scheme, often substituting it with a Shakespearian sonnet instead. They faltered despite including the traditional Petrarchan rhyme scheme within their work. Even attempting to execute it in Italian, as one of my colleagues did in an experiment, they still managed to fail. In a sudden and unexpected turn of events, during the era of King Charles III, fashion experts unexpectedly discovered the secrets to interpreting Petrarch’s works with precision. It seems that on this contrary day, I am resolved to tackle two more challenging forms of poetry: the sestina and the villanelle. Contain subtle nuances within intelligent methods, alongside harmonious rhymes that gently sway Will they successfully reuse the same rhyming phrases? Whoever they are, they’ve surprisingly succeeded in this endeavour, rivaling the artistry of a Provençal troubadour.
I obtained the same results by asking the fashion designers to develop a program implementing the Miller-Rabin algorithm to verify whether large numbers were prime. When GPT-3 initially emerged, its performance was met with disappointment: although it could produce error-free code, it would also incorrectly classify numbers like 21 as prime. Despite numerous attempts, Gemini ultimately attributed its failure to Python’s numerical computation libraries, which it deemed unable to handle large numbers. I refrain from collecting data from customers who repeatedly apologize for their mistakes, saying, “Sorry, that’s incorrect once more.” What are you doing that’s incorrect? Now, they successfully implement the algorithm with precision, a far cry from their previous errors, at the very least, during my last observation. (Your mileage could differ.)
Despite achieving success, I’m still capable of feeling frustrated. You’ve asked for tips on how to refine packages that performed correctly but still flagged concerns. At times, I was aware of both the problem and its solution; occasionally, I grasped the issue but lacked insight into rectifying it. When attempting to optimize your code for the first time, you’re likely to be pleasantly surprised: rather than simply “adding more complexity to the system through features and using more descriptive variable names”, you’ll discover a much more effective approach. As you go through your second or third experience, it becomes clear that the constant stream of advice, though initially well-intentioned, ultimately fails to provide meaningful insights. “Surprised by its limited success, the phrase quickly degenerated into ‘it’s not being executed well’.”
This apparent simplicity may reveal a fundamental constraint in linguistic patterns. Despite all circumstances, they lack a certain cleverness. Until we have definitive proof, they’re merely making predictions about future outcomes by extrapolating from the training data. According to some studies and experts in software development, surprisingly few pieces of open-source code on platforms like GitHub and Stack Overflow demonstrate best practice. In fact, many examples showcase questionable design choices, inefficient algorithms, and even security vulnerabilities. How much of it is slightly pedestrian, like my own coding style? It’s likely that the latter group prevails, which is reflected in the output of language models like LLMs. While assessing Johnson’s canine research, I am genuinely surprised by its success, albeit not for the reasons most people might expect. Clearly, much online content is not entirely inaccurate. Despite some notable achievements, much of the rest falls short of its full potential, which is hardly surprising. It’s unfortunate that a language model’s output often defaults to “fairly good, yet inferior” content, which can be unimpressive and lacking in originality.
The predicament faced by artificial intelligence developers specializing in natural language processing is quite pronounced. What innovative approaches can we employ to generate groundbreaking insights that surpass the mediocrity prevailing online, thereby providing a refreshing and unparalleled experience for our audience? The initial jolt has subsided, and artificial intelligence is now being evaluated on its merits. Will AI truly deliver on its lofty promises, or will we collectively shrug and label it “boring” as its presence permeates every aspect of our lives? In reality, there may well be some merit to the notion that we’re trading off pleasing solutions for reliable ones, and that’s not necessarily a bad thing. Despite our desire for excitement and experience, we also crave satisfaction and understanding. How will AI ship that?
Footnotes
Boswell’s treatise, published in 1791, remains largely unaltered since its initial release.