Thursday, September 4, 2025

Past the benchmarks: Understanding the coding personalities of various LLMs

Most reviews evaluating AI fashions are based mostly on benchmarks of efficiency, however a current analysis report from Sonar takes a special method: grouping totally different fashions by their coding personalities and searching on the downsides of every with regards to code high quality.

The researchers studied 5 totally different LLMs utilizing the SonarQube Enterprise static evaluation engine on over 4,000 Java assignments. The LLMs reviewed had been Claude Sonnet 4, OpenCoder-8B, Llama 3.2 90B, GPT-4o, and Claude Sonnet 3.7.

They discovered that the fashions had totally different traits, reminiscent of Claude Sonnet 4 being very verbose in its outputs, producing over 3x as many strains of code as OpenCoder-8B for a similar drawback.

Primarily based on these traits, the researchers divided the 5 fashions into coding archetypes. Claude Sonnet 4 was the “senior architect,” writing refined, complicated code, however introducing high-severity bugs. “Due to the extent of technical issue tried, there have been extra of those points,” mentioned Donald Fischer, a VP at Sonar.

OpenCoder-8B was the “speedy prototyper” because of it being the quickest and most concise whereas additionally probably creating technical debt, making it perfect for proof-of-concepts. It created the very best subject density of all of the fashions, with 32.45 points per thousand strains of code.

Llama 3.2 90B was the “unfulfilled promise,” as its scale and backing implies it needs to be a top-tier mannequin, however it solely had a cross charge of 61.47%. Moreover, 70.73% of the vulnerabilities it created had been “BLOCKER” severity, essentially the most extreme kind of bug, which prevents testing from persevering with.

GPT-4o was an “environment friendly generalist,” a jack-of-all-trades that may be a frequent alternative for general-purpose coding help. Its code wasn’t as verbose because the senior architect or as concise because the speedy prototyper, however someplace within the center. It additionally averted producing extreme bugs for essentially the most half, however 48.15% of its bugs had been control-flow errors.

“This paints an image of a coder who accurately grasps the primary goal however typically fumbles

the small print required to make the code strong. The code is more likely to operate for the supposed state of affairs however will likely be suffering from persistent issues that compromise high quality and reliability over time,” the report states.

Lastly, Claude 3.7 Sonnet was a “balanced predecessor.” The researchers discovered that it was a succesful developer that produced well-documented code, however nonetheless launched a lot of extreme vulnerabilities.

Although the fashions did have these distinct personalities, additionally they shared comparable strengths and weaknesses. The frequent strengths had been that they shortly produced syntactically right code, had strong algorithmic and information construction fundamentals, and effectively translated code to totally different languages. The frequent weaknesses had been that all of them produced a excessive proportion of high-severity vulnerabilities, launched extreme bugs like useful resource leaks or API contract violations, and had an inherent bias in direction of messy code.

“Like people, they turn out to be inclined to refined points within the code they generate, and so there’s this correlation between functionality and threat introduction, which I feel is amazingly human,” mentioned Fischer.

One other attention-grabbing discovering of the report is that newer fashions could also be extra technically succesful, however are additionally extra more likely to generate dangerous code. For instance, Claude Sonnet 4 has a 6.3% enchancment over Claude 3.7 Sonnet on benchmark cross charges, however the points it generated had been 93% extra more likely to be “BLOCKER” severity.

“In the event you assume the newer mannequin is superior, give it some thought another time as a result of newer isn’t truly superior; it’s injecting an increasing number of points,” mentioned Prasenjit Sarkar, options advertising and marketing supervisor at Sonar.

How reasoning modes influence GPT-5

The researchers adopted up their report this week with new information on GPT-5 and the way the 4 obtainable reasoning modes—minimal, low, medium, and excessive—influence efficiency, safety, and code high quality.

They discovered that rising reasoning has a diminishing return on purposeful efficiency. Bumping up from minimal to low leads to the mannequin’s cross charge rising from 75% to 80%, however medium and excessive solely had a cross charge of 81.96% and 81.68%, respectively.

When it comes to safety, excessive and low reasoning modes eradicate frequent assaults like path-traversal and injection, however exchange them with harder-to-detect flaws, like insufficient I/O error-handling. The low reasoning mode had the very best proportion of that subject at 51%, adopted by excessive (44%), medium (36%), and minimal (30%).

“We’ve seen the path-traversal and injection turn out to be zero p.c,” mentioned Sarkar. “We will see that they’re attempting to unravel one sector, and what’s occurring is that whereas they’re attempting to unravel code high quality, they’re someplace doing this trade-off. Insufficient I/O error-handling is one other drawback that has skyrocketed. In the event you take a look at 4o, it has gone to 15-20% extra within the newer mannequin.”

There was an analogous sample with bugs, with control-flow errors reducing past minimal reasoning, however superior bugs like concurrency / threading rising alongside the reasoning issue.

“The trade-offs are the important thing factor right here,” mentioned Fischer. “It’s not as simple as to say, which is the very best mannequin? The best way this has been seen within the horse race between totally different fashions is which of them full essentially the most variety of options on the SWE-bench benchmark. As we’ve demonstrated, the fashions that may do extra, that push the boundaries, additionally they introduce extra safety vulnerabilities, they introduce extra maintainability points.”

Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles