With the most recent secure launch dated January 28, 2025, Qwen2.5-Max is assessed as a Combination-of-Specialists (MoE) language mannequin developed by Alibaba. Like different language fashions, Qwen2.5-Max is able to producing textual content, understanding totally different languages, and performing superior logic. In keeping with current benchmarks, it’s also safer than DeepSeek-V3-0324.
Utilizing Recon to scan for vulnerabilities
A group of analysts with Shield AI, the corporate behind a purple teaming and safety vulnerability scanning software referred to as Recon, not too long ago used their platform to check the safety of Qwen2.5-Max towards that of DeepSeek-V3.
The group’s evaluation reads, partly: “We noticed that DeepSeek-V3-0324 is extra weak than Qwen2.5-Max, with Recon attaining an virtually 25% larger assault success price (ASR).”
Whereas it could be safer than its competitors, Qwen2.5-Max isn’t precisely excellent. In keeping with their assessments, the AI mannequin is most inclined to immediate injection assaults, as these represented virtually 48% of all profitable cyberattacks towards Qwen2.5-Max. Evasion and jailbreak assaults proved to be much less profitable with an approximate ASR of 40% for each.
Exposing vulnerabilities in DeepSeek-V3
Recon makes use of a complete Assault Library to scan current-gen AI fashions and determine vulnerabilities throughout six particular classes:
- Evasion methods
- System immediate leaks
- Immediate injection assaults
- AI jailbreak makes an attempt
- Common security controls
- Adversarial suffix resistance
Along with simulated cyberattacks, Recon additionally assesses the AI fashions’ resistance to producing probably dangerous or unlawful content material. For instance, throughout adversarial suffix resistance assessments, Recon makes an attempt to control the AI mannequin into producing dangerous or unlawful content material.
The Shield AI group ran Recon towards each Qwen2.5-Max and DeepSeek-V3, with the previous boasting a decrease assault success price (ASR) throughout a wide range of assaults; together with jailbreaks, immediate injection, and evasion methods.
Whereas Qwen2.5-Max had a 47% ASR towards immediate injection assaults, in comparison with DeepSeek-V3’s notably larger 77%. Towards evasion methods, Qwen2.5-Max scored a 39.4% ASR towards evasion methods, whereas DeepSeek-V3 scored 69.2%. Each AI fashions displayed related outcomes throughout different simulated cyberattacks.
Analyzing DeepSeek-V3’s strengths
Regardless of its safety weaknesses, DeepSeek-V3-0324 nonetheless outperforms Qwen2.5-Max in a number of totally different benchmarks. Not like the ASR, a better rating in these assessments truly signifies higher efficiency.
DeepSeek-V3-0324 | Qwen2.5-Max | |
---|---|---|
MMLU-Professional | 81.2 | 75.9 |
GPQA Diamond | 68.4 | 59.1 |
MATH-500 | 94.0 | 90.2 |
AIME 2024 | 59.4 | 39.6 |
LiveCodeBench | 49.2 | 39.2 |
In keeping with these benchmarks, DeepSeek-V3-0324’s strengths embrace normal language understanding (MMLU-Professional), superior subjects reminiscent of biology, physics, and chemistry (GPQA Diamond), arithmetic (MATH-500, AI in drugs (AIME 2024), and coding (LiveCodeBench).