Microsoft raises the bar: A better method to measure AI for cybersecurity

October 15, 2025

1

ExCyTIn-Bench is Microsoft’s latest open-source benchmarking software designed to judge how properly AI methods carry out real-world cybersecurity investigations.¹ It helps enterprise leaders assess language fashions by simulating sensible cyberthreat eventualities and offering clear, actionable insights into how these instruments motive via advanced issues. In distinction to earlier benchmarks that focused on menace intelligence trivia or static information, this benchmark evaluates AI brokers in multistep, data-rich, multistage cyberattack eventualities inside a simulated safety operations middle (SOC) in Microsoft Azure. It incorporates 57 log tables from Microsoft Sentinel and associated companies to mirror the size, noise, and complexity of actual incidents and SOC operations.²

Why ExCyTIn-Bench issues for enterprise

For chief info safety officers (CISOs), IT leaders, and patrons, ExCyTIn-Bench presents a transparent, goal method to assess AI capabilities for safety. It’s not nearly accuracy in cyberthreat studies, trivia, or toy simulations, however about how properly AI can examine, adapt, and clarify its findings within the face of real-world cyberthreats. As cyberattacks develop in sophistication, instruments like ExCyTIn-Bench assist organizations choose options that actually improve detection, response, and resilience.

Microsoft makes use of this framework internally to strengthen its AI-powered safety features and take a look at their capability to resist real-world cyberattacks. Our security-focused in-house fashions depend on suggestions from ExCyTIn to uncover weaknesses in detection logic, software capabilities, and information navigation. For broader integration, we’re additionally collaborating with safety merchandise reminiscent of Microsoft Safety Copilot, Microsoft Sentinel, and Microsoft Defender to judge and supply suggestions on their AI options. Moreover, Microsoft Safety product house owners can monitor how completely different fashions carry out and what they value, permitting them to decide on acceptable fashions for particular options.

How ExCyTIn-Bench improves upon conventional benchmarks

Not like conventional benchmarks^3,4 that depend on a number of selection questions—which are sometimes vulnerable to guesswork—ExCyTIn-Bench adopts an progressive, principled methodology for producing questions and solutions from menace investigation graphs. Human analysts conceptualize menace investigations utilizing incident graphs, particularly bipartite alert-entity graphs.⁵ These function floor reality, supporting the creation of explainable question-answer pairs grounded in genuine safety information. This allows rigorous evaluation of technique high quality, not simply remaining solutions. Even latest trade publications, reminiscent of CyberSOCEval,³ give attention to packaging sensible SOC eventualities and evaluating how fashions examine static proof in them. ExCyTIn adopts a distinct method in each design and technical implementation by positioning the agent inside a managed Azure SOC atmosphere: the place the agent queries stay log tables, transitions throughout information sources, and plans multistep investigations.

In consequence, ExCyTIn evaluates complete reasoning processes, together with objective decomposition, software utilization, and proof synthesis, below constraints that simulate an analyst’s workflow. By defining rigorous floor truths and extensible frameworks, ExCyTIn-Bench allows sensible, multiturn, agent-based experimentation, collaboration, and steady self-improvement, all bolstered by verifiable, fine-grained reward mechanisms for AI-powered cyber protection.⁶

ExCyTIn-Bench improvements that ship strategic worth

Sensible safety analysis. Not like most open-source benchmarks,^3,4 ExCyTIn-Bench captures the complexity and ambiguity of precise cyber investigations. AI brokers are challenged to research noisy, multitable safety information, assemble superior queries, and uncover indicators of compromise (IoCs)—mirroring the work of human SOC analysts.
Clear, actionable metrics. The benchmark supplies fine-grained, step-by-step reward alerts for every investigative motion over primary binary success and failure metrics present in present benchmarks. This transparency helps organizations perceive not simply what a mannequin can do, however the way it arrives at its conclusions—vital for actionability, belief, and compliance.
Accelerating innovation. ExCyTIn-Bench is open-source and designed for collaboration. Researchers and distributors worldwide can use it to check, examine, and enhance new fashions, driving fast progress in automated cyber protection.
Personalised benchmarks (coming quickly). Create tailor-made cyberthreat investigation benchmarks particular to the threats occurring in every buyer tenant.

Newest outcomes—language fashions are getting smarter

Current evaluations present that the most recent fashions are making vital strides:

Table comparing average rewards of different AI models across several incidents. GPT-5 (Reasoning=High) shows the highest average reward.

GPT-5 (Excessive Reasoning) leads with a 56.2% common reward, outperforming earlier fashions and demonstrating the worth of superior reasoning for safety duties.
Smaller fashions with efficient chain-of-thought (CoT) reasoning—like GPT-5-mini—are actually rivaling bigger fashions, providing robust efficiency at decrease value.
Express reasoning issues—Decrease reasoning settings in GPT-5 drop efficiency by almost 19%, highlighting that deep, step-by-step reasoning is important for advanced investigations.
Open-source fashions are closing the hole with proprietary options, making high-quality safety automation extra accessible.
New fashions are getting near prime CoT methods (ReAct, reflection and BoN at 56.3%) however don’t surpass them, suggesting comparable reasoning throughout inference.

Get entangled

ExCyTIn-Bench is open-source and free to entry. Mannequin builders and safety groups are invited to contribute, benchmark, and share outcomes via the official GitHub repository. For questions or partnership alternatives, attain out to the crew at msecaimrbenchmarking@microsoft.com.

Thanks to the MSECAI Benchmarking crew for serving to this turn into actuality.

To study extra about Microsoft Safety options, go to our web site. Bookmark the Safety weblog to maintain up with our professional protection on safety issues. Additionally, observe us on LinkedIn (Microsoft Safety) and X (@MSFTSecurity) for the most recent information and updates on cybersecurity.

¹Benchmarking LLM brokers on Cyber Menace Investigation

² https://huggingface.co/datasets/anandmudgerikar/excytin-bench

³CyberSOCEval: Benchmarking LLMs Capabilities for Malware Evaluation and Menace Intelligence Reasoning

⁴[2406.07599] CTIBench: A Benchmark for Evaluating LLMs in Cyber Menace Intelligence

⁵Incident or Menace Investigation graphs painting multi-stage assaults by linking alerts, occasions, and indicators of compromise (IoCs) right into a unified view. Nodes denote alerts (e.g., suspicious file downloads) or entities (e.g., person accounts) whereas edges seize their relationships (e.g., a phishing electronic mail that triggers a malicious obtain)

⁶[2507.14201] ExCyTIn-Bench: Evaluating LLM brokers on Cyber Menace Investigation

Microsoft raises the bar: A better method to measure AI for cybersecurity

Why ExCyTIn-Bench issues for enterprise

How ExCyTIn-Bench improves upon conventional benchmarks

ExCyTIn-Bench improvements that ship strategic worth

Newest outcomes—language fashions are getting smarter

Get entangled

Related Articles

5 Tricks to Architecting an Apache Iceberg Lakehouse

Selective retraining helps AI study new expertise with out forgetting, research finds

Synthetic Intelligence (AI) in Cellular Telephones – Is It a Good Factor or Not

LEAVE A REPLY Cancel reply

Latest Articles

5 Tricks to Architecting an Apache Iceberg Lakehouse

Selective retraining helps AI study new expertise with out forgetting, research finds

Synthetic Intelligence (AI) in Cellular Telephones – Is It a Good Factor or Not

DJI Chinese language Army Firm designation

Your information to Day 1 of RoboBusiness 2025