Saturday, October 18, 2025

Accelerating open-source infrastructure improvement for frontier AI at scale

Microsoft is contributing new requirements throughout energy, cooling, sustainability, safety, networking, and fleet resiliency to advance innovation.

Within the transition from constructing computing infrastructure for cloud scale to constructing cloud and AI infrastructure for frontier scale, the world of computing has skilled tectonic shifts in innovation. All through this journey, Microsoft has shared its learnings and finest practices, optimizing our cloud infrastructure stack in cross-industry boards such because the Open Compute Challenge (OCP) International Basis.

As we speak, we see that the following section of cloud infrastructure innovation is poised to be essentially the most consequential interval of transformation but. In simply the final yr, Microsoft has added greater than 2 gigawatts of recent capability and launched the world’s strongest AI datacenter, which delivers 10x the efficiency of the world’s quickest supercomputer immediately. But, that is just the start.

Delivering AI infrastructure on the highest efficiency and lowest value requires a programs method, with optimizations throughout the stack to drive high quality, velocity, and resiliency at a degree that may present a constant expertise to our prospects. Within the quest to produce resilient, sustainable, safe, and broadly scalable expertise to deal with the breadth of AI workloads, we’re embarking on an formidable new journey: one not simply of redefining infrastructure innovation at each layer of execution from silicon to programs, however one among tightly built-in {industry} alignment on requirements that provide a mannequin for international interoperability and standardization.

At this yr’s OCP International Summit, Microsoft is contributing new requirements throughout energy, cooling, sustainability, safety, networking, and fleet resiliency to additional advance innovation within the {industry}.

Redefining energy distribution for the AI period

As AI workloads scale globally, hyperscale datacenters are experiencing unprecedented energy density and distribution challenges.

Final yr, on the OCP International Summit, we partnered with Meta and Google within the improvement of Mt. Diablo, a disaggregated energy structure. This yr, we’re constructing on this innovation with the following step of our full-stack transformation of datacenter energy programs: solid-state transformers. Stable-state transformers simplify the ability chain with new conversion applied sciences and safety schemes that may accommodate future rack voltage necessities.

Coaching giant fashions throughout 1000’s of GPUs additionally introduces variable and intense energy draw patterns that may pressure the grid. The utility, and conventional energy supply programs. These fluctuations not solely danger {hardware} reliability and operational effectivity but in addition create challenges throughout capability planning and sustainability targets.

Along with key {industry} companions, Microsoft is main an influence stabilization initiative to deal with this problem. In a just lately printed paper with OpenAI and NVIDIA—Energy Stabilization for AI Coaching Datacenters—we tackle how full-stack improvements spanning rack-level {hardware}, firmware orchestration, predictive telemetry, and facility integration can clean energy spikes, cut back energy overshoot by 40%, and mitigate operational danger and prices to allow predictable, and scalable energy supply for AI coaching clusters.

This yr, on the OCP International Summit, Microsoft is becoming a member of forces with {industry} companions to launch a devoted energy stabilization workgroup. Our aim is to foster open collaboration throughout hyperscalers and {hardware} companions, sharing our learnings from full-stack innovation and alluring the group to co-develop new methodologies that tackle the distinctive energy challenges of AI coaching datacenters. By constructing on the insights from our just lately printed white paper, we goal to speed up industry-wide adoption of resilient, scalable energy supply options for the following technology of AI infrastructure. Learn extra about our energy stabilization efforts.

Cooling improvements for resiliency

As the ability profile for AI infrastructure modifications, we’re additionally persevering with to rearchitect our cooling infrastructure to help evolving wants round vitality consumption, area optimization, and general datacenter sustainability. Numerous cooling options have to be carried out to help the dimensions of our growth—as we search to construct new AI-scale datacenters, we’re additionally using Warmth Exchanger Unit (HXU)-based liquid cooling to quickly deploy new AI capability inside our current air-cooled datacenter footprint.

Microsoft’s subsequent technology HXU is an upcoming OCP contribution that allows liquid cooling for high-performance AI programs in air-cooled datacenters, supporting international scalability and fast deployment. The modular HXU design delivers 2X the efficiency of present fashions and maintains >99.9% cooling service availability for AI workloads. No datacenter modifications are required, permitting seamless integration and growth. Be taught extra concerning the subsequent technology HXU right here. 

In the meantime, we’re persevering with to innovate throughout a number of layers of the stack to deal with modifications in energy and warmth dissipation—using facility water cooling at datacenter-scale, circulating liquid in closed-loops from server to chiller; and exploring on-chip cooling improvements like microfluidics to effectively take away warmth straight from the silicon.

Unified networking options for rising infrastructure calls for 

Scaling lots of of 1000’s of GPUs to function as a single, coherent system comes with important challenges to create rack-scale interconnects that may ship low-latency, excessive bandwidth materials which are each environment friendly and interoperable. As AI workloads develop exponentially and infrastructure calls for intensify, we’re exploring networking optimizations that may help these wants. To that finish, we’ve got developed options leveraging scale-up, scale-out, and Vast Space Community (WAN) options to allow large-scale distributed coaching.

We associate intently with requirements our bodies, like UEC (Extremely Ethernet Consortium) and UALink, centered on innovation in networking applied sciences for this essential factor of AI programs. We’re additionally driving ahead adoption of Ethernet for scale-up networking throughout the ecosystem and are excited to see the Ethernet for Scale-up Networking (ESUN) workstream launch underneath the OCP Networking Challenge. We sit up for selling adoption of cutting-edge networking options and enabling multi-vendor Ecosystem based mostly on open requirements.

Safety, sustainability, and high quality: Basic pillars for resilient AI operations

Protection in depth: Belief at each layer

Our complete method to scaling AI programs responsibly consists of embedding belief and safety into each layer of our platform. This yr, we’re introducing new safety contributions that construct on our current physique of labor in {hardware} safety and introduce new protocols which are uniquely match to help new scientific breakthroughs which were accelerated with the introduction of AI:

  • Constructing on previous years’ contributions and Microsoft’s collaboration with AMD, Google, and NVIDIA, we’ve got additional enhanced Caliptra, our open-source silicon root of belief The introduction of Caliptra 2.1 extends the {hardware} root-of-trust to a full safety subsystem. Be taught extra about Caliptra 2.1 right here.
  • We’ve got additionally added Adams Bridge 2.0 to Caliptra to increase help for quantum-resilient cryptographic algorithms to the root-of-trust.
  • Lastly, we’re contributing OCP Layered Open-source Cryptographic Key Administration (L.O.C.Ok)—a key administration block for storage gadgets that secures media encryption keys in {hardware}. L.O.C.Ok was developed by collaboration between Google, Kioxia, Microsoft, Samsung, and Solidigm.

Advancing datacenter-scale sustainability 

Sustainability continues to be a serious space of alternative for {industry} collaboration and standardization by communities such because the Open Compute Challenge. Working collaboratively as an ecosystem of hyperscalers and {hardware} companions is one catalyst to deal with the necessity for sustainable datacenter infrastructure that may successfully scale as compute calls for proceed to evolve. This yr, we’re happy to proceed our collaborations as a part of OCP’s Sustainability workgroup throughout areas similar to carbon reporting, accounting, and circularity:

  • Introduced at this yr’s International Summit, we’re partnering with AWS, Google, and Meta to fund the Product Class Rule initiative underneath the OCP Sustainability workgroup, with the aim of standardizing carbon measurement methodology for gadgets and datacenter gear.
  • Along with Google, Meta, OCP, Schneider Electrical, and the iMasons Local weather Accord, we’re establishing the Embodied Carbon Disclosure Base Specification to determine a standard framework for reporting the carbon affect of datacenter gear.
  • Microsoft is advancing the adoption of waste warmth reuse (WHR). In partnership with the NetZero Innovation Hub, NREL, and EU and US collaborators, Microsoft has printed warmth reuse reference designs and is creating an financial modeling device which offer knowledge middle operators and waste warmth off takers/customers the associated fee it takes to develop the waste warmth reuse infrastructure based mostly on the circumstances like the scale and capability of the WHR system, season, location, WHR mandates and subsidies in place. These region-specific options assist operators convert extra warmth into usable vitality—assembly regulatory necessities and unlocking new capability, particularly in areas like Europe the place warmth reuse is turning into necessary.
  • We’ve got developed an open methodology for Life Cycle Evaluation (LCA) at scale throughout large-scale IT {hardware} fleets to drive in the direction of a “gold normal” in sustainable cloud infrastructure.

Rethinking node administration: Fleet operational resiliency for the frontier period

As AI infrastructure scales at an unprecedented tempo, Microsoft is investing in standardizing how various compute nodes are deployed, up to date, monitored, and serviced throughout hyperscale datacenters. In collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, we’re driving a collection of Open Compute Challenge (OCP) contributions centered on streamlining fleet operations, unifying firmware administration, manageability interfaces and enhancing diagnostics, debug, and RAS (Reliability, Availability, and Serviceability) capabilities. This standardized method to lifecycle administration lays the muse for constant, scalable node operations throughout this era of fast growth. Learn extra about our method to resilient fleet operations

Paving the best way for frontier-scale AI computing 

As we enter a brand new period of frontier-scale AI improvement, Microsoft takes pleasure in main the development of requirements that may drive the way forward for globally deployable AI supercomputing. Our dedication is mirrored in our energetic function in shaping the ecosystem that allows scalable, safe, and dependable AI infrastructure throughout the globe. We invite attendees of this yr’s OCP International Summit to attach with Microsoft at sales space #B53 to find our newest cloud {hardware} demonstrations. These demonstrations showcase our ongoing collaborations with companions all through the OCP group, highlighting improvements that help the evolution of AI and cloud applied sciences.

Join with Microsoft on the OCP International Summit 2025 and past


Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles