Dion: the distributed orthonormal replace revolution is right here

August 16, 2025

31

Three white icons on a gradient background transitioning from blue to green. From left to right: a network of interconnected nodes, a speedometer with the needle pointing right, and a flowchart with squares and a diamond shape.

Coaching AI fashions requires selecting an optimizer and for practically a decade, Adam( (opens in new tab)–W) (opens in new tab) has been the optimizer of selection. Given that sturdiness and success, it was honest to doubt that any additional enchancment was doable. And but, final December, a brand new optimizer known as Muon (opens in new tab) confirmed severe promise by powering a nanoGPT speedrun (opens in new tab). This proved out, with a number of AI labs (e.g., Kimi-AI (opens in new tab) and Important-AI (opens in new tab)) reporting 2x scale enhancements and the discharge of the 1T parameter Kimi K2 (opens in new tab) mannequin. Restated: you possibly can practice a mannequin to comparable efficiency with half as many GPUs.

There’s one fly within the ointment: Muon requires giant matrix multiplications within the optimizer, which requires heavy communication in giant fashions on the scale the place FSDP and TP parallelization turns into fascinating. Going again to the inspiration for Muon, the important thing thought is an orthonormal replace, which sparked the search for extra scalable different linear algebras realizing the identical objective. That’s precisely what Dion is. We’ve got open-sourced this new optimizer to allow anybody to coach giant fashions extra effectively at scale.

What’s an orthonormal replace?

Figure1. Illustration of matrix parameters

On the core of Transformers, a set of enter activations is multiplied by a realized weight matrix to provide a brand new set of output activations. When the load matrix is up to date throughout coaching, the ensuing change within the output activations typically is dependent upon the route of the enter activations. Because of this, the educational charge have to be chosen conservatively to accommodate the enter route that induces the most important change. Orthonormalized updates alter this habits by (roughly) making the change in output activations invariant to the route of the enter. That is achieved by imposing orthonormality (opens in new tab) on the replace matrix, thereby equalizing its impact throughout all enter instructions.

What’s Dion?

Whereas Muon has proven sturdy empirical outcomes, scaling it to very giant fashions poses challenges. As reported by Important AI (opens in new tab), making use of Muon to giant architectures like LLaMA-3 turns into compute-bound—and probably communication-bound—as a result of the price of the Newton–Schulz orthonormalization steps (opens in new tab).

Pseudocode of the centralized version of Dion — Determine 2. Pseudocode of the centralized model of Dion

That is the place Dion enters. At a excessive degree, Dion introduces a brand new axis for scalability: the rank. Particularly, for a given rank r, Dion orthonormalizes solely the highest r of the singular vector area, decreasing communication and compute overhead whereas preserving efficiency. Empirically, we observe that the mandatory rank for good efficiency grows far more slowly than the variety of parameters in bigger fashions.

Dion implements orthonormalization utilizing amortized energy iteration (opens in new tab). Energy iteration sometimes pulls out the most important singular worth by repeated matrix multiplication. By amortizing this course of over optimization steps—utilized to the slowly-evolving momentum matrix—we cut back the price to simply two matrix multiplications per step. Incorporating a QR decomposition permits us to extract an approximate orthonormal foundation spanning the highest singular instructions, relatively than simply the main one. This amortized energy iteration is totally suitable with commonplace distributed coaching methods comparable to FSDP and tensor parallelism. Right here, we present a easy centralized model, however the approach works for extra advanced types of parallelization as offered within the paper. In different phrases, we will orthogonalize a matrix with out ever seeing a full row or column of it.

Low-rank approximation would ordinarily introduce error, however Dion overcomes this by an error suggestions mechanism. This retains the residual of low rank approximation within the momentum matrix in order that any systematic gradient construction not initially captured accumulates to finally be utilized in a future replace.

How does it work?

One thing very unusual occurred in our experiments. Normally, including an additional constraint on the best way an algorithm works might be anticipated to lower total efficiency. And certainly, on the 120M parameter scale of the speedrun, we see Dion’s replace taking extra time than Muon, whereas not yielding any important features. However at bigger scales, we noticed a unique development: Dion started to outperform Muon.

Wall-clock time speedup of Dion for 3B model training — Determine 3. Wall-clock time speedup of Dion for 3B mannequin coaching

Why would including a constraint enhance the replace rule? The reply lies in what the constraint enforces. Dion achieves a a lot nearer approximation to true orthonormalization than Muon. This precision, initially delicate, turns into more and more vital because the variety of singular vectors grows. Over growing mannequin scale and coaching steps, this small benefit accumulates—resulting in a measurable enchancment in efficiency.

This edge additional grows with batch dimension—with bigger batches the replace high quality tends to degrade, however notably extra slowly with Dion than Muon (and Muon is already a major enchancment over AdamW).

Scaling of Dion across different batch sizes — Determine 4. Scaling of Dion throughout totally different batch sizes

Right here you possibly can see how the variety of steps to achieve a pretraining loss in comparison with AdamW varies as batch dimension grows with full rank and ¼ rank Dion (in orange) and Muon (in blue).

In our experiments, these advantages prolong to numerous post-training regimes as properly.

We additionally experimented with rank, discovering empirically that bigger fashions tolerate smaller rank properly.

Low-rank Dion across different model sizes — Determine 5. Low-rank Dion throughout totally different mannequin sizes

Projecting this development out to the size of the LLaMA-3 (opens in new tab) 405B parameter fashions means that Dion is totally efficient even with rank fractions as little as 1/16 or 1/64 for big dense fashions like LLaMA-3.

Utilizing {hardware} timings of the person replace steps suggests a narrative that appears this:

Estimated wall-clock time of each optimizer step for Llama 3 405B. Lower is better. Muon is highlighted in orange as our baseline, next to Dion with varying rank fractions. Suggested rank fractions for a 405B parameter model are shown in blue. Using Dion with rank fraction 1/16 or lower offers an order-of-magnitude speedup over Muon. — Determine 6. Estimated wall-clock time of every optimizer step for Llama 3 405B. Decrease is healthier. Muon is highlighted in orange as our baseline, subsequent to Dion with various rank fractions. Steered rank fractions for a 405B parameter mannequin are proven in blue. Utilizing Dion with rank fraction 1/16 or decrease presents an order-of-magnitude speedup over Muon.

We’ve open-sourced a PyTorch FSDP2 + Tensor Parallel (TP) implementation of Dion, accessible by way of a easy pip set up. Our objective is to make sooner coaching with Dion accessible to everybody. As a bonus, the repository additionally features a PyTorch FSDP2 implementation of Muon.

Acknowledgements

We thank Riashat Islam and Pratyusha Sharma for his or her useful suggestions on the writing and presentation.

Dion: the distributed orthonormal replace revolution is right here

What’s an orthonormal replace?

What’s Dion?

How does it work?

Acknowledgements

Related Articles

The best way to Watch the Orionids Meteor Bathe

How one can Detect and Mitigate Zero-Day

Why MinIO Added Help for Iceberg Tables

LEAVE A REPLY Cancel reply

Latest Articles

The best way to Watch the Orionids Meteor Bathe

How one can Detect and Mitigate Zero-Day

Why MinIO Added Help for Iceberg Tables

From Workloads to Factories: Rethinking the Knowledge Heart for AI

VibeSec embeds safety evaluation into AI coding fashions to forestall era of insecure code

Dion: the distributed orthonormal replace revolution is right here

What’s an orthonormal replace?

What’s Dion?

Microsoft analysis copilot expertise

How does it work?

Acknowledgements

Related Articles

LEAVE A REPLY Cancel reply

Latest Articles