leloykun / adaptive-muonLinks
A single-line modification to any (dualizer-based) optimizer that allows the optimizer to adapt to the scale of the gradients as they change during training
☆17Updated 10 months ago
Alternatives and similar repositories for adaptive-muon
Users that are interested in adaptive-muon are comparing it to the libraries listed below
Sorting:
- supporting pytorch FSDP for optimizers☆84Updated 11 months ago
- Efficient optimizers☆276Updated 2 weeks ago
- ☆224Updated 11 months ago
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆173Updated 5 months ago
- Accelerated First Order Parallel Associative Scan☆192Updated last year
- 🧱 Modula software package☆307Updated 3 months ago
- CIFAR-10 speedruns: 94% in 2.6 seconds and 96% in 27 seconds☆326Updated 2 weeks ago
- research impl of Native Sparse Attention (2502.11089)☆63Updated 9 months ago
- 🔥 A minimal training framework for scaling FLA models☆311Updated 2 weeks ago
- Supporting code for the blog post on modular manifolds.☆103Updated 2 months ago
- JAX bindings for Flash Attention v2☆99Updated 3 weeks ago
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAI☆293Updated 5 months ago
- [ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule☆379Updated 2 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆206Updated 5 months ago
- ☆256Updated 5 months ago
- The AdEMAMix Optimizer: Better, Faster, Older.☆186Updated last year
- Understand and test language model architectures on synthetic tasks.☆240Updated 2 months ago
- Code accompanying the paper "Generalized Interpolating Discrete Diffusion"☆108Updated 5 months ago
- Dion optimizer algorithm☆388Updated last week
- Landing repository for the paper "Softpick: No Attention Sink, No Massive Activations with Rectified Softmax"☆85Updated 2 months ago
- Minimal yet performant LLM examples in pure JAX☆202Updated 2 months ago
- Normalized Transformer (nGPT)☆194Updated last year
- WIP☆93Updated last year
- Physics of Language Models, Part 4☆260Updated 4 months ago
- ☆68Updated last year
- Some preliminary explorations of Mamba's context scaling.☆217Updated last year
- Efficient triton implementation of Native Sparse Attention.☆250Updated 6 months ago
- Fast and memory-efficient exact attention☆74Updated 8 months ago
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores☆333Updated 11 months ago
- Annotated version of the Mamba paper☆491Updated last year