KellerJordan / MuonLinks
Muon is an optimizer for hidden layers in neural networks
β2,179Updated last month
Alternatives and similar repositories for Muon
Users that are interested in Muon are comparing it to the libraries listed below
Sorting:
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ791Updated 5 months ago
- π Efficient implementations of state-of-the-art linear attention modelsβ4,209Updated last week
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ581Updated 11 months ago
- Muon is Scalable for LLM Trainingβ1,397Updated 5 months ago
- Helpful tools and examples for working with flex-attentionβ1,108Updated this week
- Code for BLT research paperβ2,024Updated 2 months ago
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,300Updated last year
- [ICLR 2025 Oral] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ940Updated 6 months ago
- Implementing DeepSeek R1's GRPO algorithm from scratchβ1,740Updated 8 months ago
- H-Net: Hierarchical Network with Dynamic Chunkingβ801Updated last month
- Dream 7B, a large diffusion language modelβ1,139Updated last month
- Official PyTorch implementation for "Large Language Diffusion Models"β3,473Updated 2 months ago
- Minimalistic 4D-parallelism distributed training framework for education purposeβ1,947Updated 4 months ago
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β952Updated 9 months ago
- Unofficial implementation of Titans, SOTA memory for transformers, in Pytorchβ1,864Updated last week
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ920Updated last year
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β449Updated 8 months ago
- Pretraining and inference code for a large-scale depth-recurrent language modelβ859Updated 2 weeks ago
- Training Large Language Model to Reason in a Continuous Latent Spaceβ1,449Updated 5 months ago
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ446Updated 2 months ago
- Schedule-Free Optimization in PyTorchβ2,251Updated 7 months ago
- [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β445Updated this week
- [ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modelingβ939Updated last month
- β578Updated 3 months ago
- β655Updated 9 months ago
- Understanding R1-Zero-Like Training: A Critical Perspectiveβ1,186Updated 4 months ago
- Code release for DynamicTanh (DyT)β1,032Updated 9 months ago
- β949Updated 2 months ago
- Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAIβ1,310Updated last week
- dLLM: Simple Diffusion Language Modelingβ1,566Updated last week