KellerJordan / MuonLinks
Muon is an optimizer for hidden layers in neural networks
β2,056Updated last week
Alternatives and similar repositories for Muon
Users that are interested in Muon are comparing it to the libraries listed below
Sorting:
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ785Updated 3 months ago
- π Efficient implementations of state-of-the-art linear attention modelsβ3,937Updated last week
- Helpful tools and examples for working with flex-attentionβ1,072Updated this week
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ580Updated 9 months ago
- Muon is Scalable for LLM Trainingβ1,372Updated 4 months ago
- H-Net: Hierarchical Network with Dynamic Chunkingβ788Updated 2 weeks ago
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β928Updated 8 months ago
- Unofficial implementation of Titans, SOTA memory for transformers, in Pytorchβ1,533Updated last week
- Implementing DeepSeek R1's GRPO algorithm from scratchβ1,682Updated 7 months ago
- Code for BLT research paperβ2,010Updated last month
- Official PyTorch implementation for "Large Language Diffusion Models"β3,333Updated 3 weeks ago
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,282Updated last year
- Minimalistic 4D-parallelism distributed training framework for education purposeβ1,911Updated 3 months ago
- [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β427Updated last month
- [ICLR 2025 Oral] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ899Updated 4 months ago
- Dream 7B, a large diffusion language modelβ1,094Updated 2 weeks ago
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ886Updated last year
- Schedule-Free Optimization in PyTorchβ2,237Updated 6 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β445Updated 6 months ago
- β555Updated 2 months ago
- [ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modelingβ932Updated 2 weeks ago
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ432Updated last month
- Pretraining and inference code for a large-scale depth-recurrent language modelβ850Updated last month
- Implementation of Rotary Embeddings, from the Roformer paper, in Pytorchβ780Updated 4 months ago
- Code release for DynamicTanh (DyT)β1,026Updated 8 months ago
- β634Updated 7 months ago
- Understanding R1-Zero-Like Training: A Critical Perspectiveβ1,164Updated 3 months ago
- Training Large Language Model to Reason in a Continuous Latent Spaceβ1,367Updated 3 months ago
- β917Updated last month
- PyTorch native quantization and sparsity for training and inferenceβ2,543Updated this week