KellerJordan / MuonLinks
Muon is an optimizer for hidden layers in neural networks
β2,267Updated 3 weeks ago
Alternatives and similar repositories for Muon
Users that are interested in Muon are comparing it to the libraries listed below
Sorting:
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ797Updated 5 months ago
- π Efficient implementations of state-of-the-art linear attention modelsβ4,352Updated last week
- Helpful tools and examples for working with flex-attentionβ1,118Updated 3 weeks ago
- H-Net: Hierarchical Network with Dynamic Chunkingβ810Updated 2 months ago
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ587Updated 11 months ago
- Muon is Scalable for LLM Trainingβ1,426Updated 6 months ago
- Unofficial implementation of Titans, SOTA memory for transformers, in Pytorchβ1,924Updated last week
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,318Updated last year
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β964Updated this week
- Code for BLT research paperβ2,027Updated 3 months ago
- Implementing DeepSeek R1's GRPO algorithm from scratchβ1,762Updated 9 months ago
- [ICLR 2025 Oral] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ950Updated 6 months ago
- Schedule-Free Optimization in PyTorchβ2,256Updated 8 months ago
- Minimalistic 4D-parallelism distributed training framework for education purposeβ2,058Updated 5 months ago
- Pretraining and inference code for a large-scale depth-recurrent language modelβ863Updated last month
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ452Updated 3 months ago
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ930Updated last year
- Official PyTorch implementation for "Large Language Diffusion Models"β3,554Updated 2 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β452Updated 8 months ago
- Dream 7B, a large diffusion language modelβ1,164Updated 2 months ago
- Pytorch implementation of Transfusion, "Predict the Next Token and Diffuse Images with One Multi-Modal Model", from MetaAIβ1,322Updated last week
- [NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β445Updated 2 weeks ago
- When it comes to optimizers, it's always better to be safe than sorryβ402Updated 4 months ago
- Implementation of Rotary Embeddings, from the Roformer paper, in Pytorchβ802Updated last week
- β661Updated 9 months ago
- Training Large Language Model to Reason in a Continuous Latent Spaceβ1,496Updated 5 months ago
- β579Updated 4 months ago
- PyTorch implementation of FractalGen https://arxiv.org/abs/2502.17437β1,220Updated 11 months ago
- [ICML 2024 Best Paper] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution (https://arxiv.org/abs/2310.16834)β700Updated last year
- Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation (NeurIPS 2025)β542Updated 4 months ago