KellerJordan / MuonLinks
Muon is an optimizer for hidden layers in neural networks
β1,218Updated 2 weeks ago
Alternatives and similar repositories for Muon
Users that are interested in Muon are comparing it to the libraries listed below
Sorting:
- Helpful tools and examples for working with flex-attentionβ896Updated last week
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ565Updated 5 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ672Updated last month
- Muon is Scalable for LLM Trainingβ1,211Updated 3 months ago
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β431Updated 2 months ago
- π Efficient implementations of state-of-the-art linear attention modelsβ2,928Updated last week
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β729Updated 4 months ago
- Code for BLT research paperβ1,740Updated 2 months ago
- Minimalistic 4D-parallelism distributed training framework for education purposeβ1,607Updated 2 weeks ago
- Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ417Updated 11 months ago
- β497Updated 2 weeks ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ532Updated 2 months ago
- [ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modelingβ897Updated 2 months ago
- Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ737Updated 2 weeks ago
- H-Net: Hierarchical Network with Dynamic Chunkingβ518Updated 2 weeks ago
- Implementing DeepSeek R1's GRPO algorithm from scratchβ1,488Updated 3 months ago
- When it comes to optimizers, it's always better to be safe than sorryβ330Updated last week
- The official implementation of TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)β378Updated last week
- [ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptationβ817Updated 9 months ago
- Dream 7B, a large diffusion language modelβ848Updated last month
- Unofficial implementation of Titans, SOTA memory for transformers, in Pytorchβ1,414Updated last month
- Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ1,236Updated last year
- β577Updated 3 months ago
- Understanding R1-Zero-Like Training: A Critical Perspectiveβ1,039Updated 3 weeks ago
- β293Updated 3 months ago
- Annotated version of the Mamba paperβ487Updated last year
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β562Updated last week
- Pretraining and inference code for a large-scale depth-recurrent language modelβ803Updated last week
- Recipes to scale inference-time compute of open modelsβ1,108Updated 2 months ago
- Training Large Language Model to Reason in a Continuous Latent Spaceβ1,199Updated 6 months ago