KellerJordan / MuonLinks

Muon is an optimizer for hidden layers in neural networks

☆2,056

Alternatives and similar repositories for Muon

Users that are interested in Muon are comparing it to the libraries listed below

Sorting:

lucidrains / native-sparse-attention-pytorch
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
☆785Updated 3 months ago
fla-org / flash-linear-attention
🚀 Efficient implementations of state-of-the-art linear attention models
☆3,937Updated last week
meta-pytorch / attention-gym
Helpful tools and examples for working with flex-attention
☆1,072Updated this week
Haiyang-W / TokenFormer
[ICLR2025 Spotlight🔥] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
☆580Updated 9 months ago
MoonshotAI / Moonlight
Muon is Scalable for LLM Training
☆1,372Updated 4 months ago
goombalab / hnet
H-Net: Hierarchical Network with Dynamic Chunking
☆788Updated 2 weeks ago
fla-org / native-sparse-attention
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆928Updated 8 months ago
lucidrains / titans-pytorch
Unofficial implementation of Titans, SOTA memory for transformers, in Pytorch
☆1,533Updated last week
policy-gradient / GRPO-Zero
Implementing DeepSeek R1's GRPO algorithm from scratch
☆1,682Updated 7 months ago
facebookresearch / blt
Code for BLT research paper
☆2,010Updated last month
ML-GSAI / LLaDA
Official PyTorch implementation for "Large Language Diffusion Models"
☆3,333Updated 3 weeks ago
test-time-training / ttt-lm-pytorch
Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
☆1,282Updated last year
huggingface / picotron
Minimalistic 4D-parallelism distributed training framework for education purpose
☆1,911Updated 3 months ago
tensorgi / TPA
[NeurIPS 2025 Spotlight] TPA: Tensor ProducT ATTenTion Transformer (T6) (https://arxiv.org/abs/2501.06425)
☆427Updated last month
kuleshov-group / bd3lms
[ICLR 2025 Oral] Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
☆899Updated 4 months ago
DreamLM / Dream
Dream 7B, a large diffusion language model
☆1,094Updated 2 weeks ago
NVlabs / DoRA
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
☆886Updated last year
facebookresearch / schedule_free
Schedule-Free Optimization in PyTorch
☆2,237Updated 6 months ago
zyushun / Adam-mini
Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
☆445Updated 6 months ago
apple / ml-cross-entropy
☆555Updated 2 months ago
microsoft / Samba
[ICLR 2025] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
☆932Updated 2 weeks ago
test-time-training / ttt-lm-jax
Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
☆432Updated last month
seal-rg / recurrent-pretraining
Pretraining and inference code for a large-scale depth-recurrent language model
☆850Updated last month
lucidrains / rotary-embedding-torch
Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch
☆780Updated 4 months ago
jiachenzhu / DyT
Code release for DynamicTanh (DyT)
☆1,026Updated 8 months ago
minyoungg / platonic-rep
☆634Updated 7 months ago
sail-sg / understand-r1-zero
Understanding R1-Zero-Like Training: A Critical Perspective
☆1,164Updated 3 months ago
facebookresearch / coconut
Training Large Language Model to Reason in a Continuous Latent Space
☆1,367Updated 3 months ago
thinking-machines-lab / batch_invariant_ops
☆917Updated last month
pytorch / ao
PyTorch native quantization and sparsity for training and inference
☆2,543Updated this week