MoonshotAI / MoonlightLinks

Muon is Scalable for LLM Training

☆1,223

Alternatives and similar repositories for Moonlight

Users that are interested in Moonlight are comparing it to the libraries listed below

Sorting:

fla-org / native-sparse-attention
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆731Updated 4 months ago
MoonshotAI / MoBA
MoBA: Mixture of Block Attention for Long-Context LLMs
☆1,846Updated 3 months ago
ByteDance-Seed / Seed-Thinking-v1.5
☆800Updated last month
DreamLM / Dream
Dream 7B, a large diffusion language model
☆857Updated last month
KellerJordan / Muon
Muon is an optimizer for hidden layers in neural networks
☆1,390Updated 3 weeks ago
THUDM / slime
slime is a LLM post-training framework aiming for RL Scaling.
☆975Updated this week
allenai / OLMoE
OLMoE: Open Mixture-of-Experts Language Models
☆823Updated 4 months ago
sail-sg / understand-r1-zero
Understanding R1-Zero-Like Training: A Critical Perspective
☆1,048Updated last week
QwenLM / ParScale
Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling
☆417Updated 2 months ago
lucidrains / native-sparse-attention-pytorch
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
☆673Updated last month
Open-Reasoner-Zero / Open-Reasoner-Zero
Official Repo for Open-Reasoner-Zero
☆2,008Updated 2 months ago
BytedTsinghua-SIA / DAPO
An Open-source RL System from ByteDance Seed and Tsinghua AIR
☆1,470Updated 2 months ago
PRIME-RL / PRIME
Scalable RL solution for advanced reasoning of language models
☆1,668Updated 4 months ago
MoonshotAI / Kimi-VL
Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities
☆1,014Updated 2 weeks ago
zhuzilin / ring-flash-attention
Ring attention implementation with flash attention
☆828Updated last week
microsoft / MInference
[NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…
☆1,080Updated this week
AIDC-AI / Marco-o1
An Open Large Reasoning Model for Real-World Solutions
☆1,510Updated 2 months ago
SimpleBerry / LLaMA-O1
Large Reasoning Models
☆804Updated 7 months ago
kuleshov-group / bd3lms
Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models
☆747Updated 3 weeks ago
seal-rg / recurrent-pretraining
Pretraining and inference code for a large-scale depth-recurrent language model
☆806Updated 2 weeks ago
SkyworkAI / Skywork-OR1
Unleashing the Power of Reinforcement Learning for Math and Code Reasoners
☆689Updated last month
facebookresearch / coconut
Training Large Language Model to Reason in a Continuous Latent Space
☆1,216Updated 6 months ago
NVIDIA-NeMo / RL
Scalable toolkit for efficient model reinforcement
☆558Updated this week
huggingface / search-and-learn
Recipes to scale inference-time compute of open models
☆1,110Updated 2 months ago
fxmeng / TransMLA
TransMLA: Multi-Head Latent Attention Is All You Need
☆331Updated 2 weeks ago
Gen-Verse / MMaDA
MMaDA - Open-Sourced Multimodal Large Diffusion Language Models
☆1,254Updated last month
huggingface / nanotron
Minimalistic large language model 3D-parallelism training
☆2,068Updated 3 weeks ago
GAIR-NLP / LIMO
[COLM 2025] LIMO: Less is More for Reasoning
☆986Updated 3 weeks ago
jzhang38 / EasyContext
Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
☆739Updated 10 months ago
facebookresearch / memory
Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…
☆342Updated 7 months ago