MoonshotAI / Moonlight
Muon is Scalable for LLM Training
β974Updated last month
Alternatives and similar repositories for Moonlight:
Users that are interested in Moonlight are comparing it to the libraries listed below
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β590Updated last week
- Official Repo for Open-Reasoner-Zeroβ1,667Updated 3 weeks ago
- MoBA: Mixture of Block Attention for Long-Context LLMsβ1,687Updated 3 weeks ago
- Scalable RL solution for advanced reasoning of language modelsβ1,419Updated last week
- Large Reasoning Modelsβ800Updated 3 months ago
- Official PyTorch implementation for "Large Language Diffusion Models"β1,313Updated 2 weeks ago
- An Open Large Reasoning Model for Real-World Solutionsβ1,475Updated 3 weeks ago
- An Open-source RL System from ByteDance Seed and Tsinghua AIRβ767Updated last week
- Understanding R1-Zero-Like Training: A Critical Perspectiveβ568Updated this week
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ529Updated this week
- OLMoE: Open Mixture-of-Experts Language Modelsβ693Updated 2 weeks ago
- Pretraining code for a large-scale depth-recurrent language modelβ697Updated 2 weeks ago
- [NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which rβ¦β945Updated last month
- Democratizing Reinforcement Learning for LLMsβ2,113Updated last month
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β310Updated 3 months ago
- LIMO: Less is More for Reasoningβ864Updated last month
- Explore the Multimodal βAha Momentβ on 2B Modelβ524Updated last week
- FlashInfer: Kernel Library for LLM Servingβ2,483Updated this week
- β1,348Updated 4 months ago
- Ring attention implementation with flash attentionβ717Updated last month
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,094Updated this week
- Training Large Language Model to Reason in a Continuous Latent Spaceβ998Updated 2 months ago
- Next-Token Prediction is All You Needβ2,042Updated last week
- β910Updated 2 months ago
- Minimalistic large language model 3D-parallelism trainingβ1,715Updated this week
- β485Updated last week
- Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.β706Updated 6 months ago
- DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Modelsβ1,609Updated last year
- β518Updated last week
- Fast, Flexible and Portable Structured Generationβ818Updated this week