MoonshotAI / MoonlightLinks
Muon is Scalable for LLM Training
β1,052Updated 2 months ago
Alternatives and similar repositories for Moonlight
Users that are interested in Moonlight are comparing it to the libraries listed below
Sorting:
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β686Updated 2 months ago
- Official Repo for Open-Reasoner-Zeroβ1,939Updated last month
- β773Updated last month
- Understanding R1-Zero-Like Training: A Critical Perspectiveβ956Updated last week
- MoBA: Mixture of Block Attention for Long-Context LLMsβ1,781Updated last month
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ661Updated this week
- Scalable RL solution for advanced reasoning of language modelsβ1,587Updated 2 months ago
- An Open-source RL System from ByteDance Seed and Tsinghua AIRβ1,284Updated 3 weeks ago
- OLMoE: Open Mixture-of-Experts Language Modelsβ764Updated 2 months ago
- Dream 7B, a large diffusion language modelβ703Updated 3 weeks ago
- Large Reasoning Modelsβ804Updated 5 months ago
- MMaDA - Open-Sourced Multimodal Large Diffusion Language Modelsβ819Updated last week
- Ring attention implementation with flash attentionβ771Updated last week
- An Open Large Reasoning Model for Real-World Solutionsβ1,494Updated this week
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attentionβ¦β1,035Updated last week
- Parallel Scaling Law for Language Model β Beyond Parameter and Inference Time Scalingβ345Updated 2 weeks ago
- Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.β725Updated 8 months ago
- Official Implementation of EAGLE-1 (ICML'24), EAGLE-2 (EMNLP'24), and EAGLE-3.β1,266Updated 2 weeks ago
- Fast, Flexible and Portable Structured Generationβ980Updated last week
- O1 Replication Journeyβ1,990Updated 4 months ago
- A fork to add multimodal model training to open-r1β1,281Updated 3 months ago
- β706Updated this week
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilitiesβ868Updated last month
- Official PyTorch implementation for "Large Language Diffusion Models"β2,036Updated this week
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ463Updated 3 months ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β333Updated 5 months ago
- RAGEN leverages reinforcement learning to train LLM reasoning agents in interactive, stochastic environments.β1,908Updated this week
- Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ660Updated last month
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ637Updated 2 weeks ago
- Recipes to scale inference-time compute of open modelsβ1,087Updated last week