MoonshotAI / MoonlightLinks
Muon is Scalable for LLM Training
☆1,077Updated 2 months ago
Alternatives and similar repositories for Moonlight
Users that are interested in Moonlight are comparing it to the libraries listed below
Sorting:
- MoBA: Mixture of Block Attention for Long-Context LLMs☆1,798Updated 2 months ago
- ☆789Updated last week
- Muon: An optimizer for hidden layers in neural networks☆897Updated last week
- Dream 7B, a large diffusion language model☆764Updated last week
- Understanding R1-Zero-Like Training: A Critical Perspective☆988Updated 3 weeks ago
- 🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"☆700Updated 3 months ago
- Official Repo for Open-Reasoner-Zero☆1,967Updated 2 weeks ago
- OLMoE: Open Mixture-of-Experts Language Models☆785Updated 3 months ago
- Scalable RL solution for advanced reasoning of language models☆1,615Updated 3 months ago
- An Open-source RL System from ByteDance Seed and Tsinghua AIR☆1,349Updated last month
- Recipes to scale inference-time compute of open models☆1,095Updated last month
- Nano vLLM☆1,659Updated this week
- Ring attention implementation with flash attention☆789Updated last week
- Large Reasoning Models☆804Updated 6 months ago
- Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models☆698Updated 2 months ago
- Parallel Scaling Law for Language Model — Beyond Parameter and Inference Time Scaling☆395Updated last month
- [NeurIPS'24 Spotlight, ICLR'25, ICML'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention…☆1,055Updated this week
- Official PyTorch implementation for "Large Language Diffusion Models"☆2,332Updated this week
- Unleashing the Power of Reinforcement Learning for Math and Code Reasoners☆632Updated 2 weeks ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper☆653Updated last week
- ☆714Updated 3 weeks ago
- ☆773Updated last month
- MMaDA - Open-Sourced Multimodal Large Diffusion Language Models☆1,109Updated last week
- LIMO: Less is More for Reasoning☆960Updated 2 months ago
- Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities☆893Updated 2 months ago
- Pretraining code for a large-scale depth-recurrent language model☆782Updated last week
- An Open Large Reasoning Model for Real-World Solutions☆1,498Updated 3 weeks ago
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, spars…☆337Updated 6 months ago
- TransMLA: Multi-Head Latent Attention Is All You Need☆302Updated this week
- Training Large Language Model to Reason in a Continuous Latent Space☆1,155Updated 4 months ago