jxiw / M1Links
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
☆39Updated last month
Alternatives and similar repositories for M1
Users that are interested in M1 are comparing it to the libraries listed below
Sorting:
- ☆57Updated 2 months ago
- Kinetics: Rethinking Test-Time Scaling Laws☆80Updated 2 months ago
- [ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.☆91Updated 8 months ago
- ☆100Updated 4 months ago
- ☆84Updated 6 months ago
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning☆125Updated last month
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆115Updated 2 months ago
- The evaluation framework for training-free sparse attention in LLMs☆93Updated 2 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆185Updated 3 months ago
- Stick-breaking attention☆60Updated 2 months ago
- ☆91Updated last month
- [ICLR 2025] When Attention Sink Emerges in Language Models: An Empirical View (Spotlight)☆124Updated 2 months ago
- Here we will test various linear attention designs.☆62Updated last year
- Flash-Linear-Attention models beyond language☆17Updated 2 weeks ago
- [NeurIPS-2024] 📈 Scaling Laws with Vocabulary: Larger Models Deserve Larger Vocabularies https://arxiv.org/abs/2407.13623☆86Updated 11 months ago
- The official code implementation for paper "R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing"☆45Updated last week
- ☆118Updated 3 months ago
- Code for "Reasoning to Learn from Latent Thoughts"☆118Updated 5 months ago
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"☆99Updated last month
- ☆35Updated 6 months ago
- ☆51Updated 2 months ago
- [ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆92Updated 9 months ago
- The official implementation for Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free☆54Updated 4 months ago
- ☆126Updated 3 months ago
- ☆54Updated 3 months ago
- Efficient triton implementation of Native Sparse Attention.☆215Updated 3 months ago
- 🔥 A minimal training framework for scaling FLA models☆239Updated this week
- The official github repo for "Diffusion Language Models are Super Data Learners".☆109Updated last month
- TraceRL - Revolutionizing Reinforcement Learning Framework for Diffusion Large Language Models☆128Updated last week
- Implementation for FP8/INT8 Rollout for RL training without performence drop.☆200Updated this week