kyleliang919 / Super_MuonView external linksLinks
☆67Mar 21, 2025Updated 10 months ago
Alternatives and similar repositories for Super_Muon
Users that are interested in Super_Muon are comparing it to the libraries listed below
Sorting:
- Code for the paper "Function-Space Learning Rates"☆25Jun 3, 2025Updated 8 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆233Jun 15, 2025Updated 7 months ago
- ☆19Dec 4, 2025Updated 2 months ago
- RWKV-7 mini☆12Mar 29, 2025Updated 10 months ago
- ☆14Mar 2, 2025Updated 11 months ago
- Combining SOAP and MUON☆19Feb 11, 2025Updated last year
- [NeurIPS 2024] Low rank memory efficient optimizer without SVD☆33Jul 1, 2025Updated 7 months ago
- [Oral; Neurips OPT2024 ] μLO: Compute-Efficient Meta-Generalization of Learned Optimizers☆14Mar 18, 2025Updated 10 months ago
- Checkpointable dataset utilities for foundation model training☆32Jan 29, 2024Updated 2 years ago
- Official Repo for Error-Free Linear Attention is a Free Lunch: Exact Solution from Continuous-Time Dynamics☆71Jan 13, 2026Updated last month
- H-Net Dynamic Hierarchical Architecture☆81Sep 11, 2025Updated 5 months ago
- DeMo: Decoupled Momentum Optimization☆198Dec 2, 2024Updated last year
- ☆34Sep 10, 2024Updated last year
- Fast modular code to create and train cutting edge LLMs☆68May 16, 2024Updated last year
- ☆20May 30, 2024Updated last year
- Official PyTorch Implementation of the Longhorn Deep State Space Model☆56Dec 4, 2024Updated last year
- Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)☆35Mar 7, 2025Updated 11 months ago
- Supporting code for the blog post on modular manifolds.☆115Sep 26, 2025Updated 4 months ago
- Efficient PScan implementation in PyTorch☆17Jan 2, 2024Updated 2 years ago
- Here we will test various linear attention designs.☆62Apr 25, 2024Updated last year
- ☆44Nov 1, 2025Updated 3 months ago
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆36Jun 7, 2024Updated last year
- Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.☆85Nov 25, 2025Updated 2 months ago
- ☆27Jul 28, 2025Updated 6 months ago
- [Preprint] GMem: A Modular Approach for Ultra-Efficient Generative Models☆42Mar 11, 2025Updated 11 months ago
- Official Chinese documentation for RWKV | RWKV官方中文文档☆14Feb 6, 2026Updated last week
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning☆137Dec 19, 2025Updated last month
- NanoGPT-speedrunning for the poor T4 enjoyers☆73Apr 22, 2025Updated 9 months ago
- Code for ICLR 2025 Paper "What is Wrong with Perplexity for Long-context Language Modeling?"☆109Oct 11, 2025Updated 4 months ago
- RADLADS training code☆36May 7, 2025Updated 9 months ago
- Positional Skip-wise Training for Efficient Context Window Extension of LLMs to Extremely Length (ICLR 2024)☆209May 20, 2024Updated last year
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Jun 6, 2024Updated last year
- Mini Model Daemon☆12Nov 9, 2024Updated last year
- ☆11Oct 11, 2023Updated 2 years ago
- Experiments on the impact of depth in transformers and SSMs.☆40Oct 23, 2025Updated 3 months ago
- Code for NeurIPS 2024 Spotlight: "Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations"☆89Oct 30, 2024Updated last year
- EvaByte: Efficient Byte-level Language Models at Scale☆115Apr 22, 2025Updated 9 months ago
- GoldFinch and other hybrid transformer components☆12Dec 9, 2025Updated 2 months ago
- Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"☆18Mar 15, 2024Updated last year