kyleliang919 / Super_MuonLinks
ā65Updated 9 months ago
Alternatives and similar repositories for Super_Muon
Users that are interested in Super_Muon are comparing it to the libraries listed below
Sorting:
- RWKV-7: Surpassing GPTā103Updated last year
- DPO, but faster šā46Updated last year
- https://x.com/BlinkDL_AI/status/1884768989743882276ā28Updated 8 months ago
- Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)ā35Updated 10 months ago
- ā91Updated last year
- Work in progress.ā77Updated last month
- Flash-Muon: An Efficient Implementation of Muon Optimizerā225Updated 6 months ago
- A repository for research on medium sized language models.ā77Updated last year
- Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrunā56Updated 10 months ago
- A collection of tricks and tools to speed up transformer modelsā193Updated 3 weeks ago
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's lā¦ā53Updated this week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersā131Updated last year
- Fast modular code to create and train cutting edge LLMsā68Updated last year
- Esoteric Language Modelsā107Updated last month
- ā53Updated last year
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"ā249Updated 11 months ago
- ā39Updated last year
- [ICML 2025] From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories and Applicationsā52Updated 2 months ago
- an open source reproduction of NVIDIA's nGPT (Normalized Transformer with Representation Learning on the Hypersphere)ā109Updated 10 months ago
- Here we will test various linear attention designs.ā62Updated last year
- Normalized Transformer (nGPT)ā195Updated last year
- EvaByte: Efficient Byte-level Language Models at Scaleā113Updated 8 months ago
- Code accompanying the paper "Generalized Interpolating Discrete Diffusion"ā112Updated 7 months ago
- GoldFinch and other hybrid transformer componentsā45Updated last year
- research impl of Native Sparse Attention (2502.11089)ā63Updated 10 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmindā132Updated 2 months ago
- Official implementation of GRAPE: Group Representational Position Encoding (https://arxiv.org/abs/2512.07805)ā71Updated last week
- The evaluation framework for training-free sparse attention in LLMsā108Updated 3 months ago
- NanoGPT-speedrunning for the poor T4 enjoyersā73Updated 8 months ago
- [NeurIPS 2024] Low rank memory efficient optimizer without SVDā32Updated 6 months ago