kyleliang919 / Super_MuonLinks
ā66Updated 8 months ago
Alternatives and similar repositories for Super_Muon
Users that are interested in Super_Muon are comparing it to the libraries listed below
Sorting:
- RWKV-7: Surpassing GPTā101Updated last year
- DPO, but faster šā46Updated 11 months ago
- https://x.com/BlinkDL_AI/status/1884768989743882276ā28Updated 7 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersā130Updated last year
- Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)ā35Updated 8 months ago
- Fast modular code to create and train cutting edge LLMsā68Updated last year
- [ICML 2025] From Low Rank Gradient Subspace Stabilization to Low-Rank Weights: Observations, Theories and Applicationsā51Updated last month
- The evaluation framework for training-free sparse attention in LLMsā106Updated last month
- Flash-Muon: An Efficient Implementation of Muon Optimizerā212Updated 5 months ago
- Work in progress.ā75Updated last week
- ā53Updated last year
- ā89Updated last year
- A repository for research on medium sized language models.ā78Updated last year
- QuIP quantizationā61Updated last year
- My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activatedā33Updated last year
- Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrunā57Updated 8 months ago
- ā64Updated 5 months ago
- Here we will test various linear attention designs.ā62Updated last year
- GoldFinch and other hybrid transformer componentsā45Updated last year
- an open source reproduction of NVIDIA's nGPT (Normalized Transformer with Representation Learning on the Hypersphere)ā108Updated 8 months ago
- A collection of tricks and tools to speed up transformer modelsā189Updated last month
- research impl of Native Sparse Attention (2502.11089)ā63Updated 9 months ago
- RWKV-X is a Linear Complexity Hybrid Language Model based on the RWKV architecture, integrating Sparse Attention to improve the model's lā¦ā51Updated 4 months ago
- [NeurIPS 2024] Low rank memory efficient optimizer without SVDā31Updated 5 months ago
- Memory optimized Mixture of Expertsā69Updated 4 months ago
- EvaByte: Efficient Byte-level Language Models at Scaleā111Updated 7 months ago
- Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmindā131Updated last month
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"ā249Updated 10 months ago
- Normalized Transformer (nGPT)ā194Updated last year
- ā55Updated 5 months ago