kyleliang919 / Super_MuonLinks

☆64

Alternatives and similar repositories for Super_Muon

Users that are interested in Super_Muon are comparing it to the libraries listed below

Sorting:

BlinkDL / modded-nanogpt-rwkv
RWKV-7: Surpassing GPT
☆98Updated 11 months ago
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆45Updated 10 months ago
IST-DASLab / QuEST
Work in progress.
☆74Updated 3 months ago
RWKV / ZeroCoT
https://x.com/BlinkDL_AI/status/1884768989743882276
☆28Updated 5 months ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆195Updated 4 months ago
fal-ai-community / NativeSparseAttention
research impl of Native Sparse Attention (2502.11089)
☆62Updated 8 months ago
TRI-ML / linear_open_lm
A repository for research on medium sized language models.
☆78Updated last year
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated 10 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
wdlctc / mini-s
☆52Updated 11 months ago
fal-ai-community / nano-mdm
Tiny re-implementation of MDM in style of LLaDA and nano-gpt speedrun
☆56Updated 7 months ago
Cornell-RelaxML / yaqa-quantization
☆60Updated 4 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆101Updated last week
VITA-Group / WeLore
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients. Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu,…
☆49Updated 6 months ago
NathanGodey / qfilters
Repository for the Q-Filters method (https://arxiv.org/pdf/2503.02812)
☆35Updated 7 months ago
RobertCsordas / moeut
☆86Updated last year
joey00072 / Multi-Head-Latent-Attention-MLA-
working implimention of deepseek MLA
☆44Updated 9 months ago
OpenEvaByte / evabyte
EvaByte: Efficient Byte-level Language Models at Scale
☆110Updated 6 months ago
HazyResearch / lolcats
Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"
☆248Updated 8 months ago
s-sahoo / Eso-LMs
Esoteric Language Models
☆101Updated 2 weeks ago
bloc97 / DeMo
DeMo: Decoupled Momentum Optimization
☆194Updated 10 months ago
RobertCsordas / moe_attention
Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"
☆99Updated last year
NX-AI / mlstm_kernels
Tiled Flash Linear Attention library for fast and efficient mLSTM Kernels.
☆73Updated 2 weeks ago
VatsaDev / NanoPoor
NanoGPT-speedrunning for the poor T4 enjoyers
☆72Updated 6 months ago
OpenMachine-ai / transformer-tricks
A collection of tricks and tools to speed up transformer models
☆182Updated 2 weeks ago
kyleliang919 / Online-Subspace-Descent
[NeurIPS 2024] Low rank memory efficient optimizer without SVD
☆30Updated 3 months ago
NVIDIA / ngpt
Normalized Transformer (nGPT)
☆192Updated 11 months ago
lucaslingle / mu_transformer
Transformer with Mu-Parameterization, implemented in Jax/Flax. Supports FSDP on TPU pods.
☆32Updated 4 months ago
tilde-research / MoMoE-impl
Memory optimized Mixture of Experts
☆68Updated 3 months ago
evanatyourservice / llm-jax
Train a SmolLM-style llm on fineweb-edu in JAX/Flax with an assortment of optimizers.
☆18Updated 3 months ago