lucidrains / mixture-of-expertsLinks

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models

☆815

Alternatives and similar repositories for mixture-of-experts

Users that are interested in mixture-of-experts are comparing it to the libraries listed below

Sorting:

davidmrau / mixture-of-experts
PyTorch Re-Implementation of "The Sparsely-Gated Mixture-of-Experts Layer" by Noam Shazeer et al. https://arxiv.org/abs/1701.06538
☆1,188Updated last year
lucidrains / st-moe-pytorch
Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorch
☆363Updated last year
google-research / vmoe
☆683Updated 2 months ago
codecaution / Awesome-Mixture-of-Experts-Papers
A curated reading list of research in Mixture-of-Experts(MoE).
☆648Updated 11 months ago
lucidrains / linear-attention-transformer
Transformer based on a variant of attention that is linear complexity in respect to sequence length
☆801Updated last year
lucidrains / rotary-embedding-torch
Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch
☆769Updated 2 months ago
microsoft / Tutel
Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4
☆934Updated 2 weeks ago
laekov / fastmoe
A fast MoE impl for PyTorch
☆1,806Updated 8 months ago
XueFuzhao / awesome-mixture-of-experts
A collection of AWESOME things about mixture-of-experts
☆1,216Updated 10 months ago
lucidrains / local-attention
An implementation of local windowed attention for language modeling
☆483Updated 3 months ago
ofirpress / attention_with_linear_biases
Code for the ALiBi method for transformer language models (ICLR 2022)
☆544Updated last year
lucidrains / FLASH-pytorch
Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"
☆369Updated 2 years ago
lucidrains / linformer
Implementation of Linformer for Pytorch
☆299Updated last year
lucidrains / soft-moe-pytorch
Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch
☆331Updated 6 months ago
ZhuiyiTechnology / roformer
Rotary Transformer
☆1,039Updated 3 years ago
mlfoundations / model-soups
Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time
☆488Updated last year
facebookresearch / multimodal
TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.
☆1,656Updated this week
test-time-training / ttt-lm-pytorch
Official PyTorch implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
☆1,263Updated last year
lucidrains / memory-efficient-attention-pytorch
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"
☆383Updated 2 years ago
google-research / long-range-arena
Long Range Arena for Benchmarking Efficient Transformers
☆765Updated last year
QingruZhang / AdaLoRA
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning (ICLR 2023).
☆353Updated 2 years ago
bobby-he / simplified_transformers
☆292Updated 10 months ago
idiap / fast-transformers
Pytorch library for fast transformer implementations
☆1,745Updated 2 years ago
lucidrains / ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
☆542Updated 5 months ago
Haiyang-W / TokenFormer
[ICLR2025 Spotlight🔥] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
☆574Updated 8 months ago
fkodom / grouped-query-attention-pytorch
(Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …
☆181Updated last year
NVlabs / DoRA
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
☆866Updated last year
meta-pytorch / attention-gym
Helpful tools and examples for working with flex-attention
☆1,029Updated this week
test-time-training / ttt-lm-jax
Official JAX implementation of Learning to (Learn at Test Time): RNNs with Expressive Hidden States
☆423Updated last year
bzhangGo / rmsnorm
Root Mean Square Layer Normalization
☆256Updated 2 years ago