lucidrains / mixture-of-expertsView external linksLinks
A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models
☆848Sep 13, 2023Updated 2 years ago
Alternatives and similar repositories for mixture-of-experts
Users that are interested in mixture-of-experts are comparing it to the libraries listed below
Sorting:
- PyTorch Re-Implementation of "The Sparsely-Gated Mixture-of-Experts Layer" by Noam Shazeer et al. https://arxiv.org/abs/1701.06538☆1,228Apr 19, 2024Updated last year
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch☆344Apr 2, 2025Updated 10 months ago
- A fast MoE impl for PyTorch☆1,834Feb 10, 2025Updated last year
- A collection of AWESOME things about mixture-of-experts☆1,262Dec 8, 2024Updated last year
- Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4☆965Dec 21, 2025Updated last month
- ☆705Dec 6, 2025Updated 2 months ago
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆123Oct 17, 2024Updated last year
- A curated reading list of research in Mixture-of-Experts(MoE).☆660Oct 30, 2024Updated last year
- A family of open-sourced Mixture-of-Experts (MoE) Large Language Models☆1,660Mar 8, 2024Updated last year
- 🦁 Lion, new optimizer discovered by Google Brain using genetic algorithms that is purportedly better than Adam(w), in Pytorch☆2,184Nov 27, 2024Updated last year
- ⛷️ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training (EMNLP 2024)☆1,003Dec 6, 2024Updated last year
- This package implements THOR: Transformer with Stochastic Experts.☆65Oct 7, 2021Updated 4 years ago
- ☆89Apr 2, 2022Updated 3 years ago
- Implementation of the specific Transformer architecture from PaLM - Scaling Language Modeling with Pathways☆828Nov 9, 2022Updated 3 years ago
- Fast and memory-efficient exact attention☆22,231Updated this week
- Implementation of RETRO, Deepmind's Retrieval based Attention net, in Pytorch☆879Oct 30, 2023Updated 2 years ago
- PyTorch extensions for high performance and large scale training.☆3,399Apr 26, 2025Updated 9 months ago
- TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.☆1,699Updated this week
- Transformer related optimization, including BERT, GPT☆6,392Mar 27, 2024Updated last year
- Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch☆804Jan 30, 2026Updated 2 weeks ago
- 🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (i…☆9,491Feb 6, 2026Updated last week
- Accessible large language models via k-bit quantization for PyTorch.☆7,952Updated this week
- Flexible and powerful tensor operations for readable and reliable code (for pytorch, jax, TF and others)☆9,395Jan 26, 2026Updated 3 weeks ago
- Ongoing research training transformer models at scale☆15,213Updated this week
- Vector (and Scalar) Quantization, in Pytorch☆3,870Updated this week
- Mamba SSM architecture☆17,186Jan 12, 2026Updated last month
- ☆143Jul 21, 2024Updated last year
- Implementation of MEGABYTE, Predicting Million-byte Sequences with Multiscale Transformers, in Pytorch☆655Dec 27, 2024Updated last year
- ☆273Oct 31, 2023Updated 2 years ago
- maximal update parametrization (µP)☆1,676Jul 17, 2024Updated last year
- Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Py…☆24,993Updated this week
- Foundation Architecture for (M)LLMs☆3,133Apr 11, 2024Updated last year
- An open source implementation of CLIP.☆13,383Updated this week
- Reformer, the efficient Transformer, in Pytorch☆2,193Jun 21, 2023Updated 2 years ago
- 🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.☆20,619Feb 9, 2026Updated last week
- [COLM 2024] LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition☆668Jul 22, 2024Updated last year
- Implementation of fused cosine similarity attention in the same style as Flash Attention☆220Feb 13, 2023Updated 3 years ago
- [NeurIPS 2022] “M³ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design”, Hanxue …☆136Nov 30, 2022Updated 3 years ago
- Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"☆8,352May 31, 2024Updated last year