fkodom / soft-mixture-of-experts
PyTorch implementation of Soft MoE by Google Brain in "From Sparse to Soft Mixtures of Experts" (https://arxiv.org/pdf/2308.00951.pdf)
β71Updated last year
Alternatives and similar repositories for soft-mixture-of-experts:
Users that are interested in soft-mixture-of-experts are comparing it to the libraries listed below
- PyTorch implementation of "From Sparse to Soft Mixtures of Experts"β52Updated last year
- Implementation of π» Mirasol, SOTA Multimodal Autoregressive model out of Google Deepmind, in Pytorchβ88Updated last year
- Towards Understanding the Mixture-of-Experts Layer in Deep Learningβ24Updated last year
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorchβ270Updated 10 months ago
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amountβ¦β53Updated last year
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswalβ¦β48Updated 2 years ago
- Implementation of Infini-Transformer in Pytorchβ109Updated 2 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"β97Updated 5 months ago
- Code accompanying the paper "Massive Activations in Large Language Models"β150Updated last year
- [NeurIPS 2023] Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuningβ31Updated last year
- Randomized Positional Encodings Boost Length Generalization of Transformersβ80Updated last year
- [NeurIPS 2023 spotlight] Official implementation of HGRN in our NeurIPS 2023 paper - Hierarchically Gated Recurrent Neural Network for Seβ¦β64Updated 10 months ago
- Explorations into the recently proposed Taylor Series Linear Attentionβ95Updated 7 months ago
- β37Updated 11 months ago
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-expertsβ116Updated 5 months ago
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)β79Updated last year
- Model Stock: All we need is just a few fine-tuned modelsβ106Updated 6 months ago
- Mixture of A Million Expertsβ42Updated 7 months ago
- Language Quantized AutoEncodersβ102Updated 2 years ago
- This is the implementation of the paper AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning (https://arxiv.org/abs/2205.1β¦β130Updated last year
- [NAACL 2025] A Closer Look into Mixture-of-Experts in Large Language Modelsβ45Updated last month
- β52Updated 8 months ago
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]β63Updated 5 months ago
- Implementation of Zorro, Masked Multimodal Transformer, in Pytorchβ97Updated last year
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Modeβ¦β100Updated 6 months ago
- Inference Speed Benchmark for Learning to (Learn at Test Time): RNNs with Expressive Hidden Statesβ63Updated 8 months ago
- Language models scale reliably with over-training and on downstream tasksβ96Updated 11 months ago
- β101Updated last year
- β127Updated 2 years ago
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"β36Updated last year