fkodom / soft-mixture-of-experts
PyTorch implementation of Soft MoE by Google Brain in "From Sparse to Soft Mixtures of Experts" (https://arxiv.org/pdf/2308.00951.pdf)
☆66Updated last year
Related projects ⓘ
Alternatives and complementary repositories for soft-mixture-of-experts
- PyTorch implementation of "From Sparse to Soft Mixtures of Experts"☆47Updated last year
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆44Updated last year
- Official implementation of Phi-Mamba. A MOHAWK-distilled model (Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Mode…☆79Updated 2 months ago
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆93Updated last month
- ☆153Updated 9 months ago
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆79Updated last year
- Implementation of 🌻 Mirasol, SOTA Multimodal Autoregressive model out of Google Deepmind, in Pytorch☆88Updated 11 months ago
- Randomized Positional Encodings Boost Length Generalization of Transformers☆78Updated 8 months ago
- Implementation of Soft MoE, proposed by Brain's Vision team, in Pytorch☆248Updated 7 months ago
- Yet another random morning idea to be quickly tried and architecture shared if it works; to allow the transformer to pause for any amount…☆49Updated last year
- Official repository for the paper "Approximating Two-Layer Feedforward Networks for Efficient Transformers"☆36Updated last year
- [NeurIPS 2023] Make Your Pre-trained Model Reversible: From Parameter to Memory Efficient Fine-Tuning☆29Updated last year
- Code for NOLA, an implementation of "nola: Compressing LoRA using Linear Combination of Random Basis"☆49Updated 2 months ago
- The official repository for our paper "The Dual Form of Neural Networks Revisited: Connecting Test Time Predictions to Training Patterns …☆16Updated last year
- Implementation of Zorro, Masked Multimodal Transformer, in Pytorch☆95Updated last year
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆52Updated last month
- Some personal experiments around routing tokens to different autoregressive attention, akin to mixture-of-experts☆109Updated last month
- Towards Understanding the Mixture-of-Experts Layer in Deep Learning☆21Updated 11 months ago
- Implementation of Infini-Transformer in Pytorch☆104Updated last month
- Recycling diverse models☆44Updated last year
- A Closer Look into Mixture-of-Experts in Large Language Models☆40Updated 3 months ago
- HGRN2: Gated Linear RNNs with State Expansion☆49Updated 3 months ago
- Code and benchmark for the paper: "A Practitioner's Guide to Continual Multimodal Pretraining" [NeurIPS'24]☆35Updated 2 months ago
- Code accompanying the paper "Massive Activations in Large Language Models"☆123Updated 8 months ago
- ☆24Updated 5 months ago
- Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆74Updated this week
- [NeurIPS 2023 spotlight] Official implementation of HGRN in our NeurIPS 2023 paper - Hierarchically Gated Recurrent Neural Network for Se…☆61Updated 7 months ago
- This is a PyTorch implementation of the paperViP A Differentially Private Foundation Model for Computer Vision☆37Updated last year
- ☆22Updated 2 weeks ago
- ☆33Updated 5 months ago