uclaml / MoE
Towards Understanding the Mixture-of-Experts Layer in Deep Learning
☆25Updated last year
Alternatives and similar repositories for MoE:
Users that are interested in MoE are comparing it to the libraries listed below
- HGRN2: Gated Linear RNNs with State Expansion☆53Updated 7 months ago
- Official repository of "LiNeS: Post-training Layer Scaling Prevents Forgetting and Enhances Model Merging"☆24Updated 4 months ago
- PyTorch implementation of Soft MoE by Google Brain in "From Sparse to Soft Mixtures of Experts" (https://arxiv.org/pdf/2308.00951.pdf)☆71Updated last year
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆79Updated last year
- ☆18Updated 8 months ago
- Official repo of Progressive Data Expansion: data, code and evaluation☆28Updated last year
- ☆30Updated 2 months ago
- ☆18Updated this week
- ☆74Updated 7 months ago
- Code and benchmark for the paper: "A Practitioner's Guide to Continual Multimodal Pretraining" [NeurIPS'24]☆54Updated 3 months ago
- The this is the official implementation of "DAPE: Data-Adaptive Positional Encoding for Length Extrapolation"☆37Updated 5 months ago
- Recycling diverse models☆44Updated 2 years ago
- Implementation of Griffin from the paper: "Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models"☆51Updated 2 months ago
- [WACV 2025] Official implementation of "Online-LoRA: Task-free Online Continual Learning via Low Rank Adaptation" by Xiwen Wei, Guihong L…☆35Updated 4 months ago
- LoRA-XS: Low-Rank Adaptation with Extremely Small Number of Parameters☆31Updated 3 weeks ago
- Official PyTorch implementation for NeurIPS'24 paper "Knowledge Composition using Task Vectors with Learned Anisotropic Scaling"☆19Updated last month
- ☆27Updated last month
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆48Updated 2 years ago
- DeciMamba: Exploring the Length Extrapolation Potential of Mamba (ICLR 2025)☆24Updated 8 months ago
- [NeurIPS 2023] Official repository for "Distilling Out-of-Distribution Robustness from Vision-Language Foundation Models"☆12Updated 9 months ago
- ☆30Updated 5 months ago
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆26Updated 11 months ago
- One Initialization to Rule them All: Fine-tuning via Explained Variance Adaptation☆38Updated 5 months ago
- [ACL 2023] Code for paper “Tailoring Instructions to Student’s Learning Levels Boosts Knowledge Distillation”(https://arxiv.org/abs/2305.…☆38Updated last year
- ☆21Updated 2 years ago
- Code for "Merging Text Transformers from Different Initializations"☆19Updated last month
- Official repository for the paper "SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention"☆97Updated 5 months ago
- [ICLR 2025] Official Code Release for Explaining Modern Gated-Linear RNNs via a Unified Implicit Attention Formulation☆41Updated 3 weeks ago
- Official code for the paper "Attention as a Hypernetwork"☆25Updated 9 months ago
- Why Do We Need Weight Decay in Modern Deep Learning? [NeurIPS 2024]☆63Updated 6 months ago