microsoft / SparseMixer
Sparse Backpropagation for Mixture-of-Expert Training
☆17Updated 2 months ago
Related projects: ⓘ
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆99Updated 6 months ago
- ☆41Updated 2 months ago
- Repo for ICML23 "Why do Nearest Neighbor Language Models Work?"☆56Updated last year
- Parameter Efficient Transfer Learning with Diff Pruning☆70Updated 3 years ago
- Using FlexAttention to compute attention with different masking patterns☆28Updated last week
- A Kernel-Based View of Language Model Fine-Tuning https://arxiv.org/abs/2210.05643☆68Updated last year
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆25Updated 5 months ago
- This package implements THOR: Transformer with Stochastic Experts.☆60Updated 2 years ago
- Code for the ACL-2022 paper "StableMoE: Stable Routing Strategy for Mixture of Experts"☆41Updated 2 years ago
- [ICLR 2023] "Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers" by Tianlong Chen*, Zhenyu Zhang*, Ajay Jaiswal…☆42Updated last year
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆37Updated last year
- ☆28Updated last year
- ☆94Updated 6 months ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆51Updated 3 months ago
- ☆42Updated 7 months ago
- Block Sparse movement pruning☆77Updated 3 years ago
- ☆61Updated 6 months ago
- Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).☆54Updated 2 years ago
- ☆26Updated last year
- Code for paper "Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning"☆59Updated 7 months ago
- Repository for "Propagating Knowledge Updates to LMs Through Distillation" (NeurIPS 2023).☆23Updated 3 weeks ago
- ☆18Updated 3 months ago
- DEMix Layers for Modular Language Modeling☆51Updated 3 years ago
- A Closer Look into Mixture-of-Experts in Large Language Models☆38Updated last month
- [EVA ICLR'23; LARA ICML'22] Efficient attention mechanisms via control variates, random features, and importance sampling☆78Updated last year
- Revisiting Efficient Training Algorithms For Transformer-based Language Models (NeurIPS 2023)☆77Updated last year
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆36Updated 8 months ago
- ☆23Updated 9 months ago
- ☆29Updated last year
- ☆40Updated 2 years ago