thu-ml / ReMoE
[ICLR2025] Codebase for "ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing", built on Megatron-LM.
☆67Updated 4 months ago
Alternatives and similar repositories for ReMoE:
Users that are interested in ReMoE are comparing it to the libraries listed below
- ☆122Updated 2 months ago
- 🔥 A minimal training framework for scaling FLA models☆101Updated last week
- ☆39Updated last month
- ☆75Updated this week
- ☆76Updated 3 months ago
- Efficient triton implementation of Native Sparse Attention.☆136Updated last week
- ☆74Updated last month
- ☆36Updated 7 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆62Updated 5 months ago
- 16-fold memory access reduction with nearly no loss☆89Updated 3 weeks ago
- "Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding" Zhenyu Zhang, Runjin Chen, Shiw…☆29Updated 11 months ago
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆123Updated 4 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆45Updated 6 months ago
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆79Updated 10 months ago
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆27Updated last year
- ☆69Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆159Updated 9 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆140Updated 3 weeks ago
- The official implementation of the paper "Towards Efficient Mixture of Experts: A Holistic Study of Compression Techniques (TMLR)".☆66Updated last month
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆182Updated last week
- ☆89Updated 6 months ago
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…☆64Updated last year
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆85Updated this week
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆157Updated 10 months ago
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆35Updated 10 months ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆121Updated 3 months ago
- [EMNLP 2024] Quantize LLM to extremely low-bit, and finetune the quantized LLMs☆12Updated 9 months ago
- ☆77Updated 8 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆101Updated last month
- ☆43Updated last year