pprp / Awesome-Efficient-MoE
Efficient Mixture of Experts for LLM Paper List
☆51Updated 3 months ago
Alternatives and similar repositories for Awesome-Efficient-MoE:
Users that are interested in Awesome-Efficient-MoE are comparing it to the libraries listed below
- ☆72Updated last week
- The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference☆68Updated 2 months ago
- [ICML'24] The official implementation of “Rethinking Optimization and Architecture for Tiny Language Models”☆121Updated 2 months ago
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆47Updated 8 months ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆52Updated last week
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆35Updated last year
- Source code for the paper "LongGenBench: Long-context Generation Benchmark"☆16Updated 5 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆35Updated 7 months ago
- [ICLR 2024 Spotlight] Code for the paper "Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy"☆77Updated 9 months ago
- [AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models☆45Updated last year
- qwen-nsa☆44Updated 2 weeks ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆66Updated last week
- ☆65Updated 3 months ago
- SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference☆45Updated 4 months ago
- LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification☆42Updated last month
- [ICLR 2025] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models☆82Updated last month
- ☆60Updated 4 months ago
- A MoE impl for PyTorch, [ATC'23] SmartMoE☆61Updated last year
- [ICLR 2025] MiniPLM: Knowledge Distillation for Pre-Training Language Models☆34Updated 4 months ago
- More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression☆11Updated 2 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 9 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆130Updated 9 months ago
- Implementation for the paper: CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference☆17Updated 3 weeks ago
- Reproducing R1 for Code with Reliable Rewards☆140Updated last month
- TokenSkip: Controllable Chain-of-Thought Compression in LLMs☆103Updated 3 weeks ago
- Official PyTorch implementation of "IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact"☆43Updated 10 months ago
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆34Updated 9 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆60Updated 5 months ago
- XAttention: Block Sparse Attention with Antidiagonal Scoring☆118Updated this week
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆67Updated 3 weeks ago