pprp / Awesome-Efficient-MoELinks

Efficient Mixture of Experts for LLM Paper List

☆143

Alternatives and similar repositories for Awesome-Efficient-MoE

Users that are interested in Awesome-Efficient-MoE are comparing it to the libraries listed below

Sorting:

mdy666 / Qwen-Native-Sparse-Attention
qwen-nsa
☆83Updated last month
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆128Updated 2 weeks ago
FFY0 / AdaKV
The Official Implementation of Ada-KV [NeurIPS 2025]
☆110Updated last month
lliai / D2MoE
D^2-MoE: Delta Decompression for MoE-based LLMs Compression
☆69Updated 7 months ago
OpenSparseLLMs / Linear-MoE
☆120Updated 5 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆152Updated last month
Equationliu / Kangaroo
[NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…
☆61Updated last year
mit-han-lab / x-attention
[ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring
☆245Updated 4 months ago
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆169Updated last month
NVlabs / COAT
[ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training
☆245Updated 3 months ago
mdy666 / mdy_triton
☆148Updated 4 months ago
yaof20 / Flash-RL
Implementation for FP8/INT8 Rollout for RL training without performence drop.
☆269Updated last week
LINs-lab / DynMoE
[ICLR 2025] Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
☆143Updated 4 months ago
ClubieDong / QAQ-KVCacheQuantization
QAQ: Quality Adaptive Quantization for LLM KV Cache
☆54Updated last year
thunlp / FR-Spec
[ACL 2025 main] FR-Spec: Frequency-Ranked Speculative Sampling
☆47Updated 4 months ago
maomaocun / dLLM-cache
Official PyTorch implementation of the paper "dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching" (dLLM-Cache…
☆176Updated 2 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆106Updated 7 months ago
JieShibo / MoLE
[ICML 2025 Oral] Mixture of Lookup Experts
☆54Updated 6 months ago
liangyuwang / Tiny-DeepSpeed
Tiny-DeepSpeed, a minimalistic re-implementation of the DeepSpeed library
☆48Updated 2 months ago
FasterDecoding / SnapKV
☆287Updated 4 months ago
dilab-zju / self-speculative-decoding
Code associated with the paper **Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding**
☆208Updated 9 months ago
dhcode-cpp / NSA-pytorch
DeepSeek Native Sparse Attention pytorch implementation
☆107Updated last week
Gaffey / ExCP
Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".
☆48Updated last year
htqin / IR-QLoRA
[ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…
☆67Updated last year
mutonix / pyramidinfer
☆48Updated 11 months ago
Dominic789654 / LongGenBench
Source code for the paper "LongGenBench: Long-context Generation Benchmark"
☆23Updated last year
ruikangliu / IntactKV
[ACL 2024] Official PyTorch implementation of "IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact"
☆48Updated last year
Hsu1023 / DuQuant
[NeurIPS 2024 Oral🔥] DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs.
☆174Updated last year
ruikangliu / Quantized-Reasoning-Models
[COLM 2025] Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"
☆57Updated 4 months ago
step-law / steplaw
☆205Updated 2 weeks ago