UNITES-Lab / C2R-MoELinks

[NAACL'25 🏆 SAC Award] Official code for "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design"

☆10

Alternatives and similar repositories for C2R-MoE

Users that are interested in C2R-MoE are comparing it to the libraries listed below

Sorting:

pprp / Pruner-Zero
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
☆94Updated 10 months ago
JarvisPei / CMoE
Implementation for the paper: CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference
☆25Updated 7 months ago
SqueezeAILab / SqueezedAttention
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆54Updated 10 months ago
Linking-ai / SCOPE
(ACL 2025 oral) SCOPE: Optimizing KV Cache Compression in Long-context Generation
☆33Updated 4 months ago
dongwonjo / FastKV
Official Implementation of FastKV: KV Cache Compression for Fast Long-Context Processing with Token-Selective Propagation
☆24Updated 5 months ago
Doraemonzzz / xmixers
Xmixers: A collection of SOTA efficient token/channel mixers
☆29Updated last month
TianjinYellow / StableSPAM
☆25Updated 6 months ago
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆46Updated 3 months ago
GATECH-EIC / Linearized-LLM
[ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models
☆35Updated last year
AkideLiu / MiniCache
☆10Updated last year
imagination-research / LCSC
[ICLR 2025] Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better
☆16Updated 8 months ago
ArminAzizi98 / LaMDA
☆14Updated 11 months ago
sail-sg / SimLayerKV
The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.
☆49Updated last year
z-lab / sparselora
[ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
☆58Updated 3 months ago
ziplab / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆30Updated last year
Lucky-Lance / SPP
[ICML 2024] SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models
☆21Updated last year
Anonymous1252022 / Megatron-DeepSpeed
☆14Updated last year
lliai / D2MoE
D^2-MoE: Delta Decompression for MoE-based LLMs Compression
☆68Updated 6 months ago
sail-sg / LongSpec
LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification
☆64Updated 3 months ago
OpenSparseLLMs / Linearization
☆61Updated 3 months ago
metacarbon / shareAtt
Beyond KV Caching: Shared Attention for Efficient LLMs
☆19Updated last year
ModelTC / QLLM
[ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…
☆39Updated last year
shoaibahmed / llm_depth_pruning
Official implementation of the paper: "A deeper look at depth pruning of LLMs"
☆15Updated last year
Infini-AI-Lab / Multiverse
☆96Updated last month
zyxxmu / cam
Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference
☆47Updated last year
NonvolatileMemory / flash_tree_attn
☆18Updated 9 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 10 months ago
thunlp / SparsingLaw
The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".
☆26Updated 11 months ago
Mind4Compiler / Compiler-R1
Compiler-R1: Towards Agentic Compiler Auto-tuning with Reinforcement Learning
☆17Updated 3 months ago
ilur98 / DGQ
Official Code For Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
☆14Updated last year