UNITES-Lab / HEXA-MoELinks

Official code for the paper "HEXA-MoE: Efficient and Heterogeneous-Aware MoE Acceleration with Zero Computation Redundancy"

☆13

Alternatives and similar repositories for HEXA-MoE

Users that are interested in HEXA-MoE are comparing it to the libraries listed below

Sorting:

zyxxmu / cam
Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference
☆47Updated last year
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆48Updated 3 months ago
z-lab / sparselora
[ICML 2025] SparseLoRA: Accelerating LLM Fine-Tuning with Contextual Sparsity
☆61Updated 4 months ago
ruikangliu / Quantized-Reasoning-Models
[COLM 2025] Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"
☆57Updated 4 months ago
Aaronhuang-778 / Mixture-Compressor-MoE
[ICLR 2025] Mixture Compressor for Mixture-of-Experts LLMs Gains More
☆61Updated 8 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆106Updated 7 months ago
Qualcomm-AI-research / gptvq
☆36Updated last year
tsinghua-ideal / Twilight
[NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning
☆63Updated last week
Aaronhuang-778 / SliM-LLM
[ICML 2025] SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models
☆45Updated last year
zhangsichengsjtu / AFPQ
AFPQ code implementation
☆23Updated 2 years ago
SqueezeAILab / SqueezedAttention
[ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference
☆54Updated 11 months ago
NonvolatileMemory / flash_tree_attn
☆18Updated 10 months ago
zhuzilin / flash-attention-with-sink
☆39Updated 3 months ago
ThisisBillhe / torch_quantizer
torch_quantizer is a out-of-box quantization tool for PyTorch models on CUDA backend, specially optimized for Diffusion Models.
☆22Updated last year
machilusZ / FastGen
This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
☆41Updated last year
pzs19 / TokenSelect
☆15Updated 7 months ago
HuangOwen / RoLoRA
[EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
☆38Updated last year
hahnyuan / ASVD4LLM
Activation-aware Singular Value Decomposition for Compressing Large Language Models
☆80Updated last year
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 7 months ago
UNITES-Lab / C2R-MoE
[NAACL'25 🏆 SAC Award] Official code for "Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert…
☆13Updated 9 months ago
cat538 / SKVQ
[COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
☆24Updated last year
SNU-ARC / any-precision-llm
[ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs
☆121Updated 4 months ago
PingchengDong / GQA-LUT
The official implementation of the DAC 2024 paper GQA-LUT
☆20Updated 10 months ago
aiha-lab / MX-QLLM
LLM Inference with Microscaling Format
☆32Updated 11 months ago
ruipeterpan / specreason
PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]
☆57Updated last month
thu-nics / qllm-eval
Code Repository of Evaluating Quantized Large Language Models
☆135Updated last year
TUDa-HWAI / Basis_Sharing
☆17Updated last year
imagination-research / LCSC
[ICLR 2025] Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better
☆16Updated 8 months ago
Raincleared-Song / sparse_gpu_operator
GPU operators for sparse tensor operations
☆35Updated last year
microsoft / SeerAttention
SeerAttention: Learning Intrinsic Sparse Attention in Your LLMs
☆168Updated last month