PKU-SEC-Lab / HybriMoELinks

[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"

☆93

Alternatives and similar repositories for HybriMoE

Users that are interested in HybriMoE are comparing it to the libraries listed below

Sorting:

DD-DuDa / BitDecoding
[HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆66Updated 2 weeks ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 8 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆97Updated 11 months ago
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆82Updated this week
hao-ai-lab / MuxServe
☆79Updated last month
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆190Updated last month
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆146Updated 2 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 10 months ago
flashinfer-ai / cutlass-viz
☆65Updated 7 months ago
d-matrix-ai / keyformer-llm
☆58Updated last year
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆142Updated 10 months ago
AlibabaPAI / FLASHNN
☆102Updated last year
pku-liang / ArkVale
ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)
☆47Updated 11 months ago
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆82Updated this week
microsoft / chunk-attention
☆82Updated 7 months ago
infinigence / SpecEE
Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)
☆67Updated 7 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆118Updated 6 months ago
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆262Updated last month
madsys-dev / deepseekv2-profile
☆153Updated 9 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆45Updated 5 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆134Updated 6 months ago
WukLab / preble
Stateful LLM Serving
☆89Updated 8 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆108Updated 8 months ago
ByteDance-Seed / cudaLLM
☆125Updated 3 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 7 months ago
microsoft / SparTA
☆161Updated last year
tile-ai / AttentionEngine
☆51Updated 6 months ago
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆64Updated 2 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆227Updated 2 years ago
YJHMITWEB / ExFlow
Explore Inter-layer Expert Affinity in MoE Model Inference
☆16Updated last year