microsoft / tokenweaveLinks

Efficient Compute-Communication Overlap for Distributed LLM Inference

☆62

Alternatives and similar repositories for tokenweave

Users that are interested in tokenweave are comparing it to the libraries listed below

Sorting:

DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆82Updated this week
ByteDance-Seed / StragglerAnalysis
☆43Updated 6 months ago
hao-ai-lab / MuxServe
☆79Updated last month
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆143Updated 2 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆187Updated last month
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
Azure / msccl-executor-nccl
☆46Updated 11 months ago
WukLab / preble
Stateful LLM Serving
☆88Updated 8 months ago
Infrawaves / DeepEP_ibrc_dual-ports_multiQP
Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport
☆66Updated 6 months ago
gty111 / gLLM
gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
☆49Updated this week
tile-ai / tilescale
Tile-based language built for AI computation across all scales
☆80Updated last week
flashinfer-ai / cutlass-viz
☆65Updated 6 months ago
nex-agi / NexVenusCL
Nex Venus Communication Library
☆50Updated this week
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆50Updated 2 months ago
tile-ai / TileOPs
☆57Updated last week
HPMLL / NVIDIA-Hopper-Benchmark
☆64Updated 5 months ago
toyaix / triton-runner
Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.
☆76Updated last week
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆89Updated 5 months ago
xxyux / SpInfer
SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs
☆59Updated 7 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆85Updated 2 months ago
xinhao-luo / ClusterFusion
[NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
☆49Updated last month
microsoft / FractalTensor
FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …
☆29Updated 11 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆115Updated 6 months ago
microsoft / chunk-attention
☆81Updated 7 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆39Updated last year
microsoft / RetrievalAttention
Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.
☆98Updated 2 months ago
Azure / msccl
Microsoft Collective Communication Library
☆66Updated 11 months ago
shenh10 / DeepSeek_Simulator
☆90Updated 7 months ago
LeiWang1999 / Stream-k.tvm
☆19Updated last year
NEO-MLSys25 / NEO
NEO is a LLM inference engine built to save the GPU memory crisis by CPU offloading
☆69Updated 5 months ago