PluralisResearch / AsyncPPLinks

Asynchronous pipeline parallel optimization

☆18

Alternatives and similar repositories for AsyncPP

Users that are interested in AsyncPP are comparing it to the libraries listed below

Sorting:

hao-ai-lab / cse234-w25-PA
☆41Updated 8 months ago
zhuzilin / flash-attention-with-sink
☆39Updated 3 months ago
RLsys-Foundation / APRIL
APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation. A system-level optimization for scalable LLM tra…
☆43Updated last month
RLsys-Foundation / TritonForge
🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…
☆98Updated last week
meta-pytorch / kraken
Triton-based Symmetric Memory operators and examples
☆62Updated last month
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆51Updated last year
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
flashinfer-ai / debug-print
Debug print operator for cudagraph debugging
☆14Updated last year
stanford-cs336 / assignment2-systems
Student version of Assignment 2 for Stanford CS336 - Language Modeling From Scratch
☆111Updated 3 months ago
gpu-mode / ring-attention
ring-attention experiments
☆155Updated last year
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆106Updated 7 months ago
xdit-project / DiTCacheAnalysis
An auxiliary project analysis of the characteristics of KV in DiT Attention.
☆32Updated 11 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
sgl-project / sgl-flash-attn
Fast and memory-efficient exact attention
☆14Updated this week
ademeure / QuickRunCUDA
☆13Updated 2 weeks ago
tile-ai / AttentionEngine
☆50Updated 5 months ago
eth-easl / sailor
AI model training on heterogeneous, geo-distributed resources
☆19Updated last month
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆104Updated last week
tspeterkim / mixed-precision-from-scratch
Mixed precision training from scratch with Tensors and CUDA
☆28Updated last year
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆124Updated 4 months ago
HanGuo97 / hilt
☆36Updated last week
microsoft / AttentionEngine
☆106Updated 5 months ago
daniel-geon-park / triton_bwd
Automatic differentiation for Triton Kernels
☆30Updated 3 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆45Updated 5 months ago
ScalingIntelligence / hydragen
Hydragen: High-Throughput LLM Inference with Shared Prefixes
☆44Updated last year
KuangjuX / AttnLink
An experimental communicating attention kernel based on DeepEP.
☆34Updated 3 months ago
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 6 months ago
flashinfer-ai / cutlass-viz
☆65Updated 6 months ago
HanGuo97 / log-linear-attention
☆253Updated 5 months ago
tyler-griggs / melange-release
☆48Updated last year