PluralisResearch / AsyncPPLinks
Asynchronous pipeline parallel optimization
☆18Updated 5 months ago
Alternatives and similar repositories for AsyncPP
Users that are interested in AsyncPP are comparing it to the libraries listed below
Sorting:
- ☆41Updated 8 months ago
- ☆39Updated 3 months ago
- APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation. A system-level optimization for scalable LLM tra…☆43Updated last month
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆98Updated last week
- Triton-based Symmetric Memory operators and examples☆62Updated last month
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆51Updated last year
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆131Updated 11 months ago
- Debug print operator for cudagraph debugging☆14Updated last year
- Student version of Assignment 2 for Stanford CS336 - Language Modeling From Scratch☆111Updated 3 months ago
- ring-attention experiments☆155Updated last year
- 16-fold memory access reduction with nearly no loss☆106Updated 7 months ago
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆32Updated 11 months ago
- DeeperGEMM: crazy optimized version☆73Updated 6 months ago
- Fast and memory-efficient exact attention☆14Updated this week
- ☆13Updated 2 weeks ago
- ☆50Updated 5 months ago
- AI model training on heterogeneous, geo-distributed resources☆19Updated last month
- Learn CUDA with PyTorch☆104Updated last week
- Mixed precision training from scratch with Tensors and CUDA☆28Updated last year
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆124Updated 4 months ago
- ☆36Updated last week
- ☆106Updated 5 months ago
- Automatic differentiation for Triton Kernels☆30Updated 3 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆45Updated 5 months ago
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆44Updated last year
- An experimental communicating attention kernel based on DeepEP.☆34Updated 3 months ago
- Transformers components but in Triton☆34Updated 6 months ago
- ☆65Updated 6 months ago
- ☆253Updated 5 months ago
- ☆48Updated last year