PluralisResearch / AsyncPPLinks
Asynchronous pipeline parallel optimization
☆19Updated 7 months ago
Alternatives and similar repositories for AsyncPP
Users that are interested in AsyncPP are comparing it to the libraries listed below
Sorting:
- ☆44Updated 9 months ago
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆55Updated this week
- AI model training on heterogeneous, geo-distributed resources☆29Updated last month
- Expert Specialization MoE Solution based on CUTLASS☆23Updated last week
- ring-attention experiments☆160Updated last year
- AI-Driven Research Systems (ADRS)☆103Updated last week
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆126Updated 6 months ago
- ☆39Updated 4 months ago
- Easy, Fast, and Scalable Multimodal AI☆81Updated this week
- Parallel framework for training and fine-tuning deep neural networks☆71Updated last month
- DeeperGEMM: crazy optimized version☆73Updated 7 months ago
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆109Updated last month
- Fast and memory-efficient exact attention☆14Updated this week
- Ship correct and fast LLM kernels to PyTorch☆126Updated last week
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆135Updated last year
- Transformers components but in Triton☆34Updated 7 months ago
- The evaluation framework for training-free sparse attention in LLMs☆106Updated 2 months ago
- ☆260Updated 6 months ago
- Triton-based Symmetric Memory operators and examples☆67Updated 2 months ago
- Learn CUDA with PyTorch☆138Updated this week
- A bunch of kernels that might make stuff slower 😉☆69Updated this week
- [NeurIPS 2025] Scaling Speculative Decoding with Lookahead Reasoning☆56Updated last month
- ☆52Updated 7 months ago
- ☆27Updated last year
- APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation. A system-level optimization for scalable LLM tra…☆44Updated 2 months ago
- Autonomous GPU Kernel Generation via Deep Agents☆192Updated last week
- Mixed precision training from scratch with Tensors and CUDA☆28Updated last year
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆52Updated last year
- Automatic differentiation for Triton Kernels☆29Updated 4 months ago
- Debug print operator for cudagraph debugging☆14Updated last year