AMD-AGI / Primus-TurboLinks
☆38Updated last week
Alternatives and similar repositories for Primus-Turbo
Users that are interested in Primus-Turbo are comparing it to the libraries listed below
Sorting:
- ☆151Updated 11 months ago
- ☆253Updated last year
- A lightweight design for computation-communication overlap.☆196Updated 2 months ago
- ☆163Updated 7 months ago
- ☆155Updated last month
- Examples of CUDA implementations by Cutlass CuTe☆260Updated 5 months ago
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆146Updated 3 months ago
- ☆112Updated 7 months ago
- A Easy-to-understand TensorOp Matmul Tutorial☆395Updated 2 months ago
- Github mirror of trition-lang/triton repo.☆105Updated this week
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆71Updated this week
- ☆103Updated last year
- Building the Virtuous Cycle for AI-driven LLM Systems☆98Updated this week
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆133Updated this week
- ☆329Updated last month
- nnScaler: Compiling DNN models for Parallel Training☆120Updated 2 months ago
- Artifact from "Hardware Compute Partitioning on NVIDIA GPUs". THIS IS A FORK OF BAKITAS REPO. I AM NOT ONE OF THE AUTHORS OF THE PAPER.☆46Updated 3 weeks ago
- QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.☆36Updated 3 months ago
- High performance Transformer implementation in C++.☆146Updated 11 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆448Updated 6 months ago
- ☆90Updated 8 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆444Updated this week
- CUTLASS and CuTe Examples☆112Updated 2 weeks ago
- Accelerating MoE with IO and Tile-aware Optimizations☆351Updated this week
- flash attention tutorial written in python, triton, cuda, cutlass☆459Updated 7 months ago
- Tile-based language built for AI computation across all scales☆98Updated this week
- Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocatio…☆72Updated 3 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆137Updated 7 months ago
- Allow torch tensor memory to be released and resumed later☆187Updated 2 weeks ago
- Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.☆69Updated 9 months ago