mk1-project / quickreduceLinks
QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.
☆27Updated 2 months ago
Alternatives and similar repositories for quickreduce
Users that are interested in quickreduce are comparing it to the libraries listed below
Sorting:
- ☆88Updated 5 months ago
- Applied AI experiments and examples for PyTorch☆274Updated last week
- ☆208Updated 10 months ago
- An experimental CPU backend for Triton☆119Updated this week
- ☆110Updated last month
- Fast low-bit matmul kernels in Triton☆311Updated this week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆173Updated this week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆109Updated 10 months ago
- OpenAI Triton backend for Intel® GPUs☆189Updated this week
- ☆97Updated last year
- CUDA Matrix Multiplication Optimization☆189Updated 10 months ago
- ☆80Updated 7 months ago
- Collection of kernels written in Triton language☆127Updated 2 months ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆153Updated this week
- MLIR-based partitioning system☆87Updated this week
- extensible collectives library in triton☆87Updated 2 months ago
- ☆215Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆183Updated last month
- Shared Middle-Layer for Triton Compilation☆252Updated this week
- PyTorch emulation library for Microscaling (MX)-compatible data formats☆241Updated last week
- Development repository for the Triton language and compiler☆122Updated this week
- ☆36Updated this week
- Ahead of Time (AOT) Triton Math Library☆64Updated last week
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆331Updated this week
- Experimental projects related to TensorRT☆105Updated last week
- Cataloging released Triton kernels.☆229Updated 4 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆251Updated 7 months ago
- oneCCL Bindings for Pytorch*☆97Updated last month
- ☆107Updated 2 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆43Updated 2 months ago