mk1-project / quickreduceLinks

QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.

☆28

Alternatives and similar repositories for quickreduce

Users that are interested in quickreduce are comparing it to the libraries listed below

Sorting:

yifuwang / symm-mem-recipes
☆90Updated 6 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆323Updated last week
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆277Updated last month
ColfaxResearch / cutlass-kernels
☆212Updated 11 months ago
ColfaxResearch / cfx-article-src
☆117Updated last month
cchan / tccl
extensible collectives library in triton
☆86Updated 2 months ago
MekkCyber / CutlassAcademy
A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS
☆189Updated last month
triton-lang / kernels
☆81Updated 7 months ago
NVIDIA / TensorRT-Incubator
Experimental projects related to TensorRT
☆105Updated last week
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆109Updated 11 months ago
NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …
☆179Updated 3 weeks ago
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆127Updated 3 weeks ago
sunlex0717 / DissectingTensorCores
☆98Updated last year
pranjalssh / fast.cu
Fastest kernels written from scratch
☆284Updated 2 months ago
pytorch-labs / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆170Updated this week
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆143Updated last week
Deep-Learning-Profiling-Tools / triton-viz
☆221Updated this week
microsoft / microxcaling
PyTorch emulation library for Microscaling (MX)-compatible data formats
☆251Updated last week
pytorch-labs / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆167Updated this week
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆94Updated 6 years ago
ppl-ai / pplx-kernels
Perplexity GPU Kernels
☆375Updated 2 weeks ago
gpu-mode / triton-index
Cataloging released Triton kernels.
☆238Updated 5 months ago
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆67Updated last week
bertmaher / simplegemm
☆109Updated 3 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆82Updated last month
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆252Updated 8 months ago
microsoft / triton-shared
Shared Middle-Layer for Triton Compilation
☆256Updated this week
HazyResearch / Megakernels
kernels, of the mega variety
☆406Updated 3 weeks ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 5 months ago
AlibabaPAI / FLASHNN
☆97Updated 9 months ago