ROCm / irisLinks
AMD RAD's experimental RMA library for Triton.
☆79Updated this week
Alternatives and similar repositories for iris
Users that are interested in iris are comparing it to the libraries listed below
Sorting:
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆324Updated 2 weeks ago
- extensible collectives library in triton☆88Updated 6 months ago
- ☆90Updated 10 months ago
- Github mirror of trition-lang/triton repo.☆78Updated this week
- ☆121Updated 9 months ago
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer (WIP) for Triton Kernels☆151Updated last week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆115Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆97Updated 3 months ago
- An experimental CPU backend for Triton☆153Updated 4 months ago
- ☆238Updated last year
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆369Updated this week
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆48Updated last week
- Ahead of Time (AOT) Triton Math Library☆76Updated 2 weeks ago
- ☆144Updated 4 months ago
- Fastest kernels written from scratch☆366Updated 2 weeks ago
- MLIR-based partitioning system☆135Updated this week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆97Updated this week
- High-Performance SGEMM on CUDA devices☆103Updated 8 months ago
- A lightweight design for computation-communication overlap.☆177Updated 2 weeks ago
- ☆118Updated 6 months ago
- ☆108Updated last year
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆318Updated last week
- QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.☆33Updated last month
- Fast low-bit matmul kernels in Triton☆373Updated last week
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆66Updated 3 months ago
- ☆43Updated 4 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆99Updated 3 weeks ago
- An extension library of WMMA API (Tensor Core API)☆106Updated last year
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆265Updated 2 months ago
- ☆240Updated last week