NVIDIA / cuda-tileLinks
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-based computation patterns and optimizations targeting NVIDIA tensor core units.
☆195Updated this week
Alternatives and similar repositories for cuda-tile
Users that are interested in cuda-tile are comparing it to the libraries listed below
Sorting:
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆178Updated this week
- Helpful kernel tutorials and examples for tile-based GPU programming☆456Updated this week
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆427Updated last week
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆415Updated last month
- An experimental CPU backend for Triton☆167Updated last month
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆133Updated last week
- MLIR-based partitioning system☆151Updated this week
- Open ABI and FFI for Machine Learning Systems☆256Updated this week
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆83Updated 2 months ago
- Github mirror of trition-lang/triton repo.☆105Updated last week
- Shared Middle-Layer for Triton Compilation☆318Updated 2 weeks ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆691Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆47Updated 4 months ago
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆182Updated 5 months ago
- torchcomms: a modern PyTorch communications API☆309Updated this week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆174Updated 2 weeks ago
- AI Tensor Engine for ROCm☆322Updated this week
- Fastest kernels written from scratch☆499Updated 3 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆105Updated 5 months ago
- ☆81Updated 2 weeks ago
- Fast and Furious AMD Kernels☆321Updated this week
- ☆99Updated last year
- extensible collectives library in triton☆91Updated 8 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆301Updated this week
- ☆52Updated 7 months ago
- The missing pieces (as far as boilerplate reduction goes) of the upstream MLIR python bindings.☆116Updated last month
- kernels, of the mega variety☆631Updated 2 months ago
- IREE's PyTorch Frontend, based on Torch Dynamo.☆102Updated last week
- Perplexity open source garden for inference technology☆306Updated 2 weeks ago
- ☆127Updated 2 months ago