moritztng / grayskull-attentionLinks
Attention in SRAM on Tenstorrent Grayskull
☆38Updated last year
Alternatives and similar repositories for grayskull-attention
Users that are interested in grayskull-attention are comparing it to the libraries listed below
Sorting:
- High-Performance SGEMM on CUDA devices☆101Updated 7 months ago
- Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-s…☆104Updated this week
- Tenstorrent MLIR compiler☆183Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆45Updated 3 weeks ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆97Updated 2 months ago
- ☆42Updated 4 months ago
- Custom PTX Instruction Benchmark☆126Updated 6 months ago
- An experimental CPU backend for Triton☆148Updated 3 months ago
- General Matrix Multiplication using NVIDIA Tensor Cores☆21Updated 7 months ago
- ☆88Updated 10 months ago
- ☆50Updated 8 months ago
- The TT-Forge FE is a graph compiler designed to optimize and transform computational graphs for deep learning models, enhancing their per…☆51Updated this week
- TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer Generator(WIP) for Triton Kernels☆150Updated this week
- ☆60Updated last week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆114Updated last year
- A framework that support executing unmodified CUDA source code on non-NVIDIA devices.☆135Updated 8 months ago
- ☆117Updated 5 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆55Updated this week
- ☆39Updated last month
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆142Updated last month
- ☆234Updated this week
- Buda Compiler Backend for Tenstorrent devices☆30Updated 5 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆85Updated this week
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆342Updated this week
- AMD RAD's experimental RMA library for Triton.☆30Updated this week
- ☆43Updated this week
- Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X☆69Updated last month
- TVM for Tenstorrent ASICs☆26Updated last week
- Framework to reduce autotune overhead to zero for well known deployments.☆81Updated last week
- Super fast FP32 matrix multiplication on RDNA3☆73Updated 5 months ago