moritztng / grayskull-attentionLinks

Attention in SRAM on Tenstorrent Grayskull

☆37

Alternatives and similar repositories for grayskull-attention

Users that are interested in grayskull-attention are comparing it to the libraries listed below

Sorting:

pytorch-labs / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆43Updated 4 months ago
tenstorrent / tt-mlir
Tenstorrent MLIR compiler
☆165Updated this week
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆98Updated 6 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆93Updated last month
tenstorrent / tt-forge
Tenstorrent's MLIR Based Compiler. We aim to enable developers to run AI on all configurations of Tenstorrent hardware, through an open-s…
☆96Updated this week
pytorch-labs / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer(WIP) for Triton Kernels
☆138Updated this week
triton-lang / triton-cpu
An experimental CPU backend for Triton
☆138Updated 2 months ago
ademeure / cuda-side-boost
☆20Updated 3 months ago
gpu-mode / popcorn-cli
☆33Updated 2 weeks ago
LaurieWired / BenchmarkCustomPTX
Custom PTX Instruction Benchmark
☆126Updated 5 months ago
cupbop / CuPBoP
A framework that support executing unmodified CUDA source code on non-NVIDIA devices.
☆132Updated 7 months ago
AMDResearch / Riallto
The Riallto Open Source Project from AMD
☆82Updated 3 months ago
seb-v / fp32_sgemm_amd
Super fast FP32 matrix multiplication on RDNA3
☆70Updated 4 months ago
openxla / shardy
MLIR-based partitioning system
☆115Updated this week
bertmaher / simplegemm
☆110Updated 4 months ago
tenstorrent / tt-budabackend
Buda Compiler Backend for Tenstorrent devices
☆29Updated 4 months ago
RadeonFlow / RadeonFlow_Kernels
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
☆60Updated this week
triton-lang / kernels
☆85Updated 8 months ago
makslevental / nelli
A lightweight, Pythonic, frontend for MLIR
☆80Updated last year
ROCm / rocMLIR
☆148Updated this week
0xD0GF00D / DocumentSASS
Unofficial description of the CUDA assembly (SASS) instruction sets.
☆124Updated 2 weeks ago
tenstorrent / tt-tvm
TVM for Tenstorrent ASICs
☆24Updated last week
spcl / daceml
A Data-Centric Compiler for Machine Learning
☆84Updated last year
gpu-mode / reference-kernels
Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!
☆69Updated 2 weeks ago
microsoft / cusync
☆27Updated last year
daniel-geon-park / triton_bwd
Automatic differentiation for Triton Kernels
☆11Updated last week
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆113Updated last year
ROCm / TransformerEngine
☆41Updated this week
north-numerical-computing / tensor-cores-numerical-behavior
Test suite for probing the numerical behavior of NVIDIA tensor cores
☆40Updated last year
ROCm / amd_matrix_instruction_calculator
A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators
☆110Updated 2 months ago