Libraries-Openly-Fused / FusedKernelLibraryLinks
Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.
☆16Updated this week
Alternatives and similar repositories for FusedKernelLibrary
Users that are interested in FusedKernelLibrary are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆104Updated last year
- ☆53Updated this week
- A parallel framework for training deep neural networks☆63Updated 5 months ago
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆40Updated last year
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆60Updated 2 months ago
- High-Performance SGEMM on CUDA devices☆97Updated 7 months ago
- SYCL implementation of Fused MLPs for Intel GPUs☆47Updated 2 months ago
- A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!☆51Updated last month
- A bunch of kernels that might make stuff slower 😉☆58Updated last week
- ☆163Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆51Updated this week
- ☆41Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆44Updated last week
- This repository contains the experimental PyTorch native float8 training UX☆224Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on disk☆153Updated last week
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆71Updated last month
- ☆74Updated 5 months ago
- ☆82Updated 2 months ago
- extensible collectives library in triton☆88Updated 4 months ago
- ☆88Updated 9 months ago
- ☆232Updated last week
- Fast low-bit matmul kernels in Triton☆353Updated last week
- Effective transpose on Hopper GPU☆23Updated 3 months ago
- Cataloging released Triton kernels.☆252Updated 7 months ago
- Applied AI experiments and examples for PyTorch☆291Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆95Updated 2 months ago
- Explore training for quantized models☆22Updated last month
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆74Updated this week
- Collection of kernels written in Triton language☆147Updated 4 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆224Updated last year