Libraries-Openly-Fused / FusedKernelLibraryLinks

Implementation of a methodology that allows all sorts of user defined GPU kernel fusion, for non CUDA programmers.

☆28

Alternatives and similar repositories for FusedKernelLibrary

Users that are interested in FusedKernelLibrary are comparing it to the libraries listed below

Sorting:

microsoft / AttentionEngine
☆109Updated 6 months ago
sandyresearch / chipmunk
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …
☆90Updated 2 months ago
tile-ai / AttentionEngine
☆50Updated 6 months ago
RadeonFlow / RadeonFlow_Kernels
Efficient implementation of DeepSeek Ops (Blockwise FP8 GEMM, MoE, and MLA) for AMD Instinct MI300X
☆71Updated 2 weeks ago
cherichy / tilecute
☆31Updated 4 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated last year
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆85Updated 2 months ago
NVIDIA / compute-eval
Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…
☆74Updated last month
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
☆171Updated last week
meta-pytorch / BackendBench
How to ensure correctness and ship LLM generated kernels in PyTorch
☆121Updated last week
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆286Updated this week
flashinfer-ai / cubloaty
a size profiler for cuda binary
☆52Updated last month
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆51Updated 4 months ago
open-lm-engine / accelerated-model-architectures
A bunch of kernels that might make stuff slower 😉
☆64Updated this week
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆134Updated last week
fw-ai / llama-cuda-graph-example
Example of applying CUDA graphs to LLaMA-v2
☆12Updated 2 years ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆103Updated last month
IST-DASLab / Quartet
☆107Updated this week
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
axonn-ai / axonn
Parallel framework for training and fine-tuning deep neural networks
☆68Updated 2 weeks ago
meta-pytorch / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆47Updated 3 months ago
Dao-AILab / grouped-latent-attention
☆130Updated 5 months ago
Libraries-Openly-Fused / cvGPUSpeedup
A faster implementation of OpenCV-CUDA that uses OpenCV objects, and more!
☆54Updated last week
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆210Updated this week
huggingface / hf-rocm-kernels
☆22Updated 4 months ago
xdit-project / DiTCacheAnalysis
An auxiliary project analysis of the characteristics of KV in DiT Attention.
☆32Updated 11 months ago
ByteDance-Seed / cudaLLM
☆121Updated 3 months ago
cchan / tccl
extensible collectives library in triton
☆91Updated 7 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆100Updated 4 months ago
deepspeedai / DeepSpeed-Kernels
☆71Updated 7 months ago