vedantroy / gpu_kernelsLinks

☆27

Alternatives and similar repositories for gpu_kernels

Users that are interested in gpu_kernels are comparing it to the libraries listed below

Sorting:

IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆85Updated last year
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆137Updated this week
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆85Updated 2 months ago
microsoft / AttentionEngine
☆109Updated 6 months ago
meta-pytorch / kraken
Triton-based Symmetric Memory operators and examples
☆63Updated last month
stanford-futuredata / stk
☆113Updated last year
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆130Updated 5 months ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 10 months ago
tile-ai / AttentionEngine
☆50Updated 6 months ago
deepspeedai / DeepSpeed-Kernels
☆71Updated 7 months ago
cchan / tccl
extensible collectives library in triton
☆91Updated 7 months ago
triton-lang / kernels
☆93Updated last year
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 9 months ago
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆122Updated last year
Dao-AILab / grouped-latent-attention
☆130Updated 5 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆100Updated 4 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆39Updated last year
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆112Updated last year
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆45Updated 5 months ago
flashinfer-ai / cutlass-viz
☆65Updated 6 months ago
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆170Updated last year
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆51Updated 4 months ago
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆134Updated last week
gpu-mode / ring-attention
ring-attention experiments
☆155Updated last year
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated last year
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆131Updated 11 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆223Updated 2 years ago