caibucai22 / awesome-cudaLinks

Awesome code, projects, books, etc. related to CUDA

☆25

Alternatives and similar repositories for awesome-cuda

Users that are interested in awesome-cuda are comparing it to the libraries listed below

Sorting:

Bruce-Lee-LY / cutlass_gemm
Multiple GEMM operators are constructed with cutlass to support LLM inference.
☆20Updated 2 months ago
caijixueIT / CUDA_Learning_for_Freshman
☆14Updated 7 months ago
simveit / persistent_dense_gemm
Persistent dense gemm for Hopper in `CuTeDSL`
☆15Updated 2 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆45Updated 4 months ago
IST-DASLab / gemm-fp8
High Performance FP8 GEMM Kernels for SM89 and later GPUs.
☆20Updated 9 months ago
luliyucoordinate / flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
☆10Updated last year
BBuf / tensorrt-llm-moe
☆33Updated 8 months ago
yester31 / Cutlass_EX
study of cutlass
☆22Updated 11 months ago
li199603 / sgemm_with_cuda
SGEMM optimization with cuda step by step
☆21Updated last year
FeiGeChuanShu / trt2023
NVIDIA TensorRT Hackathon 2023复赛选题：通义千问Qwen-7B用TensorRT-LLM模型搭建及优化
☆43Updated 2 years ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆76Updated last year
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆62Updated 11 months ago
jundaf2 / CUDA-INT8-GEMM
CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API
☆33Updated 2 years ago
triple-Mu / HunyuanDiT-TensorRT-libtorch
HunyuanDiT with TensorRT and libtorch
☆18Updated last year
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆112Updated last year
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆121Updated 5 months ago
pzhao-eng / FlashMLA
☆56Updated 3 months ago
KuangjuX / CUDAKernels
🎉My Collections of CUDA Kernels~
☆11Updated last year
Qingrenn / mmdeploy-summer-camp
🐱 ncnn int8 模型量化评估
☆13Updated 3 years ago
Syencil / Programming_Massively_Parallel_Processors
CUDA 6大并行计算模式代码与笔记
☆61Updated 5 years ago
JieRen98 / SGEMM-SASS-Annotation
☆21Updated 4 years ago
OpenPPL / ppl.kernel.cuda
☆37Updated last year
LeiWang1999 / TVM.CMakeExtend
Tutorials of Extending and importing TVM with CMAKE Include dependency.
☆14Updated last year
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 10 months ago
hova88 / CUDA-MatMul-Practice
☆17Updated last year
Chtholly-Boss / swizzle
A practical way of learning Swizzle
☆29Updated 8 months ago
latentCall145 / channels-last-groupnorm
A CUDA kernel for NHWC GroupNorm for PyTorch
☆21Updated 11 months ago
weishengying / cute_gemm
☆17Updated last year
zeroine / cutlass-cute-sample
☆44Updated last year
InfiniTensor / RefactorGraph
分层解耦的深度学习推理引擎
☆76Updated 8 months ago