yester31/Cutlass_EX

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/yester31/Cutlass_EX)

yester31 / Cutlass_EX

study of cutlass

☆22

Alternatives and similar repositories for Cutlass_EX

Users that are interested in Cutlass_EX are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

SkyworkAI / vllm
View on GitHub
A high-throughput and memory-efficient inference and serving engine for LLMs
☆17Jun 3, 2024Updated 2 years ago
weishengying / cutlass_flash_atten_fp8
View on GitHub
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆82Aug 12, 2024Updated last year
billmuch / matmul_perf_test
View on GitHub
☆15Apr 15, 2022Updated 4 years ago
flashinfer-ai / debug-print
View on GitHub
Debug print operator for cudagraph debugging
☆18Aug 2, 2024Updated last year
TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆19Nov 19, 2024Updated last year
End-to-end encrypted email - Proton Mail • Ad
Special offer: 40% Off Yearly / 80% Off First Month. All Proton services are open source and independently audited for security.
tlc-pack / libflash_attn
View on GitHub
Standalone Flash Attention v2 kernel without libtorch dependency
☆113Sep 10, 2024Updated last year
tlc-pack / cutlass_fpA_intB_gemm
View on GitHub
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Jun 21, 2026Updated last month
vllm-project / tml-fa4
View on GitHub
FA4-based Relative Attention Kernel developed by TML and Colfax
☆17Updated this week
Dao-AILab / gemm-cublas
View on GitHub
☆22May 5, 2025Updated last year
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated last year
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆108Dec 17, 2024Updated last year
frozein / QuickMathHPP
View on GitHub
a single-header math library
☆17Nov 7, 2025Updated 8 months ago
ColfaxResearch / cutlass-kernels
View on GitHub
☆269Jul 11, 2024Updated 2 years ago
L1aoXingyu / llm-infer-bench
View on GitHub
☆12Sep 1, 2023Updated 2 years ago
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
flashinfer-ai / cubloaty
View on GitHub
a size profiler for cuda binary
☆71Jan 15, 2026Updated 6 months ago
Adlik / model_zoo
View on GitHub
☆11Dec 26, 2025Updated 6 months ago
FindHao / drgpu
View on GitHub
A Top-Down Profiler for GPU Applications
☆23Feb 29, 2024Updated 2 years ago
Bruce-Lee-LY / flash_attention_inference
View on GitHub
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆45Feb 27, 2025Updated last year
Oneflow-Inc / oneflow_convert
View on GitHub
OneFlow->ONNX
☆42Apr 19, 2023Updated 3 years ago
HydraQYH / hp_rms_norm
View on GitHub
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
☆30Jan 22, 2026Updated 5 months ago
IST-DASLab / GSQ
View on GitHub
Gumbel-Softmax post-training quantization for LLMs (1–3 bit scalar, INT/GGUF-compatible).
☆15Jul 11, 2026Updated last week
daquexian / faster-rwkv
View on GitHub
☆126Dec 15, 2023Updated 2 years ago
ProjectPhysX / PTXprofiler
View on GitHub
A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.
☆59Mar 20, 2025Updated last year
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
xlite-dev / netron-vscode-extension
View on GitHub
☕️ A vscode extension for netron, support *.pdmodel, *.nb, *.onnx, *.pb, *.h5, *.tflite, *.pth, *.pt, *.mnn, *.param, etc.
☆14Jun 4, 2023Updated 3 years ago
Guanbin-Huang / camera_calibration_cpp
View on GitHub
☆19Aug 23, 2022Updated 3 years ago
blackjack2015 / NV-DVFS-Benchmark
View on GitHub
☆10Aug 21, 2023Updated 2 years ago
carefree0910 / carefree-flow
View on GitHub
Deep Learning ❤️ OneFlow
☆19Aug 26, 2021Updated 4 years ago
KuangjuX / cuda-evolve-oss
View on GitHub
Autonomous GPU kernel optimization system driven by AI agents.
☆31Mar 29, 2026Updated 3 months ago
eyalroz / gpu-kernel-runner
View on GitHub
Runs a single CUDA/OpenCL kernel, taking its source from a file and arguments from the command-line
☆26Jun 10, 2026Updated last month
YJMSTR / flash-linear-attention
View on GitHub
FLA but cuTile
☆27Apr 17, 2026Updated 3 months ago
wyg1997 / neovimplus
View on GitHub
auto deploy neovim like chxuan/vimplus
☆12Apr 22, 2025Updated last year
bytedance / ByteTransformer
View on GitHub
optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052
☆479Mar 15, 2024Updated 2 years ago
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
OpenPPL / ppl.llm.kernel.cuda
View on GitHub
☆150Jan 9, 2025Updated last year
shaoshitong / diffusion-model-learning
View on GitHub
Document the demo and a series of documents for learning the diffusion model.
☆41Jun 29, 2023Updated 3 years ago
neuralmagic / AutoFP8
View on GitHub
☆210May 5, 2025Updated last year
JiangLiSJTU / token-ring
View on GitHub
☆13Jan 7, 2025Updated last year
sgl-project / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆32Updated this week
manishucsd / py-codegen
View on GitHub
☆16Sep 24, 2024Updated last year
tgale96 / grouped_gemm
View on GitHub
PyTorch bindings for CUTLASS grouped GEMM.
☆154May 29, 2025Updated last year