NVIDIA/cutile-python

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/NVIDIA/cutile-python)

NVIDIA / cutile-python

cuTile is a programming model for writing parallel kernels for NVIDIA GPUs

☆1,933Updated this week

Alternatives and similar repositories for cutile-python

Users that are interested in cutile-python are comparing it to the libraries listed below

Sorting:

NVIDIA / TileGym
View on GitHub
Helpful kernel tutorials and examples for tile-based GPU programming
☆654Updated this week
dsl-learn / cutile-learn
View on GitHub
NVIDIA cuTile learn
☆163Dec 9, 2025Updated 2 months ago
ByteDance-Seed / Triton-distributed
View on GitHub
Distributed Compiler based on Triton for Parallel Systems
☆1,361Feb 13, 2026Updated 2 weeks ago
tile-ai / tilelang
View on GitHub
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆5,236Feb 20, 2026Updated last week
NVIDIA / cutlass
View on GitHub
CUDA Templates and Python DSLs for High-Performance Linear Algebra
☆9,315Updated this week
NVIDIA / tilus
View on GitHub
Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.
☆445Feb 4, 2026Updated 3 weeks ago
triton-lang / Triton-to-tile-IR
View on GitHub
incubator repo for CUDA-TileIR backend
☆106Feb 14, 2026Updated 2 weeks ago
flashinfer-ai / flashinfer
View on GitHub
FlashInfer: Kernel Library for LLM Serving
☆5,009Updated this week
microsoft / AttentionEngine
View on GitHub
☆118May 19, 2025Updated 9 months ago
Dao-AILab / quack
View on GitHub
A Quirky Assortment of CuTe Kernels
☆814Updated this week
triton-lang / triton
View on GitHub
Development repository for the Triton language and compiler
☆18,460Updated this week
NVIDIA / cuda-tile
View on GitHub
CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-base…
☆851Updated this week
bytedance / flux
View on GitHub
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
☆1,261Aug 28, 2025Updated 6 months ago
HazyResearch / ThunderKittens
View on GitHub
Tile primitives for speedy kernels
☆3,183Updated this week
xlite-dev / LeetCUDA
View on GitHub
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
☆9,693Feb 13, 2026Updated 2 weeks ago
mirage-project / mirage
View on GitHub
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
☆2,141Feb 19, 2026Updated last week
sustcsonglin / fla-tilelang
View on GitHub
☆35Mar 7, 2025Updated 11 months ago
meta-pytorch / tritonparse
View on GitHub
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
☆195Updated this week
Dao-AILab / sonic-moe
View on GitHub
Accelerating MoE with IO and Tile-aware Optimizations
☆591Updated this week
perplexityai / pplx-kernels
View on GitHub
Perplexity GPU Kernels
☆564Nov 7, 2025Updated 3 months ago
NVIDIA / cuda-python
View on GitHub
CUDA Python: Performance meets Productivity
☆3,173Updated this week
foundation-model-stack / vllm-triton-backend
View on GitHub
A Triton-only attention backend for vLLM
☆24Feb 11, 2026Updated 2 weeks ago
tile-ai / AttentionEngine
View on GitHub
☆52May 19, 2025Updated 9 months ago
mit-han-lab / flash-moba
View on GitHub
☆224Nov 19, 2025Updated 3 months ago
xlite-dev / HGEMM
View on GitHub
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆147May 10, 2025Updated 9 months ago
NVIDIA / nvshmem
View on GitHub
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆466Dec 31, 2025Updated last month
microsoft / triton-shared
View on GitHub
Shared Middle-Layer for Triton Compilation
☆329Dec 5, 2025Updated 2 months ago
NVIDIA / cccl
View on GitHub
CUDA Core Compute Libraries
☆2,182Updated this week
pytorch / torchtitan
View on GitHub
A PyTorch native platform for training generative AI models
☆5,084Updated this week
deepseek-ai / DeepEP
View on GitHub
DeepEP: an efficient expert-parallel communication library
☆8,993Feb 9, 2026Updated 2 weeks ago
deepseek-ai / DeepGEMM
View on GitHub
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆6,185Feb 3, 2026Updated 3 weeks ago
Dao-AILab / flash-attention
View on GitHub
Fast and memory-efficient exact attention
☆22,361Updated this week
fla-org / flash-linear-attention
View on GitHub
🚀 Efficient implementations of state-of-the-art linear attention models
☆4,428Updated this week
sgl-project / sglang
View on GitHub
SGLang is a high-performance serving framework for large language models and multimodal models.
☆23,658Updated this week
deepseek-ai / FlashMLA
View on GitHub
FlashMLA: Efficient Multi-head Latent Attention Kernels
☆12,497Feb 6, 2026Updated 3 weeks ago
infinigence / HamiltonAttention
View on GitHub
☆41Oct 15, 2025Updated 4 months ago
BBuf / how-to-optim-algorithm-in-cuda
View on GitHub
how to optimize some algorithm in cuda.
☆2,825Feb 15, 2026Updated last week
llvm / torch-mlir
View on GitHub
The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
☆1,754Updated this week
ByteDance-Seed / cudaLLM
View on GitHub
☆130Aug 18, 2025Updated 6 months ago