NVIDIA / cutlassLinks

CUDA Templates and Python DSLs for High-Performance Linear Algebra

☆8,828

Alternatives and similar repositories for cutlass

Users that are interested in cutlass are comparing it to the libraries listed below

Sorting:

NVIDIA / CUDALibrarySamples
CUDA Library Samples
☆2,203Updated this week
openxla / xla
A machine learning compiler for GPUs, CPUs, and ML accelerators
☆3,687Updated this week
NVIDIA / TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit and 4-bit floating point (FP8 and FP4) precision on H…
☆2,925Updated this week
NVIDIA / nccl
Optimized primitives for collective multi-GPU communication
☆4,236Updated last week
NVIDIA / cuda-samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
☆8,470Updated 2 months ago
HazyResearch / ThunderKittens
Tile primitives for speedy kernels
☆2,937Updated this week
NVIDIA / cccl
CUDA Core Compute Libraries
☆2,029Updated last week
triton-lang / triton
Development repository for the Triton language and compiler
☆17,585Updated this week
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆4,099Updated this week
NVIDIA / cuda-python
CUDA Python: Performance meets Productivity
☆3,044Updated this week
tile-ai / tilelang
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
☆3,945Updated this week
NVIDIA / FasterTransformer
Transformer related optimization, including BERT, GPT
☆6,354Updated last year
gpu-mode / lectures
Material for gpu-mode lectures
☆5,310Updated 2 months ago
iree-org / iree
A retargetable MLIR-based machine learning compiler and runtime toolkit.
☆3,456Updated last week
BBuf / how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.
☆2,622Updated 2 weeks ago
NVIDIA / cub
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
☆1,806Updated 2 years ago
mirage-project / mirage
Mirage Persistent Kernel: Compiling LLMs into a MegaKernel
☆1,951Updated this week
pytorch / ao
PyTorch native quantization and sparsity for training and inference
☆2,511Updated this week
llvm / torch-mlir
The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.
☆1,678Updated last week
NVIDIA / TensorRT
NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source compone…
☆12,369Updated last week
ai-dynamo / dynamo
A Datacenter Scale Distributed Inference Serving Framework
☆5,490Updated this week
deepseek-ai / DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆5,890Updated last week
flame / how-to-optimize-gemm
☆1,947Updated 2 years ago
Liu-xiandong / How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…
☆1,188Updated 2 years ago
gpu-mode / resource-stream
GPU programming related news and material links
☆1,795Updated 2 months ago
NVIDIA / TensorRT-LLM
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…
☆12,203Updated this week
kvcache-ai / Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
☆4,283Updated this week
PacktPublishing / Learn-CUDA-Programming
Learn CUDA Programming, published by Packt
☆1,211Updated last year
siboehm / SGEMM_CUDA
Fast CUDA matrix multiplication from scratch
☆946Updated 2 months ago
srush / Triton-Puzzles
Puzzles for learning Triton
☆2,116Updated last year