Codes & examples for "CUDA - From Correctness to Performance"
☆123Oct 24, 2024Updated last year
Alternatives and similar repositories for CUDA-From-Correctness-To-Performance-Code
Users that are interested in CUDA-From-Correctness-To-Performance-Code are comparing it to the libraries listed below
Sorting:
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆314Jun 10, 2025Updated 8 months ago
- Awesome code, projects, books, etc. related to CUDA☆31Feb 3, 2026Updated 3 weeks ago
- 🦙🦙.🦀☆28Sep 24, 2023Updated 2 years ago
- 使用 cutlass 实现 flash-attention 精简版,具有教学意义☆58Aug 12, 2024Updated last year
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆44Feb 27, 2025Updated last year
- Implement Flash Attention using Cute.☆101Dec 17, 2024Updated last year
- ☆27Jan 8, 2024Updated 2 years ago
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆35Sep 15, 2023Updated 2 years ago
- flash attention tutorial written in python, triton, cuda, cutlass☆488Jan 20, 2026Updated last month
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆20Aug 3, 2025Updated 6 months ago
- ☆16Apr 22, 2025Updated 10 months ago
- ☆38Aug 7, 2025Updated 6 months ago
- [NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive☆66Dec 11, 2025Updated 2 months ago
- Hydragen: High-Throughput LLM Inference with Shared Prefixes☆48May 10, 2024Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Jun 11, 2025Updated 8 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Feb 20, 2026Updated last week
- A lightweight design for computation-communication overlap.☆223Jan 20, 2026Updated last month
- 🔥 LLM-powered GPU kernel synthesis: Train models to convert PyTorch ops into optimized Triton kernels via SFT+RL. Multi-turn compilation…☆127Nov 10, 2025Updated 3 months ago
- Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]☆322Nov 8, 2022Updated 3 years ago
- learning how CUDA works☆376Mar 3, 2025Updated 11 months ago
- Inference deployment of the llama3☆11Apr 21, 2024Updated last year
- YOLOv12 TensorRT 端到端模型加速推理和INT8量化实现☆13Mar 5, 2025Updated 11 months ago
- ☆13Jan 7, 2025Updated last year
- ☆166Feb 5, 2026Updated 3 weeks ago
- A Easy-to-understand TensorOp Matmul Tutorial☆410Feb 11, 2026Updated 2 weeks ago
- We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …☆193Jan 28, 2025Updated last year
- High performance Transformer implementation in C++.☆152Jan 18, 2025Updated last year
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆31Mar 12, 2024Updated last year
- Materials for learning SGLang☆753Jan 5, 2026Updated last month
- Flash Attention in ~100 lines of CUDA (forward pass only)☆1,079Dec 30, 2024Updated last year
- Flash Attention in ~100 lines of CUDA (forward pass only)☆11Jun 10, 2024Updated last year
- High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)☆27Jan 22, 2026Updated last month
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Sep 10, 2024Updated last year
- A selective knowledge distillation algorithm for efficient speculative decoders☆36Nov 27, 2025Updated 3 months ago
- Simple RAM benchmark for Linux.☆11Aug 4, 2021Updated 4 years ago
- Open-sourcing code associated with the AAAI-25 paper "On the Expressiveness and Length Generalization of Selective State-Space Models on …☆14Sep 18, 2025Updated 5 months ago
- Expert Specialization MoE Solution based on CUTLASS☆27Jan 19, 2026Updated last month
- ☆11Apr 5, 2021Updated 4 years ago
- AI based singing voice synthesis database generator☆13Aug 12, 2022Updated 3 years ago