NVIDIA / cutile-pythonLinks
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
☆1,903Updated this week
Alternatives and similar repositories for cutile-python
Users that are interested in cutile-python are comparing it to the libraries listed below
Sorting:
- Distributed Compiler based on Triton for Parallel Systems☆1,332Updated last week
- Helpful kernel tutorials and examples for tile-based GPU programming☆630Updated this week
- Mirage Persistent Kernel: Compiling LLMs into a MegaKernel☆2,104Updated last week
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,235Updated 5 months ago
- A Quirky Assortment of CuTe Kernels☆781Updated this week
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆739Updated this week
- kernels, of the mega variety☆665Updated last week
- CUDA Tile IR is an MLIR-based intermediate representation and compiler infrastructure for CUDA kernel optimization, focusing on tile-base…☆823Updated 3 weeks ago
- Step-by-step optimization of CUDA SGEMM☆428Updated 3 years ago
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels☆4,863Updated this week
- Fastest kernels written from scratch☆532Updated 4 months ago
- LeetGPU Challenges☆613Updated this week
- Fast CUDA matrix multiplication from scratch☆1,040Updated 5 months ago
- Perplexity GPU Kernels☆554Updated 3 months ago
- A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.☆3,297Updated 2 weeks ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)☆781Updated 2 weeks ago
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆440Updated last month
- FlagGems is an operator library for large language models implemented in the Triton Language.☆893Updated this week
- A unified library of SOTA model optimization techniques like quantization, pruning, distillation, speculative decoding, etc. It compresse…☆1,925Updated this week
- Flash Attention in ~100 lines of CUDA (forward pass only)☆1,067Updated last year
- CUDA Kernel Benchmarking Library☆806Updated last week
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆461Updated last month
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆520Updated last year
- Puzzles for learning Triton, play it with minimal environment configuration!☆613Updated last month
- depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.☆782Updated 3 months ago
- This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…☆1,233Updated 2 years ago
- Tile-Based Runtime for Ultra-Low-Latency LLM Inference☆564Updated last week
- PyTorch Single Controller☆957Updated this week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆200Updated last week
- Yet Another Language Model: LLM inference in C++/CUDA, no libraries except for I/O☆550Updated 4 months ago