Load compute kernels from the Hub
β439Feb 21, 2026Updated last week
Alternatives and similar repositories for kernels
Users that are interested in kernels are comparing it to the libraries listed below
Sorting:
- π· Build compute kernelsβ215Jan 27, 2026Updated last month
- A Quirky Assortment of CuTe Kernelsβ814Updated this week
- Framework to reduce autotune overhead to zero for well known deployments.β96Sep 19, 2025Updated 5 months ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)β820Updated this week
- Efficient Triton Kernels for LLM Trainingβ6,162Updated this week
- β206May 5, 2025Updated 9 months ago
- Minimalistic large language model 3D-parallelism trainingβ2,569Feb 19, 2026Updated last week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.β327Updated this week
- FlashInfer: Kernel Library for LLM Servingβ5,009Updated this week
- β26Nov 18, 2025Updated 3 months ago
- β32Jul 2, 2025Updated 7 months ago
- Hugging Face Jobsβ19Jul 11, 2025Updated 7 months ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.β766Updated this week
- [ICLR'25] Code for KaSA, an official implementation of "KaSA: Knowledge-Aware Singular-Value Adaptation of Large Language Models"β20Jan 16, 2025Updated last year
- Tile primitives for speedy kernelsβ3,183Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.β106Jun 28, 2025Updated 8 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ237Jun 15, 2025Updated 8 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin toβ¦β30Jan 28, 2026Updated last month
- Applied AI experiments and examples for PyTorchβ318Aug 22, 2025Updated 6 months ago
- π Efficient implementations of state-of-the-art linear attention modelsβ4,428Updated this week
- β52May 19, 2025Updated 9 months ago
- PyTorch native quantization and sparsity for training and inferenceβ2,696Updated this week
- π₯ A minimal training framework for scaling FLA modelsβ350Nov 15, 2025Updated 3 months ago
- FlexAttention w/ FlashAttention3 Supportβ27Oct 5, 2024Updated last year
- Minimalistic 4D-parallelism distributed training framework for education purposeβ2,090Aug 26, 2025Updated 6 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.β1,018Sep 4, 2024Updated last year
- Distributed Compiler based on Triton for Parallel Systemsβ1,361Feb 13, 2026Updated 2 weeks ago
- Fast low-bit matmul kernels in Tritonβ433Feb 1, 2026Updated 3 weeks ago
- Helpful tools and examples for working with flex-attentionβ1,136Feb 8, 2026Updated 2 weeks ago
- β261Jul 11, 2024Updated last year
- Dynamic Memory Management for Serving LLMs without PagedAttentionβ463May 30, 2025Updated 8 months ago
- A PyTorch native platform for training generative AI modelsβ5,084Updated this week
- Mirage Persistent Kernel: Compiling LLMs into a MegaKernelβ2,141Feb 19, 2026Updated last week
- Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernelsβ5,236Feb 20, 2026Updated last week
- Benchmark tests supporting the TiledCUDA library.β18Nov 19, 2024Updated last year
- Quantized Attention on GPUβ44Nov 22, 2024Updated last year
- β‘οΈWrite HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peakβ‘οΈ Performance.β147May 10, 2025Updated 9 months ago
- A bunch of kernels that might make stuff slower πβ75Feb 18, 2026Updated last week
- Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backendsβ2,311Feb 20, 2026Updated last week