huggingface / kernelsLinks
Load compute kernels from the Hub
β172Updated this week
Alternatives and similar repositories for kernels
Users that are interested in kernels are comparing it to the libraries listed below
Sorting:
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β253Updated this week
- This repository contains the experimental PyTorch native float8 training UXβ224Updated 10 months ago
- ring-attention experimentsβ144Updated 8 months ago
- β108Updated last year
- Triton-based implementation of Sparse Mixture of Experts.β219Updated 6 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ131Updated last week
- Applied AI experiments and examples for PyTorchβ277Updated 3 weeks ago
- Fast low-bit matmul kernels in Tritonβ322Updated this week
- β88Updated last year
- Experiment of using Tangent to autodiff tritonβ79Updated last year
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β70Updated last week
- π· Build compute kernelsβ64Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on diskβ126Updated this week
- β114Updated 3 weeks ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.β136Updated this week
- β78Updated 11 months ago
- PyTorch per step fault tolerance (actively under development)β329Updated this week
- β219Updated this week
- Collection of kernels written in Triton languageβ128Updated 2 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsityβ79Updated 9 months ago
- β193Updated 4 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decodingβ116Updated 6 months ago
- Cataloging released Triton kernels.β236Updated 5 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASSβ186Updated last month
- β63Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!β46Updated this week
- Fast and memory-efficient exact attentionβ68Updated 3 months ago
- A library for unit scaling in PyTorchβ125Updated 6 months ago
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ126Updated 6 months ago
- extensible collectives library in tritonβ86Updated 2 months ago