huggingface / kernelsLinks
Load compute kernels from the Hub
β326Updated this week
Alternatives and similar repositories for kernels
Users that are interested in kernels are comparing it to the libraries listed below
Sorting:
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β271Updated last week
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β302Updated last week
- π· Build compute kernelsβ171Updated this week
- β225Updated 3 weeks ago
- This repository contains the experimental PyTorch native float8 training UXβ223Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ195Updated 5 months ago
- β121Updated last year
- Triton-based implementation of Sparse Mixture of Experts.β248Updated last month
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ206Updated 4 months ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)β446Updated this week
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β216Updated this week
- Applied AI experiments and examples for PyTorchβ302Updated 2 months ago
- ring-attention experimentsβ155Updated last year
- β176Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on diskβ187Updated last week
- Cataloging released Triton kernels.β265Updated 2 months ago
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face supportβ147Updated this week
- Fast low-bit matmul kernels in Tritonβ392Updated 2 weeks ago
- β246Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β582Updated 3 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β147Updated last year
- Efficient LLM Inference over Long Sequencesβ390Updated 4 months ago
- The evaluation framework for training-free sparse attention in LLMsβ102Updated 3 weeks ago
- A bunch of kernels that might make stuff slower πβ64Updated this week
- Learn CUDA with PyTorchβ104Updated this week
- An extension of the nanoGPT repository for training small MOE models.β210Updated 8 months ago
- β89Updated last year
- PyTorch-native post-training at scaleβ509Updated this week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β344Updated 6 months ago
- Scalable and Performant Data Loadingβ331Updated last week