huggingface / kernelsLinks
Load compute kernels from the Hub
β244Updated this week
Alternatives and similar repositories for kernels
Users that are interested in kernels are comparing it to the libraries listed below
Sorting:
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β260Updated 3 weeks ago
- π· Build compute kernelsβ106Updated last week
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ160Updated 2 months ago
- β118Updated last year
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β250Updated 2 weeks ago
- This repository contains the experimental PyTorch native float8 training UXβ224Updated last year
- β211Updated 6 months ago
- ring-attention experimentsβ149Updated 10 months ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)β383Updated last week
- The evaluation framework for training-free sparse attention in LLMsβ90Updated 2 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ190Updated 2 months ago
- Triton-based implementation of Sparse Mixture of Experts.β233Updated 8 months ago
- β162Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on diskβ149Updated last week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ128Updated 8 months ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.β84Updated last month
- The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS β¦β61Updated 10 months ago
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β208Updated last week
- β237Updated 2 months ago
- Normalized Transformer (nGPT)β186Updated 9 months ago
- Official implementation for Training LLMs with MXFP4β75Updated 3 months ago
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)β200Updated last week
- β88Updated last year
- An extension of the nanoGPT repository for training small MOE models.β178Updated 5 months ago
- PyTorch Single Controllerβ361Updated last week
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β141Updated last year
- FlashRNN - Fast RNN Kernels with I/O Awarenessβ94Updated 2 months ago
- Google TPU optimizations for transformers modelsβ118Updated 7 months ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β244Updated 6 months ago
- Efficient LLM Inference over Long Sequencesβ389Updated last month