huggingface / kernelsLinks
Load compute kernels from the Hub
β337Updated last week
Alternatives and similar repositories for kernels
Users that are interested in kernels are comparing it to the libraries listed below
Sorting:
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β271Updated last week
- π· Build compute kernelsβ190Updated this week
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β313Updated last month
- β224Updated last week
- This repository contains the experimental PyTorch native float8 training UXβ226Updated last year
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)β454Updated 3 weeks ago
- ring-attention experimentsβ160Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ196Updated 6 months ago
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face supportβ187Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on diskβ210Updated 2 weeks ago
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ212Updated 5 months ago
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β217Updated last week
- Triton-based implementation of Sparse Mixture of Experts.β253Updated 2 months ago
- β121Updated last year
- Applied AI experiments and examples for PyTorchβ307Updated 3 months ago
- β555Updated 2 months ago
- Efficient LLM Inference over Long Sequencesβ392Updated 5 months ago
- TPU inference for vLLM, with unified JAX and PyTorch support.β170Updated this week
- Scalable and Performant Data Loadingβ345Updated this week
- An extension of the nanoGPT repository for training small MOE models.β215Updated 8 months ago
- Fast low-bit matmul kernels in Tritonβ401Updated last week
- β90Updated last year
- β177Updated last year
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β347Updated 7 months ago
- Normalized Transformer (nGPT)β194Updated last year
- Learn CUDA with PyTorchβ117Updated last week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β584Updated 3 months ago
- Memory optimized Mixture of Expertsβ69Updated 4 months ago
- LLM KV cache compression made easyβ701Updated this week
- Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clustersβ130Updated last year