huggingface / kernelsLinks
Load compute kernels from the Hub
β381Updated this week
Alternatives and similar repositories for kernels
Users that are interested in kernels are comparing it to the libraries listed below
Sorting:
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β280Updated 2 months ago
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β333Updated 2 months ago
- π· Build compute kernelsβ214Updated last week
- β229Updated 2 months ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)β472Updated 2 weeks ago
- This repository contains the experimental PyTorch native float8 training UXβ227Updated last year
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ195Updated 7 months ago
- β178Updated last year
- ring-attention experimentsβ163Updated last year
- Normalized Transformer (nGPT)β197Updated last year
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β218Updated this week
- Triton-based implementation of Sparse Mixture of Experts.β262Updated 3 months ago
- β124Updated last year
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face supportβ259Updated this week
- Efficient LLM Inference over Long Sequencesβ394Updated 7 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizerβ229Updated 7 months ago
- Applied AI experiments and examples for PyTorchβ314Updated 5 months ago
- An extension of the nanoGPT repository for training small MOE models.β231Updated 10 months ago
- β578Updated 4 months ago
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ549Updated 8 months ago
- β273Updated this week
- Fast low-bit matmul kernels in Tritonβ423Updated last month
- TPU inference for vLLM, with unified JAX and PyTorch support.β216Updated this week
- Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024β355Updated 8 months ago
- A safetensors extension to efficiently store sparse quantized tensors on diskβ237Updated this week
- Simple & Scalable Pretraining for Neural Architecture Researchβ306Updated last month
- Accelerating MoE with IO and Tile-aware Optimizationsβ553Updated last week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β594Updated 5 months ago
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top ofβ¦β146Updated last year
- Dion optimizer algorithmβ420Updated last week