huggingface / kernelsLinks

Load compute kernels from the Hub

☆337

Alternatives and similar repositories for kernels

Users that are interested in kernels are comparing it to the libraries listed below

Sorting:

foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆271Updated last week
huggingface / kernel-builder
👷 Build compute kernels
☆190Updated this week
changjonathanc / flex-nano-vllm
FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.
☆313Updated last month
huggingface / picotron_tutorial
☆224Updated last week
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆226Updated last year
meta-pytorch / torchft
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
☆454Updated 3 weeks ago
gpu-mode / ring-attention
ring-attention experiments
☆160Updated last year
MekkCyber / TritonAcademy
A repository to unravel the language of GPUs, making their kernel conversations easy to understand
☆196Updated 6 months ago
NVIDIA-NeMo / Automodel
Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face support
☆187Updated this week
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆210Updated 2 weeks ago
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆212Updated 5 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆217Updated last week
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆253Updated 2 months ago
mgmalek / efficient_cross_entropy
☆121Updated last year
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆307Updated 3 months ago
apple / ml-cross-entropy
☆555Updated 2 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆392Updated 5 months ago
vllm-project / tpu-inference
TPU inference for vLLM, with unified JAX and PyTorch support.
☆170Updated this week
facebookresearch / spdl
Scalable and Performant Data Loading
☆345Updated this week
wolfecameron / nanoMoE
An extension of the nanoGPT repository for training small MOE models.
☆215Updated 8 months ago
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆401Updated last week
vdesai2014 / inference-optimization-blog-post
☆90Updated last year
gpu-mode / profiling-cuda-in-torch
☆177Updated last year
facebookresearch / LayerSkip
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
☆347Updated 7 months ago
NVIDIA / ngpt
Normalized Transformer (nGPT)
☆194Updated last year
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆117Updated last week
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆584Updated 3 months ago
tilde-research / MoMoE-impl
Memory optimized Mixture of Experts
☆69Updated 4 months ago
NVIDIA / kvpress
LLM KV cache compression made easy
☆701Updated this week
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated last year