huggingface / kernel-builderLinks
π· Build compute kernels
β201Updated this week
Alternatives and similar repositories for kernel-builder
Users that are interested in kernel-builder are comparing it to the libraries listed below
Sorting:
- Load compute kernels from the Hubβ359Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!β66Updated last week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)β66Updated 9 months ago
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β328Updated 2 months ago
- Simple & Scalable Pretraining for Neural Architecture Researchβ306Updated last month
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IPβ141Updated 4 months ago
- Google TPU optimizations for transformers modelsβ132Updated 3 weeks ago
- β219Updated 11 months ago
- MoE training for Me and You and maybe other peopleβ319Updated last week
- β224Updated last month
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ195Updated 7 months ago
- mHC kernels implemented in CUDAβ196Updated last week
- Pytorch Distributed native training library for LLMs/VLMs with OOTB Hugging Face supportβ245Updated this week
- ring-attention experimentsβ161Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on diskβ233Updated this week
- β114Updated last week
- Official implementation for Training LLMs with MXFP4β116Updated 8 months ago
- Ship correct and fast LLM kernels to PyTorchβ132Updated this week
- TPU inference for vLLM, with unified JAX and PyTorch support.β213Updated this week
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β278Updated last month
- Memory optimized Mixture of Expertsβ72Updated 5 months ago
- The evaluation framework for training-free sparse attention in LLMsβ108Updated 3 months ago
- vLLM adapter for a TGIS-compatible gRPC server.β47Updated this week
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β249Updated 11 months ago
- Where GPUs get cooked π©βπ³π₯β347Updated 3 months ago
- Train, tune, and infer Bamba modelβ137Updated 7 months ago
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)β269Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMsβ267Updated last month
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learningβ277Updated 2 months ago
- Lightweight toolkit package to train and fine-tune 1.58bit Language modelsβ106Updated 7 months ago