huggingface / kernel-builderLinks
π· Build compute kernels
β163Updated this week
Alternatives and similar repositories for kernel-builder
Users that are interested in kernel-builder are comparing it to the libraries listed below
Sorting:
- Load compute kernels from the Hubβ304Updated last week
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)β66Updated 7 months ago
- FlexAttention based, minimal vllm-style inference engine for fast Gemma 2 inference.β296Updated 2 months ago
- β222Updated 3 weeks ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ193Updated 4 months ago
- A safetensors extension to efficiently store sparse quantized tensors on diskβ180Updated this week
- Write a fast kernel and run it on Discord. See how you compare against the best!β58Updated last week
- Simple & Scalable Pretraining for Neural Architecture Researchβ297Updated 2 months ago
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β270Updated 3 months ago
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IPβ133Updated last month
- ring-attention experimentsβ154Updated last year
- Google TPU optimizations for transformers modelsβ120Updated 9 months ago
- β218Updated 9 months ago
- Efficient LLM Inference over Long Sequencesβ390Updated 3 months ago
- β102Updated this week
- β89Updated last year
- A high-throughput and memory-efficient inference and serving engine for LLMsβ266Updated last year
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"β248Updated 8 months ago
- Learn CUDA with PyTorchβ92Updated last month
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)β420Updated last week
- Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.β147Updated last week
- Memory optimized Mixture of Expertsβ68Updated 2 months ago
- CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learningβ192Updated 2 months ago
- PyTorch implementation of models from the Zamba2 series.β185Updated 9 months ago
- Where GPUs get cooked π©βπ³π₯β293Updated last month
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.β202Updated last year
- Train, tune, and infer Bamba modelβ134Updated 4 months ago
- Fast low-bit matmul kernels in Tritonβ381Updated 3 weeks ago
- Simple high-throughput inference libraryβ147Updated 5 months ago
- How to ensure correctness and ship LLM generated kernels in PyTorchβ66Updated last week