tinygrad / open-gpu-kernel-modules
NVIDIA Linux open GPU with P2P support
☆1,058Updated 3 months ago
Alternatives and similar repositories for open-gpu-kernel-modules:
Users that are interested in open-gpu-kernel-modules are comparing it to the libraries listed below
- ☆1,030Updated 4 months ago
- Distributed Training Over-The-Internet☆891Updated 3 months ago
- Tile primitives for speedy kernels☆2,184Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆772Updated this week
- Flash Attention in ~100 lines of CUDA (forward pass only)☆747Updated 2 months ago
- ☆529Updated 5 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆775Updated 6 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆620Updated this week
- FlashInfer: Kernel Library for LLM Serving☆2,483Updated this week
- Serving multiple LoRA finetuned LLM as one☆1,042Updated 10 months ago
- Large-scale LLM inference engine☆1,366Updated this week
- ☆438Updated 2 weeks ago
- FlashAttention (Metal Port)☆463Updated 6 months ago
- Muon optimizer: +>30% sample efficiency with <3% wallclock overhead☆529Updated this week
- Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA☆771Updated this week
- Tutorials on tinygrad☆357Updated last month
- A throughput-oriented high-performance serving framework for LLMs☆782Updated 6 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆1,113Updated last week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆524Updated last month
- PyTorch native quantization and sparsity for training and inference☆1,920Updated this week
- Stateful load balancer custom-tailored for llama.cpp 🏓🦙☆732Updated last week
- llama.cpp fork with additional SOTA quants and improved performance☆222Updated this week
- Puzzles for learning Triton☆1,540Updated 4 months ago
- prime is a framework for efficient, globally distributed training of AI models over the internet.☆682Updated last week
- OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training☆474Updated 2 months ago
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs…☆2,299Updated this week
- Open weights language model from Google DeepMind, based on Griffin.☆628Updated last month
- llama3.np is a pure NumPy implementation for Llama 3 model.☆977Updated 9 months ago
- scalable and robust tree-based speculative decoding algorithm☆340Updated 2 months ago
- Pipeline Parallelism for PyTorch☆760Updated 7 months ago