tinygrad / open-gpu-kernel-modules
NVIDIA Linux open GPU with P2P support
☆978Updated last month
Alternatives and similar repositories for open-gpu-kernel-modules:
Users that are interested in open-gpu-kernel-modules are comparing it to the libraries listed below
- Tile primitives for speedy kernels☆1,923Updated this week
- ☆1,012Updated last month
- Distributed Training Over-The-Internet☆857Updated last month
- Flash Attention in ~100 lines of CUDA (forward pass only)☆681Updated 2 weeks ago
- ☆411Updated last month
- Official implementation of Half-Quadratic Quantization (HQQ)☆732Updated this week
- ☆515Updated 2 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆561Updated this week
- prime is a framework for efficient, globally distributed training of AI models over the internet.☆619Updated this week
- Stateful load balancer custom-tailored for llama.cpp 🏓🦙☆666Updated last week
- FlashAttention (Metal Port)☆425Updated 3 months ago
- Llama 2 Everywhere (L2E)☆1,507Updated this week
- Open weights language model from Google DeepMind, based on Griffin.☆614Updated 6 months ago
- Serving multiple LoRA finetuned LLM as one☆1,012Updated 8 months ago
- OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training☆417Updated this week
- llama3.np is a pure NumPy implementation for Llama 3 model.☆975Updated 7 months ago
- FlashInfer: Kernel Library for LLM Serving☆1,797Updated this week
- An implementation of bucketMul LLM inference☆214Updated 6 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆680Updated 4 months ago
- ☆237Updated 9 months ago
- Training LLMs with QLoRA + FSDP☆1,436Updated 2 months ago
- Stop messing around with finicky sampling parameters and just use DRµGS!☆336Updated 7 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆505Updated 2 months ago
- Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA☆714Updated this week
- Minimalistic 4D-parallelism distributed training framework for education purpose☆644Updated this week
- A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and full…☆605Updated last month
- A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations☆832Updated 2 months ago
- Port of MiniGPT4 in C++ (4bit, 5bit, 6bit, 8bit, 16bit CPU inference with GGML)☆562Updated last year
- ☆180Updated 4 months ago
- Tensor parallelism is all you need. Run LLMs on an AI cluster at home using any device. Distribute the workload, divide RAM usage, and in…☆1,622Updated last week