tinygrad / open-gpu-kernel-modulesLinks
NVIDIA Linux open GPU with P2P support
☆1,316Updated 8 months ago
Alternatives and similar repositories for open-gpu-kernel-modules
Users that are interested in open-gpu-kernel-modules are comparing it to the libraries listed below
Sorting:
- ☆1,074Updated 8 months ago
- Distributed Training Over-The-Internet☆975Updated 3 months ago
- ☆451Updated 10 months ago
- Tile primitives for speedy kernels☆3,120Updated this week
- llama.cpp fork with additional SOTA quants and improved performance☆1,587Updated this week
- Mirage Persistent Kernel: Compiling LLMs into a MegaKernel☆2,104Updated last week
- Serving multiple LoRA finetuned LLM as one☆1,140Updated last year
- Official implementation of Half-Quadratic Quantization (HQQ)☆912Updated last month
- ☆577Updated last year
- Large-scale LLM inference engine☆1,647Updated 2 weeks ago
- Juice Community Version Public Release☆628Updated 8 months ago
- Flash Attention in ~100 lines of CUDA (forward pass only)☆1,067Updated last year
- A fast inference library for running LLMs locally on modern consumer-class GPUs☆4,431Updated last month
- OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training☆562Updated last year
- Open weights language model from Google DeepMind, based on Griffin.☆661Updated 2 weeks ago
- prime is a framework for efficient, globally distributed training of AI models over the internet.☆850Updated 2 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆674Updated 9 months ago
- 🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantiza…☆839Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆1,005Updated last year
- AI Tensor Engine for ROCm☆348Updated this week
- TinyChatEngine: On-Device LLM Inference Library☆939Updated last year
- NVIDIA Linux open GPU with P2P support☆126Updated 2 months ago
- Fast and memory-efficient exact attention☆214Updated this week
- PyTorch native quantization and sparsity for training and inference☆2,657Updated last week
- FlashAttention (Metal Port)☆579Updated last year
- An implementation of bucketMul LLM inference☆224Updated last year
- Llama 2 Everywhere (L2E)☆1,526Updated 5 months ago
- ☆592Updated last year
- kernels, of the mega variety☆665Updated last week
- An innovative library for efficient LLM inference via low-bit quantization☆352Updated last year