tinygrad / open-gpu-kernel-modules
NVIDIA Linux open GPU with P2P support
☆1,019Updated 2 months ago
Alternatives and similar repositories for open-gpu-kernel-modules:
Users that are interested in open-gpu-kernel-modules are comparing it to the libraries listed below
- ☆1,018Updated 2 months ago
- Tile primitives for speedy kernels☆2,032Updated this week
- Distributed Training Over-The-Internet☆877Updated 2 months ago
- Flash Attention in ~100 lines of CUDA (forward pass only)☆699Updated last month
- Stateful load balancer custom-tailored for llama.cpp 🏓🦙☆704Updated 3 weeks ago
- ☆426Updated 2 months ago
- Official implementation of Half-Quadratic Quantization (HQQ)☆747Updated this week
- Serving multiple LoRA finetuned LLM as one☆1,025Updated 9 months ago
- FlashInfer: Kernel Library for LLM Serving☆2,030Updated this week
- ☆238Updated 10 months ago
- Large-scale LLM inference engine☆1,290Updated this week
- FlashAttention (Metal Port)☆436Updated 4 months ago
- Open weights language model from Google DeepMind, based on Griffin.☆620Updated 7 months ago
- A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and full…☆604Updated 2 months ago
- Puzzles for learning Triton☆1,393Updated 3 months ago
- An implementation of bucketMul LLM inference☆215Updated 7 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆514Updated this week
- PyTorch native quantization and sparsity for training and inference☆1,835Updated this week
- ☆522Updated 3 months ago
- A fast inference library for running LLMs locally on modern consumer-class GPUs☆3,951Updated this week
- NanoGPT (124M) in 3 minutes☆2,278Updated this week
- Tutorials on tinygrad☆341Updated last week
- This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?☆918Updated 2 weeks ago
- A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.☆2,823Updated last year
- TinyChatEngine: On-Device LLM Inference Library☆811Updated 7 months ago
- llama3.cuda is a pure C/CUDA implementation for Llama 3 model.☆326Updated 8 months ago
- CUDA/Metal accelerated language model inference☆510Updated 2 months ago
- OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training☆440Updated last month
- Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?☆1,384Updated 9 months ago
- Make PyTorch models up to 40% faster! Thunder is a source to source compiler for PyTorch. It enables using different hardware executors a…☆1,280Updated this week