tinygrad / open-gpu-kernel-modulesLinks

NVIDIA Linux open GPU with P2P support

☆1,200

Alternatives and similar repositories for open-gpu-kernel-modules

Users that are interested in open-gpu-kernel-modules are comparing it to the libraries listed below

Sorting:

mikex86 / LibreCuda
☆1,045Updated 2 months ago
NousResearch / DisTrO
Distributed Training Over-The-Internet
☆951Updated 2 months ago
mirage-project / mirage
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
☆1,647Updated last week
HazyResearch / ThunderKittens
Tile primitives for speedy kernels
☆2,555Updated last week
turboderp-org / exllamav2
A fast inference library for running LLMs locally on modern consumer-class GPUs
☆4,258Updated 3 weeks ago
ikawrakow / ik_llama.cpp
llama.cpp fork with additional SOTA quants and improved performance
☆964Updated last week
tinygrad / 7900xtx
☆449Updated 4 months ago
PrimeIntellect-ai / OpenDiloco
OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training
☆521Updated 6 months ago
intentee / paddler
Stateful load balancer custom-tailored for llama.cpp 🏓🦙
☆800Updated last week
mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆856Updated this week
aphrodite-engine / aphrodite-engine
Large-scale LLM inference engine
☆1,502Updated this week
Cornell-RelaxML / quip-sharp
☆549Updated 9 months ago
tspeterkim / flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
☆897Updated 7 months ago
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆870Updated 11 months ago
philipturner / metal-flash-attention
FlashAttention (Metal Port)
☆514Updated 10 months ago
punica-ai / punica
Serving multiple LoRA finetuned LLM as one
☆1,082Updated last year
mlecauchois / micrograd-cuda
☆249Updated last year
PrimeIntellect-ai / prime
prime is a framework for efficient, globally distributed training of AI models over the internet.
☆786Updated 2 months ago
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆648Updated 3 months ago
flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆3,474Updated this week
google-deepmind / recurrentgemma
Open weights language model from Google DeepMind, based on Griffin.
☆645Updated 2 months ago
aikitoria / open-gpu-kernel-modules
NVIDIA Linux open GPU with P2P support
☆28Updated last week
Juice-Labs / Juice-Labs
Juice Community Version Public Release
☆598Updated 3 months ago
turboderp-org / exllamav3
An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs
☆466Updated this week
AnswerDotAI / fsdp_qlora
Training LLMs with QLoRA + FSDP
☆1,526Updated 9 months ago
zeux / calm
CUDA/Metal accelerated language model inference
☆599Updated 2 months ago
mit-han-lab / TinyChatEngine
TinyChatEngine: On-Device LLM Inference Library
☆882Updated last year
PaulPauls / llama3_interpretability_sae
A complete end-to-end pipeline for LLM interpretability with sparse autoencoders (SAEs) using Llama 3.2, written in pure PyTorch and full…
☆622Updated 4 months ago
turboderp / exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
☆2,894Updated last year
likejazz / llama3.np
llama3.np is a pure NumPy implementation for Llama 3 model.
☆987Updated 3 months ago