0xSero/turboquant

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/0xSero/turboquant)

0xSero / turboquant

TurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integration

☆1,683

Alternatives and similar repositories for turboquant

Users that are interested in turboquant are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

mitkox / vllm-turboquant
View on GitHub
vLLM TurboQuant
☆610Jun 25, 2026Updated 3 weeks ago
tonbistudio / turboquant-pytorch
View on GitHub
From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for LLM KV cache compression. 5x compression at 3-bit with 99.5% a…
☆1,032Apr 23, 2026Updated 2 months ago
cksac / turboquant-model
View on GitHub
☆200Apr 5, 2026Updated 3 months ago
TheTom / turboquant_plus
View on GitHub
☆6,997Updated this week
TheTom / llama-cpp-turboquant
View on GitHub
LLM inference in C/C++
☆2,160Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
z-lab / dflash
View on GitHub
DFlash: Block Diffusion for Flash Speculative Decoding
☆5,504May 10, 2026Updated 2 months ago
scrya-com / rotorquant
View on GitHub
KV cache compression via block-diagonal rotation. Beats TurboQuant: better PPL (6.91 vs 7.07), 28% faster decode, 5.3x faster prefill, 44…
☆1,037Apr 23, 2026Updated 2 months ago
RyanCodrai / turbovec
View on GitHub
A vector index built on TurboQuant, written in Rust with Python bindings
☆13,671Updated this week
quantumaikr / quant.cpp
View on GitHub
LLM inference with 7x longer context. Pure C, zero dependencies. Lossless KV cache compression + single-header library.
☆394Apr 26, 2026Updated 2 months ago
vllm-project / vllm
View on GitHub
A high-throughput and memory-efficient inference and serving engine for LLMs
☆86,804Updated this week
spiritbuun / buun-llama-cpp
View on GitHub
LLAMA Turboquant implementation with CUDA support
☆704Updated this week
unslothai / unsloth
View on GitHub
Unsloth is a local UI for training and running Gemma 4, Qwen3.6, DeepSeek, Kimi, GLM and other models.
☆68,666Updated this week
karpathy / autoresearch
View on GitHub
AI agents running research on single-GPU nanochat training automatically
☆91,712Mar 26, 2026Updated 3 months ago
LMCache / LMCache
View on GitHub
LMCache: Supercharge Your LLM with the Fastest KV Cache Layer
☆10,782Updated this week
GPU virtual machines on DigitalOcean Gradient AI • Ad
Get to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
RecursiveIntell / turbo-quant
View on GitHub
Rust implementation of TurboQuant, PolarQuant, and QJL — zero-overhead vector quantization for semantic search and KV cache compression (…
☆29May 31, 2026Updated last month
sybil-solutions / local-studio
View on GitHub
Control panel for VLLM, Sglang, llama.cpp, exllamav3
☆1,481Updated this week
deepseek-ai / DeepSpec
View on GitHub
DeepSpec: a full-stack codebase for training and evaluating speculative decoding algorithms
☆6,719Jul 9, 2026Updated last week
NousResearch / hermes-agent
View on GitHub
The agent that grows with you
☆218,250Updated this week
allenai / molmoweb
View on GitHub
☆579Jun 26, 2026Updated 3 weeks ago
microsoft / memento
View on GitHub
☆497Apr 8, 2026Updated 3 months ago
sgl-project / sglang
View on GitHub
SGLang is a high-performance serving framework for large language models and multimodal models.
☆30,583Updated this week
kyegomez / OpenMythos
View on GitHub
A theoretical reconstruction of the Claude Mythos architecture, built from first principles using the available research literature.
☆14,737May 23, 2026Updated last month
aaif-goose / goose
View on GitHub
an open source, extensible AI agent that goes beyond code suggestions - install, execute, edit, and test with any LLM
☆51,405Updated this week
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
NVIDIA / OpenShell
View on GitHub
OpenShell is the safe, private runtime for autonomous AI agents.
☆7,719Updated this week
lyogavin / airllm
View on GitHub
AirLLM 70B inference with single 4GB GPU
☆23,860Updated this week
NVIDIA / NemoClaw
View on GitHub
Run agents like Hermes, LangChain Deep Agents, and OpenClaw more securely inside NVIDIA OpenShell with managed inference
☆21,867Updated this week
OmarHory / turboquant
View on GitHub
Open-source implementation of Google's TurboQuant (ICLR 2026) — KV cache compression to 2.5–4 bits with near-zero quality loss. 3.8–5.7x …
☆52Mar 29, 2026Updated 3 months ago
MiniMax-AI / MSA
View on GitHub
☆380Jun 15, 2026Updated last month
Luce-Org / lucebox
View on GitHub
Fast LLM speculative inference server for consumer hardware.
☆2,668Updated this week
NVIDIA / Model-Optimizer
View on GitHub
A unified library of SOTA model optimization techniques like quantization, distillation, pruning, neural architecture search, speculative…
☆3,278Updated this week
elder-plinius / OBLITERATUS
View on GitHub
OBLITERATE THE CHAINS THAT BIND YOU
☆7,038Jun 17, 2026Updated last month
ggml-org / llama.cpp
View on GitHub
LLM inference in C/C++
☆121,178Updated this week
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
paperclipai / paperclip
View on GitHub
The open-source app everyone uses to manage agents at work
☆74,369Updated this week
DevTechJr / turboquant-gpu
View on GitHub
☆259Apr 5, 2026Updated 3 months ago
Graphify-Labs / graphify
View on GitHub
Turn any codebase, with its docs, SQL schemas, configs, and PDFs, into a queryable knowledge graph. A /graphify skill for Claude Code, Cu…
☆92,915Updated this week
earendil-works / pi
View on GitHub
AI agent toolkit: unified LLM API, agent loop, TUI, coding agent CLI
☆74,638Updated this week
microsoft / BitNet
View on GitHub
Official inference framework for 1-bit LLMs
☆39,768Updated this week
WeianMao / triattention
View on GitHub
TriAttention — Efficient long reasoning with trigonometric KV cache compression. Enables OpenClaw local deployment on memory-constrained …
☆828Jul 14, 2026Updated last week
MemPalace / mempalace
View on GitHub
The best-benchmarked open-source AI memory system. And it's free.
☆57,558Updated this week