tonbistudio/turboquant-pytorch

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/tonbistudio/turboquant-pytorch)

tonbistudio / turboquant-pytorch

From-scratch PyTorch implementation of Google's TurboQuant (ICLR 2026) for LLM KV cache compression. 5x compression at 3-bit with 99.5% attention fidelity.

☆1,027

Alternatives and similar repositories for turboquant-pytorch

Users that are interested in turboquant-pytorch are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

0xSero / turboquant
View on GitHub
TurboQuant: Near-optimal KV cache quantization for LLM inference (3-bit keys, 2-bit values) with Triton kernels + vLLM integration
☆1,662Mar 27, 2026Updated 3 months ago
RecursiveIntell / turbo-quant
View on GitHub
Rust implementation of TurboQuant, PolarQuant, and QJL — zero-overhead vector quantization for semantic search and KV cache compression (…
☆29May 31, 2026Updated last month
TheTom / turboquant_plus
View on GitHub
☆6,992Jun 26, 2026Updated 2 weeks ago
mitkox / vllm-turboquant
View on GitHub
vLLM TurboQuant
☆610Jun 25, 2026Updated 2 weeks ago
z-lab / dflash
View on GitHub
DFlash: Block Diffusion for Flash Speculative Decoding
☆5,468May 10, 2026Updated 2 months ago
End-to-end encrypted cloud storage - Proton Drive • Ad
Special offer: 40% Off Yearly / 80% Off First Month. Protect your most important files, photos, and documents from prying eyes.
TheTom / llama-cpp-turboquant
View on GitHub
LLM inference in C/C++
☆2,097Updated this week
danveloper / flash-moe
View on GitHub
Running a big model on a small laptop
☆3,980Mar 19, 2026Updated 3 months ago
Luce-Org / lucebox
View on GitHub
Fast LLM speculative inference server for consumer hardware.
☆2,656Updated this week
scrya-com / rotorquant
View on GitHub
KV cache compression via block-diagonal rotation. Beats TurboQuant: better PPL (6.91 vs 7.07), 28% faster decode, 5.3x faster prefill, 44…
☆1,027Apr 23, 2026Updated 2 months ago
OnlyTerp / kvtc
View on GitHub
First open-source KVTC implementation (NVIDIA, ICLR 2026) -- 8-32x KV cache compression via PCA + adaptive quantization + entropy coding
☆21Apr 17, 2026Updated 2 months ago
RyanCodrai / turbovec
View on GitHub
A vector index built on TurboQuant, written in Rust with Python bindings
☆12,690Jun 10, 2026Updated last month
Dynamis-Labs / spectralquant
View on GitHub
SpectralQuant: Calibrated Eigenbasis Rotation and Water-Filled Bit Allocation for KV-Cache Compression
☆197May 15, 2026Updated last month
huggingface / ml-intern
View on GitHub
🤗 ml-intern: an open-source ML engineer that reads papers, trains models, and ships ML models
☆10,648Updated this week
kyegomez / OpenMythos
View on GitHub
A theoretical reconstruction of the Claude Mythos architecture, built from first principles using the available research literature.
☆14,681May 23, 2026Updated last month
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
RightNow-AI / autokernel
View on GitHub
Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels.
☆1,458Mar 19, 2026Updated 3 months ago
botirkhaltaev / turboquant
View on GitHub
Library for Google's Turboquant Algorithm
☆67Mar 29, 2026Updated 3 months ago
CaChiJ / kakao-navigation-mcp-server
View on GitHub
Kakao Mobility MCP Server for directions and transit information
☆11Sep 14, 2025Updated 10 months ago
lucienhuangfu / eLLM
View on GitHub
eLLM can infer LLM on CPUs faster than on GPUs
☆427Updated this week
deepseek-ai / DeepSpec
View on GitHub
DeepSpec: a full-stack codebase for training and evaluating speculative decoding algorithms
☆6,638Updated this week
lucidrains / lbm-training-framework
View on GitHub
Training framework for Large Behavioral Models
☆28Sep 17, 2025Updated 9 months ago
stanford-iris-lab / meta-harness-tbench2-artifact
View on GitHub
Meta-Harness: 76.4% on Terminal-Bench 2.0 (Claude Opus 4.6)
☆1,142Mar 26, 2026Updated 3 months ago
varjoranta / turboquant-vllm
View on GitHub
TurboQuant+ KV cache compression for vLLM. 3.8x smaller KV cache, same conversation quality. Fused CUDA kernels with automatic PyTorch fa…
☆74Updated this week
antirez / ds4
View on GitHub
DeepSeek 4 Flash and PRO local inference engine for Metal, CUDA and ROCm
☆18,521Jul 3, 2026Updated last week
Deploy on Railway without the complexity - Free Credits Offer • Ad
Connect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
Dao-AILab / flash-attention
View on GitHub
Fast and memory-efficient exact attention
☆24,452Updated this week
JustVugg / colibri
View on GitHub
Run GLM-5.2 (744B MoE) on a 25GB-RAM consumer machine — pure C, zero deps, experts streamed from disk. Tiny engine, immense model. 🐦
☆11,798Updated this week
facebookresearch / HyperAgents
View on GitHub
Self-referential self-improving agents that can optimize for any computable task
☆2,637May 9, 2026Updated 2 months ago
lightseekorg / tokenspeed
View on GitHub
TokenSpeed is a speed-of-light LLM inference engine.
☆1,589Updated this week
UMass-Embodied-AGI / CommVQ
View on GitHub
[ICML 2025] CommVQ: Commutative Vector Quantization for KV Cache Compression
☆27Sep 2, 2025Updated 10 months ago
ultraworkers / claw-code
View on GitHub
An agent-managed museum exhibit, built in Rust with Gajae-Code / LazyCodex — developed and maintained with no human intervention.
☆194,754Jun 26, 2026Updated 2 weeks ago
NousResearch / hermes-agent
View on GitHub
The agent that grows with you
☆214,747Updated this week
kakao / FunctionChat-Bench
View on GitHub
☆119Feb 25, 2026Updated 4 months ago
AbdelStark / turboquant
View on GitHub
Rust implementation of Google's TurboQuant algorithm for vector quantization
☆36Mar 25, 2026Updated 3 months ago
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
cactus-compute / needle
View on GitHub
26m function call model that runs on incredibly small devices
☆3,042Jul 1, 2026Updated last week
geekjourneyx / awesome-ai-video-prompts
View on GitHub
A curated collection of AI video prompting resources, featuring official guides, prompt templates, cinematic techniques, and audio-vi…
☆65Jan 4, 2026Updated 6 months ago
deepseek-ai / TileKernels
View on GitHub
A kernel library written in tilelang
☆1,642Apr 23, 2026Updated 2 months ago
MemPalace / mempalace
View on GitHub
The best-benchmarked open-source AI memory system. And it's free.
☆57,326Updated this week
HKUDS / OpenSpace
View on GitHub
"OpenSpace: The Quality-First Skill Hub for AI Agents" -- https://open-space.cloud/
☆6,732Updated this week
vllm-project / vllm
View on GitHub
A high-throughput and memory-efficient inference and serving engine for LLMs
☆86,251Updated this week
NVIDIA / Model-Optimizer
View on GitHub
A unified library of SOTA model optimization techniques like quantization, distillation, pruning, neural architecture search, speculative…
☆3,230Updated this week