huawei-csl/KVarN

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/huawei-csl/KVarN)

huawei-csl / KVarN

KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

☆440

Alternatives and similar repositories for KVarN

Users that are interested in KVarN are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

huawei-csl / GENIAL
View on GitHub
Code for the Paper GENIAL: Generative Design Space Exploration via Network Inversion for Low Power Algorithmic Logic Units
☆23May 22, 2026Updated last month
huawei-csl / pto-dsl
View on GitHub
Pythonic interface and JIT compiler for https://gitcode.com/cann/pto-isa
☆27Jun 1, 2026Updated last month
huawei-csl / AC-LoRA
View on GitHub
Welcome to the official repository of AC-LORA: (Almost) Training-Free Access Control-Aware Multi-Modal LLMs, a mechanism that provides tr…
☆21Nov 14, 2025Updated 8 months ago
huawei-csl / spire-hdl
View on GitHub
Spire is a Python embedded domain-specific language (DSL) for RTL generation. Its built-in optimizations reduce area and delay of circuit…
☆24Updated this week
huawei-csl / SINQ
View on GitHub
Welcome to the official repository of SINQ! A novel, fast and high-quality quantization method designed to make any Large Language Model …
☆625May 8, 2026Updated 2 months ago
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
chiennv2000 / orthrus
View on GitHub
Fast, lossless LLM inference via dual-view diffusion decoding.
☆460May 18, 2026Updated 2 months ago
Anbeeld / beellama.cpp
View on GitHub
KVarN, KV cache precision tail, low-bit quants in llama.cpp for longer context of better precision in the same VRAM
☆789Updated this week
eml-eda / match
View on GitHub
☆35Jul 13, 2026Updated last week
Luce-Org / lucebox
View on GitHub
Fast LLM speculative inference server for consumer hardware.
☆2,668Updated this week
phuongncn / asus-gx10-qwen35-speed-hack
View on GitHub
4-5x faster Qwen3.5 on ASUS GX10 / DGX Spark — Hybrid INT4+FP8 + MTP via one shell script
☆31Apr 16, 2026Updated 3 months ago
lightseekorg / tokenspeed
View on GitHub
TokenSpeed is a speed-of-light LLM inference engine.
☆1,638Updated this week
eml-eda / eden
View on GitHub
Efficient Decision tree Ensembles library for IoT edge nodes
☆16Jan 29, 2025Updated last year
z-lab / dflash
View on GitHub
DFlash: Block Diffusion for Flash Speculative Decoding
☆5,500May 10, 2026Updated 2 months ago
Anemll / ds4-ssd
View on GitHub
DeepSeek V4 Flash specific inference engine. SSD MoE expert paging (slot-bank) + disk KV cache for long agent sessions. Metal-first, narr…
☆29Updated this week
Virtual machines for every use case on DigitalOcean • Ad
Get dependable uptime with 99.99% SLA, simple security tools, and predictable monthly pricing with DigitalOcean's virtual machines, called Droplets.
JustVugg / nanoeuler
View on GitHub
GPT-2-style LLM built from scratch in C/CUDA with hand-written backprop, BPE tokenizer, FlashAttention, pretraining, and SFT.
☆98Jun 18, 2026Updated last month
cactus-compute / needle
View on GitHub
26m agentic model for tiny devices
☆3,225Updated this week
cosdt / vllm-ascend
View on GitHub
See vLLM official support: https://github.com/vllm-project/vllm-ascend
☆11Feb 5, 2025Updated last year
deepseek-ai / DeepSpec
View on GitHub
DeepSpec: a full-stack codebase for training and evaluating speculative decoding algorithms
☆6,702Jul 9, 2026Updated last week
SwaggasDeCatas / emuThreeDS
View on GitHub
World's first Nintendo 3DS emulator for Apple devices based on Citra.
☆18Apr 7, 2023Updated 3 years ago
noonghunna / club-3090
View on GitHub
Community recipes for serving LLMs on RTX 3090/4090/5090 CUDA gpus. Multi-engine (vLLM, llama.cpp, ik_llama) and model-agnostic. Currentl…
☆1,751Updated this week
ghetea-patrick / riscrithm
View on GitHub
Riscrithm is a lightweight, low-boilerplate macro-assembly dialect that compiles straight down to pure, human-readable RISC-V assembly. I…
☆24Updated this week
spiritbuun / buun-llama-cpp
View on GitHub
LLAMA Turboquant implementation with CUDA support
☆703Updated this week
Algebraic-Programming / OneStopParallel
View on GitHub
A collection of optimal and heuristic scheduling tools
☆17Apr 24, 2026Updated 2 months ago
Managed hosting for WordPress and PHP on Cloudways • Ad
Managed hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
Algebraic-Programming / ALP
View on GitHub
Home of ALP/GraphBLAS and ALP/Pregel, featuring shared- and distributed-memory auto-parallelisation of linear algebraic and vertex-centri…
☆33Apr 2, 2026Updated 3 months ago
stanford-oval / sliders
View on GitHub
Repository for paper: Contexts are Never Long Enough: Structured Reasoning for Scalable Question Answering over Long Document Sets
☆27Apr 27, 2026Updated 2 months ago
Zyora-Dev / zse
View on GitHub
The inference engine the open-source world built for itself.
☆153Jun 13, 2026Updated last month
hao-ai-lab / JetSpec
View on GitHub
JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Causal Parallel Tree Drafting
☆163Jun 27, 2026Updated 3 weeks ago
QuinsZouls / llama-cpp-turboquant
View on GitHub
Experimental LLM inference in C/C++
☆39May 15, 2026Updated 2 months ago
LMCache / LMCache
View on GitHub
LMCache: Supercharge Your LLM with the Fastest KV Cache Layer
☆10,732Updated this week
teamchong / turboquant-wasm
View on GitHub
TurboQuant WASM SIMD vector compression — 3 bits/dim with fast dot product. Requires relaxed SIMD (Chrome 114+, Firefox 128+, Safari 18+,…
☆322Apr 19, 2026Updated 3 months ago
katanemo / plano
View on GitHub
Plano is an AI-native proxy server and data plane for agentic apps. Smart LLM routing, observability, agent orchestration, and guardrails…
☆6,871Updated this week
scrya-com / rotorquant
View on GitHub
KV cache compression via block-diagonal rotation. Beats TurboQuant: better PPL (6.91 vs 7.07), 28% faster decode, 5.3x faster prefill, 44…
☆1,037Apr 23, 2026Updated 2 months ago
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
skyne98 / wiki-gfx906
View on GitHub
A database of knowledge around inference & training on GFX906 GPUs https://skyne98.github.io/wiki-gfx906/
☆15Feb 21, 2026Updated 5 months ago
jungledesh / profile
View on GitHub
A physics-grounded, cost-aware optimizer for vLLM.
☆54Updated this week
alrevuelta / rs-merkle-tree
View on GitHub
Merkle tree implementation in Rust with configurable storage backends and hash functions. Fixed depth and incremental only. Optimized for…
☆228Jun 15, 2026Updated last month
antirez / ds4
View on GitHub
DeepSeek 4 Flash and PRO local inference engine for Metal, CUDA and ROCm
☆18,904Updated this week
jmaczan / tiny-vllm
View on GitHub
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM
☆938Jul 2, 2026Updated 2 weeks ago
microsoft / pg_durable
View on GitHub
PostgreSQL in-database durable execution
☆2,659Updated this week
Avarok-Cybersecurity / atlas
View on GitHub
Pure Rust Inference Engine
☆606Updated this week