turbo-tan/llama.cpp-tq3

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/turbo-tan/llama.cpp-tq3)

turbo-tan / llama.cpp-tq3

llama.cpp fork with TQ3_1S/4S CUDA kernels — 3.5-bit WHT quantization achieving Q4s quality at 10% smaller size. Based on RaBitQ-inspired Walsh-Hadamard transform. Enables 27B models on 16GB GPUs with 15 tok/s TG, 221 tok/s PP.

☆222

Alternatives and similar repositories for llama.cpp-tq3

Users that are interested in llama.cpp-tq3 are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

TheTom / llama-cpp-turboquant
View on GitHub
LLM inference in C/C++
☆2,203Jul 22, 2026Updated last week
spiritbuun / buun-llama-cpp
View on GitHub
Experimental llama.cpp fork for inference research and development
☆721Updated this week
Anbeeld / beellama.cpp
View on GitHub
KVarN, KV cache precision tail, low-bit quants in llama.cpp for longer context of better precision in the same VRAM
☆818Updated this week
AEON-7 / Aeon-Bench-Pod
View on GitHub
Run the AEON Bench suite on your own hardware: verified HuggingFace pull → serve → benchmark (text · agentic ×3 harnesses · vision · audi…
☆21Updated this week
AtomicBot-ai / atomic-llama-cpp-turboquant
View on GitHub
llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% t…
☆314Updated this week
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
localai-org / apex-quant
View on GitHub
Adaptive Precision for EXpert Models: MoE-aware mixed-precision quantization
☆408Updated this week
am17an / llama.cpp
View on GitHub
LLM inference in C/C++
☆56Updated this week
Luce-Org / lucebox
View on GitHub
Fast LLM speculative inference server for consumer hardware.
☆2,694Updated this week
ikawrakow / ik_llama.cpp
View on GitHub
llama.cpp fork with additional SOTA quants and improved performance
☆2,969Updated this week
stevibe / local-screen-agent
View on GitHub
☆68Jun 4, 2026Updated last month
QuinsZouls / llama-cpp-turboquant
View on GitHub
Experimental LLM inference in C/C++
☆40May 15, 2026Updated 2 months ago
andthattoo / structured-cot
View on GitHub
Structured Chain-of-Thought
☆219May 16, 2026Updated 2 months ago
scrya-com / rotorquant
View on GitHub
KV cache compression via block-diagonal rotation. Beats TurboQuant: better PPL (6.91 vs 7.07), 28% faster decode, 5.3x faster prefill, 44…
☆1,041Apr 23, 2026Updated 3 months ago
caiovicentino / polarengine-vllm
View on GitHub
PolarEngine: vLLM plugin for PolarQuant quantized LLM inference — 75% FP16 speed at 2.3x less VRAM
☆34Apr 13, 2026Updated 3 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
johndpope / llama-cpp-turboquant
View on GitHub
LLM inference in C/C++
☆64May 7, 2026Updated 2 months ago
r0b0tlab / hermes-concurrent-agents
View on GitHub
Deploy concurrent Hermes Agent workers on unified-memory GPUs (GB10, DGX Spark) for maximum total tok/s. Profile-isolated, kanban-coordin…
☆75Jul 18, 2026Updated last week
anpaure / cp_eval
View on GitHub
Tiny evaluation of leading LLMs on competitive programming problems
☆14Apr 10, 2026Updated 3 months ago
Luce-Org / lucebox-ggml
View on GitHub
VENDORIZED in lucebox-hub. Fork of llama.cpp, ggml graph for lucebox inference engine
☆31Jul 8, 2026Updated 3 weeks ago
TheTom / turboquant_plus
View on GitHub
☆7,004Jul 20, 2026Updated last week
Bent-Solutions / hermes-bench
View on GitHub
Local benchmarking UI for LLMs and AI agents
☆21Apr 13, 2026Updated 3 months ago
z-lab / paroquant
View on GitHub
[ICLR 2026] ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference
☆326Jul 1, 2026Updated 3 weeks ago
caiovicentino / eoq-quantization
View on GitHub
EOQ: Entropy-Optimal Quantization for LLMs. 11-41% smaller than GGUF Q4_K_M with near-FP16 perplexity.
☆46Mar 31, 2026Updated 3 months ago
sybil-solutions / local-studio
View on GitHub
Control panel for VLLM, Sglang, llama.cpp, exllamav3
☆1,514Updated this week
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
z-lab / dflash
View on GitHub
DFlash: Block Diffusion for Flash Speculative Decoding
☆5,547May 10, 2026Updated 2 months ago
AEON-7 / Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash
View on GitHub
Fully uncensored, capability-enhanced abliteration of Qwen3.6-27B. NVFP4 + z-lab DFlash speculative decoding (n=12) on the unified ghcr.i…
☆425Jul 3, 2026Updated 3 weeks ago
aivrar / multi-turboquant
View on GitHub
Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU pl…
☆24Updated this week
unarbos / distil
View on GitHub
Distil SN97 — Competitive Model Distillation on Bittensor
☆35May 20, 2026Updated 2 months ago
stevibe / BenchLocal
View on GitHub
Test LLMs on real tasks. Compare models side-by-side.
☆387Jun 16, 2026Updated last month
noonghunna / club-3090
View on GitHub
Community recipes for serving LLMs on RTX 3090/4090/5090 CUDA gpus. Multi-engine (vLLM, llama.cpp, ik_llama) and model-agnostic. Currentl…
☆1,827Updated this week
shea256 / autofoundry
View on GitHub
A CLI tool that automates the provisioning of GPU's across cloud providers and the running of AI experiments across them
☆23Jun 1, 2026Updated last month
outsourc-e / qwen36-4090-recipes
View on GitHub
Reproducible llama.cpp configs + per-category quality benches for Qwen3.6-27B on a single RTX 4090. Winners, dead ends, and the silent-co…
☆23Apr 26, 2026Updated 3 months ago
Dogacel / Attention-Drift
View on GitHub
Code for the paper *Attention Drift: What Speculative Decoding Models Learn*.
☆28May 12, 2026Updated 2 months ago
Simple, predictable pricing with DigitalOcean hosting • Ad
Always know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
ousiaresearch / Risomorphism-1911
View on GitHub
High-fidelity ASCII art pipeline — edge-aware downsampling, video eikons, Risomorphism 1911 aesthetic
☆30May 15, 2026Updated 2 months ago
MaximeRivest / brepl
View on GitHub
Universal REPL Bridge for LLMs - Tab completion, interactive prompts, TUI support
☆20Nov 24, 2025Updated 8 months ago
PrismML-Eng / llama.cpp
View on GitHub
LLM inference in C/C++
☆429Jul 22, 2026Updated last week
turboderp-org / exllamav3
View on GitHub
An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs
☆1,073Updated this week
0xSero / reap-expert-swap
View on GitHub
How much experts do we need to serve a model?
☆152Mar 18, 2026Updated 4 months ago
dspy-community / dspy-session
View on GitHub
☆27Feb 26, 2026Updated 5 months ago
Archelunch / dspy-repl
View on GitHub
☆46Feb 20, 2026Updated 5 months ago