Indras-Mirror/llama.cpp-turboq-mtp

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/Indras-Mirror/llama.cpp-turboq-mtp)

Indras-Mirror / llama.cpp-turboq-mtp

Fused TBQ4 Flash Attention + MTP + Shared Tensors for llama.cpp — 82+ tok/s with lossless 4.25 bpv KV cache at 200K context on RTX 4090

☆89

Alternatives and similar repositories for llama.cpp-turboq-mtp

Users that are interested in llama.cpp-turboq-mtp are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

BoFan-tunning / llama.cpp-MTP-TurboQuant
View on GitHub
☆142Jun 13, 2026Updated last month
am17an / llama.cpp
View on GitHub
LLM inference in C/C++
☆56Updated this week
AtomicBot-ai / atomic-llama-cpp-turboquant
View on GitHub
llama.cpp fork with TurboQuant WHT-rotated KV cache & weight compression + Gemma 4 MTP and Qwen 3.6 NextN speculative decoding (+30-50% t…
☆306Updated this week
Anbeeld / beellama.cpp
View on GitHub
KVarN, KV cache precision tail, low-bit quants in llama.cpp for longer context of better precision in the same VRAM
☆789Updated this week
TheTom / llama-cpp-turboquant
View on GitHub
LLM inference in C/C++
☆2,152Updated this week
Proton VPN Special Offer - Get 70% off • Ad
Special partner offer. Trusted by over 100 million users worldwide. Tested, Approved and Recommended by Experts.
johndpope / llama-cpp-turboquant
View on GitHub
LLM inference in C/C++
☆64May 7, 2026Updated 2 months ago
spiritbuun / buun-llama-cpp
View on GitHub
LLAMA Turboquant implementation with CUDA support
☆703Updated this week
ikawrakow / ik_llama.cpp
View on GitHub
llama.cpp fork with additional SOTA quants and improved performance
☆2,943Updated this week
test1111111111111112 / llama-cpp-turboquant-gemma4
View on GitHub
TurboQuant llama.cpp fork with optimized turbo4 kernels for Gemma 4 D=256/512 heads — lazy K/V, batch decode, warp-cooperative write. 120…
☆35Apr 5, 2026Updated 3 months ago
devnen / qwen3.6-windows-server
View on GitHub
One-click Qwen3.6-27B inference on Windows. 158 tok/s on RTX 5090, 72 tok/s on RTX 3090. Native, no WSL, no Docker, no telemetry.
☆222May 14, 2026Updated 2 months ago
turbo-tan / llama.cpp-tq3
View on GitHub
llama.cpp fork with TQ3_1S/4S CUDA kernels — 3.5-bit WHT quantization achieving Q4s quality at 10% smaller size. Based on RaBitQ-inspired…
☆221Jul 6, 2026Updated 2 weeks ago
Madreag / turbo3-cuda
View on GitHub
LLM inference in C/C++
☆36Apr 12, 2026Updated 3 months ago
sfeuga / Autodesk-Install-Fedora
View on GitHub
Easy install Autodesk products on Fedora
☆12Aug 18, 2020Updated 5 years ago
charlie12345 / rocmfp4-llama
View on GitHub
NEW ROCmfp4 format for llama.cpp
☆135Jun 13, 2026Updated last month
Wordpress hosting with auto-scaling - Free Trial Offer • Ad
Fully Managed hosting for WordPress and WooCommerce businesses that need reliable, auto-scalable performance. Cloudways SafeUpdates now available.
devnen / vllm-windows
View on GitHub
Patched native-Windows build of vLLM. Three Windows-specific fixes (CPU-relay for Gloo, Qwen3 reasoning parser, wildcard model name) on t…
☆26May 8, 2026Updated 2 months ago
AIFSH / SemiChat-ComfyUI
View on GitHub
☆12Feb 19, 2025Updated last year
Doorman11991 / budget-aware-mcp
View on GitHub
Model-agnostic code memory MCP server. Budget-aware graph retrieval for AI agents. Sub-millisecond queries, token budgeting, deterministi…
☆23May 18, 2026Updated 2 months ago
QuinsZouls / llama-cpp-turboquant
View on GitHub
Experimental LLM inference in C/C++
☆39May 15, 2026Updated 2 months ago
guqiong96 / Lvllm
View on GitHub
LvLLM is a special NUMA extension of vllm that makes full use of CPU and memory resources, reduces GPU memory requirements, and features …
☆386Updated this week
Luce-Org / lucebox
View on GitHub
Fast LLM speculative inference server for consumer hardware.
☆2,668Updated this week
BenChaliah / NVFP4-on-4090-vLLM
View on GitHub
AdaLLM is an NVFP4-first inference runtime for Ada Lovelace (RTX 4090) with FP8 KV cache and custom decode kernels. This repo targets NVF…
☆135Feb 15, 2026Updated 5 months ago
ajarellanod / pi-usage-bars
View on GitHub
Usage indicator extension for pi with footer status bars and /usage command
☆15Apr 17, 2026Updated 3 months ago
guqiong96 / lktransformers
View on GitHub
The complete NUMA-optimized branch of the ktransformers project
☆25Nov 3, 2025Updated 8 months ago
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
seckinbostanci / csharp-dde-client
View on GitHub
C# DDE Client for MetaTrader 4 (via Ndde)
☆10Jan 1, 2018Updated 8 years ago
Nano-Collective / get-md
View on GitHub
A fast, lightweight HTML to Markdown converter optimized for LLM consumption. Uses proven parsing libraries to deliver clean, well-struct…
☆77Updated this week
localai-org / apex-quant
View on GitHub
Adaptive Precision for EXpert Models: MoE-aware mixed-precision quantization
☆396May 29, 2026Updated last month
taro-antd / taro-antd
View on GitHub
基于Ant Design Mobile 的Taro组件库多端支持微信小程序、H5等
☆11Sep 26, 2018Updated 7 years ago
dakshjain-1616 / Qwen-Lens-Studio
View on GitHub
Multimodal AI studio powered by Qwen3.6-35B-A3B. End-to-end web app exposing visual reasoning, image captioning, and document understandi…
☆27Apr 23, 2026Updated 2 months ago
yeataro / TD-JSONLiveLink
View on GitHub
TouchDesigner > JSONLiveLink > Unreal Engine
☆13Mar 31, 2022Updated 4 years ago
crashr / brute-llama
View on GitHub
Testbench for llama.cpp llama-server
☆15Aug 20, 2025Updated 11 months ago
arte-fact / llama-monitor
View on GitHub
A llamacpp wrapper to manage and monitor your llama server instance over a web ui.
☆21Jun 16, 2026Updated last month
kh0pper / crow
View on GitHub
Modular, agentic framework and MCP platform you self-host. Build and run your own AI agents, connect Claude Code, Claude Desktop, or Curs…
☆20Updated this week
Open source password manager - Proton Pass • Ad
Securely store, share, and autofill your credentials with Proton Pass, the end-to-end encrypted password manager trusted by millions.
Danmoreng / local-qwen3-coder-env
View on GitHub
Linux & Powershell scripts to easily set up and run the Qwen 3.5 series locally on Windows and Linux with llama.cpp.
☆90Apr 28, 2026Updated 2 months ago
damoshen123 / st-immersive-sound
View on GitHub
☆16Jun 13, 2026Updated last month
aivrar / multi-turboquant
View on GitHub
Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU pl…
☆24Jul 11, 2026Updated last week
woheller69 / textweb
View on GitHub
A markdown web renderer for AI agents — see the web without screenshots
☆64May 21, 2026Updated 2 months ago
reversePublic / whatsappShare
View on GitHub
☆25Apr 28, 2023Updated 3 years ago
atomicmilkshake / llama-cpp-turboquant
View on GitHub
llama.cpp fork with TurboQuant quantization (turbo2/3/4) and TriAttention GPU-accelerated KV cache pruning. 75 tok/s on Qwen3-8B / RTX 30…
☆42Jul 2, 2026Updated 2 weeks ago
immohitsen / RAG-Chat
View on GitHub
A premium RAG-based AI Assistant built with React and FastAPI. Features efficient document indexing and high-accuracy retrieval-augmented…
☆18Jun 3, 2026Updated last month