An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs
☆637Feb 16, 2026Updated 2 weeks ago
Alternatives and similar repositories for exllamav3
Users that are interested in exllamav3 are comparing it to the libraries listed below
Sorting:
- The official API server for Exllama. OAI compatible, lightweight, and fast.☆1,139Feb 9, 2026Updated 3 weeks ago
- A fast inference library for running LLMs locally on modern consumer-class GPUs☆4,444Dec 9, 2025Updated 2 months ago
- ☆92Dec 9, 2025Updated 2 months ago
- Web UI for ExLlamaV2☆512Feb 5, 2025Updated last year
- llama.cpp fork with additional SOTA quants and improved performance☆1,696Updated this week
- ☆165Jun 22, 2025Updated 8 months ago
- Produce your own Dynamic 3.0 Quants and achieve optimum accuracy & SOTA quantization performance! Input your VRAM and RAM and the toolcha…☆79Feb 22, 2026Updated last week
- Large-scale LLM inference engine☆1,658Feb 17, 2026Updated 2 weeks ago
- ☆72Jun 20, 2025Updated 8 months ago
- Prompt Jinja2 templates for LLMs☆35Jul 9, 2025Updated 7 months ago
- ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU, Vulkan and CUDA☆63Updated this week
- Croco.Cpp is fork of KoboldCPP infering GGML/GGUF models on CPU/Cuda with KoboldAI's UI. It's powered partly by IK_LLama.cpp, and compati…☆158Feb 25, 2026Updated last week
- Yet Another (LLM) Web UI, made with Gemini☆12Dec 25, 2024Updated last year
- Transplants vocabulary between language models, enabling the creation of draft models for speculative decoding WITHOUT retraining.☆49Oct 29, 2025Updated 4 months ago
- ☆63Jul 10, 2025Updated 7 months ago
- A simple Gradio WebUI for loading/unloading models and loras in tabbyAPI.☆20Nov 21, 2024Updated last year
- ☆53Oct 10, 2025Updated 4 months ago
- A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.☆2,913Sep 30, 2023Updated 2 years ago
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆1,028Updated this week
- Official implementation of Half-Quadratic Quantization (HQQ)☆913Dec 18, 2025Updated 2 months ago
- Modified Beam Search with periodical restart☆12Sep 12, 2024Updated last year
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆2,787Updated this week
- Reliable model swapping for any local OpenAI/Anthropic compatible server - llama.cpp, vllm, etc☆2,506Updated this week
- A multimodal, function calling powered LLM webui.☆215Sep 23, 2024Updated last year
- 🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantiza…☆853Updated this week
- Customizable implementation of the self-instruct paper.☆1,049Mar 7, 2024Updated last year
- Optimizing inference proxy for LLMs☆3,352Jan 28, 2026Updated last month
- Lightweight continuous batching OpenAI compatibility using HuggingFace Transformers include T5 and Whisper.☆29Mar 15, 2025Updated 11 months ago
- Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.p…☆1,315Aug 8, 2025Updated 6 months ago
- Run GGUF models easily with a KoboldAI UI. One File. Zero Install.☆9,594Updated this week
- An OpenAI API compatible LLM inference server based on ExLlamaV2.☆25Feb 9, 2024Updated 2 years ago
- REAP: Router-weighted Expert Activation Pruning for SMoE compression☆270Dec 9, 2025Updated 2 months ago
- Interface for OuteTTS models.☆1,426Jun 21, 2025Updated 8 months ago
- A fast batching API to serve LLM models☆189Apr 26, 2024Updated last year
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆184Apr 2, 2025Updated 11 months ago
- LLM Frontend in a single html file☆701Dec 27, 2025Updated 2 months ago
- Llama.cpp runner/swapper and proxy that emulates LMStudio / Ollama backends☆52Aug 21, 2025Updated 6 months ago
- A stable, fast and easy-to-use inference library with a focus on a sync-to-async API☆48Sep 26, 2024Updated last year
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆674Apr 25, 2025Updated 10 months ago