turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

☆3,680

Related projects ⓘ

Alternatives and complementary repositories for exllamav2

turboderp / exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
☆2,760Updated last year
AutoGPTQ / AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
☆4,497Updated last month
marella / ctransformers
Python bindings for the Transformer models implemented in C/C++ using GGML library.
☆1,814Updated 9 months ago
sgl-project / sglang
SGLang is a fast serving framework for large language models and vision language models.
☆6,127Updated this week
axolotl-ai-cloud / axolotl
Go ahead and axolotl questions
☆7,930Updated this week
arcee-ai / mergekit
Tools for merging pretrained large language models.
☆4,816Updated 2 weeks ago
huggingface / text-generation-inference
Large Language Model Text Generation Inference
☆9,122Updated this week
abetlen / llama-cpp-python
Python bindings for llama.cpp
☆8,141Updated this week
predibase / lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
☆2,205Updated this week
casper-hansen / AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
☆1,765Updated this week
bitsandbytes-foundation / bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
☆6,299Updated this week
qwopqwop200 / GPTQ-for-LLaMa
4 bits quantization of LLaMA using GPTQ
☆2,998Updated 4 months ago
jzhang38 / TinyLlama
The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
☆7,919Updated 6 months ago
microsoft / LLMLingua
[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which ach…
☆4,653Updated this week
PygmalionAI / aphrodite-engine
Large-scale LLM inference engine
☆1,134Updated this week
huggingface / text-embeddings-inference
A blazing fast inference solution for text embeddings models
☆2,846Updated 2 weeks ago
mit-han-lab / llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆2,526Updated last month
S-LoRA / S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
☆1,755Updated 9 months ago
IST-DASLab / gptq
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆1,941Updated 7 months ago
artidoro / qlora
QLoRA: Efficient Finetuning of Quantized LLMs
☆10,059Updated 5 months ago
noamgat / lm-format-enforcer
Enforce the output format (JSON Schema, Regex etc) of a language model
☆1,546Updated last month
EleutherAI / lm-evaluation-harness
A framework for few-shot evaluation of language models.
☆6,990Updated this week
ggerganov / ggml
Tensor library for machine learning
☆11,233Updated this week
young-geng / EasyLM
Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Fl…
☆2,409Updated 3 months ago
InternLM / lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
☆4,669Updated this week
vllm-project / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆30,423Updated this week
OpenNMT / CTranslate2
Fast inference engine for Transformer models
☆3,411Updated this week
pytorch / torchtune
PyTorch native finetuning library
☆4,336Updated this week
microsoft / Llama-2-Onnx
☆1,022Updated 10 months ago