turboderp-org / exllamav2Links

A fast inference library for running LLMs locally on modern consumer-class GPUs

☆4,364

Alternatives and similar repositories for exllamav2

Users that are interested in exllamav2 are comparing it to the libraries listed below

Sorting:

turboderp / exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
☆2,905Updated 2 years ago
aphrodite-engine / aphrodite-engine
Large-scale LLM inference engine
☆1,596Updated this week
AutoGPTQ / AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
☆4,989Updated 7 months ago
marella / ctransformers
Python bindings for the Transformer models implemented in C/C++ using GGML library.
☆1,876Updated last year
casper-hansen / AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
☆2,272Updated 6 months ago
abetlen / llama-cpp-python
Python bindings for llama.cpp
☆9,764Updated 3 months ago
theroyallab / tabbyAPI
The official API server for Exllama. OAI compatible, lightweight, and fast.
☆1,090Updated this week
arcee-ai / mergekit
Tools for merging pretrained large language models.
☆6,468Updated 3 weeks ago
intel / intel-extension-for-transformers
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…
☆2,167Updated last year
qwopqwop200 / GPTQ-for-LLaMa
4 bits quantization of LLaMA using GPTQ
☆3,076Updated last year
MeetKai / functionary
Chat language model that can use tools and interpret the results
☆1,586Updated last week
OpenNMT / CTranslate2
Fast inference engine for Transformer models
☆4,154Updated this week
bitsandbytes-foundation / bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
☆7,767Updated this week
S-LoRA / S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
☆1,868Updated last year
predibase / lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
☆3,533Updated 6 months ago
huggingface / text-generation-inference
Large Language Model Text Generation Inference
☆10,656Updated this week
mit-han-lab / llm-awq
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
☆3,347Updated 4 months ago
axolotl-ai-cloud / axolotl
Go ahead and axolotl questions
☆10,842Updated this week
SJTU-IPADS / PowerInfer
High-speed Large Language Model Serving for Local Deployment
☆8,409Updated 3 months ago
jondurbin / airoboros
Customizable implementation of the self-instruct paper.
☆1,050Updated last year
IST-DASLab / gptq
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
☆2,221Updated last year
deepspeedai / DeepSpeed-MII
MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.
☆2,077Updated 4 months ago
FasterDecoding / Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
☆2,664Updated last year
meta-pytorch / gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
☆6,152Updated 3 months ago
NVIDIA / RULER
This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
☆1,372Updated last week
ModelTC / LightLLM
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…
☆3,730Updated this week
RWKV / rwkv.cpp
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
☆1,554Updated 8 months ago
noamgat / lm-format-enforcer
Enforce the output format (JSON Schema, Regex etc) of a language model
☆1,958Updated 2 months ago
ggml-org / ggml
Tensor library for machine learning
☆13,575Updated last week
e-p-armstrong / augmentoolkit
Create Custom LLMs
☆1,774Updated 2 weeks ago