XiongjieDai / GPU-Benchmarks-on-LLM-InferenceLinks

Multiple NVIDIA GPUs or Apple Silicon for Large Language Model Inference?

☆1,670

Alternatives and similar repositories for GPU-Benchmarks-on-LLM-Inference

Users that are interested in GPU-Benchmarks-on-LLM-Inference are comparing it to the libraries listed below

Sorting:

aphrodite-engine / aphrodite-engine
Large-scale LLM inference engine
☆1,457Updated last week
turboderp-org / exllamav2
A fast inference library for running LLMs locally on modern consumer-class GPUs
☆4,216Updated 3 weeks ago
lyogavin / airllm
AirLLM 70B inference with single 4GB GPU
☆5,798Updated last month
mostlygeek / llama-swap
Model swapping for llama.cpp (or any local OpenAPI compatible server)
☆969Updated this week
theroyallab / tabbyAPI
The official API server for Exllama. OAI compatible, lightweight, and fast.
☆990Updated this week
vllm-project / llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
☆1,553Updated this week
ikawrakow / ik_llama.cpp
llama.cpp fork with additional SOTA quants and improved performance
☆608Updated this week
b4rtaz / distributed-llama
Connect home devices into a powerful cluster to accelerate LLM inference. More devices means faster inference.
☆2,189Updated last week
mobiusml / hqq
Official implementation of Half-Quadratic Quantization (HQQ)
☆837Updated last week
arcee-ai / mergekit
Tools for merging pretrained large language models.
☆5,853Updated last week
ModelCloud / GPTQModel
Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…
☆633Updated this week
casper-hansen / AutoAWQ
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
☆2,196Updated last month
RahulSChand / gpu_poor
Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization
☆1,316Updated 6 months ago
NVIDIA / RULER
This repo contains the source code for RULER: What’s the Real Context Size of Your Long-Context Language Models?
☆1,159Updated this week
turboderp-org / exllamav3
An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs
☆419Updated last week
turboderp / exllama
A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
☆2,883Updated last year
matt-c1 / llama-3-quant-comparison
Comparison of the output quality of quantization methods, using Llama 3, transformers, GGUF, EXL2.
☆154Updated last year
Blaizzy / mlx-vlm
MLX-VLM is a package for inference and fine-tuning of Vision Language Models (VLMs) on your Mac using MLX.
☆1,381Updated this week
predibase / lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
☆3,028Updated last month
Maximilian-Winter / llama-cpp-agent
The llama-cpp-agent framework is a tool designed for easy interaction with Large Language Models (LLMs). Allowing users to chat with LLM …
☆572Updated 4 months ago
ray-project / llmperf
LLMPerf is a library for validating and benchmarking LLMs
☆947Updated 6 months ago
ml-explore / mlx-lm
Run LLMs with MLX
☆1,125Updated last week
gkamradt / LLMTest_NeedleInAHaystack
Doing simple retrieval from LLM models at various context lengths to measure accuracy
☆1,904Updated 10 months ago
argilla-io / distilabel
Distilabel is a framework for synthetic data and AI feedback for engineers who need fast, reliable and scalable pipelines based on verifi…
☆2,773Updated this week
microsoft / VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
☆646Updated 2 months ago
arcee-ai / DistillKit
An Open Source Toolkit For LLM Distillation
☆657Updated 3 weeks ago
foldl / chatllm.cpp
Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)
☆634Updated this week
intel / intel-extension-for-transformers
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Pl…
☆2,167Updated 8 months ago
Sumandora / remove-refusals-with-transformers
Implements harmful/harmless refusal removal using pure HF Transformers
☆903Updated last year
huggingface / text-embeddings-inference
A blazing fast inference solution for text embeddings models
☆3,731Updated this week