mani-kantap / llm-inference-solutionsLinks

A collection of all available inference solutions for the LLMs

☆91

Alternatives and similar repositories for llm-inference-solutions

Users that are interested in llm-inference-solutions are comparing it to the libraries listed below

Sorting:

lapp0 / lm-inference-engines
Comparison of Language Model Inference Engines
☆222Updated 7 months ago
neuralmagic / nm-vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆265Updated 9 months ago
EmbeddedLLM / vllm
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
☆87Updated this week
vllm-project / guidellm
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
☆438Updated this week
nyunAI / PruneGPT
☆51Updated last year
bentoml / BentoVLLM
Self-host LLMs with vLLM and BentoML
☆138Updated last week
premAI-io / benchmarks
🕹️ Performance Comparison of MLOps Engines, Frameworks, and Languages on Mainstream AI Models.
☆137Updated last year
DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆141Updated this week
NetEase-FuXi / EETQ
Easy and Efficient Quantization for Transformers
☆198Updated last month
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆349Updated 11 months ago
triton-inference-server / vllm_backend
☆280Updated this week
fw-ai / benchmark
Benchmark suite for LLMs from Fireworks.ai
☆76Updated 3 weeks ago
ServiceNow / Fast-LLM
Accelerating your LLM training to full speed! Made with ❤️ by ServiceNow Research
☆217Updated this week
huggingface / optimum-benchmark
🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…
☆307Updated 2 months ago
taprosoft / llm_finetuning
Convenient wrapper for fine-tuning and inference of Large Language Models (LLMs) with several quantization techniques (GTPQ, bitsandbytes…
☆146Updated last year
tiiuae / onebitllms
Lightweight toolkit package to train and fine-tune 1.58bit Language models
☆81Updated 2 months ago
asprenger / ray_vllm_inference
A simple service that integrates vLLM with Ray Serve for fast and scalable LLM serving.
☆69Updated last year
deep-diver / llamaduo
[ACL'25] Official Code for LlamaDuo: LLMOps Pipeline for Seamless Migration from Service LLMs to Small-Scale Local LLMs
☆313Updated 3 weeks ago
snowflakedb / ArcticInference
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆198Updated this week
intel / llm-on-ray
Pretrain, finetune and serve LLMs on Intel platforms with Ray
☆128Updated 3 weeks ago
substratusai / vllm-docker
☆63Updated 4 months ago
IBM / text-generation-inference
IBM development fork of https://github.com/huggingface/text-generation-inference
☆61Updated 2 months ago
arcee-ai / PruneMe
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
☆243Updated last year
run-ai / runai-model-streamer
☆231Updated this week
facebookresearch / LayerSkip
Code for "LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding", ACL 2024
☆323Updated 3 months ago
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…
☆551Updated this week
huggingface / inference-benchmarker
Inference server benchmarking tool
☆87Updated 3 months ago
samchaineau / llm_slerp_generation
Repo hosting codes and materials related to speeding LLMs' inference using token merging.
☆36Updated last week
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆134Updated 5 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆385Updated last month