mani-kantap / llm-inference-solutionsLinks
A collection of all available inference solutions for the LLMs
☆88Updated 3 months ago
Alternatives and similar repositories for llm-inference-solutions
Users that are interested in llm-inference-solutions are comparing it to the libraries listed below
Sorting:
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆86Updated this week
- ☆99Updated this week
- ☆53Updated last year
- Benchmark suite for LLMs from Fireworks.ai☆75Updated 2 weeks ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆172Updated 2 months ago
- Easy and Efficient Quantization for Transformers☆198Updated 3 months ago
- Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs☆317Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆263Updated 7 months ago
- Inference server benchmarking tool☆67Updated last month
- ☆260Updated 2 weeks ago
- ☆53Updated 8 months ago
- Comparison of Language Model Inference Engines☆217Updated 5 months ago
- Accelerating your LLM training to full speed! Made with ❤️ by ServiceNow Research☆200Updated this week
- A list of LLM benchmark frameworks.☆66Updated last year
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆36Updated last year
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆98Updated this week
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆211Updated 6 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆117Updated this week
- ☆93Updated last week
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆365Updated last year
- ☆60Updated 2 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆270Updated last week
- ☆130Updated 2 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆275Updated last year
- Cray-LM unified training and inference stack.☆22Updated 4 months ago
- Spherical Merge Pytorch/HF format Language Models with minimal feature loss.☆123Updated last year
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…☆483Updated this week
- ☆193Updated 3 weeks ago
- ☆119Updated last year
- Pretrain, finetune and serve LLMs on Intel platforms with Ray☆127Updated last month