EmbeddedLLM / vllm
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
☆87Updated this week
Alternatives and similar repositories for vllm:
Users that are interested in vllm are comparing it to the libraries listed below
- A high-throughput and memory-efficient inference and serving engine for LLMs☆262Updated 5 months ago
- ☆116Updated 11 months ago
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆202Updated 4 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆272Updated last year
- ☆205Updated 2 months ago
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆81Updated 3 weeks ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆338Updated 7 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆236Updated last month
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆111Updated 4 months ago
- PB-LLM: Partially Binarized Large Language Models☆152Updated last year
- Boosting 4-bit inference kernels with 2:4 Sparsity☆72Updated 6 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆165Updated 3 weeks ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆362Updated last year
- KV cache compression for high-throughput LLM inference☆119Updated last month
- Inference server benchmarking tool☆38Updated this week
- A collection of all available inference solutions for the LLMs☆83Updated last month
- Fast low-bit matmul kernels in Triton☆275Updated this week
- Benchmark suite for LLMs from Fireworks.ai☆70Updated last month
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆54Updated this week
- Easy and Efficient Quantization for Transformers☆195Updated last month
- Cray-LM unified training and inference stack.☆21Updated 2 months ago
- LLM Serving Performance Evaluation Harness☆73Updated last month
- Fast and memory-efficient exact attention☆163Updated this week
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"☆155Updated 5 months ago
- ☆113Updated last week
- GPTQ inference Triton kernel☆300Updated last year
- A safetensors extension to efficiently store sparse quantized tensors on disk☆92Updated this week
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆301Updated 9 months ago
- ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization☆105Updated 5 months ago
- Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry☆40Updated last year