EmbeddedLLM / vllm
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
☆89Updated this week
Related projects ⓘ
Alternatives and complementary repositories for vllm
- ☆114Updated 6 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆46Updated this week
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆70Updated last week
- Fast Inference of MoE Models with CPU-GPU Orchestration☆170Updated last week
- Advanced Quantization Algorithm for LLMs. This is official implementation of "Optimize Weight Rounding via Signed Gradient Descent for t…☆245Updated this week
- A collection of all available inference solutions for the LLMs☆72Updated last month
- A toolkit for fine-tuning, inferencing, and evaluating GreenBitAI's LLMs.☆73Updated 3 weeks ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆184Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆250Updated last month
- [ICLR 2024] Skeleton-of-Thought: Prompting LLMs for Efficient Parallel Generation☆144Updated 8 months ago
- Code for the paper "QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models".☆261Updated last year
- PB-LLM: Partially Binarized Large Language Models☆146Updated 11 months ago
- Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"☆347Updated 8 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.☆172Updated 3 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ, and export to onnx/onnx-runtime easily.☆148Updated last month
- KV cache compression for high-throughput LLM inference☆82Updated last week
- Fast and memory-efficient exact attention☆138Updated this week
- Repo hosting codes and materials related to speeding LLMs' inference using token merging.☆29Updated 6 months ago
- ☆156Updated last month
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆303Updated 2 months ago
- Applied AI experiments and examples for PyTorch☆160Updated last week
- EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆222Updated last month
- An efficent implementation of the method proposed in "The Era of 1-bit LLMs"