snowflakedb / ArcticInferenceLinks
ArcticInference: vLLM plugin for high-throughput, low-latency inference
☆203Updated this week
Alternatives and similar repositories for ArcticInference
Users that are interested in ArcticInference are comparing it to the libraries listed below
Sorting:
- ArcticTraining is a framework designed to simplify and accelerate the post-training process for large language models (LLMs)☆190Updated this week
- LLM Serving Performance Evaluation Harness☆79Updated 5 months ago
- KV cache compression for high-throughput LLM inference☆134Updated 6 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆123Updated 8 months ago
- Efficient LLM Inference over Long Sequences☆385Updated last month
- LLM KV cache compression made easy☆566Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆265Updated 9 months ago
- Genai-bench is a powerful benchmark tool designed for comprehensive token-level performance evaluation of large language model (LLM) serv…☆180Updated this week
- [ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration☆224Updated 8 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆141Updated last week
- A low-latency & high-throughput serving engine for LLMs☆400Updated 2 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆405Updated 2 months ago
- Simple extension on vLLM to help you speed up reasoning model without training.☆172Updated 2 months ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆206Updated last week
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆228Updated 7 months ago
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆215Updated 3 weeks ago
- Perplexity GPU Kernels☆418Updated 2 weeks ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆80Updated 11 months ago
- Benchmark suite for LLMs from Fireworks.ai☆76Updated this week
- vLLM performance dashboard☆33Updated last year
- A minimal implementation of vllm.☆50Updated last year
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆258Updated last week
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆365Updated 11 months ago
- Fast low-bit matmul kernels in Triton☆338Updated last week
- ☆47Updated last year
- ☆195Updated 2 months ago
- Code for data-aware compression of DeepSeek models☆40Updated last month
- ☆215Updated 6 months ago
- A throughput-oriented high-performance serving framework for LLMs☆856Updated 3 weeks ago
- Common recipes to run vLLM☆68Updated this week