powerserve-project / PowerServeLinks

High-speed and easy-use LLM serving framework for local deployment

☆137

Alternatives and similar repositories for PowerServe

Users that are interested in PowerServe are comparing it to the libraries listed below

Sorting:

SJTU-IPADS / Bamboo
Bamboo-7B Large Language Model
☆93Updated last year
efeslab / fiddler
[ICLR'25] Fast Inference of MoE Models with CPU-GPU Orchestration
☆243Updated last year
OpenBMB / CPM.cu
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…
☆206Updated 2 months ago
IsaacRe / vllm-kvcompress
KV cache compression for high-throughput LLM inference
☆145Updated 10 months ago
mlc-ai / relax
☆170Updated 3 weeks ago
intel / xFasterTransformer
☆431Updated 2 months ago
EfficientMoE / MoE-Infinity
PyTorch library for cost-effective, fast and easy serving of MoE models.
☆263Updated last month
microsoft / T-MAC
Low-bit LLM inference on CPU/NPU with lookup table
☆898Updated 6 months ago
saic-fi / MobileQuant
[EMNLP Findings 2024] MobileQuant: Mobile-friendly Quantization for On-device Language Models
☆68Updated last year
SJTU-IPADS / PhoenixOS
Fast OS-level support for GPU checkpoint and restore
☆260Updated 2 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆446Updated 6 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆142Updated 10 months ago
intel / neural-speed
An innovative library for efficient LLM inference via low-bit quantization
☆350Updated last year
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆294Updated 6 months ago
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆790Updated 9 months ago
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆64Updated 2 months ago
ServerlessLLM / ServerlessLLM
Serverless LLM Serving for Everyone.
☆618Updated last week
microsoft / vidur
A large-scale simulation framework for LLM inference
☆491Updated 4 months ago
perplexityai / pplx-kernels
Perplexity GPU Kernels
☆534Updated last month
Cornell-RelaxML / qtip
☆158Updated 5 months ago
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆214Updated this week
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆723Updated 4 months ago
tyler-griggs / melange-release
☆48Updated last year
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆200Updated last year
Infini-AI-Lab / Sequoia
scalable and robust tree-based speculative decoding algorithm
☆363Updated 10 months ago
efeslab / Nanoflow
A throughput-oriented high-performance serving framework for LLMs
☆921Updated last month
cornserve-ai / cornserve
Easy, Fast, and Scalable Multimodal AI
☆78Updated last week
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆454Updated last month
Infini-AI-Lab / UMbreLLa
LLM Inference on consumer devices
☆125Updated 8 months ago
IST-DASLab / MoE-Quant
Code for data-aware compression of DeepSeek models
☆65Updated last month