zhaochenyang20 / ModelServer
Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang
☆24Updated 2 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for ModelServer
- Materials for learning SGLang☆110Updated this week
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆34Updated 8 months ago
- ☆64Updated 4 months ago
- Manages vllm-nccl dependency☆17Updated 5 months ago
- Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.☆26Updated 2 weeks ago
- Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆79Updated last week
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆57Updated last week
- Odysseus: Playground of LLM Sequence Parallelism☆57Updated 5 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆77Updated last month
- The official code for paper "parallel speculative decoding with adaptive draft length."☆25Updated 3 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆39Updated last month
- ☆43Updated 6 months ago
- ☆22Updated 11 months ago
- The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference☆40Updated this week
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection☆29Updated 3 weeks ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆56Updated 11 months ago
- LLM Serving Performance Evaluation Harness☆57Updated 2 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆82Updated 8 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆149Updated 4 months ago
- GPTQ inference TVM kernel☆36Updated 7 months ago
- Multi-Candidate Speculative Decoding☆28Updated 7 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆76Updated 8 months ago
- Modular and structured prompt caching for low-latency LLM inference☆69Updated 2 weeks ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆83Updated 6 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆126Updated 5 months ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆40Updated last week
- Awesome list for LLM quantization☆127Updated this week
- ☆47Updated 2 months ago
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year