zhaochenyang20 / ModelServer
Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang
☆44Updated 4 months ago
Alternatives and similar repositories for ModelServer:
Users that are interested in ModelServer are comparing it to the libraries listed below
- ☆74Updated 3 months ago
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆238Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆93Updated last year
- An easy-to-use package for implementing SmoothQuant for LLMs☆95Updated 10 months ago
- ☆125Updated 2 weeks ago
- ☆181Updated 5 months ago
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆110Updated 3 months ago
- Skywork-MoE: A Deep Dive into Training Techniques for Mixture-of-Experts Language Models☆129Updated 9 months ago
- ☆15Updated 11 months ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆108Updated last week
- ☆78Updated last year
- Fast and memory-efficient exact attention☆52Updated this week
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆35Updated 2 weeks ago
- Manages vllm-nccl dependency☆17Updated 9 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆66Updated 9 months ago
- Fast LLM Training CodeBase With dynamic strategy choosing [Deepspeed+Megatron+FlashAttention+CudaFusionKernel+Compiler];☆36Updated last year
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆91Updated last year
- ☆69Updated 3 months ago
- ☆98Updated last week
- KV cache compression for high-throughput LLM inference☆117Updated last month
- Boosting 4-bit inference kernels with 2:4 Sparsity☆71Updated 6 months ago
- DeeperGEMM: crazy optimized version☆61Updated last week
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆151Updated 8 months ago
- Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs☆145Updated this week
- Vocabulary Parallelism☆17Updated 2 weeks ago
- 📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.☆154Updated this week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆157Updated 8 months ago
- ☆88Updated 6 months ago
- Transformer related optimization, including BERT, GPT☆17Updated last year
- [ICLR 2025] PEARL: parallel speculative decoding with adaptive draft length☆59Updated last week