chen-ace / LLM-Prefill-Decode-BenchmarkLinks
通过实验对比LLM推理中Prefill和Decoding阶段的吞吐量差异,揭示性能瓶颈,解释PD分离优化技术的原理。包含CUDA和Apple MPS (M系列芯片) 的测试脚本。
☆18Updated 5 months ago
Alternatives and similar repositories for LLM-Prefill-Decode-Benchmark
Users that are interested in LLM-Prefill-Decode-Benchmark are comparing it to the libraries listed below
Sorting:
- ☆53Updated last year
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆119Updated last year
- 注释的nano_vllm仓库,并且完成了MiniCPM4的适配以及注册新模型的功能☆95Updated 3 months ago
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆60Updated last year
- ☆151Updated 8 months ago
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆66Updated last year
- Modular and structured prompt caching for low-latency LLM inference☆102Updated last year
- LLM Inference with Deep Learning Accelerator.☆53Updated 9 months ago
- ☆82Updated last year
- ATC23 AE☆47Updated 2 years ago
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆249Updated last year
- ☆79Updated last year
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆130Updated 3 weeks ago
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆112Updated 4 months ago
- LLM Inference benchmark☆430Updated last year
- A simple calculation for LLM MFU.☆50Updated 2 months ago
- A flexible and efficient training framework for large-scale alignment tasks☆437Updated 3 weeks ago
- SGLang is a fast serving framework for large language models and vision language models.☆22Updated this week
- Bridge Megatron-Core to Hugging Face/Reinforcement Learning☆159Updated last week
- ☆151Updated 4 months ago
- Omni_Infer is a suite of inference accelerators designed for the Ascend NPU platform, offering native support and an expanding feature se…☆86Updated this week
- A LLaMA1/LLaMA12 Megatron implement.☆28Updated last year
- ☆130Updated 10 months ago
- Implement some method of LLM KV Cache Sparsity☆42Updated last year
- Transformer related optimization, including BERT, GPT☆59Updated 2 years ago
- A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, …☆123Updated last year
- LLMem: GPU Memory Estimation for Fine-Tuning Pre-Trained LLMs☆26Updated 5 months ago
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆64Updated last year
- ☆81Updated 7 months ago
- ☆512Updated 2 months ago