chen-ace / LLM-Prefill-Decode-BenchmarkLinks

通过实验对比LLM推理中Prefill和Decoding阶段的吞吐量差异，揭示性能瓶颈，解释PD分离优化技术的原理。包含CUDA和Apple MPS (M系列芯片) 的测试脚本。

☆18

Alternatives and similar repositories for LLM-Prefill-Decode-Benchmark

Users that are interested in LLM-Prefill-Decode-Benchmark are comparing it to the libraries listed below

Sorting:

iiis-turing-llm / llm-training-calculator
☆53Updated last year
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆119Updated last year
LDLINGLINGLING / nano_vllm_note
注释的nano_vllm仓库，并且完成了MiniCPM4的适配以及注册新模型的功能
☆95Updated 3 months ago
zhaochenyang20 / ModelServer
Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang
☆60Updated last year
madsys-dev / deepseekv2-profile
☆151Updated 8 months ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆66Updated last year
yale-sys / prompt-cache
Modular and structured prompt caching for low-latency LLM inference
☆102Updated last year
shishishu / LLM-Inference-Acceleration
LLM Inference with Deep Learning Accelerator.
☆53Updated 9 months ago
zcli-charlie / Awesome-KV-Cache
☆82Updated last year
thu-pacman / SmartMoE-AE
ATC23 AE
☆47Updated 2 years ago
inferflow / inferflow
Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).
☆249Updated last year
Ascend / AscendSpeed
☆79Updated last year
smart-lty / ParallelSpeculativeDecoding
[ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length
☆130Updated 3 weeks ago
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆112Updated 4 months ago
ninehills / llm-inference-benchmark
LLM Inference benchmark
☆430Updated last year
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆50Updated 2 months ago
alibaba / ChatLearn
A flexible and efficient training framework for large-scale alignment tasks
☆437Updated 3 weeks ago
antgroup / sglang
SGLang is a fast serving framework for large language models and vision language models.
☆22Updated this week
ISEEKYAN / mbridge
Bridge Megatron-Core to Hugging Face/Reinforcement Learning
☆159Updated last week
YaoJiayi / CacheBlend
☆151Updated 4 months ago
omni-ai-npu / omni-infer
Omni_Infer is a suite of inference accelerators designed for the Ascend NPU platform, offering native support and an expanding feature se…
☆86Updated this week
MoFHeka / LLaMA-Megatron
A LLaMA1/LLaMA12 Megatron implement.
☆28Updated last year
OpenPPL / ppl.llm.serving
☆130Updated 10 months ago
HarryWu99 / llm_kvcache_sparsity
Implement some method of LLM KV Cache Sparsity
☆42Updated last year
void-main / FasterTransformer
Transformer related optimization, including BERT, GPT
☆59Updated 2 years ago
Hsword / Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, …
☆123Updated last year
taehokim20 / LLMem
LLMem: GPU Memory Estimation for Fine-Tuning Pre-Trained LLMs
☆26Updated 5 months ago
hao-ai-lab / vllm-ltr
[NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank
☆64Updated last year
microsoft / chunk-attention
☆81Updated 7 months ago
Tencent / KsanaLLM
☆512Updated 2 months ago