chenhongyu2048 / LLM-inference-optimization-paperView external linksLinks
Summary of some awesome work for optimizing LLM inference
☆176Updated this week
Alternatives and similar repositories for LLM-inference-optimization-paper
Users that are interested in LLM-inference-optimization-paper are comparing it to the libraries listed below
Sorting:
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆283Mar 6, 2025Updated 11 months ago
- 📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉☆4,990Jan 18, 2026Updated 3 weeks ago
- ☆29May 28, 2024Updated last year
- Curated collection of papers in machine learning systems☆507Feb 7, 2026Updated last week
- A throughput-oriented high-performance serving framework for LLMs☆945Oct 29, 2025Updated 3 months ago
- Curated collection of papers in MoE model inference☆342Oct 20, 2025Updated 3 months ago
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…☆281Dec 5, 2025Updated 2 months ago
- ☆58May 4, 2024Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆812Mar 6, 2025Updated 11 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆314Jun 10, 2025Updated 8 months ago
- paper and its code for AI System☆347Updated this week
- Disaggregated serving system for Large Language Models (LLMs).☆776Apr 6, 2025Updated 10 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆458May 30, 2025Updated 8 months ago
- ☆114May 16, 2025Updated 8 months ago
- ☆630Jan 14, 2026Updated last month
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆280Feb 2, 2026Updated last week
- ☆11Aug 4, 2022Updated 3 years ago
- 2023/12/22 电三 420 每周会议技术分享:「容器」的 slides 和附件☆10Dec 22, 2023Updated 2 years ago
- 基于Xilinx FPGA的通用型 CNN卷积神经网络加速器,本设计基于KV260板卡,MpSoC架构均可移植☆18Dec 13, 2024Updated last year
- The code based on vLLM for the paper “ Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention”.☆11Sep 19, 2024Updated last year
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆23Mar 15, 2024Updated last year
- Large Language Model (LLM) Systems Paper List☆1,818Feb 8, 2026Updated last week
- Optimize tensor program fast with Felix, a gradient descent autotuner.☆30Apr 27, 2024Updated last year
- QAQ: Quality Adaptive Quantization for LLM KV Cache☆55Mar 27, 2024Updated last year
- A graph pattern mining framework for large graphs on gpu.☆15Dec 9, 2024Updated last year
- ☆17Feb 8, 2024Updated 2 years ago
- ☆14Dec 5, 2024Updated last year
- High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)☆26Jan 22, 2026Updated 3 weeks ago
- ☆22Oct 7, 2025Updated 4 months ago
- OSDI 2023 Welder, deeplearning compiler☆32Nov 24, 2023Updated 2 years ago
- Aims to implement dual-port and multi-qp solutions in deepEP ibrc transport☆73May 9, 2025Updated 9 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆120Mar 13, 2024Updated last year
- how to optimize some algorithm in cuda.☆2,819Updated this week
- ☆14Nov 3, 2025Updated 3 months ago
- Repository for the COLM 2025 paper SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths☆15Jul 10, 2025Updated 7 months ago
- ☆53Dec 26, 2024Updated last year
- ☆13Sep 8, 2021Updated 4 years ago
- deepstream + cuda,yolo26,yolo-master,yolo11,yolov8,sam,transformer, etc.☆35Feb 7, 2026Updated last week
- An analytical framework that models hardware dataflow of tensor applications on spatial architectures using the relation-centric notation…☆87Apr 28, 2024Updated last year