shishishu / LLM-Inference-AccelerationLinks

LLM Inference with Deep Learning Accelerator.

☆52

Alternatives and similar repositories for LLM-Inference-Acceleration

Users that are interested in LLM-Inference-Acceleration are comparing it to the libraries listed below

Sorting:

galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆278Updated 7 months ago
chenhongyu2048 / LLM-inference-optimization-paper
Summary of some awesome work for optimizing LLM inference
☆120Updated 4 months ago
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆115Updated last year
TreeAI-Lab / Awesome-KV-Cache-Management
This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…
☆221Updated 2 months ago
madsys-dev / deepseekv2-profile
☆148Updated 7 months ago
MoE-Inf / awesome-moe-inference
Curated collection of papers in MoE model inference
☆285Updated last month
harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆108Updated 3 months ago
HarryWu99 / llm_kvcache_sparsity
Implement some method of LLM KV Cache Sparsity
☆39Updated last year
InternLM / Awesome-LLM-Training-System
☆43Updated last year
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆138Updated 9 months ago
Zefan-Cai / Awesome-LLM-KV-Cache
Awesome-LLM-KV-Cache: A curated list of 📙Awesome LLM KV Cache Papers with Codes.
☆376Updated 7 months ago
LDLINGLINGLING / nano_vllm_note
注释的nano_vllm仓库，并且完成了MiniCPM4的适配以及注册新模型的功能
☆81Updated 2 months ago
thu-pacman / SmartMoE-AE
ATC23 AE
☆47Updated 2 years ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆181Updated 2 weeks ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
interestingLSY / swiftLLM
A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …
☆278Updated 4 months ago
microsoft / chunk-attention
☆78Updated 6 months ago
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆112Updated 5 months ago
LoongServe / LoongServe
☆124Updated 11 months ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆65Updated last year
microsoft / ParrotServe
[OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable
☆186Updated last year
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆75Updated 4 months ago
YaoJiayi / CacheBlend
☆141Updated 3 months ago
CalebDu / Awesome-Cute
☆107Updated 5 months ago
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆70Updated this week
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 10 months ago
mdy666 / mdy_triton
☆148Updated 3 months ago
thu-pacman / FasterMoE
☆87Updated 3 years ago
lambda7xx / awesome-AI-system
paper and its code for AI System
☆331Updated 2 months ago
AlibabaPAI / FLASHNN
☆100Updated last year