sihyeong / Awesome-LLM-Inference-EngineLinks
☆81Updated last week
Alternatives and similar repositories for Awesome-LLM-Inference-Engine
Users that are interested in Awesome-LLM-Inference-Engine are comparing it to the libraries listed below
Sorting:
- [ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding☆116Updated 6 months ago
- ☆77Updated 2 months ago
- ☆87Updated 3 months ago
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆164Updated 9 months ago
- ☆36Updated 10 months ago
- ☆109Updated 8 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆151Updated this week
- ☆69Updated 8 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆90Updated 2 months ago
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆53Updated 7 months ago
- ☆159Updated this week
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.☆101Updated last year
- Modular and structured prompt caching for low-latency LLM inference☆96Updated 7 months ago
- ☆141Updated 3 months ago
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆255Updated 3 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆214Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆38Updated 2 weeks ago
- Implement some method of LLM KV Cache Sparsity☆32Updated last year
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆226Updated 2 weeks ago
- ☆62Updated last year
- ☆71Updated last month
- ☆54Updated last year
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆48Updated 7 months ago
- Curated collection of papers in MoE model inference☆200Updated 4 months ago
- Summary of some awesome work for optimizing LLM inference☆77Updated 3 weeks ago
- High performance Transformer implementation in C++.☆125Updated 5 months ago
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆113Updated last month
- LLM Serving Performance Evaluation Harness☆78Updated 4 months ago
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding co…☆154Updated last week
- A simple calculation for LLM MFU.☆38Updated 3 months ago