sihyeong / Awesome-LLM-Inference-EngineLinks
β161Updated last month
Alternatives and similar repositories for Awesome-LLM-Inference-Engine
Users that are interested in Awesome-LLM-Inference-Engine are comparing it to the libraries listed below
Sorting:
- Awesome-LLM-KV-Cache: A curated list of πAwesome LLM KV Cache Papers with Codes.β405Updated 10 months ago
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline modβ¦β605Updated last year
- Awesome list for LLM quantizationβ378Updated 3 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of β¦β306Updated 7 months ago
- This repository serves as a comprehensive survey of LLM development, featuring numerous research papers along with their corresponding coβ¦β264Updated last month
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.β626Updated this week
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papβ¦β283Updated 10 months ago
- Curated collection of papers in MoE model inferenceβ333Updated 2 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tecβ¦β216Updated 3 months ago
- Summary of some awesome work for optimizing LLM inferenceβ162Updated last month
- π° Must-read papers on KV Cache Compression (constantly updating π€).β635Updated 3 months ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ145Updated 3 weeks ago
- Dynamic Memory Management for Serving LLMs without PagedAttentionβ454Updated 7 months ago
- β153Updated 10 months ago
- Materials for learning SGLangβ717Updated last week
- ArcticInference: vLLM plugin for high-throughput, low-latency inferenceβ368Updated last week
- LLM Inference with Deep Learning Accelerator.β56Updated 11 months ago
- Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)β70Updated 8 months ago
- Compare different hardware platforms via the Roofline Model for LLM inference tasks.β120Updated last year
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"β96Updated last month
- β162Updated 6 months ago
- a curated list of high-quality papers on resource-efficient LLMs π±β154Updated 10 months ago
- β299Updated 6 months ago
- A low-latency & high-throughput serving engine for LLMsβ464Updated last week
- PyTorch library for cost-effective, fast and easy serving of MoE models.β273Updated 3 months ago
- β83Updated last year
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.β114Updated 6 months ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Servingβ333Updated last year
- Tile-Based Runtime for Ultra-Low-Latency LLM Inferenceβ527Updated 3 weeks ago
- Code Repository of Evaluating Quantized Large Language Modelsβ136Updated last year