DefTruth / Awesome-LLM-Inference
πA curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
β2,845Updated this week
Related projects β
Alternatives and complementary repositories for Awesome-LLM-Inference
- FlashInfer: Kernel Library for LLM Servingβ1,452Updated this week
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabiliβ¦β2,613Updated this week
- Awesome LLM compression research papers and tools.β1,202Updated this week
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Accelerationβ2,526Updated last month
- [TMLR 2024] Efficient Large Language Models: A Surveyβ1,025Updated last week
- A curated list for Efficient Large Language Modelsβ1,270Updated this week
- Large Language Model (LLM) Systems Paper Listβ645Updated this week
- πModern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, π150+ CUDA Kernels with PyTorch bindings, πHGEMM/SGEMM (95%~99% cuBLAS perfoβ¦β1,473Updated this week
- SGLang is a fast serving framework for large language models and vision language models.β6,127Updated this week
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Modelsβ1,257Updated 4 months ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2β1,893Updated last month
- LMDeploy is a toolkit for compressing, deploying, and serving LLMs.β4,669Updated this week
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:β1,765Updated this week
- An Easy-to-use, Scalable and High-performance RLHF Framework (70B+ PPO Full Tuning & Iterative DPO & LoRA & RingAttention)β2,666Updated this week
- β502Updated 2 months ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.β1,120Updated 3 months ago
- how to optimize some algorithm in cuda.β1,593Updated last week
- Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Headsβ2,312Updated 4 months ago
- π° Must-read papers and blogs on Speculative Decoding β‘οΈβ471Updated last week
- Fast inference from large lauguage models via speculative decodingβ569Updated 2 months ago
- π° Must-read papers and blogs on LLM based Long Context Modeling π₯β1,006Updated this week
- Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)β826Updated this week
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUsβ¦β1,979Updated this week
- This repository collects papers for "A Survey on Knowledge Distillation of Large Language Models". We break down KD into Knowledge Elicitβ¦β656Updated 3 weeks ago
- S-LoRA: Serving Thousands of Concurrent LoRA Adaptersβ1,755Updated 10 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decodingβ1,149Updated last month
- FlexFlow Serve: Low-Latency, High-Performance LLM Servingβ1,713Updated this week
- A PyTorch Native LLM Training Frameworkβ665Updated 2 months ago
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.β1,904Updated this week
- A throughput-oriented high-performance serving framework for LLMsβ636Updated 2 months ago