InternLM / Awesome-LLM-Training-System
☆14Updated last month
Related projects: ⓘ
- ☆37Updated 5 months ago
- 16-fold memory access reduction with nearly no loss☆35Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆134Updated 2 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆68Updated 3 months ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆84Updated 2 months ago
- A parallelism VAE avoids OOM for high resolution image generation☆34Updated 2 months ago
- ATC23 AE☆42Updated last year
- ☆60Updated last month
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆161Updated 2 months ago
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆37Updated 2 months ago
- [ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.☆69Updated 4 months ago
- GPTQ inference TVM kernel☆35Updated 4 months ago
- ☆29Updated last month
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆72Updated last month
- Awesome list for LLM quantization☆84Updated 2 weeks ago
- Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…☆153Updated last week
- Summary of some awesome work for optimizing LLM inference☆26Updated this week
- ☆67Updated last week
- 📖A small curated list of Awesome SD/DiT/ViT/Diffusion Inference with Distributed/Caching/Sampling: DistriFusion, PipeFusion, AsyncDiff, …☆64Updated 2 weeks ago
- ☆29Updated 4 months ago
- ☆102Updated 3 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆21Updated 3 months ago
- Code Repository of Evaluating Quantized Large Language Models☆89Updated last week
- [OSDI'24] Serving LLM-based Applications Efficiently with Semantic Variable☆89Updated last week
- Patch convolution to avoid large GPU memory usage of Conv2D☆73Updated 3 months ago
- ☆23Updated last week
- Odysseus: Playground of LLM Sequence Parallelism☆50Updated 3 months ago
- Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models☆27Updated last week
- SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models☆16Updated last month
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆90Updated last month