mlc-ai / mlc-en
☆375Updated 3 months ago
Related projects: ⓘ
- ☆188Updated last week
- FlashInfer: Kernel Library for LLM Serving☆1,143Updated last week
- A curated list of awesome projects and papers for distributed training or inference☆189Updated last month
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆399Updated 2 weeks ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆342Updated this week
- Microsoft Automatic Mixed Precision Library☆507Updated this week
- ☆467Updated 2 weeks ago
- A throughput-oriented high-performance serving framework for LLMs☆470Updated this week
- GPTQ inference Triton kernel☆273Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆562Updated 2 weeks ago
- An open-source efficient deep learning framework/compiler, written in python.☆646Updated 3 weeks ago
- Latency and Memory Analysis of Transformer Models for Training and Inference☆338Updated 3 months ago
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆1,183Updated 2 months ago
- optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052☆451Updated 6 months ago
- Zero Bubble Pipeline Parallelism☆254Updated 2 weeks ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆258Updated 2 months ago
- Serving multiple LoRA finetuned LLM as one☆946Updated 4 months ago
- FlagGems is an operator library for large language models implemented in Triton Language.☆246Updated last week
- Flash Attention in ~100 lines of CUDA (forward pass only)☆558Updated 5 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆250Updated this week
- ☆584Updated 3 months ago
- A baseline repository of Auto-Parallelism in Training Neural Networks☆138Updated 2 years ago
- Ring attention implementation with flash attention☆529Updated this week
- ☆269Updated 5 months ago
- paper and its code for AI System☆202Updated 3 weeks ago
- Large Language Model (LLM) Systems Paper List☆572Updated 3 weeks ago
- Pipeline Parallelism for PyTorch☆708Updated 3 weeks ago
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆282Updated last month
- A fast communication-overlapping library for tensor parallelism on GPUs.☆184Updated this week
- A multi-level tensor algebra superoptimizer☆316Updated this week