flashinfer-ai / flashinfer
FlashInfer: Kernel Library for LLM Serving
☆1,491Updated this week
Alternatives and similar repositories for flashinfer:
Users that are interested in flashinfer are comparing it to the libraries listed below
- A throughput-oriented high-performance serving framework for LLMs☆648Updated 2 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆640Updated 2 months ago
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,156Updated last month
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs…☆2,000Updated this week
- Official Implementation of EAGLE-1 (ICML'24) and EAGLE-2 (EMNLP'24)☆839Updated last week
- The Triton TensorRT-LLM Backend☆715Updated last week
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆1,273Updated 4 months ago
- Serving multiple LoRA finetuned LLM as one☆991Updated 6 months ago
- [MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration☆2,559Updated last month
- A PyTorch Native LLM Training Framework☆674Updated 3 months ago
- QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving☆457Updated 3 weeks ago
- [NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces in…☆812Updated last week
- Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA☆661Updated this week
- Ring attention implementation with flash attention☆595Updated 3 weeks ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆1,725Updated this week
- AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:☆1,793Updated this week
- Disaggregated serving system for Large Language Models (LLMs).☆370Updated 3 months ago
- Pipeline Parallelism for PyTorch☆729Updated 3 months ago
- Microsoft Automatic Mixed Precision Library☆527Updated 2 months ago
- Fast inference from large lauguage models via speculative decoding☆590Updated 3 months ago
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalabili…☆2,644Updated this week
- Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline mod…☆323Updated 2 months ago
- Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM☆753Updated this week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆373Updated this week
- ☆293Updated 8 months ago
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆1,911Updated last week
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆429Updated this week
- TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillati…☆589Updated last week
- FlagGems is an operator library for large language models implemented in Triton Language.☆347Updated this week