FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆74Sep 15, 2025Updated 5 months ago
Alternatives and similar repositories for flexflow-serve
Users that are interested in flexflow-serve are comparing it to the libraries listed below
Sorting:
- ☆13Jan 7, 2025Updated last year
- ☆34Jun 22, 2024Updated last year
- Compression for Foundation Models☆35Jul 21, 2025Updated 7 months ago
- Automatically Discovering Fast Parallelization Strategies for Distributed Deep Neural Network Training☆1,861Feb 20, 2026Updated last week
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆44Feb 27, 2025Updated last year
- ☆164Jul 15, 2025Updated 7 months ago
- ☆131Nov 11, 2024Updated last year
- PoC for "SpecReason: Fast and Accurate Inference-Time Compute via Speculative Reasoning" [NeurIPS '25]☆64Oct 2, 2025Updated 4 months ago
- A throughput-oriented high-performance serving framework for LLMs☆946Oct 29, 2025Updated 4 months ago
- Dynamic Memory Management for Serving LLMs without PagedAttention☆463May 30, 2025Updated 9 months ago
- An Attention Superoptimizer☆22Jan 20, 2025Updated last year
- Stateful LLM Serving☆96Mar 11, 2025Updated 11 months ago
- Artifact for "Marconi: Prefix Caching for the Era of Hybrid LLMs" [MLSys '25 Outstanding Paper Award, Honorable Mention]☆52Mar 5, 2025Updated 11 months ago
- ☆17May 10, 2024Updated last year
- ☆20Jun 9, 2025Updated 8 months ago
- High performance Transformer implementation in C++.☆152Jan 18, 2025Updated last year
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆314Jun 10, 2025Updated 8 months ago
- ☆18Aug 5, 2025Updated 6 months ago
- ☆27Jan 7, 2025Updated last year
- Vortex: A Flexible and Efficient Sparse Attention Framework☆48Jan 21, 2026Updated last month
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Sep 10, 2024Updated last year
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆20Aug 3, 2025Updated 6 months ago
- TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pyt…☆16Jul 5, 2024Updated last year
- [SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference☆82Dec 7, 2025Updated 2 months ago
- Efficient and easy multi-instance LLM serving☆527Sep 3, 2025Updated 5 months ago
- Efficient Long-context Language Model Training by Core Attention Disaggregation☆91Updated this week
- APEX+ is an LLM Serving Simulator☆42Jun 16, 2025Updated 8 months ago
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆135Feb 22, 2024Updated 2 years ago
- scalable and robust tree-based speculative decoding algorithm☆370Jan 28, 2025Updated last year
- ☆97Mar 26, 2025Updated 11 months ago
- [COLM 2024] SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆25Oct 5, 2024Updated last year
- llama INT4 cuda inference with AWQ☆54Jan 20, 2025Updated last year
- git@github.com:endymecy/awesome-deeplearning-resources.git☆22Apr 20, 2017Updated 8 years ago
- Möbius Transformation for Fast Inner Product Search on Graph☆22Jun 3, 2021Updated 4 years ago
- ☆26Feb 17, 2025Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Feb 20, 2026Updated last week
- [NeurIPS 2024] Efficient LLM Scheduling by Learning to Rank☆70Nov 4, 2024Updated last year
- Draft-Target Disaggregation LLM Serving System via Parallel Speculative Decoding.☆162Feb 4, 2026Updated 3 weeks ago
- ☆64Dec 3, 2024Updated last year