PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
☆32Nov 16, 2024Updated last year
Alternatives and similar repositories for PipeInfer
Users that are interested in PipeInfer are comparing it to the libraries listed below
Sorting:
- ☆28May 24, 2025Updated 9 months ago
- [ICML 2021] "Auto-NBA: Efficient and Effective Search Over the Joint Space of Networks, Bitwidths, and Accelerators" by Yonggan Fu, Yonga…☆16Jan 3, 2022Updated 4 years ago
- [DATE 2025] Official implementation and dataset of AIrchitect v2: Learning the Hardware Accelerator Design Space through Unified Represen…☆19Jan 17, 2025Updated last year
- FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …☆30Dec 21, 2024Updated last year
- ☆18Mar 4, 2025Updated last year
- ☆67Nov 4, 2024Updated last year
- ☆19Mar 21, 2023Updated 3 years ago
- Visual Tagger is a JavaScript tool that visually highlights HTML elements for AIs, aiding in identifying interactive components on web pa…☆11Oct 28, 2024Updated last year
- ☆15Updated this week
- A high performance batching router optimises max throughput for text inference workload☆16Sep 6, 2023Updated 2 years ago
- ☆26Mar 14, 2024Updated 2 years ago
- Continuous Pipelined Speculative Decoding☆18Jan 4, 2026Updated 2 months ago
- ☆47Jun 7, 2024Updated last year
- [NeurIPS 2024] The official implementation of "Kangaroo: Lossless Self-Speculative Decoding for Accelerating LLMs via Double Early Exitin…☆68Jun 26, 2024Updated last year
- Disaggregated serving system for Large Language Models (LLMs).☆785Apr 6, 2025Updated 11 months ago
- ☆21Jun 6, 2024Updated last year
- Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-…☆11Jul 1, 2025Updated 8 months ago
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆69Mar 7, 2024Updated 2 years ago
- ☆17Feb 20, 2023Updated 3 years ago
- ☆47Jun 27, 2024Updated last year
- Official implementation for 'Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient LLM Reasoning'☆26Feb 18, 2025Updated last year
- ☆20Sep 28, 2024Updated last year
- A proxy that hosts multiple single-model runners such as LLama.cpp and vLLM☆13May 30, 2025Updated 9 months ago
- ☆15Jun 26, 2024Updated last year
- The code based on vLLM for the paper “ Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention”.☆11Sep 19, 2024Updated last year
- A simple cycle-accurate DaDianNao simulator☆13Mar 27, 2019Updated 6 years ago
- STREAMer: Benchmarking remote volatile and non-volatile memory bandwidth☆17Aug 21, 2023Updated 2 years ago
- Implemented a script that automatically adjusts Qwen3's inference and non-inference capabilities, based on an OpenAI-like API. The infere…☆22May 9, 2025Updated 10 months ago
- [ACL2025 Oral🔥]Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling☆24Nov 11, 2025Updated 4 months ago
- Yet another frontend for LLM, written using .NET and WinUI 3☆10Sep 14, 2025Updated 6 months ago
- Open source static analysis toolkit for LLM agent plans☆13Aug 9, 2025Updated 7 months ago
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆105Dec 15, 2025Updated 3 months ago
- [ACL 2025 main] FR-Spec: Frequency-Ranked Speculative Sampling☆54Jul 15, 2025Updated 8 months ago
- ☆28Feb 26, 2023Updated 3 years ago
- ☆19Feb 18, 2025Updated last year
- ☆20Jun 9, 2025Updated 9 months ago
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)☆182Jul 10, 2024Updated last year
- Source code for Jellyfish, a soft real-time inference serving system☆15Dec 20, 2022Updated 3 years ago
- ☆97Nov 25, 2024Updated last year