PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
☆32Nov 16, 2024Updated last year
Alternatives and similar repositories for PipeInfer
Users that are interested in PipeInfer are comparing it to the libraries listed below
Sorting:
- ☆28May 24, 2025Updated 9 months ago
- Open source static analysis toolkit for LLM agent plans☆13Aug 9, 2025Updated 6 months ago
- A FastAPI application that integrates with Telegram using webhooks and OpenAI Agents SDK for AI-powered stock trading assistance, utilizi…☆16May 11, 2025Updated 9 months ago
- ☆18Mar 4, 2025Updated 11 months ago
- [DATE 2025] Official implementation and dataset of AIrchitect v2: Learning the Hardware Accelerator Design Space through Unified Represen…☆19Jan 17, 2025Updated last year
- ☆32Aug 21, 2021Updated 4 years ago
- [ICML 2021] "Auto-NBA: Efficient and Effective Search Over the Joint Space of Networks, Bitwidths, and Accelerators" by Yonggan Fu, Yonga…☆16Jan 3, 2022Updated 4 years ago
- ☆19Mar 21, 2023Updated 2 years ago
- A high performance batching router optimises max throughput for text inference workload☆16Sep 6, 2023Updated 2 years ago
- DEPRICATED: See ChiScraper Instead☆17Oct 13, 2024Updated last year
- ☆20Sep 28, 2024Updated last year
- FractalTensor is a programming framework that introduces a novel approach to organizing data in deep neural networks (DNNs) as a list of …☆32Dec 21, 2024Updated last year
- KoboldCpp Smart Launcher with GPU Layer and Tensor Override Tuning☆30May 18, 2025Updated 9 months ago
- Super simple python connectors for llama.cpp, including vision models (Gemma 3, Qwen2-VL). Compile llama.cpp and run!☆29Dec 11, 2025Updated 2 months ago
- SPAA'21: Efficient Stepping Algorithms and Implementations for Parallel Shortest Paths☆21Aug 10, 2024Updated last year
- run ollama & gguf easily with a single command☆52May 15, 2024Updated last year
- Run Orpheus 3B Locally with Gradio UI, Standalone App☆23Apr 1, 2025Updated 11 months ago
- 🎮 Material You TUI for monitoring NVIDIA GPUs☆58Jan 16, 2026Updated last month
- Locally hosted AI Agent Python Tool To Generate Novel Research Hypothesis + Titles + Abstracts☆30Apr 30, 2025Updated 10 months ago
- ☆33Jan 30, 2025Updated last year
- Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]☆24Nov 21, 2024Updated last year
- B-Llama3o a llama3 with Vision Audio and Audio understanding as well as text and Audio and Animation Data output.☆26Jun 3, 2024Updated last year
- The hearth of The Pulsar App, fast, secure and shared inference with modern UI☆59Dec 1, 2024Updated last year
- ☆28Feb 26, 2023Updated 3 years ago
- Llama cute voice assistant☆27Sep 10, 2023Updated 2 years ago
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆69Mar 7, 2024Updated last year
- ☆26Mar 14, 2024Updated last year
- [DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"☆101Dec 15, 2025Updated 2 months ago
- ☆54May 28, 2025Updated 9 months ago
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆123Jul 4, 2025Updated 7 months ago
- benchmarks for LLM tokenizers☆17Jan 15, 2026Updated last month
- Repository holding the code base to AC-SpGEMM : "Adaptive Sparse Matrix-Matrix Multiplication on the GPU"☆31Jul 7, 2020Updated 5 years ago
- Generate Your Own Private Morning Radio for Commute☆32Feb 5, 2025Updated last year
- Share sensitive information securely with a link that can only be viewed once.☆42Feb 16, 2026Updated 2 weeks ago
- Experience the power of AI with this free AI voice generator demo. Utilizing Deepgram and Groq, we transform text into voice seamlessly. …☆37Jun 12, 2024Updated last year
- Official Repo for "SplitQuant / LLM-PQ: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and …☆36Aug 29, 2025Updated 6 months ago
- Disaggregated serving system for Large Language Models (LLMs).☆777Apr 6, 2025Updated 10 months ago
- Run multiple resource-heavy Large Models (LM) on the same machine with limited amount of VRAM/other resources by exposing them on differe…☆88Feb 7, 2026Updated 3 weeks ago
- The official implementation of HPCA 2025 paper, Prosperity: Accelerating Spiking Neural Networks via Product Sparsity☆37Aug 9, 2025Updated 6 months ago