Modular and structured prompt caching for low-latency LLM inference
☆109Nov 9, 2024Updated last year
Alternatives and similar repositories for prompt-cache
Users that are interested in prompt-cache are comparing it to the libraries listed below
Sorting:
- Stateful LLM Serving☆97Mar 11, 2025Updated 11 months ago
- Implemented a script that automatically adjusts Qwen3's inference and non-inference capabilities, based on an OpenAI-like API. The infere…☆22May 9, 2025Updated 9 months ago
- [VLDB'23] A Skew-Resistant Index for Processing-in-Memory☆26Jan 5, 2026Updated 2 months ago
- ☆165Jul 15, 2025Updated 7 months ago
- ☆40Jul 26, 2024Updated last year
- LITS: An Optimized Learned Index for Strings☆13Jun 18, 2025Updated 8 months ago
- Efficient and easy multi-instance LLM serving☆528Sep 3, 2025Updated 6 months ago
- ☆10Apr 30, 2025Updated 10 months ago
- How to plot for papers, slides, demos, etc.☆10Apr 7, 2022Updated 3 years ago
- record your keypress counts into a database to track computer usage☆12Dec 25, 2025Updated 2 months ago
- The code based on vLLM for the paper “ Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention”.☆11Sep 19, 2024Updated last year
- Inference deployment of the llama3☆11Apr 21, 2024Updated last year
- ☆11Mar 3, 2024Updated 2 years ago
- ☆26Aug 31, 2023Updated 2 years ago
- LangGraph.js agent that requires authorization before invoking tools☆14Feb 24, 2025Updated last year
- The (open-source part of) code to reproduce "BPPSA: Scaling Back-propagation by Parallel Scan Algorithm".☆13Jun 7, 2021Updated 4 years ago
- Preparing Machine Learning/Computer Vision environment for NVidia Jetson TX2☆16Jul 5, 2017Updated 8 years ago
- ☆14Aug 19, 2024Updated last year
- ☆19Jun 1, 2025Updated 9 months ago
- ☆19Apr 18, 2024Updated last year
- Artifacts of EuroSys'24 paper "Exploring Performance and Cost Optimization with ASIC-Based CXL Memory"☆31Feb 21, 2024Updated 2 years ago
- 📀 Encoding YUV into H.246 & transmit it to rtmp server.☆10Dec 22, 2023Updated 2 years ago
- fnm -> this -> load-env = use fnm in nushell☆12Apr 14, 2025Updated 10 months ago
- Probably one of the lightest native RAG + Agent apps out there,experience the power of Agent-powered models and Agent-driven knowledge ba…☆32May 30, 2025Updated 9 months ago
- ☆13Nov 1, 2021Updated 4 years ago
- Official implementation of "TailorKV: A Hybrid Framework for Long-Context Inference via Tailored KV Cache Optimization" (Findings of ACL …☆21Jul 25, 2025Updated 7 months ago
- ☆14Apr 8, 2023Updated 2 years ago
- ☆20Jun 9, 2025Updated 8 months ago
- OpenCSD: eBPF Computational Storage Device (CSD) for Zoned Namespace (ZNS) SSDs in QEMU☆66Nov 1, 2023Updated 2 years ago
- HW/SW co-designed end-host RPC stack☆20Oct 28, 2021Updated 4 years ago
- ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (DAC'25)☆25Feb 26, 2026Updated last week
- GHive: Accelerating Analytical Query Processing in Apache Hive via CPU-GPU Heterogeneous Computing.☆14Nov 8, 2023Updated 2 years ago
- LEMON: Explainable Entity Matching☆19Apr 6, 2022Updated 3 years ago
- ☆150Oct 9, 2024Updated last year
- An experimentation platform for LLM inference optimisation☆36Sep 19, 2024Updated last year
- PetPS: Supporting Huge Embedding Models with Tiered Memory☆33May 21, 2024Updated last year
- Disaggregated serving system for Large Language Models (LLMs).☆778Apr 6, 2025Updated 11 months ago
- A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems☆241Feb 1, 2026Updated last month
- STREAMer: Benchmarking remote volatile and non-volatile memory bandwidth☆17Aug 21, 2023Updated 2 years ago