ArkVale: Efficient Generative LLM Inference with Recallable Key-Value Eviction (NIPS'24)
☆53Dec 17, 2024Updated last year
Alternatives and similar repositories for ArkVale
Users that are interested in ArkVale are comparing it to the libraries listed below
Sorting:
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆374Jul 10, 2025Updated 7 months ago
- MAGIS: Memory Optimization via Coordinated Graph Transformation and Scheduling for DNN (ASPLOS'24)☆56May 29, 2024Updated last year
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference☆57Nov 20, 2024Updated last year
- [NeurIPS'25 Spotlight] Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆87Nov 29, 2025Updated 3 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆53Aug 6, 2025Updated 6 months ago
- ☆13Dec 9, 2024Updated last year
- [SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference☆83Dec 7, 2025Updated 2 months ago
- ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression (DAC'25)☆25Feb 26, 2026Updated last week
- ☆17Mar 26, 2025Updated 11 months ago
- The Artifact of NeoMem: Hardware/Software Co-Design for CXL-Native Memory Tiering☆63Aug 11, 2024Updated last year
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆57Jul 23, 2024Updated last year
- [VLDB 26, NeurIPS 25] Scalable long-context LLM decoding that leverages sparsity—by treating the KV cache as a vector storage system.☆127Feb 22, 2026Updated last week
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆17Mar 13, 2023Updated 2 years ago
- A sparse attention kernel supporting mix sparse patterns☆467Jan 18, 2026Updated last month
- [ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation☆250Dec 16, 2024Updated last year
- Dynamic Context Selection for Efficient Long-Context LLMs☆56May 20, 2025Updated 9 months ago
- [COLM 2024] TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding☆277Aug 31, 2024Updated last year
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆179Jul 12, 2024Updated last year
- 🛠Robust SSH: auto-reconnect SSH session that preserves your running shell and command. Intuitive, no server-side setup, aimed at simplic…☆13Nov 14, 2025Updated 3 months ago
- FPGA-based HyperLogLog Accelerator☆12Jul 13, 2020Updated 5 years ago
- How to plot for papers, slides, demos, etc.☆10Apr 7, 2022Updated 3 years ago
- ☆11Apr 3, 2023Updated 2 years ago
- ☆301Jul 10, 2025Updated 7 months ago
- Codebase for ICML'24 paper: Learning from Students: Applying t-Distributions to Explore Accurate and Efficient Formats for LLMs☆27Jun 25, 2024Updated last year
- Polyite: Iterative Schedule Optimization for Parallelization in the Polyhedron Model☆12Jan 19, 2020Updated 6 years ago
- ☆10Sep 14, 2023Updated 2 years ago
- [ICML 2025 Spotlight] ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference☆283May 1, 2025Updated 10 months ago
- The Next-gen Language & Compiler Powering Efficient Hardware Design☆36Jan 16, 2025Updated last year
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆51Oct 18, 2024Updated last year
- 📰 Must-read papers on KV Cache Compression (constantly updating 🤗).☆661Feb 24, 2026Updated last week
- You Only Search Once: On Lightweight Differentiable Architecture Search for Resource-Constrained Embedded Platforms☆12Apr 17, 2023Updated 2 years ago
- InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management (OSDI'24)☆179Jul 10, 2024Updated last year
- Utilities for paper writing.☆12Jan 11, 2026Updated last month
- Repository for the COLM 2025 paper SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths☆15Jul 10, 2025Updated 7 months ago
- A retargetable and extensible synthesis-based compiler for modern hardware architectures☆17Nov 20, 2025Updated 3 months ago
- [ASP-DAC 2025] "NeuronQuant: Accurate and Efficient Post-Training Quantization for Spiking Neural Networks" Official Implementation☆15Mar 6, 2025Updated last year
- ☆23Oct 7, 2025Updated 4 months ago
- An analytical framework that models hardware dataflow of tensor applications on spatial architectures using the relation-centric notation…☆87Apr 28, 2024Updated last year
- Code release for AdapMoE accepted by ICCAD 2024☆35Apr 28, 2025Updated 10 months ago