JunHao-Zhu / FusionQuery
[VLDB 2024] Source code for FusionQuery: On-demand Fusion Queries over Multi-source Heterogeneous Data
☆11Updated last month
Alternatives and similar repositories for FusionQuery:
Users that are interested in FusionQuery are comparing it to the libraries listed below
- Pytorch implementation of our paper SUBP: Soft Uniform Block Pruning for 1xN Sparse CNNs Multithreading Acceleration accepted by NeurIPS …☆22Updated last year
- Adaptive Attention Sparsity with Hierarchical Top-p Pruning☆17Updated 2 months ago
- [SIGMOD 2025] PQCache: Product Quantization-based KVCache for Long Context LLM Inference☆42Updated 2 months ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆56Updated last month
- [ICLR 2025] The official pytorch implement of "Dynamic Low-Rank Sparse Adaptation for Large Language Models".☆16Updated last month
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆47Updated 9 months ago
- Stateful LLM Serving☆63Updated last month
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆19Updated 6 months ago
- [ICLR 2024] Dynamic Neural Response Tuning☆16Updated last month
- The Official Implementation of Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference☆72Updated 3 months ago
- 16-fold memory access reduction with nearly no loss☆91Updated last month
- ☆59Updated 10 months ago
- SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models☆29Updated 8 months ago
- Implement some method of LLM KV Cache Sparsity☆32Updated 10 months ago
- PyTorch implementation of paper "Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline".☆85Updated last year
- FGNN's artifact evaluation (EuroSys 2022)☆17Updated 3 years ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆25Updated 2 months ago
- [ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆35Updated last week
- ☆22Updated 2 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆36Updated 8 months ago
- Artifact for "Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving" [SOSP '24]☆24Updated 5 months ago
- ☆11Updated 6 months ago
- Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models☆43Updated 5 months ago
- [ICLR 2024] Jaiswal, A., Gan, Z., Du, X., Zhang, B., Wang, Z., & Yang, Y. Compressing llms: The truth is rarely pure and never simple.☆23Updated last week
- LLM Serving Performance Evaluation Harness☆77Updated 2 months ago
- ☆70Updated last week
- A sparse attention kernel supporting mix sparse patterns☆197Updated 2 months ago
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆33Updated last month
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆34Updated last week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆160Updated 9 months ago