SqueezeAILab / SqueezedAttention
SQUEEZED ATTENTION: Accelerating Long Prompt LLM Inference
☆29Updated 3 weeks ago
Alternatives and similar repositories for SqueezedAttention:
Users that are interested in SqueezedAttention are comparing it to the libraries listed below
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆35Updated 9 months ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆45Updated last month
- ☆49Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆61Updated 6 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆76Updated 3 weeks ago
- [AAAI 2024] Fluctuation-based Adaptive Structured Pruning for Large Language Models☆37Updated 11 months ago
- ☆46Updated 7 months ago
- Pytorch implementation of our paper accepted by ICML 2024 -- CaM: Cache Merging for Memory-efficient LLMs Inference☆29Updated 5 months ago
- ☆35Updated 3 months ago
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…☆59Updated 8 months ago
- Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding (EMNLP 2023 Long)☆56Updated 2 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆32Updated 4 months ago
- TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆24Updated this week
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆52Updated last month
- [EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization☆25Updated 2 months ago
- [NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection☆32Updated last month
- An algorithm for static activation quantization of LLMs☆95Updated last month
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆33Updated 6 months ago
- The official implementation of the paper "Demystifying the Compression of Mixture-of-Experts Through a Unified Framework".☆49Updated last month
- [ICML 2024] When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models☆27Updated 6 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆20Updated 9 months ago
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆40Updated 2 months ago
- Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting☆47Updated 5 months ago
- BESA is a differentiable weight pruning technique for large language models.☆14Updated 9 months ago
- Code for "Everybody Prune Now: Structured Pruning of LLMs with only Forward Passes"☆27Updated 8 months ago
- LLM Inference with Microscaling Format☆11Updated last month
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆151Updated 5 months ago
- ☆13Updated last year
- ☆101Updated 2 months ago
- 16-fold memory access reduction with nearly no loss☆63Updated last month