mit-han-lab / Block-Sparse-Attention
A sparse attention kernel supporting mix sparse patterns
☆53Updated 3 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for Block-Sparse-Attention
- 16-fold memory access reduction with nearly no loss☆57Updated 2 months ago
- Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization☆58Updated this week
- An algorithm for static activation quantization of LLMs☆67Updated this week
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆195Updated last week
- Quantized Attention on GPU☆29Updated last week
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆146Updated 4 months ago
- Code Repository of Evaluating Quantized Large Language Models☆103Updated 2 months ago
- SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models☆12Updated last month
- Code for Palu: Compressing KV-Cache with Low-Rank Projection☆54Updated this week
- ☆21Updated 3 months ago
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆80Updated 2 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆161Updated this week
- Patch convolution to avoid large GPU memory usage of Conv2D☆79Updated 5 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆108Updated 5 months ago
- Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting☆41Updated 4 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆55Updated 4 months ago
- [EMNLP 2024] RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization☆21Updated last month
- ☆42Updated 7 months ago
- TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention☆21Updated last month
- MagicPIG: LSH Sampling for Efficient LLM Generation☆44Updated 2 weeks ago
- ☆183Updated 6 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆51Updated 2 months ago
- ☆95Updated last month
- ☆41Updated 2 years ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆76Updated last month
- PyTorch library for cost-effective, fast and easy serving of MoE models.☆101Updated 2 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆32Updated 2 months ago
- [ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.☆81Updated 5 months ago
- Official implementation of ICML 2024 paper "ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking".☆41Updated 4 months ago
- Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)☆35Updated 3 weeks ago