DeepAuto-AI / hip-attention
Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.
☆19Updated 2 weeks ago
Related projects ⓘ
Alternatives and complementary repositories for hip-attention
- [ICML 2024 Oral] Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs☆83Updated 3 months ago
- [ICLR 2024 Spotlight] This is the official PyTorch implementation of "EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Di…☆50Updated 5 months ago
- [DATE 2023] Pipe-BD: Pipelined Parallel Blockwise Distillation☆11Updated last year
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…☆59Updated 7 months ago
- ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation☆34Updated 3 months ago
- It's All In the Teacher: Zero-Shot Quantization Brought Closer to the Teacher [CVPR 2022 Oral]☆30Updated 2 years ago
- ☆23Updated 4 months ago
- Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…☆53Updated 8 months ago
- The official implementation of PTQD: Accurate Post-Training Quantization for Diffusion Models☆89Updated 8 months ago
- ☆96Updated 2 months ago
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆36Updated 8 months ago
- Patch convolution to avoid large GPU memory usage of Conv2D☆79Updated 5 months ago
- An algorithm for static activation quantization of LLMs☆79Updated 2 weeks ago
- QuEST: Efficient Finetuning for Low-bit Diffusion Models☆35Updated 3 months ago
- Compressed LLMs for Efficient Text Generation [ICLR'24 Workshop]☆65Updated 2 months ago
- QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference☆112Updated 8 months ago
- Code repo for the paper BiT Robustly Binarized Multi-distilled Transformer☆101Updated last year
- torch_quantizer is a out-of-box quantization tool for PyTorch models on CUDA backend, specially optimized for Diffusion Models.☆19Updated 7 months ago
- [ICML 2024] SPP: Sparsity-Preserved Parameter-Efficient Fine-Tuning for Large Language Models☆16Updated 5 months ago
- Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…☆42Updated last year
- Model Stock: All we need is just a few fine-tuned models☆92Updated 2 months ago
- PyTorch code for Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers☆34Updated 2 months ago
- SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models☆24Updated 3 months ago
- Official Implementation of SLEB: Streamlining LLMs through Redundancy Verification and Elimination of Transformer Blocks☆31Updated 5 months ago
- This repo contains the source code for: Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs☆32Updated 3 months ago
- ☆47Updated last year
- [Preprint] Why is the State of Neural Network Pruning so Confusing? On the Fairness, Comparison Setup, and Trainability in Network Prunin…☆40Updated last year
- Official PyTorch implementation of MaskSub "Masking Augmentation for Supervised Learning"☆34Updated 8 months ago
- Are gradient information useful for pruning of LLMs?☆38Updated 7 months ago
- ☆33Updated 11 months ago