kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆90Updated last year
Related projects: ⓘ
- ☆130Updated last year
- Low-bit optimizers for PyTorch☆109Updated 11 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆156Updated 9 months ago
- An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆33Updated 3 months ago
- ☆75Updated this week
- PB-LLM: Partially Binarized Large Language Models☆143Updated 9 months ago
- ☆164Updated 4 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆134Updated 2 months ago
- Unofficial implementations of block/layer-wise pruning methods for LLMs.☆45Updated 4 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆182Updated 4 months ago
- ☆191Updated 3 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆87Updated 3 months ago
- Implementation of Kangaroo: Lossless Self-Speculative Decoding via Double Early Exiting☆39Updated 2 months ago
- Reorder-based post-training quantization for large language model☆178Updated last year
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆123Updated 2 months ago
- Triton implementation of Flash Attention2.0☆21Updated last year
- KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache☆213Updated 3 weeks ago
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆258Updated 2 months ago
- Block Transformer: Global-to-Local Language Modeling for Fast Inference (Official Code)☆118Updated 2 weeks ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆161Updated 2 months ago
- Explorations into some recent techniques surrounding speculative decoding☆190Updated 11 months ago
- Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆68Updated 3 months ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆69Updated 6 months ago
- Activation-aware Singular Value Decomposition for Compressing Large Language Models☆43Updated 3 weeks ago
- [ICML 2024 Oral] This project is the official implementation of our Accurate LoRA-Finetuning Quantization of LLMs via Information Retenti…☆55Updated 5 months ago
- Official implementation of the ICLR 2024 paper AffineQuant☆16Updated 5 months ago
- This repository contains integer operators on GPUs for PyTorch.☆172Updated 11 months ago
- ☆102Updated 3 months ago
- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆282Updated last month
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆240Updated 2 weeks ago