leloykun / flash-attention-minimalLinks
Flash Attention in 300-500 lines of CUDA/C++
☆36Updated 4 months ago
Alternatives and similar repositories for flash-attention-minimal
Users that are interested in flash-attention-minimal are comparing it to the libraries listed below
Sorting:
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆90Updated 5 months ago
- ☆133Updated 6 months ago
- Fast and memory-efficient exact attention☆75Updated 9 months ago
- Sirius, an efficient correction mechanism, which significantly boosts Contextual Sparsity models on reasoning tasks while maintaining its…☆21Updated last year
- The evaluation framework for training-free sparse attention in LLMs☆106Updated 2 months ago
- ☆156Updated 10 months ago
- Stick-breaking attention☆62Updated 5 months ago
- Awesome Triton Resources☆39Updated 8 months ago
- ☆150Updated 2 years ago
- ☆21Updated 8 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆89Updated last year
- The official implementation of paper: SimLayerKV: A Simple Framework for Layer-Level KV Cache Reduction.☆52Updated last year
- ☆16Updated this week
- Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…☆146Updated last year
- Triton-based implementation of Sparse Mixture of Experts.☆257Updated 2 months ago
- ☆35Updated last year
- LongSpec: Long-Context Lossless Speculative Decoding with Efficient Drafting and Verification☆69Updated 5 months ago
- Code for studying the super weight in LLM☆121Updated last year
- continous batching and parallel acceleration for RWKV6☆22Updated last year
- Simple and efficient pytorch-native transformer training and inference (batched)☆79Updated last year
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference☆54Updated last year
- ☆48Updated 7 months ago
- ☆126Updated 6 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆175Updated last year
- Official implementation for Yuan & Liu & Zhong et al., KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark o…☆87Updated 10 months ago
- Kinetics: Rethinking Test-Time Scaling Laws☆84Updated 5 months ago
- Ouroboros: Speculative Decoding with Large Model Enhanced Drafting (EMNLP 2024 main)☆112Updated 9 months ago
- 🔥 A minimal training framework for scaling FLA models☆324Updated last month
- ☆57Updated last year
- ☆48Updated last year