Implementation of FlashAttention (FA1-FA4) in PyTorch for educational and algorithmic clarity
☆214Apr 12, 2026Updated last month
Alternatives and similar repositories for FlashAttention-PyTorch
Users that are interested in FlashAttention-PyTorch are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels☆114Jul 31, 2023Updated 2 years ago
- flash attention tutorial written in python, triton, cuda, cutlass☆517Jan 20, 2026Updated 4 months ago
- An approximate implementation of the OpenAI paper - An Empirical Model of Large-Batch Training for MNIST☆11Nov 19, 2022Updated 3 years ago
- Flash Attention in ~100 lines of CUDA (forward pass only)☆1,144Dec 30, 2024Updated last year
- a minimal cache manager for PagedAttention, on top of llama3.☆144Aug 26, 2024Updated last year
- Managed hosting for WordPress and PHP on Cloudways • AdManaged hosting for WordPress, Magento, Laravel, or PHP apps, on multiple cloud providers. Deploy in minutes on Cloudways by DigitalOcean.
- Implement Flash Attention using Cute.☆108Dec 17, 2024Updated last year
- ☆16Mar 13, 2023Updated 3 years ago
- Prune transformer layers☆74May 30, 2024Updated 2 years ago
- The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.☆92Jul 17, 2025Updated 10 months ago
- Triton implementation of FlashAttention2 that adds Custom Masks.☆176Aug 14, 2024Updated last year
- ☆27Aug 5, 2022Updated 3 years ago
- Implementation of the paper "Opcodes as predictor for malware " by Daniel Bilar☆11Oct 17, 2020Updated 5 years ago
- Implementation of Speculative Sampling as described in "Accelerating Large Language Model Decoding with Speculative Sampling" by Deepmind☆111Feb 29, 2024Updated 2 years ago
- triton ver of gqa flash attn, based on the tutorial☆12Aug 4, 2024Updated last year
- GPU virtual machines on DigitalOcean Gradient AI • AdGet to production fast with high-performance AMD and NVIDIA GPUs you can spin up in seconds. The definition of operational simplicity.
- Fast inference from large lauguage models via speculative decoding☆915Aug 22, 2024Updated last year
- A re-implementation of the "Red Teaming Language Models with Language Models" paper by Perez et al., 2022☆34Oct 9, 2023Updated 2 years ago
- [ICLR 2025 & COLM 2025] Official PyTorch implementation of the Forgetting Transformer and Adaptive Computation Pruning☆150Feb 25, 2026Updated 3 months ago
- Port linux kernel list.h to userspace☆32Mar 18, 2015Updated 11 years ago
- ☆16Nov 14, 2022Updated 3 years ago
- [ACL 2025] Squeezed Attention: Accelerating Long Prompt LLM Inference