leloykun / flash-attention-minimalView external linksLinks
Flash Attention in 300-500 lines of CUDA/C++
☆36Aug 22, 2025Updated 5 months ago
Alternatives and similar repositories for flash-attention-minimal
Users that are interested in flash-attention-minimal are comparing it to the libraries listed below
Sorting:
- codes and plots for "Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs"☆10Dec 30, 2024Updated last year
- triton ver of gqa flash attn, based on the tutorial☆12Aug 4, 2024Updated last year
- Code for the paper "Stack Attention: Improving the Ability of Transformers to Model Hierarchical Patterns"☆18Mar 15, 2024Updated last year
- ☆44Nov 1, 2025Updated 3 months ago
- Personal solutions to the Triton Puzzles☆20Jul 18, 2024Updated last year
- Parallel Associative Scan for Language Models☆18Jan 8, 2024Updated 2 years ago
- LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models☆79Oct 16, 2024Updated last year
- ☆19Dec 12, 2023Updated 2 years ago
- ☆22Dec 1, 2021Updated 4 years ago
- ☆20Oct 11, 2023Updated 2 years ago
- supporting pytorch FSDP for optimizers☆84Dec 8, 2024Updated last year
- Flash Attention in ~100 lines of CUDA (forward pass only)☆1,068Dec 30, 2024Updated last year
- Flash-Linear-Attention models beyond language☆21Aug 28, 2025Updated 5 months ago
- u-MPS implementation and experimentation code used in the paper Tensor Networks for Probabilistic Sequence Modeling (https://arxiv.org/ab…☆19Jul 2, 2020Updated 5 years ago
- A probabilitic model for contextual word representation. Accepted to ACL2023 Findings.