shreyansh26 / FlashAttention-PyTorchLinks
Implementation of FlashAttention in PyTorch
☆171Updated 9 months ago
Alternatives and similar repositories for FlashAttention-PyTorch
Users that are interested in FlashAttention-PyTorch are comparing it to the libraries listed below
Sorting:
- ☆147Updated 3 months ago
- ☆148Updated 7 months ago
- DeepSeek Native Sparse Attention pytorch implementation☆103Updated this week
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆240Updated 2 months ago
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆220Updated 2 months ago
- A collection of memory efficient attention operators implemented in the Triton language.☆279Updated last year
- qwen-nsa☆78Updated 6 months ago
- TransMLA: Multi-Head Latent Attention Is All You Need (NeurIPS 2025 Spotlight)☆382Updated 3 weeks ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆329Updated 7 months ago
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels☆108Updated 2 years ago
- Efficient Mixture of Experts for LLM Paper List☆132Updated 2 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆151Updated this week
- 青稞Talk☆150Updated this week
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Length☆118Updated 6 months ago
- flash attention tutorial written in python, triton, cuda, cutlass☆426Updated 5 months ago
- ☆43Updated last year
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆338Updated 3 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆124Updated 4 months ago
- Implement Flash Attention using Cute.☆96Updated 9 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆573Updated 3 weeks ago
- Triton implementation of FlashAttention2 that adds Custom Masks.☆138Updated last year
- Built upon Megatron-Deepspeed and HuggingFace Trainer, EasyLLM has reorganized the code logic with a focus on usability. While enhancing …☆48Updated last year
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆143Updated last month
- ☆129Updated 4 months ago
- ☆119Updated last month
- Triton implementation of Flash Attention2.0☆40Updated 2 years ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆216Updated last year
- Implementation for FP8/INT8 Rollout for RL training without performence drop.☆253Updated 2 weeks ago
- [ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"☆171Updated 2 weeks ago
- analyse problems of AI with Math and Code☆26Updated 2 months ago