shreyansh26 / FlashAttention-PyTorch
Implementation of FlashAttention in PyTorch
☆129Updated 2 weeks ago
Alternatives and similar repositories for FlashAttention-PyTorch:
Users that are interested in FlashAttention-PyTorch are comparing it to the libraries listed below
- ☆73Updated 6 months ago
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels☆102Updated last year
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆415Updated last month
- Awesome list for LLM quantization☆160Updated last month
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆249Updated 9 months ago
- A collection of memory efficient attention operators implemented in the Triton language.☆233Updated 7 months ago
- Transformer related optimization, including BERT, GPT☆59Updated last year
- Triton implementation of Flash Attention2.0☆29Updated last year
- flash attention tutorial written in python, triton, cuda, cutlass☆255Updated 3 weeks ago
- ☆140Updated last year
- [ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.☆98Updated 8 months ago
- Ring attention implementation with flash attention☆660Updated last month
- The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )☆217Updated last month
- Low-bit optimizers for PyTorch☆125Updated last year
- ☆217Updated 8 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆152Updated 6 months ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆89Updated 8 months ago
- A MoE impl for PyTorch, [ATC'23] SmartMoE☆61Updated last year
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆34Updated 4 months ago
- Puzzles for learning Triton, play it with minimal environment configuration!☆207Updated last month
- ☆76Updated last year
- pytorch distribute tutorials☆101Updated 3 months ago
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …☆149Updated 8 months ago
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆76Updated 3 weeks ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆95Updated last month
- REST: Retrieval-Based Speculative Decoding, NAACL 2024☆190Updated last month
- Spec-Bench: A Comprehensive Benchmark and Unified Evaluation Platform for Speculative Decoding (ACL 2024 Findings)☆218Updated 3 months ago
- ☆79Updated 4 months ago
- 📰 Must-read papers on KV Cache Compression (constantly updating 🤗).☆276Updated 2 weeks ago
- Implement Flash Attention using Cute.☆67Updated last month