shreyansh26 / FlashAttention-PyTorch
Implementation of FlashAttention in PyTorch
☆122Updated last year
Related projects ⓘ
Alternatives and complementary repositories for FlashAttention-PyTorch
- Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels☆95Updated last year
- A collection of memory efficient attention operators implemented in the Triton language.☆215Updated 5 months ago
- ☆63Updated 3 months ago
- ☆133Updated last year
- flash attention tutorial written in python, triton, cuda, cutlass☆194Updated 4 months ago
- Low-bit optimizers for PyTorch☆118Updated last year
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆352Updated last week
- (Unofficial) PyTorch implementation of grouped-query attention (GQA) from "GQA: Training Generalized Multi-Query Transformer Models from …☆130Updated 6 months ago
- Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models☆184Updated 6 months ago
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆196Updated 2 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆67Updated 3 months ago
- The official implementation of the EMNLP 2023 paper LLM-FP4☆166Updated 10 months ago
- ☆74Updated 10 months ago
- Transformer related optimization, including BERT, GPT☆60Updated last year
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆26Updated 2 months ago
- ☆183Updated 6 months ago
- Ring attention implementation with flash attention☆578Updated this week
- ☆79Updated 2 months ago
- [ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.☆79Updated 5 months ago
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆46Updated 3 months ago
- 📑 Dive into Big Model Training☆110Updated last year
- Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"☆253Updated 2 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆51Updated last week
- The CUDA version of the RWKV language model ( https://github.com/BlinkDL/RWKV-LM )☆212Updated 5 months ago
- ☆283Updated 7 months ago
- Rectified Rotary Position Embeddings☆338Updated 5 months ago
- Awesome list for LLM quantization☆122Updated 3 weeks ago
- ☆50Updated last year
- The official code for paper "parallel speculative decoding with adaptive draft length."☆21Updated 2 months ago
- GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM☆146Updated 3 months ago