alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆99Updated 6 months ago
Alternatives and similar repositories for flashattention2-custom-mask:
Users that are interested in flashattention2-custom-mask are comparing it to the libraries listed below
- Triton-based implementation of Sparse Mixture of Experts.☆203Updated 3 months ago
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆207Updated 6 months ago
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆114Updated 11 months ago
- 🔥 A minimal training framework for scaling FLA models☆74Updated this week
- Odysseus: Playground of LLM Sequence Parallelism☆65Updated 8 months ago
- ☆227Updated 10 months ago
- [ICLR 2025] COAT: Compressing Optimizer States and Activation for Memory-Efficient FP8 Training☆128Updated 2 weeks ago
- [ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference☆253Updated 3 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆67Updated 4 months ago
- A sparse attention kernel supporting mix sparse patterns☆156Updated 3 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆102Updated 2 months ago
- ☆115Updated 3 weeks ago
- 16-fold memory access reduction with nearly no loss☆80Updated 2 weeks ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆226Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆102Updated this week
- ☆138Updated last year
- ☆100Updated 6 months ago
- The official implementation of the paper <MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression>☆119Updated 3 months ago
- This repository contains the experimental PyTorch native float8 training UX☆221Updated 7 months ago
- Ring attention implementation with flash attention☆700Updated last week
- REST: Retrieval-Based Speculative Decoding, NAACL 2024☆196Updated 3 months ago
- Triton implementation of Flash Attention2.0☆29Updated last year
- ring-attention experiments☆127Updated 4 months ago
- Collection of kernels written in Triton language☆110Updated 2 weeks ago
- Normalized Transformer (nGPT)☆156Updated 3 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆231Updated 2 weeks ago
- A collection of memory efficient attention operators implemented in the Triton language.☆248Updated 9 months ago
- Fast and memory-efficient exact attention☆65Updated this week