kyegomez / FlashMHA
An simple pytorch implementation of Flash MultiHead Attention
☆21Updated last year
Alternatives and similar repositories for FlashMHA:
Users that are interested in FlashMHA are comparing it to the libraries listed below
- ☆30Updated 11 months ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆41Updated last week
- Linear Attention Sequence Parallelism (LASP)☆82Updated 11 months ago
- Triton implement of bi-directional (non-causal) linear attention☆47Updated 3 months ago
- On The Planning Abilities of OpenAI's o1 Models: Feasibility, Optimality, and Generalizability☆38Updated 2 weeks ago
- Official repository for ICML 2024 paper "MoRe Fine-Tuning with 10x Fewer Parameters"☆18Updated last week
- Experimental scripts for researching data adaptive learning rate scheduling.☆23Updated last year
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"☆116Updated last year
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆25Updated 10 months ago
- Self Reproduction Code of Paper "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (MIT CSAIL)☆14Updated 11 months ago
- ☆71Updated 2 months ago
- Official repository of paper "RNNs Are Not Transformers (Yet): The Key Bottleneck on In-context Retrieval"☆27Updated last year
- [EMNLP 2022] Official implementation of Transnormer in our EMNLP 2022 paper - The Devil in Linear Transformer☆60Updated last year
- Quantized Attention on GPU☆45Updated 5 months ago
- "Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding" Zhenyu Zhang, Runjin Chen, Shiw…☆29Updated last year
- Code for paper "Patch-Level Training for Large Language Models"☆84Updated 5 months ago
- Utilities for Training Very Large Models☆58Updated 7 months ago
- ☆45Updated 2 months ago
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆39Updated last year
- Here we will test various linear attention designs.☆60Updated last year
- Implementation of the model "Hedgehog" from the paper: "The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry"☆13Updated last year
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆36Updated last year
- Odysseus: Playground of LLM Sequence Parallelism☆69Updated 10 months ago
- ☆68Updated this week
- ☆103Updated last year
- Patch convolution to avoid large GPU memory usage of Conv2D☆87Updated 3 months ago
- ☆53Updated 10 months ago
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆99Updated last month
- 🔥 A minimal training framework for scaling FLA models☆128Updated this week
- ☆18Updated last week