kyegomez / FlashMHALinks
An simple pytorch implementation of Flash MultiHead Attention
☆20Updated last year
Alternatives and similar repositories for FlashMHA
Users that are interested in FlashMHA are comparing it to the libraries listed below
Sorting:
- Linear Attention Sequence Parallelism (LASP)☆85Updated last year
- ☆31Updated last year
- [ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs☆91Updated 7 months ago
- Exploring Diffusion Transformer Designs via Grafting☆45Updated last month
- [ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts☆40Updated last year
- ☆51Updated last year
- Official repository for ICML 2024 paper "MoRe Fine-Tuning with 10x Fewer Parameters"☆20Updated 2 months ago
- Self Reproduction Code of Paper "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (MIT CSAIL)☆17Updated last year
- The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".☆23Updated 8 months ago
- ☆77Updated 4 months ago
- Implementation of the model "Hedgehog" from the paper: "The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry"☆14Updated last year
- ☆48Updated last month
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆48Updated 2 years ago
- [ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…☆101Updated last year
- Utilities for Training Very Large Models☆58Updated 9 months ago
- RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best…☆50Updated 4 months ago
- ☆119Updated last month
- Triton implement of bi-directional (non-causal) linear attention☆52Updated 5 months ago
- Experimental scripts for researching data adaptive learning rate scheduling.☆23Updated last year
- Official implementation of "Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning"☆19Updated last month
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆42Updated 2 weeks ago
- ☆22Updated 3 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆25Updated this week
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆142Updated last month
- A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and cach…☆28Updated this week
- Code for NOLA, an implementation of "nola: Compressing LoRA using Linear Combination of Random Basis"☆56Updated 10 months ago
- [ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"☆116Updated 2 weeks ago
- ☆39Updated last year
- Here we will test various linear attention designs.☆60Updated last year
- HGRN2: Gated Linear RNNs with State Expansion☆55Updated 11 months ago