kyegomez / FlashMHALinks

An simple pytorch implementation of Flash MultiHead Attention

☆20

Alternatives and similar repositories for FlashMHA

Users that are interested in FlashMHA are comparing it to the libraries listed below

Sorting:

OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆85Updated last year
BBuf / flash-rwkv
☆31Updated last year
pprp / Pruner-Zero
[ICML24] Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
☆91Updated 7 months ago
keshik6 / grafting
Exploring Diffusion Transformer Designs via Grafting
☆45Updated last month
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
tridao / flash-attention-wheels
☆51Updated last year
SprocketLab / sparse_matrix_fine_tuning
Official repository for ICML 2024 paper "MoRe Fine-Tuning with 10x Fewer Parameters"
☆20Updated 2 months ago
JerryYin777 / Cross-Layer-Attention
Self Reproduction Code of Paper "Reducing Transformer Key-Value Cache Size with Cross-Layer Attention (MIT CSAIL)
☆17Updated last year
thunlp / SparsingLaw
The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".
☆23Updated 8 months ago
sustcsonglin / linear-attention-and-beyond-slides
☆77Updated 4 months ago
kyegomez / Hedgehog
Implementation of the model "Hedgehog" from the paper: "The Hedgehog & the Porcupine: Expressive Linear Attentions with Softmax Mimicry"
☆14Updated last year
Infini-AI-Lab / gsm_infinite
☆48Updated last month
kyegomez / Blockwise-Parallel-Transformer
32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.
☆48Updated 2 years ago
YuchuanTian / DiJiang
[ICML'24 Oral] The official code of "DiJiang: Efficient Large Language Models through Compact Kernelization", a novel DCT-based linear at…
☆101Updated last year
eth-easl / fmengine
Utilities for Training Very Large Models
☆58Updated 9 months ago
RWKV / RWKV-LM
RWKV is an RNN with transformer-level LLM performance. It can be directly trained like a GPT (parallelizable). So it's combining the best…
☆50Updated 4 months ago
Dao-AILab / grouped-latent-attention
☆119Updated last month
fla-org / flash-bidirectional-linear-attention
Triton implement of bi-directional (non-causal) linear attention
☆52Updated 5 months ago
facebookresearch / adaptive_scheduling
Experimental scripts for researching data adaptive learning rate scheduling.
☆23Updated last year
jiwonsong-dev / ReasoningPathCompression
Official implementation of "Reasoning Path Compression: Compressing Generation Trajectories for Efficient LLM Reasoning"
☆19Updated last month
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆42Updated 2 weeks ago
TianjinYellow / StableSPAM
☆22Updated 3 months ago
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Updated this week
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆142Updated last month
hao-ai-lab / Awesome-Video-Attention
A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and cach…
☆28Updated this week
UCDvision / NOLA
Code for NOLA, an implementation of "nola: Compressing LoRA using Linear Combination of Random Basis"
☆56Updated 10 months ago
zhixuan-lin / forgetting-transformer
[ICLR 2025] Official PyTorch implementation of "Forgetting Transformer: Softmax Attention with a Forget Gate"
☆116Updated 2 weeks ago
Lyken17 / hf-torrent
☆39Updated last year
BlinkDL / LinearAttentionArena
Here we will test various linear attention designs.
☆60Updated last year
OpenNLPLab / HGRN2
HGRN2: Gated Linear RNNs with State Expansion
☆55Updated 11 months ago