xiayuqing0622 / flex_head_fa
View external linksLinks

Fast and memory-efficient exact attention

☆75

Alternatives and similar repositories for flex_head_fa

Users that are interested in flex_head_fa are comparing it to the libraries listed below

Sorting:

tile-ai / tvm
View on GitHub
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆19Updated this week
microsoft / AttentionEngine
View on GitHub
☆118May 19, 2025Updated 8 months ago
LeiWang1999 / Stream-k.tvm
View on GitHub
☆20Sep 28, 2024Updated last year
tile-ai / AttentionEngine
View on GitHub
☆52May 19, 2025Updated 8 months ago
shreyansh26 / Attention-Mask-Patterns
View on GitHub
Using FlexAttention to compute attention with different masking patterns
☆47Sep 22, 2024Updated last year
luliyucoordinate / cute-flash-attention
View on GitHub
Implement Flash Attention using Cute.
☆100Dec 17, 2024Updated last year
sjelassi / transformers_ssm_copy
View on GitHub
☆35Feb 26, 2024Updated last year
66RING / CritiPrefill
View on GitHub
Code repo for "CritiPrefill: A Segment-wise Criticality-based Approach for Prefilling Acceleration in LLMs".
☆16Sep 15, 2024Updated last year
microsoft / nnscaler
View on GitHub
nnScaler: Compiling DNN models for Parallel Training
☆124Sep 23, 2025Updated 4 months ago
nox-410 / tvm.tl
View on GitHub
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆51Jul 23, 2024Updated last year
chhzh123 / ptc-tutorial
View on GitHub
PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo
☆17Mar 13, 2023Updated 2 years ago
shawntan / stickbreaking-attention
View on GitHub
Stick-breaking attention
☆62Jul 1, 2025Updated 7 months ago
microsoft / TileFusion
View on GitHub
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆106Jun 28, 2025Updated 7 months ago
jamii / texsearch
View on GitHub
A search index specialised for LaTeX equations. Developed for latexsearch.com.
☆17Jul 15, 2011Updated 14 years ago
andy-yang-1 / DoubleSparse
View on GitHub
16-fold memory access reduction with nearly no loss
☆110Mar 26, 2025Updated 10 months ago
kylehkhsu / tripod
View on GitHub
☆12Apr 19, 2024Updated last year
bethelmelesse / UnifiedCrawl
View on GitHub
☆16Nov 26, 2024Updated last year
uwsampl / paper-agents
View on GitHub
☆13Dec 9, 2024Updated last year
mit-han-lab / Quest
View on GitHub
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆372Jul 10, 2025Updated 7 months ago
IsaacRe / vllm-kvcompress
View on GitHub
KV cache compression for high-throughput LLM inference
☆153Feb 5, 2025Updated last year
amirbar / StoP
View on GitHub
☆13Jun 26, 2024Updated last year
xhuang28 / NewBioNer
View on GitHub
☆11Nov 16, 2019Updated 6 years ago
peichenxie / FPRev
View on GitHub
☆24May 9, 2025Updated 9 months ago
mit-han-lab / Block-Sparse-Attention
View on GitHub
A sparse attention kernel supporting mix sparse patterns
☆455Jan 18, 2026Updated 3 weeks ago
Cranial-XIX / longhorn
View on GitHub
Official PyTorch Implementation of the Longhorn Deep State Space Model
☆56Dec 4, 2024Updated last year
ml-jku / hopfield-boosting
View on GitHub
☆33May 15, 2024Updated last year
nil0x9 / flash-muon
View on GitHub
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆235Jun 15, 2025Updated 8 months ago
meta-pytorch / attention-gym
View on GitHub
Helpful tools and examples for working with flex-attention
☆1,127Feb 8, 2026Updated last week
OpenBitSys / BitDistiller
View on GitHub
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
☆134May 16, 2024Updated last year
jzhang38 / LongMamba
View on GitHub
Some preliminary explorations of Mamba's context scaling.
☆218Feb 8, 2024Updated 2 years ago
lcy-seso / DLFrameworkTest
View on GitHub
My tests and experiments with some popular dl frameworks.
☆17Sep 11, 2025Updated 5 months ago
TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆18Nov 19, 2024Updated last year
watcl-lab / positional_attention
View on GitHub
Source code for the paper "Positional Attention: Expressivity and Learnability of Algorithmic Computation"
☆14May 26, 2025Updated 8 months ago
jiazhihao / attention_superoptimizer
View on GitHub
An Attention Superoptimizer
☆22Jan 20, 2025Updated last year
fla-org / flash-linear-attention
View on GitHub
🚀 Efficient implementations of state-of-the-art linear attention models
☆4,379Updated this week
tile-ai / tilescale
View on GitHub
Tile-based language built for AI computation across all scales
☆120Feb 8, 2026Updated last week
tile-ai / TileOPs
View on GitHub
☆86Updated this week
TiledTensor / TiledCUDA
View on GitHub
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆192Jan 28, 2025Updated last year
alexzhang13 / flashattention2-custom-mask
View on GitHub
Triton implementation of FlashAttention2 that adds Custom Masks.
☆167Aug 14, 2024Updated last year

xiayuqing0622 / flex_head_faView external linksLinks

Alternatives and similar repositories for flex_head_fa

xiayuqing0622 / flex_head_fa
View external linksLinks