vllm-project / flash-attentionLinks

Fast and memory-efficient exact attention

☆96

Alternatives and similar repositories for flash-attention

Users that are interested in flash-attention are comparing it to the libraries listed below

Sorting:

InternLM / turbomind
☆96Updated 6 months ago
infinigence / FlashOverlap
A lightweight design for computation-communication overlap.
☆181Updated 2 weeks ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆121Updated 5 months ago
FlagOpen / FlagCX
☆91Updated last week
feifeibear / LLMRoofline
Compare different hardware platforms via the Roofline Model for LLM inference tasks.
☆115Updated last year
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆112Updated 5 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆265Updated 3 months ago
AlibabaPAI / FLASHNN
☆100Updated last year
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆138Updated last month
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
microsoft / sarathi-serve
A low-latency & high-throughput serving engine for LLMs
☆431Updated last week
madsys-dev / deepseekv2-profile
☆148Updated 7 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆124Updated 4 months ago
bytedance / ByteMLPerf
AI Accelerator Benchmark focuses on evaluating AI Accelerators from a practical production perspective, including the ease of use and ver…
☆265Updated 2 months ago
Ascend / triton-ascend
Triton adapter for Ascend. Mirror of https://gitee.com/ascend/triton-ascend
☆76Updated 3 weeks ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆223Updated 2 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆72Updated 5 months ago
triton-lang / kernels
☆92Updated 11 months ago
stepfun-ai / StepMesh
☆307Updated 3 weeks ago
ByteDance-Seed / cudaLLM
☆120Updated 2 months ago
LLMServe / SwiftTransformer
High performance Transformer implementation in C++.
☆138Updated 9 months ago
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆282Updated last year
DeepLink-org / DLSlime
DLSlime: Flexible & Efficient Heterogeneous Transfer Toolkit
☆70Updated this week
flexflow / flexflow-serve
FlexFlow Serve: Low-Latency, High-Performance LLM Serving
☆63Updated last month
Victarry / PP-Schedule-Visualization
Pipeline Parallelism Emulation and Visualization
☆68Updated 4 months ago
flashinfer-ai / cutlass-viz
☆65Updated 5 months ago
PKU-SEC-Lab / HybriMoE
[DAC'25] Official implement of "HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference"
☆75Updated 4 months ago
perplexityai / pplx-kernels
Perplexity GPU Kernels
☆497Updated last month
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆112Updated last year
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆264Updated this week