FlagOpen / FlagAttentionLinks

A collection of memory efficient attention operators implemented in the Triton language.

☆283

Alternatives and similar repositories for FlagAttention

Users that are interested in FlagAttention are comparing it to the libraries listed below

Sorting:

fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆156Updated 3 weeks ago
66RING / tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
☆437Updated 5 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆125Updated 5 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆266Updated 3 months ago
ColfaxResearch / cutlass-kernels
☆241Updated last year
madsys-dev / deepseekv2-profile
☆148Updated 7 months ago
AlibabaPAI / FLASHNN
☆100Updated last year
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆272Updated this week
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆220Updated 2 years ago
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆41Updated 8 months ago
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆433Updated 5 months ago
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆301Updated 2 months ago
KnowingNothing / MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
☆387Updated 3 weeks ago
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆107Updated 6 months ago
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆226Updated 2 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆144Updated 2 months ago
FlagOpen / FlagGems
FlagGems is an operator library for large language models implemented in the Triton Language.
☆703Updated last week
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆323Updated last year
OpenPPL / ppl.nn.llm
☆139Updated last year
gpu-mode / triton-index
Cataloging released Triton kernels.
☆263Updated last month
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆244Updated 4 months ago
Victarry / PP-Schedule-Visualization
Pipeline Parallelism Emulation and Visualization
☆69Updated 4 months ago
microsoft / vattention
Dynamic Memory Management for Serving LLMs without PagedAttention
☆432Updated 5 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆221Updated 2 years ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 10 months ago
InternLM / turbomind
☆97Updated 7 months ago
CalebDu / Awesome-Cute
☆107Updated 5 months ago
yifuwang / symm-mem-recipes
☆144Updated 10 months ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆66Updated last year
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆216Updated last year