tridao / flash-attention-wheelsLinks

☆57

Alternatives and similar repositories for flash-attention-wheels

Users that are interested in flash-attention-wheels are comparing it to the libraries listed below

Sorting:

Dao-AILab / grouped-latent-attention
☆129Updated 4 months ago
OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆87Updated last year
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆100Updated 3 months ago
kyegomez / FlashAttention20
Get down and dirty with FlashAttention2.0 in pytorch, plug in and play no complex CUDA kernels
☆108Updated 2 years ago
microsoft / AttentionEngine
☆99Updated 4 months ago
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆73Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆77Updated last year
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆26Updated 2 years ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆82Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆189Updated 3 months ago
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆45Updated 10 months ago
chu-tianxiang / QuIP-for-all
QuIP quantization
☆59Updated last year
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆43Updated 3 months ago
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆131Updated 2 years ago
Infini-AI-Lab / MagicDec
[ICLR2025] Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
☆130Updated 10 months ago
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆50Updated last year
IST-DASLab / SparseFinetuning
Repository for Sparse Finetuning of LLMs via modified version of the MosaicML llmfoundry
☆42Updated last year
IST-DASLab / QuEST
Work in progress.
☆74Updated 3 months ago
stanford-futuredata / stk
☆113Updated last year
rayleizhu / vllm-ra
[ACL 2024] RelayAttention for Efficient Large Language Model Serving with Long System Prompts
☆40Updated last year
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆70Updated 7 months ago
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
opengear-project / GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
☆168Updated last year
FasterDecoding / TEAL
☆143Updated 7 months ago
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆89Updated 2 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆82Updated 8 months ago
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 5 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆124Updated 4 months ago
ByteDance-Seed / cudaLLM
☆115Updated last month
Doraemonzzz / Awesome-Triton-Resources
Awesome Triton Resources
☆36Updated 5 months ago