habanero-lab / APPyLinks

APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.

☆24

Alternatives and similar repositories for APPy

Users that are interested in APPy are comparing it to the libraries listed below

Sorting:

GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated 10 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 8 months ago
Ryu1845 / hyena-jax
Implementation of Hyena Hierarchy in JAX
☆10Updated 2 years ago
TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆16Updated 8 months ago
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 2 months ago
Aleph-Alpha-Research / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆22Updated last year
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆71Updated last year
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆79Updated last week
hazan-lab / flash-stu
PyTorch implementation of the Flash Spectral Transform Unit.
☆17Updated 10 months ago
glassroom / heinsen_attention
Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)
☆24Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆149Updated last month
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆72Updated last year
tile-ai / AttentionEngine
☆50Updated 2 months ago
BBuf / flash-rwkv
☆32Updated last year
Doraemonzzz / Awesome-Triton-Resources
Awesome Triton Resources
☆32Updated 3 months ago
tridao / flash-attention-wheels
☆52Updated last year
Dao-AILab / grouped-latent-attention
☆123Updated 2 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆42Updated last month
PipeFusion / PipeFusion
A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters
☆46Updated last year
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated last year
sgl-project / tensorrt-demo
TensorRT LLM Benchmark Configuration
☆13Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆79Updated 6 months ago
NX-AI / flashrnn
FlashRNN - Fast RNN Kernels with I/O Awareness
☆93Updated last month
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆93Updated last month
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆86Updated last month
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆56Updated this week
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆69Updated 5 months ago
microsoft / AttentionEngine
☆75Updated 2 months ago
OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆85Updated last year