habanero-lab / APPyLinks
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆25Updated last week
Alternatives and similar repositories for APPy
Users that are interested in APPy are comparing it to the libraries listed below
Sorting:
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Quantized Attention on GPU☆44Updated 11 months ago
- ☆32Updated last year
- ☆22Updated last year
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆73Updated last year
- Transformers components but in Triton☆34Updated 5 months ago
- ☆50Updated 5 months ago
- Benchmark tests supporting the TiledCUDA library.☆17Updated 11 months ago
- Xmixers: A collection of SOTA efficient token/channel mixers☆29Updated last month
- ☆102Updated 5 months ago
- Awesome Triton Resources☆36Updated 6 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆78Updated last year
- ☆57Updated last year
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated last year
- ☆130Updated 5 months ago
- Implementation of Hyena Hierarchy in JAX☆10Updated 2 years ago
- Framework to reduce autotune overhead to zero for well known deployments.☆84Updated last month
- A bunch of kernels that might make stuff slower 😉☆63Updated this week
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆48Updated 2 years ago
- Triton implement of bi-directional (non-causal) linear attention☆56Updated 8 months ago
- ☆23Updated 5 months ago
- ☆120Updated 2 months ago
- PyTorch implementation of the Flash Spectral Transform Unit.☆18Updated last year
- An efficient implementation of the NSA (Native Sparse Attention) kernel☆121Updated 4 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆197Updated 4 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆100Updated 4 months ago
- GPTQ inference TVM kernel☆39Updated last year
- Linear Attention Sequence Parallelism (LASP)☆87Updated last year
- Fast and memory-efficient exact attention☆71Updated 7 months ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆103Updated last week