habanero-lab / APPyLinks
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆27Updated this week
Alternatives and similar repositories for APPy
Users that are interested in APPy are comparing it to the libraries listed below
Sorting:
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- ☆32Updated last year
- Quantized Attention on GPU☆44Updated last year
- Benchmark tests supporting the TiledCUDA library.☆18Updated last year
- Awesome Triton Resources☆38Updated 7 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆73Updated last year
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated last year
- Transformers components but in Triton☆34Updated 7 months ago
- ☆22Updated 2 years ago
- Xmixers: A collection of SOTA efficient token/channel mixers☆28Updated 3 months ago
- ☆23Updated 7 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆78Updated last year
- Implementation of Hyena Hierarchy in JAX☆10Updated 2 years ago
- Framework to reduce autotune overhead to zero for well known deployments.☆90Updated 3 months ago
- A bunch of kernels that might make stuff slower 😉☆65Updated 2 weeks ago
- ☆59Updated 2 years ago
- My attempt to improve the speed of the newton schulz algorithm, starting from the dion implementation.☆22Updated 2 weeks ago
- Here we will test various linear attention designs.☆62Updated last year
- ☆52Updated 7 months ago
- 方便扩展的Cuda 算子理解和优化框架,仅用在学习使用☆18Updated last year
- ☆132Updated 6 months ago
- GPTQ inference TVM kernel☆41Updated last year
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆49Updated 2 years ago
- Fast and memory-efficient exact attention☆74Updated 9 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆105Updated 5 months ago
- Repository for CPU Kernel Generation for LLM Inference☆27Updated 2 years ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Updated 5 months ago
- ☆114Updated 7 months ago
- ☆76Updated last year
- Vortex: A Flexible and Efficient Sparse Attention Framework☆43Updated 2 weeks ago