habanero-lab / APPyLinks
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆30Updated 3 weeks ago
Alternatives and similar repositories for APPy
Users that are interested in APPy are comparing it to the libraries listed below
Sorting:
- FlexAttention w/ FlashAttention3 Support☆27Updated last year
- Quantized Attention on GPU☆44Updated last year
- Benchmark tests supporting the TiledCUDA library.☆18Updated last year
- Transformers components but in Triton☆34Updated 8 months ago
- ☆117Updated 8 months ago
- ☆52Updated 8 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆94Updated 4 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆79Updated last year
- ☆22Updated 2 years ago
- GPTQ inference TVM kernel☆41Updated last year
- ☆128Updated 5 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆73Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆105Updated 6 months ago
- A bunch of kernels that might make stuff slower 😉☆75Updated this week
- ☆65Updated 9 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆46Updated 7 months ago
- ☆32Updated last year
- ☆133Updated 7 months ago
- ☆62Updated 2 years ago
- DeeperGEMM: crazy optimized version☆73Updated 8 months ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆51Updated 6 months ago
- ☆39Updated last month
- Awesome Triton Resources☆39Updated 9 months ago
- Triton implement of bi-directional (non-causal) linear attention☆63Updated 11 months ago
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆18Updated last year
- ☆160Updated 2 years ago
- Fast and memory-efficient exact attention☆75Updated 10 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆55Updated last year
- extensible collectives library in triton☆93Updated 9 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Updated 5 months ago