habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆21Updated 2 weeks ago
Alternatives and similar repositories for APPy:
Users that are interested in APPy are comparing it to the libraries listed below
- FlashRNN - Fast RNN Kernels with I/O Awareness☆69Updated last month
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆75Updated this week
- FlexAttention w/ FlashAttention3 Support☆27Updated 3 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆57Updated last month
- 📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)🎉GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.☆44Updated this week
- Odysseus: Playground of LLM Sequence Parallelism☆64Updated 7 months ago
- ☆55Updated 3 months ago
- GPTQ inference TVM kernel☆38Updated 8 months ago
- Triton implement of bi-directional (non-causal) linear attention☆33Updated this week
- ☆22Updated last year
- ☆45Updated last year
- ☆31Updated 7 months ago
- Quantized Attention on GPU☆34Updated last month
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆67Updated 7 months ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 7 months ago
- TensorRT LLM Benchmark Configuration☆12Updated 5 months ago
- Here we will test various linear attention designs.☆58Updated 8 months ago
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 7 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆37Updated 5 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆58Updated 2 months ago
- Awesome Triton Resources☆19Updated last month
- 🔥 A minimal training framework for scaling FLA models☆24Updated this week
- Benchmark tests supporting the TiledCUDA library.☆12Updated last month
- Fast and memory-efficient exact attention☆52Updated last month
- ☆52Updated last week
- Implement Flash Attention using Cute.☆65Updated last month
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆62Updated 10 months ago