habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆23Updated this week
Alternatives and similar repositories for APPy:
Users that are interested in APPy are comparing it to the libraries listed below
- FlexAttention w/ FlashAttention3 Support☆26Updated 6 months ago
- Quantized Attention on GPU☆45Updated 4 months ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 9 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 9 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆107Updated this week
- Framework to reduce autotune overhead to zero for well known deployments.☆63Updated last week
- PyTorch implementation of the Flash Spectral Transform Unit.☆16Updated 6 months ago
- ☆46Updated last year
- FlashRNN - Fast RNN Kernels with I/O Awareness☆76Updated last week
- ☆30Updated 10 months ago
- ☆68Updated 2 months ago
- Transformers components but in Triton☆32Updated 2 weeks ago
- ☆22Updated last year
- GPTQ inference TVM kernel☆38Updated 11 months ago
- ☆19Updated 3 weeks ago
- Benchmark tests supporting the TiledCUDA library.☆16Updated 4 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- ☆55Updated this week
- PyTorch bindings for CUTLASS grouped GEMM.☆77Updated 5 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆35Updated 3 weeks ago
- Memory Optimizations for Deep Learning (ICML 2023)☆62Updated last year
- [WIP] Better (FP8) attention for Hopper☆26Updated last month
- Awesome Triton Resources☆23Updated last week
- DeeperGEMM: crazy optimized version☆64Updated this week
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- Personal solutions to the Triton Puzzles☆18Updated 8 months ago
- Here we will test various linear attention designs.☆60Updated 11 months ago
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆44Updated 8 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 8 months ago
- GPU operators for sparse tensor operations☆31Updated last year