habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆23Updated last month
Alternatives and similar repositories for APPy:
Users that are interested in APPy are comparing it to the libraries listed below
- FlexAttention w/ FlashAttention3 Support☆26Updated 5 months ago
- Quantized Attention on GPU☆45Updated 4 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆63Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- ☆22Updated last year
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 9 months ago
- ☆30Updated 10 months ago
- GPTQ inference TVM kernel☆38Updated 11 months ago
- ☆46Updated last year
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆107Updated this week
- Benchmark tests supporting the TiledCUDA library.☆16Updated 4 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 9 months ago
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated last week
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆35Updated 3 weeks ago
- Awesome Triton Resources☆23Updated last week
- ☆68Updated 2 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆77Updated 5 months ago
- DeeperGEMM: crazy optimized version☆63Updated 2 weeks ago
- Transformers components but in Triton☆32Updated 2 weeks ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆76Updated last week
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- Implementation of Hyena Hierarchy in JAX☆10Updated last year
- ☆19Updated 3 weeks ago
- extensible collectives library in triton☆84Updated this week
- Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).☆22Updated 9 months ago
- PyTorch implementation of the Flash Spectral Transform Unit.☆16Updated 6 months ago
- Memory Optimizations for Deep Learning (ICML 2023)☆62Updated last year
- A bunch of kernels that might make stuff slower 😉☆29Updated this week
- Personal solutions to the Triton Puzzles☆18Updated 8 months ago
- Explore training for quantized models☆17Updated 2 months ago