habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆23Updated last month
Alternatives and similar repositories for APPy:
Users that are interested in APPy are comparing it to the libraries listed below
- FlexAttention w/ FlashAttention3 Support☆26Updated 5 months ago
- Quantized Attention on GPU☆45Updated 4 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆104Updated this week
- ☆30Updated 9 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆66Updated 9 months ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆76Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing. By pro…☆68Updated this week
- ☆64Updated 2 months ago
- PyTorch bindings for CUTLASS grouped GEMM.☆74Updated 4 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 8 months ago
- GPTQ inference TVM kernel☆39Updated 10 months ago
- Repository for CPU Kernel Generation for LLM Inference☆25Updated last year
- ☆46Updated last year
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated this week
- Benchmark tests supporting the TiledCUDA library.☆15Updated 4 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆63Updated this week
- Awesome Triton Resources☆20Updated 3 months ago
- Here we will test various linear attention designs.☆60Updated 10 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆68Updated 9 months ago
- Personal solutions to the Triton Puzzles☆18Updated 8 months ago
- Transformers components but in Triton☆32Updated this week
- ☆22Updated last year
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆35Updated 2 weeks ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 9 months ago
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆46Updated last year
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆43Updated 8 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆71Updated 6 months ago