habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆23Updated last month
Alternatives and similar repositories for APPy:
Users that are interested in APPy are comparing it to the libraries listed below
- FlexAttention w/ FlashAttention3 Support☆26Updated 5 months ago
- Quantized Attention on GPU☆45Updated 4 months ago
- Benchmark tests supporting the TiledCUDA library.☆15Updated 4 months ago
- FlashRNN - Fast RNN Kernels with I/O Awareness☆76Updated this week
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 9 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆104Updated this week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- ☆67Updated 2 months ago
- Awesome Triton Resources☆23Updated this week
- ☆30Updated 10 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆63Updated this week
- PyTorch implementation of the Flash Spectral Transform Unit.☆16Updated 6 months ago
- ☆22Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 8 months ago
- Transformers components but in Triton☆32Updated 2 weeks ago
- Here we will test various linear attention designs.☆60Updated 11 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆35Updated 3 weeks ago
- GPTQ inference TVM kernel☆38Updated 11 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 9 months ago
- [WIP] Better (FP8) attention for Hopper☆26Updated last month
- DeeperGEMM: crazy optimized version☆63Updated 2 weeks ago
- ☆19Updated 3 weeks ago
- PyTorch bindings for CUTLASS grouped GEMM.☆77Updated 5 months ago
- TensorRT LLM Benchmark Configuration☆13Updated 8 months ago
- Implement Flash Attention using Cute.☆74Updated 3 months ago
- Implementation of Hyena Hierarchy in JAX☆10Updated last year
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆44Updated 8 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆68Updated 10 months ago
- continous batching and parallel acceleration for RWKV6☆24Updated 9 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆102Updated 8 months ago