habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆23Updated this week
Alternatives and similar repositories for APPy:
Users that are interested in APPy are comparing it to the libraries listed below
- Quantized Attention on GPU☆45Updated 5 months ago
- FlexAttention w/ FlashAttention3 Support☆26Updated 6 months ago
- Benchmark tests supporting the TiledCUDA library.☆16Updated 5 months ago
- GPTQ inference TVM kernel☆38Updated last year
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆111Updated last week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆82Updated this week
- ☆67Updated this week
- Framework to reduce autotune overhead to zero for well known deployments.☆65Updated last week
- Odysseus: Playground of LLM Sequence Parallelism☆68Updated 10 months ago
- DeeperGEMM: crazy optimized version☆67Updated 3 weeks ago
- ☆55Updated 2 weeks ago
- ☆68Updated 3 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Updated last year
- Transformers components but in Triton☆32Updated last month
- PyTorch bindings for CUTLASS grouped GEMM.☆81Updated 5 months ago
- ☆30Updated 11 months ago
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated 3 weeks ago
- ☆48Updated last year
- ☆19Updated last month
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated last week
- ☆22Updated last year
- An auxiliary project analysis of the characteristics of KV in DiT Attention.☆29Updated 4 months ago
- PyTorch implementation of the Flash Spectral Transform Unit.☆16Updated 7 months ago
- A bunch of kernels that might make stuff slower 😉☆34Updated this week
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 10 months ago
- Implement Flash Attention using Cute.☆76Updated 4 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆108Updated 7 months ago
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆15Updated this week
- 32 times longer context window than vanilla Transformers and up to 4 times longer than memory efficient Transformers.☆47Updated last year
- Awesome Triton Resources☆26Updated 3 weeks ago