habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆23Updated last week
Alternatives and similar repositories for APPy:
Users that are interested in APPy are comparing it to the libraries listed below
- Quantized Attention on GPU☆45Updated 5 months ago
- FlexAttention w/ FlashAttention3 Support☆26Updated 7 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimzer☆91Updated this week
- ☆17Updated last week
- Transformers components but in Triton☆33Updated last month
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆40Updated last week
- ☆20Updated 2 months ago
- A bunch of kernels that might make stuff slower 😉☆40Updated this week
- Awesome Triton Resources☆27Updated last week
- Framework to reduce autotune overhead to zero for well known deployments.☆70Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆122Updated this week
- ☆48Updated last year
- Benchmark tests supporting the TiledCUDA library.☆16Updated 5 months ago
- ☆68Updated 3 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆69Updated 10 months ago
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 11 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆84Updated last week
- ☆22Updated last year
- GPTQ inference TVM kernel☆38Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 9 months ago
- ☆68Updated 2 weeks ago
- DeeperGEMM: crazy optimized version☆68Updated this week
- ☆30Updated 11 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆69Updated 11 months ago
- Here we will test various linear attention designs.☆60Updated last year
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆102Updated this week
- 🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …☆51Updated this week
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆15Updated this week
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.☆36Updated last month
- A Suite for Parallel Inference of Diffusion Transformers (DiTs) on multi-GPU Clusters☆45Updated 9 months ago