habanero-lab / APPyLinks
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to OpenMP, and automatically compiles the annotated code to GPU kernels.
☆23Updated 2 weeks ago
Alternatives and similar repositories for APPy
Users that are interested in APPy are comparing it to the libraries listed below
Sorting:
- FlexAttention w/ FlashAttention3 Support☆26Updated 8 months ago
- Quantized Attention on GPU☆44Updated 6 months ago
- Flash-Muon: An Efficient Implementation of Muon Optimizer☆115Updated this week
- ☆22Updated last year
- ☆49Updated 2 weeks ago
- Accelerate LLM preference tuning via prefix sharing with a single line of code☆41Updated last month
- Awesome Triton Resources☆28Updated last month
- Benchmark tests supporting the TiledCUDA library.☆16Updated 6 months ago
- ☆70Updated 2 weeks ago
- Implementation of Hyena Hierarchy in JAX☆10Updated 2 years ago
- Framework to reduce autotune overhead to zero for well known deployments.☆74Updated 2 weeks ago
- ☆20Updated last month
- Official implementation of "The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs"☆32Updated last month
- Reference implementation of "Softmax Attention with Constant Cost per Token" (Heinsen, 2024)☆24Updated 11 months ago
- ☆53Updated this week
- Transformers components but in Triton☆33Updated 3 weeks ago
- Repository for CPU Kernel Generation for LLM Inference☆26Updated last year
- Open deep learning compiler stack for cpu, gpu and specialized accelerators☆18Updated this week
- GPTQ inference TVM kernel☆38Updated last year
- ☆31Updated last year
- ☆21Updated 2 months ago
- Odysseus: Playground of LLM Sequence Parallelism☆69Updated 11 months ago
- CUDA and Triton implementations of Flash Attention with SoftmaxN.☆70Updated last year
- PyTorch implementation of the Flash Spectral Transform Unit.☆16Updated 8 months ago
- Personal solutions to the Triton Puzzles☆18Updated 10 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆88Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆127Updated this week
- Using FlexAttention to compute attention with different masking patterns☆43Updated 8 months ago
- A bunch of kernels that might make stuff slower 😉☆46Updated this week
- ☆73Updated 4 months ago