daniel-geon-park / triton_bwd
Automatic differentiation for Triton Kernels
☆10Updated last week
Alternatives and similar repositories for triton_bwd:
Users that are interested in triton_bwd are comparing it to the libraries listed below
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆77Updated this week
- ☆26Updated 2 weeks ago
- Framework to reduce autotune overhead to zero for well known deployments.☆63Updated this week
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆23Updated 3 months ago
- ☆13Updated 3 weeks ago
- ☆76Updated 5 months ago
- ThrillerFlow is a Dataflow Analysis and Codegen Framework written in Rust.☆14Updated 4 months ago
- DeeperGEMM: crazy optimized version☆64Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆108Updated this week
- extensible collectives library in triton☆84Updated this week
- TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators☆34Updated last month
- ☆19Updated 6 months ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆32Updated last week
- ☆22Updated 2 years ago
- An Attention Superoptimizer☆21Updated 2 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆40Updated 2 weeks ago
- Artifacts of EVT ASPLOS'24☆23Updated last year
- An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.☆50Updated 8 months ago
- A bunch of kernels that might make stuff slower 😉☆29Updated this week
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆37Updated 8 months ago
- (NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.☆38Updated 2 years ago
- Ahead of Time (AOT) Triton Math Library☆56Updated 2 weeks ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆23Updated last month
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆25Updated last week
- ☆9Updated last year
- Canvas: End-to-End Kernel Architecture Search in Neural Networks☆26Updated 4 months ago
- Thunder Research Group's Collective Communication Library☆34Updated 11 months ago
- ☆92Updated 11 months ago
- Debug print operator for cudagraph debugging☆10Updated 8 months ago
- ☆25Updated last year