daniel-geon-park / triton_bwdLinks
Automatic differentiation for Triton Kernels
☆11Updated 2 months ago
Alternatives and similar repositories for triton_bwd
Users that are interested in triton_bwd are comparing it to the libraries listed below
Sorting:
- Framework to reduce autotune overhead to zero for well known deployments.☆74Updated 3 weeks ago
- ☆80Updated 7 months ago
- DeeperGEMM: crazy optimized version☆69Updated last month
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆88Updated this week
- extensible collectives library in triton☆87Updated 2 months ago
- ☆34Updated 2 weeks ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆127Updated this week
- ☆59Updated last month
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆23Updated 3 weeks ago
- ☆13Updated 3 months ago
- A bunch of kernels that might make stuff slower 😉☆48Updated this week
- TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators☆55Updated 3 months ago
- ☆49Updated 2 weeks ago
- A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆36Updated last month
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆43Updated 2 months ago
- Debug print operator for cudagraph debugging☆10Updated 10 months ago
- ThrillerFlow is a Dataflow Analysis and Codegen Framework written in Rust.☆14Updated 6 months ago
- Boosting 4-bit inference kernels with 2:4 Sparsity☆76Updated 9 months ago
- Github mirror of trition-lang/triton repo.☆37Updated this week
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆153Updated this week
- (NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.☆38Updated 2 years ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆47Updated 2 months ago
- Effective transpose on Hopper GPU☆20Updated last month
- Ahead of Time (AOT) Triton Math Library☆64Updated last week
- ☆71Updated 2 weeks ago
- Quantized Attention on GPU☆44Updated 6 months ago
- A minimal implementation of vllm.☆41Updated 10 months ago
- ☆19Updated 8 months ago
- ☆13Updated 5 months ago
- Artifacts of EVT ASPLOS'24☆25Updated last year