TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
☆115Jun 14, 2025Updated 8 months ago
Alternatives and similar repositories for TritonBench
Users that are interested in TritonBench are comparing it to the libraries listed below
Sorting:
- ☆64Jul 14, 2025Updated 7 months ago
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)☆836Updated this week
- Experiments Notebook of "Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism"☆14Apr 30, 2025Updated 10 months ago
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 10 months ago
- ☆52May 19, 2025Updated 9 months ago
- It is an LLM-based AI agent, which can write correct and efficient gpu kernels automatically.☆69Updated this week
- Triton adapter for Ascend. Mirror of https://gitcode.com/ascend/triton-ascend☆113Updated this week
- ☆134Aug 18, 2025Updated 6 months ago
- Samples of good AI generated CUDA kernels☆100May 30, 2025Updated 9 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- Quantized Attention on GPU☆44Nov 22, 2024Updated last year
- a simple API to use CUPTI☆11Aug 19, 2025Updated 6 months ago
- Minimal Transformer base in JAX. A single backbone for language modelling, diffusion, classification, etc...☆14May 28, 2025Updated 9 months ago
- ☆11Jun 9, 2023Updated 2 years ago
- ☆23Jul 11, 2025Updated 7 months ago
- Fast low-bit matmul kernels in Triton☆436Feb 1, 2026Updated last month
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…☆104Jan 8, 2026Updated last month
- ☆79Dec 27, 2024Updated last year
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Feb 9, 2026Updated 3 weeks ago
- ☆15Mar 2, 2025Updated last year
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆150May 10, 2025Updated 9 months ago
- Writing a CUDA software ray tracing renderer with Analysis-Driven Optimization from scratch: a python-importable, distributed parallel re…☆37Oct 5, 2025Updated 5 months ago
- FlashInfer Bench @ MLSys 2026: Building AI agents to write high performance GPU kernels☆141Feb 9, 2026Updated 3 weeks ago
- ☆35Mar 7, 2025Updated 11 months ago
- ☆18Mar 4, 2025Updated last year
- Advancing the frontier of efficient AI☆54Updated this week
- A forked version of flux-fast that makes flux-fast even faster with cache-dit, 3.3x speedup on NVIDIA L20.☆24Jul 18, 2025Updated 7 months ago
- [DAC2024] A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning☆15Jan 13, 2024Updated 2 years ago
- A Easy-to-understand TensorOp Matmul Tutorial☆409Updated this week
- An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.☆50Jul 23, 2024Updated last year
- Implement Flash Attention using Cute.☆102Dec 17, 2024Updated last year
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆327Updated this week
- CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning☆445Jan 8, 2026Updated last month
- Framework to reduce autotune overhead to zero for well known deployments.☆97Sep 19, 2025Updated 5 months ago
- My tests and experiments with some popular dl frameworks.☆17Sep 11, 2025Updated 5 months ago
- Minimal PyTorch implementation of TP, SP, FSDP and sharded-EMA☆31Nov 27, 2025Updated 3 months ago
- ☆15Aug 18, 2022Updated 3 years ago
- Benchmark tests supporting the TiledCUDA library.☆18Nov 19, 2024Updated last year
- Conversions to MLIR EmitC☆134Dec 12, 2024Updated last year