thunlp / TritonBenchLinks

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

☆95

Alternatives and similar repositories for TritonBench

Users that are interested in TritonBench are comparing it to the libraries listed below

Sorting:

tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆130Updated 5 months ago
triton-lang / kernels
☆93Updated last year
flashinfer-ai / cutlass-viz
☆65Updated 6 months ago
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆137Updated this week
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
flashinfer-ai / flashinfer-bench
Building the Virtuous Cycle for AI-driven LLM Systems
☆91Updated this week
tile-ai / TileRT
Tile-Based Runtime for Ultra-Low-Latency LLM Inference
☆178Updated this week
ByteDance-Seed / cudaLLM
☆121Updated 3 months ago
DD-DuDa / BitDecoding
[HPCA 2025] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.
☆62Updated last week
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆85Updated 2 months ago
FasterDecoding / TEAL
☆148Updated 9 months ago
ByteDance-Seed / FlexPrefill
Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference
☆154Updated last month
stanford-futuredata / stk
☆113Updated last year
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆286Updated this week
tile-ai / AttentionEngine
☆50Updated 6 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆272Updated 4 months ago
microsoft / AttentionEngine
☆109Updated 6 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆167Updated 7 months ago
feifeibear / DPSKV3MFU
Estimate MFU for DeepSeekV3
☆26Updated 10 months ago
HanGuo97 / hilt
☆36Updated 2 weeks ago
toyaix / triton-runner
Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.
☆76Updated last week
mit-han-lab / Quest
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
☆353Updated 4 months ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆100Updated 4 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 11 months ago
andy-yang-1 / DoubleSparse
16-fold memory access reduction with nearly no loss
☆107Updated 7 months ago
cchan / tccl
extensible collectives library in triton
☆91Updated 7 months ago
osayamenja / FlashMoE
Distributed MoE in a Single Kernel [NeurIPS '25]
☆125Updated last month
InternLM / Awesome-LLM-Training-System
☆44Updated last year
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆328Updated last year
INT-FlashAttention2024 / INT-FlashAttention
☆83Updated 9 months ago