Ship correct and fast LLM kernels to PyTorch
☆142Jan 14, 2026Updated last month
Alternatives and similar repositories for BackendBench
Users that are interested in BackendBench are comparing it to the libraries listed below
Sorting:
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- A Triton-only attention backend for vLLM☆24Feb 11, 2026Updated 2 weeks ago
- ☆23Jul 11, 2025Updated 7 months ago
- study of cutlass☆22Nov 10, 2024Updated last year
- MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI tr…☆52Updated this week
- ☆16May 14, 2025Updated 9 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆327Updated this week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)☆820Updated this week
- Multi-Level Triton Runner supporting Python, IR, PTX, and cubin.☆84Updated this week
- Ahead of Time (AOT) Triton Math Library☆92Updated this week
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆766Updated this week
- Quantized LLM training in pure CUDA/C++.☆241Updated this week
- Automatic differentiation for Triton Kernels☆29Aug 12, 2025Updated 6 months ago
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Jul 21, 2023Updated 2 years ago
- A domain-specific language (DSL) based on Triton but providing higher-level abstractions.☆41Feb 4, 2026Updated 3 weeks ago
- Sample Codes using NVSHMEM on Multi-GPU☆30Jan 22, 2023Updated 3 years ago
- Triton-based Symmetric Memory operators and examples☆85Jan 15, 2026Updated last month
- Cataloging released Triton kernels.☆295Sep 9, 2025Updated 5 months ago
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆177Updated this week
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Apr 7, 2023Updated 2 years ago
- 一个用 ChatGPT 生成命令行的小玩具☆10Mar 7, 2023Updated 2 years ago
- Fast low-bit matmul kernels in Triton☆433Feb 1, 2026Updated 3 weeks ago
- Implementation of the LDP module block in PyTorch and Zeta from the paper: "MobileVLM: A Fast, Strong and Open Vision Language Assistant …☆15Mar 11, 2024Updated last year
- ☆15Updated this week
- Delegatecall from any contract. A kind of vm.prank for delegatecalls.☆18Sep 10, 2024Updated last year
- Ergonomic alternative to `approve`/`transferFrom` -- flash loans without external calls☆18Feb 2, 2022Updated 4 years ago
- ☆12Jan 7, 2025Updated last year
- GVProf: A Value Profiler for GPU-based Clusters☆53Mar 24, 2024Updated last year
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆109Sep 24, 2025Updated 5 months ago
- ☆31Apr 19, 2025Updated 10 months ago
- Distributed multi-agent framework for event-driven, graph-based computation. Elixir/Python, NATS event streaming, modular operator/XCS ar…☆14Nov 4, 2025Updated 3 months ago
- Development containers for triton and triton-cpu☆24Feb 16, 2026Updated last week
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Jun 28, 2025Updated 8 months ago
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆168Nov 11, 2025Updated 3 months ago
- A collection of reproducible inference engine benchmarks☆38Apr 22, 2025Updated 10 months ago
- ☆118May 19, 2025Updated 9 months ago
- Personal solutions to the Triton Puzzles☆20Jul 18, 2024Updated last year
- Because it's there.☆16Sep 22, 2024Updated last year
- Tile-based language built for AI computation across all scales☆138Updated this week