meta-pytorch / BackendBenchLinks

Ship correct and fast LLM kernels to PyTorch

☆124

Alternatives and similar repositories for BackendBench

Users that are interested in BackendBench are comparing it to the libraries listed below

Sorting:

cchan / tccl
extensible collectives library in triton
☆91Updated 8 months ago
gpu-mode / ring-attention
ring-attention experiments
☆160Updated last year
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆61Updated this week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆169Updated 7 months ago
meta-pytorch / kraken
Triton-based Symmetric Memory operators and examples
☆63Updated last month
sgl-project / sglang-jax
JAX backend for SGL
☆185Updated this week
meta-pytorch / KernelAgent
Autonomous GPU Kernel Generation via Deep Agents
☆163Updated last week
NVIDIA / nsight-python
Nsight Python is a Python kernel profiling interface based on NVIDIA Nsight Tools
☆61Updated this week
triton-lang / kernels
☆94Updated last year
open-lm-engine / accelerated-model-architectures
A bunch of kernels that might make stuff slower 😉
☆65Updated this week
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆307Updated 3 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆256Updated this week
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆217Updated last week
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆294Updated this week
dropbox / gemlite
Fast low-bit matmul kernels in Triton
☆401Updated last week
IST-DASLab / qutlass
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆143Updated 3 weeks ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆88Updated 2 months ago
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels
☆175Updated last week
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆86Updated last year
vllm-project / tpu-inference
TPU inference for vLLM, with unified JAX and PyTorch support.
☆170Updated this week
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆73Updated 6 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆131Updated 6 months ago
NVIDIA / compute-eval
Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Lar…
☆76Updated last week
flashinfer-ai / cutlass-viz
☆65Updated 7 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆253Updated 2 months ago
ROCm / iris
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
☆119Updated last week
stanford-futuredata / stk
☆113Updated last year
NVIDIA / jaxpp
JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training
☆58Updated 2 weeks ago
NVIDIA / nvshmem
NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…
☆402Updated 2 weeks ago
microsoft / AttentionEngine
☆113Updated 6 months ago