openxla / openxla-nvgpu

☆48

Related projects: ⓘ

microsoft / triton-shared
Shared Middle-Layer for Triton Compilation
☆160Updated this week
ColfaxResearch / cutlass-kernels
☆138Updated 2 months ago
cmu-catalyst / collage
System for automated integration of deep learning backends.
☆48Updated 2 years ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆52Updated 6 years ago
nod-ai / SHARK-Turbine
Unified compiler/runtime for interfacing with PyTorch Dynamo.
☆90Updated this week
thu-pacman / PET
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
☆112Updated 2 years ago
wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆81Updated 2 months ago
tlc-pack / TLCBench
Benchmark scripts for TVM
☆73Updated 2 years ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆93Updated last week
makslevental / nelli
A lightweight, Pythonic, frontend for MLIR
☆79Updated 10 months ago
TiledTensor / TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
☆114Updated last week
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆81Updated 2 months ago
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆129Updated last year
nox-410 / tvm.tl
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆49Updated last month
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆46Updated 2 months ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆82Updated 6 months ago
mmperf / mmperf
MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.
☆123Updated 11 months ago
sunlex0717 / DissectingTensorCores
☆73Updated 5 months ago
awslabs / raf
☆141Updated last year
manishucsd / py-codegen
☆14Updated 4 months ago
facebookresearch / fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …
☆61Updated 2 years ago
roastduck / FreeTensor
A language and compiler for irregular tensor programs.
☆132Updated 4 months ago
masahi / torchscript-to-tvm
☆66Updated last year
intel / intel-xpu-backend-for-triton
OpenAI Triton backend for Intel® GPUs
☆126Updated this week
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆81Updated last year
apache / tvm-rfcs
A home for the final text of all TVM RFCs.
☆99Updated 3 months ago
iree-org / iree-experimental
Experiments and prototypes associated with IREE or MLIR
☆47Updated last month
pytorch-labs / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆30Updated 4 months ago
daadaada / gas
☆39Updated 3 years ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆41Updated 3 weeks ago