bdhirsh / pytorch_open_registration_exampleLinks

Example of using pytorch's open device registration API

☆30

Alternatives and similar repositories for pytorch_open_registration_example

Users that are interested in pytorch_open_registration_example are comparing it to the libraries listed below

Sorting:

wmmae / wmma_extension
An extension library of WMMA API (Tensor Core API)
☆99Updated last year
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆75Updated this week
cmu-catalyst / collage
System for automated integration of deep learning backends.
☆47Updated 2 years ago
apache / tvm-rfcs
A home for the final text of all TVM RFCs.
☆105Updated 10 months ago
mmperf / mmperf
MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.
☆134Updated last year
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated 3 weeks ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆96Updated 7 years ago
thu-pacman / PET
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections
☆122Updated 3 years ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆111Updated 10 months ago
AlibabaPAI / FLASHNN
☆96Updated 11 months ago
yifuwang / symm-mem-recipes
☆102Updated 7 months ago
awslabs / raf
☆144Updated 6 months ago
nox-410 / tvm.tl
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆50Updated last year
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆91Updated 2 years ago
tlc-pack / TLCBench
Benchmark scripts for TVM
☆75Updated 3 years ago
wzsh / wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
☆138Updated 4 years ago
OpenPPL / ppl.llm.kernel.cuda
☆149Updated 7 months ago
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆74Updated 11 months ago
ColfaxResearch / cfx-article-src
☆127Updated 3 months ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆54Updated 6 months ago
TiledTensor / TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel …
☆183Updated 6 months ago
microsoft / triton-shared
Shared Middle-Layer for Triton Compilation
☆261Updated last week
daadaada / turingas
Assembler for NVIDIA Volta and Turing GPUs
☆226Updated 3 years ago
masahi / torchscript-to-tvm
☆69Updated 2 years ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆93Updated last month
NVIDIA / TensorRT-Incubator
Experimental projects related to TensorRT
☆109Updated this week
Bruce-Lee-LY / flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
☆39Updated 5 months ago
ROCm / TransformerEngine
☆41Updated this week
CalebDu / Awesome-Cute
☆91Updated 2 months ago
intel / torch-ccl
oneCCL Bindings for Pytorch*
☆99Updated this week