NVIDIA / TensorRT-Incubator
Experimental projects related to TensorRT
☆86Updated this week
Alternatives and similar repositories for TensorRT-Incubator:
Users that are interested in TensorRT-Incubator are comparing it to the libraries listed below
- ☆178Updated 6 months ago
- Shared Middle-Layer for Triton Compilation☆220Updated this week
- ☆66Updated 3 weeks ago
- Assembler for NVIDIA Volta and Turing GPUs☆203Updated 3 years ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆284Updated this week
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆174Updated 2 months ago
- OpenAI Triton backend for Intel® GPUs☆154Updated this week
- Benchmark code for the "Online normalizer calculation for softmax" paper☆62Updated 6 years ago
- A fast communication-overlapping library for tensor parallelism on GPUs.☆271Updated 2 months ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆330Updated 4 months ago
- Step-by-step optimization of CUDA SGEMM☆270Updated 2 years ago
- PyTorch emulation library for Microscaling (MX)-compatible data formats☆187Updated 3 months ago
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆291Updated this week
- A Easy-to-understand TensorOp Matmul Tutorial☆306Updated 3 months ago
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators☆334Updated this week
- Stores documents and resources used by the OpenXLA developer community☆113Updated 5 months ago
- collection of benchmarks to measure basic GPU capabilities☆280Updated last week
- An extension library of WMMA API (Tensor Core API)☆87Updated 6 months ago
- A home for the final text of all TVM RFCs.☆101Updated 3 months ago
- ☆94Updated last month
- An experimental CPU backend for Triton☆75Updated this week
- System for automated integration of deep learning backends.☆48Updated 2 years ago
- CUDA Matrix Multiplication Optimization☆152Updated 5 months ago
- Unified compiler/runtime for interfacing with PyTorch Dynamo.☆99Updated this week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆93Updated 6 months ago
- ☆81Updated 8 months ago
- MLIR-based partitioning system☆56Updated this week
- Development repository for the Triton-Linalg conversion☆167Updated 3 weeks ago
- Yinghan's Code Sample☆300Updated 2 years ago
- ☆48Updated 10 months ago