YashasSamaga / ConvolutionBuildingBlocks
GEMM and Winograd based convolutions using CUTLASS
☆26Updated 4 years ago
Alternatives and similar repositories for ConvolutionBuildingBlocks:
Users that are interested in ConvolutionBuildingBlocks are comparing it to the libraries listed below
- ☆50Updated last year
- CUDA templates for tile-sparse matrix multiplication based on CUTLASS.☆51Updated 7 years ago
- ☆38Updated 5 years ago
- ☆69Updated 2 years ago
- ☆95Updated last year
- System for automated integration of deep learning backends.☆47Updated 2 years ago
- A Winograd Minimal Filter Implementation in CUDA☆24Updated 3 years ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆135Updated 2 years ago
- ☆16Updated 7 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆81Updated 2 weeks ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆106Updated 9 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆91Updated 6 years ago
- A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores☆51Updated last year
- Benchmark scripts for TVM☆74Updated 3 years ago
- Codebase associated with the PyTorch compiler tutorial☆45Updated 5 years ago
- ☆44Updated 4 years ago
- An extension library of WMMA API (Tensor Core API)☆96Updated 9 months ago
- [MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration☆197Updated 2 years ago
- Customized matrix multiplication kernels☆54Updated 3 years ago
- ☆17Updated 3 years ago
- play gemm with tvm☆90Updated last year
- Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation☆27Updated 5 years ago
- llama INT4 cuda inference with AWQ☆54Updated 3 months ago
- An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.☆50Updated 9 months ago
- Test winograd convolution written in TVM for CUDA and AMDGPU☆41Updated 6 years ago
- Benchmark PyTorch Custom Operators☆14Updated last year
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆32Updated 4 years ago
- Training neural networks in TensorFlow 2.0 with 5x less memory☆130Updated 3 years ago
- study of Ampere' Sparse Matmul☆18Updated 4 years ago
- ☆23Updated 5 months ago