google-research / sputnik
A library of GPU kernels for sparse matrix operations.
☆240Updated 3 years ago
Related projects: ⓘ
- Assembler for NVIDIA Volta and Turing GPUs☆195Updated 2 years ago
- ☆138Updated 2 months ago
- A Easy-to-understand TensorOp Matmul Tutorial☆265Updated this week
- Training neural networks in TensorFlow 2.0 with 5x less memory☆127Updated 2 years ago
- Research and development for optimizing transformers☆121Updated 3 years ago
- Code for paper "Design Principles for Sparse Matrix Multiplication on the GPU" accepted to Euro-Par 2018☆70Updated 3 years ago
- ☆141Updated last year
- Step-by-step optimization of CUDA SGEMM☆207Updated 2 years ago
- ☆73Updated 5 months ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆264Updated last week
- PyTorch emulation library for Microscaling (MX)-compatible data formats☆143Updated last month
- Benchmark code for the "Online normalizer calculation for softmax" paper☆52Updated 6 years ago
- Experimental projects related to TensorRT☆62Updated this week
- A simple high performance CUDA GEMM implementation.☆319Updated 8 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆81Updated 2 months ago
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆123Updated 11 months ago
- [MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration☆191Updated 2 years ago
- ☆136Updated 3 months ago
- Shared Middle-Layer for Triton Compilation☆157Updated last week
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆109Updated 4 years ago
- Magicube is a high-performance library for quantized sparse matrix operations (SpMM and SDDMM) of deep learning on Tensor Cores.☆79Updated last year
- CUDA Matrix Multiplication Optimization☆118Updated 2 months ago
- CUDA templates for tile-sparse matrix multiplication based on CUTLASS.☆48Updated 6 years ago
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆129Updated last year
- Stores documents and resources used by the OpenXLA developer community☆105Updated last month
- ☆127Updated last month
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆250Updated this week
- ☆48Updated 6 months ago
- MSCCL++: A GPU-driven communication stack for scalable AI applications☆233Updated this week
- A fast communication-overlapping library for tensor parallelism on GPUs.☆184Updated this week