NervanaSystems / maxas
Assembler for NVIDIA Maxwell architecture
☆968Updated 2 years ago
Alternatives and similar repositories for maxas:
Users that are interested in maxas are comparing it to the libraries listed below
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆426Updated last year
- Patterns and behaviors for GPU computing☆1,699Updated 2 years ago
- BLISlab: A Sandbox for Optimizing GEMM☆498Updated 3 years ago
- [ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl☆1,719Updated last year
- Source code examples from the Parallel Forall Blog☆1,260Updated 6 months ago
- Winograd minimal convolution algorithm generator for convolutional neural networks.☆611Updated 4 years ago
- A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).☆521Updated this week
- Assembler for NVIDIA Volta and Turing GPUs☆212Updated 3 years ago
- ☆1,813Updated last year
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆958Updated this week
- Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.☆324Updated last month
- CUDA Kernel Benchmarking Library☆561Updated 3 months ago
- CUDA Data Parallel Primitives Library☆426Updated 6 years ago
- row-major matmul optimization☆602Updated last year
- Examples demonstrating available options to program multiple GPUs in a single node or a cluster☆610Updated 3 months ago
- Low-precision matrix multiplication☆1,792Updated last year
- A simple memory manager for CUDA designed to help Deep Learning frameworks manage memory☆296Updated 6 years ago
- Demonstration of various hardware effects on CUDA GPUs.☆364Updated last year
- Library for specialized dense and sparse matrix operations, and deep learning primitives.☆862Updated this week
- A CPU tool for benchmarking the peak of floating points☆524Updated 4 months ago
- Source code that accompanies The CUDA Handbook.☆514Updated 2 weeks ago
- A simple high performance CUDA GEMM implementation.☆346Updated last year
- a software library containing BLAS functions written in OpenCL☆851Updated 6 months ago
- Tuned OpenCL BLAS☆1,084Updated 3 months ago
- Kernel Tuner☆311Updated last week
- A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.☆974Updated 5 months ago
- Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators☆349Updated this week
- Stretching GPU performance for GEMMs and tensor contractions.☆233Updated this week
- HCC is an Open Source, Optimizing C++ Compiler for Heterogeneous Compute currently for the ROCm GPU Computing Platform☆434Updated 4 years ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆345Updated 5 months ago