NervanaSystems / maxas
Assembler for NVIDIA Maxwell architecture
☆981Updated 2 years ago
Alternatives and similar repositories for maxas:
Users that are interested in maxas are comparing it to the libraries listed below
- Patterns and behaviors for GPU computing☆1,707Updated 2 years ago
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆462Updated last year
- Library for specialized dense and sparse matrix operations, and deep learning primitives.☆867Updated this week
- [ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl☆1,735Updated last year
- BLISlab: A Sandbox for Optimizing GEMM☆507Updated 3 years ago
- A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).☆527Updated last week
- CUDA Data Parallel Primitives Library☆428Updated 6 years ago
- Assembler for NVIDIA Volta and Turing GPUs☆214Updated 3 years ago
- Representation and Reference Lowering of ONNX Models in MLIR Compiler Infrastructure☆830Updated this week
- ☆408Updated this week
- a software library containing BLAS functions written in OpenCL☆852Updated 7 months ago
- The Tensor Algebra SuperOptimizer for Deep Learning☆704Updated 2 years ago
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆1,007Updated 2 weeks ago
- CUDA Kernel Benchmarking Library☆593Updated last week
- Winograd minimal convolution algorithm generator for convolutional neural networks.☆613Updated 4 years ago
- Open single and half precision gemm implementations☆378Updated last year
- Tuned OpenCL BLAS☆1,090Updated 4 months ago
- A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)☆389Updated 2 months ago
- Low-precision matrix multiplication☆1,794Updated last year
- GPUOCelot: A dynamic compilation framework for PTX☆286Updated last year
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆367Updated 6 months ago
- Source code examples from the Parallel Forall Blog☆1,269Updated 7 months ago
- A code generator for array-based code on CPUs and GPUs☆599Updated this week
- ☆131Updated last year
- The Tensor Algebra Compiler (taco) computes sparse tensor expressions on CPUs and GPUs☆1,287Updated 11 months ago
- common in-memory tensor structure☆963Updated last week
- row-major matmul optimization☆611Updated last year
- Demonstration of various hardware effects on CUDA GPUs.☆365Updated last year
- The Torch-MLIR project aims to provide first class support from the PyTorch ecosystem to the MLIR ecosystem.☆1,466Updated last week
- Examples demonstrating available options to program multiple GPUs in a single node or a cluster☆657Updated last month