Triple-Z / AVX-AVX2-Example-Code
Example code for Intel AVX / AVX2 intrinsics.
☆123Updated last year
Related projects: ⓘ
- Assembler for NVIDIA Volta and Turing GPUs☆195Updated 2 years ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆103Updated 2 years ago
- Short examples illustrating AVX2 intrinsics for simple tasks.☆81Updated 6 months ago
- Encapsulate the frequently used AVX instructions as independent modules to reduce repeated development workload.☆113Updated 8 months ago
- ☆53Updated last week
- Dissecting NVIDIA GPU Architecture☆78Updated 2 years ago
- Third party assembler and GEMM library for NVIDIA Kepler GPU☆76Updated 4 years ago
- ☆189Updated this week
- ☆88Updated 7 years ago
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆123Updated 11 months ago
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆96Updated 7 years ago
- This is an implementation of sgemm_kernel on L1d cache.☆212Updated 6 months ago
- 14 basic topics for VEGA64 performance optmization☆49Updated 3 years ago
- An extension library of WMMA API (Tensor Core API)☆81Updated 2 months ago
- ☆73Updated 5 months ago
- CUDA PTX-ISA Document 中文翻译版☆23Updated 6 months ago
- development repository for the open earth compiler☆74Updated 3 years ago
- ☆39Updated 4 years ago
- assembler for NVIDIA FERMI. Imported from Google Code☆68Updated 9 years ago
- CSR5-based SpMV on CPUs, GPUs and Xeon Phi☆93Updated 3 months ago
- examples for tvm schedule API☆97Updated last year
- Intel Data Parallel C++ (and SYCL 2020) Tutorial.☆90Updated 2 years ago
- ☆39Updated 3 years ago
- An unofficial cuda assembler, for all generations of SASS, hopefully :)☆389Updated last year
- BLISlab: A Sandbox for Optimizing GEMM☆467Updated 3 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆165Updated 3 months ago
- CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly.☆107Updated last year
- ☆38Updated 4 years ago
- GPU-Accelerated Lossless Data Compressors Survey☆110Updated 4 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆109Updated 4 years ago