Triple-Z / AVX-AVX2-Example-CodeLinks
Example code for Intel AVX / AVX2 intrinsics.
☆138Updated last year
Alternatives and similar repositories for AVX-AVX2-Example-Code
Users that are interested in AVX-AVX2-Example-Code are comparing it to the libraries listed below
Sorting:
- ☆44Updated 4 years ago
- Encapsulate the frequently used AVX instructions as independent modules to reduce repeated development workload.☆121Updated last year
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆148Updated 3 years ago
- how to design cpu gemm on x86 with avx256, that can beat openblas.☆70Updated 6 years ago
- A 128 bit unsigned integer class for CUDA☆46Updated 5 months ago
- Intel AVX-512简介☆49Updated last year
- This is an implementation of sgemm_kernel on L1d cache.☆227Updated last year
- CUDA PTX-ISA Document 中文翻译版☆42Updated last week
- Short examples illustrating AVX2 intrinsics for simple tasks.☆95Updated last year
- 14 basic topics for VEGA64 performance optmization☆56Updated 4 years ago
- Parallelized and vectorized SpMV on Intel Xeon Phi (Knights Landing, AVX512, KNL)☆24Updated last year
- BLISlab: A Sandbox for Optimizing GEMM☆527Updated 3 years ago
- code for benchmarking GPU performance based on cublasSgemm and cublasHgemm☆31Updated 3 years ago
- ☆27Updated last year
- ☆91Updated 8 years ago
- TLB Benchmarks☆34Updated 7 years ago
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆105Updated 7 years ago
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆134Updated 4 years ago
- CSR-based SpGEMM on nVidia and AMD GPUs☆46Updated 9 years ago
- CSR5-based SpMV on CPUs, GPUs and Xeon Phi☆101Updated 11 months ago
- row-major matmul optimization☆637Updated last year
- assembler for NVIDIA FERMI. Imported from Google Code☆72Updated 10 years ago
- A highly efficient library for GEMM operations on Sunway TaihuLight☆17Updated 4 years ago
- Third party assembler and GEMM library for NVIDIA Kepler GPU☆81Updated 5 years ago
- ☆245Updated this week
- GPUDirect Async support for IB Verbs☆115Updated 2 years ago
- ☆112Updated last year
- ☆96Updated 3 years ago
- Implementation of TSM2L and TSM2R -- High-Performance Tall-and-Skinny Matrix-Matrix Multiplication Algorithms for CUDA☆32Updated 4 years ago
- Efficient Distributed GPU Programming for Exascale, an SC/ISC Tutorial☆264Updated last week