kshitijl / avx2-examples
Short examples illustrating AVX2 intrinsics for simple tasks.
☆89Updated last year
Alternatives and similar repositories for avx2-examples:
Users that are interested in avx2-examples are comparing it to the libraries listed below
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆129Updated last year
- Example code for Intel AVX / AVX2 intrinsics.☆137Updated last year
- Benchmark for measuring the performance of sparse and irregular memory access.☆76Updated last month
- ☆140Updated this week
- Agenium Scale vectorization library for CPUs and GPUs☆331Updated 3 years ago
- The Berkeley Container Library☆124Updated last year
- RV: A Unified Region Vectorizer for LLVM☆107Updated 2 months ago
- A 128 bit unsigned integer class for CUDA☆45Updated 3 months ago
- Utilities to measure read access times of caches, memory, and hardware prefetches for simple and fused operations☆82Updated last year
- assembler for NVIDIA FERMI. Imported from Google Code☆72Updated 10 years ago
- CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly.☆117Updated 2 years ago
- Third party assembler and GEMM library for NVIDIA Kepler GPU☆81Updated 5 years ago
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆104Updated 7 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆50Updated 2 weeks ago
- ☆131Updated last year
- ☆51Updated 5 years ago
- ☆56Updated 2 weeks ago
- Stretching GPU performance for GEMMs and tensor contractions.☆234Updated 2 weeks ago
- ☆16Updated 3 years ago
- Archived implementation of BLAS using the SYCL open standard. See oneMath for a replacement.☆261Updated 2 months ago
- Little OpenMP Library☆159Updated 2 years ago
- An implementation of HIP that works on CPUs, across OSes.☆115Updated last year
- TLB Benchmarks☆33Updated 7 years ago
- Unofficial description of the CUDA assembly (SASS) instruction sets.☆79Updated 3 weeks ago
- Omnitrace: Application Profiling, Tracing, and Analysis☆309Updated 3 weeks ago
- ROCm Thrust - run Thrust dependent software on AMD GPUs☆106Updated this week
- Haystack is an analytical cache model that given a program computes the number of cache misses.☆46Updated 5 years ago
- AVX-optimized sin(), cos(), exp() and log() functions☆121Updated 3 years ago
- SYCL Open Source Specification☆131Updated this week
- TPP experimentation on MLIR for linear algebra☆122Updated 2 weeks ago