kshitijl / avx2-examples
Short examples illustrating AVX2 intrinsics for simple tasks.
☆94Updated last year
Alternatives and similar repositories for avx2-examples
Users that are interested in avx2-examples are comparing it to the libraries listed below
Sorting:
- tools to create performance and roofline plots from measured data☆58Updated 10 years ago
- MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.☆130Updated last year
- Kernel Tuning Toolkit☆59Updated 2 months ago
- CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly.☆119Updated 2 years ago
- TLB Benchmarks☆33Updated 7 years ago
- Example code for Intel AVX / AVX2 intrinsics.☆138Updated last year
- A 128 bit unsigned integer class for CUDA☆46Updated 4 months ago
- An implementation of HIP that works on CPUs, across OSes.☆116Updated last year
- ☆16Updated 3 years ago
- Agenium Scale vectorization library for CPUs and GPUs☆333Updated 3 years ago
- ☆149Updated this week
- ☆51Updated 5 years ago
- The Berkeley Container Library☆124Updated last year
- ☆134Updated last year
- Third party assembler and GEMM library for NVIDIA Kepler GPU☆81Updated 5 years ago
- Source code for the CPU-Free model - a fully autonomous execution model for multi-GPU applications that completely excludes the involveme…☆17Updated last year
- Emulating DMA Engines on GPUs for Performance and Portability☆40Updated 10 years ago
- Benchmark for measuring the performance of sparse and irregular memory access.☆77Updated last week
- A GPU accelerated error-bounded lossy compression for scientific data.☆75Updated this week
- The Farm-SVE package provides a header that implements the ARM C language extensions (ACLE) for the ARM Scalable Vector Extension (SVE) i…☆14Updated last year
- Little OpenMP Library☆160Updated 2 years ago
- A GPU benchmark suite for assessing on-chip GPU memory bandwidth☆104Updated 7 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆52Updated last month
- Power measurement for CUDA programs by polling using NVIDIA Management Library (nvml) APIs.☆24Updated 7 years ago
- ROCm Thrust - run Thrust dependent software on AMD GPUs☆108Updated last week
- Serial and parallel implementations of matrix multiplication☆40Updated 4 years ago
- CUDA kernel author's tools☆111Updated 3 years ago
- Utilities to measure read access times of caches, memory, and hardware prefetches for simple and fused operations☆83Updated last year
- rocWMMA☆111Updated this week
- Intel® Extension for MLIR. A staging ground for MLIR dialects and tools for Intel devices using the MLIR toolchain.☆134Updated this week