HazyResearch / blocking-tutorialLinks

☆138

Alternatives and similar repositories for blocking-tutorial

Users that are interested in blocking-tutorial are comparing it to the libraries listed below

Sorting:

facebookresearch / loop_tool
A thin, highly portable toolkit for efficiently compiling dense loop-based computation.
☆149Updated 3 years ago
parallel-runtimes / lomp
Little OpenMP Library
☆170Updated 3 years ago
astojanov / Clover
Clover: Quantized 4-bit Linear Algebra Library
☆114Updated 7 years ago
mmperf / mmperf
MatMul Performance Benchmarks for a Single CPU Core comparing both hand engineered and codegen kernels.
☆138Updated 2 years ago
bryancatanzaro / trove
Full-speed Array of Structures access
☆176Updated 2 years ago
michael-lehn / ulmBLAS
ulmBLAS
☆107Updated 7 months ago
Kobzol / hardware-effects-gpu
Demonstration of various hardware effects on CUDA GPUs.
☆391Updated 2 years ago
milakov / int_fastdiv
Fast integer division with divisor not known at compile time. To be used primarily in CUDA kernels.
☆73Updated 10 years ago
codeplaysoftware / portBLAS
Archived implementation of BLAS using the SYCL open standard. See oneMath for a replacement.
☆260Updated last year
ashvardanian / ParallelReductionsBenchmark
Thrust, CUB, TBB, AVX2, AVX-512, CUDA, OpenCL, OpenMP, Metal, and Rust - all it takes to sum a lot of numbers fast!
☆116Updated 6 months ago
ithemal / Ithemal
Instruction THroughput Estimator using MAchine Learning (ITHEMAL)
☆152Updated 4 years ago
HiPerCoRe / KTT
Kernel Tuning Toolkit
☆67Updated last week
ips4o / ips4o
In-place Parallel Super Scalar Samplesort (IPS⁴o)
☆132Updated last year
gpuocelot / gpuocelot
GPUOcelot: A dynamic compilation framework for PTX
☆219Updated 11 months ago
sleeepyjack / warpcore
A Library for fast Hash Tables on GPUs
☆132Updated 3 months ago
NVIDIA / jitify
A single-header C++ library for simplifying the use of CUDA Runtime Compilation (NVRTC).
☆569Updated 4 months ago
kshitijl / avx2-examples
Short examples illustrating AVX2 intrinsics for simple tasks.
☆98Updated last year
tobiasgrosser / islplot
Library to plot integer sets and maps
☆53Updated 9 years ago
oprecomp / FloatX
Header-only C++ library for low precision floating point type emulation.
☆179Updated 6 years ago
ROCm / Tensile
[DEPRECATED] Moved to ROCm/rocm-libraries repo
☆254Updated last week
SunsetQuest / CudaPAD
CudaPAD is a PTX/SASS viewer for NVIDIA Cuda kernels and provides an on-the-fly view of the assembly.
☆127Updated 3 years ago
gunrock / loops
🎃 GPU load-balancing library for regular and irregular computations.
☆66Updated 4 months ago
Maratyszcza / FP16
Conversion to/from half-precision floating point formats
☆379Updated 5 months ago
eyalroz / cuda-kat
CUDA kernel author's tools
☆115Updated 3 years ago
OpenCilk / opencilk-project
Monorepo for the OpenCilk compiler. Forked from llvm/llvm-project and based on Tapir/LLVM.
☆120Updated this week
mangpo / swizzle-inventor
A framework that helps implementing swizzle GPU kernels
☆51Updated 5 years ago
libxsmm / libxsmm
Library for specialized dense and sparse matrix operations, and deep learning primitives.
☆933Updated 3 weeks ago
google / ruy
☆322Updated last month
gevtushenko / cuda_benchmark
A library to benchmark CUDA code, similar to google benchmark.
☆30Updated 4 years ago
ORNL / iris
A unified framework across multiple programming platforms
☆43Updated 8 months ago