NVIDIA / HMM_sample_code
CUDA 12.2 HMM demos
☆19Updated 8 months ago
Alternatives and similar repositories for HMM_sample_code:
Users that are interested in HMM_sample_code are comparing it to the libraries listed below
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆31Updated 3 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆74Updated this week
- 方便扩展的Cuda算子理解和优化框架,仅用在学习使用☆13Updated 9 months ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆57Updated last week
- GPTQ inference TVM kernel☆38Updated 11 months ago
- ☆49Updated last year
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 5 months ago
- An extension library of WMMA API (Tensor Core API)☆93Updated 8 months ago
- RCCL Performance Benchmark Tests☆60Updated 3 weeks ago
- CUDA Templates for Linear Algebra Subroutines☆16Updated this week
- Bandwidth test for ROCm☆54Updated 2 weeks ago
- An Attention Superoptimizer☆21Updated 2 months ago
- Benchmark tests supporting the TiledCUDA library.☆15Updated 4 months ago
- ☆21Updated last month
- study of cutlass☆21Updated 4 months ago
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆37Updated 8 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆87Updated 6 years ago
- Code for Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture (accepted by PVLDB).The outdated wr…☆9Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆108Updated 6 months ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆90Updated last month
- A practical way of learning Swizzle☆16Updated last month
- rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.☆66Updated this week
- Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)☆127Updated 4 years ago
- ☆11Updated 3 years ago
- cuASR: CUDA Algebra for Semirings☆35Updated 2 years ago
- play gemm with tvm☆89Updated last year
- ☆92Updated 11 months ago
- An IR for efficiently simulating distributed ML computation.☆28Updated last year
- This is a demo how to write a high performance convolution run on apple silicon☆54Updated 3 years ago
- ☆36Updated 3 months ago