NVIDIA / HMM_sample_code
CUDA 12.2 HMM demos
☆19Updated 6 months ago
Alternatives and similar repositories for HMM_sample_code:
Users that are interested in HMM_sample_code are comparing it to the libraries listed below
- TileFusion is a highly efficient kernel template library designed to elevate the level of abstraction in CUDA C for processing tiles.☆56Updated this week
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆30Updated 2 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆25Updated 4 months ago
- Code for Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture (accepted by PVLDB).The outdated wr…☆9Updated last year
- GPTQ inference TVM kernel☆38Updated 9 months ago
- An extension library of WMMA API (Tensor Core API)☆88Updated 7 months ago
- ☆48Updated 11 months ago
- An Attention Superoptimizer☆21Updated last month
- GPU Performance Advisor☆64Updated 2 years ago
- Test suite for probing the numerical behavior of NVIDIA tensor cores☆37Updated 6 months ago
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 3 years ago
- RCCL Performance Benchmark Tests☆59Updated last month
- ☆8Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆104Updated 5 months ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆55Updated this week
- Bandwidth test for ROCm☆54Updated last week
- ☆36Updated 2 months ago
- ☆20Updated this week
- ☆67Updated 3 months ago
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆64Updated 2 years ago
- ☆19Updated 4 months ago
- CUDA Templates for Linear Algebra Subroutines☆14Updated this week
- ☆59Updated 2 weeks ago
- Memory Optimizations for Deep Learning (ICML 2023)☆62Updated 11 months ago
- Benchmark tests supporting the TiledCUDA library.☆15Updated 3 months ago
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆43Updated last week
- ☆17Updated 5 years ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆17Updated this week
- This is a demo how to write a high performance convolution run on apple silicon☆52Updated 3 years ago