NVIDIA / HMM_sample_code
CUDA 12.2 HMM demos
☆19Updated 5 months ago
Alternatives and similar repositories for HMM_sample_code:
Users that are interested in HMM_sample_code are comparing it to the libraries listed below
- ☆36Updated this week
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆22Updated 3 months ago
- Code for Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture (accepted by PVLDB).The outdated wr…☆8Updated last year
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆54Updated 4 months ago
- GPTQ inference TVM kernel☆38Updated 8 months ago
- An extension library of WMMA API (Tensor Core API)☆87Updated 6 months ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆30Updated last month
- An Attention Superoptimizer☆20Updated 8 months ago
- An IR for efficiently simulating distributed ML computation.☆25Updated last year
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆63Updated 2 years ago
- ☆48Updated 10 months ago
- Benchmark tests supporting the TiledCUDA library.☆12Updated last month
- Standalone Flash Attention v2 kernel without libtorch dependency☆99Updated 4 months ago
- ☆19Updated 3 months ago
- Framework to reduce autotune overhead to zero for well known deployments.☆57Updated last month
- ☆20Updated last year
- ☆8Updated last year
- CUDA Templates for Linear Algebra Subroutines☆11Updated this week
- ☆35Updated last month
- ☆57Updated 7 months ago
- An external memory allocator example for PyTorch.☆14Updated 3 years ago
- FlexFlow Serve: Low-Latency, High-Performance LLM Serving☆16Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆75Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆37Updated 8 months ago
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆21Updated last month
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆87Updated 10 months ago
- RCCL Performance Benchmark Tests☆55Updated this week
- Benchmark code for the "Online normalizer calculation for softmax" paper☆62Updated 6 years ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆18Updated 3 years ago