NVIDIA / HMM_sample_code
CUDA 12.2 HMM demos
☆17Updated 3 months ago
Related projects ⓘ
Alternatives and complementary repositories for HMM_sample_code
- GPTQ inference TVM kernel☆36Updated 6 months ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆46Updated 2 months ago
- An Attention Superoptimizer☆20Updated 6 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆22Updated last month
- ☆48Updated 8 months ago
- ☆55Updated 5 months ago
- An external memory allocator example for PyTorch.☆13Updated 3 years ago
- PyTorch bindings for CUTLASS grouped GEMM.☆53Updated 3 weeks ago
- Code for Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture (accepted by PVLDB).The outdated wr…☆8Updated last year
- ☆22Updated 11 months ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆98Updated 2 months ago
- An extension library of WMMA API (Tensor Core API)☆84Updated 4 months ago
- ☆11Updated 3 years ago
- cuASR: CUDA Algebra for Semirings☆34Updated 2 years ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆85Updated 8 months ago
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 3 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆29Updated 2 months ago
- Quantized Attention on GPU☆30Updated 2 weeks ago
- ☆20Updated last year
- GPU Performance Advisor☆63Updated 2 years ago
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆63Updated 2 years ago
- Odysseus: Playground of LLM Sequence Parallelism☆57Updated 5 months ago
- ☆8Updated last year
- (NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.☆34Updated 2 years ago
- Transformers components but in Triton☆27Updated this week
- ☆14Updated last month
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆11Updated last year
- ☆33Updated 2 months ago
- FlexAttention w/ FlashAttention3 Support☆27Updated last month