Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
☆13Nov 3, 2023Updated 2 years ago
Alternatives and similar repositories for cuda_back2back_hgemm
Users that are interested in cuda_back2back_hgemm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆20Aug 3, 2025Updated 9 months ago
- Source code of the IPDPS '21 paper: "TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs" by Yuyao Niu, Zhengyang…☆13Aug 12, 2022Updated 3 years ago
- Lemon is an LALR(1) parser generator for C or C++.☆17Jun 10, 2014Updated 11 years ago
- Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceler…☆32Jun 26, 2024Updated last year
- A intelligent matrix format designer for SpMV☆10Oct 10, 2023Updated 2 years ago
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- 🍓 A toy object-oriented programming language written by rust☆17Apr 10, 2024Updated 2 years ago
- Mirror of http://gitlab.hpcrl.cse.ohio-state.edu/chong/ppopp19_ae, refactoring for understanding☆17Oct 20, 2021Updated 4 years ago
- Python client for the etcd API v3, supported python >= 3.7, under active maintenance☆13Aug 4, 2025Updated 9 months ago
- 6502 Emulator written in C++☆13Feb 18, 2025Updated last year
- A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores☆61Nov 24, 2023Updated 2 years ago
- blockchain open sources☆11Aug 18, 2017Updated 8 years ago
- kubernetes调试检测工具☆13Nov 8, 2018Updated 7 years ago
- Parallel cuckoo hashing on GPUs with CUDA☆12Sep 27, 2019Updated 6 years ago
- 华为集合通信性能测试☆16May 27, 2024Updated 2 years ago
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Parallel SpMV using CSR representation, built in CUDA☆14Jun 27, 2020Updated 5 years ago
- ☆13Nov 25, 2019Updated 6 years ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆545Sep 8, 2024Updated last year
- ☆10Apr 24, 2023Updated 3 years ago
- ☆22Sep 10, 2025Updated 8 months ago
- ☆116May 10, 2026Updated 2 weeks ago
- Yet another Polyhedra Compiler for DeepLearning☆19Apr 14, 2023Updated 3 years ago
- Experiments evaluating preemption on the NVIDIA Pascal architecture☆16Nov 10, 2016Updated 9 years ago
- ☆17Aug 9, 2022Updated 3 years ago
- AI Agents on DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- Offline renderer using CUDA☆13Jun 8, 2020Updated 5 years ago
- ☆34Apr 2, 2025Updated last year
- Source code of the SC '23 paper: "DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multipli…☆29Jun 18, 2024Updated last year
- ECM Factorization on CUDA-GPUs☆16Sep 29, 2020Updated 5 years ago
- Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.☆14Feb 8, 2023Updated 3 years ago
- Convert CUDA programs from float data type to half or half2 with SIMDization☆19May 28, 2019Updated 7 years ago
- A SoundFont MIDI synthesizer written in pure Odinlang☆11Aug 13, 2023Updated 2 years ago
- ☆18Mar 12, 2025Updated last year
- A CUDA implementation of Arithmetic Coding☆18Jan 21, 2025Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- BWA-MEM program accelerated with the GPUSeed and GASAL2 libraries☆19Dec 16, 2022Updated 3 years ago
- Music GAN - GANSynth preprocessing, ProGAN and DCGAN architecture☆11Jan 26, 2023Updated 3 years ago
- The code repository of DGCNN on FPGA: Acceleration of The Point Cloud Classifier Using FPGAs☆17Mar 6, 2023Updated 3 years ago
- An Open Source Kepler GPU Assembler☆21Jan 23, 2017Updated 9 years ago
- A Data Oriented C Compiler in C☆25Mar 28, 2024Updated 2 years ago
- FlashSparse significantly reduces the computation redundancy for unstructured sparsity (for SpMM and SDDMM) on Tensor Cores through a Swa…☆39Oct 5, 2025Updated 7 months ago
- A "minimal" example of a Vulkan rainbow triangle in Odin with GLFW.☆12Jun 2, 2024Updated last year