Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.
☆13Nov 3, 2023Updated 2 years ago
Alternatives and similar repositories for cuda_back2back_hgemm
Users that are interested in cuda_back2back_hgemm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Multiple GEMM operators are constructed with cutlass to support LLM inference.☆20Aug 3, 2025Updated 7 months ago
- ☆63Mar 21, 2026Updated last week
- Source code of the IPDPS '21 paper: "TileSpMV: A Tiled Algorithm for Sparse Matrix-Vector Multiplication on GPUs" by Yuyao Niu, Zhengyang…☆12Aug 12, 2022Updated 3 years ago
- Lemon is an LALR(1) parser generator for C or C++.☆17Jun 10, 2014Updated 11 years ago
- Specialized Parallel Linear Algebra, providing distributed GEMM functionality for specific matrix distributions with optional GPU acceler…☆31Jun 26, 2024Updated last year
- Wordpress hosting with auto-scaling on Cloudways • AdFully Managed hosting built for WordPress-powered businesses that need reliable, auto-scalable hosting. Cloudways SafeUpdates now available.
- A intelligent matrix format designer for SpMV☆10Oct 10, 2023Updated 2 years ago
- 🍓 A toy object-oriented programming language written by rust☆17Apr 10, 2024Updated last year
- Mirror of http://gitlab.hpcrl.cse.ohio-state.edu/chong/ppopp19_ae, refactoring for understanding☆16Oct 20, 2021Updated 4 years ago
- 6502 Emulator written in C++☆13Feb 18, 2025Updated last year
- Python client for the etcd API v3, supported python >= 3.7, under active maintenance☆12Aug 4, 2025Updated 7 months ago
- A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores☆59Nov 24, 2023Updated 2 years ago
- blockchain open sources☆11Aug 18, 2017Updated 8 years ago
- kubernetes调试检测工具☆13Nov 8, 2018Updated 7 years ago
- 华为集合通信性能测试☆15May 27, 2024Updated last year
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Parallel cuckoo hashing on GPUs with CUDA☆12Sep 27, 2019Updated 6 years ago
- ☆13Nov 25, 2019Updated 6 years ago
- Parallel SpMV using CSR representation, built in CUDA☆14Jun 27, 2020Updated 5 years ago
- Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruct…☆533Sep 8, 2024Updated last year
- ☆10Apr 24, 2023Updated 2 years ago
- AKO4ALL: Agentic Kernel Optimization for All — Open, minimal harness for any kernel, any hardware, any language.☆86Updated this week
- ☆20Sep 10, 2025Updated 6 months ago
- Yet another Polyhedra Compiler for DeepLearning☆19Apr 14, 2023Updated 2 years ago
- Experiments evaluating preemption on the NVIDIA Pascal architecture☆16Nov 10, 2016Updated 9 years ago
- DigitalOcean Gradient AI Platform • AdBuild production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
- ☆18Aug 9, 2022Updated 3 years ago
- Offline renderer using CUDA☆13Jun 8, 2020Updated 5 years ago
- ☆32Apr 2, 2025Updated 11 months ago
- Source code of the SC '23 paper: "DASP: Specific Dense Matrix Multiply-Accumulate Units Accelerated General Sparse Matrix-Vector Multipli…☆29Jun 18, 2024Updated last year
- ☆22Updated this week
- Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.☆14Feb 8, 2023Updated 3 years ago
- ECM Factorization on CUDA-GPUs☆14Sep 29, 2020Updated 5 years ago
- Convert CUDA programs from float data type to half or half2 with SIMDization☆20May 28, 2019Updated 6 years ago
- Console Sake Game in Assembly☆21Oct 24, 2022Updated 3 years ago
- 1-Click AI Models by DigitalOcean Gradient • AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click and start building anything your business needs.
- A SoundFont MIDI synthesizer written in pure Odinlang☆11Aug 13, 2023Updated 2 years ago
- ☆18Mar 12, 2025Updated last year
- The code repository of DGCNN on FPGA: Acceleration of The Point Cloud Classifier Using FPGAs☆17Mar 6, 2023Updated 3 years ago
- ☆55Feb 5, 2026Updated last month
- A CUDA implementation of Arithmetic Coding☆18Jan 21, 2025Updated last year
- Music GAN - GANSynth preprocessing, ProGAN and DCGAN architecture☆11Jan 26, 2023Updated 3 years ago
- BWA-MEM program accelerated with the GPUSeed and GASAL2 libraries☆19Dec 16, 2022Updated 3 years ago