General Matrix Multiplication using NVIDIA Tensor Cores
☆28Jan 25, 2025Updated last year
Alternatives and similar repositories for tGeMM
Users that are interested in tGeMM are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆18Feb 9, 2026Updated last month
- Simple problems implemented in CUDA C☆35Apr 7, 2025Updated 11 months ago
- Personal solutions to the Triton Puzzles☆20Jul 18, 2024Updated last year
- Accelerated General (FP32) Matrix Multiplication from scratch in CUDA☆184Jan 9, 2025Updated last year
- Optimized Parallel Tiled Approach to perform Matrix Multiplication by taking advantage of the lower latency, higher bandwidth shared memo…☆16Sep 24, 2017Updated 8 years ago
- ☆10Apr 10, 2014Updated 11 years ago
- Storb is a distributed storage subnet on the Bittensor network☆13Jul 28, 2025Updated 7 months ago
- ☆13Jan 28, 2026Updated last month
- An extention to the GaLore paper, to perform Natural Gradient Descent in low rank subspace☆18Oct 21, 2024Updated last year
- ☆37Dec 12, 2025Updated 3 months ago
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 7 months ago
- Multi-path UDP protocol - an example implementation☆10Jul 6, 2015Updated 10 years ago
- Write a fast kernel and see how you compare against the best humans and AI on gpumode.com☆88Updated this week
- CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API☆35Sep 15, 2023Updated 2 years ago
- ☆23Jul 11, 2025Updated 8 months ago
- Dense optical flow toolbox (from C.Liu)☆18Jun 14, 2012Updated 13 years ago
- some mixture of experts architecture implementations☆26Mar 22, 2024Updated 2 years ago
- The VD100 development board is based on the Xilinx Versal AI Edge series chip xcve2302 and is designed with a core board and a bottom boa…☆18Jul 9, 2024Updated last year
- ☆21Jan 21, 2026Updated 2 months ago
- GitHub Action to automatically format Rust code and fix clippy lints.☆27Nov 24, 2025Updated 4 months ago
- Sparse/dense tensor library for Python☆12Jan 26, 2026Updated last month
- Example design for the Ethernet FMC using an FPGA based hardware packet generator/checker to demonstrate maximum throughput☆12Mar 10, 2026Updated 2 weeks ago
- An AMD/Xilinx Artix 50T FPGA on a Pi5 Hat with PCIe and GPIO interconnects as well as SPI programming☆17Sep 25, 2024Updated last year
- A fast alternative to the standard C/C++ pow() function. With adjustable accuracy-space tradeoff.☆14Jul 12, 2013Updated 12 years ago
- ☆14Feb 7, 2020Updated 6 years ago
- A handy plugin for copying requests/responses directly from Burp, some extra magic included.☆13Oct 15, 2021Updated 4 years ago
- 3-axis Accelerometer☆14Aug 20, 2018Updated 7 years ago
- Code for "What really matters in matrix-whitening optimizers?"☆23Oct 31, 2025Updated 4 months ago
- Interface Xilinx XDMA PCIe with DDR3 using MIG-IP on Artix-7 FPGA using Nitefury dev board☆18Apr 13, 2022Updated 3 years ago
- ☆93Nov 11, 2025Updated 4 months ago
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.☆20Jan 24, 2025Updated last year
- LightWeight IP Application Examples for Xilinx FPGA☆15Jan 19, 2016Updated 10 years ago
- Tenstorrent Topology (TT-Topology) is a command line utility used to flash multiple NB cards on a system to use specific eth routing conf…☆16Feb 26, 2026Updated 3 weeks ago
- Logistic regression FPGA core☆19Apr 7, 2021Updated 4 years ago
- SD card Bootloader for atmega processors☆27May 9, 2012Updated 13 years ago
- uiomem is a Linux device driver for accessing a memory area outside the Linux Kernel management from user space.☆14Updated this week
- Siemens PPD/SID206/SID803A engine control map detection software☆21Mar 11, 2013Updated 13 years ago
- Step by step implementation of a fast softmax kernel in CUDA☆63Jan 6, 2025Updated last year
- Trigger an LLM in your CI/CD to auto-complete your work☆11Apr 5, 2023Updated 2 years ago