π A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software
β64Feb 23, 2025Updated last year
Alternatives and similar repositories for awesome-gemm
Users that are interested in awesome-gemm are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.β21Jan 24, 2025Updated last year
- From Minimal GEMM to Everythingβ207May 22, 2026Updated last week
- Official repository Flash Local Linear Attentionβ23Apr 23, 2026Updated last month
- NCCL Examples from Official NVIDIA NCCL Developer Guide.β20May 29, 2018Updated 8 years ago
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Larβ¦β139May 19, 2026Updated last week
- Deploy on Railway without the complexity - Free Credits Offer β’ AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.β47Jun 11, 2025Updated 11 months ago
- Synthesis using Synopsys DC and Physical Design flow using Synopsys ICC II, of my RISC-V 5 stage pipelined using 32 nm tech repoβ15Jul 31, 2024Updated last year
- Benchmarking guide for the Azure AI Infrastructure.β40May 22, 2026Updated last week
- A Toy-Purpose TPU Simulatorβ22Jun 7, 2024Updated last year
- Notes and code for Programming Massively Parallel Processorsβ13Mar 29, 2025Updated last year
- An open-source hybrid MeshβCrossbar NoC for scalable, low-latency shared-L1-memory clusters with thousands of cores.β37May 20, 2026Updated last week
- A CUDA kernel optimization toolkit for validation, benchmarking, Nsight Compute profiling, bottleneck analysis, and iterative tuning. It β¦β168Apr 22, 2026Updated last month
- A parser for PTX 6.5β13Jun 19, 2023Updated 2 years ago
- Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUsβ14Apr 3, 2025Updated last year
- Managed Database hosting by DigitalOcean β’ AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- Example SystemVerilog UVM Environmentβ10Jun 23, 2015Updated 10 years ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.β16Aug 31, 2023Updated 2 years ago
- Automated bottleneck detection and solution orchestrationβ21Feb 24, 2026Updated 3 months ago
- Step by step implementation of a fast softmax kernel in CUDAβ68Jan 6, 2025Updated last year
- transformer tokenizers (e.g. BERT tokenizer) in C++ (WIP)β18Apr 7, 2022Updated 4 years ago
- Fast CUDA matrix multiplication from scratchβ1,196Sep 2, 2025Updated 8 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASSβ256May 6, 2025Updated last year
- This serves as a repository for reproducibility of the SC21 paper "In-Depth Analyses of Unified Virtual Memory System for GPU Acceleratedβ¦β38Sep 25, 2023Updated 2 years ago
- πππ This repository lists some awesome public CUDA, cuda-python, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR, PTβ¦β479Aug 2, 2025Updated 9 months ago
- Managed Kubernetes at scale on DigitalOcean β’ AdDigitalOcean Kubernetes includes the control plane, bandwidth allowance, container registry, automatic updates, and more for free.
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.β19Feb 9, 2026Updated 3 months ago
- Example programs and tests for ivshmem module for QEMU/KVMβ20Sep 5, 2019Updated 6 years ago
- Official Repo For AAAI 2026 Accepted Paper "Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception"β31Mar 25, 2026Updated 2 months ago
- Booksie: an open catalog of free picture storybooks for children instantly available for reading.β19Feb 27, 2026Updated 3 months ago
- See vLLM official support: https://github.com/vllm-project/vllm-ascendβ11Feb 5, 2025Updated last year
- Lab assignments for the Agile Hardware Design courseβ18Nov 14, 2025Updated 6 months ago
- A 32-bit MIPS processor which aims for conformance to the MIPS32 Release 1 ISA.β19Jul 29, 2015Updated 10 years ago
- β24May 22, 2026Updated last week
- β24Apr 7, 2026Updated last month
- Simple, predictable pricing with DigitalOcean hosting β’ AdAlways know what you'll pay with monthly caps and flat pricing. Enterprise-grade infrastructure trusted by 600k+ customers.
- β149Apr 4, 2026Updated last month
- A lightweight, production-ready C++ library for LLM tokenization, fully compatible with HuggingFace tokenizer.json.β28Jan 4, 2026Updated 4 months ago
- DeepSeek VSCode Extension: Your Local AI Coding Companionβ25Feb 18, 2025Updated last year
- Mini CCL - A lightweight collective communication libraryβ32Jan 2, 2026Updated 4 months ago
- Implementation for IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).β25Feb 22, 2026Updated 3 months ago
- Python code (packaged in Docker container) to run the experiments in "A Greedy Algorithm for Quantizing Neural Networks" by Eric Lybrand β¦β20Jun 20, 2021Updated 4 years ago
- Making Flux go brrr on GPUs.β168Jan 5, 2026Updated 4 months ago