High Performance FP8 GEMM Kernels for SM89 and later GPUs.
☆20Jan 24, 2025Updated last year
Alternatives and similar repositories for gemm-fp8
Users that are interested in gemm-fp8 are comparing it to the libraries listed below
Sorting:
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- High Performance Int8 GEMM Kernels for SM80 and later GPUs.☆20Mar 11, 2025Updated 11 months ago
- Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…☆23Oct 1, 2025Updated 5 months ago
- ☆11Jan 10, 2025Updated last year
- Simple and efficient memory pool is implemented with C++11.☆10Jun 2, 2022Updated 3 years ago
- [ICML 2023] This project is the official implementation of our accepted ICML 2023 paper BiBench: Benchmarking and Analyzing Network Binar…☆56Mar 4, 2024Updated last year
- [NeurIPS 2023] ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer☆30Dec 6, 2023Updated 2 years ago
- Model Quantization Benchmark☆18Sep 30, 2025Updated 5 months ago
- The official implementation of the ICML 2023 paper OFQ-ViT☆39Oct 3, 2023Updated 2 years ago
- A Triton JIT runtime and ffi provider in C++☆31Updated this week
- This is a repository of Binary General Matrix Multiply (BGEMM) by customized CUDA kernel. Thank FP6-LLM for the wheels!☆18Aug 30, 2024Updated last year
- This project is the official implementation of our accepted IEEE TPAMI paper Diverse Sample Generation: Pushing the Limit of Data-free Qu…☆15Feb 26, 2023Updated 3 years ago
- The official implementation of the DAC 2024 paper GQA-LUT☆20Dec 20, 2024Updated last year
- SLiM: One-shot Quantized Sparse Plus Low-rank Approximation of LLMs (ICML 2025)☆32Nov 28, 2025Updated 3 months ago
- ☆18Feb 28, 2023Updated 3 years ago
- 使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention☆79Aug 12, 2024Updated last year
- [TMLR] Official PyTorch implementation of paper "Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precisio…☆48Sep 27, 2024Updated last year
- 📚 A curated list of awesome matrix-matrix multiplication (A * B = C) frameworks, libraries and software☆61Feb 23, 2025Updated last year
- PyTorch implementation of Near-Lossless Post-Training Quantization of Deep Neural Networks via a Piecewise Linear Approximation☆23Feb 17, 2020Updated 6 years ago
- A practical way of learning Swizzle☆37Feb 3, 2025Updated last year
- QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning☆168Nov 11, 2025Updated 3 months ago
- A framework to compare low-bit integer and float-point formats☆66Feb 6, 2026Updated 3 weeks ago
- MSLK (Meta Superintelligence Labs Kernels) is a collection of PyTorch GPU operator libraries that are designed and optimized for GenAI tr…☆52Updated this week
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆60Mar 25, 2025Updated 11 months ago
- BitSplit Post-trining Quantization☆50Dec 20, 2021Updated 4 years ago
- ☆28Dec 2, 2024Updated last year
- [ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"☆211Nov 25, 2025Updated 3 months ago
- Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)☆81Jul 28, 2025Updated 7 months ago
- [ICASSP'20] DNN-Chip Predictor: An Analytical Performance Predictor for DNN Accelerators with Various Dataflows and Hardware Architecture…☆25Oct 1, 2022Updated 3 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆72Sep 8, 2024Updated last year
- A collection of specialized agent skills for AI infrastructure development, enabling Claude Code to write, optimize, and debug high-perfo…☆61Feb 2, 2026Updated 3 weeks ago
- ☆34Feb 3, 2025Updated last year
- [TMLR] Official PyTorch implementation of paper "Efficient Quantization-aware Training with Adaptive Coreset Selection"☆37Aug 20, 2024Updated last year
- From Minimal GEMM to Everything☆163Feb 10, 2026Updated 2 weeks ago
- C# 基于.NET5开发的WPF串口助手☆13Mar 21, 2022Updated 3 years ago
- In our implementation of Qwen-Image-Edit, we employ block causal attention to improve inference speed.☆37Feb 16, 2026Updated last week
- Official Repo For AAAI 2026 Accepted Paper "Rethinking the Spatio-Temporal Alignment of End-to-End 3D Perception"☆28Jan 13, 2026Updated last month
- [ICML 2022] "DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks", by Yonggan …☆35Jul 12, 2022Updated 3 years ago
- MNSIM_Python_v1.0. The former circuits-level version link: https://github.com/Zhu-Zhenhua/MNSIM_V1.1☆35Jan 5, 2024Updated 2 years ago