IST-DASLab / gemm-int8Links

High Performance Int8 GEMM Kernels for SM80 and later GPUs.

☆18

Alternatives and similar repositories for gemm-int8

Users that are interested in gemm-int8 are comparing it to the libraries listed below

Sorting:

IST-DASLab / gemm-fp8
High Performance FP8 GEMM Kernels for SM89 and later GPUs.
☆20Updated 11 months ago
INT-FlashAttention2024 / INT-FlashAttention
☆85Updated 11 months ago
Qualcomm-AI-research / FP8-quantization
☆168Updated 2 years ago
ruikangliu / FlatQuant
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆204Updated last month
thu-nics / qllm-eval
Code Repository of Evaluating Quantized Large Language Models
☆136Updated last year
mit-han-lab / fouroversix
Code for the paper “Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling”
☆104Updated this week
ChenMnZ / PrefixQuant
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆170Updated last month
thu-ml / Jetfire-INT8Training
☆60Updated last year
Intelligent-Computing-Lab-Panda / GPTAQ
Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)
☆79Updated 5 months ago
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆271Updated 2 months ago
xvyaward / owq
Code for the AAAI 2024 Oral paper "OWQ: Outlier-Aware Weight Quantization for Efficient Fine-Tuning and Inference of Large Language Model…
☆68Updated last year
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆236Updated 2 years ago
Qualcomm-AI-research / gptvq
☆40Updated last year
naver-aics / lut-gemm
☆83Updated last year
Ther-nullptr / circult-eda-mlsys-tinyml-arxiv-daily
🎓Automatically Update circult-eda-mlsys-tinyml Papers Daily using Github Actions (Update Every 8th hours)
☆10Updated last week
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆96Updated 3 months ago
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆219Updated 2 years ago
ModelTC / Outlier_Suppression_Plus
Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…
☆50Updated 2 years ago
bytedance / AffineQuant
Official implementation of the ICLR 2024 paper AffineQuant
☆28Updated last year
weishengying / cutlass_flash_atten_fp8
使用 cutlass 仓库在 ada 架构上实现 fp8 的 flash attention
☆78Updated last year
BrotherHappy / OSTQuant
[ICLR2025]: OSTQuant: Refining Large Language Model Quantization with Orthogonal and Scaling Transformations for Better Distribution Fitt…
☆87Updated 9 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆142Updated 8 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆139Updated 7 months ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 11 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆153Updated 4 months ago
dsl-learn / triton-tutorial
Getting Started with Triton: A Tutorial for Python Beginners
☆29Updated 2 months ago
zhangsichengsjtu / AFPQ
AFPQ code implementation
☆23Updated 2 years ago
facebookresearch / ParetoQ
This repository contains the training code of ParetoQ introduced in our work "ParetoQ Scaling Laws in Extremely Low-bit LLM Quantization"
☆116Updated 2 months ago
ademeure / DeeperGEMM
DeeperGEMM: crazy optimized version
☆74Updated 8 months ago
thu-nics / ViDiT-Q
[ICLR'25] ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
☆145Updated 9 months ago