IST-DASLab / gemm-int8Links
High Performance Int8 GEMM Kernels for SM80 and later GPUs.
☆12Updated 5 months ago
Alternatives and similar repositories for gemm-int8
Users that are interested in gemm-int8 are comparing it to the libraries listed below
Sorting:
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.☆19Updated 7 months ago
- Official implementation of the EMNLP23 paper: Outlier Suppression+: Accurate quantization of large language models by equivalent and opti…☆46Updated last year
- A collection of research papers on low-precision training methods☆33Updated 3 months ago
- [ICML 2023] This project is the official implementation of our accepted ICML 2023 paper BiBench: Benchmarking and Analyzing Network Binar…☆56Updated last year
- ☆158Updated 2 years ago
- ☆22Updated 10 months ago
- ☆32Updated last year
- [ECCV24] MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization☆44Updated 9 months ago
- ☆51Updated last year
- [ICLR 2024] This is the official PyTorch implementation of "QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Mod…☆39Updated last year
- ☆63Updated 4 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆107Updated 3 months ago
- ☆81Updated 7 months ago
- The official PyTorch implementation of the NeurIPS2022 (spotlight) paper, Outlier Suppression: Pushing the Limit of Low-bit Transformer L…☆48Updated 2 years ago
- Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)☆59Updated last month
- Tutorials of Extending and importing TVM with CMAKE Include dependency.☆14Updated 10 months ago
- LLM Inference with Microscaling Format☆29Updated 9 months ago
- BitPack is a practical tool to efficiently save ultra-low precision/mixed-precision quantized models.☆57Updated 2 years ago
- [TMLR] Official PyTorch implementation of paper "Efficient Quantization-aware Training with Adaptive Coreset Selection"☆34Updated last year
- A collection of research papers on efficient training of DNNs☆69Updated 3 years ago
- DeeperGEMM: crazy optimized version☆70Updated 3 months ago
- The official implementation of the ICML 2023 paper OFQ-ViT☆33Updated last year
- ☆11Updated 7 months ago
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆110Updated 8 months ago
- [COLM 2025] Official PyTorch implementation of "Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models"☆47Updated last month
- Code for ICML 2021 submission☆34Updated 4 years ago
- SpInfer: Leveraging Low-Level Sparsity for Efficient Large Language Model Inference on GPUs☆57Updated 5 months ago
- Official implementation for ECCV 2022 paper LIMPQ - "Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance"☆58Updated 2 years ago
- torch_quantizer is a out-of-box quantization tool for PyTorch models on CUDA backend, specially optimized for Diffusion Models.☆23Updated last year
- llama INT4 cuda inference with AWQ☆54Updated 7 months ago