IntelLabs / FP8-Emulation-ToolkitLinks

PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.

☆110

Alternatives and similar repositories for FP8-Emulation-Toolkit

Users that are interested in FP8-Emulation-Toolkit are comparing it to the libraries listed below

Sorting:

Qualcomm-AI-research / FP8-quantization
☆154Updated 2 years ago
microsoft / microxcaling
PyTorch emulation library for Microscaling (MX)-compatible data formats
☆262Updated last month
naver-aics / lut-gemm
☆64Updated last year
microsoft / SparTA
☆150Updated last year
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆208Updated last year
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆137Updated 2 years ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆113Updated last year
LeiWang1999 / tvm_gpu_gemm
play gemm with tvm
☆91Updated 2 years ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆54Updated 6 months ago
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆213Updated last year
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆260Updated 3 weeks ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated last year
sunlex0717 / DissectingTensorCores
☆106Updated last year
nox-410 / tvm.tl
An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.
☆50Updated last year
Qualcomm-AI-research / transformer-quantization
☆206Updated 3 years ago
wimh966 / outlier_suppression
The official PyTorch implementation of the NeurIPS2022 (spotlight) paper, Outlier Suppression: Pushing the Limit of Low-bit Transformer L…
☆47Updated 2 years ago
INT-FlashAttention2024 / INT-FlashAttention
☆79Updated 6 months ago
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆94Updated 3 weeks ago
pku-liang / AMOS
Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators
☆114Updated 2 years ago
UDC-GAC / venom
A Vectorized N:M Format for Unleashing the Power of Sparse Tensor Cores
☆52Updated last year
aojunzz / NM-sparsity
☆236Updated 2 years ago
tlc-pack / TLCBench
Benchmark scripts for TVM
☆75Updated 3 years ago
lixiuhong / batched_gemm
☆39Updated 5 years ago
apuaaChen / EVT_AE
Artifacts of EVT ASPLOS'24
☆26Updated last year
cmu-catalyst / collage
System for automated integration of deep learning backends.
☆47Updated 2 years ago
lenLRX / AmpereSparseMatmul
study of Ampere' Sparse Matmul
☆18Updated 4 years ago
ColfaxResearch / cutlass-kernels
☆227Updated last year
UDC-GAC / openCNN
A Winograd Minimal Filter Implementation in CUDA
☆25Updated 3 years ago
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆95Updated 7 years ago
thu-ml / Jetfire-INT8Training
☆51Updated last year