IntelLabs / FP8-Emulation-Toolkit
PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
☆103Updated last month
Alternatives and similar repositories for FP8-Emulation-Toolkit:
Users that are interested in FP8-Emulation-Toolkit are comparing it to the libraries listed below
- ☆131Updated last year
- ☆47Updated 9 months ago
- ☆134Updated 5 months ago
- PyTorch emulation library for Microscaling (MX)-compatible data formats☆187Updated 3 months ago
- This repository contains integer operators on GPUs for PyTorch.☆189Updated last year
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆133Updated last year
- ☆178Updated 6 months ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆93Updated 6 months ago
- play gemm with tvm☆85Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆87Updated 10 months ago
- llama INT4 cuda inference with AWQ☆49Updated 6 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆230Updated 2 months ago
- DietCode Code Release☆61Updated 2 years ago
- ☆157Updated last year
- TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.☆174Updated 2 months ago
- ☆66Updated 3 weeks ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆99Updated 4 months ago
- An extention of TVMScript to write simple and high performance GPU kernels with tensorcore.☆51Updated 5 months ago
- Automatic Mapping Generation, Verification, and Exploration for ISA-based Spatial Accelerators☆106Updated 2 years ago
- ☆85Updated last year
- ☆81Updated 8 months ago
- Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization☆90Updated 2 months ago
- System for automated integration of deep learning backends.☆48Updated 2 years ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆132Updated 7 months ago
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections☆117Updated 2 years ago
- Code Repository of Evaluating Quantized Large Language Models☆112Updated 4 months ago
- ☆197Updated 3 years ago
- ☆41Updated 2 years ago
- ☆64Updated 2 months ago
- CUDA Matrix Multiplication Optimization☆152Updated 5 months ago