jundaf2 / INT8-Flash-Attention-FMHA-QuantizationLinks

☆158

Alternatives and similar repositories for INT8-Flash-Attention-FMHA-Quantization

Users that are interested in INT8-Flash-Attention-FMHA-Quantization are comparing it to the libraries listed below

Sorting:

wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆116Updated last year
stanford-futuredata / stk
☆112Updated last year
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated last year
Guangxuan-Xiao / torch-int
This repository contains integer operators on GPUs for PyTorch.
☆220Updated 2 years ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆125Updated 4 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆265Updated 3 months ago
haochengxi / Train_Transformers_with_INT4
☆156Updated 2 years ago
FlagOpen / FlagAttention
A collection of memory efficient attention operators implemented in the Triton language.
☆282Updated last year
hahnyuan / RPTQ4LLM
Reorder-based post-training quantization for large language model
☆194Updated 2 years ago
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆248Updated last week
INT-FlashAttention2024 / INT-FlashAttention
☆82Updated 9 months ago
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆299Updated 2 months ago
meta-pytorch / tritonbench
Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.
☆264Updated this week
NVIDIA / online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
☆101Updated 7 years ago
cli99 / flops-profiler
pytorch-profiler
☆51Updated 2 years ago
AniZpZ / AutoSmoothQuant
An easy-to-use package for implementing SmoothQuant for LLMs
☆107Updated 6 months ago
Qualcomm-AI-research / FP8-quantization
☆163Updated 2 years ago
efeslab / Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆320Updated last year
IntelLabs / FP8-Emulation-Toolkit
PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.
☆111Updated 10 months ago
SqueezeBits / QUICK
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
☆118Updated last year
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆246Updated 3 weeks ago
fpgaminer / GPTQ-triton
GPTQ inference Triton kernel
☆311Updated 2 years ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆381Updated 3 weeks ago
ankan-ban / llama_cu_awq
llama INT4 cuda inference with AWQ
☆55Updated 9 months ago
ColfaxResearch / cutlass-kernels
☆241Updated last year
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆83Updated last year
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆78Updated last year
nbasyl / LLM-FP4
The official implementation of the EMNLP 2023 paper LLM-FP4
☆217Updated last year
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆157Updated 6 months ago
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆144Updated 2 months ago