jundaf2 / INT8-Flash-Attention-FMHA-QuantizationView external linksLinks
☆160Sep 15, 2023Updated 2 years ago
Alternatives and similar repositories for INT8-Flash-Attention-FMHA-Quantization
Users that are interested in INT8-Flash-Attention-FMHA-Quantization are comparing it to the libraries listed below
Sorting:
- GPTQ inference Triton kernel☆321May 18, 2023Updated 2 years ago
- This repository contains integer operators on GPUs for PyTorch.☆237Sep 29, 2023Updated 2 years ago
- The only known (by 2022) open-source, easy-to-understand basic algorithm implementations in TD-CEM. (Please star and fork this project if…☆15Mar 1, 2022Updated 3 years ago
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆96Sep 13, 2025Updated 5 months ago
- Forward and backward Attention DNN operators implementationed by LibTorch, cuDNN, and Eigen.☆30Jun 6, 2023Updated 2 years ago
- Reorder-based post-training quantization for large language model☆198May 17, 2023Updated 2 years ago
- ☆85Jan 23, 2025Updated last year
- Standalone Flash Attention v2 kernel without libtorch dependency☆114Sep 10, 2024Updated last year
- Overlapping Schwarz Domain Decomposition Finite Element Algorithm in both Matlab and serial/parallel C++☆18Mar 1, 2022Updated 3 years ago
- Quantized Attention on GPU☆44Nov 22, 2024Updated last year
- GPTQ inference TVM kernel☆40Apr 25, 2024Updated last year
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models☆1,607Jul 12, 2024Updated last year
- Prototype routines for GPU quantization written using PyTorch.