INT-FlashAttention2024/INT-FlashAttention

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/INT-FlashAttention2024/INT-FlashAttention)

INT-FlashAttention2024 / INT-FlashAttention

☆91

Alternatives and similar repositories for INT-FlashAttention

Users that are interested in INT-FlashAttention are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

HandH1998 / QQQ
View on GitHub
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆158Aug 21, 2025Updated 11 months ago
Adlik / smoothquantplus
View on GitHub
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
☆23Mar 15, 2024Updated 2 years ago
HanGuo97 / flute
View on GitHub
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
☆391Apr 13, 2025Updated last year
feifeibear / ChituAttention
View on GitHub
Quantized Attention on GPU
☆45Nov 22, 2024Updated last year
ChenMnZ / PrefixQuant
View on GitHub
An algorithm for weight-activation quantization (W4A4, W4A8) of LLMs, supporting both static and dynamic quantization
☆176Nov 26, 2025Updated 8 months ago
AI Agents on DigitalOcean Gradient AI Platform • Ad
Build production-ready AI agents using customizable tools or access multiple LLMs through a single endpoint. Create custom knowledge bases or connect external data.
TiledTensor / TiledBench
View on GitHub
Benchmark tests supporting the TiledCUDA library.
☆19Nov 19, 2024Updated last year
Intelligent-Computing-Lab-Panda / GPTAQ
View on GitHub
Code implementation of GPTAQ (https://arxiv.org/abs/2504.02692)
☆93Jul 28, 2025Updated last year
tile-ai / AttentionEngine
View on GitHub
☆52May 19, 2025Updated last year
facebookresearch / SpinQuant
View on GitHub
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
☆418Feb 14, 2025Updated last year
mit-han-lab / omniserve
View on GitHub
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆852Mar 6, 2025Updated last year
IST-DASLab / FP-Quant
View on GitHub
☆116Feb 26, 2026Updated 5 months ago
Bruce-Lee-LY / cuda_hgemv
View on GitHub
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
☆75Sep 8, 2024Updated last year
IST-DASLab / qutlass
View on GitHub
QuTLASS: CUTLASS-Powered Quantized BLAS for Deep Learning
☆193Jul 20, 2026Updated last week
ai-compiler-study / triton-kernels
View on GitHub
Triton kernels for Flux
☆23Jul 7, 2025Updated last year
Serverless GPU API endpoints on Runpod - Get Bonus Credits • Ad
Skip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
IST-DASLab / Sparse-Marlin
View on GitHub
Boosting 4-bit inference kernels with 2:4 Sparsity
☆96Sep 4, 2024Updated last year
efeslab / Atom
View on GitHub
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
☆344Jul 2, 2024Updated 2 years ago
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
View on GitHub
☆165Sep 15, 2023Updated 2 years ago
zhuzilin / flash-attention-with-sink
View on GitHub
☆37Aug 7, 2025Updated 11 months ago
dropbox / gemlite
View on GitHub
Fast low-bit matmul kernels in Triton
☆477Jul 15, 2026Updated 2 weeks ago
Qualcomm-AI-research / lr-qat
View on GitHub
☆54Nov 5, 2024Updated last year
LeiWang1999 / Stream-k.tvm
View on GitHub
☆20Sep 28, 2024Updated last year
flashinfer-ai / cutlass-viz
View on GitHub
☆65Apr 26, 2025Updated last year
KuangjuX / ncu-cli
View on GitHub
Automated CUDA kernel performance diagnostics from NVIDIA Nsight Compute (NCU) CSV exports.
☆34Mar 18, 2026Updated 4 months ago
GPUs on demand by Runpod - Special Offer Available • Ad
Run AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
hahahaxyf / Get-Moving-with-Pynq-on-Alveo-U50
View on GitHub
Get Moving with Pynq on Alveo U50
☆14Jul 13, 2020Updated 6 years ago
ruikangliu / FlatQuant
View on GitHub
[ICML 2025] Official PyTorch implementation of "FlatQuant: Flatness Matters for LLM Quantization"
☆223Nov 25, 2025Updated 8 months ago
IST-DASLab / Quartet
View on GitHub
☆127Mar 18, 2026Updated 4 months ago
facebookresearch / Ternary_Binary_Transformer
View on GitHub
ACL 2023
☆39Jun 6, 2023Updated 3 years ago
IST-DASLab / marlin
View on GitHub
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆1,111Sep 4, 2024Updated last year
ByteDance-Seed / decoupleQ
View on GitHub
A quantization algorithm for LLM
☆151Jun 21, 2024Updated 2 years ago
WaveSpeedAI / QuantumAttention
View on GitHub
[WIP] Better (FP8) attention for Hopper
☆33Feb 24, 2025Updated last year
microsoft / AttentionEngine
View on GitHub
☆123May 19, 2025Updated last year
GindaChen / FlexFlashAttention3
View on GitHub
FlexAttention w/ FlashAttention3 Support
☆27Oct 5, 2024Updated last year
Managed Database hosting by DigitalOcean • Ad
PostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
thu-ml / Jetfire-INT8Training
View on GitHub
☆63Jul 21, 2024Updated 2 years ago
xlite-dev / ffpa-attn
View on GitHub
🤖FFPA: Extends FA-2/3 via Split-D for large headdims, 1.5x~6×↑🎉 vs SDPA, up to 513~535 TFLOPS🎉 on NVIDIA H200.
☆318Updated this week
jy-yuan / KIVI
View on GitHub
[ICML 2024] KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
☆423Nov 20, 2025Updated 8 months ago
toyaix / triton-runner
View on GitHub
Multi-Level Triton Runner supporting Python, IR, PTX, AMDGCN, cubin and hasco.
☆98May 8, 2026Updated 2 months ago
vllm-project / tml-fa4
View on GitHub
FA4-based Relative Attention Kernel developed by TML and Colfax
☆17Jul 17, 2026Updated last week
SqueezeAILab / KVQuant
View on GitHub
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆431Aug 13, 2024Updated last year
HydraQYH / hp_rms_norm
View on GitHub
High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)
☆30Jan 22, 2026Updated 6 months ago