ssiu / flash-attention-turingLinks

☆55

Alternatives and similar repositories for flash-attention-turing

Users that are interested in flash-attention-turing are comparing it to the libraries listed below

Sorting:

Repeerc / flash-attention-v2-RDNA3-minimal
a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…
☆48Updated last year
ROCm / flash-attention
Fast and memory-efficient exact attention
☆193Updated last week
IST-DASLab / marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
☆916Updated last year
ModelCloud / GPTQModel
LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…
☆842Updated this week
vllm-project / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆180Updated this week
nunchaku-tech / deepcompressor
Model Compression Toolbox for Large Language Models and Diffusion Models
☆671Updated 2 months ago
neuralmagic / AutoFP8
☆205Updated 5 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆265Updated 3 months ago
LeanModels / DFloat11
DFloat11: Lossless LLM Compression for Efficient GPU Inference
☆550Updated 2 months ago
mlc-ai / relax
☆167Updated this week
HandH1998 / QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
☆144Updated 2 months ago
microsoft / BitBLAS
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
☆698Updated 2 months ago
ReinForce-II / mmapeak
☆43Updated 3 weeks ago
modelscope / dash-infer
DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …
☆266Updated 2 months ago
Cornell-RelaxML / qtip
☆152Updated 4 months ago
intel / xFasterTransformer
☆430Updated last month
ROCm / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆108Updated last week
mit-han-lab / omniserve
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…
☆766Updated 7 months ago
OpenBMB / CPM.cu
CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…
☆199Updated 2 weeks ago
intel / auto-round
Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.
☆668Updated this week
ROCm / triton
Development repository for the Triton language and compiler
☆137Updated this week
OpenGVLab / EfficientQAT
[ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
☆306Updated 5 months ago
ubergarm / r1-ktransformers-guide
run DeepSeek-R1 GGUFs on KTransformers
☆253Updated 7 months ago
FlagOpen / FlagCX
☆91Updated last week
sgl-project / SpecForge
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
☆439Updated this week
aikitoria / open-gpu-kernel-modules
NVIDIA Linux open GPU with P2P support
☆66Updated 2 weeks ago
ROCm / AITemplate
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…
☆12Updated last year
Dao-AILab / fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
☆248Updated last week
ROCm / aotriton
Ahead of Time (AOT) Triton Math Library
☆79Updated last week
zhaochenyang20 / ModelServer
Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang
☆58Updated 11 months ago