ssiu / flash-attention-turingLinks
☆64Updated this week
Alternatives and similar repositories for flash-attention-turing
Users that are interested in flash-attention-turing are comparing it to the libraries listed below
Sorting:
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆50Updated last year
- Fast and memory-efficient exact attention☆208Updated this week
- Model Compression Toolbox for Large Language Models and Diffusion Models☆732Updated 5 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆113Updated this week
- NVIDIA Linux open GPU with P2P support☆112Updated last month
- 🎯An accuracy-first, highly efficient quantization toolkit for LLMs, designed to minimize quality degradation across Weight-Only Quantiza…☆815Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆233Updated last week
- ☆49Updated last month
- Ahead of Time (AOT) Triton Math Library☆87Updated this week
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆971Updated last week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆987Updated last year
- ☆206Updated 8 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆276Updated 6 months ago
- Development repository for the Triton language and compiler☆140Updated last week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆154Updated 4 months ago
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆274Updated 5 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆216Updated 3 months ago
- ☆163Updated 6 months ago
- 8-bit CUDA functions for PyTorch☆69Updated 3 months ago
- AI Tensor Engine for ROCm☆341Updated this week
- DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference☆592Updated last month
- ☆436Updated 4 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆743Updated 5 months ago
- OpenAI Triton backend for Intel® GPUs☆224Updated this week
- Fast low-bit matmul kernels in Triton☆423Updated last month
- GPTQ inference Triton kernel☆316Updated 2 years ago
- PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu☆77Updated last year
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆802Updated 10 months ago
- run DeepSeek-R1 GGUFs on KTransformers☆259Updated 10 months ago
- VPTQ, A Flexible and Extreme low-bit quantization algorithm☆673Updated 8 months ago