ssiu / flash-attention-turingLinks
☆62Updated 2 weeks ago
Alternatives and similar repositories for flash-attention-turing
Users that are interested in flash-attention-turing are comparing it to the libraries listed below
Sorting:
- Fast and memory-efficient exact attention☆205Updated this week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆50Updated last year
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆153Updated 4 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆212Updated 2 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆113Updated this week
- Advanced quantization toolkit for LLMs and VLMs. Support for WOQ, MXFP4, NVFP4, GGUF, Adaptive Schemes and seamless integration with Tra…☆785Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆277Updated 5 months ago
- ☆78Updated last year
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆943Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆225Updated last week
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆271Updated 4 months ago
- ☆207Updated 7 months ago
- DFloat11 [NeurIPS '25]: Lossless Compression of LLMs and DiTs for Efficient GPU Inference☆576Updated last month
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆962Updated last year
- An easy-to-use package for implementing SmoothQuant for LLMs☆110Updated 8 months ago
- ☆130Updated last year
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆123Updated last year
- ☆159Updated 6 months ago
- ☆96Updated 9 months ago
- ☆434Updated 3 months ago
- ☆59Updated 5 months ago
- Ahead of Time (AOT) Triton Math Library☆84Updated 2 weeks ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆321Updated last month
- Model Compression Toolbox for Large Language Models and Diffusion Models☆722Updated 4 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆184Updated 8 months ago
- Fast and memory-efficient exact attention☆105Updated last week
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆61Updated last year
- High-speed and easy-use LLM serving framework for local deployment☆139Updated 4 months ago
- Ascend TileLang adapter☆167Updated this week
- OpenAI Triton backend for Intel® GPUs☆222Updated this week