ssiu / flash-attention-turingLinks
☆55Updated last week
Alternatives and similar repositories for flash-attention-turing
Users that are interested in flash-attention-turing are comparing it to the libraries listed below
Sorting:
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆48Updated last year
- Fast and memory-efficient exact attention☆193Updated last week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆916Updated last year
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆842Updated this week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆180Updated this week
- Model Compression Toolbox for Large Language Models and Diffusion Models☆671Updated 2 months ago
- ☆205Updated 5 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆265Updated 3 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆550Updated 2 months ago
- ☆167Updated this week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆144Updated 2 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆698Updated 2 months ago
- ☆43Updated 3 weeks ago
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆266Updated 2 months ago
- ☆152Updated 4 months ago
- ☆430Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆108Updated last week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆766Updated 7 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆199Updated 2 weeks ago
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.☆668Updated this week
- Development repository for the Triton language and compiler☆137Updated this week
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆306Updated 5 months ago
- run DeepSeek-R1 GGUFs on KTransformers☆253Updated 7 months ago
- ☆91Updated last week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆439Updated this week
- NVIDIA Linux open GPU with P2P support☆66Updated 2 weeks ago
- AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (N…☆12Updated last year
- Fast Hadamard transform in CUDA, with a PyTorch interface☆248Updated last week
- Ahead of Time (AOT) Triton Math Library☆79Updated last week
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆58Updated 11 months ago