ssiu / flash-attention-turingLinks
☆54Updated this week
Alternatives and similar repositories for flash-attention-turing
Users that are interested in flash-attention-turing are comparing it to the libraries listed below
Sorting:
- Fast and memory-efficient exact attention☆189Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆900Updated last year
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.☆638Updated this week
- LLM model quantization (compression) toolkit with hw acceleration support for Nvidia CUDA, AMD ROCm, Intel XPU and Intel/AMD/Apple CPU vi…☆795Updated this week
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆264Updated last month
- Model Compression Toolbox for Large Language Models and Diffusion Models☆633Updated last month
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆48Updated last year
- ☆428Updated last week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆265Updated 2 months ago
- ☆128Updated 9 months ago
- ☆199Updated 4 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆162Updated this week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆139Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆102Updated this week
- run DeepSeek-R1 GGUFs on KTransformers☆251Updated 6 months ago
- A powerful toolkit for compressing large models including LLM, VLM, and video generation models.☆568Updated last month
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆194Updated last week
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆756Updated 6 months ago
- Low-bit LLM inference on CPU/NPU with lookup table☆860Updated 3 months ago
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆676Updated last month
- A throughput-oriented high-performance serving framework for LLMs☆891Updated last week
- ☆56Updated 2 months ago
- ☆166Updated this week
- A high-performance inference system for large language models, designed for production environments.☆466Updated last week
- Inferflow is an efficient and highly configurable inference engine for large language models (LLMs).☆248Updated last year
- 8-bit CUDA functions for PyTorch☆62Updated this week
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆177Updated 5 months ago
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆218Updated last month
- A low-latency & high-throughput serving engine for LLMs☆418Updated 3 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆541Updated last month