ssiu / flash-attention-turingLinks
☆38Updated last week
Alternatives and similar repositories for flash-attention-turing
Users that are interested in flash-attention-turing are comparing it to the libraries listed below
Sorting:
- ☆137Updated this week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆43Updated 10 months ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆128Updated 2 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆273Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆84Updated this week
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆53Updated 7 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆129Updated this week
- Fast and memory-efficient exact attention☆174Updated this week
- run DeepSeek-R1 GGUFs on KTransformers☆236Updated 3 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆172Updated 2 months ago
- ☆194Updated last month
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆256Updated 3 weeks ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆151Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆349Updated 9 months ago
- A quantization algorithm for LLM☆141Updated last year
- automatically quant GGUF models☆184Updated this week
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…☆525Updated this week
- Model Compression Toolbox for Large Language Models and Diffusion Models☆509Updated 3 months ago
- PyTorch half precision gemm lib w/ fused optional bias + optional relu/gelu☆70Updated 6 months ago
- 8-bit CUDA functions for PyTorch☆53Updated last week
- ☆70Updated 6 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆252Updated 7 months ago
- An easy-to-use package for implementing SmoothQuant for LLMs☆102Updated 2 months ago
- ☆97Updated 9 months ago
- https://wavespeed.ai/ Context parallel attention that accelerates DiT model inference with dynamic caching☆304Updated last month
- Fast low-bit matmul kernels in Triton☆323Updated last week
- GPTQ inference Triton kernel☆302Updated 2 years ago
- llama.cpp fork with additional SOTA quants and improved performance☆608Updated this week
- ☆87Updated 3 months ago
- vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs☆87Updated this week