ssiu / flash-attention-turingLinks
☆51Updated last week
Alternatives and similar repositories for flash-attention-turing
Users that are interested in flash-attention-turing are comparing it to the libraries listed below
Sorting:
- Fast and memory-efficient exact attention☆183Updated 2 weeks ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆92Updated last week
- ☆427Updated 2 weeks ago
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆47Updated last year
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆893Updated 11 months ago
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆263Updated 3 weeks ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆614Updated 2 weeks ago
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆758Updated this week
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…☆607Updated this week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆138Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆153Updated this week
- run DeepSeek-R1 GGUFs on KTransformers☆250Updated 5 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆261Updated last month
- Development repository for the Triton language and compiler☆127Updated this week
- ☆196Updated 3 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆524Updated last week
- A throughput-oriented high-performance serving framework for LLMs☆881Updated 3 weeks ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆746Updated 5 months ago
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,091Updated this week
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆665Updated 3 weeks ago
- 8-bit CUDA functions for PyTorch☆61Updated 2 weeks ago
- High-speed and easy-use LLM serving framework for local deployment☆117Updated 3 weeks ago
- Low-bit LLM inference on CPU/NPU with lookup table☆845Updated 2 months ago
- ☆149Updated 2 months ago
- AI Tensor Engine for ROCm☆260Updated this week
- An innovative library for efficient LLM inference via low-bit quantization☆348Updated last year
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆175Updated this week
- llama.cpp fork with additional SOTA quants and improved performance☆1,111Updated this week
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆211Updated 3 weeks ago
- Perplexity GPU Kernels☆449Updated 3 weeks ago