ssiu / flash-attention-turingLinks
☆49Updated this week
Alternatives and similar repositories for flash-attention-turing
Users that are interested in flash-attention-turing are comparing it to the libraries listed below
Sorting:
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆727Updated this week
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…☆574Updated this week
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆45Updated 11 months ago
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆875Updated 11 months ago
- Fast and memory-efficient exact attention☆180Updated this week
- Model Compression Toolbox for Large Language Models and Diffusion Models☆578Updated 4 months ago
- A safetensors extension to efficiently store sparse quantized tensors on disk☆142Updated this week
- ☆195Updated 3 months ago
- ☆428Updated 3 weeks ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆88Updated this week
- BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.☆656Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆260Updated 3 weeks ago
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆138Updated 4 months ago
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆264Updated this week
- run DeepSeek-R1 GGUFs on KTransformers☆249Updated 5 months ago
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆731Updated 5 months ago
- [EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a V…☆529Updated last week
- Materials for learning SGLang☆522Updated 3 weeks ago
- Low-bit LLM inference on CPU/NPU with lookup table☆838Updated 2 months ago
- NVIDIA Linux open GPU with P2P support☆27Updated last week
- Perplexity GPU Kernels☆425Updated this week
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆55Updated 9 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆504Updated this week
- Pure C++ implementation of several models for real-time chatting on your computer (CPU & GPU)☆670Updated this week
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆289Updated 2 months ago
- A throughput-oriented high-performance serving framework for LLMs☆862Updated last month
- Triton Documentation in Chinese Simplified / Triton 中文文档☆78Updated 3 months ago
- Fast low-bit matmul kernels in Triton☆339Updated last week
- ☆77Updated 8 months ago
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆171Updated last week