ssiu / flash-attention-turingLinks
☆43Updated this week
Alternatives and similar repositories for flash-attention-turing
Users that are interested in flash-attention-turing are comparing it to the libraries listed below
Sorting:
- a simple Flash Attention v2 implementation with ROCM (RDNA3 GPU, roc wmma), mainly used for stable diffusion(ComfyUI) in Windows ZLUDA en…☆43Updated 10 months ago
- Production ready LLM model compression/quantization toolkit with hw accelerated inference support for both cpu/gpu via HF, vLLM, and SGLa…☆666Updated 2 weeks ago
- llama.cpp fork with additional SOTA quants and improved performance☆652Updated this week
- Fast and memory-efficient exact attention☆177Updated this week
- DashInfer is a native LLM inference engine aiming to deliver industry-leading performance atop various hardware architectures, including …☆259Updated last month
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆855Updated 10 months ago
- Model Compression Toolbox for Large Language Models and Diffusion Models☆530Updated 3 months ago
- DFloat11: Lossless LLM Compression for Efficient GPU Inference☆446Updated last month
- Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU. Seamlessly integrated with Torchao, Tra…☆528Updated this week
- QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.☆135Updated 3 months ago
- run DeepSeek-R1 GGUFs on KTransformers☆242Updated 4 months ago
- ☆428Updated last week
- A safetensors extension to efficiently store sparse quantized tensors on disk☆135Updated this week
- CPM.cu is a lightweight, high-performance CUDA implementation for LLMs, optimized for end-device inference and featuring cutting-edge tec…☆160Updated this week
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆255Updated 8 months ago
- ☆195Updated 2 months ago
- [ACL 2025 Main] EfficientQAT: Efficient Quantization-Aware Training for Large Language Models☆277Updated last month
- ☆139Updated 3 weeks ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆85Updated this week
- ☆134Updated 3 weeks ago
- ☆128Updated 6 months ago
- RAG SYSTEM FOR RWKV☆50Updated 7 months ago
- A general 2-8 bits quantization toolbox with GPTQ/AWQ/HQQ/VPTQ, and export to onnx/onnx-runtime easily.☆174Updated 3 months ago
- This is an inference framework for the RWKV large language model implemented purely in native PyTorch. The official native implementation…☆130Updated 11 months ago
- Low-bit LLM inference on CPU/NPU with lookup table☆823Updated last month
- Efficient, Flexible, and Highly Fault-Tolerant Model Service Management Based on SGLang☆54Updated 8 months ago
- An innovative library for efficient LLM inference via low-bit quantization☆349Updated 10 months ago
- SpargeAttention: A training-free sparse attention that can accelerate any model inference.☆653Updated 3 weeks ago
- vLLM performance dashboard☆32Updated last year
- An optimized quantization and inference library for running LLMs locally on modern consumer-class GPUs☆436Updated this week