vllm-project / flash-attentionView external linksLinks
Fast and memory-efficient exact attention
☆114Updated this week
Alternatives and similar repositories for flash-attention
Users that are interested in flash-attention are comparing it to the libraries listed below
Sorting:
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆148May 10, 2025Updated 9 months ago
- FlashTile is a CUDA Tile IR compiler that is compatible with NVIDIA's tileiras, targeting SM70 through SM121 NVIDIA GPUs.☆37Feb 6, 2026Updated last week
- Quantized Attention on GPU☆44Nov 22, 2024Updated last year
- KV cache store for distributed LLM inference☆392Nov 13, 2025Updated 3 months ago
- Kernel Library Wheel for SGLang☆17Updated this week
- High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)☆26Jan 22, 2026Updated 3 weeks ago
- CUDA Templates for Linear Algebra Subroutines☆101Apr 25, 2024Updated last year
- ☆34Feb 3, 2025Updated last year
- Demo for Qwen2.5-VL-3B-Instruct on Axera device.☆17Sep 3, 2025Updated 5 months ago
- ☆18Mar 4, 2025Updated 11 months ago
- FlashInfer: Kernel Library for LLM Serving☆4,935Updated this week
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆22Updated this week
- RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.☆1,049Updated this week
- ☆84Feb 6, 2026Updated last week
- patches for huggingface transformers to save memory☆34Jun 2, 2025Updated 8 months ago
- Online Preference Alignment for Language Models via Count-based Exploration☆17Jan 14, 2025Updated last year
- ☆155Mar 4, 2025Updated 11 months ago
- ☆22May 5, 2025Updated 9 months ago
- A stress testing tool for the scheduler in a large-scale scenario.☆16Apr 29, 2024Updated last year
- 🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.☆250Feb 5, 2026Updated last week
- Official repository of "Distort, Distract, Decode: Instruction-Tuned Model Can Refine its Response from Noisy Instructions", ICLR 2024 Sp…☆21Mar 7, 2024Updated last year
- Generate compile_commands.json and run clang-tidy with Bazel☆18Jun 23, 2019Updated 6 years ago
- ☆18Jan 4, 2024Updated 2 years ago
- Disaggregated serving system for Large Language Models (LLMs).☆776Apr 6, 2025Updated 10 months ago
- Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.☆4,701Updated this week
- A sparse attention kernel supporting mix sparse patterns☆453Jan 18, 2026Updated 3 weeks ago
- ☆18Dec 26, 2023Updated 2 years ago
- ☆20Dec 24, 2024Updated last year
- NVIDIA Inference Xfer Library (NIXL)☆876Updated this week
- [ECCV24] MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization☆49Nov 27, 2024Updated last year
- Code for paper: [ICLR2025 Oral] FlexPrefill: A Context-Aware Sparse Attention Mechanism for Efficient Long-Sequence Inference☆160Oct 13, 2025Updated 4 months ago
- study of cutlass☆22Nov 10, 2024Updated last year
- 国科大编译作业三:Point to 分析☆19Dec 12, 2021Updated 4 years ago
- 基于语义的中文文本关键词提取算法☆20Mar 24, 2021Updated 4 years ago
- Distributed tracing data from Meta's microservices architecture.☆25Aug 30, 2023Updated 2 years ago
- A tiny yet powerful LLM inference system tailored for researching purpose. vLLM-equivalent performance with only 2k lines of code (2% of …☆314Jun 10, 2025Updated 8 months ago
- AST interpreter with clang 5.0.0 and llvm 5.0.0☆14Dec 7, 2019Updated 6 years ago
- ☆52May 19, 2025Updated 8 months ago
- [ICML 2025] XAttention: Block Sparse Attention with Antidiagonal Scoring☆269Jul 6, 2025Updated 7 months ago