ai-bond/flash-attention-v100

Readme badge preview -

If you own this repo, copy the snippet below and add it to your README.md

[![RelatedRepos](https://img.shields.io/badge/related-repos-yellow)](https://relatedrepos.com/gh/ai-bond/flash-attention-v100)

ai-bond / flash-attention-v100

Implementation of FlashAttention-2 for Nvidia Tesla V100 / Titan V

☆175

Alternatives and similar repositories for flash-attention-v100

Users that are interested in flash-attention-v100 are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.

Sorting:

zhinianqin / flash-attention-v100
View on GitHub
forked from vllm-project/flash-attention
☆58May 9, 2026Updated 2 months ago
1CatAI / 1Cat-vLLM
View on GitHub
vLLM fork for Tesla V100 (SM70) with AWQ 4-bit support, CUDA 12.8 build flow, and validated Qwen3.5 27B/35B deployment on multi-GPU V…
☆539Updated this week
humanjesse / vllm-v100
View on GitHub
vLLM fork for Tesla V100 (SM70) — extends 1CatAI's AWQ support and adds GGUF support
☆19Jun 20, 2026Updated last month
zh-nj / lmdeploy-v100
View on GitHub
This project is specifically developed for V100, based on lmdeploy 0.12.1, and supports mainstream open-source models from Q4 2025 to Q1 …
☆21Mar 18, 2026Updated 4 months ago
haohervchb / GooseLLM
View on GitHub
1CatV2 with TileLANG written FA-v100 and many goodies
☆17Jul 15, 2026Updated last week
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
zhinianqin / marlin_v100
View on GitHub
marlin_v100 是一个从 vLLM 主树中提取出来的最小 Marlin 独立开发工作区，聚焦于 Marlin dense 与 Marlin MoE 的源码开发、最小构建和轻量验证。它保留了核心 CUDA/C++ 实现、最小 Python 薄封装、生成器测试与主树回写…
☆19Jul 2, 2026Updated 2 weeks ago
ZRayZzz / flash-attention-v100
View on GitHub
☆78Feb 19, 2024Updated 2 years ago
ztxz16 / exvllm
View on GitHub
vllm混合推理扩展插件，支持多NUMA混合推理，单卡推理Qwen3-Next模型可达1000+ prefill
☆34Nov 7, 2025Updated 8 months ago
poad42 / cuda-fp8-ampere
View on GitHub
IMMA-based **FP8-as-storage** GEMM experiments for Ampere (sm_86 / RTX 3090 Ti).
☆24Jan 30, 2026Updated 5 months ago
ssiu / flash-attention-turing
View on GitHub
Flash Attention 2 implementation for Turing GPUs
☆116Mar 23, 2026Updated 3 months ago
aikitoria / open-gpu-kernel-modules
View on GitHub
NVIDIA Linux open GPU with P2P support
☆351Jul 13, 2026Updated last week
2dameneko / ide-cap-chan
View on GitHub
ide-cap-chan is a utility for batch image captioning with natural language using various VL models
☆14May 8, 2026Updated 2 months ago
guqiong96 / Lsglang
View on GitHub
Lsglang is a special extension of sglang that fully utilizes CPU and GPU computing resources with an efficient GPU parallel + NUMA parall…
☆97Updated this week
sgl-project / sgl-flash-attn
View on GitHub
Fast and memory-efficient exact attention
☆22Jun 26, 2026Updated 3 weeks ago
Bare Metal GPUs on DigitalOcean Gradient AI • Ad
Purpose-built for serious AI teams training foundational models, running large-scale inference, and pushing the boundaries of what's possible.
mengqin / ComfyUI-TwinFlow
View on GitHub
A ComfyUI custom node implementation of TwinFlow: Realizing One-step Generation on Large Models with Self-adversarial Flows.
☆44Mar 6, 2026Updated 4 months ago
mixa3607 / ML-gfx906
View on GitHub
ML software (llama.cpp, ComfyUI, vLLM) builds for AMD gfx906 GPUs, e.g. Radeon VII / MI50 / MI60
☆286Jul 8, 2026Updated last week
komikndr / raylight
View on GitHub
Enable true multi gpu capability in Comfy UI using XDiT XFuser and FSDP managed by Ray
☆368Updated this week
egaoharu-kensei / flash-attention-triton
View on GitHub
Cross-platform FlashAttention-2 Triton implementation for Turing+ GPUs with custom configuration mode
☆26Jan 12, 2026Updated 6 months ago
averkij / top_papers
View on GitHub
Top ML papers of the week.
☆46Updated this week
bassrehab / triton-kernels
View on GitHub
High-performance GPU kernels for LLM inference in OpenAI Triton. Fused RMSNorm, SwiGLU, INT8 GEMM with benchmarks and roofline analysis.
☆31Updated this week
litch230 / comfyui_toriigate
View on GitHub
☆20May 9, 2026Updated 2 months ago
turbo-tan / llama.cpp-tq3
View on GitHub
llama.cpp fork with TQ3_1S/4S CUDA kernels — 3.5-bit WHT quantization achieving Q4s quality at 10% smaller size. Based on RaBitQ-inspired…
☆221Jul 6, 2026Updated 2 weeks ago
weicj / vLLM-2080Ti-Definitive
View on GitHub
The definitive vLLM runtime for dual RTX 2080 Ti 22GB + NVLink, delivering Qwen 27B local inference with 100+ tok/s single-request decode…
☆431Jul 13, 2026Updated last week
1-Click AI Models by DigitalOcean Gradient • Ad
Deploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
TechTalkies / Face-for-Xiaozhi
View on GitHub
A plugin code to have a simple face instead of the Xiaozhi AI firmware
☆26Mar 29, 2026Updated 3 months ago
BoFan-tunning / llama.cpp-MTP-TurboQuant
View on GitHub
☆142Jun 13, 2026Updated last month
Yuan-ManX / ComfyUI-Bagel
View on GitHub
ComfyUI-Bagel is now available in ComfyUI, BAGEL is an open‑source multimodal foundation model with 7B active parameters (14B total) trai…
☆29May 28, 2025Updated last year
astrowander / acmb
View on GitHub
Processing of astronomical images
☆14Updated this week
DanielSc4 / Dynamic-Activation-Composition
View on GitHub
Materials for "Multi-property Steering of Large Language Models with Dynamic Activation Composition"
☆14Nov 22, 2024Updated last year
yhfgyyf / vllm-deepseek-v4-sm89
View on GitHub
Run DeepSeek-V4-Flash on SM89 (Ada / RTX 4090) with vLLM — patch over PR #41834. Validated on 4x RTX 4090.
☆22Updated this week
Sandermage / sndr_core_engine
View on GitHub
SNDR Core Engine (Genesis) — vLLM runtime patch-overlay for Qwen3.6 + Gemma4 on consumer NVIDIA (Ampere sm_86, 2× A5000/3090). Qwen3.6-35…
☆125Updated this week
Sike-Wang / low-bit-Shampoo
View on GitHub
4-bit Shampoo for Memory-Efficient Network Training (NeurIPS 2024)
☆13Feb 13, 2025Updated last year
Derryyyyyy / ComfyUI-DNode
View on GitHub
一个基于利萨茹曲线的ComfyUI自定义节点,用于模拟平滑的手持相机抖动。A ComfyUI custom node for smooth handheld camera shake simulation based on Lissajous curves.
☆47Feb 28, 2026Updated 4 months ago
Deploy to Railway using AI coding agents - Free Credits Offer • Ad
Use Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
Anbeeld / beellama.cpp
View on GitHub
KVarN, KV cache precision tail, low-bit quants in llama.cpp for longer context of better precision in the same VRAM
☆794Updated this week
Ph0rk0z / SageAttention2
View on GitHub
Sage attention for turning.
☆70Dec 29, 2025Updated 6 months ago
Thireus / ik_llama.cpp
View on GitHub
ik_llama.cpp's Thireus fork with release builds for macOS/Windows/Ubuntu CPU, Vulkan and CUDA
☆165Updated this week
joshterrell805 / OpenIntro_Statistics_Labs
View on GitHub
R labs for the book OpenIntro Statistics (https://www.openintro.org/stat/)
☆13Nov 17, 2016Updated 9 years ago
aivrar / multi-turboquant
View on GitHub
Unified KV cache compression for LLM inference — TurboQuant, IsoQuant, PlanarQuant, TriAttention. 10 methods, GPU-validated, multi-GPU pl…
☆24Jul 11, 2026Updated last week
lofcz / gpt2sharp
View on GitHub
GPT2# is a zero dependency, sub 1 000 loc implementation of GPT2 inference, batteries included
☆13May 2, 2023Updated 3 years ago
xuchenxu168 / Comfyui_Prompt_Edit
View on GitHub
☆69Nov 11, 2025Updated 8 months ago