kyutai-labs / jax-flash-attn3Links

JAX bindings for the flash-attention3 kernels

☆16

Alternatives and similar repositories for jax-flash-attn3

Users that are interested in jax-flash-attn3 are comparing it to the libraries listed below

Sorting:

tile-ai / tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
☆19Updated last week
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated last year
BBuf / flash-rwkv
☆32Updated last year
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆27Updated this week
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆73Updated last year
IST-DASLab / QIGen
Repository for CPU Kernel Generation for LLM Inference
☆27Updated 2 years ago
casper-hansen / AutoAWQ_kernels
☆78Updated last year
simveit / persistent_dense_gemm
Persistent dense gemm for Hopper in `CuTeDSL`
☆15Updated 4 months ago
nanowell / Q-Sparse-LLM
My Implementation of Q-Sparse: All Large Language Models can be Fully Sparsely-Activated
☆33Updated last year
Ryu1845 / hyena-jax
Implementation of Hyena Hierarchy in JAX
☆10Updated 2 years ago
microsoft / AttentionEngine
☆114Updated 6 months ago
sgl-project / DeepGEMM
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
☆21Updated last week
graphcore-research / jax-scalify
JAX Scalify: end-to-end scaled arithmetics
☆17Updated last year
opendatahub-io / vllm-tgis-adapter
vLLM adapter for a TGIS-compatible gRPC server.
☆45Updated this week
UmerHA / triton_util
Make triton easier
☆49Updated last year
WaveSpeedAI / QuantumAttention
[WIP] Better (FP8) attention for Hopper
☆32Updated 9 months ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated last year
kyegomez / OpenStrawberry
An open source replication of the stawberry method that leverages Monte Carlo Search with PPO and or DPO
☆29Updated last week
tridao / flash-attention-wheels
☆58Updated 2 years ago
frankxwang / dpo-prefix-sharing
DPO, but faster 🚀
☆46Updated last year
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆112Updated last year
kyegomez / MobileVLM
Implementation of the LDP module block in PyTorch and Zeta from the paper: "MobileVLM: A Fast, Strong and Open Vision Language Assistant …
☆15Updated last year
SkyworkAI / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆16Updated last year
TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆18Updated last year
yuzhenmao / IceFormer
Implementation of IceFormer: Accelerated Inference with Long-Sequence Transformers on CPUs (ICLR 2024).
☆25Updated 4 months ago
Bruce-Lee-LY / decoding_attention
Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.
☆45Updated 5 months ago
Yifei-Zuo / Flash-LLA
Official repository for the paper Local Linear Attention: An Optimal Interpolation of Linear and Softmax Attention For Test-Time Regressi…
☆23Updated 2 months ago
cassiewilliam / cuda_op_benchmark
方便扩展的Cuda算子理解和优化框架，仅用在学习使用
☆18Updated last year
IST-DASLab / gemm-fp8
High Performance FP8 GEMM Kernels for SM89 and later GPUs.
☆20Updated 10 months ago