Dao-AILab / gemm-cublasLinks

☆22

Alternatives and similar repositories for gemm-cublas

Users that are interested in gemm-cublas are comparing it to the libraries listed below

Sorting:

feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆77Updated last year
li-plus / flash-preference
Accelerate LLM preference tuning via prefix sharing with a single line of code
☆43Updated 3 months ago
Dao-AILab / grouped-latent-attention
☆129Updated 4 months ago
PiotrNawrot / sparse-frontier
The evaluation framework for training-free sparse attention in LLMs
☆100Updated 3 months ago
tile-ai / AttentionEngine
☆50Updated 4 months ago
tilde-research / nsa-impl
An efficient implementation of the NSA (Native Sparse Attention) kernel
☆119Updated 3 months ago
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆59Updated this week
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆189Updated 3 months ago
feifeibear / ChituAttention
Quantized Attention on GPU
☆44Updated 10 months ago
microsoft / AttentionEngine
☆99Updated 4 months ago
HanGuo97 / log-linear-attention
☆251Updated 4 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆84Updated 3 weeks ago
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆73Updated last year
dame-cell / Triformer
Transformers components but in Triton
☆34Updated 5 months ago
Doraemonzzz / Awesome-Triton-Resources
Awesome Triton Resources
☆36Updated 5 months ago
TiledTensor / TiledBench
Benchmark tests supporting the TiledCUDA library.
☆17Updated 10 months ago
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆24Updated last week
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
GindaChen / FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
☆27Updated last year
gpu-mode / ring-attention
ring-attention experiments
☆152Updated 11 months ago
BBuf / flash-rwkv
☆32Updated last year
HanGuo97 / hilt
☆33Updated this week
stanford-futuredata / stk
☆113Updated last year
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
OpenNLPLab / LASP
Linear Attention Sequence Parallelism (LASP)
☆87Updated last year
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆138Updated last year
tridao / flash-attention-wheels
☆57Updated last year
zhuzilin / flash-attention-with-sink
☆39Updated 2 months ago
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆70Updated 7 months ago
sandyresearch / chipmunk
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× …
☆86Updated last month