epfml / dynamic-sparse-flash-attentionLinks

☆150

Alternatives and similar repositories for dynamic-sparse-flash-attention

Users that are interested in dynamic-sparse-flash-attention are comparing it to the libraries listed below

Sorting:

mgmalek / efficient_cross_entropy
☆121Updated last year
HazyResearch / based
Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"
☆243Updated 5 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆253Updated 2 months ago
insuhan / hyper-attn
☆83Updated 2 years ago
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆74Updated 9 months ago
PiotrNawrot / nano-sparse-attention
The simplest implementation of recent Sparse Attention patterns for efficient LLM inference.
☆91Updated 4 months ago
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆151Updated last year
JonasGeiping / linear_cross_entropy_loss
A fusion of a linear layer and a cross entropy loss, written for pytorch in triton.
☆73Updated last year
teelinsan / parallel-decoding
Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"
☆121Updated last year
jzhang38 / LongMamba
Some preliminary explorations of Mamba's context scaling.
☆217Updated last year
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆218Updated last year
stanford-futuredata / stk
☆113Updated last year
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆226Updated last year
thu-ml / low-bit-optimizers
Low-bit optimizers for PyTorch
☆132Updated 2 years ago
kssteven418 / BigLittleDecoder
[NeurIPS'23] Speculative Decoding with Big Little Decoder
☆95Updated last year
HazyResearch / zoology
Understand and test language model architectures on synthetic tasks.
☆240Updated 2 months ago
softmax1 / Flash-Attention-Softmax-N
CUDA and Triton implementations of Flash Attention with SoftmaxN.
☆73Updated last year
athms / mad-lab
A MAD laboratory to improve AI architecture designs 🧪
☆135Updated 11 months ago
AnswerDotAI / cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆146Updated last year
nil0x9 / flash-muon
Flash-Muon: An Efficient Implementation of Muon Optimizer
☆212Updated 5 months ago
Zyphra / tree_attention
Tree Attention: Topology-aware Decoding for Long-Context Attention on GPU clusters
☆130Updated last year
FasterDecoding / TEAL
☆154Updated 9 months ago
nshepperd / flash_attn_jax
JAX bindings for Flash Attention v2
☆99Updated last month
berlino / gated_linear_attention
☆106Updated last year
Edward-Sun / gpt-accelera
Simple and efficient pytorch-native transformer training and inference (batched)
☆78Updated last year
graphcore-research / unit-scaling
A library for unit scaling in PyTorch
☆132Updated 4 months ago
kyo-takano / chinchilla
A toolkit for scaling law research ⚖
☆53Updated 10 months ago
siyan-zhao / prepacking
The source code of our work "Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models" [AISTATS …
☆60Updated last year
haochengxi / Train_Transformers_with_INT4
☆157Updated 2 years ago
jaymody / speculative-sampling
Simple implementation of Speculative Sampling in NumPy for GPT-2.
☆98Updated 2 years ago