exists-forall / striped_attentionLinks

☆39

Alternatives and similar repositories for striped_attention

Users that are interested in striped_attention are comparing it to the libraries listed below

Sorting:

stanford-futuredata / stk
☆107Updated 11 months ago
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆106Updated 2 months ago
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆212Updated 11 months ago
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆230Updated 8 months ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆72Updated last year
cchan / tccl
extensible collectives library in triton
☆88Updated 4 months ago
open-lm-engine / flash-model-architectures
A bunch of kernels that might make stuff slower 😉
☆56Updated this week
jundaf2 / INT8-Flash-Attention-FMHA-Quantization
☆158Updated last year
deepspeedai / DeepSpeed-Kernels
☆74Updated 4 months ago
IST-DASLab / Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
☆80Updated 11 months ago
feifeibear / DPSKV3MFU
Estimate MFU for DeepSeekV3
☆25Updated 7 months ago
zhuohan123 / terapipe
☆75Updated 4 years ago
yanring / Megatron-MoE-ModelZoo
Best practices for testing advanced Mixtral, DeepSeek, and Qwen series MoE models using Megatron Core MoE.
☆41Updated last week
hpcaitech / TensorNVMe
A Python library transfers PyTorch tensors between CPU and NVMe
☆117Updated 8 months ago
triton-lang / kernels
☆85Updated 8 months ago
pytorch-labs / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆200Updated this week
Aleph-Alpha-Research / NeurIPS-WANT-submission-efficient-parallelization-layouts
☆22Updated last year
fanshiqing / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆134Updated 3 weeks ago
thu-pacman / FasterMoE
☆86Updated 3 years ago
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆289Updated 2 months ago
xuqifan897 / Optimus
☆28Updated 4 years ago
kssteven418 / BigLittleDecoder
[NeurIPS'23] Speculative Decoding with Big Little Decoder
☆93Updated last year
Jokeren / triton-samples
☆28Updated 6 months ago
IBM / triton-dejavu
Framework to reduce autotune overhead to zero for well known deployments.
☆79Updated last week
anyscale / llm-continuous-batching-benchmarks
☆120Updated last year
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆260Updated 3 weeks ago
DerrickYLJ / TidalDecode
[ICLR 2025] TidalDecode: A Fast and Accurate LLM Decoding with Position Persistent Sparse Attention
☆44Updated 3 months ago
gpu-mode / ring-attention
ring-attention experiments
☆146Updated 9 months ago
xiayuqing0622 / flex_head_fa
Fast and memory-efficient exact attention
☆69Updated 5 months ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆113Updated last year