meta-pytorch / attention-gymLinks

Helpful tools and examples for working with flex-attention

☆1,062

Alternatives and similar repositories for attention-gym

Users that are interested in attention-gym are comparing it to the libraries listed below

Sorting:

lucidrains / ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
☆545Updated 6 months ago
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆584Updated 3 months ago
apple / ml-cross-entropy
☆555Updated 2 months ago
lucidrains / native-sparse-attention-pytorch
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
☆782Updated 3 months ago
KellerJordan / Muon
Muon is an optimizer for hidden layers in neural networks
☆2,056Updated last week
fla-org / native-sparse-attention
🐳 Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"
☆925Updated 8 months ago
haoliuhl / ringattention
Large Context Attention
☆753Updated last month
pytorch / PiPPy
Pipeline Parallelism for PyTorch
☆784Updated last year
zhuzilin / ring-flash-attention
Ring attention implementation with flash attention
☆923Updated 2 months ago
facebookresearch / spdl
Scalable and Performant Data Loading
☆345Updated this week
srush / annotated-mamba
Annotated version of the Mamba paper
☆491Updated last year
Haiyang-W / TokenFormer
[ICLR2025 Spotlight🔥] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
☆579Updated 9 months ago
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆675Updated last week
NVIDIA / kvpress
LLM KV cache compression made easy
☆694Updated last week
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆271Updated last week
meta-pytorch / torchft
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
☆454Updated 3 weeks ago
NVlabs / GatedDeltaNet
[ICLR 2025] Official PyTorch Implementation of Gated Delta Networks: Improving Mamba2 with Delta Rule
☆379Updated 2 months ago
zyushun / Adam-mini
Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793
☆445Updated 6 months ago
lucidrains / rotary-embedding-torch
Implementation of Rotary Embeddings, from the Roformer paper, in Pytorch
☆780Updated 4 months ago
HazyResearch / aisys-building-blocks
Building blocks for foundation models.
☆581Updated last year
thuml / depyf
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
☆765Updated last month
pytorch / helion
A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.
☆640Updated this week
microsoft / Tutel
Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4
☆943Updated 3 weeks ago
facebookresearch / optimizers
For optimization algorithm research and development.
☆547Updated 2 weeks ago
huggingface / kernels
Load compute kernels from the Hub
☆335Updated last week
thinking-machines-lab / batch_invariant_ops
☆912Updated 3 weeks ago
goombalab / hnet
H-Net: Hierarchical Network with Dynamic Chunking
☆785Updated last week
SandAI-org / MagiAttention
A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training
☆570Updated this week
LambdaLabsML / distributed-training-guide
Best practices & guides on how to write distributed pytorch training code
☆543Updated last month
apple / ml-sigmoid-attention
☆303Updated 7 months ago