pytorch-labs / attention-gymLinks
Helpful tools and examples for working with flex-attention
β802Updated last week
Alternatives and similar repositories for attention-gym
Users that are interested in attention-gym are comparing it to the libraries listed below
Sorting:
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ513Updated 2 weeks ago
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ661Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β544Updated this week
- Large Context Attentionβ711Updated 4 months ago
- β450Updated this week
- Ring attention implementation with flash attentionβ771Updated last week
- Scalable and Performant Data Loadingβ267Updated last week
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β686Updated 2 months ago
- LLM KV cache compression made easyβ493Updated 3 weeks ago
- Pipeline Parallelism for PyTorchβ766Updated 9 months ago
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β249Updated this week
- Annotated version of the Mamba paperβ482Updated last year
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ559Updated 3 months ago
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ637Updated 2 weeks ago
- This repository contains the experimental PyTorch native float8 training UXβ222Updated 10 months ago
- β286Updated last month
- Efficient LLM Inference over Long Sequencesβ376Updated this week
- Building blocks for foundation models.β500Updated last year
- Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDAβ850Updated this week
- For optimization algorithm research and development.β518Updated this week
- Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Modelsβ660Updated last month
- Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(nΒ²) Memory"β378Updated last year
- TensorDict is a pytorch dedicated tensor container.β925Updated this week
- Scalable toolkit for efficient model reinforcementβ361Updated this week
- Universal Tensor Operations in Einstein-Inspired Notation for Python.β374Updated last month
- Triton-based implementation of Sparse Mixture of Experts.β216Updated 6 months ago
- When it comes to optimizers, it's always better to be safe than sorryβ233Updated 2 months ago
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Coresβ317Updated 5 months ago
- PyTorch per step fault tolerance (actively under development)β302Updated this week
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β417Updated 2 weeks ago