pytorch-labs / attention-gym
Helpful tools and examples for working with flex-attention
β746Updated 3 weeks ago
Alternatives and similar repositories for attention-gym:
Users that are interested in attention-gym are comparing it to the libraries listed below
- Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorchβ511Updated 6 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β536Updated last week
- Muon optimizer: +>30% sample efficiency with <3% wallclock overheadβ597Updated last month
- Large Context Attentionβ707Updated 3 months ago
- Annotated version of the Mamba paperβ483Updated last year
- Scalable and Performant Data Loadingβ252Updated this week
- Ring attention implementation with flash attentionβ757Updated 3 weeks ago
- π³ Efficient Triton implementations for "Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention"β647Updated last month
- Pipeline Parallelism for PyTorchβ765Updated 8 months ago
- FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Coresβ316Updated 4 months ago
- π Efficient implementations of state-of-the-art linear attention models in Torch and Tritonβ2,344Updated this week
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β244Updated this week
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β409Updated 3 weeks ago
- β434Updated 2 weeks ago
- This repository contains the experimental PyTorch native float8 training UXβ224Updated 9 months ago
- β280Updated last week
- Efficient LLM Inference over Long Sequencesβ372Updated last week
- Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(nΒ²) Memory"β377Updated last year
- Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paperβ607Updated last month
- Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDAβ820Updated this week
- Tutel MoE: Optimized Mixture-of-Experts Library, Support DeepSeek FP8/FP4β814Updated this week
- Triton-based implementation of Sparse Mixture of Experts.β212Updated 5 months ago
- LLM KV cache compression made easyβ471Updated this week
- Flash Attention in ~100 lines of CUDA (forward pass only)β796Updated 4 months ago
- Building blocks for foundation models.β487Updated last year
- For optimization algorithm research and development.β509Updated this week
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ488Updated 2 weeks ago
- [ICLR2025 Spotlightπ₯] Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parametersβ555Updated 2 months ago
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ281Updated last month
- Microsoft Automatic Mixed Precision Libraryβ596Updated 7 months ago