lucidrains / ring-attention-pytorch
Implementation of π Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
β501Updated 3 months ago
Alternatives and similar repositories for ring-attention-pytorch:
Users that are interested in ring-attention-pytorch are comparing it to the libraries listed below
- Large Context Attentionβ682Updated 3 weeks ago
- Ring attention implementation with flash attentionβ674Updated 2 months ago
- Helpful tools and examples for working with flex-attentionβ635Updated this week
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β222Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β514Updated this week
- Implementation of ST-Moe, the latest incarnation of MoE after years of research at Brain, in Pytorchβ306Updated 8 months ago
- Minimalistic 4D-parallelism distributed training framework for education purposeβ724Updated this week
- [ICML 2024] CLLMs: Consistency Large Language Modelsβ372Updated 3 months ago
- Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.β701Updated 4 months ago
- This repository contains the experimental PyTorch native float8 training UXβ221Updated 6 months ago
- Triton-based implementation of Sparse Mixture of Experts.β196Updated 2 months ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"β221Updated this week
- Microsoft Automatic Mixed Precision Libraryβ567Updated 4 months ago
- A repository for research on medium sized language models.β491Updated last month
- LLM KV cache compression made easyβ397Updated this week
- Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsβ¦β297Updated 2 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ424Updated this week
- β138Updated last year
- Code for Adam-mini: Use Fewer Learning Rates To Gain More https://arxiv.org/abs/2406.16793β385Updated 2 months ago
- Some preliminary explorations of Mamba's context scaling.β213Updated last year
- Muon optimizer: +~30% sample efficiency with <3% wallclock overheadβ253Updated last week
- Explorations into some recent techniques surrounding speculative decodingβ240Updated last month
- β350Updated 3 weeks ago
- [ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruningβ584Updated 11 months ago
- Efficient LLM Inference over Long Sequencesβ357Updated this week
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformersβ205Updated 6 months ago
- Quick implementation of nGPT, learning entirely on the hypersphere, from NvidiaAIβ271Updated 3 months ago
- Annotated version of the Mamba paperβ473Updated 11 months ago
- [ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Headsβ426Updated last week