haoliuhl / ringattentionLinks

Large Context Attention

☆746

Alternatives and similar repositories for ringattention

Users that are interested in ringattention are comparing it to the libraries listed below

Sorting:

lucidrains / ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
☆542Updated 5 months ago
zhuzilin / ring-flash-attention
Ring attention implementation with flash attention
☆903Updated last month
Azure / MS-AMP
Microsoft Automatic Mixed Precision Library
☆627Updated last year
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆270Updated 3 months ago
hao-ai-lab / LookaheadDecoding
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
☆1,291Updated 7 months ago
NVIDIA / NeMo-Aligner
Scalable toolkit for efficient model alignment
☆841Updated 3 weeks ago
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆216Updated last year
lucidrains / speculative-decoding
Explorations into some recent techniques surrounding speculative decoding
☆288Updated 10 months ago
pytorch / PiPPy
Pipeline Parallelism for PyTorch
☆780Updated last year
jzhang38 / EasyContext
Memory optimization and training recipes to extrapolate language models' context length to 1 million tokens, with minimal hardware.
☆750Updated last year
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆246Updated 3 weeks ago
hao-ai-lab / Consistency_LLM
[ICML 2024] CLLMs: Consistency Large Language Models
☆405Updated 11 months ago
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆389Updated last year
meta-pytorch / attention-gym
Helpful tools and examples for working with flex-attention
☆1,029Updated last week
BlackSamorez / tensor_parallel
Automatically split your PyTorch models on multiple GPUs for training & inference
☆658Updated last year
microsoft / Tutel
Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4
☆934Updated 3 weeks ago
apple / ml-cross-entropy
☆543Updated last month
alexzhang13 / flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
☆142Updated last year
sail-sg / zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
☆433Updated 5 months ago
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆223Updated last year
Cornell-RelaxML / QuIP
Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"
☆385Updated last year
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆580Updated 2 months ago
yuhuixu1993 / qa-lora
Official PyTorch implementation of QA-LoRA
☆141Updated last year
mit-han-lab / duo-attention
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
☆496Updated 8 months ago
feifeibear / long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
☆582Updated 2 weeks ago
epfml / dynamic-sparse-flash-attention
☆149Updated 2 years ago
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆273Updated last month
princeton-nlp / LLM-Shearing
[ICLR 2024] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
☆631Updated last year
NVIDIA / kvpress
LLM KV cache compression made easy
☆669Updated last week
NVIDIA / Megatron-Energon
Megatron's multi-modal data loader
☆252Updated 2 weeks ago