foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
☆232Updated 2 weeks ago
Alternatives and similar repositories for fms-fsdp:
Users that are interested in fms-fsdp are comparing it to the libraries listed below
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆189Updated this week
- This repository contains the experimental PyTorch native float8 training UX☆222Updated 7 months ago
- Triton-based implementation of Sparse Mixture of Experts.☆207Updated 3 months ago
- Applied AI experiments and examples for PyTorch☆249Updated this week
- ring-attention experiments☆127Updated 5 months ago
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆506Updated 4 months ago
- Large Context Attention☆690Updated last month
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆207Updated 7 months ago
- LLM KV cache compression made easy☆440Updated this week
- Ring attention implementation with flash attention☆711Updated 3 weeks ago
- ☆158Updated last month
- Explorations into some recent techniques surrounding speculative decoding☆248Updated 2 months ago
- Efficient LLM Inference over Long Sequences☆365Updated last month
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆232Updated 3 weeks ago
- Fast low-bit matmul kernels in Triton☆263Updated this week
- PyTorch per step fault tolerance (actively under development)☆266Updated this week
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆524Updated last month
- Cataloging released Triton kernels.☆204Updated 2 months ago
- Helpful tools and examples for working with flex-attention☆689Updated last week
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆336Updated 7 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference☆447Updated last month
- Scalable and Performant Data Loading☆230Updated this week
- Collection of kernels written in Triton language☆114Updated last month
- ☆203Updated last month
- ☆190Updated last month
- [MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Se…☆595Updated 2 weeks ago
- ☆180Updated 5 months ago
- ☆381Updated 2 weeks ago
- Repo for "LoLCATs: On Low-Rank Linearizing of Large Language Models"☆223Updated last month