foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.
☆194Updated this week
Related projects ⓘ
Alternatives and complementary repositories for fms-fsdp
- Triton-based implementation of Sparse Mixture of Experts.☆185Updated last month
- This repository contains the experimental PyTorch native float8 training UX☆212Updated 3 months ago
- Applied AI experiments and examples for PyTorch☆168Updated 3 weeks ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆166Updated this week
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆195Updated 3 months ago
- Explorations into some recent techniques surrounding speculative decoding☆212Updated last year
- Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch☆477Updated 3 weeks ago
- ☆189Updated 6 months ago
- Large Context Attention☆642Updated 3 months ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMs☆187Updated this week
- LLM KV cache compression made easy☆168Updated this week
- Ring attention implementation with flash attention☆588Updated 2 weeks ago
- ring-attention experiments☆97Updated last month
- Cataloging released Triton kernels.☆138Updated 2 months ago
- Easy and Efficient Quantization for Transformers☆180Updated 4 months ago
- Zero Bubble Pipeline Parallelism☆283Updated last week
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆305Updated 3 months ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.☆483Updated 3 weeks ago
- Code for exploring Based models from "Simple linear attention language models balance the recall-throughput tradeoff"☆214Updated this week
- Unofficial implementation for the paper "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"☆135Updated 5 months ago
- Multipack distributed sampler for fast padding-free training of LLMs☆178Updated 3 months ago
- Simple and fast low-bit matmul kernels in CUDA / Triton☆147Updated this week
- Scalable toolkit for efficient model alignment☆624Updated this week
- Helpful tools and examples for working with flex-attention☆475Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆253Updated last month
- ☆158Updated last month
- [MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving☆278Updated 4 months ago
- ☆153Updated this week
- ☆88Updated 2 months ago
- ☆132Updated last year