foundation-model-stack / fms-fsdpLinks

🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash attention v2.

☆258

Alternatives and similar repositories for fms-fsdp

Users that are interested in fms-fsdp are comparing it to the libraries listed below

Sorting:

foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆206Updated this week
pytorch-labs / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆224Updated last year
shawntan / scattermoe
Triton-based implementation of Sparse Mixture of Experts.
☆230Updated 8 months ago
pytorch-labs / applied-ai
Applied AI experiments and examples for PyTorch
☆289Updated 2 months ago
huggingface / kernels
Load compute kernels from the Hub
☆220Updated this week
haoliuhl / ringattention
Large Context Attention
☆719Updated 6 months ago
lucidrains / ring-attention-pytorch
Implementation of 💍 Ring Attention, from Liu et al. at Berkeley AI, in Pytorch
☆532Updated 2 months ago
NVIDIA-NeMo / RL
Scalable toolkit for efficient model reinforcement
☆558Updated this week
gpu-mode / ring-attention
ring-attention experiments
☆146Updated 9 months ago
huggingface / picotron_tutorial
☆206Updated 5 months ago
mobiusml / gemlite
Fast low-bit matmul kernels in Triton
☆338Updated last week
Dao-AILab / quack
A Quirky Assortment of CuTe Kernels
☆374Updated this week
lucidrains / speculative-decoding
Explorations into some recent techniques surrounding speculative decoding
☆275Updated 7 months ago
RulinShao / LightSeq
Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training
☆212Updated 11 months ago
Deep-Learning-Profiling-Tools / triton-viz
☆227Updated this week
gpu-mode / triton-index
Cataloging released Triton kernels.
☆247Updated 6 months ago
pytorch / torchft
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
☆372Updated this week
NVIDIA / kvpress
LLM KV cache compression made easy
☆560Updated last week
NVIDIA / Megatron-Energon
Megatron's multi-modal data loader
☆230Updated last week
fla-org / flame
🔥 A minimal training framework for scaling FLA models
☆209Updated last month
BobMcDear / attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
☆565Updated this week
Dao-AILab / grouped-latent-attention
☆123Updated 2 months ago
mgmalek / efficient_cross_entropy
☆114Updated last year
AnswerDotAI / cold-compress
Cold Compress is a hackable, lightweight, and open-source toolkit for creating and benchmarking cache compression methods built on top of…
☆138Updated 11 months ago
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆139Updated 3 months ago
NVIDIA / Star-Attention
Efficient LLM Inference over Long Sequences
☆385Updated last month
zhuzilin / ring-flash-attention
Ring attention implementation with flash attention
☆828Updated last week
neuralmagic / compressed-tensors
A safetensors extension to efficiently store sparse quantized tensors on disk
☆141Updated this week
SqueezeAILab / KVQuant
[NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
☆365Updated 11 months ago
epfml / dynamic-sparse-flash-attention
☆147Updated 2 years ago