pytorch / torchft
PyTorch per step fault tolerance (actively under development)
β273Updated this week
Alternatives and similar repositories for torchft:
Users that are interested in torchft are comparing it to the libraries listed below
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β238Updated this week
- Scalable and Performant Data Loadingβ234Updated this week
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASSβ156Updated 3 weeks ago
- LLM KV cache compression made easyβ452Updated 3 weeks ago
- A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.β528Updated last month
- This repository contains the experimental PyTorch native float8 training UXβ222Updated 8 months ago
- β166Updated last month
- Applied AI experiments and examples for PyTorchβ256Updated 3 weeks ago
- Fast low-bit matmul kernels in Tritonβ285Updated this week
- Efficient LLM Inference over Long Sequencesβ366Updated last month
- Cataloging released Triton kernels.β213Updated 3 months ago
- β205Updated 2 months ago
- Load compute kernels from the Hubβ113Updated this week
- Perplexity GPU Kernelsβ185Updated this week
- Google TPU optimizations for transformers modelsβ107Updated 2 months ago
- Where GPUs get cooked π©βπ³π₯β221Updated last month
- A repository to unravel the language of GPUs, making their kernel conversations easy to understandβ170Updated last week
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β190Updated last week
- Home for "How To Scale Your Model", a short blog-style textbook about scaling LLMs on TPUsβ237Updated last week
- β153Updated last year
- β198Updated this week
- Collection of kernels written in Triton languageβ118Updated last week
- extensible collectives library in tritonβ84Updated last week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.β128Updated last year
- β295Updated this week
- KernelBench: Can LLMs Write GPU Kernels? - Benchmark with Torch -> CUDA problemsβ254Updated last week
- β185Updated this week
- JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs welβ¦β313Updated this week
- A library to analyze PyTorch traces.β361Updated last week
- ring-attention experimentsβ129Updated 5 months ago