qhliu26 / Dive-into-Big-Model-Training
π Dive into Big Model Training
β111Updated 2 years ago
Alternatives and similar repositories for Dive-into-Big-Model-Training
Users that are interested in Dive-into-Big-Model-Training are comparing it to the libraries listed below
Sorting:
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Trainingβ209Updated 8 months ago
- β82Updated 3 years ago
- PyTorch bindings for CUTLASS grouped GEMM.β121Updated 4 months ago
- Odysseus: Playground of LLM Sequence Parallelismβ69Updated 11 months ago
- Triton-based implementation of Sparse Mixture of Experts.β214Updated 5 months ago
- β104Updated 8 months ago
- β146Updated last year
- A collection of memory efficient attention operators implemented in the Triton language.β267Updated 11 months ago
- Zero Bubble Pipeline Parallelismβ389Updated last week
- A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, β¦β111Updated last year
- A Python library transfers PyTorch tensors between CPU and NVMeβ115Updated 5 months ago
- β147Updated last year
- REST: Retrieval-Based Speculative Decoding, NAACL 2024β201Updated 5 months ago
- PyTorch bindings for CUTLASS grouped GEMM.β89Updated 2 weeks ago
- [ICLR 2025] PEARL: Parallel Speculative Decoding with Adaptive Draft Lengthβ81Updated last month
- β117Updated last year
- Latency and Memory Analysis of Transformer Models for Training and Inferenceβ411Updated 3 weeks ago
- ring-attention experimentsβ142Updated 7 months ago
- Sequence-level 1F1B schedule for LLMs.β17Updated 11 months ago
- Cataloging released Triton kernels.β221Updated 4 months ago
- USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inferenceβ492Updated 3 weeks ago
- Applied AI experiments and examples for PyTorchβ267Updated this week
- β132Updated 2 months ago
- Collection of kernels written in Triton languageβ122Updated last month
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.β124Updated this week
- Repository of the paper "Accelerating Transformer Inference for Translation via Parallel Decoding"β116Updated last year
- β84Updated last month
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).β248Updated 6 months ago
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paralβ¦β54Updated 9 months ago
- β158Updated last year