Research and development for optimizing transformers
☆131Feb 16, 2021Updated 5 years ago
Alternatives and similar repositories for substation
Users that are interested in substation are comparing it to the libraries listed below
Sorting:
- An external memory allocator example for PyTorch.☆16Aug 10, 2025Updated 6 months ago
- PSTensor provides a way to hack the memory management of tensors in TensorFlow and PyTorch by defining your own C++ Tensor Class.☆10Feb 10, 2022Updated 4 years ago
- DaCe - Data Centric Parallel Programming☆573Updated this week
- FTPipe and related pipeline model parallelism research.☆44May 16, 2023Updated 2 years ago
- ☆20Jun 3, 2023Updated 2 years ago
- Dynamic Tensor Rematerialization prototype (modified PyTorch) and simulator. Paper: https://arxiv.org/abs/2006.09616☆133Jul 6, 2023Updated 2 years ago
- PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models. ICML 2021☆56Jul 21, 2021Updated 4 years ago
- ☆78May 4, 2021Updated 4 years ago
- ☆251Jul 25, 2024Updated last year
- Distributed Communication-Optimal LU-factorization Algorithm☆12Aug 1, 2021Updated 4 years ago
- Analyze network performance in distributed training☆20Oct 20, 2020Updated 5 years ago
- Source code repo for paper "TLDR: Token Loss Dynamic Reweighting for Reducing Repetitive Utterance Generation"☆10Aug 11, 2023Updated 2 years ago
- ☆13Mar 27, 2020Updated 5 years ago
- A library to analyze PyTorch traces.☆467Feb 4, 2026Updated 3 weeks ago
- BytePS examples (Vision, NLP, GAN, etc)☆19Nov 24, 2022Updated 3 years ago
- MONeT framework for reducing memory consumption of DNN training☆174May 4, 2021Updated 4 years ago
- A Data-Centric Compiler for Machine Learning☆85Dec 14, 2025Updated 2 months ago
- benchmarking some transformer deployments☆26Dec 15, 2025Updated 2 months ago
- An Attention Superoptimizer☆22Jan 20, 2025Updated last year
- This repository has moved, please visit https://github.com/ai2cm/pace for the latest development of fv3core.☆13Dec 21, 2022Updated 3 years ago
- a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.☆1,543Jul 18, 2025Updated 7 months ago
- A GPipe implementation in PyTorch☆863Jul 25, 2024Updated last year
- ☆41Jun 18, 2021Updated 4 years ago
- A tensor-aware point-to-point communication primitive for machine learning☆283Dec 17, 2025Updated 2 months ago
- OSLO: Open Source framework for Large-scale model Optimization☆309Aug 25, 2022Updated 3 years ago
- C++/MPI proxies for distributed training of deep neural networks.☆15Jun 18, 2022Updated 3 years ago
- ☆13Jan 23, 2021Updated 5 years ago
- A GPU performance profiling tool for PyTorch models☆510Jul 13, 2021Updated 4 years ago
- PipeSwitch: Fast Pipelined Context Switching for Deep Learning Applications☆127May 9, 2022Updated 3 years ago
- Matrix multiplication on GPUs for matrices stored on a CPU. Similar to cublasXt, but ported to both NVIDIA and AMD GPUs.☆32Apr 2, 2025Updated 11 months ago
- Large scale graph learning on a single machine.☆167Feb 25, 2025Updated last year
- ☆44Sep 6, 2021Updated 4 years ago
- PyTorch extensions for high performance and large scale training.☆3,400Apr 26, 2025Updated 10 months ago
- Sequence-level 1F1B schedule for LLMs.☆38Aug 26, 2025Updated 6 months ago
- The code for our paper "Neural Architecture Search as Program Transformation Exploration"☆16Apr 28, 2021Updated 4 years ago
- ☆145Jan 30, 2025Updated last year
- Scalable PaLM implementation of PyTorch☆190Dec 19, 2022Updated 3 years ago
- An open-source efficient deep learning framework/compiler, written in python.☆737Sep 4, 2025Updated 5 months ago
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆222Aug 19, 2024Updated last year