boweiliu / ncclLinks
Optimized primitives for collective multi-GPU communication
☆9Updated last year
Alternatives and similar repositories for nccl
Users that are interested in nccl are comparing it to the libraries listed below
Sorting:
- ☆314Updated 11 months ago
- ☆21Updated 5 months ago
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆200Updated this week
- ☆38Updated this week
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆377Updated this week
- ☆20Updated 9 months ago
- Learn CUDA with PyTorch☆33Updated 3 weeks ago
- Module, Model, and Tensor Serialization/Deserialization☆250Updated this week
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆258Updated 2 weeks ago
- ring-attention experiments☆147Updated 9 months ago
- An implementation of the Llama architecture, to instruct and delight☆21Updated 2 months ago
- Two implementations of ZeRO-1 optimizer sharding in JAX☆14Updated 2 years ago
- CUDA checkpoint and restore utility☆357Updated 6 months ago
- Load compute kernels from the Hub☆233Updated this week
- This repository contains the experimental PyTorch native float8 training UX☆224Updated last year
- Implementation of Flash Attention in Jax☆215Updated last year
- Experiment of using Tangent to autodiff triton☆80Updated last year
- xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerat…☆136Updated this week
- ☆83Updated last year
- Custom triton kernels for training Karpathy's nanoGPT.☆19Updated 9 months ago
- ☆20Updated 2 years ago
- Transformer with Mu-Parameterization, implemented in Jax/Flax. Supports FSDP on TPU pods.☆32Updated 2 months ago
- Collection of kernels written in Triton language☆142Updated 4 months ago
- PyTorch Single Controller☆345Updated this week
- A bunch of kernels that might make stuff slower 😉☆56Updated last week
- Simple (fast) transformer inference in PyTorch with torch.compile + lit-llama code☆11Updated last year
- The simplest, fastest repository for training/finetuning medium-sized GPTs.☆149Updated last month
- extensible collectives library in triton☆88Updated 4 months ago
- A parallel framework for training deep neural networks☆63Updated 4 months ago
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆158Updated last month