boweiliu / ncclLinks
Optimized primitives for collective multi-GPU communication
☆10Updated last year
Alternatives and similar repositories for nccl
Users that are interested in nccl are comparing it to the libraries listed below
Sorting:
- ☆21Updated 7 months ago
- ☆40Updated this week
- Two implementations of ZeRO-1 optimizer sharding in JAX☆14Updated 2 years ago
- PyTorch centric eager mode debugger☆48Updated 9 months ago
- An implementation of the Llama architecture, to instruct and delight☆21Updated 4 months ago
- ☆315Updated last year
- A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.☆143Updated 5 months ago
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆223Updated last week
- A library for unit scaling in PyTorch☆130Updated 2 months ago
- Transformer with Mu-Parameterization, implemented in Jax/Flax. Supports FSDP on TPU pods.☆32Updated 3 months ago
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆161Updated 2 weeks ago
- Experiment of using Tangent to autodiff triton☆81Updated last year
- extensible collectives library in triton☆88Updated 6 months ago
- JAX bindings for Flash Attention v2☆92Updated 3 weeks ago
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆318Updated last week
- Custom triton kernels for training Karpathy's nanoGPT.☆19Updated 11 months ago
- ☆23Updated 11 months ago
- 🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…☆268Updated 2 months ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆410Updated last week
- This repository contains the experimental PyTorch native float8 training UX☆224Updated last year
- A flexible and efficient implementation of Flash Attention 2.0 for JAX, supporting multiple backends (GPU/TPU/CPU) and platforms (Triton/…☆27Updated 7 months ago
- JAX implementation of the Mistral 7b v0.2 model☆36Updated last year
- A bunch of kernels that might make stuff slower 😉☆59Updated last week
- Minimal yet performant LLM examples in pure JAX☆177Updated last week
- Minimal but scalable implementation of large language models in JAX☆35Updated last month
- xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerat…☆143Updated this week
- ring-attention experiments☆152Updated 11 months ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆213Updated last week
- Module, Model, and Tensor Serialization/Deserialization☆268Updated last month
- Collection of kernels written in Triton language☆155Updated 5 months ago