boweiliu / ncclLinks

Optimized primitives for collective multi-GPU communication

☆10

Alternatives and similar repositories for nccl

Users that are interested in nccl are comparing it to the libraries listed below

Sorting:

lianakoleva / no-libtorch-compile
☆21Updated 8 months ago
thecharlieblake / lovely-llama
An implementation of the Llama architecture, to instruct and delight
☆21Updated 5 months ago
meta-pytorch / torchcomms
torchcomms: a modern PyTorch communications API
☆291Updated this week
imbue-ai / cluster-health
☆316Updated last year
srush / triton-autodiff
Experiment of using Tangent to autodiff triton
☆80Updated last year
foundation-model-stack / fms-fsdp
🚀 Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flash…
☆272Updated 2 weeks ago
graphcore-research / jax-scalify
JAX Scalify: end-to-end scaled arithmetics
☆17Updated last year
ezyang / torchdbg
PyTorch centric eager mode debugger
☆48Updated 11 months ago
lucaslingle / mu_transformer
Transformer with Mu-Parameterization, implemented in Jax/Flax. Supports FSDP on TPU pods.
☆32Updated 5 months ago
pytorch / test-infra
This repository hosts code that supports the testing infrastructure for the PyTorch organization. For example, this repo hosts the logic …
☆103Updated this week
meta-pytorch / torchft
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
☆452Updated last week
NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …
☆234Updated last week
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆61Updated last week
graphcore-research / unit-scaling
A library for unit scaling in PyTorch
☆132Updated 4 months ago
cloneofsimo / min-fsdp
☆91Updated last year
nshepperd / flash_attn_jax
JAX bindings for Flash Attention v2
☆98Updated 2 weeks ago
Jaykef / Triton-nanoGPT
Custom triton kernels for training Karpathy's nanoGPT.
☆19Updated last year
google-deepmind / asyncdiloco
☆47Updated last year
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆150Updated 2 years ago
coreweave / ml-containers
☆42Updated 3 weeks ago
open-lm-engine / accelerated-model-architectures
A bunch of kernels that might make stuff slower 😉
☆64Updated this week
meta-pytorch / torchsnapshot
A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…
☆161Updated 2 months ago
meta-pytorch / float8_experimental
This repository contains the experimental PyTorch native float8 training UX
☆225Updated last year
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆216Updated last week
gpu-mode / ring-attention
ring-attention experiments
☆155Updated last year
cchan / tccl
extensible collectives library in triton
☆91Updated 7 months ago
huggingface / kernels
Load compute kernels from the Hub
☆327Updated last week
zinccat / Awesome-Triton-Kernels
Collection of kernels written in Triton language
☆167Updated 7 months ago
vdesai2014 / inference-optimization-blog-post
☆89Updated last year
DataStates / datastates-llm
LLM checkpointing for DeepSpeed/Megatron
☆21Updated last month