soumyadipghosh / eventgrad
Event-Triggered Communication in Parallel Machine Learning
☆25Updated 3 years ago
Alternatives and similar repositories for eventgrad:
Users that are interested in eventgrad are comparing it to the libraries listed below
- LLM training in simple, raw C/CUDA☆92Updated 11 months ago
- Codebase associated with the PyTorch compiler tutorial☆45Updated 5 years ago
- DLPack for Tensorflow☆35Updated 5 years ago
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆65Updated 3 years ago
- Benchmarks to capture important workloads.☆31Updated 2 months ago
- Cooperative Primitives for CUDA C++ Kernel Authors. This repository contains CUB PRs from Q4 2019 until Q4 2020.☆22Updated 4 years ago
- ☆26Updated last year
- ☆39Updated last year
- Customized matrix multiplication kernels☆54Updated 3 years ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆79Updated last year
- Implementation of a Tensorflow XLA rematerialization pass☆15Updated 5 years ago
- ☆18Updated 2 years ago
- Distributed Bayesian Optimization☆23Updated 4 years ago
- Automatically insert nvtx ranges to PyTorch models☆17Updated 3 years ago
- Memory Optimizations for Deep Learning (ICML 2023)☆64Updated last year
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.☆45Updated 9 months ago
- Introduction to CUDA programming☆116Updated 7 years ago
- Implementations and checkpoints for ResNet, Wide ResNet, ResNeXt, ResNet-D, and ResNeSt in JAX (Flax).☆109Updated 2 years ago
- ☆27Updated 3 months ago
- Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.☆44Updated this week
- A tensor-aware point-to-point communication primitive for machine learning☆257Updated 2 years ago
- torch::deploy (multipy for non-torch uses) is a system that lets you get around the GIL problem by running multiple Python interpreters i…☆180Updated 4 months ago
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆26Updated last week
- GEMM and Winograd based convolutions using CUTLASS☆26Updated 4 years ago
- ☆29Updated 2 years ago
- 👑 Pytorch code for the Nero optimiser.☆20Updated 2 years ago
- Some CUDA design patterns and a bit of template magic for CUDA☆150Updated last year
- Deadline-based hyperparameter tuning on RayTune.☆31Updated 5 years ago
- benchmarking some transformer deployments☆26Updated 2 years ago
- JMP is a Mixed Precision library for JAX.☆194Updated 2 months ago