ROCm / Megatron-LMLinks
Ongoing research training transformer models at scale
☆33Updated this week
Alternatives and similar repositories for Megatron-LM
Users that are interested in Megatron-LM are comparing it to the libraries listed below
Sorting:
- MAD (Model Automation and Dashboarding)☆30Updated last week
- Microsoft Collective Communication Library☆66Updated last year
- Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…☆63Updated 5 months ago
- ☆51Updated this week
- ☆47Updated 11 months ago
- ☆94Updated last year
- AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming☆116Updated last week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆235Updated last week
- RCCL Performance Benchmark Tests☆79Updated last week
- ☆147Updated 11 months ago
- AI Tensor Engine for ROCm☆306Updated this week
- LLM-Inference-Bench☆57Updated 4 months ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆217Updated last week
- nnScaler: Compiling DNN models for Parallel Training☆119Updated 2 months ago
- torchcomms: a modern PyTorch communications API☆295Updated this week
- This repository contains the results and code for the MLPerf™ Training v2.0 benchmark.☆29Updated last year
- ☆63Updated last week
- SYCL* Templates for Linear Algebra (SYCL*TLA) - SYCL based CUTLASS implementation for Intel GPUs☆55Updated this week
- QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.☆35Updated 3 months ago
- ☆79Updated last month
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆57Updated last week
- Development repository for the Triton language and compiler☆137Updated this week
- PyTorch bindings for CUTLASS grouped GEMM.☆131Updated 6 months ago
- An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).☆272Updated 4 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMs☆110Updated this week
- Multi-GPU communication profiler and visualizer☆36Updated last year
- oneCCL Bindings for Pytorch* (deprecated)☆103Updated 3 weeks ago
- Fast and memory-efficient exact attention☆201Updated last month
- extensible collectives library in triton☆91Updated 8 months ago
- Applied AI experiments and examples for PyTorch☆307Updated 3 months ago