ROCm / Megatron-LMLinks

Ongoing research training transformer models at scale

☆33

Alternatives and similar repositories for Megatron-LM

Users that are interested in Megatron-LM are comparing it to the libraries listed below

Sorting:

ROCm / MAD
MAD (Model Automation and Dashboarding)
☆30Updated last week
Azure / msccl
Microsoft Collective Communication Library
☆66Updated last year
intel / intel-extension-for-deepspeed
Intel® Extension for DeepSpeed* is an extension to DeepSpeed that brings feature support with SYCL kernels on Intel GPU(XPU) device. Note…
☆63Updated 5 months ago
ROCm / TransformerEngine
☆51Updated this week
Azure / msccl-executor-nccl
☆47Updated 11 months ago
triton-lang / kernels
☆94Updated last year
ROCm / iris
AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming
☆116Updated last week
NVIDIA / nvidia-resiliency-ext
NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …
☆235Updated last week
ROCm / rccl-tests
RCCL Performance Benchmark Tests
☆79Updated last week
yifuwang / symm-mem-recipes
☆147Updated 11 months ago
ROCm / aiter
AI Tensor Engine for ROCm
☆306Updated this week
argonne-lcf / LLM-Inference-Bench
LLM-Inference-Bench
☆57Updated 4 months ago
foundation-model-stack / foundation-model-stack
🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.
☆217Updated last week
microsoft / nnscaler
nnScaler: Compiling DNN models for Parallel Training
☆119Updated 2 months ago
meta-pytorch / torchcomms
torchcomms: a modern PyTorch communications API
☆295Updated this week
mlcommons / training_results_v2.0
This repository contains the results and code for the MLPerf™ Training v2.0 benchmark.
☆29Updated last year
intel / torch-xpu-ops
☆63Updated last week
intel / sycl-tla
SYCL* Templates for Linear Algebra (SYCL*TLA) - SYCL based CUTLASS implementation for Intel GPUs
☆55Updated this week
mk1-project / quickreduce
QuickReduce is a performant all-reduce library designed for AMD ROCm that supports inline compression.
☆35Updated 3 months ago
hao-ai-lab / MuxServe
☆79Updated last month
NVIDIA / jaxpp
JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training
☆57Updated last week
ROCm / triton
Development repository for the Triton language and compiler
☆137Updated this week
tgale96 / grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
☆131Updated 6 months ago
usyd-fsalab / fp6_llm
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
☆272Updated 4 months ago
ROCm / vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
☆110Updated this week
ParCoreLab / Snoopie
Multi-GPU communication profiler and visualizer
☆36Updated last year
intel / torch-ccl
oneCCL Bindings for Pytorch* (deprecated)
☆103Updated 3 weeks ago
ROCm / flash-attention
Fast and memory-efficient exact attention
☆201Updated last month
cchan / tccl
extensible collectives library in triton
☆91Updated 8 months ago
meta-pytorch / applied-ai
Applied AI experiments and examples for PyTorch
☆307Updated 3 months ago