NVIDIA / NeMo-Run
A tool to configure, launch and manage your machine learning experiments.
β122Updated this week
Alternatives and similar repositories for NeMo-Run:
Users that are interested in NeMo-Run are comparing it to the libraries listed below
- π Efficiently (pre)training foundation models with native PyTorch features, including FSDP for training and SDPA implementation of Flashβ¦β225Updated this week
- Applied AI experiments and examples for PyTorchβ232Updated this week
- Megatron's multi-modal data loaderβ167Updated this week
- PyTorch per step fault tolerance (actively under development)β253Updated last week
- This repository contains the experimental PyTorch native float8 training UXβ221Updated 7 months ago
- A high-throughput and memory-efficient inference and serving engine for LLMsβ260Updated 4 months ago
- Scalable and Performant Data Loadingβ222Updated this week
- LLM KV cache compression made easyβ412Updated last week
- Fast low-bit matmul kernels in Tritonβ250Updated last week
- β200Updated last month
- PyTorch/XLA integration with JetStream (https://github.com/google/JetStream) for LLM inference"β53Updated 3 weeks ago
- Fast Matrix Multiplications for Lookup Table-Quantized LLMsβ231Updated last week
- β100Updated 6 months ago
- π Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.β188Updated this week
- β176Updated 5 months ago
- Efficient LLM Inference over Long Sequencesβ362Updated 2 weeks ago
- β186Updated last week
- Google TPU optimizations for transformers modelsβ100Updated last month
- PyTorch RFCs (experimental)β130Updated 6 months ago
- The Triton backend for the PyTorch TorchScript models.β144Updated this week
- ring-attention experimentsβ126Updated 4 months ago
- Code repo for the paper "SpinQuant LLM quantization with learned rotations"β218Updated 2 weeks ago
- Cataloging released Triton kernels.β176Updated last month
- β231Updated last week
- β60Updated this week
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformersβ206Updated 6 months ago
- Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients.β195Updated 7 months ago
- Easy and lightning fast training of π€ Transformers on Habana Gaudi processor (HPU)β173Updated this week
- OpenAI compatible API for TensorRT LLM triton backendβ198Updated 7 months ago