NVIDIA / NeMo-Framework-Launcher
Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.
☆486Updated 3 weeks ago
Alternatives and similar repositories for NeMo-Framework-Launcher:
Users that are interested in NeMo-Framework-Launcher are comparing it to the libraries listed below
- The Triton TensorRT-LLM Backend☆758Updated 3 weeks ago
- Scalable toolkit for efficient model alignment☆697Updated this week
- FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.☆690Updated 4 months ago
- ☆218Updated this week
- Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)☆166Updated this week
- Microsoft Automatic Mixed Precision Library☆554Updated 4 months ago
- ☆411Updated last year
- 🏋️ A unified multi-backend utility for benchmarking Transformers, Timm, PEFT, Diffusers and Sentence-Transformers with full support of O…☆284Updated this week
- Serving multiple LoRA finetuned LLM as one☆1,018Updated 8 months ago
- Easy and Efficient Quantization for Transformers☆192Updated last month
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs…☆2,121Updated this week
- Fast Inference Solutions for BLOOM☆563Updated 3 months ago
- Pipeline Parallelism for PyTorch☆739Updated 5 months ago
- MII makes low-latency and high-throughput inference possible, powered by DeepSpeed.☆1,950Updated this week
- Large Context Attention☆677Updated last week
- GPTQ inference Triton kernel☆292Updated last year
- LLMPerf is a library for validating and benchmarking LLMs☆710Updated last month
- A high-throughput and memory-efficient inference and serving engine for LLMs☆50Updated this week
- A high-throughput and memory-efficient inference and serving engine for LLMs☆257Updated 3 months ago
- Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Serv…☆448Updated 2 weeks ago
- [ICML 2024] SqueezeLLM: Dense-and-Sparse Quantization☆669Updated 5 months ago
- A tool to configure, launch and manage your machine learning experiments.☆107Updated this week
- Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers☆204Updated 5 months ago
- 🚀 Collection of components for development, training, tuning, and inference of foundation models leveraging PyTorch native components.☆186Updated last week
- [ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding☆1,183Updated 3 months ago
- [NeurIPS 2024] KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization☆327Updated 5 months ago
- Latency and Memory Analysis of Transformer Models for Training and Inference☆378Updated 2 months ago
- Ongoing research training transformer language models at scale, including: BERT & GPT-2☆1,358Updated 10 months ago
- A throughput-oriented high-performance serving framework for LLMs☆714Updated 4 months ago
- Zero Bubble Pipeline Parallelism☆317Updated 2 months ago