GoogleCloudPlatform / nvidia-nemo-on-gke
Training NVIDIA NeMo Megatron Large Language Model (LLM) using NeMo Framework on Google Kubernetes Engine
☆12Updated 3 weeks ago
Alternatives and similar repositories for nvidia-nemo-on-gke:
Users that are interested in nvidia-nemo-on-gke are comparing it to the libraries listed below
- Recipes for reproducing training and serving benchmarks for large machine learning models using GPUs on Google Cloud.☆56Updated this week
- Testing framework for Deep Learning models (Tensorflow and PyTorch) on Google Cloud hardware accelerators (TPU and GPU)☆64Updated 4 months ago
- xpk (Accelerated Processing Kit, pronounced x-p-k,) is a software tool to help Cloud developers to orchestrate training jobs on accelerat…☆115Updated this week
- ☆43Updated 2 months ago
- JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs wel…☆315Updated this week
- AI on GKE is a collection of examples, best-practices, and prebuilt solutions to help build, deploy, and scale AI Platforms on Google Kub…☆302Updated last week
- A collection of YAML files, Helm Charts, Operator code, and guides to act as an example reference implementation for NVIDIA NIM deploymen…☆177Updated last week
- A simplified and automated orchestration workflow to perform ML end-to-end (E2E) model tests and benchmarking on Cloud VMs across differe…☆43Updated this week
- ☆16Updated last month
- Infrastructure as code for GPU accelerated managed Kubernetes clusters.☆55Updated 2 weeks ago
- Cluster Toolkit is an open-source software offered by Google Cloud which makes it easy for customers to deploy AI/ML and HPC environments…☆245Updated this week
- The Amazon S3 Connector for PyTorch delivers high throughput for PyTorch training jobs that access and store data in Amazon S3.☆153Updated last week
- An Operator for deployment and maintenance of NVIDIA NIMs and NeMo microservices in a Kubernetes environment.☆92Updated this week
- Deploying EFA in EKS utilizing GPUDirectRDMA where supported☆37Updated 6 months ago
- Run cloud native workloads on NVIDIA GPUs☆168Updated last week
- MLFlow Deployment Plugin for Ray Serve☆44Updated 3 years ago
- ☆137Updated this week
- Test infrastructure and tooling for Kubeflow.☆63Updated 2 months ago
- Volume Controller for Kubernetes☆67Updated 2 years ago
- This repository contains example integrations between Determined and other ML products☆48Updated last year
- Secure HDFS Access from Kubernetes☆60Updated 4 years ago
- Large Language Model Text Generation Inference on Habana Gaudi☆32Updated 3 weeks ago
- Apache YuniKorn Scheduler Interface☆28Updated last month
- Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.☆93Updated this week
- Terraform module for creating GKE clusters to run Kubeflow☆215Updated 4 years ago
- Performance optimization for Spark running on Kubernetes☆87Updated 4 years ago
- Seldon Core Operator for Kubernetes☆12Updated 5 years ago
- Collection of tools and examples for managing Accelerated workloads in Kubernetes Engine☆229Updated this week
- Repository used to main group ACLs used by Kubeflow developers☆17Updated last week
- ☆13Updated this week