awslabs / s3-connector-for-pytorchLinks
The Amazon S3 Connector for PyTorch delivers high throughput for PyTorch training jobs that access and store data in Amazon S3.
☆166Updated this week
Alternatives and similar repositories for s3-connector-for-pytorch
Users that are interested in s3-connector-for-pytorch are comparing it to the libraries listed below
Sorting:
- ☆167Updated 2 years ago
- ☆47Updated last month
- EFA/NCCL base AMI build Packer and CodeBuild/Pipeline files. Also base Docker build files to enable EFA/NCCL in containers☆43Updated last year
- Create, List, Update, Delete Amazon EKS clusters. Deploy and manage software on EKS. Run distributed model training and inference example…☆59Updated 2 weeks ago
- ☆110Updated 5 months ago
- Container plugin for Slurm Workload Manager☆347Updated 7 months ago
- Module, Model, and Tensor Serialization/Deserialization☆240Updated last week
- TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and sup…☆367Updated last week
- Deploying EFA in EKS utilizing GPUDirectRDMA where supported☆37Updated 8 months ago
- ☆58Updated last month
- Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.☆230Updated this week
- A helper library to connect into Amazon SageMaker with AWS Systems Manager and SSH (Secure Shell)☆246Updated 3 months ago
- ☆72Updated last year
- Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.☆98Updated last week
- Example code for AWS Neuron SDK developers building inference and training applications☆149Updated 2 weeks ago
- Scalable and Performant Data Loading☆278Updated this week
- A high performance data access library for machine learning tasks☆74Updated last year
- This is a plugin which lets EC2 developers use libfabric as network provider while running NCCL applications.☆176Updated this week
- NVIDIA Resiliency Extension is a python package for framework developers and users to implement fault-tolerant features. It improves the …☆179Updated 2 weeks ago
- AWS virtual gpu device plugin provides capability to use smaller virtual gpus for your machine learning inference workloads☆205Updated last year
- ☆62Updated 4 months ago
- ☆221Updated this week
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆157Updated this week
- CUDA checkpoint and restore utility☆345Updated 4 months ago
- PyTorch per step fault tolerance (actively under development)☆329Updated this week
- Tools to deploy GPU clusters in the Cloud☆31Updated 2 years ago
- Create and manage Amazon SageMaker HyperPod clusters, run distributed model training☆23Updated last month
- This Guidance demonstrates how to deploy a machine learning inference architecture on Amazon Elastic Kubernetes Service (Amazon EKS). It …☆44Updated 3 weeks ago
- KubeFlow on AWS☆184Updated 2 weeks ago
- Collection of best practices, reference architectures, model training examples and utilities to train large models on AWS.☆313Updated this week