NVIDIA / aistore
AIStore: scalable storage for AI applications
☆1,465Updated this week
Alternatives and similar repositories for aistore:
Users that are interested in aistore are comparing it to the libraries listed below
- Collective communications library with various primitives for multi-machine training.☆1,288Updated this week
- A CPU+GPU Profiling library that provides access to timeline traces and hardware performance counters.☆794Updated this week
- TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and sup…☆358Updated this week
- RAFT contains fundamental widely-used algorithms and primitives for machine learning and information retrieval. The algorithms are CUDA-a…☆869Updated this week
- A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.☆731Updated 4 months ago
- Kubernetes Operator, ansible playbooks, and production scripts for large-scale AIStore deployments on Kubernetes.☆93Updated this week
- cuVS - a library for vector search and clustering on the GPU☆373Updated this week
- PyTorch elastic training☆730Updated 2 years ago
- common in-memory tensor structure☆978Updated last week
- A toolkit to run Ray applications on Kubernetes☆1,664Updated last week
- NVIDIA Data Center GPU Manager (DCGM) is a project for gathering telemetry and measuring the health of NVIDIA GPUs☆493Updated 2 months ago
- Optimized primitives for collective multi-GPU communication☆3,659Updated last week
- Tools for building GPU clusters☆1,334Updated last week
- RAPIDS Memory Manager☆569Updated this week
- A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper, Ada and Bla…☆2,365Updated this week
- Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet f…☆1,830Updated last year
- A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology☆1,059Updated 3 weeks ago
- Modern columnar data format for ML and LLMs implemented in Rust. Convert from parquet in 2 lines of code for 100x faster random access, v…☆4,454Updated this week
- NCCL Tests☆1,065Updated last month
- Unified Communication X (mailing list - https://elist.ornl.gov/mailman/listinfo/ucx-group)☆1,303Updated this week
- Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)☆474Updated this week
- FB (Facebook) + GEMM (General Matrix-Matrix Multiplication) - https://code.fb.com/ml-applications/fbgemm/☆1,299Updated this week
- Unified high-performance Python client for object and file stores.☆23Updated last week
- CUDA checkpoint and restore utility☆325Updated 2 months ago
- Slurm: A Highly Scalable Workload Manager☆3,021Updated this week
- Container plugin for Slurm Workload Manager☆335Updated 5 months ago
- Determined is an open-source machine learning platform that simplifies distributed training, hyperparameter tuning, experiment tracking, …☆3,133Updated last month
- NVIDIA container runtime library☆939Updated this week
- Distributed ML Training and Fine-Tuning on Kubernetes☆1,762Updated this week
- JetStream is a throughput and memory optimized engine for LLM inference on XLA devices, starting with TPUs (and GPUs in future -- PRs wel…☆315Updated this week