A Python library transfers PyTorch tensors between CPU and NVMe
☆125Nov 27, 2024Updated last year
Alternatives and similar repositories for TensorNVMe
Users that are interested in TensorNVMe are comparing it to the libraries listed below
Sorting:
- ☆30Sep 4, 2023Updated 2 years ago
- Examples of training models with hybrid parallelism using ColossalAI☆339Mar 23, 2023Updated 2 years ago
- ☆12Apr 30, 2024Updated last year
- ☆12Sep 1, 2023Updated 2 years ago
- A collection of models built with ColossalAI☆32Nov 22, 2022Updated 3 years ago
- A memory efficient DLRM training solution using ColossalAI☆107Nov 22, 2022Updated 3 years ago
- PyTorch implementation of LAMB for ImageNet/ResNet-50 training☆13May 13, 2021Updated 4 years ago
- TVMScript kernel for deformable attention☆25Dec 15, 2021Updated 4 years ago
- ☆28Jul 11, 2021Updated 4 years ago
- ☆14Nov 7, 2025Updated 3 months ago
- ☆53Updated this week
- IntLLaMA: A fast and light quantization solution for LLaMA☆18Jul 21, 2023Updated 2 years ago
- ☆218Nov 23, 2025Updated 3 months ago
- A curated list of awesome projects and papers for distributed training or inference☆266Oct 8, 2024Updated last year
- A practical way of learning Swizzle☆37Feb 3, 2025Updated last year
- ☆120Jan 8, 2026Updated last month
- ☆21Jun 6, 2024Updated last year
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆123Dec 25, 2025Updated 2 months ago
- Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity☆234Sep 24, 2023Updated 2 years ago
- a simple API to use CUPTI☆11Aug 19, 2025Updated 6 months ago
- [PACT'24] GraNNDis. A fast and unified distributed graph neural network (GNN) training framework for both full-batch (full-graph) and min…☆10Aug 13, 2024Updated last year
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- A very simple variant of adversarial training that yields excellent results on MNIST☆12Mar 19, 2016Updated 9 years ago
- A GPU (CUDA) accelerated set of tools for object detection using waldboost/LBP.☆10May 25, 2015Updated 10 years ago
- A toolkit for developers to simplify the transformation of nn.Module instances. It's now corresponding to Pytorch.fx.☆13Apr 7, 2023Updated 2 years ago
- ☆12Mar 13, 2023Updated 2 years ago
- DQN-MxNet-Gluon☆23Nov 12, 2017Updated 8 years ago
- NVIDIA cuTile learn☆163Dec 9, 2025Updated 2 months ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆82Nov 19, 2024Updated last year
- Fast low-bit matmul kernels in Triton☆433Feb 1, 2026Updated 3 weeks ago
- RPCNIC: A High-Performance and Reconfigurable PCIe-attached RPC Accelerator [HPCA2025]☆13Dec 9, 2024Updated last year
- The (open-source part of) code to reproduce "BPPSA: Scaling Back-propagation by Parallel Scan Algorithm".☆13Jun 7, 2021Updated 4 years ago
- Chimera: bidirectional pipeline parallelism for efficiently training large-scale models.☆70Mar 20, 2025Updated 11 months ago
- ☆52May 19, 2025Updated 9 months ago
- Scalable PaLM implementation of PyTorch☆190Dec 19, 2022Updated 3 years ago
- Training and serving large-scale neural networks with auto parallelization.☆3,183Dec 9, 2023Updated 2 years ago
- PyTorch distributed training acceleration framework☆54Aug 13, 2025Updated 6 months ago
- Official repository for DistFlashAttn: Distributed Memory-efficient Attention for Long-context LLMs Training☆222Aug 19, 2024Updated last year
- Efficient Auto-scalable Scientific Infrastructure for Engineers and Researchers☆15Sep 8, 2025Updated 5 months ago