A NCCL extension library, designed to efficiently offload GPU memory allocated by the NCCL communication library.
☆98Dec 17, 2025Updated 2 months ago
Alternatives and similar repositories for asystem-amem
Users that are interested in asystem-amem are comparing it to the libraries listed below
Sorting:
- ☆36Dec 9, 2025Updated 2 months ago
- An experimental communicating attention kernel based on DeepEP.☆35Jul 29, 2025Updated 7 months ago
- ☆20Nov 18, 2023Updated 2 years ago
- Large language models to diffusion finetuning code☆24Jun 2, 2025Updated 9 months ago
- ☆13Jan 7, 2025Updated last year
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- Pytorch routines for (Ker)nel (Mac)hines☆10Oct 10, 2025Updated 4 months ago
- APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation. A system-level optimization for scalable LLM tra…☆51Oct 11, 2025Updated 4 months ago
- ☆25Oct 11, 2025Updated 4 months ago
- ☆18Nov 11, 2025Updated 3 months ago
- ☆53Feb 24, 2026Updated last week
- A Top-Down Profiler for GPU Applications☆22Feb 29, 2024Updated 2 years ago
- ☆30Jan 9, 2026Updated last month
- ☆160Dec 27, 2024Updated last year
- ☆27Feb 9, 2026Updated 3 weeks ago
- Supplementary material for our paper "Compute Trends Across Three Eras of Machine Learning".☆45Mar 12, 2022Updated 3 years ago
- train a model on huchenfeng dataset☆51Dec 8, 2025Updated 2 months ago
- ☆29Dec 31, 2025Updated 2 months ago
- GPUDirect Async support for IB Verbs☆135Nov 10, 2022Updated 3 years ago
- NVIDIA Inference Xfer Library (NIXL)☆898Updated this week
- Distributed MoE in a Single Kernel [NeurIPS '25]☆194Updated this week
- CUDA 12.2 HMM demos☆20Jul 26, 2024Updated last year
- Automatic differentiation for Triton Kernels☆29Aug 12, 2025Updated 6 months ago
- A high-performance RL training-inference weight synchronization framework, designed to enable second-level parameter updates from trainin…☆132Dec 22, 2025Updated 2 months ago
- High-performance distributed data shuffling (all-to-all) library for MoE training and inference☆112Updated this week
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆163Feb 11, 2026Updated 3 weeks ago
- ☆44Updated this week
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆109Sep 24, 2025Updated 5 months ago
- Sample Codes using NVSHMEM on Multi-GPU☆30Jan 22, 2023Updated 3 years ago
- ☆451Aug 10, 2025Updated 6 months ago
- A collection of workload implementations for the LDBC SNB benchmark driver☆20Jun 7, 2021Updated 4 years ago
- Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning☆25May 12, 2025Updated 9 months ago
- Ring attention implementation with flash attention☆986Sep 10, 2025Updated 5 months ago
- A fast communication-overlapping library for tensor/expert parallelism on GPUs.☆1,261Aug 28, 2025Updated 6 months ago
- ☆26Aug 31, 2023Updated 2 years ago
- ☆65Apr 26, 2025Updated 10 months ago
- triton for dsa☆58Updated this week
- Train speculative decoding models effortlessly and port them smoothly to SGLang serving.☆716Updated this week
- JAX backend for SGL☆243Updated this week