Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocations such as NCCL, ...)
☆110Sep 11, 2025Updated 9 months ago
Alternatives and similar repositories for torch_utils
Users that are interested in torch_utils are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- [Archived] For the latest updates and community contribution, please visit: https://github.com/Ascend/TransferQueue or https://gitcode.co…☆16Jan 16, 2026Updated 5 months ago
- Bridge Megatron-Core to Hugging Face/Reinforcement Learning☆216Updated this week
- Perplexity GPU Kernels☆586Nov 7, 2025Updated 7 months ago
- ☆66Apr 26, 2025Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆111Jun 28, 2025Updated 11 months ago
- Deploy open-source AI quickly and easily - Special Bonus Offer • AdRunpod Hub is built for open source. One-click deployment and autoscaling endpoints without provisioning your own infrastructure.
- ☆20Dec 24, 2024Updated last year
- ☆20May 30, 2026Updated 2 weeks ago
- a simple API to use CUPTI☆10Aug 19, 2025Updated 9 months ago
- Toolchain built around the Megatron-LM for Distributed Training☆95May 20, 2026Updated 3 weeks ago
- [NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive☆73Dec 11, 2025Updated 6 months ago
- ☆32Jul 2, 2025Updated 11 months ago
- [WIP] Better (FP8) attention for Hopper☆34Feb 24, 2025Updated last year
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆488Updated this week
- An experimental communicating attention kernel based on DeepEP.☆34Jul 29, 2025Updated 10 months ago
- Managed Database hosting by DigitalOcean • AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆99Jan 16, 2026Updated 4 months ago
- Fastest kernels written from scratch☆583Sep 18, 2025Updated 8 months ago
- Debug print operator for cudagraph debugging☆15Aug 2, 2024Updated last year
- Distributed Compiler based on Triton for Parallel Systems☆1,459Apr 22, 2026Updated last month
- Estimate MFU for DeepSeekV3☆26Jan 5, 2025Updated last year
- ☆82Dec 27, 2024Updated last year
- A high-performance acceleration library dedicated to large-scale model training on AMD GPUs☆64Updated this week
- Ring attention implementation with flash attention☆1,025Sep 10, 2025Updated 9 months ago
- ☆52May 19, 2025Updated last year
- GPUs on demand by Runpod - Special Offer Available • AdRun AI, ML, and HPC workloads on powerful cloud GPUs—without limits or wasted spend. Deploy GPUs in under a minute and pay by the second.
- Ongoing research training transformer models at scale☆18Updated this week
- Pipeline Parallelism Emulation and Visualization☆83Jan 8, 2026Updated 5 months ago
- ☆249Nov 19, 2025Updated 6 months ago
- ☆169Dec 27, 2024Updated last year
- Sequence-level 1F1B schedule for LLMs.☆37Aug 26, 2025Updated 9 months ago
- Byted PyTorch Distributed for Hyperscale Training of LLMs and RLs☆1,024Mar 3, 2026Updated 3 months ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆548Updated this week
- Allow torch tensor memory to be released and resumed later☆249May 16, 2026Updated 3 weeks ago
- torchcomms: a modern PyTorch communications API☆368Updated this week
- Deploy on Railway without the complexity - Free Credits Offer • AdConnect your repo and Railway handles the rest with instant previews. Quickly provision container image services, databases, and storage volumes.
- ☆42Dec 9, 2025Updated 6 months ago
- ☆22May 5, 2025Updated last year
- NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer☆192Feb 11, 2026Updated 4 months ago
- Multi-Level Triton Runner supporting Python, IR, PTX, AMDGCN, cubin and hasco.☆98May 8, 2026Updated last month
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆359Updated this week
- Tutorials for NVIDIA CUPTI samples☆68Nov 3, 2025Updated 7 months ago
- A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.☆126Dec 25, 2025Updated 5 months ago