Utility scripts for PyTorch (e.g. Make Perfetto show some disappearing kernels, Memory profiler that understands more low-level allocations such as NCCL, ...)
☆90Sep 11, 2025Updated 5 months ago
Alternatives and similar repositories for torch_utils
Users that are interested in torch_utils are comparing it to the libraries listed below
Sorting:
- Bridge Megatron-Core to Hugging Face/Reinforcement Learning☆197Updated this week
- DeepXTrace is a lightweight tool for precisely diagnosing slow ranks in DeepEP-based environments.☆93Jan 16, 2026Updated last month
- ☆20Dec 24, 2024Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆106Jun 28, 2025Updated 8 months ago
- a simple API to use CUPTI☆11Aug 19, 2025Updated 6 months ago
- ☆16Feb 24, 2026Updated last week
- Perplexity GPU Kernels☆567Nov 7, 2025Updated 4 months ago
- Use the tokenizer in parallel to achieve superior acceleration☆20Mar 21, 2024Updated last year
- Ongoing research training transformer models at scale☆18Updated this week
- Sequence-level 1F1B schedule for LLMs.☆38Aug 26, 2025Updated 6 months ago
- [Archived] For the latest updates and community contribution, please visit: https://github.com/Ascend/TransferQueue or https://gitcode.co…☆13Jan 16, 2026Updated last month
- DeepTrace: A lightweight, scalable real-time diagnostic and analysis tool for distributed training tasks.☆18Nov 4, 2025Updated 4 months ago
- ☆16Nov 5, 2018Updated 7 years ago
- Distributed Compiler based on Triton for Parallel Systems☆1,371Feb 13, 2026Updated 3 weeks ago
- A bunch of kernels that might make stuff slower 😉☆75Feb 18, 2026Updated 2 weeks ago
- ☆65Apr 26, 2025Updated 10 months ago
- Fastest kernels written from scratch☆550Sep 18, 2025Updated 5 months ago
- ☆226Nov 19, 2025Updated 3 months ago
- [NeurIPS 2025] ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive☆66Dec 11, 2025Updated 2 months ago
- ☆18Mar 10, 2023Updated 2 years ago
- Ring attention implementation with flash attention☆987Sep 10, 2025Updated 5 months ago
- ☆32Jul 2, 2025Updated 8 months ago
- ☆22May 5, 2025Updated 10 months ago
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆469Feb 28, 2026Updated last week
- Best practices for training DeepSeek, Mixtral, Qwen and other MoE models using Megatron Core.☆167Jan 22, 2026Updated last month
- ☆160Dec 27, 2024Updated last year
- ☆42Sep 8, 2025Updated 5 months ago
- Official Project Page for HLA: Higher-order Linear Attention (https://arxiv.org/abs/2510.27258)☆45Jan 6, 2026Updated 2 months ago
- ☆62Updated this week
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆327Updated this week
- 🍼 Official implementation of Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts☆41Sep 29, 2024Updated last year
- “悟道”源代码☆21Aug 24, 2021Updated 4 years ago
- ☆79Dec 27, 2024Updated last year
- ☆16Mar 30, 2024Updated last year
- Tilus is a tile-level kernel programming language with explicit control over shared memory and registers.☆446Feb 4, 2026Updated last month
- A Distributed Attention Towards Linear Scalability for Ultra-Long Context, Heterogeneous Data Training☆659Updated this week
- Flash-Linear-Attention models beyond language☆21Aug 28, 2025Updated 6 months ago
- study of cutlass☆22Nov 10, 2024Updated last year
- A Structured Span Selector (NAACL 2022). A structured span selector with a WCFG for span selection tasks (coreference resolution, semanti…☆21Jul 11, 2022Updated 3 years ago