CentML / DeepView.ProfileLinks
π Interactive performance profiling and debugging tool for PyTorch neural networks.
β61Updated 5 months ago
Alternatives and similar repositories for DeepView.Profile
Users that are interested in DeepView.Profile are comparing it to the libraries listed below
Sorting:
- extensible collectives library in tritonβ86Updated 2 months ago
- β72Updated 3 months ago
- β105Updated 10 months ago
- PyTorch centric eager mode debuggerβ47Updated 6 months ago
- Applied AI experiments and examples for PyTorchβ277Updated 3 weeks ago
- Boosting 4-bit inference kernels with 2:4 Sparsityβ79Updated 9 months ago
- Home for OctoML PyTorch Profilerβ113Updated 2 years ago
- β28Updated 5 months ago
- Fast low-bit matmul kernels in Tritonβ322Updated last week
- β81Updated 7 months ago
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM trainingβ49Updated last month
- ring-attention experimentsβ144Updated 8 months ago
- A Python library transfers PyTorch tensors between CPU and NVMeβ116Updated 6 months ago
- A safetensors extension to efficiently store sparse quantized tensors on diskβ129Updated this week
- Collection of kernels written in Triton languageβ132Updated 2 months ago
- This repository contains the experimental PyTorch native float8 training UXβ224Updated 10 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.β90Updated 2 weeks ago
- β219Updated this week
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind β¦β157Updated this week
- MLIR-based partitioning systemβ97Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)β43Updated 3 months ago
- Microsoft Collective Communication Libraryβ64Updated 7 months ago
- A schedule language for large model trainingβ149Updated last year
- High-Performance SGEMM on CUDA devicesβ95Updated 5 months ago
- Hydragen: High-Throughput LLM Inference with Shared Prefixesβ41Updated last year
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.β167Updated this week
- Evaluating Large Language Models for CUDA Code Generation ComputeEval is a framework designed to generate and evaluate CUDA code from Larβ¦β50Updated last week
- Cataloging released Triton kernels.β238Updated 5 months ago
- Demo of the unit_scaling library, showing how a model can be easily adapted to train in FP8.β45Updated 11 months ago
- A framework for PyTorch to enable fault management for collective communication libraries (CCL) such as NCCLβ19Updated last month