msaroufim / awesome-profilingLinks

Awesome utilities for performance profiling

☆199

Alternatives and similar repositories for awesome-profiling

Users that are interested in awesome-profiling are comparing it to the libraries listed below

Sorting:

facebookincubator / dynolog
Dynolog is a telemetry daemon for performance monitoring and tracing. It exports metrics from different components in the system like the…
☆362Updated 2 weeks ago
sbu-fsl / kernel-ml
Machine Learning Framework for Operating Systems - Brings ML to Linux kernel
☆253Updated 4 years ago
GPUprobe / gpuprobe-daemon
Lightweight daemon for monitoring CUDA runtime API calls with eBPF uprobes
☆146Updated 10 months ago
meta-pytorch / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆49Updated 5 months ago
jax-ml / ml_dtypes
A stand-alone implementation of several NumPy dtype extensions used in machine learning.
☆329Updated last week
ashvardanian / PyBindToGPUs
Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake usin…
☆31Updated 3 months ago
octoml / octoml-profile
Home for OctoML PyTorch Profiler
☆113Updated 2 years ago
mlcommons / logging
MLPerf™ logging library
☆38Updated last month
danyangz / lightning
Lightning In-Memory Object Store
☆47Updated 4 years ago
NVIDIA / cuda-checkpoint
CUDA checkpoint and restore utility
☆415Updated 4 months ago
intel / iaprof
AI/GPU flame graph
☆246Updated 4 months ago
wafer-ai / gpu-perf-engineering-resources
A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do
☆349Updated 3 weeks ago
gpuocelot / gpuocelot
GPUOcelot: A dynamic compilation framework for PTX
☆219Updated last year
gevtushenko / llm.c
LLM training in simple, raw C/CUDA
☆112Updated last year
coreweave / ml-containers
☆44Updated this week
NVIDIA / Fuser
A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")
☆379Updated this week
CentML / DeepView.Profile
🏙 Interactive performance profiling and debugging tool for PyTorch neural networks.
☆64Updated last year
microsoft / dist-ir
An IR for efficiently simulating distributed ML computation.
☆32Updated 2 years ago
openxla / shardy
MLIR-based partitioning system
☆164Updated this week
meta-pytorch / torchsnapshot
A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…
☆164Updated last month
foundation-model-stack / fastsafetensors
High-performance safetensors model loader
☆99Updated last month
meta-pytorch / autoparallel
An experimental implementation of compiler-driven automatic sharding of models across a given device mesh.
☆52Updated this week
meta-pytorch / multipy
torch::deploy (multipy for non-torch uses) is a system that lets you get around the GIL problem by running multiple Python interpreters i…
☆182Updated last month
salykova / sgemm.cu
High-Performance FP32 GEMM on CUDA devices
☆117Updated last year
simon-mo / vLLM-Benchmark
☆31Updated 9 months ago
Jokeren / Awesome-GPU
Awesome resources for GPUs
☆609Updated 2 years ago
facebookresearch / FAMBench
Benchmarks to capture important workloads.
☆32Updated last week
microsoft / Accera
Open source cross-platform compiler for compute-intensive loops used in AI algorithms, from Microsoft Research
☆116Updated 2 years ago
meta-pytorch / tlparse
TORCH_TRACE parser for PT2
☆76Updated last week
fabiocannizzo / FastBinarySearch
Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers
☆153Updated last year