matiaslindgren / cuda-memory-access-recorder
Record GPU memory accesses of a CUDA program and visualize the access pattern in a browser
☆13Updated 3 years ago
Related projects ⓘ
Alternatives and complementary repositories for cuda-memory-access-recorder
- Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.☆11Updated last year
- An IR for efficiently simulating distributed ML computation.☆25Updated 9 months ago
- A memory profiler for NVIDIA GPUs to explore memory inefficiencies in GPU-accelerated applications.☆22Updated 3 weeks ago
- A tracing JIT compiler for PyTorch☆12Updated 2 years ago
- ☆14Updated last month
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆42Updated 10 months ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆57Updated 4 months ago
- [CF ’20] Verified Instruction-Level Energy Consumption Measurement for NVIDIA GPUs☆15Updated 3 years ago
- GEMM and Winograd based convolutions using CUTLASS☆25Updated 4 years ago
- A lightweight, Pythonic, frontend for MLIR☆79Updated last year
- A tracing JIT for PyTorch☆17Updated 2 years ago
- A Top-Down Profiler for GPU Applications☆13Updated 8 months ago
- CUDAAdvisor: a GPU profiling tool☆48Updated 6 years ago
- LLVM-Canon aims to transform LLVM modules into a canonical form by reordering and renaming instructions while preserving the same semanti…☆12Updated 6 months ago
- Training neural networks in TensorFlow 2.0 with 5x less memory☆128Updated 2 years ago
- Experiments and prototypes associated with IREE or MLIR☆49Updated 3 months ago
- An experimental ahead of time compiler for Relay.☆51Updated 4 years ago
- Torch Frontend for IREE☆25Updated 10 months ago
- CUDA 12.2 HMM demos☆17Updated 3 months ago
- Open source cross-platform compiler for compute-intensive loops used in AI algorithms, from Microsoft Research☆101Updated last year
- A framework that helps implementing swizzle GPU kernels☆41Updated 4 years ago
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆63Updated 2 years ago
- Code for Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture (accepted by PVLDB).The outdated wr…☆8Updated last year
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆65Updated last year
- cuASR: CUDA Algebra for Semirings☆34Updated 2 years ago
- ☆48Updated 3 months ago
- Chai☆42Updated 11 months ago
- Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as ta…☆44Updated 3 years ago
- Julia ports of the Rodinia benchmark suite for heterogeneous computing infrastructures☆47Updated last year