vaibhawvipul / performance-engineering
☆26Updated 2 years ago
Alternatives and similar repositories for performance-engineering:
Users that are interested in performance-engineering are comparing it to the libraries listed below
- LLM training in simple, raw C/CUDA☆92Updated 11 months ago
- ML/DL Math and Method notes☆60Updated last year
- Notes and code for Programming Massively Parallel Processors☆11Updated 3 weeks ago
- Slides and recordings of talks hosted by our community☆20Updated 10 months ago
- ☆27Updated 3 months ago
- A tracing JIT compiler for PyTorch☆13Updated 3 years ago
- This material contains content on how to profile and optimize simple Pytorch mnist code using NVIDIA Nsight Systems and Pytorch Profiler☆12Updated last year
- a Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization in pure C.☆21Updated 9 months ago
- Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.☆44Updated last week
- benchmarking some transformer deployments☆26Updated 2 years ago
- The CUDA target for Numba☆106Updated this week
- A parallel framework for training deep neural networks☆58Updated last month
- Some CUDA example code with READMEs.☆94Updated last month
- Random number library that generate pseudo-random and quasi-random numbers.☆26Updated last week
- A Gentle Principled Introduction to Deep Reinforcement Learning☆19Updated 3 weeks ago
- Learn OpenMP examples step by step☆91Updated 3 months ago
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆57Updated last week
- Learn CUDA with PyTorch☆20Updated 2 months ago
- SKIP for AI☆21Updated 5 years ago
- Serial and parallel implementations of matrix multiplication☆40Updated 4 years ago
- Benchmark tests supporting the TiledCUDA library.☆16Updated 5 months ago
- A repository of PyTorch example☆9Updated 2 years ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆79Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆40Updated this week
- Benchmarks to capture important workloads.☆31Updated 2 months ago
- ☆13Updated 2 years ago
- a distributed end-to-end image classification system using kubernetes☆11Updated 3 months ago
- NVIDIA's launch, startup, and logging scripts used by our MLPerf Training and HPC submissions☆26Updated this week
- ⛰️ RockyML - A High-Performance Scientific Computing Framework for Non-smooth Machine Learning Problems☆19Updated 2 years ago
- Worked example of the process from Python source to CUDA kernel execution with Numba☆40Updated 7 months ago