vaibhawvipul / performance-engineeringLinks
☆30Updated 2 years ago
Alternatives and similar repositories for performance-engineering
Users that are interested in performance-engineering are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆108Updated last year
- CS294 AI Systems Class Website☆16Updated 3 years ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆61Updated last week
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆90Updated 2 years ago
- Some CUDA example code with READMEs.☆178Updated last week
- ☆69Updated last week
- ☆13Updated 3 weeks ago
- High-Performance SGEMM on CUDA devices☆112Updated 10 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 8 months ago
- Machine Learning Agility (MLAgility) benchmark and benchmarking tools☆40Updated 3 months ago
- ☆28Updated 10 months ago
- TORCH_LOGS parser for PT2☆64Updated last week
- How to ensure correctness and ship LLM generated kernels in PyTorch☆121Updated last week
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆254Updated last year
- Custom kernels in Triton language for accelerating LLMs☆27Updated last year
- ☆19Updated 2 weeks ago
- Learning about CUDA by writing PTX code.☆147Updated last year
- My own repository containing the codes I wrote to practice CUDA programming.☆63Updated 2 years ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆47Updated 3 months ago
- Learn CUDA with PyTorch☆111Updated last week
- Notes on "Programming Massively Parallel Processors" by Hwu, Kirk, and Hajj (4th ed.)☆53Updated last year
- Intel Gaudi's Megatron DeepSpeed Large Language Models for training☆15Updated 11 months ago
- Home for OctoML PyTorch Profiler☆114Updated 2 years ago
- Memory Optimizations for Deep Learning (ICML 2023)☆110Updated last year
- ☆77Updated last year
- ☆27Updated last year
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆100Updated 4 months ago
- Hand-Rolled GPU communications library☆65Updated last week
- LLM training parallelisms (DP, FSDP, TP, PP) in pure C☆26Updated 4 months ago
- MLPerf™ logging library☆37Updated last month