vaibhawvipul / performance-engineeringLinks
☆30Updated 3 years ago
Alternatives and similar repositories for performance-engineering
Users that are interested in performance-engineering are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆112Updated last year
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆71Updated this week
- Some CUDA example code with READMEs.☆179Updated 2 months ago
- ☆15Updated 3 months ago
- ☆27Updated 2 years ago
- Learning about CUDA by writing PTX code.☆152Updated last year
- ☆95Updated this week
- Benchmark tests supporting the TiledCUDA library.☆18Updated last year
- Make triton easier☆50Updated last year
- Machine Learning Agility (MLAgility) benchmark and benchmarking tools☆40Updated 6 months ago
- Hand-Rolled GPU communications library☆82Updated 2 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 10 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆49Updated 5 months ago
- Custom PTX Instruction Benchmark☆138Updated 11 months ago
- TORCH_TRACE parser for PT2☆76Updated this week
- A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do☆341Updated 3 weeks ago
- ☆28Updated last year
- A series of high-performance GEMM (General Matrix Multiply) implementations Iteratively optimised for H100 GPUs in Pure CUDA.☆64Updated 3 weeks ago
- Parallel framework for training and fine-tuning deep neural networks☆70Updated 3 months ago
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆93Updated 2 years ago
- Benchmarks to capture important workloads.☆32Updated 2 weeks ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆201Updated this week
- General Matrix Multiplication using NVIDIA Tensor Cores☆28Updated last year
- TritonParse: A Compiler Tracer, Visualizer, and Reproducer for Triton Kernels☆194Updated this week
- Framework to reduce autotune overhead to zero for well known deployments.☆96Updated 4 months ago
- Custom kernels in Triton language for accelerating LLMs☆27Updated last year
- Ship correct and fast LLM kernels to PyTorch☆140Updated 3 weeks ago
- LLM training parallelisms (DP, FSDP, TP, PP) in pure C☆26Updated 2 weeks ago
- Attention in SRAM on Tenstorrent Grayskull☆40Updated last year