vaibhawvipul / performance-engineeringLinks
☆30Updated 2 years ago
Alternatives and similar repositories for performance-engineering
Users that are interested in performance-engineering are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆108Updated last year
- High-Performance SGEMM on CUDA devices☆113Updated 11 months ago
- Hand-Rolled GPU communications library☆76Updated 3 weeks ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆64Updated this week
- ☆14Updated last month
- Some CUDA example code with READMEs.☆179Updated last month
- Custom PTX Instruction Benchmark☆136Updated 9 months ago
- Accelerated General (FP32) Matrix Multiplication from scratch in CUDA☆174Updated 11 months ago
- Make triton easier☆49Updated last year
- ☆86Updated last month
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆153Updated last year
- An interactive web-based tool for exploring intermediate representations of PyTorch and Triton models☆50Updated 2 weeks ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 8 months ago
- Learning about CUDA by writing PTX code.☆150Updated last year
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆174Updated 2 weeks ago
- a Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization in pure C.☆22Updated last year
- ☆28Updated 11 months ago
- A plugin for Jupyter Notebook to run CUDA C/C++ code☆257Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆113Updated last year
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆69Updated 8 months ago
- Parallel framework for training and fine-tuning deep neural networks☆70Updated last month
- Home for OctoML PyTorch Profiler☆114Updated 2 years ago
- ☆81Updated 2 weeks ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆47Updated 4 months ago
- ☆21Updated 9 months ago
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆370Updated 8 months ago
- Custom kernels in Triton language for accelerating LLMs☆27Updated last year
- ☆12Updated 3 months ago
- Notes on "Programming Massively Parallel Processors" by Hwu, Kirk, and Hajj (4th ed.)☆53Updated last year
- Worked example of the process from Python source to CUDA kernel execution with Numba☆44Updated last year