vaibhawvipul / performance-engineeringLinks
☆30Updated 2 years ago
Alternatives and similar repositories for performance-engineering
Users that are interested in performance-engineering are comparing it to the libraries listed below
Sorting:
- Write a fast kernel and run it on Discord. See how you compare against the best!☆67Updated this week
- LLM training in simple, raw C/CUDA☆110Updated last year
- High-Performance SGEMM on CUDA devices☆115Updated 11 months ago
- ☆87Updated last week
- ☆15Updated 2 months ago
- PTX-Tutorial Written Purely By AIs (Deep Research of Openai and Claude 3.7)☆66Updated 9 months ago
- ☆28Updated last year
- A curriculum for learning about gpu performance engineering, from scratch to what the frontier AI labs do☆169Updated last week
- Make triton easier☆50Updated last year
- Some CUDA example code with READMEs.☆179Updated 2 months ago
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆189Updated this week
- Learning about CUDA by writing PTX code.☆151Updated last year
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆93Updated 2 years ago
- ☆19Updated 9 years ago
- Parallel framework for training and fine-tuning deep neural networks☆70Updated 2 months ago
- Ship correct and fast LLM kernels to PyTorch☆132Updated this week
- Machine Learning Agility (MLAgility) benchmark and benchmarking tools☆40Updated 5 months ago
- TORCH_TRACE parser for PT2☆71Updated this week
- Custom PTX Instruction Benchmark☆137Updated 10 months ago
- This repository contains companion software for the Colfax Research paper "Categorical Foundations for CuTe Layouts".☆92Updated 3 months ago
- Home for OctoML PyTorch Profiler☆114Updated 2 years ago
- Hand-Rolled GPU communications library☆76Updated last month
- Tutorials for running models on First-gen Gaudi and Gaudi2 for Training and Inference. The source files for the tutorials on https://dev…☆63Updated 4 months ago
- a Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization in pure C.☆22Updated last year
- Custom kernels in Triton language for accelerating LLMs☆27Updated last year
- Memory Optimizations for Deep Learning (ICML 2023)☆114Updated last year
- Slides and recordings of talks hosted by our community☆21Updated last year
- ☆23Updated 2 months ago
- This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transfor…☆85Updated this week
- ☆27Updated 2 years ago