vaibhawvipul / performance-engineeringLinks
☆28Updated 2 years ago
Alternatives and similar repositories for performance-engineering
Users that are interested in performance-engineering are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆101Updated last year
- ☆28Updated 6 months ago
- ML/DL Math and Method notes☆61Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆47Updated this week
- Make triton easier☆47Updated last year
- benchmarking some transformer deployments☆26Updated 2 years ago
- Machine Learning Agility (MLAgility) benchmark and benchmarking tools☆39Updated 2 months ago
- ☆18Updated 2 weeks ago
- ⛰️ RockyML - A High-Performance Scientific Computing Framework for Non-smooth Machine Learning Problems☆19Updated 2 years ago
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆144Updated 7 months ago
- High-Performance SGEMM on CUDA devices☆97Updated 6 months ago
- A parallel framework for training deep neural networks☆63Updated 4 months ago
- Personal solutions to the Triton Puzzles☆19Updated last year
- Some CUDA example code with READMEs.☆169Updated 4 months ago
- The CUDA target for Numba☆153Updated this week
- Benchmarks to capture important workloads.☆31Updated 5 months ago
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆63Updated 3 months ago
- A tracing JIT compiler for PyTorch☆13Updated 3 years ago
- Slides and recordings of talks hosted by our community☆20Updated last year
- FlexAttention w/ FlashAttention3 Support☆26Updated 9 months ago
- This library empowers users to seamlessly port pretrained models and checkpoints on the HuggingFace (HF) hub (developed using HF transfor…☆74Updated this week
- This material contains content on how to profile and optimize simple Pytorch mnist code using NVIDIA Nsight Systems and Pytorch Profiler☆14Updated 2 years ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆43Updated 4 months ago
- Neural network from scratch in CUDA/C++☆82Updated 6 months ago
- Supplementary material for our paper "Compute Trends Across Three Eras of Machine Learning".☆40Updated 3 years ago
- A lightweight MLIR Python frontend with support for PyTorch☆25Updated 10 months ago
- Loop Nest - Linear algebra compiler and code generator.☆22Updated 2 years ago
- a Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization in pure C.☆21Updated last year
- Benchmark tests supporting the TiledCUDA library.☆16Updated 8 months ago
- ☆14Updated last year