vaibhawvipul / performance-engineeringLinks
☆28Updated 2 years ago
Alternatives and similar repositories for performance-engineering
Users that are interested in performance-engineering are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆99Updated last year
- ⛰️ RockyML - A High-Performance Scientific Computing Framework for Non-smooth Machine Learning Problems☆19Updated 2 years ago
- ML/DL Math and Method notes☆61Updated last year
- Worked example of the process from Python source to CUDA kernel execution with Numba☆41Updated 9 months ago
- Machine Learning Agility (MLAgility) benchmark and benchmarking tools☆39Updated last month
- A tracing JIT compiler for PyTorch☆13Updated 3 years ago
- Fast Matrix Multiplication Implementation in C programming language. This matrix multiplication algorithm is similar to what Numpy uses t…☆34Updated 4 years ago
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆26Updated last week
- benchmarking some transformer deployments☆26Updated 2 years ago
- a Minimal, clean code for the Byte Pair Encoding (BPE) algorithm commonly used in LLM tokenization in pure C.☆21Updated 11 months ago
- A tracing JIT for PyTorch☆17Updated 2 years ago
- ☆28Updated 5 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆46Updated this week
- Benchmarks to capture important workloads.☆31Updated 4 months ago
- Make triton easier☆46Updated last year
- Inference Llama 2 in C++☆43Updated last year
- Samples demonstrating how to use the Compute Sanitizer Tools and Public API☆83Updated last year
- Notes and artifacts from the ONNX steering committee☆26Updated 2 weeks ago
- Some CUDA example code with READMEs.☆168Updated 3 months ago
- Learn CUDA with PyTorch☆27Updated this week
- A Gentle Principled Introduction to Deep Reinforcement Learning☆19Updated 2 months ago
- Serial and parallel implementations of matrix multiplication☆41Updated 4 years ago
- ☆18Updated this week
- Slides and recordings of talks hosted by our community☆20Updated last year
- ☆20Updated 9 years ago
- ☆14Updated last year
- Article about deploying machine learning models using grpc, pytorch and asyncio☆28Updated 2 years ago
- Notes on "Programming Massively Parallel Processors" by Hwu, Kirk, and Hajj (4th ed.)☆53Updated 10 months ago
- FP64 equivalent GEMM via Int8 Tensor Cores using the Ozaki scheme☆73Updated 3 months ago
- Numbast is a tool to build an automated pipeline that converts CUDA APIs into Numba bindings.☆47Updated this week