ashvardanian / cuda-python-starter-kitLinks
Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
☆29Updated 6 months ago
Alternatives and similar repositories for cuda-python-starter-kit
Users that are interested in cuda-python-starter-kit are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆104Updated last year
- High-Performance SGEMM on CUDA devices☆101Updated 7 months ago
- ☆39Updated this week
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆149Updated 8 months ago
- A list of awesome resources and blogs on topics related to Unum☆41Updated 11 months ago
- A parallel framework for training deep neural networks☆63Updated 6 months ago
- PyTorch Single Controller☆414Updated this week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆141Updated last year
- Some CUDA example code with READMEs.☆170Updated 6 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆57Updated this week
- ☆21Updated 6 months ago
- ScalarLM - a unified training and inference stack☆55Updated this week
- Effective transpose on Hopper GPU☆23Updated last week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆193Updated 3 months ago
- extensible collectives library in triton☆87Updated 5 months ago
- ☆31Updated 4 months ago
- Implementation of the paper "Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search" by Severo et al.☆82Updated 7 months ago
- Pipeline parallelism for the minimalist☆33Updated last month
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆63Updated 5 months ago
- Learning about CUDA by writing PTX code.☆135Updated last year
- Learn CUDA with PyTorch☆74Updated last week
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆296Updated last week
- Example ML projects that use the Determined library.☆32Updated last year
- ☆69Updated 7 months ago
- python package of rocm-smi-lib☆23Updated 2 months ago
- Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)☆401Updated 2 weeks ago
- Fast low-bit matmul kernels in Triton☆365Updated this week
- Simple MPI implementation for prototyping or learning☆279Updated last month
- NVIDIA NVSHMEM is a parallel programming interface for NVIDIA GPUs based on OpenSHMEM. NVSHMEM can significantly reduce multi-process com…☆311Updated this week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆92Updated this week