ashvardanian / cuda-python-starter-kitLinks
Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
☆26Updated 4 months ago
Alternatives and similar repositories for cuda-python-starter-kit
Users that are interested in cuda-python-starter-kit are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆99Updated last year
- Example ML projects that use the Determined library.☆32Updated 10 months ago
- A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.☆128Updated 3 months ago
- A list of awesome resources and blogs on topics related to Unum☆40Updated 8 months ago
- A parallel framework for training deep neural networks☆61Updated 3 months ago
- ScalarLM - a unified training and inference stack☆44Updated last week
- 👷 Build compute kernels☆74Updated this week
- High-Performance SGEMM on CUDA devices☆97Updated 5 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆46Updated last week
- ☆37Updated last week
- ☆21Updated 4 months ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆27Updated last month
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆62Updated last week
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆62Updated 2 months ago
- GPU documentation for humans☆81Updated this week
- PyTorch Single Controller☆296Updated this week
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆142Updated 6 months ago
- Learn CUDA with PyTorch☆27Updated this week
- extensible collectives library in triton☆87Updated 3 months ago
- ☆25Updated this week
- ☆12Updated this week
- 🏙 Interactive performance profiling and debugging tool for PyTorch neural networks.☆62Updated 5 months ago
- Experimental GPU language with meta-programming☆23Updated 10 months ago
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆188Updated last month
- Learning about CUDA by writing PTX code.☆133Updated last year
- ☆15Updated 3 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆134Updated last year
- FlexAttention w/ FlashAttention3 Support☆26Updated 9 months ago
- Because it's there.☆16Updated 9 months ago
- Benchmarks to capture important workloads.☆31Updated 5 months ago