ashvardanian / cuda-python-starter-kitLinks
Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
☆26Updated 4 months ago
Alternatives and similar repositories for cuda-python-starter-kit
Users that are interested in cuda-python-starter-kit are comparing it to the libraries listed below
Sorting:
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆144Updated 7 months ago
- LLM training in simple, raw C/CUDA☆102Updated last year
- A parallel framework for training deep neural networks☆63Updated 4 months ago
- High-Performance SGEMM on CUDA devices☆98Updated 6 months ago
- ☆21Updated 5 months ago
- Learn CUDA with PyTorch☆33Updated 2 weeks ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆137Updated last year
- Example ML projects that use the Determined library.☆32Updated 10 months ago
- ☆28Updated 6 months ago
- extensible collectives library in triton☆88Updated 4 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆48Updated this week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆188Updated 2 months ago
- PyTorch Single Controller☆341Updated this week
- Home for OctoML PyTorch Profiler☆113Updated 2 years ago
- Effective transpose on Hopper GPU☆23Updated 3 months ago
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆282Updated 3 weeks ago
- ScalarLM - a unified training and inference stack☆52Updated last week
- A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.☆131Updated 3 months ago
- Implementation of the paper "Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search" by Severo et al.☆81Updated 6 months ago
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆63Updated 3 months ago
- GPU documentation for humans☆99Updated 3 weeks ago
- Awesome utilities for performance profiling☆186Updated 5 months ago
- Make triton easier☆47Updated last year
- A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")☆345Updated this week
- 👷 Build compute kernels☆87Updated this week
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆43Updated 4 months ago
- A list of awesome resources and blogs on topics related to Unum☆40Updated 9 months ago
- ☆15Updated 4 months ago
- FlexAttention w/ FlashAttention3 Support☆27Updated 10 months ago
- torch::deploy (multipy for non-torch uses) is a system that lets you get around the GIL problem by running multiple Python interpreters i…☆180Updated 3 weeks ago