ashvardanian / cuda-python-starter-kitLinks
Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
☆28Updated 5 months ago
Alternatives and similar repositories for cuda-python-starter-kit
Users that are interested in cuda-python-starter-kit are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆104Updated last year
- High-Performance SGEMM on CUDA devices☆97Updated 7 months ago
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆146Updated 8 months ago
- A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.☆140Updated 4 months ago
- ☆38Updated last week
- ☆21Updated 5 months ago
- Awesome utilities for performance profiling☆186Updated 5 months ago
- A parallel framework for training deep neural networks☆63Updated 5 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆51Updated this week
- extensible collectives library in triton☆88Updated 4 months ago
- A bunch of kernels that might make stuff slower 😉☆58Updated last week
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆136Updated last year
- A minimalistic C++ Jinja templating engine for LLM chat templates☆170Updated 3 weeks ago
- Pipeline parallelism for the minimalist☆33Updated 3 weeks ago
- FlexAttention w/ FlashAttention3 Support☆27Updated 10 months ago
- PyTorch Single Controller☆368Updated this week
- Official Problem Sets / Reference Kernels for the GPU MODE Leaderboard!☆74Updated this week
- A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.☆271Updated this week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆191Updated 2 months ago
- A list of awesome resources and blogs on topics related to Unum☆41Updated 10 months ago
- A stand-alone implementation of several NumPy dtype extensions used in machine learning.☆291Updated 2 weeks ago
- Learning about CUDA by writing PTX code.☆135Updated last year
- Learn CUDA with PyTorch☆67Updated this week
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆63Updated 4 months ago
- An experimental CPU backend for Triton (https//github.com/openai/triton)☆44Updated last week
- ☆25Updated this week
- Make triton easier☆47Updated last year
- ☆232Updated last week
- PyTorch centric eager mode debugger☆48Updated 8 months ago
- ☆15Updated last week