ashvardanian / cuda-python-starter-kitLinks
Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
☆26Updated 3 months ago
Alternatives and similar repositories for cuda-python-starter-kit
Users that are interested in cuda-python-starter-kit are comparing it to the libraries listed below
Sorting:
- LLM training in simple, raw C/CUDA☆99Updated last year
- Write a fast kernel and run it on Discord. See how you compare against the best!☆46Updated this week
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆61Updated 2 months ago
- A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.☆115Updated 2 months ago
- Learn CUDA with PyTorch☆27Updated this week
- extensible collectives library in triton☆86Updated 2 months ago
- ☆21Updated 3 months ago
- My submission for the GPUMODE/AMD fp8 mm challenge☆25Updated 2 weeks ago
- A parallel framework for training deep neural networks☆61Updated 3 months ago
- Reference Kernels for the Leaderboard☆60Updated this week
- TritonParse is a tool designed to help developers analyze and debug Triton kernels by visualizing the compilation process and source code…☆93Updated last week
- FlexAttention w/ FlashAttention3 Support☆26Updated 8 months ago
- High-Performance SGEMM on CUDA devices☆95Updated 5 months ago
- Small scale distributed training of sequential deep learning models, built on Numpy and MPI.☆134Updated last year
- A list of awesome resources and blogs on topics related to Unum☆40Updated 8 months ago
- SIMD quantization kernels☆71Updated last week
- Proof-of-concept of global switching between numpy/jax/pytorch in a library.☆18Updated last year
- Learning about CUDA by writing PTX code.☆132Updated last year
- PCCL (Prime Collective Communications Library) implements fault tolerant collective communications over IP☆94Updated last month
- ☆11Updated 2 months ago
- Optimized LLM inference for Apple Silicon using MLX.☆11Updated this week
- ScalarLM - a unified training and inference stack☆40Updated last month
- PyTorch Single Controller☆218Updated this week
- A repository to unravel the language of GPUs, making their kernel conversations easy to understand☆185Updated 3 weeks ago
- 👷 Build compute kernels☆68Updated this week
- Make triton easier☆46Updated last year
- JaxPP is a library for JAX that enables flexible MPMD pipeline parallelism for large-scale LLM training☆49Updated last month
- GPU documentation for humans☆70Updated this week
- Experimental GPU language with meta-programming☆23Updated 9 months ago
- Personal solutions to the Triton Puzzles☆19Updated 11 months ago