ashvardanian / cuda-python-starter-kitLinks

Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11

☆29

Alternatives and similar repositories for cuda-python-starter-kit

Users that are interested in cuda-python-starter-kit are comparing it to the libraries listed below

Sorting:

gevtushenko / llm.c
LLM training in simple, raw C/CUDA
☆105Updated last year
fabiocannizzo / FastBinarySearch
Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers
☆151Updated 9 months ago
salykova / sgemm.cu
High-Performance SGEMM on CUDA devices
☆107Updated 8 months ago
lianakoleva / no-libtorch-compile
☆21Updated 7 months ago
siboehm / ShallowSpeed
Small scale distributed training of sequential deep learning models, built on Numpy and MPI.
☆144Updated last year
ailzhang / minPP
Pipeline parallelism for the minimalist
☆35Updated 2 months ago
meta-pytorch / monarch
PyTorch Single Controller
☆435Updated this week
unum-cloud / awesome
A list of awesome resources and blogs on topics related to Unum
☆41Updated 11 months ago
NVIDIA / free-threaded-python
No-GIL Python environment featuring NVIDIA Deep Learning libraries.
☆64Updated 5 months ago
msaroufim / awesome-profiling
Awesome utilities for performance profiling
☆193Updated 7 months ago
CosimoRulli / emvb
Implementation of "Efficient Multi-vector Dense Retrieval with Bit Vectors", ECIR 2024
☆66Updated last year
skypilot-org / skypilot-catalog
☆24Updated this week
cchan / tccl
extensible collectives library in triton
☆89Updated 6 months ago
simveit / effective_transpose
Effective transpose on Hopper GPU
☆24Updated last month
gpu-mode / discord-cluster-manager
Write a fast kernel and run it on Discord. See how you compare against the best!
☆58Updated 2 weeks ago
axonn-ai / axonn
Parallel framework for training and fine-tuning deep neural networks
☆65Updated 6 months ago
gau-nernst / learn-cuda
Learn CUDA with PyTorch
☆85Updated 2 weeks ago
abhisheknair10 / llama3.cu
Lightweight Llama 3 8B Inference Engine in CUDA C
☆48Updated 6 months ago
octoml / octoml-profile
Home for OctoML PyTorch Profiler
☆114Updated 2 years ago
nebius / kvax
A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.
☆144Updated 6 months ago
IST-DASLab / llmq
Quantized LLM training in pure CUDA/C++.
☆180Updated this week
meta-pytorch / triton-cpu
An experimental CPU backend for Triton (https//github.com/openai/triton)
☆45Updated last month
MekkCyber / TritonAcademy
A repository to unravel the language of GPUs, making their kernel conversations easy to understand
☆194Updated 4 months ago
SzymonOzog / Penny
Hand-Rolled GPU communications library
☆37Updated this week
jax-ml / ml_dtypes
A stand-alone implementation of several NumPy dtype extensions used in machine learning.
☆301Updated this week
huggingface / kernel-builder
👷 Build compute kernels
☆155Updated this week
meta-pytorch / tritonparse
TritonParse: A Compiler Tracer, Visualizer, and mini-Reproducer for Triton Kernels
☆152Updated last week
CentML / DeepView.Profile
🏙 Interactive performance profiling and debugging tool for PyTorch neural networks.
☆64Updated 8 months ago
unixpickle / learn-ptx
Learning about CUDA by writing PTX code.
☆138Updated last year
Jokeren / triton-samples
☆28Updated 8 months ago