ashvardanian / cuda-python-starter-kit
Parallel Computing starter project to build GPU & CPU kernels in CUDA & C++ and call them from Python without a single line of CMake using PyBind11
☆25Updated 2 months ago
Alternatives and similar repositories for cuda-python-starter-kit
Users that are interested in cuda-python-starter-kit are comparing it to the libraries listed below
Sorting:
- No-GIL Python environment featuring NVIDIA Deep Learning libraries.☆60Updated last month
- Optimized LLM inference for Apple Silicon using MLX.☆10Updated this week
- LLM training in simple, raw C/CUDA☆95Updated last year
- A list of awesome resources and blogs on topics related to Unum☆40Updated 7 months ago
- ☆21Updated 2 months ago
- Learn CUDA with PyTorch☆20Updated 3 months ago
- ☆13Updated last year
- Fast and vectorizable algorithms for searching in a vector of sorted floating point numbers☆137Updated 4 months ago
- Write a fast kernel and run it on Discord. See how you compare against the best!☆44Updated this week
- Lightweight Llama 3 8B Inference Engine in CUDA C☆47Updated last month
- A FlashAttention implementation for JAX with support for efficient document mask computation and context parallelism.☆112Updated last month
- FlexAttention w/ FlashAttention3 Support☆26Updated 7 months ago
- 👷 Build compute kernels☆38Updated this week
- ☆11Updated 3 months ago
- ☆15Updated last month
- Implementation of the paper "Lossless Compression of Vector IDs for Approximate Nearest Neighbor Search" by Severo et al.☆78Updated 3 months ago
- Better bindings for Python☆17Updated 2 years ago
- High-Performance SGEMM on CUDA devices☆91Updated 3 months ago
- TritonParse is a new tool designed to help developers analyze and debug Triton kernels by visualizing the compilation process and source …☆14Updated last week
- ☆24Updated this week
- Make triton easier☆47Updated 11 months ago
- Personal solutions to the Triton Puzzles☆18Updated 10 months ago
- A user-friendly tool chain that enables the seamless execution of ONNX models using JAX as the backend.☆111Updated last week
- A collection of reproducible inference engine benchmarks☆30Updated 3 weeks ago
- Rust Implementation of micrograd☆51Updated 10 months ago
- extensible collectives library in triton☆86Updated last month
- [WIP] Better (FP8) attention for Hopper☆30Updated 2 months ago
- Generate Glue Code in seconds to simplify your Nvidia Triton Inference Server Deployments☆20Updated 10 months ago
- ☆88Updated last year
- Experiment of using Tangent to autodiff triton☆78Updated last year