CUDA (Compute Unified Device Architecture) is a parallel computing platform and application programming interface (API) model created by NVIDIA for general computing on its own GPUs (Graphics Processing Units). It empowers application developers to leverage the parallel processing capabilities of NVIDIA's GPUs to accelerate computation-heavy tasks, such as matrix operations, physics simulations, deep learning training, and real-time video processing. CUDA provides a C-like programming language that allows developers to write kernel functions, which are executed on the GPU, and manage memory between the host (CPU) and device (GPU) environments. Utilizing CUDA can lead to significant performance improvements in suitable applications, and it integrates well with various programming environments, including Python through libraries like PyCUDA or through frameworks like TensorFlow with GPU support. Understanding basic concepts such as kernels, threads, blocks, and warps is essential for developers to effectively harness the power of GPU programming with CUDA.
View the most prominent open source CUDA projects in the list below. Click on a specific project to view its alternative or complementary packages. Make comparisons and find the best package for your app.
- A high-throughput and memory-efficient inference and serving engine for LLMs☆64,235Updated last week
- World's fastest and most advanced password recovery utility☆24,848Updated 2 weeks ago
- Build and run Docker containers leveraging NVIDIA GPUs☆17,450Updated 2 years ago
- SGLang is a fast serving framework for large language models and vision language models.☆20,874Updated this week
- Instant neural graphics primitives: lightning fast NeRF and more☆17,098Updated last month
- kaldi-asr/kaldi is the official location of the Kaldi project.☆15,253Updated 2 months ago
- CUDA on non-NVIDIA GPUs☆13,534Updated this week
- Open3D: A Modern Library for 3D Data Processing☆13,054Updated 2 weeks ago
- Burn is a next generation tensor library and Deep Learning Framework that doesn't compromise on flexibility, efficiency and portability.☆13,564Updated this week
- Solve puzzles. Learn CUDA.☆11,790Updated last year
- TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizat…☆12,312Updated this week
- NumPy aware dynamic Python compiler using LLVM☆10,769Updated this week
- NumPy & SciPy for GPU☆10,653Updated last week
- OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.☆9,376Updated 3 months ago
- cuDF - GPU DataFrame Library☆9,352Updated last week
- Containers for machine learning☆9,102Updated last week
- A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other ma…☆8,682Updated this week
- CUDA Templates and Python DSLs for High-Performance Linear Algebra☆8,902Updated this week
- Samples for CUDA Developers which demonstrates features in CUDA Toolkit☆8,531Updated 3 months ago
- Modular ZK(Zero Knowledge) backend accelerated by GPU☆7,738Updated last year
- Go package for computer vision using OpenCV 4 and beyond. Includes support for DNN, CUDA, OpenCV Contrib, and OpenVINO.☆7,304Updated last month
- 📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉☆8,709Updated last week
- A flexible framework of neural networks for deep learning☆5,912Updated 2 years ago
- An interactive NVIDIA-GPU process viewer and beyond, the one-stop solution for GPU process management.☆6,360Updated this week
- A Python framework for accelerated simulation, data generation and spatial computing.☆5,868Updated last week
- ALIEN is a CUDA-powered artificial life simulation program.☆5,292Updated this week
- [ARCHIVED] The C++ parallel algorithms library. See https://github.com/NVIDIA/cccl☆4,989Updated last year
- cuML - RAPIDS Machine Learning Library☆5,037Updated this week
- A PyTorch Library for Accelerating 3D Deep Learning Research☆4,978Updated last week
- Supercharge Your LLM with the Fastest KV Cache Layer☆6,277Updated this week
- ArrayFire: a general purpose GPU library.☆4,831Updated 3 months ago
- Tengine is a lite, high performance, modular inference engine for embedded device☆4,500Updated 9 months ago
- Making it easier to work with shaders☆4,797Updated this week
- Lightning fast C++/CUDA neural network framework☆4,324Updated last week
- HIP: C++ Heterogeneous-Compute Interface for Portability☆4,260Updated last week
- Fast inference engine for Transformer models☆4,166Updated this week
- FlashInfer: Kernel Library for LLM Serving☆4,168Updated this week
- GPU cluster manager for optimized AI model deployment☆4,147Updated this week
- A retargetable MLIR-based machine learning compiler and runtime toolkit.☆3,475Updated last week