kfish / micrograd-cpp-2023Links
A C++ port of karpathy/micrograd, a tiny scalar-valued autograd engine and a neural net library
☆14Updated 2 years ago
Alternatives and similar repositories for micrograd-cpp-2023
Users that are interested in micrograd-cpp-2023 are comparing it to the libraries listed below
Sorting:
- Header-only safetensors loader and saver in C++☆78Updated last month
- C implementation of the L-Mul f32/f16 multiplications from paper: https://arxiv.org/html/2410.00907☆28Updated last year
- ☆23Updated 3 years ago
- Generate simple index ranges in C++ and CUDA C++☆39Updated 2 years ago
- ☆13Updated last week
- Easier, quicker command-line CUDA profiling☆45Updated last year
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆57Updated 10 months ago
- Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as ta…☆47Updated 4 years ago
- A C++ port of karpathy/llm.c features a tiny torch library while maintaining overall simplicity.☆42Updated last year
- Monorepo for the OpenCilk compiler. Forked from llvm/llvm-project and based on Tapir/LLVM.☆120Updated this week
- 🎃 GPU load-balancing library for regular and irregular computations.☆66Updated 4 months ago
- Source code for 'Modern Parallel Programming with C++ and Assembly' by Dan Kusswurm☆75Updated 3 years ago
- SYCL Reference Manual☆29Updated last week
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- LLM training in simple, raw C/CUDA☆112Updated last year
- Parallel Tasking Library (PTL) - Lightweight C++11 mutilthreading tasking system featuring thread-pool, task-groups, and lock-free task q…☆48Updated last year
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆376Updated 9 months ago
- High Level Algorithmic Skeleton CUDA Library☆30Updated last year
- Little OpenMP Library☆171Updated 3 years ago
- ☆18Updated last year
- Examples from Programming in Parallel with CUDA☆170Updated last week
- ☆26Updated 11 months ago
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆163Updated 4 years ago
- A parallel implementation of DFS for Directed Acyclic Graphs (https://research.nvidia.com/publication/parallel-depth-first-search-directe…☆50Updated 4 years ago
- A lightweight memory allocator for hardware-accelerated machine learning☆181Updated 4 months ago
- ☆90Updated last week
- amdgpu example code in hip/asm☆54Updated last week
- An extension library of WMMA API (Tensor Core API)☆109Updated last year
- Embedded Universal DSL: a good DSL for us, by us☆66Updated this week
- A C/C++ task-based programming model for shared memory and distributed parallel computing.☆72Updated 5 years ago