kfish / micrograd-cpp-2023Links
A C++ port of karpathy/micrograd, a tiny scalar-valued autograd engine and a neural net library
☆14Updated 2 years ago
Alternatives and similar repositories for micrograd-cpp-2023
Users that are interested in micrograd-cpp-2023 are comparing it to the libraries listed below
Sorting:
- MLIR-based toolkit targeting intel heterogeneous hardware☆51Updated this week
- amdgpu example code in hip/asm☆54Updated last week
- Stepwise optimizations of DGEMM on CPU, reaching performance faster than Intel MKL eventually, even under multithreading.☆163Updated 4 years ago
- 🎃 GPU load-balancing library for regular and irregular computations.☆66Updated 4 months ago
- Easier, quicker command-line CUDA profiling☆45Updated last year
- Super fast FP32 matrix multiplication on RDNA3☆82Updated 10 months ago
- Generate simple index ranges in C++ and CUDA C++☆39Updated 2 years ago
- Directed Acyclic Graph Execution Engine (DAGEE) is a C++ library that enables programmers to express computation and data movement, as ta…☆47Updated 4 years ago
- A simple profiler to count Nvidia PTX assembly instructions of OpenCL/SYCL/CUDA kernels for roofline model analysis.☆57Updated 10 months ago
- High-Performance FP32 GEMM on CUDA devices☆117Updated last year
- A tool for generating information about the matrix multiplication instructions in AMD Radeon™ and AMD Instinct™ accelerators☆125Updated 2 months ago
- ☆59Updated this week
- [DEPRECATED] Moved to ROCm/rocm-libraries repo☆138Updated last week
- ☆23Updated 3 years ago
- ☆165Updated last week
- Header-only safetensors loader and saver in C++☆78Updated last month
- ☆13Updated this week
- C implementation of the L-Mul f32/f16 multiplications from paper: https://arxiv.org/html/2410.00907☆28Updated last year
- ☆18Updated last year
- Implementation of parallel Breadth First Algorithm for graph traversal using CUDA and C++ language.☆34Updated 6 years ago
- CUDA Matrix Multiplication Optimization☆256Updated last year
- ☆53Updated 9 months ago
- Examples from Programming in Parallel with CUDA☆170Updated last week
- TPP experimentation on MLIR for linear algebra☆142Updated this week
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆377Updated 9 months ago
- Embedded Universal DSL: a good DSL for us, by us☆66Updated this week
- An extension library of WMMA API (Tensor Core API)☆109Updated last year
- Gallatin is a general-purpose memory manager for CUDA that allows for threads to quickly malloc and free memory of arbitrary size inside …☆25Updated this week
- Open ABI and FFI for Machine Learning Systems☆333Updated this week
- A compiler for Tiger language includes lexical analysis using flexc++, parsing using Bisonc++, type checking, building abstract syntax tr…☆13Updated 3 years ago