elinx / ugradLinks
A C++ implementation of the scalar-valued autograd engine micrograd
☆23Updated 5 years ago
Alternatives and similar repositories for ugrad
Users that are interested in ugrad are comparing it to the libraries listed below
Sorting:
- A C++ port of karpathy/llm.c features a tiny torch library while maintaining overall simplicity.☆34Updated 10 months ago
- MLIR based Tiny Graph Compiler [dev-stage]☆18Updated 7 months ago
- my little linear algebra library☆46Updated 11 months ago
- Learn OpenCL step by step.☆136Updated 2 years ago
- Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)☆150Updated last year
- LLM training in simple, raw C/CUDA☆99Updated last year
- Serial and parallel implementations of matrix multiplication☆41Updated 4 years ago
- ☆17Updated last year
- Multi-Threaded FP32 Matrix Multiplication on x86 CPUs☆352Updated 2 months ago
- Class of High Performance Computing taken at U.T.P 2017☆65Updated 7 years ago
- A recurrent (LSTM) neural network in C☆94Updated 3 years ago
- Learning about CUDA by writing PTX code.☆133Updated last year
- Explore training for quantized models☆18Updated this week
- Neural network from scratch in CUDA/C++☆80Updated 5 months ago
- NNCG: A Neural Network Code Generator☆35Updated 10 months ago
- Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.☆16Updated last year
- NVIDIA tools guide☆137Updated 5 months ago
- A faithful clone of Karpathy's llama2.c (one file inference, zero dependency) but fully functional with LLaMA 3 8B base and instruct mode…☆128Updated 11 months ago
- CUDA Matrix Multiplication Optimization☆196Updated 11 months ago
- CUTLASS and CuTe Examples☆57Updated 5 months ago
- An MLIR-based toy DL compiler for TVM Relay.☆58Updated 2 years ago
- Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.☆63Updated 9 months ago
- Super fast FP32 matrix multiplication on RDNA3☆64Updated 2 months ago
- High-Performance SGEMM on CUDA devices☆96Updated 5 months ago
- Converting a deep neural network to integer-only inference in native C via uniform quantization and the fixed-point representation.☆24Updated 3 years ago
- Can I make an *optimizing* compiler under 1k lines of code?☆60Updated 4 months ago
- An Open Convolutional Neural Network Framework in C++ From Scratch☆65Updated 4 years ago
- A simple and fast minimalistic header-only library allowing to run async tasks and execute task graphs.☆53Updated 6 months ago
- Clover: Quantized 4-bit Linear Algebra Library☆114Updated 7 years ago
- pytorch from scratch in pure C/CUDA and python☆40Updated 8 months ago