Recreating PyTorch from scratch (C/C++, CUDA, NCCL and Python, with multi-GPU support and automatic differentiation!)
โ165Nov 25, 2025Updated 5 months ago
Alternatives and similar repositories for PyNorch
Users that are interested in PyNorch are comparing it to the libraries listed below. We may earn a commission when you buy through links labeled 'Ad' on this page.
Sorting:
- Contrastive Reinforcement Learningโ63Apr 4, 2026Updated 3 weeks ago
- SpeechPlus: Small LLM-Based Text-to-Speech Library ๐โ21May 20, 2025Updated 11 months ago
- ๐ง A study guide to learn about Transformersโ12Jan 11, 2024Updated 2 years ago
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.โ19Feb 9, 2026Updated 2 months ago
- Alex Krizhevsky's original code from Google Codeโ199Mar 10, 2016Updated 10 years ago
- Managed Database hosting by DigitalOcean โข AdPostgreSQL, MySQL, MongoDB, Kafka, Valkey, and OpenSearch available. Automatically scale up storage and focus on building your apps.
- โ46May 24, 2025Updated 11 months ago
- UNet diffusion model in pure CUDAโ657Jun 28, 2024Updated last year
- This repo implements and trains DallE-1 on a synthetically generated dataset which has colored mnist images on texture/solid background aโฆโ13Oct 30, 2024Updated last year
- A std::execution style runtime context and High Performance RPC Transport for using OpenUCX. Including CUDA/ROCM/... devices with RDMA.โ30Apr 21, 2026Updated last week
- A really tiny autograd engineโ100May 26, 2025Updated 11 months ago
- High Performance FP8 GEMM Kernels for SM89 and later GPUs.โ21Jan 24, 2025Updated last year
- High-Performance FP32 GEMM on CUDA devicesโ122Jan 21, 2025Updated last year
- Implementation of FlashAttention (FA1-FA4) in PyTorch for educational and algorithmic clarityโ206Apr 12, 2026Updated 2 weeks ago
- Paper implementationโ52Apr 8, 2025Updated last year
- Deploy to Railway using AI coding agents - Free Credits Offer โข AdUse Claude Code, Codex, OpenCode, and more. Autonomous software development now has the infrastructure to match with Railway.
- Reference implementation of Mistral AI 7B v0.1 model.โ28Dec 25, 2023Updated 2 years ago
- The official evaluation suite and dynamic data release for MixEval.โ11Sep 23, 2024Updated last year
- โ93Jul 5, 2024Updated last year
- This repo implements Diffusion Transformers(DiT) in PyTorch and provides training and inference code on CelebHQ datasetโ64Jan 6, 2025Updated last year
- High Performance Int8 GEMM Kernels for SM80 and later GPUs.โ22Mar 11, 2025Updated last year
- Comprehensive CUDA tutorials for Maths & ML with examplesโ223Jun 11, 2025Updated 10 months ago
- โ19Jan 16, 2025Updated last year
- My submission for the GPUMODE/AMD fp8 mm challengeโ29Jun 4, 2025Updated 10 months ago
- Open deep learning compiler stack for cpu, gpu and specialized acceleratorsโ19Updated this week
- Serverless GPU API endpoints on Runpod - Get Bonus Credits โข AdSkip the infrastructure headaches. Auto-scaling, pay-as-you-go, no-ops approach lets you focus on innovating your application.
- Fast GPU based tensor core reductionsโ13Jan 13, 2023Updated 3 years ago
- ไฝฟ็จ CUDA C++ ๅฎ็ฐ็ llama ๆจกๅๆจ็ๆกๆถโ65Nov 8, 2024Updated last year
- โ16Oct 5, 2024Updated last year
- This is a small autograd engine, made purely from numpy and python.โ27Sep 17, 2024Updated last year
- โ14May 15, 2023Updated 2 years ago
- TACOS: [T]opology-[A]ware [Co]llective Algorithm [S]ynthesizer for Distributed Machine Learningโ33Jun 13, 2025Updated 10 months ago
- C Compiler written in Kotlinโ13Apr 19, 2024Updated 2 years ago
- Minimalistic 4D-parallelism distributed training framework for education purposeโ2,159Aug 26, 2025Updated 8 months ago
- A zero-config OpenAI client with support for 20+ providers, API key rotation, rate limits, optional LangChain integration and more.โ19Dec 11, 2025Updated 4 months ago
- 1-Click AI Models by DigitalOcean Gradient โข AdDeploy popular AI models on DigitalOcean Gradient GPU virtual machines with just a single click. Zero configuration with optimized deployments.
- A Easy-to-understand TensorOp Matmul Tutorialโ428Mar 5, 2026Updated last month
- flash attention tutorial written in python, triton, cuda, cutlassโ506Jan 20, 2026Updated 3 months ago
- โ21May 26, 2025Updated 11 months ago
- A simple Python tool to measure the performance of ONNX models.โ27Sep 15, 2024Updated last year
- โ29Dec 15, 2025Updated 4 months ago
- A collection of reusable, high-performance, well-documented, thorough-tested layers and models in Jaxโ23Jun 8, 2025Updated 10 months ago
- Find, list, and inspect processes from Go (golang).โ10Feb 4, 2018Updated 8 years ago