GPU Kernels
☆221Apr 27, 2025Updated 10 months ago
Alternatives and similar repositories for 100Days
Users that are interested in 100Days are comparing it to the libraries listed below
Sorting:
- 100 days of building GPU kernels!☆573Apr 27, 2025Updated 10 months ago
- ☆417Apr 10, 2025Updated 10 months ago
- A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.☆461Mar 10, 2025Updated 11 months ago
- This repository is a curated collection of resources, tutorials, and practical examples designed to guide you through the journey of mast…☆440Feb 22, 2025Updated last year
- Persistent dense gemm for Hopper in `CuTeDSL`☆15Aug 9, 2025Updated 6 months ago
- [WIP] Better (FP8) attention for Hopper☆32Feb 24, 2025Updated last year
- 삼각형의 실전! Triton☆16Feb 15, 2024Updated 2 years ago
- High performance RMSNorm Implement by using SM Core Storage(Registers and Shared Memory)☆30Jan 22, 2026Updated last month
- Akademik çalışma için okuduğum makalenin özetleri☆15Aug 23, 2022Updated 3 years ago
- Learnings and programs related to CUDA☆433Jun 29, 2025Updated 8 months ago
- Will write CUDA for 100 days☆38May 25, 2025Updated 9 months ago
- A curated collection of resources, tutorials, and best practices for learning and mastering NVIDIA CUTLASS☆253May 6, 2025Updated 9 months ago
- ☆238Jan 2, 2025Updated last year
- A library of replicated state machine algorithms is based on Viewstamped Replication Revisited☆14Feb 6, 2021Updated 5 years ago
- A std::execution style runtime context and High Performance RPC Transport for using OpenUCX. Including CUDA/ROCM/... devices with RDMA.☆29Feb 22, 2026Updated last week
- ☆17May 15, 2025Updated 9 months ago
- Expert Specialization MoE Solution based on CUTLASS☆27Jan 19, 2026Updated last month
- GEMV implementation with CUTLASS☆19Aug 21, 2025Updated 6 months ago
- A C++ port of karpathy/micrograd, a tiny scalar-valued autograd engine and a neural net library☆13Nov 24, 2023Updated 2 years ago
- Website for CSE 234, Winter 2025☆13Mar 24, 2025Updated 11 months ago
- Row-wise block scaling for fp8 quantization matrix multiplication. Solution to GPU mode AMD challenge.☆17Feb 9, 2026Updated 3 weeks ago
- Cute layout visualization☆30Jan 18, 2026Updated last month
- CUDA Learning guide☆531Jun 20, 2024Updated last year
- CargoCoin is designed to be a smart contract, crypto currency platform, decentralising global trade and transport. The platform target is…☆13Aug 8, 2018Updated 7 years ago
- ☆15Jun 10, 2024Updated last year
- DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling☆21Feb 9, 2026Updated 3 weeks ago
- everything i know about cuda and triton☆13Jan 28, 2025Updated last year
- CUTLASS and CuTe Examples☆132Nov 30, 2025Updated 3 months ago
- ☆116May 16, 2025Updated 9 months ago
- LLM training parallelisms (DP, FSDP, TP, PP) in pure C☆26Jan 27, 2026Updated last month
- This repository contains the implementation of the paper: "Span Classification with Structured Information for Disfluency Detection in Sp…☆15Jun 6, 2023Updated 2 years ago
- Personal solutions to the Triton Puzzles☆20Jul 18, 2024Updated last year
- Puzzles for learning Triton☆2,314Nov 18, 2024Updated last year
- A repository consisting of paper/architecture replications of classic/SOTA AI/ML papers in pytorch☆405Nov 11, 2025Updated 3 months ago
- Transformers from scratch using PyTorch & NumPy.☆50Feb 7, 2025Updated last year
- DeeperGEMM: crazy optimized version☆74May 5, 2025Updated 9 months ago
- Making of cuda kernel☆17May 27, 2025Updated 9 months ago
- ☆15Feb 5, 2025Updated last year
- Effort to open-source 10.5 trillion parameter Gemini model.☆17Dec 6, 2023Updated 2 years ago