keith2018 / TinyTorchLinks
A tiny deep learning training framework implemented from scratch in C++ that follows PyTorch's API.
☆53Updated 3 weeks ago
Alternatives and similar repositories for TinyTorch
Users that are interested in TinyTorch are comparing it to the libraries listed below
Sorting:
- Code release for book "Efficient Training in PyTorch"☆69Updated 2 months ago
- Codes & examples for "CUDA - From Correctness to Performance"☆100Updated 8 months ago
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- Implement custom operators in PyTorch with cuda/c++☆63Updated 2 years ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆71Updated 2 months ago
- ☆28Updated last month
- 使用 CUDA C++ 实现的 llama 模型推理框架☆57Updated 7 months ago
- Tiny C++11 GPT-2 inference implementation from scratch☆62Updated last month
- 分层解耦的深度学习推理引擎☆73Updated 4 months ago
- ⚡️FFPA: Extend FlashAttention-2 with Split-D, achieve ~O(1) SRAM complexity for large headdim, 1.8x~3x↑ vs SDPA.☆186Updated last month
- A light llama-like llm inference framework based on the triton kernel.☆128Updated last week
- Implement Flash Attention using Cute.☆87Updated 6 months ago
- b站上的课程☆75Updated last year
- Large-scale Auto-Distributed Training/Inference Unified Framework | Memory-Compute-Control Decoupled Architecture | Multi-language SDK & …☆51Updated this week
- 大规模并行处理器编程实战 第二版答案☆33Updated 3 years ago
- CUDA 6大并行计算模式 代码与笔记☆61Updated 4 years ago
- A tutorial for CUDA&PyTorch☆146Updated 5 months ago
- Examples of CUDA implementations by Cutlass CuTe☆197Updated 4 months ago
- x86-64 SIMD矢量优化系列教程☆121Updated 2 months ago
- ☆278Updated 8 months ago
- ⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.☆80Updated last month
- ☆135Updated last year
- Solutions of LeetGPU☆27Updated last week
- ☆70Updated 2 years ago
- SGEMM optimization with cuda step by step☆19Updated last year
- CUDA C 编程权威指南代码实现 包含了书上第二章到第八章的大部分代码实现和作者笔记,全由作者本人手动实现,难免有错误的地方,请大家谨慎参考,非常欢迎对错误的指正。 如果有帮助的话请Star一下,对作者帮助很大,谢谢!☆345Updated 2 years ago
- GPTQ inference TVM kernel☆40Updated last year
- 鉴定网络热门并行编程框架 - 性能测评(附小彭老师锐评)已评测:Taichi、SyCL、C++、OpenMP、TBB、Mojo☆35Updated last year
- CPU Memory Compiler and Parallel programing☆26Updated 7 months ago
- ☆276Updated 4 years ago