LearningInfiniTensor / handout
训练营讲义
☆17Updated 2 months ago
Alternatives and similar repositories for handout:
Users that are interested in handout are comparing it to the libraries listed below
- easy cuda code☆66Updated 3 months ago
- ☆23Updated last week
- ☆51Updated 2 months ago
- ☆103Updated last week
- 笔记☆37Updated last month
- some hpc project for learning☆20Updated 7 months ago
- ☆47Updated 4 months ago
- ☆26Updated 2 months ago
- 算子库☆15Updated last month
- 先进编译实验室的个人主页☆53Updated 2 months ago
- ☆229Updated last month
- 实现一个子集c编译器,后端基于llvm20☆1Updated 2 weeks ago
- Triton Documentation in Chinese Simplified / Triton 中文文档☆62Updated 2 months ago
- 分层解耦的深度学习推理引擎☆72Updated last month
- Course materials for MIT6.5940: TinyML and Efficient Deep Learning Computing☆34Updated 2 months ago
- 《自己动手写AI编译器》☆21Updated 5 months ago
- Codes & examples for "CUDA - From Correctness to Performance"☆89Updated 5 months ago
- 北大编译课程实践,独立完成的C语言子集SysY编译器,实现了从C语言编译到Koopa IR,再从Koopa IR编译到RISC-V汇编的实现☆30Updated 8 months ago
- A light llama-like llm inference framework based on the triton kernel.☆100Updated 3 weeks ago
- ☆84Updated 11 months ago
- A PyTorch-like deep learning framework. Just for fun.☆147Updated last year
- 使用 CUDA C++ 实现的 llama 模型推理框架☆48Updated 4 months ago
- 晚上下班不刷手机,学点什么。系列一:CUDA 计算框架 CUFX (Cuda Framework eXtended)。☆14Updated 3 months ago
- Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM☆29Updated 2 weeks ago
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆80Updated 2 months ago
- 📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.☆157Updated this week
- Implement Flash Attention using Cute.☆74Updated 3 months ago
- Free resource for the book AI Compiler Development Guide☆43Updated 2 years ago
- Learning material for CMU10-714: Deep Learning System☆242Updated 10 months ago
- 【HACKATHON 预备营】飞桨启航计划集训营☆16Updated this week