tingshua-yts / BetterDL
☆35Updated last year
Alternatives and similar repositories for BetterDL:
Users that are interested in BetterDL are comparing it to the libraries listed below
- ☆123Updated last year
- A simple deep learning framework that supports automatic differentiation and GPU acceleration.☆58Updated 2 years ago
- Inference code for LLaMA models☆120Updated last year
- Tutorials for writing high-performance GPU operators in AI frameworks.☆130Updated last year
- llm theoretical performance analysis tools and support params, flops, memory and latency analysis.☆87Updated 4 months ago
- ☆123Updated last week
- CUDA 6大并行计算模式 代码与笔记☆60Updated 4 years ago
- A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, …☆111Updated last year
- Code base and slides for ECE408:Applied Parallel Programming On GPU.☆122Updated 3 years ago
- Optimize softmax in triton in many cases☆20Updated 8 months ago
- A tutorial for CUDA&PyTorch☆138Updated 3 months ago
- ATC23 AE☆45Updated last year
- PyTorch Dataset Rank Dataset☆42Updated 4 years ago
- Implement custom operators in PyTorch with cuda/c++☆60Updated 2 years ago
- Simple Dynamic Batching Inference☆145Updated 3 years ago
- A high-performance distributed deep learning system targeting large-scale and automated distributed training.☆299Updated 2 weeks ago
- ☆139Updated last year
- ☆48Updated this week
- Models and examples built with OneFlow☆97Updated 6 months ago
- ☆148Updated 4 months ago
- learning how CUDA works☆250Updated 2 months ago
- A tiny learning framework built by cudnn and cublas.☆21Updated 3 years ago
- Trans different platform's network to International Representation(IR)☆44Updated 6 years ago
- ☆39Updated 3 years ago
- [USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…☆53Updated 9 months ago
- 📚FFPA(Split-D): Extend FlashAttention with Split-D for large headdim, O(1) GPU SRAM complexity, 1.8x~3x↑🎉 faster than SDPA EA.☆171Updated last month
- ☆79Updated last year
- ☆127Updated 4 months ago
- ☆45Updated 5 years ago
- A small deep-learning framework with C++/Python/CUDA☆53Updated 7 years ago