array2d / deepxLinks

Large-scale Auto-Distributed Training/Inference Unified Framework | Memory-Compute-Control Decoupled Architecture | Multi-language SDK & Heterogeneous Hardware Support

☆55

Alternatives and similar repositories for deepx

Users that are interested in deepx are comparing it to the libraries listed below

Sorting:

xgqdut2016 / cuda_code
easy cuda code
☆90Updated 11 months ago
InfiniTensor / InfiniTensor
☆274Updated last month
interestingLSY / CUDA-From-Correctness-To-Performance-Code
Codes & examples for "CUDA - From Correctness to Performance"
☆117Updated last year
tongzhou80 / nanoPyC
☆70Updated 2 years ago
YuxueYang1204 / CudaDemo
Implement custom operators in PyTorch with cuda/c++
☆74Updated 2 years ago
CalvinXKY / BasicCUDA
A tutorial for CUDA&PyTorch
☆170Updated 10 months ago
RussWong / LLM-engineering
☆26Updated 3 months ago
InfiniTensor / operators
算子库
☆17Updated 4 months ago
hyperai / triton-cn
Triton Documentation in Chinese Simplified / Triton 中文文档
☆91Updated last week
Sunt-ing / stick
A PyTorch-like deep learning framework. Just for fun.
☆156Updated 2 years ago
xgqdut2016 / hpc2torch
☆29Updated last month
InfiniTensor / RefactorGraph
分层解耦的深度学习推理引擎
☆76Updated 9 months ago
InfiniTensor / InfiniLM-Rust
☆125Updated last month
flagos-ai / FlagCX
☆131Updated this week
BBuf / how-to-learn-deep-learning-framework
how to learn PyTorch and OneFlow
☆460Updated last year
dsl-learn / LeetGPU
LeetGPU Solutions
☆84Updated last month
xlite-dev / ffpa-attn
🤖FFPA: Extend FlashAttention-2 with Split-D, ~O(1) SRAM complexity for large headdim, 1.8x~3x↑🎉 vs SDPA EA.
☆233Updated 2 weeks ago
AdvancedCompiler / AdvancedCompiler
先进编译实验室的个人主页
☆174Updated last month
LDLINGLINGLING / nano_vllm_note
注释的nano_vllm仓库，并且完成了MiniCPM4的适配以及注册新模型的功能
☆108Updated 3 months ago
YangLinzhuo / cuda-sgemm-optimization
CUDA SGEMM optimization note
☆15Updated 2 years ago
QianyanTech / NBAssembler
Assembler and Decompiler for NVIDIA (Maxwell Pascal Volta Turing Ampere) GPUs.
☆91Updated 2 years ago
l1nkr / DL-Compiler-Navigation
Machine Learning Compiler Road Map
☆45Updated 2 years ago
caiwanxianhust / FasterLLaMA
使用 CUDA C++ 实现的 llama 模型推理框架
☆62Updated last year
ArthurinRUC / cutlass-notes
From Minimal GEMM to Everything
☆82Updated 3 weeks ago
DD-DuDa / Cute-Learning
Examples of CUDA implementations by Cutlass CuTe
☆254Updated 5 months ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆134Updated 6 months ago
XiaoSong9905 / HPC-Notes
Personal Notes for Learning HPC & Parallel Computation [Active Adding New Content]
☆75Updated 3 years ago
flagos-ai / flagtree
FlagTree is a unified compiler for multiple AI chips, which is forked from triton-lang/triton.
☆137Updated this week
infinigence / Semi-PD
A prefill & decode disaggregated LLM serving framework with shared GPU memory and fine-grained compute isolation.
☆116Updated 6 months ago
sunkx109 / GPUs-Specs
Summary of the Specs of Commonly Used GPUs for Training and Inference of LLM
☆67Updated 3 months ago