dlsyscourse / hw1Links

☆13

Alternatives and similar repositories for hw1

Users that are interested in hw1 are comparing it to the libraries listed below

Sorting:

harleyszhang / llm_counts
llm theoretical performance analysis tools and support params, flops, memory and latency analysis.
☆112Updated 4 months ago
eedalong / ECE408
Code base and slides for ECE408：Applied Parallel Programming On GPU.
☆137Updated 4 years ago
dlsyscourse / hw0
☆51Updated 2 months ago
Hsword / Hetu
A high-performance distributed deep learning system targeting large-scale and automated distributed training. If you have any interests, …
☆123Updated last year
CalvinXKY / mfu_calculation
A simple calculation for LLM MFU.
☆50Updated 2 months ago
gogongxt / nano-sglang
☆41Updated last month
mlc-ai / notebooks
☆210Updated last year
fanlai0990 / CS598
Systems for GenAI
☆147Updated 7 months ago
l1nkr / DL-Compiler-Navigation
Machine Learning Compiler Road Map
☆45Updated 2 years ago
AlibabaPAI / FLASHNN
☆102Updated last year
Chtholly-Boss / swizzle
A practical way of learning Swizzle
☆33Updated 9 months ago
ConnollyLeon / awesome-Auto-Parallelism
A baseline repository of Auto-Parallelism in Training Neural Networks
☆147Updated 3 years ago
kwai / Megatron-Kwai
[USENIX ATC '24] Accelerating the Training of Large Language Models using Efficient Activation Rematerialization and Optimal Hybrid Paral…
☆66Updated last year
KuangjuX / NVSHMEM-Tutorial
NVSHMEM‑Tutorial: Build a DeepEP‑like GPU Buffer
☆143Updated 2 months ago
luliyucoordinate / cute-flash-attention
Implement Flash Attention using Cute.
☆96Updated 11 months ago
interestingLSY / CUDA-From-Correctness-To-Performance-Code
Codes & examples for "CUDA - From Correctness to Performance"
☆117Updated last year
LDLINGLINGLING / nano_vllm_note
注释的nano_vllm仓库，并且完成了MiniCPM4的适配以及注册新模型的功能
☆95Updated 3 months ago
galeselee / Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of pap…
☆279Updated 8 months ago
Sunt-ing / stick
A PyTorch-like deep learning framework. Just for fun.
☆156Updated 2 years ago
xlite-dev / HGEMM
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
☆127Updated 6 months ago
AlibabaResearch / flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
☆223Updated 2 years ago
InfiniTensor / InfiniTensor
☆273Updated 3 weeks ago
InternLM / turbomind
☆97Updated 7 months ago
MDK8888 / vllmini
A minimal implementation of vllm.
☆61Updated last year
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆39Updated last year
openmlsys / openmlsys-cuda
Tutorials for writing high-performance GPU operators in AI frameworks.
☆134Updated 2 years ago
mit-han-lab / parallel-computing-tutorial
☆176Updated 2 years ago
Shenggan / awesome-distributed-ml
A curated list of awesome projects and papers for distributed training or inference
☆250Updated last year
microsoft / chunk-attention
☆81Updated 7 months ago
nvixnu / pmpp__programming_massively_parallel_processors
Examples and exercises from the book Programming Massively Parallel Processors - A Hands-on Approach. David B. Kirk and Wen-mei W. Hwu (T…
☆75Updated 4 years ago