Code base and slides for ECE408:Applied Parallel Programming On GPU.
☆144Jul 2, 2021Updated 4 years ago
Alternatives and similar repositories for ECE408
Users that are interested in ECE408 are comparing it to the libraries listed below
Sorting:
- ☆50Dec 4, 2023Updated 2 years ago
- Distributed DataLoader For Pytorch Based On Ray☆25Nov 5, 2021Updated 4 years ago
- Examples of CUDA implementations by Cutlass CuTe☆269Jul 1, 2025Updated 8 months ago
- Xiao's CUDA Optimization Guide [NO LONGER ADDING NEW CONTENT]☆323Nov 8, 2022Updated 3 years ago
- Keyformer proposes KV Cache reduction through key tokens identification and without the need for fine-tuning☆57Mar 26, 2024Updated last year
- ☆14Jan 12, 2022Updated 4 years ago
- CUDA 12.2 HMM demos☆20Jul 26, 2024Updated last year
- DELTA-pytorch:DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation☆12Apr 16, 2024Updated last year
- This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several…☆1,244Jul 29, 2023Updated 2 years ago
- a simple API to use CUPTI☆11Aug 19, 2025Updated 6 months ago
- A simple high performance CUDA GEMM implementation.☆426Jan 4, 2024Updated 2 years ago
- Artifact for OSDI'23: MGG: Accelerating Graph Neural Networks with Fine-grained intra-kernel Communication-Computation Pipelining on Mult…☆41Mar 17, 2024Updated last year
- Student lab assignments for MIT 6.1600☆11May 1, 2025Updated 10 months ago
- ☆11Apr 5, 2021Updated 4 years ago
- Solution of Programming Massively Parallel Processors☆49Jan 15, 2024Updated 2 years ago
- ☆11Apr 29, 2022Updated 3 years ago
- [MLSys 2023] Pre-train and Search: Efficient Embedding Table Sharding with Pre-trained Neural Cost Models☆16May 5, 2023Updated 2 years ago
- [HPCA 2026] A GPU-optimized system for efficient long-context LLMs decoding with low-bit KV cache.☆81Dec 18, 2025Updated 2 months ago
- A Really Scalable RL Framework to 10k+ CPUs☆38Feb 29, 2024Updated 2 years ago
- The road to hack SysML and become an system expert☆510Sep 25, 2024Updated last year
- 校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。☆498Oct 28, 2025Updated 4 months ago
- Tutorial for assignment of Introduction to Database System☆11Sep 29, 2025Updated 5 months ago
- a single-header math library☆17Nov 7, 2025Updated 3 months ago
- ☆18Nov 21, 2022Updated 3 years ago
- DISB is a new DNN inference serving benchmark with diverse workloads and models, as well as real-world traces.☆58Aug 21, 2024Updated last year
- ☆2,698Jan 16, 2024Updated 2 years ago
- Material for gpu-mode lectures☆5,800Feb 1, 2026Updated last month
- Reading seminar in Harvard Cloud Networking and Systems Group☆16Aug 29, 2022Updated 3 years ago
- A Homework for Computer Architecture at SJTU☆14Jan 4, 2020Updated 6 years ago
- ECE408 (Applied Parallel Programming) Fall 2022 MP☆19Mar 24, 2023Updated 2 years ago
- paper and its code for AI System☆348Feb 10, 2026Updated 3 weeks ago
- AI model training on heterogeneous, geo-distributed resources☆38Nov 24, 2025Updated 3 months ago
- GPU高性能编程CUDA实战随书代码☆45May 24, 2022Updated 3 years ago
- ☆30Sep 13, 2025Updated 5 months ago
- A decentralized scalar timestamp scheme☆16Apr 12, 2021Updated 4 years ago
- Yet another Polyhedra Compiler for DeepLearning☆19Apr 14, 2023Updated 2 years ago
- 大规模并行处理器编程实战 第二版答案☆35Jun 4, 2022Updated 3 years ago
- Automated Parallelization System and Infrastructure for Multiple Ecosystems☆82Nov 19, 2024Updated last year
- how to optimize some algorithm in cuda.☆2,825Feb 15, 2026Updated 2 weeks ago