UofT-EcoSystem / BPPSA-openLinks

The (open-source part of) code to reproduce "BPPSA: Scaling Back-propagation by Parallel Scan Algorithm".

☆12

Alternatives and similar repositories for BPPSA-open

Users that are interested in BPPSA-open are comparing it to the libraries listed below

Sorting:

comaniac / epoi
Benchmark PyTorch Custom Operators
☆14Updated 2 years ago
chhzh123 / ptc-tutorial
PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo
☆17Updated 2 years ago
ceruleangu / Block-Sparse-Benchmark
Benchmark for matrix multiplications between dense and block sparse (BSR) matrix in TVM, blocksparse (Gray et al.) and cuSparse.
☆23Updated 5 years ago
weiya711 / sam
☆17Updated last month
illinois-impact / klap
A source-to-source compiler for optimizing CUDA dynamic parallelism by aggregating launches
☆15Updated 6 years ago
awslabs / ratex
☆23Updated 3 months ago
SwarmArch / T4
Code released to accompany the ISCA paper: "T4: Compiling Sequential Code for Effective Speculative Parallelization in Hardware"
☆28Updated 3 years ago
zhisbug / Cavs
Cavs: An Efficient Runtime System for Dynamic Neural Networks
☆15Updated 5 years ago
google / iopddl
Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
☆24Updated 6 months ago
ucamrl / xrlflow
☆13Updated 2 years ago
sjtu-epcc / DVABatch
☆21Updated 3 years ago
awslabs / lorien
☆42Updated 2 years ago
jiazhihao / attention_superoptimizer
An Attention Superoptimizer
☆22Updated 10 months ago
msr-fiddle / dnn-partitioning
☆41Updated 5 years ago
tlc-pack / tenset
☆92Updated 3 years ago
UofT-EcoSystem / hfta
Boost hardware utilization for ML training workloads via Inter-model Horizontal Fusion
☆32Updated last year
xshaun / sc22-ae
☆14Updated 3 weeks ago
Jokeren / GPA
GPU Performance Advisor
☆65Updated 3 years ago
uwplse / tensat
Re-implementation of the TASO compiler using equality saturation
☆136Updated 4 years ago
escalab / SIMD2
☆31Updated 3 years ago
xiezhq-hermann / graphiler
Graphiler is a compiler stack built on top of DGL and TorchScript which compiles GNNs defined using user-defined functions (UDFs) into ef…
☆59Updated 3 years ago
awslabs / slapo
A schedule language for large model training
☆151Updated 3 months ago
bytedance / QSync
Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".
☆20Updated last year
parasailteam / coconet
☆83Updated 2 years ago
PASSIONLab / distributed_sddmm
Distributed SDDMM Kernel
☆11Updated 3 years ago
TiledTensor / TiledLower
TiledLower is a Dataflow Analysis and Codegen Framework written in Rust.
☆14Updated last year
uclasystem / dorylus
Dorylus: Affordable, Scalable, and Accurate GNN Training
☆76Updated 4 years ago
humuyan / Korch
ASPLOS'24: Optimal Kernel Orchestration for Tensor Programs with Korch
☆38Updated 8 months ago
uwsampl / SparseTIR
SparseTIR: Sparse Tensor Compiler for Deep Learning
☆141Updated 2 years ago
sjtu-epcc / Tacker
Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS
☆32Updated 9 months ago