dlsyscourse / lecture13Links

☆9

Alternatives and similar repositories for lecture13

Users that are interested in lecture13 are comparing it to the libraries listed below

Sorting:

facebookresearch / fairring
Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …
☆65Updated 3 years ago
UDC-GAC / openCNN
A Winograd Minimal Filter Implementation in CUDA
☆25Updated 3 years ago
facebookresearch / FAMBench
Benchmarks to capture important workloads.
☆31Updated 5 months ago
habanero-lab / APPy
APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…
☆24Updated 3 weeks ago
eedalong / Dpex
Distributed DataLoader For Pytorch Based On Ray
☆24Updated 3 years ago
hpcaitech / TensorNVMe
A Python library transfers PyTorch tensors between CPU and NVMe
☆116Updated 7 months ago
parasj / checkmate
Training neural networks in TensorFlow 2.0 with 5x less memory
☆132Updated 3 years ago
FrancescoSaverioZuppichini / pytorch-2.0-benchmark
Benchmarking PyTorch 2.0 different models
☆21Updated 2 years ago
Youhe-Jiang / IJCAI2023-OptimalShardedDataParallel
[IJCAI2023] An automated parallel training system that combines the advantages from both data and model parallelism. If you have any inte…
☆51Updated 2 years ago
jack-willturner / gymnastics
A "gym" style toolkit for building lightweight NAS systems.
☆13Updated 3 years ago
LeiWang1999 / AutoGPTQ.tvm
GPTQ inference TVM kernel
☆40Updated last year
soumyadipghosh / eventgrad
Event-Triggered Communication in Parallel Machine Learning
☆28Updated 3 years ago
feifeibear / Odysseus-Transformer
Odysseus: Playground of LLM Sequence Parallelism
☆70Updated last year
DeMoriarty / custom_matmul_kernels
Customized matrix multiplication kernels
☆56Updated 3 years ago
Cjkkkk / KgeN
A TVM-like CUDA/C code generator.
☆9Updated 3 years ago
masahi / torchscript-to-tvm
☆69Updated 2 years ago
tlc-pack / libflash_attn
Standalone Flash Attention v2 kernel without libtorch dependency
☆111Updated 10 months ago
gpu-mode / triton-tutorials
☆13Updated 2 months ago
zhuzilin / pytorch-malloc
An external memory allocator example for PyTorch.
☆14Updated 3 years ago
Harry-Chen / InfMoE
Inference framework for MoE layers based on TensorRT with Python binding
☆41Updated 4 years ago
pytorch / torchdistx
Torch Distributed Experimental
☆116Updated 11 months ago
ModelTC / awesome-lm-system
Summary of system papers/frameworks/codes/tools on training or serving large model
☆57Updated last year
Oneflow-Inc / oneflow_convert
OneFlow->ONNX
☆43Updated 2 years ago
microsoft / TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
☆91Updated 3 weeks ago
wangsiping97 / FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
☆112Updated last year
tlc-pack / cutlass_fpA_intB_gemm
A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer
☆93Updated this week
Distributed-AI / PipeTransformer
PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models. ICML 2021
☆56Updated 3 years ago
ahennequ / cuda-tensorcores-register-mapping
☆18Updated 2 years ago
ModelTC / pyvlova
Yet another Polyhedra Compiler for DeepLearning
☆19Updated 2 years ago
facebookresearch / MODel_opt
Memory Optimizations for Deep Learning (ICML 2023)
☆98Updated last year