dlsyscourse / lecture13Links
☆9Updated 9 months ago
Alternatives and similar repositories for lecture13
Users that are interested in lecture13 are comparing it to the libraries listed below
Sorting:
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆65Updated 3 years ago
- A Winograd Minimal Filter Implementation in CUDA☆25Updated 3 years ago
- Benchmarks to capture important workloads.☆31Updated 5 months ago
- APPy (Annotated Parallelism for Python) enables users to annotate loops and tensor expressions in Python with compiler directives akin to…☆24Updated 3 weeks ago
- Distributed DataLoader For Pytorch Based On Ray☆24Updated 3 years ago
- A Python library transfers PyTorch tensors between CPU and NVMe☆116Updated 7 months ago
- Training neural networks in TensorFlow 2.0 with 5x less memory☆132Updated 3 years ago
- Benchmarking PyTorch 2.0 different models☆21Updated 2 years ago
- [IJCAI2023] An automated parallel training system that combines the advantages from both data and model parallelism. If you have any inte…☆51Updated 2 years ago
- A "gym" style toolkit for building lightweight NAS systems.☆13Updated 3 years ago
- GPTQ inference TVM kernel☆40Updated last year
- Event-Triggered Communication in Parallel Machine Learning☆28Updated 3 years ago
- Odysseus: Playground of LLM Sequence Parallelism☆70Updated last year
- Customized matrix multiplication kernels☆56Updated 3 years ago
- A TVM-like CUDA/C code generator.☆9Updated 3 years ago
- ☆69Updated 2 years ago
- Standalone Flash Attention v2 kernel without libtorch dependency☆111Updated 10 months ago
- ☆13Updated 2 months ago
- An external memory allocator example for PyTorch.☆14Updated 3 years ago
- Inference framework for MoE layers based on TensorRT with Python binding☆41Updated 4 years ago
- Torch Distributed Experimental☆116Updated 11 months ago
- Summary of system papers/frameworks/codes/tools on training or serving large model☆57Updated last year
- OneFlow->ONNX☆43Updated 2 years ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆91Updated 3 weeks ago
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆112Updated last year
- A standalone GEMM kernel for fp16 activation and quantized weight, extracted from FasterTransformer☆93Updated this week
- PipeTransformer: Automated Elastic Pipelining for Distributed Training of Large-scale Models. ICML 2021☆56Updated 3 years ago
- ☆18Updated 2 years ago
- Yet another Polyhedra Compiler for DeepLearning☆19Updated 2 years ago
- Memory Optimizations for Deep Learning (ICML 2023)☆98Updated last year