facebookresearch / MODel_opt
Memory Optimizations for Deep Learning (ICML 2023)
☆62Updated last year
Alternatives and similar repositories for MODel_opt:
Users that are interested in MODel_opt are comparing it to the libraries listed below
- SparseTIR: Sparse Tensor Compiler for Deep Learning☆134Updated 2 years ago
- extensible collectives library in triton☆84Updated 2 weeks ago
- ☆103Updated 7 months ago
- Tritonbench is a collection of PyTorch custom operators with example inputs to measure their performance.☆108Updated this week
- A schedule language for large model training☆145Updated 9 months ago
- TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.☆80Updated this week
- High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.☆103Updated 9 months ago
- Fairring (FAIR + Herring) is a plug-in for PyTorch that provides a process group for distributed training that outperforms NCCL at large …☆65Updated 3 years ago
- ☆139Updated 8 months ago
- Training neural networks in TensorFlow 2.0 with 5x less memory☆130Updated 3 years ago
- A Python library transfers PyTorch tensors between CPU and NVMe☆113Updated 4 months ago
- ☆68Updated 3 weeks ago
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections☆119Updated 2 years ago
- ☆157Updated last year
- ☆76Updated 5 months ago
- Fast Hadamard transform in CUDA, with a PyTorch interface☆168Updated 10 months ago
- pytorch-profiler☆51Updated last year
- (NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.☆38Updated 2 years ago
- Research and development for optimizing transformers☆125Updated 4 years ago
- Sparsity support for PyTorch☆34Updated 3 weeks ago
- ☆50Updated last year
- PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.☆108Updated 4 months ago
- A minimal implementation of vllm.☆37Updated 8 months ago
- A performant, memory-efficient checkpointing library for PyTorch applications, designed with large, complex distributed workloads in mind…☆156Updated 4 months ago
- Benchmark code for the "Online normalizer calculation for softmax" paper☆90Updated 6 years ago
- GPTQ inference TVM kernel☆38Updated 11 months ago
- Collection of kernels written in Triton language☆118Updated last week
- ☆36Updated 4 months ago
- This repository contains the results and code for the MLPerf™ Training v1.0 benchmark.☆38Updated last year
- a minimal cache manager for PagedAttention, on top of llama3.☆82Updated 7 months ago