ruipeterpan / torch_profiler
Simple PyTorch profiler that combines DeepSpeed Flops Profiler and TorchInfo
☆9Updated last year
Related projects: ⓘ
- PyTorch compilation tutorial covering TorchScript, torch.fx, and Slapo☆18Updated last year
- SOTA Learning-augmented Systems☆32Updated 2 years ago
- Official resporitory for "IPDPS' 24 QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices".☆19Updated 6 months ago
- An external memory allocator example for PyTorch.☆13Updated 2 years ago
- An Attention Superoptimizer☆19Updated 4 months ago
- Deferred Continuous Batching in Resource-Efficient Large Language Model Serving (EuroMLSys 2024)☆11Updated 3 months ago
- Vector search with bounded performance.☆33Updated 7 months ago
- ☆19Updated last year
- Stateful LLM Serving☆25Updated last month
- Code for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]☆38Updated last year
- Code associated with the paper **Fine-tuning Language Models over Slow Networks using Activation Compression with Guarantees**.☆24Updated last year
- ☆13Updated 2 years ago
- Tacker: Tensor-CUDA Core Kernel Fusion for Improving the GPU Utilization while Ensuring QoS☆17Updated 2 years ago
- ☆18Updated 2 years ago
- [ICDCS 2023] DeAR: Accelerating Distributed Deep Learning with Fine-Grained All-Reduce Pipelining☆12Updated 9 months ago
- Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines☆13Updated 9 months ago
- ☆12Updated 2 years ago
- ☆11Updated last year
- ☆22Updated 3 years ago
- (NeurIPS 2022) Automatically finding good model-parallel strategies, especially for complex models and clusters.☆33Updated last year
- A Sparse-tensor Communication Framework for Distributed Deep Learning☆13Updated 2 years ago
- Multi-Instance-GPU profiling tool☆51Updated last year
- ☆48Updated 3 years ago
- Primo: Practical Learning-Augmented Systems with Interpretable Models☆18Updated 8 months ago
- ☆14Updated 2 years ago
- Official Repo for "LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization"☆25Updated 6 months ago
- MobiSys#114☆21Updated last year
- ☆21Updated 5 years ago
- Artifact of ASPLOS'23 paper entitled: GRACE: A Scalable Graph-Based Approach to Accelerating Recommendation Model Inference☆16Updated last year
- STREAMer: Benchmarking remote volatile and non-volatile memory bandwidth☆15Updated last year