skypilot-org / spot-traces
Releasing the spot availability traces used in "Can't Be Late" paper.
☆14Updated 5 months ago
Related projects: ⓘ
- Bamboo is a system for running large pipeline-parallel DNNs affordably, reliably, and efficiently using spot instances.☆46Updated last year
- Dorylus: Affordable, Scalable, and Accurate GNN Training☆77Updated 3 years ago
- ☆19Updated last year
- AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving (OSDI 23)☆76Updated last year
- ☆48Updated 3 years ago
- A resilient distributed training framework☆78Updated 5 months ago
- Code for "Shockwave: Fair and Efficient Cluster Scheduling for Dynamic Adaptation in Machine Learning" [NSDI '23]☆38Updated last year
- A Cluster-Wide Model Manager to Accelerate DNN Training via Automated Training Warmup☆31Updated last year
- GPU-accelerated vector query processing system that supports large vector datasets beyond GPU memory.☆16Updated 5 months ago
- ☆12Updated 3 months ago
- SpotServe: Serving Generative Large Language Models on Preemptible Instances☆92Updated 6 months ago
- A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems☆110Updated last month
- Artifacts for our ASPLOS'23 paper ElasticFlow☆51Updated 4 months ago
- LLM serving cluster simulator☆55Updated 4 months ago
- ☆40Updated last year
- TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches☆54Updated last year
- SHADE: Enable Fundamental Cacheability for Distributed Deep Learning Training☆28Updated last year
- An interference-aware scheduler for fine-grained GPU sharing☆92Updated 4 months ago
- Official repository for the paper DynaPipe: Optimizing Multi-task Training through Dynamic Pipelines☆13Updated 9 months ago
- ☆45Updated last year
- Artifacts for our SIGCOMM'22 paper Muri☆38Updated 8 months ago
- ☆14Updated last month
- Surrogate-based Hyperparameter Tuning System☆26Updated last year
- LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale☆32Updated last month
- Fast and Efficient Model Serving Using Multi-GPUs with Direct-Host-Access (ACM EuroSys '23)☆51Updated 5 months ago
- REEF is a GPU-accelerated DNN inference serving system that enables instant kernel preemption and biased concurrent execution in GPU sche…☆84Updated last year
- An experimental parallel training platform☆46Updated 5 months ago
- A universal workflow system for exactly-once DAGs☆23Updated last year
- ☆41Updated 3 years ago
- Virtual Memory Abstraction for Serverless Architectures☆45Updated 2 years ago