报告题目:Sia: Heterogeneity-aware, goodput-optimized ML-cluster scheduling
论文出处:ACM Symposium on Operating Systems Principles (SOSP)
作者:Suhas Jayaram Subramanya, Daiyaan Arfeen, Shouxu Lin, Aurick Qiao, Zhihao Jia, Gregory R. Ganger
单位:Carnegie Mellon University,Cornell University,Petuum Inc.
报告人:潘峰
报告时间:2024年1月11日
报告地点:博学楼621会议室
报告内容摘要:
The Sia scheduler efficiently assigns heterogeneous deep learning (DL) cluster resources to elastic resource-adaptive jobs. Although some recent schedulers address one aspect or another (e.g., heterogeneity or resource-adaptivity), none addresses all and most scale poorly to large clusters and/or heavy workloads even without the full complexity of the combined scheduling problem. Sia introduces a new scheduling formulation that can scale to the search-space sizes and intentionally match jobs and their configurations to GPU types and counts, while adapting to changes in cluster load and job mix over time. Sia also introduces a lowprofiling-overhead approach to bootstrapping (for each new job) throughput models used to evaluate possible resource assignments, and it is the first cluster scheduler to support elastic scaling of hybrid parallel jobs.
Extensive evaluations show that Sia outperforms state-of-the-art schedulers. For example, even on relatively small 44- to 64-GPU clusters with a mix of three GPU types, Sia reduces average job completion time (JCT) by 30–93%, 99th percentile JCT and makespan by 28–95%, and GPU hours used by 12–55% for workloads derived from 3 real-world environments.Additional experiments demonstrate that Sia scales to at least 2000-GPU clusters, provides improved fairness, and is not over-sensitive to scheduler parameter settings.