I Introduction
Recently, the Parameter Server (PS) architecture [1, 2, 3, 4, 5, 6] has emerged as a popular system architecture to support largescale distributed learning on a cluster of machines. The PS architecture advocates the separation of working units as “servers” and “workers”, where the servers collectively maintain the model state and the workers duly “pull” the latest version of the model from the servers, scan their own part of training data to compute the model refinements, and “push” the model updates back to the servers for aggregation. The PS architecture has the advantage of improving the network utilization so that it can scaleout to bigger model and more machines.
Generally, the completion time of a longrunning machine learning job is determined by the time required to reach model convergence. Practically, however, the completion time is largely influenced by the values of the various system knobs, such as the serverworker ratio (i.e., how many hardware threads are dedicated to the servers and workers), the device placement (i.e., which operation shall be shipped to the GPU for processing and which shall stay in the CPU), and the parallelism degree (e.g., the model replication factor, the model partitioning scheme). Today, unfortunately, the burden falls on the users who submit the ML jobs to specify the knob values.
Determining the right set of knob values that achieve optimal completion time has way surpassed human abilities. Part of what makes that so enigmatic is that the response surfaces of ML jobs are highly complex. Figure 1 shows a response surface of running a PSstyle TensorFlow job on our cluster (experiment details are presented later). The two system knobs involved are: tf.train.ClusterSpec::worker and intra_op_parallelism_threads, which respectively vary the serverworker ratio and the thread affinity of operations (i.e., the mapping between TensorFlow operations and hardware threads). The figure shows that the response surface is complex and nonmonotonic, and the optimal lies at where human can’t easily find. What adds to the challenge is that, the completion time of a ML job, unlike traditional data processing, is a complex interplay between statistical efficiency (how many iterations are needed until convergence) and hardware efficiency (how efficiently those iterations are carried out) [7]. Consider the serverworker ratio as an example. On the one hand, more workers would increase the hardware efficiency by having a higher degree of data parallelism. On the other hand, that might hurt the statistical efficiency when servers accept asynchronous updates from workers. That is because when more workers concurrently update the global model, the model would be more inconsistent and require more iterations to converge. Figure 2 shows such a case. It shows that varying just one system setting (serverworker ratio) would already yield a 2.5 difference in statistical efficiency
Configuring distributed ML systems to reduce the longrunning execution time of ML jobs currently requires system expertise – something many ML users may lack. Even for system experts, the dependencies between the knobs (e.g, changing one knob may nullify the benefits of another) make the whole task nontrivial if that is not downright impossible. Furthermore, this manual tuning task must be repeated whenever the expertsuggested model or hardware resources changes.
In this paper we present a suite of techniques for building selftuning parameter servers. By selftuning, we mean when a longrunning ML job is iteratively training an expertsuggested model, the system itself is also iteratively learning which setting is more efficient for that job and dynamically applies it online. Our goal is to free ML users from the system details as much as they could and let the system progressively discover and apply better and better system settings for a job as it proceeds. The principled contributions of this paper are as follows:

Online Optimization Framework (Section III). We present a framework that is applicable to all PSstyle ML systems to support selftuning. In online tuning, we always hope to discover the optimal system setting as soon as possible (so as to apply that in the next iteration immediately) while minimizing the number of iterations to discover it. Therefore, there is an intrinsic balance between trying a potentially better system setting or applying a known good setting before the start of each iteration. The crux of the framework is the use of Bayesian Optimization (BO) [8, 9, 10] to learn and recommend a system setting. Bayesian Optimization has been extensively used in various offline autotuning projects (e.g., [11], [12]). However, there are subtleties when coming to online ML system tuning. For example, in ML learning, the influence of the same system setting would depend on whether the training just starts or the model is converging. Therefore, we have formulated a BO that is aware of those intrinsic ML factors.

Online Progress Estimation (Section IV). A key input to our Bayesian Optimization is the estimated remaining completion time of an inflight
ML job. The challenge there is to estimate the
statistical progress — “how many more iterations to go based on the current model and current system setting?”. While there is a wealth of works that study the convergence rates of different learning algorithms on different ML problems [13, 14, 15], they are all theoretical bounds under an offline setting, i.e., bounding the maximum number of iterations required given an untrained model. Focusing on gradient descent based learning algorithms (e.g., SGD, Hogwild! [13]), our contributions here are a formal extension of those bounds for an online setting and a methodology to transform those bounds to be legitimate statistical progress estimation functions. 
Online Reconfiguration (Section V). None of the above would be meaningful unless there is a way to online reconfigure a PSstyle ML system to a new system setting while active jobs are still running. While most ML systems have builtin checkpointing facilities for recovery purpose, implementing online reconfiguration by checkpointing the state of a job and restoring the state under the new system setting would incur excessive overhead and cause system quiescence. To this end, we introduce a novel technique called On Demand Model Relocation (ODMR) so that nonquiescent and efficient online reconfigurations can be carried out.

Experimentation on TensorFlow (Section VI). One point worths noticing is that all of the our contributions (online optimization framework, online progress estimation, and online reconfiguration) above are system agnostic. That is, any existing PSstyle systems can implement our techniques and enjoy faster endtoend runtime through selftuning. As an experimental prototype, we have implemented our techniques on top of TensorFlow [6] and we name it TensorFlowOnline. Experiments show that TensorFlowOnline can reduce the longrunning completion times of different TensorFlow jobs by 1.4–18. TensorFlowOnline is opensource. So its statistical progress estimation component can be easily adapted to any new convergence results from the machine learning community.
In this paper, we focus on online system parameter tuning. By system parameters we refer to those would influence only the efficiency but not the quality of the models. Therefore, the serverworker ratio in Tensorflow is a system parameter while the learning batch size is not. Instead, the learning batch size is a hyperparameter
because it influences the quality of the final model and is not learnable from the data. With this difference is clear, we can see that our work is orthogonal to projects that focus on hyperparameter tuning (e.g., Spark TuPAQ
[16], AutoWEKA [17], Google Vizier [18], Spearmint [19], GpyOpt[20], and Autosklearn [21], Ease.ml [11]). In fact, hyperparameter tuning is generally a trialanderror process, where each trial is a different job using a different set of hyperparameters (e.g., learning rate, batch size) under the same system setting (e.g., workerserver ratio). Our work thereby can complement those systems to expedite the execution of each trial and shorten the overall hyperparameter tuning cycle.Next, we present the preliminary and background for this paper (Section II), followed by our main contributions (Sections III–VI). We give a review of related work afterwards (Section VII) and conclude this paper with some interesting future directions (Section VIII). The appendix includes some detailed implementation of our prototype TensorFlowOnline, additional experimental results, and a table that summarizes the major notations used in this paper.
Ii Background and Preliminary
ML jobs come in many forms, such as logistic regression, support vector machine, and neural networks. Nonetheless, almost all seek a model
of parameters that minimize an empirical risk function :where is the number of data examples in the whole training dataset and is the loss function that quantifies how good the model explains a data example .
Iia IterativeConvergent ML Algorithms
A ML job is usually executed by an iterativeconvergent algorithm and can be abstracted by the following additive operation:
where is the state of the model after iteration and the update function computes the parameter updates based on some data at a learning rate .
Gradient Descent (GD) is arguably the most popular family of iterativeconvergent optimization algorithms. GD is applicable to most of the supervised, semisupervised, and unsupervised ML problems. By its name, GD is a class of firstorder methods whose update function is based on computing gradients from the data. Batch GD (BGD), Stochastic GD (SGD), Mini Batch GD (MGD), and SVRG++ [22] are some example GD family members. In these algorithms, each iteration draws samples from as and the loss of an iteration j is denoted as:
(1) 
where . Generally, a lower loss indicates a better model accuracy. Such an algorithm is said to be converged when stops changing or drops below a threshold . More specifically, gradient descent learning algorithms theoretically converge when , where is the optimal model. Practically, since is unknown, is approximated by to determine convergence. Recent works (e.g., [23, 24, 25]) have shown that many industrialstrength tasks require thousands of iterations or weeks to reach convergence.
IiB Parameter Server Architecture
Recently, it is not uncommon to see models with millions of parameters [1, 3] and terabytes of training data [6]. To support learning at that scale, the Parameter Server (PS) architecture [1, 2, 3, 4, 5, 6, 26] has been advocated to distribute the workload across clusters of nodes. The concept of a node is abstract. It can refer to any computation unit like a physical machine, a CPU in a NUMA machine, or a core in a CPU. Generally, the model parameters are maintained by multiple server nodes. Worker nodes periodically “pull” (part of) the latest model from the server(s), perform local computation like calculating stochastic gradient by accessing their part of training data, and then “push” the updates back to the server(s) whose parameters need to be updated. Servers update the global model by aggregating the local updates from workers (e.g., averaging the stochastic gradients from workers). As opposed to traditional messagepassing and MapReducelike frameworks where pairwise communications between workers are needed in order to exchange each other’s parameter updates, the PS architecture has the advantage of only requiring communications between workers and servers, thereby mitigating the network bottleneck. Under the PS architecture, the serverworker ratio is a key knob.
Iii Online Optimization Framework
Figure 3 shows our general framework to support selftuning on PSstyle ML systems. Many ML systems have a frontend and a backend. Take TensorFlow as an example, the core operations are carried out by the backend, which is implemented in C++. The frontend is responsible for optimizing and orchestrating a job’s execution. The frontend also offers a highlevel API for users to write their programs. There are different TensorFlow frontends but the most popular one is implemented in Python.
On receiving a ML program J from the frontend, the program would be instrumented before sending to the ML system backend. The instrumented program would then be executed as a job J’, which would emit various periteration metrics (e.g., execution time, loss) to a repository during its execution. Before starting an iteration, the Tuning Manager will (1) update a Gaussian Process (GP) model using the metrics collected from the previous iteration, (2) carry out Bayesian Optimization (BO) to get a possibly new system setting , and (3) reconfigure the system to setting . Practically, the Tuning Manager would not carry out reconfiguration every iteration but execute a certain number of iterations for each setting in order to well understand its online statistics efficiency (its live convergence rate).
Iiia Bayesian Optimization
Bayesian Optimization (BO) is a strategy for optimizing a blackbox objective function that is unknown beforehand but observable through conducting experiments [8, 9, 10]. Conventionally, in BO, each experiment is executed by a different system setting, and the same system setting is used throughout the same experiment. In our context, each experiment (ML job) is going to be executed by a number of different system settings until the job terminates. Therefore, we introduce the loss of the model to the BO’s input space so as to differentiate whether a system setting is applied to an early stage or to a late stage of the job. That is important because, in ML, a lousy system setting might improve the loss when the training just starts, whereas an optimal setting might hardly improve the loss if the model is converging.
Given an active ML job J, the goal of the BO is to recommend the next system setting that is expected to minimize the remaining completion time of J. Let be a system setting, where each is a configurable system parameter with value . We use to denote the remaining completion time of the job if we switch to setting when the model has reached a loss . is thus a dimensional vector that includes both the system setting and the loss of the model. So, the optimization problem is, given a model whose loss is , find the that minimizes . Knowing ahead of time would be infeasible. Bayesian Optimization thus returns an approximation solution with little overhead.
We model as a GP (Gaussian Process) [27] and use BO to suggest the next setting based on a predefined acquisition function
. An acquisition function can be updated with more observations. There are many choices of acquisition function such as (i) Probability of Improvement (PI)
[9], which picks the next setting that can maximize the probability of improving the current best; (ii) Expected Improvement (EI) [27], which picks the next setting that can maximize the expected improvement over the current best; (iii) Upper Confidence Bound (UCB) [28], which picks the one that has the smallest lower bound in its certainty region. Different acquisition functions have different strategies to balance between exploring (so that it tends to suggest a possibly new setting from an unknown region of the response surface) and exploiting the knowledge so far (so that it tends to suggest a setting that lies in a known high performance region). In this paper, we choose EI because it has shown to be more robust than PI, and unlike UCB, it is parameterfree. Using BO with EI has the ability to learn the objective function quickly and always return the expected optimal setting. BO itself is noise resilient. That is important because what we can collect from experiments is actually :where is a Gaussian noise with zero mean, i.e., . Since , , and are Gaussian, we can infer
and its confidence interval
[12]. As we discuss in Section IV momentarily, the observation noise comes from the fact is not a direct measurement but a product between (i) periteration execution time (hardware efficiency) and (ii) estimated number of iterations left (statistical progress). Although (i) could be directly observed, (ii) has to be based on certain empirical estimations (Section IV).BO has an advantage of being nonparametric, meaning it does not impose any limit on , making our techniques useful for a variety of ML systems. Furthermore, it can deal with nonlinear response surface but require far fewer samples than others which have similar power (e.g., deep network). Lastly, BO has a good track record on tuning database systems [29, 30]
. An approach similar to BO is reinforcement learning
[31] and we will explore that direction as a future work.IiiB Initialization Phase
We propose to divide the execution of a ML job into two phases: initialization and online tuning. The goal of the intialization phase is to quickly bring in a small set of representative settings and their execution metrics to build the GP. Initially, the job starts the first iterations using the setting , which is the default or the one given by the user. Iterations after that will be executed under random settings from the setting space, and for each setting it runs iterations. Figure 4a illustrates the major execution metrics that would be inserted into the repository after trying different settings, with iterations. Each record in the collected execution metrics is a quadruple , with indicates that iteration was executed using setting , indicates the execution time of that iteration, and indicates the loss of the model after that iteration.
The loss of one iteration alone is insufficient to judge whether a setting has good statistical efficiency. Consequently, the execution metrics would be preprocessed into triples , where is the loss of the iteration just before using setting , i.e., (e.g., in Figure 4), and is the estimated remaining completion time if starting using from a model with loss (details of in Section IV). Furthermore, it is known that some iterations may incur abnormal loss occasionally [32]. Therefore, we apply an outliner removal technique in [33]
to remove outliers. Figure
4b shows the training data after preprocessing.The initialization phase takes a total of iterations and it ends with building a GP based on the collected execution metrics (e.g., to compute the parameter values of the kernel function). After that, the online tuning phase starts.
IiiC Online Tuning Phase
In this phase, a new setting is selected from the GP with the highest expected improvement every iterations. Depending on the online reconfiguration cost (Section V), if the expected improvement (EI) of is larger than , then an online reconfiguration to takes place. In other words, if a reconfiguration costs more than what it will potentially save, that reconfiguration would not take place. Overall, the online tuning phase goes on until the job finishes.
IiiD Miscellaneous
System settings may involve categorical attributes. For ordinal categorical attributes, we simply preprocess their values to be integers. For nominal categorical attributes, we use onehot encoding
[34] to preprocess their data. Suppose a categorical attribute contains categories , in order to denote category , onehot encoding represents that value as a dimensional bit vector with only the th bit as 1 and all other bits are zero.We end this section by discussing how to set the values of and . Our major purpose is to avoid our techniques being parametric if possible. The main usage of is to deduce the statistical progress (live convergence rate). So, we set , the number of iterations executed for each setting, be three times the number of workers so as to assume each worker has already pushed the update to the server around three times. Our experimental results support our choice empirically. We regard the theoretical foundation of this choice as a future work. In this paper, we empirically set , the number of random settings to try in the initialization phase, as 10. So far, no auto tuners can reach 100% parameterfree [29, 30]. In Section VIII, we discuss how to possibly eliminate this very last knob, or even the entire initialization phase, by transfer learning [35].
Iv Online Progress Estimation
One key input to the online optimization framework is the estimated remaining completion time of a job. Concretely, can be formulated as:
which is a product between (i) periteration execution time (hardware efficiency) and (ii) estimated number of iterations left (statistical progress). could be directly computed as the average of the recorded iteration times of using that setting , e.g., can be computed as in Figure 4a.
In this section, we focus on , the remaining number of iterations required to reach model convergence. We posit that estimating in ML systems is as challenging as estimating cardinalities in query optimization. It is known that today we could still find errors up to orders of magnitude in cardinality estimation techniques [36]. Yet, most query optimizers live with that in practice. So following decades of experiences from query optimization [37], we aim for estimates that would not lead us to disastrous settings, instead of perfect estimates that are not demanded in practice.
Iva From Bounds to Legitimate Estimation
Studying the convergence rate of various learning algorithms is a very active research topic in machine learning [38, 13, 39, 40]. Since parameter server is a parallel learning architecture, in this paper we focus on parallel gradient descent learning algorithms. For example, Hogwild! [13] is a parallel gradient descent learning algorithm that would theoretically converge, i.e., after iterations, where
(2) 
under the assumptions of is cstrongly convex and Lsmooth, and is the hidden constant, and is the Lipschiz constant, is the optimal model parameter, is the initial model parameter, and is the userspecific convergence threshold.
In this paper, we use Hogwild! as an example (because its convergence results apply to both bulk synchronous and asynchronous model update) and generalize its offline convergence analysis to be an online estimation function. Without loss of generality, assume the learning has finished iterations and switches to use setting to execute more iterations. Then, we know the new pairs of would be scattered around the curve of:
(3) 
Fitting those pairs of to Equation 3 could then determine the values of and for setting . When and are known, we can estimate as:
(4) 
This methodology was pioneered by [41], but focused on gradient based algorithms under a serial and offline setting. For example, they deduce the total number of iterations as , with only one hidden constant. For online tuning, the fitting needs to be carried out multiple times, once for each different setting from different starting point (c.f. Equation 3). Furthermore, for gradient based algorithms under a parallel setting, like Hogwild!, their convergence guarantees usually capture both the algorithmic factor (e.g., use of risk function) and the environmental factor (e.g., data distribution, network delay). While is explicitly related to , the relationship between the algorithmic factor and live metric is implicit. To determine , we first derive its upper bound. Specifically, we show that :

//Section IIA


// is cstrongly convex

// is Lsmooth

Combine (i), (ii) and (iii), we have

Combine (iii) and (iv), we get

Consider a constant , then we have
Now the question becomes how to determine (and ) using the collected pairs. Notice that we should not fit both and together because that may find a value for exceeding its upper bound. But up to this point, we understand that (a) should not be a large number, because of the log term in Equation 3 means and the difference between and is tiny in practice. As such, a large would dominate the term and degenerate it to a constant. On the other hand, we understand (b) if is smaller than many of the ’s collected, then during fitting many ’s would result in negatives, making be fitted as negative. That would make Equation 4 returns negative numbers as results, which is undesirable. Based on these information, we set as
(5) 
and then deduce based on the collected pairs. In Equation 5, the term is the supremum of based on its upper bound to address concern (a) and the term inside addresses concern (b).
IvB Limitations and Opportunities
The foundation of our statistical progress estimation (and also [41]) are based on known theoretical convergence bounds of the various learning algorithms. The advantage of this approach is that in principle anyone can follow our methodology to improve the estimation function whenever there are new results on the bounds (e.g., tighter bounds that consider more factors). Nonetheless, while the convergence bounds of many ML problems are known, the convergence bounds of a number of nonconvex problems are still under development. For example, there are bounds for nonconvex PCA [42]
and twolayer neural networks with ReLU
[15], but bounds for deeper neural networks with many layers are still being developed. For our TensorFlowOnline prototype, we have implemented an estimation function based on Hogwild! (which assumes problems are convex). Our experiments show that our estimation function is able to avoid disastrous settings and return efficient system settings for two convex and one nonconvex problems. We regard this system paper as an initial effort and we will develop more specific estimation functions for each class of ML models and learning algorithms with known convergence bounds. This scale of work would however require effort of the community and we opensource our prototype to facilitate that.V Online Reconfiguration
Online reconfiguration changes the system setting in the course of a ML job. Under the PS architecture, the following physical changes could be triggered by a reconfiguration:

(Type I) Data Relocation: For example, a recommendation that suggest turning a worker node to a server node would trigger this type of reconfiguration. Here, we further bifurcate data relocation into:

(Type Ia) Training Data Relocation

(Type Ib) Model Data Relocation


(Type II) System Setting Reconfiguration: For example, in TensorFlow, there is a knob to turn on or off the function inlining optimization. This kind of knobs would not trigger any data relocation.
To implement online reconfiguration,
a baseline solution is to reuse the system’s checkpointing and recovery feature (e.g., the save & restore in TensorFlow).
In most circumstances, that feature is collectively implemented by four techniques:

Checkpointing (CKP): This saves the model state (e.g, the current model , the current iteration number ) to a persistent storage. Usually, this would not save any system settings (e.g., whether function inlining is on or off) because those values are stored separately in a system configuration/property file/inmemory data structure. Moreover, checkpointing does not involve the training data because there is a master copy of the training data in the shared storage (e.g., HDFS).

System Setting Recovery (SSR): This is builtin as part of the recovery process, in which the system is reinitialized based on the setting specified in the configuration/property file/data structure.

Model Data Recovery (MDR): This is the other part of the buildin recovery process, in which the model state is restored to the servers based on the system setting.

Training Data Recovery (TDR): Because the training data is read only and stored in the shared storage. Therefore, on recovery, the workers would simply fetch the missing data from the shared storage directly.
Existing ML systems implement their checkpointing and recovery process as a CKP and a full SSR+MDR+TDR, respectively. We regard that as the baseline reconfiguration implementation. It is expected this baseline implementation incurs high overhead (e.g., checkpointing the state) and cause system quiescence. In this paper, we have developed a new scheme that can carry out online reconfiguration more efficiently. Before introducing that, we first present a new technique for carrying out Type Ib reconfiguration efficiently.
OnDemandModelRelocation (ODMR) In our experience of applying our techniques on TensorFlow, a lion’s share of reconfiguration cost attributes to Type Ib, i.e., the cost of relocating some model parameters from one node to another node (e.g., when a recommendation suggests increasing the number of servers). Consequently, we design a technique, namely, OnDemandModelRelocation, that can carry out more efficient Type Ib model data relocation.
The idea of ODMR is to carry out parameter relocation on demand. Concretely, on receiving a Type Ib request, the system only invokes SSR to reflect the decision of moving a parameter from a source to a destination. The actual parameter movement takes place only when a parameter is pulled from the source server and pushes back to the destination server. Suppose there are two servers and and they originally manage parameters and , respectively. Now, assume a reconfiguration suggests to add one more server so that the three servers, , , and manage parameters , , and , respectively. So, when a worker requests a parameter that is supposed to be relocated, e.g., , we simply let the worker to pull from the old destination . After the workers have computed the updates, they push both their original values and the updates to the new destination . The reasons of pushing the original value are that (1) the destination does not have the original value , so sending the updates alone is not enough and (2) the original value “flags” the servers that this push is special and to avoid possibly repeated counting — the first time the server receives the message it should create a new parameter with value , but the second time it receives a message , it should act like receiving a normal push with .
The ODMR approach has the merit of overlapping a Type Ib relocation with the usual pushandpull operations. It would not cause any system quiescence as the basic solution does. By mixandmatch the existing checkpointing and recovery techniques in PSstyle systems and ODMR, our reconfiguration scheme is as follows:

For Type Ia reconfiguration only, invoke TDR.

For Type Ib reconfiguration only, invoke ODMR.

For Type II reconfiguration only, change the system configuration file and invoke SSR.

For any combination of the above, invoke the union of their actions.
We end this section by recalling the need to estimate the reconfiguration cost for the online tuning phase (Section IIIC). With the discussion above, it becomes clear that
depends on the reconfiguration type and technique. Nonetheless, empirically we observe that the cost variance of each technique is small and thus we can simply deduce the costs of each individual techniques from the execution metrics collected during the initialization phase.
Vi Experiments
We implemented our techniques on top of TensorFlow v1.8. The details about our implementation can be found in Appendix B. Briefly, the implementation includes a userlevel library written in Python 2.7 in order to abstract out the system setting of a TensorFlow program. We implemented the Tuning Manager and the repository using Python. We modified the frontend of TensorFlow so as to support our reconfiguration scheme. We refer this prototype implementation as TensorFlowOnline in this section. TensorFlowOnline can support both asynchronous parallel (ASP) and bulk synchronous parallel (BSP) training. Table I lists all the system knobs supported by TensorFlowOnline.
Hardware We performed all the experiments on a cluster of 36 identical servers, connected by Ethernet. The network bandwidth is 10Gbps. The computing nodes run 64bit CentOS 7.3, with the training datasets on HDFS 2.6.0. Each node is Intel Xeon E52620v4 system with 16 cores CPU running at 2.1 GHz, 64GB of memory, and 800GB SSD.
Comparison For comparison purposes, we use vanilla TensorFlow as the baseline and compare it with TensorFlowOnline. For each experiment executed by TensorFlow, we repeated the job 100 times, each using a different random system setting and report:

Worst: the worst completion time of TensorFlow among 100 random settings.

Average: the average completion time of TensorFlow among 100 random settings.

Best: the best completion time of TensorFlow among 100 random settings.
Note that the result of Best could not be achievable in practice. That is because nobody would really run the same job on TensorFlow a hundred times just for identifying the best system setting for a particular model (hyperparameter) and dataset. We also note that TensorFlow, although popular, is more like a software library than a system — currently users must explicitly specify most system parameters (e.g., the number of workers) and there are no default values per se.
Workload and Datasets We evaluate TensorFlowOnline with three widely used machine learning models: (i)
regularized Logistic Regression (LogR), (ii) Support Vector Machine (SVM), and (iii) Convolutional Neural Network (CNN). For CNN, we used AlexNet
[47], a convolutional neural network with five layers. AlexNet is also used in many work [48, 5]. We remark that AlexNet is nonconvex, and its convergence bound is still actively researched by the machine learning community. But as a system paper, we still try AlexNet to see if we can use our current estimation function as a heuristic. This is not uncommon in machine learning. For example, it is known that SGD might converge to a saddle point / local optimal but not global optimum when facing nonconvex problems. Nonetheless, SGD is still being extensively used in all sort of deep learning problems in practice.
Table II
summarizes the characteristics of the datasets and the workloads used in our experiments. For CNN, we used two different datasets. Specifically, ImageNet is a typical dataset for deep learning. However, each training job on ImageNet can take weeks on our cluster. When running
TensorFlow in the baseline experiments, some random (poor) settings took even a longer time to finish. As we have to run a lot of baseline experiments, we follow [46] to use a reduced version of ImageNet, namely ImageNet8, which consists of the first 8 classes in the original data. The convergence thresholds for LogR, SVM, and CNN are set as 0.2, 0.98, 0.5, and 1.5, respectively. These thresholds are chosen to ensure we can obtain the baseline results within months.Via Performance Evaluation
Figure 5 compares the completion time of TensorFlowOnline with TensorFlow. We see that TensorFlowOnline has about 1.4 (CNN on ImageNet8) to 2.5 (SVM) speedup when compared with Average, meaning TensorFlowOnline saves much time for average ML users who have little system background. Furthermore, TensorFlowOnline helps users to avoid disastrous bad settings, which are 6 (LogR) to 18 (CNN on ImageNet8) slower than using TensorFlowOnline.
Figure 6
shows that the loss of the jobs with respect to the job training time. In the figure, we indicate the moment when
TensorFlowOnline switches from its initialization phase to the online tuning phase with a vertical dotted line. We also add a marker on the xaxis whenever TensorFlowOnline changes the system setting online. From the figure, we observe that TensorFlowOnline might have a slower convergence rate during the initialization phase because it was trying different settings and some of those might not be good ones. Nonetheless, we know that is worth doing because once TensorFlowOnline enters the online tuning phase, it progressively uses better system settings and converges much faster afterwards.ViB Statistical Efficiency versus Hardware Efficiency
The completion time of a ML job is a complex interplay between statistical efficiency and hardware efficiency because a setting good at one efficiency might be a bad setting overall. Figure 7 confirms that. The figure shows the loss of the jobs with respect to the iterations executed. In fact, both TensorFlowOnline and Best have chosen settings that need slightly more iterations to convergence on LogR and CNN (CIFAR) when compared with settings chosen by Worst and Average. But Table III, which gives the details of the number of iterations and the execution time per iteration on all workloads, shows that the settings chosen by TensorFlowOnline and Best essentially have much better hardware efficiencies than the settings chosen by Worst and Average. That explains why TensorFlowOnline and Best have much better endtoend completion time overall. On ImageNet8, TensorFlowOnline has chosen a setting which is more hardware efficient but Best has chosen a setting which is more statistical efficient. We believe that is caused by the fact that our estimation function is only a heuristic when facing nonconvex problems. But we remark that the setting chosen by TensorFlowOnline is a fairly good one after all.
0.9
Worst  Average  TensorFlowOnline  Best  
# of  time per  # of  time per  # of  time per  # of  time per  
iterations  iteration  iterations  iteration  iterations  iteration  iterations  iteration  
LogR  14899  0.846s  14795  0.217s  22592  0.093s  21834  0.060s 
SVM  106323  0.691s  227125  0.034s  223519  0.013s  225085  0.010s 
CNN on CIFAR  35426  0.157s  37520  0.023s  44827  0.011s  43601  0.005s 
CNN on ImageNet  3975  24.921s  1163  6.132s  1555  3.463s  691  3.747s 
Tables IV, IX, X, and XI list the system settings chosen by Worst, Average, TensorFlowOnline and Best in detail. For Average, the reported setting is the one whose completion time closest to the average completion time. For TensorFlowOnline, the reported setting is the final one found by TensorFlowOnline in the online tuning phase. Take the settings reported in the SVM experiment (Table IV) as an example, we see that TensorFlowOnline found a setting quite close to the Best, especially on the serverworker ratio and on the use of parallelism, which justifies TensorFlowOnline nearoptimal performance in SVM. When we look at the settings reported in the CNN ImageNet8 experiment (Table XI), we observe TensorFlowOnline and Best really chose quite different system settings, in which TensorFlowOnline has chosen a more hardware efficient one but Best has chosen a more statistical efficient one. Nonetheless, the setting chosen by TensorFlowOnline is good enough, and outperforms the one chosen by Average in terms of endtoend completion time.
ViC Reconfiguration
In order to evaluate our proposed reconfiguration scheme, especially ODMR, we carried out a set of experiments on TensorFlowOnline whose reconfiguration is implemented using the baseline method (i.e., checkpointing and recovery). Table V shows the details about the reconfiguration costs between the two implementations. Column (a) shows that our reconfiguration scheme reduces the reconfiguration overheads by 400% (LogR) to 640% (CNN on CIFAR). Column (b) shows the average cost of a single reconfiguration. It shows that our reconfiguration scheme reduces each reconfiguration overhead by 380% (LogR) to 760% (CNN on CIFAR). The reason of the reconfiguration cost being higher in LogR than the others because the model size of LogR is much bigger than the others (see Table II). That influences both the baseline’s statecheckpointingandrecovery cost and ODMR’s relocation cost. Nonetheless, we observe that the number of reconfigurations that took place is actually a fairly small number, between 24 and 50. Consequently, those costs are worth and offset by the use of better system settings in subsequent iterations, which our overall experimental results confirm that.
Workload  # of Reconfig  (a) Total Overhead  (b) Overhead per reconfiguration  

Baseline  TensorFlowOnline  Baseline  TensorFlowOnline  
LogR  37  1739s  444s  47s  12s 
SVM  50  650s  100s  13s  2s 
CNN on CIFAR  26  416s  52s  16s  2s 
CNN on ImageNet8  24  960s  144s  40s  6s 
ViD Estimation Quality
Lastly, we try to understand the quality of our estimation function. Since our primary goal is not the estimation accuracy as discussed, we evaluate the rank [49] of our estimation function instead. Specifically, there is a perfect ranking of 100 system settings obtained in the baseline experiments, where the one with the best completion time, i.e., Best, is rank 1st, and Worst, has rank 100th. This perfect ranking can serve as an oracle to the evaluation.
Consider a random system setting in our baseline experiments. We segment its execution metrics for every consecutive pairs of . Then, for each segment we follow our methodology in Section IV to form a series of estimation functions . Next, we feed the same convergence threshold to to obtain a series of estimated remaining completion times with respect to iteration th, th, so on and so forth. For 100 random settings used in our baseline experiments, we then obtain a table of estimation results like this (note: the numbers below are for illustrations only):
Setting  Iteration  Iteration  … 
Est. Remaining Time  Est. Remaining Time  …  
5555  3333  …  
4444  4222  …  
…  …  …  … 
7777  6666  … 
With a table of estimation results, we can evaluate the quality of the estimation function directly. Specifically, we can deduce which setting is the estimated “optimal” according to the estimation function alone. For example, as of the moment of iteration 60, our estimation function would regard as the estimated optimal if its estimated remaining time is the lowest among the others. Similarly, as of the moment of iteration 120, our estimation function would regard as the estimated optimal if its estimated remaining time is the lowest. Now, consider the moment of iteration 60 and assuming is the estimated optimal setting at that moment, we can quantify whether the estimation is reliable by crosschecking the rank of with respect to the oracle. Concretely, if is also rank 1st in the oracle, that means the estimation is perfect enough to suggest the real optimal. In contrast, if turns out to rank 100th in the oracle, that means the estimated “optimal” setting turns out is the worst one among the 100 settings. The notion of rank based on an actual oracle has been used in [49] and it was shown that it is way more informative than using the notion of error when evaluating the quality of an estimation function. So now, for each segment (iterations 1 to 60 is segment 1, iterations 61 to 120 is segment 2, etc.), we can identify the rank of the estimated optimal in that segment. We report the average rank of those estimated optimals cross all segments. Semantically, that is the quality of our estimates across every possible reconfiguration point of the training job under TensorFlowOnline.
Table VI shows the average ranks of our estimation on the four different workloads. Our estimation function has excellent quality in LogR and SVM, in which its estimated optimals are the third (3.3) and the second (2.0) best settings in real. As a heuristic for nonconvex CNN, our estimation function, though not as promising as on LogR and SVM, is still able to return good but not excellent settings that rank within the top22 percentile. As an initial prototype, we regard that as good enough as we can see from the previous experiments (Figure 5) that TensorFlowOnline can successfully avoid disastrous settings that would have been about 6 times (LogR), 25 times (SVM), 10 times (CIFAR) and 18 times (ImageNet8) slower. Nonetheless, we are aware of new convergence bounds for two layers neural networks have just been released [15]. We will try those new bounds as heuristics estimation function in TensorFlowOnline for CNN problems in the future.
0.9
Workload  LogR  SVM  CNN on CIFAR  CNN on ImageNet8 

Rank  3.3  2.0  22.0  13.0 
Vii Related Work
The (short) history of PS architecture began with systems that were specifically designed for LDA topic modeling [50] and deep network [5, 1]. Afterwards, generalpurpose ML systems also adopt the PS architecture [3, 2]. Compared with autotuning database systems, autotuning ML systems is in infancy. In [25], an offline tuner specifically designed for Adam [5], a closesource ML system, was presented. The work manually established an analytical costmodel based on Adam’s architecture and design. Similarly, in [51], an offline resource tuner for ML systems was discussed. That work, however, focused only on hardware efficiency. Latest works [41] and [52] discusses the automatic selection of different GD algorithms by manually creating an analytical costmodel and the automatic placement of operators on CPU/GPU using reinforcement learning, respectively. Our scope is way broader than only those. More importantly, we target online tuning, i.e., a job is executed using better and better system settings as it proceeds. In contrast, [41] targets offline tuning — first decide on which GD algorithm to use and never change that even though a job may last for hours or weeks. In [53], experiments show that changing the cluster resources online could influence the completion time of ML jobs, which supports the arguments of this paper. FlexPS [26] shares the same vision as us. However, FlexPS only supports only one knob – the workerserver ratio. Furthermore, it requires users to learn a completely new programming model and API. In contrast, our techniques in this paper are general. We have shown that we can apply our techniques to the popular Tensorflow and users can enjoy better efficiency with no pain.
There are distributed systems specialized for deep learning, for example, SINGA [54], MXNET [4], FireCaffe [55], SparkNet [56], Omnivore [46], and Project Adam [5]. Decoupling hardware and statistical efficiency is not new there. For example, MXNet reported hardware efficiency and statistical efficiency separately, SINGA studied their tradeoff, and Omnivore leveraged that tradeoff to improve endtoend running time. However, they have not studied the issues of online tuning and system reconfigurations as we do.
In machine learning, auto hyperparameter tuning (e.g., tuning the number of the hidden layers in a deep neural network) that finds the best model is a grand challenge [57]. Currently, most ML users follow a trialanderror process to find their “right” model: (i) pick an initial setting for the hyperparameters, and then train the model for a fixed amount of time (e.g., days); (ii) if the final accuracy is not desirable, then choose another set of hyperparameter values and repeat the process, until the model has reached the user’s expectation. Under this trialanderror cycle, what we propose in this paper would significantly reduce the time of each trial, thereby expediting the ideal model seeking process.
Performance modeling and progress estimation are interesting problems in their own right. For example, Ernest [58] trains a performance model for machine learning applications. However, even that latest work has only put the estimation of statistical efficiency as a future work. Progress indicator is a useful addon in analytical systems because that lets users know when will they obtain the results [59, 60, 61, 62, 63]. In this paper, we have pioneered the first progress indicator for ML systems through giving initial solutions to the statistical progress estimation problem.
Viii Conclusion and Future Work
In this paper, we make a case for building a parameterserver system prototype that supports selftuning. We show that the performances of machine learning (ML) systems, like database systems, are also subjected to the values of system parameter. However, unlike database systems, ML systems can afford online onjob training and tuning because of the longrunning nature of ML iterative programs. To this end, we have developed an online optimization framework that is suitable to all ML systems. We have also developed initial solutions to approach the online statistical progress estimation problem. Furthermore, we have developed new techniques to carry out online reconfiguration in a lightweight and nonquiescent mode. As an initial effort to showcase our techniques, we have implemented a prototype on top of TensorFlow. Experiments show that various ML tasks gain speedup by a factor of 1.4 to 18.
As a prototype, TensorFlowOnline has included only one specific statistic progress estimation function. Although empirical results show that it works well even on workloads that violate its assumption, we are going to devise specific estimation functions for each kind of ML problems and algorithms. From the system perspective, we plan to extend our idea to other system architecture (e.g,. the peer to peer architecture on a scaleup machine [64]), and to platforms with heterogenous machines (e.g., [65]). We are also in the process of using transfer learning [35] to eliminate the initialization phase. Specifically, when the framework receives a new ML job , it shall search the repository and locate a previous job that is most similar to . Then it shall transfer all candidate settings that had ever picked to be ’s candidate settings. In this case, determining the value for , the number of initial settings to try during the initialization phase, is no longer a question. OtterTune [30] has also leveraged a similar idea when facing new DB workloads. We believe the formal use of transfer learning on ML system tuning would be promising.
References
 [1] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato, A. W. Senior, P. A. Tucker, K. Yang, and A. Y. Ng, “Large scale distributed deep networks,” in NIPS, 2012, pp. 1232–1240.
 [2] E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu, “Petuum: A new platform for distributed machine learning on big data,” in SIGKDD, 2015, pp. 1335–1344.
 [3] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B. Su, “Scaling distributed machine learning with the parameter server,” in OSDI, 2014, pp. 583–598.
 [4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
 [5] T. M. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman, “Project adam: Building an efficient and scalable deep learning training system,” in OSDI, 2014, pp. 571–582.
 [6] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng, “Tensorflow: A system for largescale machine learning,” in OSDI, 2016, pp. 265–283.
 [7] C. Zhang and C. Ré, “Dimmwitted: A study of mainmemory statistical analytics,” PVLDB, vol. 7, no. 12, pp. 1283–1294, 2014.
 [8] E. Brochu, V. M. Cora, and N. de Freitas, “A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning,” CoRR, 2010.
 [9] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in NIPS, 2012, pp. 2960–2968.
 [10] J. Mockus, Bayesian approach to global optimization: theory and applications. Springer Science & Business Media, 2012.
 [11] T. Li, J. Zhong, J. Liu, W. Wu, and C. Zhang, “Ease.ml: Towards multitenant resource sharing for machine learning workloads,” PVLDB, vol. 11, no. 5, pp. 607–620, 2018.
 [12] O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang, “Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics,” in NSDI, 2017, pp. 469–482.

[13]
B. Recht, C. Re, S. J. Wright, and F. Niu, “Hogwild: A lockfree approach to parallelizing stochastic gradient descent,” in
NIPS, 2011, pp. 693–701.  [14] C. D. Sa, C. Zhang, K. Olukotun, and C. Ré, “Taming the wild: A unified analysis of hogwildstyle algorithms,” in NIPS, 2015, pp. 2674–2682.
 [15] Y. Li and Y. Yuan, “Convergence analysis of twolayer neural networks with relu activation,” in NIPS, 2017, pp. 597–607.
 [16] E. R. Sparks, A. Talwalkar, M. J. Franklin, M. I. Jordan, and T. Kraska, “Tupaq: An efficient planner for largescale predictive analytic queries,” arXiv preprint arXiv:1502.00068, 2015.
 [17] L. Kotthoff, C. Thornton, H. H. Hoos, F. Hutter, and K. LeytonBrown, “Autoweka 2.0: Automatic model selection and hyperparameter optimization in WEKA,” Journal of Machine Learning Research, vol. 18, pp. 25:1–25:5, 2017.
 [18] D. Golovin, B. Solnik, S. Moitra, G. Kochanski, J. Karro, and D. Sculley, “Google vizier: A service for blackbox optimization,” in SIGKDD, 2017, pp. 1487–1495.
 [19] U. BeckerKornstaedt, L. Scott, and J. Zettel, “Process engineering with spearmint/epg,” in ICSE, 2000, p. 791.
 [20] T. G. authors, “GPyOpt: A bayesian optimization framework in python,” http://github.com/SheffieldML/GPyOpt, 2016.
 [21] T. autosklearn authors, “autosklearn,” https://github.com/automl/autosklearn, 2014.
 [22] Z. Allen Zhu and Y. Yuan, “Improved SVRG for nonstronglyconvex or sumofnonconvex objectives,” in ICML, 2016, pp. 1080–1089.
 [23] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,” in ICML, 2015, pp. 1312–1320.
 [24] O. Meshi, M. Mahdavi, and A. Schwing, “Smooth and strong: Map inference with linear convergence,” in NIPS, 2015, pp. 298–306.
 [25] F. Yan, O. Ruwase, Y. He, and T. Chilimbi, “Performance modeling and scalability optimization of distributed deep learning systems,” in SIGKDD, 2015, pp. 1355–1364.
 [26] Y. Huang, T. Jin, Y. Wu, Z. Cai, X. Yan, F. Yang, J. Li, Y. Guo, and J. Cheng, “Flexps: Flexible parallelism control in parameter server architecture,” Proceedings of the VLDB Endowment, vol. 11, no. 5, pp. 566–579, 2018.
 [27] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
 [28] E. Contal, D. Buffoni, A. Robicquet, and N. Vayatis, “Parallel gaussian process optimization with upper confidence bound and pure exploration,” in ECML PKDD, 2013, pp. 225–240.
 [29] S. Duan, V. Thummala, and S. Babu, “Tuning database configuration parameters with ituned,” PVLDB, vol. 2, no. 1, pp. 1246–1257, 2009.
 [30] D. V. Aken, A. Pavlo, G. J. Gordon, and B. Zhang, “Automatic database management system tuning through largescale machine learning,” in SIGMOD, 2017, pp. 1009–1024.
 [31] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” J. Artif. Intell. Res., vol. 4, pp. 237–285, 1996.
 [32] R. Johnson and T. Zhang, “Accelerating stochastic gradient descent using predictive variance reduction,” in NIPS, 2013, pp. 315–323.
 [33] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to Statistical Learning: With Applications in R. Springer Publishing Company, Incorporated, 2014.
 [34] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www.deeplearningbook.org.
 [35] S. J. Pan and Q. Yang, “A survey on transfer learning,” TKDE, vol. 22, no. 10, pp. 1345–1359, 2010.
 [36] H. Harmouch and F. Naumann, “Cardinality estimation: An experimental survey,” PVLDB, vol. 11, no. 4, pp. 499–512, 2017. [Online]. Available: http://www.vldb.org/pvldb/vol11/p499harmouch.pdf
 [37] G. Lohman, “Is query optimization a “solved” problem?” http://wp.sigmod.org/?p=1075, 2014.
 [38] Q. Meng, W. Chen, Y. Wang, Z.M. Ma, and T.Y. Liu, “Convergence analysis of distributed stochastic gradient descent with shuffling,” NIPS, 2017.
 [39] S. ShalevShwartz and S. BenDavid, Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
 [40] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,” SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009.
 [41] Z. Kaoudi, J.A. QuianeRuiz, S. Thirumuruganathan, S. Chawla, and D. Agrawal, “A costbased optimizer for gradient descent optimization,” in SIGMOD, 2017, pp. 977–992.
 [42] C. D. Sa, “Nonconvex optimization,” http://www.cs.cornell.edu/courses/cs6787/2017fa/Lecture7.pdf, 2017.
 [43] K. Dataset, “Kdd cup 2012, track 1,” https://www.kaggle.com/c/kddcup2012track1, 2012.
 [44] C. Labs, “Criteo releases industry’s largestever dataset for machine learning to academic community,” https://www.criteo.com/news/pressreleases/2015/07/criteoreleasesindustryslargesteverdataset/, 2015.
 [45] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
 [46] S. Hadjis, C. Zhang, I. Mitliagkas, and C. Ré, “Omnivore: An optimizer for multidevice deep learning on cpus and gpus,” arXiv preprint arXiv:1606.04487, 2016.
 [47] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012, pp. 1106–1114.
 [48] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
 [49] W. Xu, Z. Feng, and E. Lo, “Fast multicolumn sorting in mainmemory columnstores,” in SIGMOD, 2016, pp. 1263–1278.
 [50] A. J. Smola and S. M. Narayanamurthy, “An architecture for parallel topic models,” PVLDB, vol. 3, no. 1, pp. 703–710, 2010.
 [51] B. Huang, M. Boehm, Y. Tian, B. Reinwald, S. Tatikonda, and F. R. Reiss, “Resource elasticity for largescale machine learning,” in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. ACM, 2015, pp. 137–152.

[52]
K. Nguyen, H. Daumé III, and J. BoydGraber, “Reinforcement learning for bandit neural machine translation with simulated human feedback,”
arXiv, 2017.  [53] X. Pan, S. Venkataraman, Z. Tai, and J. Gonzalez, “Hemingway: Modeling distributed optimization algorithms,” CoRR, 2017.
 [54] B. C. Ooi, K.L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo, A. K. Tung, Y. Wang, and Z. Xie, “Singa: A distributed deep learning platform,” in ACM Multimedia, 2015, pp. 685–688.
 [55] F. N. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer, “Firecaffe: Nearlinear acceleration of deep neural network training on compute clusters,” in CVPR, 2016, pp. 2592–2600.
 [56] P. Moritz, R. Nishihara, I. Stoica, and M. I. Jordan, “Sparknet: Training deep networks in spark,” arXiv preprint arXiv:1511.06051, 2015.
 [57] I. Guyon, I. Chaabane, H. J. Escalante, S. Escalera, D. Jajetic, J. R. Lloyd, N. Macià, B. Ray, L. Romaszko, M. Sebag, A. R. Statnikov, S. Treguer, and E. Viegas, “A brief Review of the ChaLearn AutoML Challenge: Anytime Anydataset Learning without Human Intervention,” in Workshop on Automatic Machine Learning, 2016, pp. 21–30.
 [58] S. Venkataraman, Z. Yang, M. J. Franklin, B. Recht, and I. Stoica, “Ernest: Efficient performance prediction for largescale advanced analytics,” in USENIX, 2016, pp. 363–378.
 [59] J. Li, A. C. König, V. R. Narasayya, and S. Chaudhuri, “Robust estimation of resource consumption for SQL queries using statistical techniques,” PVLDB, vol. 5, no. 11, pp. 1555–1566, 2012.
 [60] J. Li, R. V. Nehme, and J. F. Naughton, “Toward progress indicators on steroids for big data systems,” in CIDR, 2013.
 [61] S. Chaudhuri, V. R. Narasayya, and R. Ramamurthy, “Estimating progress of long running SQL queries,” in SIGMOD, 2004, pp. 803–814.
 [62] G. Luo, J. F. Naughton, C. J. Ellmann, and M. W. Watzke, “Toward a progress indicator for database queries,” in SIGMOD, 2004, pp. 791–802.
 [63] K. Morton, A. L. Friesen, M. Balazinska, and D. Grossman, “Estimating the progress of mapreduce pipelines,” in ICDE, 2010, pp. 681–684.
 [64] D. Grubic, L. Tam, D. Alistarh, and C. Zhang, “Synchronous multigpu training for deep learning with lowprecision communications: An empirical study,” in EBDT, 2018, pp. 145–156.
 [65] J. Jiang, B. Cui, C. Zhang, and L. Yu, “Heterogeneityaware distributed parameter servers,” in SIGMOD, 2017, pp. 463–478.
Appendix A Notations
We summarized all used notations in Table VII.
Notation  Meaning 

system setting  
optimal or nearoptimal system setting  
the loss of the first iteration of using setting  
dimensional vector that includes both the system setting values and the loss of the model  
the remaining completion time of the job if we switch to setting where the model has reached a loss down to  
the execution time of th iteration  
the loss of iteration under setting  
the estimated remaining completion time at setting  
collected execution metrics  
training data for the Bayesian Optimization (BO)  
model parameters after th iteration  
the optimal model parameters 
Appendix B Implementing TensorFlowOnline
We have implemented our techniques on top of TensorFlow v.1.3 and name our prototype TensorFlowOnline.
Ba UserProgram
Currently TensorFlow exposes all system settings through the class constructors and class attributes of the core classes. Table VIII (left) shows how ML users specify those settings within the program. We have implemented a Python module for TensorFlowOnline so that users no longer need to specify the system settings anymore. Table VIII (right) shows the corresponding program with TensorFlowOnline installed. We can see that a ML user no longer needs to bother for those system settings, except she is required to implement her TensorFlow program by extending a new class MLJobFramework provided by TensorFlowOnline. That class is implemented in the frontend to collect runtime statistics.
BB Frontend
TensorFlow existing implementation already has explicit facilities to implement CKP, MDR, and TDR. Specifically, CKP can be invoked by Saver.save(), MDR can be invoked by Saver.restore(), and TDR can be invoked by reading training data through TensorFlow tf.ReaderBase with HDFS filesystem plugin.
Now, we discuss how we implement ODMR (OnDemandModelRelocation) in TensorFlow. The placement of parameters is controlled by the execution graph generated by TensorFlow frontend. When data reallocation (e.g., changing the number of parameter server) occurs, attribute tf.Variable::device, which controls the location of the parameters, is updated according to the parameter mapping generated by TensorFlowOnline. To push also the original data value under ODMR, we added an extra operation to the frontend to do so.
BC BackEnd
We modify the backend of TensorFlow in order to reduce the overhead of SSR. Currently, if we want to carry out SSR (e.g., changing of the number of intra_op_parallelism_threads) in TensorFlow, the whole TensorFlow program has to completely restart since the backend of TensorFlow cannot change the system knobs onfly. In TensorFlowOnline, we modify TensorFlow backend and expose a new function called Reconfig() in the API. It allows the backend to accept new knob values from the frontend without restarting the whole job.
Appendix C Additional Experiment Results
0.65
Knob (names simplified)  Worst  Average  TFOnline  Best 

ps  34  28  20  18 
worker  2  8  16  18 
intra_op_parallelism_threads  12  10  6  11 
inter_op_parallelism_threads  11  6  10  5 
do_common_subexpression_elimination  False  True  True  True 
max_folded_constant_in_bytes  86954782  10485760  25092366  28952477 
do_function_inlining  False  True  False  True 
global_jit_level  ON_2  OFF  ON_1  ON_2 
infer_shapes  True  True  True  True 
place_pruned_graph  True  False  True  False 
enable_bfloat16_sendrecv  False  False  True  True 
0.65
Knob (names simplified)  Worst  Average  TFOnline  Best 

ps  33  16  6  5 
worker  3  20  30  31 
intra_op_parallelism_threads  2  2  3  12 
inter_op_parallelism_threads  14  14  13  4 
do_common_subexpression_elimination  False  True  False  False 
max_folded_constant_in_bytes  5125478  10485760  65136941  50785965 
do_function_inlining  False  True  False  True 
global_jit_level  ON_2  OFF  ON_1  OFF 
infer_shapes  True  True  True  True 
place_pruned_graph  False  False  True  True 
enable_bfloat16_sendrecv  True  False  True  False 
0.65
Knob (names simplified)  Worst  Average  TFOnline  Best 

ps  31  25  27  13 
worker  5  11  9  23 
intra_op_parallelism_threads  12  1  1  12 
inter_op_parallelism_threads  4  15  15  4 
do_common_subexpression_elimination  False  True  False  True 
max_folded_constant_in_bytes  96500000  10485760  10485760  10485760 
do_function_inlining  True  True  False  True 
global_jit_level  ON_2  OFF  ON_2  OFF 
infer_shapes  False  True  False  True 
place_pruned_graph  False  False  False  False 
enable_bfloat16_sendrecv  False  False  False  False 