In modern machine learning, complex statistical models with many free parameters are fit to data by way of automated and highly scalable algorithms. For example, weights of a deep neural network are learned by stochastic gradient descent (SGD), minimizing a loss function over the training data. Unfortunately, some remaininghyperparameters (HPs)
cannot be adjusted this way, and their values can significantly affect the prediction quality of the final model. In a neural network, we need to choose the learning rate of the stochastic optimizer, regularization constants, the type of activation functions, and architecture parameters such as the number or the width of the different layers. In Bayesian models, priors need to be specified, while for random forests or gradient boosted decision trees, the number and maximum depth of trees are important HPs.
The problem of hyperparameter tuning can be formulated as the minimization of a black-box objective function , where denotes the space of valid HP configurations, while the value corresponds to the metric we wish to optimize. We assume that , where is one of the hyperparameters, such that . For example, given some , may correspond to the held-out error rate of a machine learning model when trained and evaluated using the HPs . In practice, hyperparameter optimization requires addressing the following challenges:
The function is considered a black-box (i.e., we have no knowledge of its shape), which can be complex and difficult to optimize as we do not have access to its gradients.
Evaluations of are often expensive in terms of time and compute (e.g., training a deep neural network on a large dataset), so it is important to identify a good hyperparameter configuration with the least number of queries of .
For complex models, the HP configuration space can have diverse types of attributes, some of which may be integer or categorical. For numerical attributes, search ranges need to be determined. Some attributes in can even be conditional (e.g., the width of the -th layer of a neural network is only relevant if the model has at least layers).
Even if varies smoothly in , point evaluations of are typically noisy.
In this paper, we present Amazon SageMaker Automatic Model Tuning (AMT), a fully managed system for black-box function optimization at scale.111Amazon SageMaker is a service that allows easy training and hosting of machine learning models. For details, see https://aws.amazon.com/sagemaker and (Liberty et al., 2020). The key contributions of our work are as follows:
Design, architecture and implementation of hyperparameter optimization as a distributed, fault-tolerant, scalable, secure and fully managed service, integrated with Amazon SageMaker (§3).
Description of the Bayesian Optimization algorithm powering AMT, including efficient hyperparameter representation, surrogate Gaussian process model, acquisition functions, and parallel and asynchronous evaluations (§4).
Overview of advanced features such as log scaling, automated early stopping and warm start (§5).
Discussion of deployment results as well as challenges encountered and lessons learned (§6).
Traditionally, HPs are either hand-tuned by experts in what amounts to a laborious process, or they are selected using brute force schemes such as random search or grid search. In response to increasing model complexity, a range of more sample-efficient HP optimization techniques have emerged. A comprehensive review of modern hyperparameter optimization is provided in Feurer & Hutter (2019). Here, we will focus on work relevant in the context of AMT.
2.1 Model-Free Hyperparameter Optimization
Any HPO method is proposing evaluation points , such that
For the simplest methods, the choice of does not depend on earlier observations. In grid search, we fix values for every HP , then evaluate on the Cartesian product, so that . In random search, we draw independently and uniformly at random. More specifically, each is drawn uniformly from . For numerical HPs, the distribution may also be uniform in the log domain. Random search is typically more efficient than grid search, in particular if some of the HPs are more relevant than others Bergstra & Bengio (2012). Both methods are easily parallelizable. Random search should always be considered as a baseline, and is frequently used to initialize more sophisticated HPO methods.
These simple baselines can be improved upon by making use of earlier observations in order to plan subsequent evaluations ,
. Population-based methods, such as evolutionary or genetic algorithms, are attractive if parallel computing resources are availableHansen & Ostermeier (2001); Real et al. (2020)
. In each generation, all configurations from a fixed-size population are evaluated. Then, a new population is created by randomly mutating a certain percentile of the top performing configurations, as well as sampling from a background distribution, and the process repeats with the next generation. Evolutionary algorithms (EAs) are powerful and flexible search strategies, which can work even in search spaces of complex structure. However, EAs can be difficult to configure to the problem at hand, and since fine-grained knowledge aboutcan be encoded only in a large population, they tend to require a substantial amount of parallel compute resources. In some EAs, low-fidelity approximations of are employed during early generations Jaderberg et al. (2017) in order to reduce computational cost. Multi-fidelity strategies are discussed in more generality below.
2.2 Surrogate Models. Bayesian Optimization
A key idea to improve data efficiency of sampling is to maintain a surrogate model of the function . At decision step , this model is fit to previous data . In Bayesian optimization (BO): the value of the next could come from exploration (sampling where is most uncertain) or from exploitation (minimizing our best estimate), and a calibrated probabilistic model can be used to resolve the trade-off between these desiderata optimally Srinivas et al. (2012). More concretely, we choose as the best point according to an acquisition function
, which is a utility function averaged over the posterior predictive distribution.
The most commonly used surrogate model for BO is based on Gaussian processes (GPs) Rasmussen & Williams (2006), which not only have simple closed-form expressions for posterior and predictive distributions, but also come with strong supporting theory. Other BO surrogate models include random forests Hutter et al. (2011a) and Bayesian neural networks Springenberg et al. (2016). We provide a detailed account of sequential Bayesian optimization with a GP surrogate model in Section 4, see also Snoek et al. (2012). Tutorials on Bayesian optimization are provided in Brochu et al. (2010); Shahriari et al. (2016).
2.3 Early Stopping. Multi-Fidelity Optimization
Modern deep learning architectures can come with hundreds of millions of parameters, and even a single training run can take many hours or even days. In such settings, it is common practice to cheaply probe configurations
by training for few epochs or on a subset of the data. While this gives rise to a low-fidelity approximation of, such data can be sufficient to filter out poor configurations early on, so that full training is run only for the most promising ones. To this end, we consider functions , being a resource attribute, where is the expensive metric of interest, while , are cheaper-to-evaluate approximations. Here, could be the number of training epochs for a DNN, the number of trees for gradient boosting, or the dataset subsampling ratio.
When training a DNN model, we can evaluate , by computing the validation metric after every epoch. For gradient boosting codes supporting incremental training, we can obtain intermediate values as well. In early stopping HPO, the evaluation of is terminated at a level
if the probability of it scoring worse atthan some earlier is predicted high enough. The median rule is a simple instance of this idea Golovin et al. (2017), while other techniques aim to extrapolate learning curves beyond the current Domhan et al. (2014); Klein et al. (2017). Early stopping is particularly well suited to asynchronous parallel execution: whenever a job is stopped, an evaluation can be started for the next configuration proposed by HPO. An alternative to stopping configurations is to pause and (potentially) resume them later Swersky et al. (2014).
Besides early stopping HPO, successive halving (SH) Karnin et al. (2013); Jamieson & Talwalkar (2015) is another basic instance of a multi-fidelity technique. In each round, is evaluated for a number of configurations sampled at random. Next, is run for the top of configurations, while the bottom half are discarded. This filtering step is repeated until is reached. SH is simple to implement and can be a highly efficient baseline, in particular compared to full-fidelity HPO. However, it can be hard to select in practice. This issue is remedied in Hyperband Li et al. (2016), which adds an outer iteration over different values of . One drawback of SH and Hyperband is their synchronous nature: all configurations have to be evaluated at a certain level before any of them can be promoted to the next level. Also, most processors spend idle time at every synchronization point, waiting for the slowest job to finish. ASHA Li et al. (2019) extends SH and Hyperband to asynchronous evaluations, with strong reductions in wall-clock time.
All these methods make use of SH scheduling of evaluation jobs, yet new configurations are chosen at random. BOHB Falkner et al. (2018) combines synchronous Hyperband with model-based HPO, which tends to significantly outperform the random sampling variants. A more recent combination of ASHA with Bayesian optimization is asynchronous BOHB Klein et al. (2020). This method tends to be far more efficient than synchronous counterparts, and can often save up to half of resources compared to ASHA.
3 The System
In this section we give an overview of the system underpinning SageMaker AMT. We lay out design principles, describe the system architecture, and highlight a number of challenges related to running HPO in the cloud.
3.1 Design Principles
We present the key requirements underlying the design of SageMaker AMT.
Easy to use and fully managed: To ensure broad adoption by data scientists and ML developers with varying levels of ML knowledge depth, AMT should be easy to use, with minimal effort needed for a new user to setup and execute. Further, we would like AMT to be offered as a fully managed service, with stable API and default configuration settings so that the implementation complexity is abstracted away from the customer. AMT spares users the pain to provision hardware, install the right software, and download the data. It takes care of uploading the models to the users’ accounts and providing them with training performance metrics.
Tightly integrated with other SageMaker components: Considering that model tuning (HPO) is typically performed as a part of the ML pipeline involving several other components, AMT should seamlessly operate with other SageMaker components and APIs.
Scalable: AMT should scale with respect to different data sizes, ML training algorithms, number of HPs, metrics, HPO methods and hardware configurations. Scalability includes a failure-resistant workflow with built-in retry mechanisms to guarantee robustness.
Cost-effective: AMT should be cost-effective to the customer, in terms of both compute and human costs. We would like to enable the customer to specify a budget and support cost reduction techniques such as early stopping and warm start.
3.2 System Architecture
Sagemaker AMT provides a distributed, fault-tolerant, scalable, secure and fully managed service for hyperparameter optimization. In this section we examine the building blocks of the service. One of the key building blocks is the AWS Sagemaker Training platform, which is used to execute the training jobs and obtain the value of the objective metric for any candidate hyperparameter values chosen by the Hyperparameter Selection Service. In this way, each candidate set of hyperparameters tried is associated to a corresponding Sagemaker Training Job in the customer’s AWS account.
Sagemaker AMT’s backend is built using a fully server-less architecture by means of a number of AWS building blocks. It uses AWS API Gateway, AWS Lambda, AWS DynamoDB, AWS Step Functions, AWS Cloudwatch Events in its back-end workflow. AWS API Gateway and AWS Lambda is used to power the API Layer which customers use to call various Sagemaker AMT APIs, such as Create/List/Describe/StopHyperparamaterTuningJobs. AWS DynamoDB is used as the persistent store to keep all the metadata associated with the job and also track the current state of the job. The overall system’s architecture is depicted in Figure 1. Immense care has been taken to ensure that no customer data is stored into this DynamoDB Table. All customer data is handled by the Sagemaker Training platform, and Sagemaker AMT deals only with the metadata for the jobs. AWS Cloudwatch Events, AWS Step Functions and AWS Lambda are used in the AMT workflows engine, which is responsible for kicking off the evaluation of hyperparameter configurations from the Hyperparameter Selection Service, starting training jobs, tracking their progress and repeating the process until the stopping criterion is met.
3.3 Challenges for HPO in the cloud
While running at scale poses a number of challenges, AMT is a highly distributed and fault tolerant system. System resiliency was one of the guiding principles when building AMT. Example failure scenarios are when the BO engine suggests hyperparameters that can run out of memory or when individual training jobs fail due to dependency issues. The AMT workflow engine is designed to be resilient against failures, and has a built-in retry mechanism to guarantee robustness.
AMT runs every evaluation as a separate training job on the SageMaker Training platform. Each training job provides customers with a usable model, logs and metrics persisted in CloudWatch. A training job involves setting up a new cluster of EC2 instances, waiting for the setup to complete, and downloading algorithm images. This introduced an overhead that was pronounced for smaller datasets. To address this, AMT puts in place compute provisioning optimizations to reduce the time of setting up clusters and getting them ready to run the training.
4 The Algorithm
Next, we describe the main components of the BO algorithm powering SageMaker AMT. We start with the representation of the hyperparameters, followed by details about the surrogate model and acquisition function. We also describe how AMT handles parallel and asynchronous evaluations.
4.1 Input configuration
To tune the HPs of a machine learning model, AMT needs an input description of the space over which the optimization is performed. We denote by the set of HPs we are working with, such as . For each , we also define the domain of values that can possibly take, leading to the global domain of HPs
Each has data type continuous (real-valued), integer, or categorical, where the numerical types come with lower and upper bounds. Since BO surrogate functions are defined over real-valued inputs, we need to encode a HP configuration
into a vector:
If is continuous or integer:
Conversely, decoding for an integer HP works by rounding
to the nearest integer.
If is categorical:
is a one-hot encoding vector, whereif , and otherwise.
Finally, is the concatenation. The presence of categorical HPs, in particular such of high cardinality, implies a large dimension of the encoded vector .
Our encoded inputs live in a hypercube . Following previous work such as Snoek et al. (2012), we consider a Sobol sequence generator (Sobol, 1967) designed to populate our -dimensional hypercube as densely as possible. The resulting pseudo-random grid constitutes a set of anchor points from which the acquisition function optimization is initialized (see Section 4.3). Moreover, it is a common practice among ML practitioners to exploit some prior knowledge to adequately transform the space of some HP, for instance, applying a log-transform for a regularisation parameter. This point is important as the final performance hinges on this preprocessing. There has been a large body work dedicated to the automatic determination of such input transformations, e.g., see Assael et al. (2014) and references therein. We elaborate on log scaling in Section 5.1.
4.2 Gaussian process modelling
Once the input HPs are encoded, AMT builds a model mapping hyperparameter configurations to their predicted performance. We follow a form of GP-based global optimisation (Jones et al., 1998; Lizotte, 2008; Osborne et al., 2009; Brochu et al., 2010), assuming that the black-box function to minimize is drawn from a GP for which a mean function and a covariance function need to be defined. Since observations collected from are normalized to mean zero, we can consider a zero-mean function without loss of generality. The choice of the covariance function , which depends on some Gaussian process hyperparameters (GPHPs) , will be discussed in detail in the next paragraph.
More formally, and given an encoded configuration , our probabilistic model reads:
where the observation
is modelled as a Gaussian random variable with meanand variance : a standard GP regression setup (Rasmussen & Williams, 2006, Chapter 2).
Many choices for the covariance function (or kernel) are admissable. The Matérn- kernel with automatic relevance determination (ARD) parametrisation (Rasmussen & Williams, 2006, Chapter 4) is advocated in Snoek et al. (2012); Swersky et al. (2013); Snoek et al. (2014), where it is shown that ARD does not lead to overfitting, provided that the GPHP are properly handled. We follow this choice, which has become a de-facto standard in most BO packages.
Our probabilistic model comes with some GPHPs that need to be dealt with. We highlight two possible options, both of which are implemented in AMT. A traditional way of determining the GPHPs consists in finding that maximises the log marginal likelihood of our probabilistic model (Rasmussen & Williams, 2006, Section 5.4). While this approach, known as empirical Bayes, is efficient and often leads to good results, Snoek et al. (2012, and follow-up work) rather advocate the full Bayesian treatment of integrating out
by Markov chain Monte Carlo (MCMC). The latter approach is less likely to overfit in the few-observation regime (i.e., early in the BO procedure), but is also more costly, since GP computations have to be done for every MCMC sample. In AMT, we implement slice sampling, one of the most widely used techniques for GPHPs(Murray & Adams, 2010; Mackay, 2003). We observed that slice sampling is a better approach to learn the GPHPs compared to empirical Bayes. In our implementation, we use one chain with 300 samples, 250 samples of which as burn-in and thinning every 5 samples, resulting in an effective sample size of 10. Moreover, we use a random (normalised) direction, as opposed to a coordinate-wise strategy, to go from our multivariate problem () to the standard univariate formulation of slice sampling.
It is a common practice among ML practitioners to exploit some prior knowledge to adequately transform the space of some HP, such as applying a log-transform for a regularization parameter. We leverage the ideas developed in Snoek et al. (2014), where the configuration is transformed entry-wise by applying for each dimension :
where are GPHPs that govern the shape of the transformation. We refer to
as the vector resulting from all entry-wise transformations. An alternative, that is the default choice in AMT, is to consider the CDF of the Kumaraswamy’s distribution, which is more tractable than the CDF of the Beta distribution. A convenient way of handling these additional GHPHs is to overload our definition of the covariance function so that for any two:
where are “merged” within the global vector of GPHPs. Note that, as expected, the resulting covariance function is not stationary anymore.
4.3 Acquisition functions
Denote evaluations done up to now by . In GP-based Bayesian optimization, an acquisition function is optimized in order to determine the hyperparameter configuration at which should be evaluated next. Most common acquisition functions depend on the GP posterior only via its marginal mean and variance .
There is an extensive literature dedicated to the design of acquisition functions. The Expected improvement (EI) was introduced by Mockus et al. (1978) and is probably the most popular acquisition function (notably popularised by the EGO algorithm by Jones et al. (1998)). EI is the default choice for toolboxes like SMAC (Hutter et al., 2011a), Spearmint (Snoek et al., 2012), and AMT. It is defined as follows:
where the expectation is taken with respect to the posterior distribution
is the best target value observed so far. In the case of EI, the expectation appearing in is taken with respect to a Gaussian marginal consistent with the Gaussian process posterior and, as such, has a simple closed-form expression.
Another basic acquisition function is provided by Thompson samplingThompson (1933); Hoffman et al. (2014). It is optimized by drawing a realization of the GP posterior and searching for its minimum point . Exact Thompson sampling requires a sample from the joint posterior process, which is intractable and has to be approximated. In AMT, a crude but efficient approximation is used, whereby marginal variables are sampled from at locations from a dense set. Other interesting families of acquisition functions include those based on information-theoretic criteria (Villemonteix et al., 2009; Hennig & Schuler, 2012; Hernández-Lobato et al., 2014; Shahriari et al., 2014; Wang & Jegelka, 2017), or built upon upper-confidence bound ideas (Srinivas et al., 2010).
The BO procedure presented so far is purely sequential, but AMT users can specify whether the black-box function should be evaluated in parallel for different candidates.
There are different ways of taking advantage of a distributed/parallel computational environment. If we suppose we have access to threads/machines, we can simply return the top- candidates as ranked by the acquisition function . We proceed to the next step only when the candidate’s evaluations in parallel are completed. This strategy works well as long as the candidates are diverse enough and their evaluation time is not too heterogenous, which is unfortunately rarely the case. For these reasons, AMT optimizes the acquisition function so as to induce diversity and adopts an asynchronous strategy. As soon as one of the evaluations is done, we update the GP with this new configuration and pick the next candidate to fill in the available computation slot (making sure, of course, not to select one of the pending candidates). One disadvantage is that this does not take into account the information coming from the fact that we picked the pending candidates. To tackle this, asynchronous processing could also be based on fantasizing Snoek et al. (2012); Klein et al. (2020).
5 Advanced features
Beyond standard BO, AMT comes with a number of additional features aiming at speeding up tuning jobs, saving computational time, and potentially finding hyperparameter configurations that achieve better performance. We start by describing log scaling, followed by early stopping and warm starting.
5.1 Log Scaling
A property of learning problems often observed in practice is that a linear change in validation performance metrics requires an exponential increase in the learning capacity of the estimator (as defined by the VC dimension for example Vapnik (2013)). To illustrate this, we show the relationship between the capacity parameter of SVM and validation accuracy in Figure 2.
Because of this phenomenon, often a wide search range is chosen for hyperparameters which control the capacity of the model. For instance, a typical choice for the capacity parameter of support vector machine is . Note that 99% of the volume of this example search space corresponds to values of hyperparameter in . As a result, smaller values of might be under-explored by the BO procedure if applied directly to such range, as the BO procedure attempts to evenly explore the volume of the search space. To avoid under-exploring smaller values, a log transformation is applied to such model capacity related variables. Such transformation is generally referred to as “log scaling”.
Search within transfomed search space can be done automatically by AMT, provided that the user indicates that such transformation is appropriate for a given hyperparameter via configuration in the API. For all the algorithms provided as part of the SageMaker platform, recommendations are given for which hyperparameters log scaling is appropriate and accelerates the convergence to well-performing configurations.
5.2 Early Stopping
As explained earlier, the tuning process involves evaluating several hyperparameter configurations, which can come with considerable computational cost for complex problems. Regardless of the underlying HPO method, it can happen that the proposed hyperparameter configuration does not improve over the previously observed ones, i.e., that , for one or more .
With early stopping enabled, AMT utilizes the information of the previously-evaluated hyperparameter configurations to predict whether a specific candidate is promising. If not, it stops the evaluation before it is completed, thus reducing the overall time required by the tuning job. Note that here we assume we can obtain intermediate values , where represents the value of the objective function for the configuration at training iteration .
To implement early stopping, AMT employs the simple but effective median rule (Golovin et al., 2017) to determine which HP configurations to stop early. If the value is worse than the median of the previously evaluated configurations at the same iteration , then we stop the training. While it is possible to devise more complex models for predicting the performance, the median rule has proven to be effective in practice at reducing the HPO duration without impacting the optimization quality.
One concern with the median rule is that some ML algorithms, such as neural networks, are initialized with random weights and this can significantly impact the performance metrics observed in the initial iterations. These lower-fidelities are not necessarily representative of the final values; as the training proceeds, a seemingly poor HP configuration can eventually improve enough to be the best one. To ensure that the early stopping algorithm is resilient to such cases, we only make stopping decisions after a certain number of training iterations has been evaluated. As the total number of iterations can vary for different algorithms and use-cases, this threshold is determined dynamically based on the duration of the fully completed hyperparameter evaluations.
5.3 Warm start
When running several related tuning jobs, it is typical to build on previous experiments. For example, one may want to gradually increase the number of iterations, change hyperparameter ranges, change which hyperparameters to tune, or even tune the same model on a new dataset. In all these cases, it is desirable to re-use information from previous tuning jobs rather than starting from scratch. Since related tuning tasks are intuitively expected to be similar, this setting lends itself to some form of transfer learning.
With warm start, AMT uses the results of previous tuning jobs to inform which combinations of hyperparameters to search over in the new tuning job. Using information from previous hyperparameter tuning jobs can help increase the performance of the new hyperparameter tuning job by making the search for the best combination of hyperparameters more efficient.
Speeding up HP tuning with transfer learning is an active line of research. Most of this activity has focused on transferring HP knowledge across different datasets Bardenet et al. (2013); Swersky et al. (2013); Yogatama & Mann (2014); Feurer et al. (2015); Poloczek et al. (2016); Springenberg et al. (2016); Golovin et al. (2017); Fusi et al. (2018); Perrone et al. (2017, 2018, 2019b); Salinas et al. (2020). However, most of this previous work assumes the availability of some meta-data, describing in which sense datasets differ from each other Feurer et al. (2015). This assumption is prohibitively restrictive in practice, since computing meta-data in real-world predictive systems is challenging due to the computational overhead or privacy reasons.
6 Deployment Results
We now demonstrate in experiments the benefits of the BO strategy implemented in AMT over random search. We then turn to empirically showing the advantages of AMT’s advanced features we described in the previous section.
6.1 BO vs random search
Consider the task of tuning two regularization hyperparameters, alpha and lambda
, for the XGBoost algorithm on thedirect marketing dataset from UCI. We demonstrate the effectiveness of BO at finding well-performing hyperparameter configurations by comparing it to random search.
-axis). Each experiment was replicated 50 times, and two standard errors around the average of these replications was plotted.
Figure 3 compares the performance of random search and BO, as implemented in AMT. From the left and middle plots, we can clearly see that BO suggested more good performing hyperparameters then random search. From the right plot from Figure 3, it is clear that BO consistently outperforms random search across all number of hyperparameter evaluations.
A notebook to run this example on AWS SageMaker can be found at https://github.com/awslabs/amazon-sagemaker-examples/tree/master/hyperparameter_tuning/xgboost_random_log.
6.2 Log scaling
Beyond comparing random search and BO, Figure 3 also illustrates the benefits of log scaling. It can be seen that the BO procedure focuses on a small region of search space values, yet still attempts to explore the rest of the search space. This suggests that the good hyperparameter configurations are found with low values of alpha. When log scaling is activated, BO is steered towards these values, which allows it to focus on the most relevant region of the hyperparameter space.
Note that such exploration can be much more expensive as models with larger learning capacity require more compute resources to learn. An example is a larger number of trees in XGBoost. Hence, not only does log scaling improve the speed at which HPO finds well-performing hyperparameter configurations, but it also reduces the exploration of costly configurations.
6.3 Early stopping
One of the main concerns with early stopping is that it could negatively impact the final objective values. Furthermore, the effect of early stopping is most noticeable on longer training jobs. We consider Amazon SageMaker’s built-in linear learner algorithm on the Gdelt (Global Database of Events, Language and Tone) dataset in both single instance and distributed training mode.222http://www.gdeltproject.org/ Specifically, we used the Gdelt dataset from multiple years in distributed mode and from a single year in single instance mode. Figure 4 compares a hyperparameter tuning job with and without early stopping. Each experiment was replicated 10 times, and the median of the best model score so far (in terms of absolute loss, lower is better) is shown on the -axis, while the -axis represents time. Each tuning job was launched with a budget of 100 hyperparameter configurations to explore. The results show that AMT with early stopping not only explores the same number of HP configurations in less time, but can also yield hyperparameter configurations with the same quality in terms of objective metric.
6.4 Warm start
It is very common to update predictive models at regular intervals, typically with a different choice of the hyperparameter space, iteration count or dataset. This updating process implies having to regularly re-tune the hyperparameters of those models. AMT’s warm start offers a simple and computationally efficient approach to learn from previous tuning tasks.
To demonstrate its benefits, we consider the problem of building an image classifier and iteratively tuning it by running multiple hyperparameter tuning jobs. We focus on two simple use cases: running two sequential hyperparameter tuning jobs on the same algorithm and dataset, and launching a new tuning job on the same algorithm on an augmented dataset. We trainAmazon SageMaker’s built-in image classification algorithm on the Caltech-256 dataset, and tune its hyperparameters to maximize validation accuracy.
Figure 5 shows the impact of warm starting from previous tuning jobs. Initially, there is a single tuning job (top). Once this is complete, we launch a new tuning job (center) selecting the previous job as the parent job. The plot shows that the new tuning job (red dots) quickly detected good hyperparameter configurations thanks to the prior knowledge from the parent tuning job (black dots). As the optimization progresses, the validation accuracy continues improving and reaches 0.47, clearly improving over 0.33, the best previous metric found by running the first tuning job from scratch.
Lastly, we demonstrate how warm start can be applied to transfer information of good hyperparameter configurations across different datasets. We apply a range of data augmentations in addition to crop and color transformations, including random transformations (i.e., image rotation, shear, and aspect ratio variations). As a result of these data augmentations, we have a new dataset that is related to the previous one. To create the last hyperparameter tuning job, we warm start from both previous tuning jobs and run BO for 10 more iterations. Figure 5 (bottom) shows how the validation accuracy for the new tuning job (blue dots) changed over time compared to the parent tuning jobs (red and black dots). Once the tuning job is complete, the objective metric has improved again and reached 0.52.
These simple examples show that warm start helps explore the search space iteratively without losing the information coming from previous iterations. In addition, it demonstrates how warm start achieves transfer learning even if the dataset has been changed, giving rise to a new but related hyperparameter tuning task.
7 Related Work
Before concluding, we briefly review open source solutions for black-box optimization in this section. Rather than providing an exhaustive list, we aim to give an overview of the tools available publicly.
One of the earliest packages for Bayesian Optimization using GP as a surrogate model is Spearmint Snoek et al. (2012), where several important extensions including multi-task BO Swersky et al. (2013), input-warping Snoek et al. (2014) and handling of unknown constraints Gelbart et al. (2014) have been introduced. The same strategy has also been implemented in other open source packages such as BayesianOptimization Nogueira (2014), scikit-optimize (easy to use when training scikit-learn algorithms) and Emukit Paleyes et al. (2019).333https://github.com/scikit-optimize Unlike using a GP as surrogate model, SMAC Hutter et al. (2011b) uses random forest, which makes it appealing for high dimensional and discrete problems.
With the growing popularity of deep learning frameworks, BO has also been implemented for all the major deep learning frameworks. BoTorch Balandat et al. (2019) is the BO implementation built on top of PyTorch Paszke et al. (2017) and GPyTorch Gardner et al. (2018), with an emphasis on modular interface, support for scalable GPs as well as multi-objective BO. In TensorFlow, there is GPflowOpt Knudde et al. (2017), which is the BO implementation dependent on GPflow Matthews et al. (2017).444https://www.tensorflow.org/ Finally, AutoGluon HNAS Klein et al. (2020) provides asynchronous BO, with and without multi-fidelity optimization, as part of AutoGluon, which is framework-agnostic yet closely integrated with MXNet.555https://autogluon.mxnet.io/
We have presented SageMaker AMT, a scalable, fully-managed service to optimize black-box functions in the cloud. We outlined design principles and the system’s architecture, showing how it integrates with other SageMaker’s components. Powered by state-of-the-art BO, AMT is an effective tool to optimize SageMaker’s built-in as well as customers’ algorithms. It also offers a set of advanced features, such as automatic early stopping and warm-starting from previous tuning jobs, which demonstrably speed up the search of good hyperparameter configurations.
Future work could extend the HPO capabilities of AMT to support multi-fidelity optimization approaches, making an even more efficient use of parallelism and early stopping. Beyond performance metrics, several ML applications involve optimizing multiple alternative metrics at the same time, such as maximum memory usage, inference latency or fairness (Perrone et al., 2019a, 2020; Lee et al., 2020; Guinet et al., 2020). AMT could be extended to optimize multiple objectives simultaneously, automatically suggesting hyperparameter configurations that are optimal along several criteria.
AMT has contributions from many members of the SageMaker team, notably Gavin Bell, Bhaskar Dutt, Anne Milbert, Choucri Bechir, Enrico Sartoriello, Furkan Bozkurt, Ilyes Khamlichi, Adnan Kukuljac, Jose Luis Contreras, Chance Bair, Ugur Adiguzel.
- Assael et al. (2014) Assael, J.-A. M., Wang, Z., and de Freitas, N. Heteroscedastic treed Bayesian optimisation. Technical report, preprint arXiv:1410.7172, 2014.
- Balandat et al. (2019) Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. BoTorch: Programmable Bayesian Optimization in PyTorch. arXiv e-prints, pp. arXiv:1910.06403, October 2019.
- Bardenet et al. (2013) Bardenet, R., Brendel, M., Kégl, B., and Sebag, M. Collaborative hyperparameter tuning. In Proceedings of the International Conference on Machine Learning (ICML), pp. 199–207, 2013.
- Bergstra & Bengio (2012) Bergstra, J. and Bengio, Y. Random search for hyper-parameter optimization. Journal of Machine Learning Research (JMLR), 13:281–305, 2012.
- Brochu et al. (2010) Brochu, E., Cora, V. M., and De Freitas, N. A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. Technical report, preprint arXiv:1012.2599, 2010.
- Domhan et al. (2014) Domhan, T., Springenberg, T., and Hutter, F. Extrapolating learning curves of deep neural networks. In ICML AutoML Workshop, 2014.
- Falkner et al. (2018) Falkner, S., Klein, A., and Hutter, F. Bohb: Robust and efficient hyperparameter optimization at scale. In Proceedings of the International Conference on Machine Learning (ICML), pp. 1436–1445, 2018.
- Feurer & Hutter (2019) Feurer, M. and Hutter, F. Hyperparameter optimization. In Hutter, F., Kotthoff, L., and Vanschoren, J. (eds.), AutoML: Methods, Sytems, Challenges, chapter 1, pp. 3–37. Springer, 2019.
Feurer et al. (2015)
Feurer, M., Springenberg, T., and Hutter, F.
Initializing Bayesian hyperparameter optimization via
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Fusi et al. (2018) Fusi, N., Sheth, R., and Elibol, M. Probabilistic matrix factorization for automated machine learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 3352–3361, 2018.
- Gardner et al. (2018) Gardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q., and Wilson, A. G. GPyTorch: Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Advances in Neural Information Processing Systems, 2018.
- Gelbart et al. (2014) Gelbart, M. A., Snoek, J., and Adams, R. P. Bayesian optimization with unknown constraints. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, pp. 250–259, 2014.
- Golovin et al. (2017) Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., and Sculley, D. Google Vizier: A service for black-box optimization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1487–1495, 2017.
- Guinet et al. (2020) Guinet, G., Perrone, V., and Archambeau, C. Pareto-efficient Acquisition Functions for Cost-Aware Bayesian Optimization. NeurIPS Meta Learning Workshop, 2020.
- Hansen & Ostermeier (2001) Hansen, N. and Ostermeier, A. Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9(2):159–195, 2001.
- Hennig & Schuler (2012) Hennig, P. and Schuler, C. J. Entropy search for information-efficient global optimization. Journal of Machine Learning Research (JMLR), 2012.
- Hernández-Lobato et al. (2014) Hernández-Lobato, J. M., Hoffman, M. W., and Ghahramani, Z. Predictive entropy search for efficient global optimization of black-box functions. Technical report, preprint arXiv:1406.2541, 2014.
- Hoffman et al. (2014) Hoffman, M., Shahriari, B., and de Freitas, N. On correlation and budget constraints in model-based bandit optimization with application to automatic machine learning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 365–374, 2014.
- Hutter et al. (2011a) Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In Proceedings of LION-5, pp. 507–523, 2011a.
- Hutter et al. (2011b) Hutter, F., Hoos, H. H., and Leyton-Brown, K. Sequential model-based optimization for general algorithm configuration. In International conference on learning and intelligent optimization, pp. 507–523. Springer, 2011b.
- Jaderberg et al. (2017) Jaderberg, M., Dalibard, V., Osindero, S., Czarnecki, W. M., Donahue, J., Razavi, A., Vinyals, O., Green, T., Dunning, I., Simonyan, K., et al. Population based training of neural networks. Technical report, preprint arXiv:1711.09846, 2017.
- Jamieson & Talwalkar (2015) Jamieson, K. and Talwalkar, A. Non-stochastic best arm identification and hyperparameter optimization. Technical report, preprint arXiv:1502.07943, 2015.
- Jones et al. (1998) Jones, D. R., Schonlau, M., and Welch, W. J. Efficient global optimization of expensive black-box functions. Journal of Global Optimization, 1998.
- Karnin et al. (2013) Karnin, Z., Koren, T., and Somekh, O. Almost optimal exploration in multi-armed bandits. In Proceedings of the 30th International Conference on Machine Learning (ICML), pp. 1238–1246, 2013.
- Klein et al. (2017) Klein, A., Falkner, S., Springenberg, J. T., and Hutter, F. Learning curve prediction with Bayesian neural networks. In International Conference on Learning Representations (ICLR), volume 17, 2017.
- Klein et al. (2020) Klein, A., Tiao, L., Lienart, T., Archambeau, C., and Seeger, M. Model-based asynchronous hyperparameter and neural architecture search. arXiv preprint arXiv:2003.10865, 2020.
- Knudde et al. (2017) Knudde, N., van der Herten, J., Dhaene, T., and Couckuyt, I. GPflowOpt: A Bayesian Optimization Library using TensorFlow. arXiv preprint – arXiv:1711.03845, 2017. URL https://arxiv.org/abs/1711.03845.
- Lee et al. (2020) Lee, E. H., Perrone, V., Archambeau, C., and Seeger, M. Cost-aware Bayesian optimization. In ICML AutoML Workshop, 2020.
- Li et al. (2016) Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. Hyperband: A novel bandit-based approach to hyperparameter optimization. Technical report, preprint arXiv:1603.06560, 2016.
- Li et al. (2019) Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Hardt, M., Recht, B., and Talwalkar, A. Massively parallel hyperparameter tuning. Technical Report 1810.05934v4 [cs.LG], ArXiv, 2019.
- Liberty et al. (2020) Liberty, E., Karnin, Z., Xiang, B., Rouesnel, L., Coskun, B., Nallapati, R., Delgado, J., Sadoughi, A., Astashonok, Y., Das, P., Balioglu, C., Chakravarty, S., Jha, M., Gautier, P., Arpin, D., Januschowski, T., Flunkert, V., Wang, Y., Gasthaus, J., Stella, L., Rangapuram, S., Salinas, D., Schelter, S., and Smola, A. Elastic machine learning algorithms in Amazon SageMaker. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, pp. 731–737, 2020.
- Lizotte (2008) Lizotte, D. J. Practical Bayesian optimization. PhD thesis, University of Alberta, 2008.
- Mackay (2003) Mackay, D. J. C. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
- Matthews et al. (2017) Matthews, A. G. d. G., van der Wilk, M., Nickson, T., Fujii, K., Boukouvalas, A., León-Villagrá, P., Ghahramani, Z., and Hensman, J. GPflow: A Gaussian process library using TensorFlow. Journal of Machine Learning Research, 18(40):1–6, apr 2017. URL http://jmlr.org/papers/v18/16-537.html.
- Mockus et al. (1978) Mockus, J., Tiesis, V., and Zilinskas, A. The application of Bayesian methods for seeking the extremum. Towards Global Optimization, 1978.
- Murray & Adams (2010) Murray, I. and Adams, R. P. Slice sampling covariance hyperparameters of latent Gaussian models. In Advances in Neural Information Processing Systems, pp. 1732–1740, 2010.
- Nogueira (2014) Nogueira, F. Bayesian optimization: Open source constrained global optimization tool for Python, 2014. URL https://github.com/fmfn/BayesianOptimization.
- Osborne et al. (2009) Osborne, M. A., Garnett, R., and Roberts, S. J. Gaussian processes for global optimization. In Proceedings of the 3rd Learning and Intelligent OptimizatioN Conference (LION 3), 2009.
- Paleyes et al. (2019) Paleyes, A., Pullin, M., Mahsereci, M., Lawrence, N., and González, J. Emulation of physical processes with Emukit. In Second Workshop on Machine Learning and the Physical Sciences, NeurIPS, 2019.
- Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. In NIPS-W, 2017.
Perrone et al. (2017)
Perrone, V., Jenatton, R., Seeger, M., and Archambeau, C.
Multiple adaptive Bayesian linear regression for scalable Bayesian optimization with warm start.NeurIPS Meta Learning Workshop, 2017.
- Perrone et al. (2018) Perrone, V., Jenatton, R., Seeger, M. W., and Archambeau, C. Scalable hyperparameter transfer learning. Advances in Neural Information Processing Systems 31, pp. 6845–6855, 2018.
- Perrone et al. (2019a) Perrone, V., Shcherbatyi, I., Jenatton, R., Archambeau, C., and Seeger, M. Constrained Bayesian optimization with max-value entropy search. In NeurIPS Meta Learning Workshop, 2019a.
- Perrone et al. (2019b) Perrone, V., Shen, H., Seeger, M. W., Archambeau, C., and Jenatton, R. Learning search spaces for Bayesian optimization: Another view of hyperparameter transfer learning. Advances in Neural Information Processing Systems, 32:12771–12781, 2019b.
- Perrone et al. (2020) Perrone, V., Donini, M., Kenthapadi, K., and Archambeau, C. Fair Bayesian optimization. In ICML AutoML Workshop, 2020.
- Poloczek et al. (2016) Poloczek, M., Wang, J., and Frazier, P. I. Warm starting Bayesian optimization. In Winter Simulation Conference (WSC), 2016, pp. 770–781. IEEE, 2016.
- Rasmussen & Williams (2006) Rasmussen, C. and Williams, C. Gaussian Processes for Machine Learning. MIT Press, 2006.
- Real et al. (2020) Real, E., Liang, C., So, D., and Le, Q. Evolving machine learning algorithms from scratch. In Proceedings of the International Conference on Machine Learning (ICML), pp. 2394–2406, 2020.
Salinas et al. (2020)
Salinas, D., Shen, H., and Perrone, V.
A quantile-based approach for hyperparameter transfer learning.International Conference on Machine Learning, pp. 7706–7716, 2020.
- Shahriari et al. (2014) Shahriari, B., Wang, Z., Hoffman, M. W., Bouchard-Côté, A., and de Freitas, N. An entropy search portfolio for Bayesian optimization. Technical report, preprint arXiv:1406.4625, 2014.
- Shahriari et al. (2016) Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. Taking the human out of the loop: A review of Bayesian optimization. IEEE, 2016.
- Snoek et al. (2012) Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, pp. 2960–2968, 2012.
- Snoek et al. (2014) Snoek, J., Swersky, K., Zemel, R. S., and Adams, R. P. Input warping for Bayesian optimization of non-stationary functions. Technical report, preprint arXiv:1402.0929, 2014.
- Sobol (1967) Sobol, I. M. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Computational Mathematics and Mathematical Physics, 7(4):86–112, 1967.
- Springenberg et al. (2016) Springenberg, J. T., Klein, A., Falkner, S., and Hutter, F. Bayesian optimization with robust Bayesian neural networks. In Advances in Neural Information Processing Systems (NIPS), pp. 4134–4142, 2016.
- Srinivas et al. (2010) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Gaussian process optimization in the bandit setting: No regret and experimental design. Proceedings of the International Conference on Machine Learning (ICML), pp. 1015–1022, 2010.
- Srinivas et al. (2012) Srinivas, N., Krause, A., Kakade, S., and Seeger, M. Information-theoretic regret bounds for Gaussian process optimization in the bandit setting. IEEE Transactions on Information Theory, 58:3250–3265, 2012.
- Swersky et al. (2013) Swersky, K., Snoek, J., and Adams, R. P. Multi-task Bayesian optimization. In Advances in Neural Information Processing Systems (NIPS), pp. 2004–2012, 2013.
- Swersky et al. (2014) Swersky, K., Snoek, J., and Adams, R. P. Freeze-thaw Bayesian optimization. Technical report, preprint arXiv:1406.3896, 2014.
- Thompson (1933) Thompson, W. R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, pp. 285–294, 1933.
The nature of statistical learning theory. Springer science & business media, 2013.
- Villemonteix et al. (2009) Villemonteix, J., Vazquez, E., and Walter, E. An informational approach to the global optimization of expensive-to-evaluate functions. Journal of Global Optimization, 44(4):509–534, 2009.
- Wang & Jegelka (2017) Wang, Z. and Jegelka, S. Max-value entropy search for efficient Bayesian optimization. 70:3627–3635, 06–11 Aug 2017.
- Yogatama & Mann (2014) Yogatama, D. and Mann, G. Efficient transfer learning method for automatic hyperparameter tuning. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 1077–1085, 2014.