(HPO) has been a pivotal part of machine learning (ML) and contributed to achieving a good performance in a wide range of tasks(Feurer & Hutter, 2019)
. It is widely acknowledged that the performance of deep neural networks depends greatly on the configuration of the hyperparameters(Dacrema et al., 2019; Henderson et al., 2018; Lucic et al., 2018). The general flow used to identify the optimal hyperparameters with HPO methods is: (1) select a set of hyperparameters to be evaluated based on the observations made; (2) optimize an ML model using training data with the selected hyperparameters; (3) evaluate the loss (or score) of the optimized model using validation data; and (4) repeat steps (1) to (3) by an adaptive method. This procedure is a special case of a black-box function optimization problem, where the input is a set of hyperparameters, and the output is a validation score. Among the black-box optimization methods, adaptive algorithms, such as Bayesian optimization (BO) (Brochu et al., 2010; Shahriari et al., 2015; Frazier, 2018) have shown superior empirical performance compared with traditional algorithms, such as grid search or random search (Bergstra & Bengio, 2012).
A critical assumption in HPO is the availability of an accurate validation score (, which is often denoted as Frazier (2018)). However, in reality, there are many cases where we cannot access the ground-truth of the task of interest (referred to as target task hereinafter). For example, in display advertising, predicting the effectiveness of each advertisement, i.e., click-through rates (CTR), is important for showing relevant advertisements (ads) to users. Therefore, it is necessary to conduct HPO before a new ad campaign starts. However, for new ads that have not yet been displayed to users, one cannot use labeled data to conduct HPO. In this case, the standard HPO procedure is infeasible, as one cannot utilize the labeled target task data and the true validation score of the ML model under consideration.
In this work, we address the infeasibility issue of HPO when the labels of the target task are unavailable. To formulate the situation, we first introduce a novel HPO setting called multi-source unsupervised hyperparameter optimization (MSU-HPO). In MSU-HPO, it is assumed that we do not have the labeled data for a target task. However, we do have the data for some source tasks with a different distribution from the target task. It is natural to assume that we have access to multiple source tasks in most practical settings. In the display advertising example, several labeled datasets of old ads that have already been deployed are often available, which we can use as labeled source task datasets. To the best of our knowledge, no HPO approach exists that can address a situation without labeled target task data, despite its significance and possibility for applications.
A problem with MSU-HPO is that the ground-truth is inaccessible, and one cannot directly apply the standard HPO procedure. Thus, it is essential to accurately approximate using only the available data. For this purpose, we propose two estimators, enabling the evaluation of the ML models without the labeled target task data. Our estimators are general and can be used in combination with any common black-box optimization methods, such as Gaussian process-based BO (Srinivas et al., 2010; Snoek et al., 2012; Hennig & Schuler, 2012; Contal et al., 2014; Hernández-Lobato et al., 2014; Wang & Jegelka, 2017) and the tree-structured Parzen estimator (Bergstra et al., 2011, 2013). Through theoretical analysis, we show that the proposed estimators can unbiasedly approximate the target task objective, one of which achieves a desirable variance property by selecting useful source tasks based on a task divergence measure. We also present a general and computationally inexpensive HPO procedure for MSU-HPO building on our estimators. Finally, we demonstrate that our estimators work properly through numerical experiments with synthetic and real-world datasets.
Related Work. A typical HPO setting is to find a better set of hyperparameters using a labeled target task of interest. As faster convergence is an essential performance metric of the HPO methods, the research community is moving on to the multi-source or transfer settings for which there are some previously solved related source tasks. By combining the additional source task information and the labeled target task dataset, it has been shown that one can improve the hyperparameter search efficiency, and thus reach a better solution with fewer evaluations (Bonilla et al., 2008; Bardenet et al., 2013; Swersky et al., 2013; Yogatama & Mann, 2014; Feurer et al., 2015; Ramachandran et al., 2018; Springenberg et al., 2016; Poloczek et al., 2017; Wistuba et al., 2018; Feurer et al., 2018; Perrone et al., 2018; Law et al., 2019; Perrone et al., 2019; Salinas et al., 2019; Vanschoren, 2019). A critical difference between the multi-source (or transfer) HPOs and our MSU-HPO settings is the existence of labels for the target task. Previous studies assume that analysts can utilize labeled target data. However, as discussed above, this is often unavailable, and thus, these methods are infeasible. This work enables the use of any HPO method in the absence of a labeled target dataset in a theoretically grounded manner for the first time.
Another related field is model evaluation in covariate shift, whose objective is to evaluate the performance of the ML models of the target task using only a relevant single source dataset (Sugiyama et al., 2007; You et al., 2019; Zhong et al., 2010). These studies build on the importance sampling (IS) method (Elvira et al., 2015; Sugiyama et al., 2007)
to obtain an unbiased estimate of ground-truth model performances. While our proposed methods are also based on IS, a major difference is that we assume that there are multiple source datasets with different distributions. We demonstrate that with the multi-source setting, the previous IS method can fail, and propose an estimator satisfying the optimal variance property. Moreover, as these methods are specific tomodel evaluation, the connection between the IS-based estimation techniques and the automated HPO methods has not yet been explored despite their possible, broad applications. Consequently, we are the first to empirically evaluate the possible combination of IS-based unbiased estimation and adaptive HPO.
Contributions. The contributions of this work can be summarized as follows:
We formulate a novel and highly practical HPO setting, MSU-HPO.
We propose two unbiased estimators for the ground-truth validation score calculable with the available data. Additionally, we demonstrate that one of them achieves optimal finite variance among a reasonable class of unbiased estimators.
We describe a flexible and computationally tractable HPO procedure building on the proposed estimators.
We empirically demonstrate that the proposed procedure works favorably in MSU-HPO setting. Furthermore, our empirical results suggest a new possible connection between the adaptive HPO and IS-based unbiased estimation techniques.
2 Problem Setting
In this section, we formulate a novel problem setting called multi-source unsupervised hyperparameter optimization (MSU-HPO). Let be the -dimensional input space and be the real-valued output space. We use
to denote the joint probability density function of the input and output variablesand of the target task. The objective of this work is to find the best set of hyperparameters with respect to the target distribution:
where is a pre-defined hyperparameter search space and is the target task objective, which is defined as the generalization error over the target distribution:
is a bounded loss function such as the zero-one loss.
is an arbitrary machine learning model that predicts the output values using the input vectors with a set of hyperparameters.
In a standard hyperparameter optimization setting (Bergstra et al., 2011; Feurer & Hutter, 2019; Snoek et al., 2012), labeled i.i.d. validation samples are available, and one can easily estimate the target objective in Eq. (2) by the following empirical mean:
where is any size of the i.i.d. labeled samples from the target task distribution.
In contrast, under the MSU-HPO setting, labels of the target task are assumed to be unobservable; we can use only unlabeled target validation samples denoted as hereinafter. Instead, we assume the availability of the multiple source task datasets which is denoted as where is a source task index and denotes the number of source tasks. Each source task dataset is defined as the i.i.d. labeled samples: where is a joint probability density function that characterizes the source task .
Regarding the target and source distributions, we make the following assumptions.
Source tasks have support for the target task, i.e., .
Conditional output distributions remain the same between the target and all of the source tasks, i.e., .
Marginal input distributions of the target and source tasks are different, i.e., . It should also be noted that the marginal input distributions of different source tasks can be different (i.e., for a given , when ).
The above assumptions are common in the covariate shift literature Shimodaira (2000) and suggest that the input-output relation is the same, but the input distributions are different for the target and source task distributions.
One critical difficulty of the MSU-HPO setting is that the simple approximation using the empirical mean in Eq. (3) is infeasible, as the labeled target dataset is unavailable. Therefore, it is essential to accurately estimate the target task objective function using only an unlabeled target dataset and labeled multiple source datasets.
In this section, we first introduce a simple unbiased estimator for the target task objective by applying an importance weighting technique. This estimator provides an unbiased estimate; however, optimizing it can be a sub-optimal practice, as it can have a large variance when the target and source distributions differ greatly. To address the instability issue of the unbiased estimator, we introduce another estimator, the variance reduced estimator, which achieves the desirable variance property.
3.1 Unbiased Objective Estimator
A natural first candidate method to approximate the target task objective function is to use the importance weighting technique (Shimodaira, 2000). To define our proposed estimator, we first formally introduce the density ratio between the target task distribution and the source task distribution below.
(Density Ratio) For any with a positive source density , the density ratio between the target and a source task distributions is
where is a positive constant. The equalities are derived from Assumption 2.
Using the density ratio, we define an estimator for the target task objective function.
(Unbiased Estimator) For a given set of hyperparameter , the unbiased estimator for the target task objective function is defined as
where UB stands for unbiased, is the total sample size of the source tasks, is any sample size of the i.i.d. samples from the distribution of source task , and is the true density ratio function.
The estimator in Eq. (5) is an application of the importance weighted cross-validation (Sugiyama et al., 2007) to the multiple-source task setting and can easily be shown to be statistically unbiased for the ground-truth target task objective function in Eq. (2), i.e., for any given , .
We also characterize the variance of the unbiased estimator.
As stated above, the unbiased estimator is a valid approach for approximating a target task objective because of its ubiasedness. The problem is that its variance depends on the square value of the density ratio function, which can be huge when there is a source task with a distribution that is dissimilar to that of the target task.
To illustrate this variance problem, we use a toy example where . The loss values for possible tuples, and the probability densities of the target and two source tasks are presented in Table 1. It shows that the target task is similar to the source task , but its distribution is significantly different from that of . For simplicity and without loss of generality, suppose there are two source task datasets such as and . Then from Eq. (6), the variance of the unbiased estimator is about . Intuitively, this large variance is a result of the large variance samples from . In fact, by dropping the samples of reduces the variance to . From this example, we know that the unbiased estimator fails to make the most of the source tasks, and there is room to improve its variance by down-weighting the source tasks dissimilar to the target task.
|target task () distribution:||0.8||0.2|
|source task () distribution:||0.2||0.8|
|source task () distribution:||0.9||0.1|
3.2 Variance Reduced Objective Estimator
As illustrated with the toy example, an unbiased estimator can be unstable when there are some source tasks with a distribution that is significantly different from that of the target task. To address this variance issue, we define a divergence measure between the two tasks below.
(Task Divergence Measure) The divergence between a source task distribution where and the target task distribution is defined as
This task divergence measure is large when the corresponding source distribution deviates significantly from the target task distribution. Building on this divergence measure, we define the following estimator for the target task objective.
(Variance Reduced Estimator) For a given set of hyperparameters , the variance reduced estimator for the target task objective function is defined as
where VR stands for variance reduced, is any sample size of the i.i.d. samples from the distribution of source task , and is the true density ratio function. is a weight for source task , which is defined as
Note that, for all , and .
Then, we demonstrate that the variance reduced estimator in Eq. (8) is optimal in the sense that any other convex combination of a set of weights that satisfies the unbiasedness for the target task objective function does not provide a smaller variance.
Theorem 1 suggests that the variance reduced estimator achieves a desirable finite sample variance property by weighting each source task based on its divergence to the target task.
Let us return to the toy example in Table 1. The values of the divergence measure for and are and , respectively. This leads to the weights of and . Then, the variance of the variance reduced estimator is equal to < (variance when is dropped.). Therefore, it is obvious that the variance reduced estimator performs better than the unbiased estimator does by optimally weighting all available source tasks.
3.3 Hyperparameter Optimization Procedure
We describe several detailed components of the HPO procedure in the MSU-HPO setting.
Density Ratio Estimation: In general, density ratio functions between the target and source tasks are unavailable, and thus should be estimated beforehand. To estimate this parameter, we employ the unconstrained Least-Squares Importance Fitting (uLSIF) procedure Kanamori et al. (2009); Yamada et al. (2011), which suggests directly minimizing the following squared error for the true density ratio function:
where is a class of measurable functions. It should be noted that the empirical version of Eq. (9) is calculable with unlabeled target and source task datasets.
Task Divergence Estimation: To utilize the variance reduced estimator, the task divergence measure in Eq. (7) needs to be estimated from the available data. This can be done using the following empirical mean.
How to train ?: To evaluate the validation score of , the model parameters of
should be optimized by the supervised learning procedure. However, in the MSU-HPO setting, the labeled target task dataset is unavailable, and direct training ofis infeasible. Therefore, we suggest splitting the labeled source task datasets into the training and validation sets. Then, we can train using the training set by where is an estimator for the target task objective function such as the unbiased and variance reduced estimators, and is a hypothesis space defined by a set of hyperparameters .
This training procedure enables us to obtain the model parameters of as if it were trained on the labeled target task dataset. In addition, it is sufficient to train only once to evaluate ; the proposed procedure is computationally inexpensive.
Building on the above details, Algorithm 1 summarizes the high-level hyperparameter optimization procedure under the MSU-HPO setting111We describe the specific hyperparameter optimization procedure when BO is used as OPT in Appendix B.. We also provide the regret bound of our HPO procedure in the MSU-HPO setting in Appendix C.
We investigate the behavior of our proposed HPO procedure in MSU-HPO using a synthetic problem in Section 4.1 and real-world datasets in Section 4.2. Because, here is no existing HPO method for the MSU-HPO setting, we compare the following methods as possible baselines222The code for reproducing the results of the experiments is available at https://github.com/nmasahiro/MSU-HPO. We describe detailed settings in our experiments in Appendix D..
Naive method, which uses the performance on the concatenation of source tasks as a validation score.
Oracle method, which uses the labeled target task for HPO. Thus, this method is infeasible in MSU-HPO, and we regard the performance of the oracle method as an upper bound to which other methods can reach.
4.1 Toy Problem
We consider a -dimensional regression problem with the MSU-HPO setting. The generative process of the toy dataset is as follows:
is the uniform distribution,
denotes the normal distribution, andis a prior parameter that characterizes the marginal input distribution () of task . The objective function is given by:
Similar to the toy experiment in (Law et al., 2019), is a hypothetical ‘hyperparameter’ we would like to optimize. The optimal solution for this experiment is thus .
We described in section 3 that when of the source task and the target task differs significantly, the performance of the variance reduced estimator is better than that of the unbiased estimator. To demonstrate this, we set separately for the source () and the target tasks (). That is, the source and target distributions are similar when ; in contrast, the source and target distributions are quite different when . Finally, we set and .
Figure 1 shows the results of the experiment on the toy problem over runs with different random seeds. Figure 1 (a) indicates that the proposed unbiased and variance reduced estimators significantly outperform the naive method in all settings. This is because our estimators can unbiasedly approximate the target task objective by considering the distributional shift, while the naive method cannot. Moreover, this figure shows the advantage of unbiasedness is highlighted when the distributions of the target and source tasks diverge largely (i.e., when is large.). Next, we compare the performance of the unbiased and variance reduced estimator in Figure 1 (b). This reports the performance of the unbiased estimator relative to the variance reduced one with varying values of . The result indicates that the advantages of using the variance reduced estimator over the unbiased one are further strengthened when there is a large divergence between the target and source task distributions, which is consistent with our theoretical analysis. Finally, as shown in Figure 1 (a), the variance reduced estimator achieves almost the same performance as the upper bound without using the labels of the target task, suggesting its powerful HPO performance on an unlabeled target task.
4.2 Hyperparameter Optimization on Real-World Datasets
Parkinson data consists of voice measurements of patients with the early-stage Parkinson disease collected by using a telemonitoring device in remote symptom progression monitoring. Each patient has about recordings characterized by a feature vector with dimensions. The goal is to predict the Parkinson disease symptom score for each recording from the recordings.
GvHD is an important medical problem in the allogeneic blood transplantation field (Brinkman et al., 2007). The issue occurs in allogeneic hematopoietic stem cell transplant recipients when donor-immune cells in the graft recognize the recipient as foreign and initiate an attack on several tissues. The GvHD dataset contains weekly peripheral blood samples obtained from 31 patients characterized by a feature vector with dimensions. Following (Muandet et al., 2013)
, we omit one patient who has insufficient data, and subsample data of each patient to have 1000 data points each. The goal is to classify CD3+CD4+CD8+ cells, which have a high correlation with the development of the disease(Brinkman et al., 2007).
Experimental Procedure. To create the MSU-HPO setting, for both datasets, we treat each patient as a task. We select one patient as a target task and regard the remaining patients as multiple source tasks. Then, we use the following experimental procedure: (1) Tune hyperparameters of an ML model by an HPO method using the unlabeled target task and labeled source tasks, (2) Split the original target task data into training set and test set, (3) Train an ML model tuned by an MSU-HPO method using the training set of the target task, (4) Predict target variables (symptom scores for Parkinson and CD3+CD4+CD8+ cells for GvHD) on the test set of the target patient, (5) Calculate target task objective
of the prediction and regard it as the performance of the MSU-HPO method under consideration, (6) Repeat the above steps 10 times with different seeds and report the mean and standard error over the simulations.
As for an ML model and a target task objective, we use support vector machine (SVM) implemented in scikit-learn (Pedregosa et al., 2011) and mean absolute error (MAE) for Parkinson. In contrast, we use LightGBM (Ke et al., 2017) as an ML model and binary cross-entropy (BCE) as a target task objective for GvHD.
|Estimators||Parkinson (MAE)||GvHD (BCE)|
|Naive||1.10334 0.0908||0.02121 0.0052|
|Unbiased (ours)||1.08283 0.1981||0.02141 0.0052|
|Variance reduced (ours)||0.40455 0.1755||0.01791 0.0039|
|Oracle (reference)||0.06862 0.0011||0.01584 0.0043|
Results. Table 2 presents the results of the experiments over 10 runs with different random seeds. First, the unbiased estimator performs almost the same with naive in Parkinson given their standard errors. Moreover, it slightly underperforms the naive in GvHD, although the unbiased estimator satisfies the unbiasedness. This is because the number of data for each task is small, and the variance issue of the unbiased estimator is highlighted in these data. Therefore, pursuing only unbiasedness in the approximation of the target task objective is not sufficient in MSU-HPO. On the other hand, the variance reduced estimator alleviates the instability issue of the unbiased estimator and performs best in both datasets. The results also suggest that the variance reduced estimator works well on both regression (Parkinson) and classification (GvHD) tasks. Therefore, we conclude from its variance optimality and empirical performance that using the variance reduced estimator is the best choice for MSU-HPO.
We studied a novel problem setting, MSU-HPO, with the goal of enabling effective HPO with only an unlabeled target task and multiple labeled source task datasets. To this end, we proposed two estimators to approximate the target task objective from available data. In particular, the variance-reduced estimator achieves variance optimality by using the task divergence measure. Empirical evaluations demonstrated that the proposed HPO procedure helps to determine useful hyperparameters in a computationally inexpensive manner, even when the labels of the target task are unobservable.
- Agarwal et al. (2017) Aman Agarwal, Soumya Basu, Tobias Schnabel, and Thorsten Joachims. Effective evaluation using logged bandit feedback from multiple loggers. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 687–696, 2017.
- Bardenet et al. (2013) Rémi Bardenet, Mátyás Brendel, Balázs Kégl, and Michele Sebag. Collaborative hyperparameter tuning. In International conference on machine learning, pp. 199–207, 2013.
- Bergstra & Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(Feb):281–305, 2012.
- Bergstra et al. (2013) James Bergstra, Daniel Yamins, and David Daniel Cox. Making a science of model search: Hyperparameter optimization in hundreds of dimensions for vision architectures. 2013.
- Bergstra et al. (2011) James S Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl. Algorithms for Hyper-Parameter Optimization. In Advances in neural information processing systems, pp. 2546–2554, 2011.
- Bonilla et al. (2008) Edwin V Bonilla, Kian M Chai, and Christopher Williams. Multi-task Gaussian Process Prediction. In Advances in neural information processing systems, pp. 153–160, 2008.
- Brinkman et al. (2007) Ryan Remy Brinkman, Maura Gasparetto, Shang-Jung Jessica Lee, Albert J Ribickas, Janelle Perkins, William Janssen, Renee Smiley, and Clay Smith. High-content flow cytometry and temporal data analysis for defining a cellular signature of graft-versus-host disease. Biology of Blood and Marrow Transplantation, 13(6):691–700, 2007.
- Brochu et al. (2010) Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010.
- Contal et al. (2014) Emile Contal, Vianney Perchet, and Nicolas Vayatis. Gaussian process optimization with mutual information. In International Conference on Machine Learning, pp. 253–261, 2014.
- Dacrema et al. (2019) Maurizio Ferrari Dacrema, Paolo Cremonesi, and Dietmar Jannach. Are we really making much progress? a worrying analysis of recent neural recommendation approaches. In Proceedings of the 13th ACM Conference on Recommender Systems, pp. 101–109, 2019.
- Elvira et al. (2015) Víctor Elvira, Luca Martino, David Luengo, and Mónica F Bugallo. Efficient multiple importance sampling estimators. IEEE Signal Processing Letters, 22(10):1757–1761, 2015.
- Feurer & Hutter (2019) Matthias Feurer and Frank Hutter. Hyperparameter Optimization. In Automated Machine Learning, pp. 3–33. 2019.
Feurer et al. (2015)
Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter.
Initializing Bayesian Hyperparameter Optimization via
Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
- Feurer et al. (2018) Matthias Feurer, Benjamin Letham, and Eytan Bakshy. Scalable Meta-Learning for Bayesian Optimization using Ranking-Weighted Gaussian Process Ensembles. In AutoML Workshop at ICML, 2018.
- Frazier (2018) Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
Henderson et al. (2018)
Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup,
and David Meger.
Deep Reinforcement Learning that Matters.In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
- Hennig & Schuler (2012) Philipp Hennig and Christian J Schuler. Entropy search for information-efficient global optimization. Journal of Machine Learning Research, 13(Jun):1809–1837, 2012.
- Hernández-Lobato et al. (2014) José Miguel Hernández-Lobato, Matthew W Hoffman, and Zoubin Ghahramani. Predictive entropy search for efficient global optimization of black-box functions. In Advances in neural information processing systems, pp. 918–926, 2014.
- Kanamori et al. (2009) Takafumi Kanamori, Shohei Hido, and Masashi Sugiyama. A least-squares approach to direct importance estimation. Journal of Machine Learning Research, 10(Jul):1391–1445, 2009.
- Ke et al. (2017) Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. In Advances in neural information processing systems, pp. 3146–3154, 2017.
- Law et al. (2019) Ho Chung Law, Peilin Zhao, Leung Sing Chan, Junzhou Huang, and Dino Sejdinovic. Hyperparameter learning via distributional transfer. In Advances in Neural Information Processing Systems, pp. 6801–6812, 2019.
- Lucic et al. (2018) Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are Gans Created Equal? A Large-Scale Study. In Advances in neural information processing systems, pp. 700–709, 2018.
- Muandet et al. (2013) Krikamol Muandet, David Balduzzi, and Bernhard Schölkopf. Domain generalization via invariant feature representation. In International Conference on Machine Learning, pp. 10–18, 2013.
- Pedregosa et al. (2011) Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011.
Perrone et al. (2018)
Valerio Perrone, Rodolphe Jenatton, Matthias W Seeger, and Cédric
Scalable Hyperparameter Transfer Learning.In Advances in Neural Information Processing Systems, pp. 6845–6855, 2018.
- Perrone et al. (2019) Valerio Perrone, Huibin Shen, Matthias W Seeger, Cedric Archambeau, and Rodolphe Jenatton. Learning search spaces for bayesian optimization: Another view of hyperparameter transfer learning. In Advances in Neural Information Processing Systems, pp. 12751–12761, 2019.
- Poloczek et al. (2017) Matthias Poloczek, Jialei Wang, and Peter Frazier. Multi-information source optimization. In Advances in Neural Information Processing Systems, pp. 4288–4298, 2017.
- Ramachandran et al. (2018) Anil Ramachandran, Sunil Gupta, Santu Rana, and Svetha Venkatesh. Information-theoretic Transfer Learning framework for Bayesian Optimisation. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 827–842, 2018.
- Salinas et al. (2019) David Salinas, Huibin Shen, and Valerio Perrone. A copula approach for hyperparameter transfer learning. arXiv preprint arXiv:1909.13595, 2019.
- Shahriari et al. (2015) Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas. Taking the human out of the loop: A review of bayesian optimization. Proceedings of the IEEE, 104(1):148–175, 2015.
- Shimodaira (2000) Hidetoshi Shimodaira. Improving predictive inference under covariate shift by weighting the log-likelihood function. Journal of statistical planning and inference, 90(2):227–244, 2000.
- Snoek et al. (2012) Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems, pp. 2951–2959, 2012.
- Springenberg et al. (2016) Jost Tobias Springenberg, Aaron Klein, Stefan Falkner, and Frank Hutter. Bayesian Optimization with Robust Bayesian Neural Networks. In Advances in Neural Information Processing Systems, pp. 4134–4142, 2016.
- Srinivas et al. (2010) Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process optimization in the bandit setting: no regret and experimental design. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pp. 1015–1022, 2010.
- Sugiyama et al. (2007) Masashi Sugiyama, Matthias Krauledat, and Klaus-Robert MÃžller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8(May):985–1005, 2007.
- Swersky et al. (2013) Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-Task Bayesian Optimization. In Advances in neural information processing systems, pp. 2004–2012, 2013.
- Tsanas et al. (2009) Athanasios Tsanas, Max A Little, Patrick E McSharry, and Lorraine O Ramig. Accurate telemonitoring of parkinson’s disease progression by noninvasive speech tests. IEEE transactions on Biomedical Engineering, 57(4):884–893, 2009.
- Vanschoren (2019) Joaquin Vanschoren. Meta-Learning. In Automated Machine Learning, pp. 35–61. 2019.
- Wang & Jegelka (2017) Zi Wang and Stefanie Jegelka. Max-value entropy search for efficient bayesian optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3627–3635. JMLR. org, 2017.
- Wistuba et al. (2018) Martin Wistuba, Nicolas Schilling, and Lars Schmidt-Thieme. Scalable gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning, 107(1):43–78, 2018.
- Yamada et al. (2011) Makoto Yamada, Taiji Suzuki, Takafumi Kanamori, Hirotaka Hachiya, and Masashi Sugiyama. Relative density-ratio estimation for robust distribution comparison. In Advances in neural information processing systems, pp. 594–602, 2011.
- Yogatama & Mann (2014) Dani Yogatama and Gideon Mann. Efficient transfer learning method for automatic hyperparameter tuning. In Artificial intelligence and statistics, pp. 1077–1085, 2014.
- You et al. (2019) Kaichao You, Ximei Wang, Mingsheng Long, and Michael Jordan. Towards accurate model selection in deep unsupervised domain adaptation. In International Conference on Machine Learning, pp. 7124–7133, 2019.
- Zhong et al. (2010) Erheng Zhong, Wei Fan, Qiang Yang, Olivier Verscheure, and Jiangtao Ren. Cross validation framework to choose amongst models and datasets for transfer learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 547–562. Springer, 2010.
Appendix A Omited Proofs
a.1 Derivation of Unbiasedness
We first define a general class of unbiased estimators called -unbiased estimator that includes the unbiased and variance reduced estimators as special cases.
Then we show that the -unbiased estimator is statistically unbiased for the target task function.
a.2 Derivation of Eq. (6)
The variance can be represented as follows because samples are independent
is decomposed as
From the unbiasedness property, . Then, we now have
a.3 Proof of Theorem 1
By following the same logic flow as in Section A.2, the variance of the -unbiased estimator in Eq. (12) is
Thus, by replacing for , we have
Moreover, for any set of weights , we obtain the following variance optimality using the Cauchy-Schwarz inequality.
Appendix B Bayesian Optimization Under the MSU-HPO Setting
In Algorithm 1, we described the abstracted hyperparameter optimization procedure which allows any black-box optimization method to be used. Here, in Algorithm 2, we describe the hyperparameter optimization procedure under the MSU-HPO setting with the popular Bayesian optimization method.
Appendix C Regret Analysis
In this section, we analyze the regret bound under the MSU-HPO setting. We define a regret as
where is the ground-truth target task objective, is the total sample size among source tasks, is the total number of evaluations, , and where is a estimated target task objective by any estimator (e.g., the unbiased estimator and the variance reduced estimator). Note that each of is the hyperparameter selected in evaluations in the optimization.
To bound the regret above, we first decompose it into the following terms:
The term (A) represents the regret obtained by optimizing the estimated target task objective . The term (B) represents the difference of a function value between the true objective and the estimated objective at , which is the solution obtained by the optimization for the estimated objective. The term (C) represents the difference between the minimum value for the estimated objective and that of the true objective .
We first show the following two lemmas which is used to bound the regret.
The following inequality holds with a probability of at least
By Chebyshev’s inequality, we have
Putting the RHS as and solving it for completes the proof. ∎
The following inequality holds with a probability of at least
By Chebyshev’s inequality, we have