1 Introduction
Consider the following unsupervised ensemble regression setup: The only observations are an matrix of realvalued predictions made by different regressors or experts , on a set of unlabeled samples
. There is no apriori knowledge on the accuracy of the experts and no labeled data to estimate it. Given only the above observed data and minimal knowledge about the unobserved response, such as its mean and variance, is it possible to (i)
rank the regressors, say by their mean squared error; or at least detect the most and least accurate ones? and (ii) construct an ensemble predictor for the unobserved continuous responses , more accurate than both the individual predictors and simple ensemble strategies such as their mean or median?Our motivation for studying this problem comes from several application domains, where such scenarios naturally arise. Two such domains are biology and medicine, where in recent years there are extensive collaborative efforts to solve challenging prediction problems, see for example the past and ongoing DREAM competitions^{1}^{1}1www.dreamchallenges.org. Here, multiple participants construct prediction models based on published labeled data, which are then evaluated on heldout data whose statistical distribution may differ significantly from the training one. A key question is whether one can provide more accurate answers than those of the individual participants, by cleverly combining their prediction models. In the experiment section 5 we present one such example, where competitors had to predict the concentrations of multiple phosphoproteins in various cancer cell lines Hill et al. (2016a). Understanding the causal relationships between these proteins is important as it may explain variation in disease phenotypes or therapeutic response Hill et al. (2016b)
. A second application comes from regression problems in computer vision. A specific example, also described in Section
5, is accurate estimation of the bounding box around detected objects in images by combining several preconstructed deep neural networks.
The regression problem we consider in this paper is a particular instance of unsupervised ensemble learning. Motivated in part by crowdsourced labeling tasks, previous works on unsupervised ensemble learning mostly focused on discrete outputs, considering binary, multiclass or ordinal classification Johnson (1996); Sheng et al. (2008); Whitehill et al. (2009); Raykar et al. (2010); Platanios et al. (2014, 2016); Zhou et al. (2012). Dawid and Skene (1979)
were among the first to consider the problem of unsupervised ensemble classification. Their approach was based on the assumption that experts make independent errors conditioned on the unobserved true class label. Even for this simple model, estimating the experts’ accuracies and the unknown labels via maximum likelihood is a nonconvex problem, typically solved by the expectationmaximization algorithm. Recently, several authors proposed spectral and tensor based methods that are computationally efficient and asymptotically consistent
Anandkumar et al. (2014); Zhang et al. (2014); Jaffe et al. (2015).In contrast to the discrete case, nearly all previous works on ensemble regression considered only the supervised setting. Some ensemble methods, such as boosting and random forest are widely used in practice.
In this work we propose a framework to study unsupervised ensemble regression, focusing on linear aggregation methods. In Section 2, we first review the optimal weights that minimize the mean squared error (MSE) and highlight the key quantities that need to be estimated in an unsupervised setting. Next, in Section 3 we describe related prior work in supervised and unsupervised regression.
Our main contributions appear in Section 4. We propose a framework for unsupervised ensemble regression, based on an analogue of the Dawid and Skene classification model, adapted to the regression setting. Specifically, we assume that the experts make approximately uncorrelated errors with respect to the optimal predictor that minimizes the MSE. We show that if we knew the minimal attainable MSE, then under our assumed model, the accuracies of the experts can be consistently estimated by solving a system of linear equations. Next, based on our theoretical analysis, we develop methods to estimate this minimal MSE, detect the best and worst regressors and derive UPCR, a novel unsupervised principal components ensemble regressor. Section 5 illustrates our methods and the improved accuracy of UPCR over the ensemble mean and median, on a variety of regression problems. These include both problems for which we trained multiple regression algorithms, as well as the two applications mentioned above where the regressors were constructed by a third party and only their predictions were given to us.
Our main findings are that given only the predictions and the first two moments of the response: (i) our approach is able to distinguish between hard prediction problems where any linear aggregation of the regressors yields large errors, and feasible problems where a suitable linear combination of the regressors can accurately estimate the response; (ii) our ranking method is able to reliably detect the most and least accurate experts; and (iii) quite consistently, UPCR performs as well as and sometimes significantly better than the mean and median of the regressors. We conclude in Section 6 with a summary and future research directions in unsupervised ensemble regression.
2 Problem Setup
Consider a regression problem with a continuous response and explanatory features from an instance space . Let be preconstructed regression functions, , interchangeably also called experts, and let be i.i.d. samples from the marginal distribution of . We consider the following unsupervised ensemble regression setting, in which the only observed data is the matrix of predictions
(1) 
In particular, there are no labeled data pairs and no apriori knowledge on the accuracy of the regressors.
Given only the matrix (1), and explicit knowledge of the first two moments of , we ask whether it is possible to: (i) estimate the accuracies of the experts, or at least identify the best and worst of them, and (ii) accurately estimate the responses by an ensemble method , whose input are the predictions of . As we explain below knowing the first two moments of seems necessary as otherwise the data matrix (1) can be arbitrarily shifted and scaled. Such knowledge is reasonable in various settings, for example from past experience, previous observations or physical principles.
Following the literature on supervised ensemble regression, we consider linear ensemble learners. Specifically, we restrict ourselves to the following subclass
(2) 
where and are assumed known, and
. Note that in this subclass, for any vector
, . While is typically unknown, it can be accurately estimated given the predictions of in Eq. (1) and provided .As our risk measure, we use the popular mean squared error . For completeness, we first review the optimal weights under this risk and describe several supervised ensemble methods that estimate them.
Optimal Weights.
Let be the covariance matrix of the regressors with elements
(3) 
and let be the vector of covariances between the individual regressors and the true response,
(4) 
Let be a weight vector that minimizes the MSE
(5) 
Then it is easy to show that:
Lemma 1.
The weights satisfy
(6) 
Note that depends only on and . If the ensemble regressors are linearly independent, then is invertible and is unique. In our unsupervised scenario, the matrix can be estimated from the predictions . In contrast, estimating directly from its definition in Eq. (4) requires labeled data. A key challenge in unsupervised ensemble regression is thus to estimate without any labeled data.
3 Previous Work
This section provides a brief overview of prior art, first methods for unsupervised ensemble regression, and then two supervised ensemble regression methods that are related to our approach. We conclude this section with the popular DawidSkene model of unsupervised ensemble classification, also relevant to our work.
3.1 Unsupervised Ensemble Regression
Whereas many works considered unsupervised ensemble classification, far fewer studied the regression case. Donmez et al. (2010)
, proposed a general framework called unsupervisedsupervised learning. In the case of regression, they assumed that the marginal probability density function of the response
is known and that the regressors follow a knownparametric model with parameter . In this setup, given only unlabeled data, can be estimated by maximum likelihood. In contrast, our approach is far more general as we do not assume a parametric model, nor knowledge of the full marginal density .More closely related is the recent work of Wu et al. (2016), which in turn is based on IonitaLaza et al. (2016) and Parisi et al. (2014)
. Here, the authors compute the leading eigenvector of the covariance of the
regressors, and use it both to detect inaccurate regressors and to determine the weights of the accurate ones. However, as Wu et al. (2016) themselves write, this relation between the leading eigenvector and regressor accuracy “is based on intuition, and we do not have a rigorous mathematical proof so far”. Our work provides a solid theoretical support for a variant of this spectral approach.3.2 Supervised Ensemble Regression
As reviewed by MendesMoreira et al. (2012), quite a few supervised ensemble regressors were proposed over the past 30 years. These can be broadly divided into two groups. Methods in the first group retrain a basic regression algorithm multiple times on different subsets of the labeled data, possibly also assigning weights to the various labeled instances. Examples include stacking Wolpert (1992); Breiman (1996); Leblanc and Tibshirani (1996), random forest Breiman (2001) and boosting Freund and Schapire (1995); Friedman et al. (2000).
3.3 Generalized Ensemble Method
Perrone and Cooper (1992) were among the first to consider supervised ensemble regression. They defined the misfit of predictor as , and proposed the Generalized Ensemble Method (GEM), with
The corresponding weights that minimize the MSE are
(7) 
where is the misfit population covariance matrix
(8) 
Perrone and Cooper (1992) proposed to estimate the unknown matrix and consequently using a labeled set . Unfortunately, in many practical scenarios multicolinearity between the
regressors leads to an ill conditioned matrix
, that cannot be robustly inverted.3.4 Pcr*
A common approach to handle ill conditioned multivariate problems is via principal component regression Jolliffe (2002). In the context of supervised ensemble learning, Merz and Pazzani (1999) suggested such a method, denoted PCR*. Given a labeled set let be the sample covariance matrix of the regressors,
where , and let be the top leading eigenvectors of . Merz and Pazzani (1999) proposed a weight vector of the form , with coefficients determined by least squares regression over the training set. The number of principal components is chosen by minimizing fold cross validation error.
In the common scenario where some ensemble regressors are highly correlated, the matrix is illconditioned. The GEM estimator, which inverts then yields unstable predictions. In contrast, PCR* with a small number of components can be viewed as a regularized method, providing stability and robustness. In a supervised setting, Merz and Pazzani (1999) found PCR* to outperform GEM.
3.5 Unsupervised Ensemble Classification
The simplest model for unsupervised ensemble classification, going back to Dawid and Skene (1979) is that conditional on the label
, classifiers make independent errors
(9) 
Dawid and Skene (1979) estimated the classifier accuracies and the labels by the EM method. In recent years several authors developed computationally efficient and rate optimal methods to estimate these quantities Anandkumar et al. (2014); Zhang et al. (2014); Jaffe et al. (2015).
To the best of our knowledge, our work is the first to propose an analogue of this assumption to the regression case, rigorously study it, and consequently derive corresponding unsupervised ensemble regression schemes.
4 Unsupervised Ensemble Regression
Given only the predictions , the simplest unsupervised approach to estimate the response at an instance is to average the regressors,
Averaging is the optimal linear estimator when all regressors make independent zeromean errors of equal variance. A more robust but nonlinear method is the median,
Averaging and median are naïve estimators in the sense that prediction at each depends only on and does not depend on the other observations , .
As we show theoretically below and illustrate empirically in Section 5, under some reasonable assumptions, one can do significantly better than the ensemble mean and median by analyzing all the data and in particular the covariance matrix of the regressors.
Specifically, we propose a novel framework to study unsupervised ensemble regression, based on the assumption that the experts make approximately uncorrelated errors with respect to the optimal predictor. We develop methods to detect the best and worst regressors and derive UPCR, a novel unsupervised principal components ensemble regressor. Similar to Merz and Pazzani (1999), the weight vector of UPCR is a linear combination of the top few eigenvectors of (typically just one or two). The key novelty is that we estimate the coefficients in a fully unsupervised manner.
To this end, we do assume knowledge of the first two moments of . Such knowledge seems inevitable, as otherwise the observed data may be arbitrarily shifted and scaled without changing the correlation of the regressors. Knowing the first moment , allows to estimate the bias of each regressor by its mean over the unlabeled samples,
Knowledge of allows a rough estimate of the accuracy of the regressors. A very accurate regressor must have , whereas if or , then must have a large error.
In what follows we consider predicting the meancentered responses by a linear combination of the mean centered predictors, . We thus work with the mean centered matrix
This is equivalent to assuming that .
4.1 Statistically Independent Errors
As discussed in Section 2, in light of the optimal weights in Eq. (6), the key challenge in unsupervised ensemble regression is to estimate the vector of Eq. (4), without any labeled data.
To this end, we propose the following regression analogue of the DawidSkene assumption of conditionally independent experts. Recall that when the risk function is the MSE, the optimal regressor is the conditional mean,
Its mean is , and its MSE is where
(10) 
For each regressor, write . Since is mean centered, . Hence, simplifies to
(11) 
Similarly, the MSE of regressor is
(12) 
In this notation, the challenge is thus to estimate and the vector . Inspired by Eq. (9) in the case of classification, we assume the regressors make independent errors with respect to , namely that ^{2}^{2}2Strictly speaking, the assumption is that the deviations from are uncorrelated and not necessarily independent.
(13) 
This assumption is reasonable, for example, when the regressors were trained independently and are rich enough to well approximate the conditional mean . Note that when the response is perfectly predictable from the features , then and our assumption then states that the regressors make independent errors with respect to the response . This can be viewed as the regression equivalent of the DawidSkene model in classification.
Next, we consider how to estimate the values under the independent error assumption of Eq. (13). Suppose for a moment that the value of of Eq. (10) was known. We shall discuss how to estimate it in the next section. As the following theorem shows, in this case, we can consistently estimate by solving a system of linear equations.
Theorem 1.
Assume that the given regressors make pairwise independent errors with respect to the conditional mean. If is known then given only the data matrix , we can consistently estimate the vector at rate .
Proof.
It is instructive to first consider the population setting where . Here, under the assumption (13), the offdiagonal entries of the population covariance are
(14) 
Since is symmetric, these offdiagonal entries provide linear equations for the unknown variables ). Thus, if there are enough linearly independent equations to uniquely recover . The vector can then be computed from Eq. (11).
In practice, the population matrix is unknown. However, given the matrix , we may estimate it by the sample covariance . Since , estimating by leastsquares yields a consistent estimator with asymptotic error . ∎
Remark 1.
In practice, assumption (13) that all
regressors make independent errors, may be strongly violated at least for some pairs. To be robust to deviations from this assumption one may choose a suitable loss function
, and solve the optimization problem(15) 
In our experiments, we considered both the absolute loss and the standard squared loss.
4.2 Unsupervised PCR
The analysis above assumed knowledge of , or equivalently of the minimal attainable MSE of the regression problem at hand. Clearly, this would seldom be known to the practitioner. Further, any guess of gives a valid solution. Specifically, let be the solution of (15) with an assumed value . Then, due to the additive structure inside the parenthesis in Eq. (15), regardless of the loss function , we have where . Similarly, by Eq. (11),
(16) 
What is needed is thus a model selection criterion that would be able to accurately estimate the value of , given the family of possible solutions for .
To motivate our proposed estimator of , let us first analyze the model of the previous section, but with the additional assumption that all regressors are fairly close to the optimal conditional mean . Namely, for analysis purposes, we scale the deviations by a parameter ,
(17) 
and study the behaviour of various quantities as a function of . Specifically, under Eq. (17), the population covariance of the regressors takes the form
where and is a diagonal matrix with entries
. The following lemma characterizes the leading eigenvalue and eigenvector of
, as .Lemma 2.
Let be the largest eigenvalue and corresponding eigenvector of . Then, as ,
(18)  
(19) 
Several insights can be gained from this lemma. First, at the matrix is rank one with a single nonzero eigenvector and corresponding eigenvalue . Hence, if the regressors are all very close to , their population matrix is nearly rank one and very ill conditioned. Even with an accurate estimate of and consequently of , inverting Eq. (6) to estimate would then be extremely unstable.
Second, under the model (17), . Comparing this to Eq. (19), the vector and the leading eigenvector , properly scaled, are nearly identical, up to a small shift by and up to terms. Moreover, up to terms, the matrix has rank two, spanned by the two vectors and . Hence, up to terms, the true vector can be written as a linear combination of the first two eigenvectors of . Thus, even though the matrix is ill conditioned, a principal component approach, with just or 2 components, can provide an excellent approximation of the optimal weight vector . While our focus is on unsupervised ensemble, this analysis provides a rigorous theoretical support for the PCR* method of Merz and Pazzani (1999), a result which may be of independent interest for supervised ensemble learning.
Third, since by Eq.(12), , the worst and best regressors may be detected by the largest and smallest entries in or in the estimated vector .
Lemma 2 suggests several ways to estimate the unknown quantity . By Eq. (18), one option is . Under our assumed model, this would incur an error . Another option, which we found works better in practice is to consider the relation between and the top eigenvector of , normalized to . Specifically, we estimate by minimizing the following residual,
(20) 
where is given in Eq. (16). From the estimate , the weight vector of UPCR is
(21) 
A sketch of our proposed scheme appears in Algorithm 1.
4.3 Practical Issues
Before illustrating the competitive performance of UPCR, we discuss several important practical issues that need to be addressed when handling realworld ensembles, whose individual regressors may not satisfy our assumptions.
First, when the true value , the regression problem at hand is very difficult, and no linear combination of the predictors can give a small error. If our estimated for some small threshold , say 0.1, this is an indication of such a difficult problem. In this case we stop and do not attempt to construct an ensemble learner.
Second, even when accurate prediction is possible, in our experience, if some regressors are far less accurate than others, then it is important to detect them and exclude them from the ensemble, and recompute the various quantities after their removal. However, in the rare cases that after this removal only regressors remained, then we found it better to compute their simple average instead of Eq. (21).
Finally, if the second eigenvalue is not extremely small, then it is beneficial to project the vector onto the first two eigenvectors of . In our experiments we did so when . Then, Eq. (21) is replaced by
5 Experiments
We illustrate the performance of UPCR on various real world regression problems. These include problems for which we trained multiple regression algorithms, and two applications where the regressors were constructed by a third party and only their predictions were given to us.
We compare UPCR to the ensemble mean and median as well as to a linear oracle regressor of the form (2), which has access to all the response values
. It determines its weights by ordinary least squares over all
samples(22) 
We denote the normalized MSE of the oracle by .
We divide the regression problems into three difficulty levels: (i) , where accurate prediction is possible by a linear combination of the regressors; (ii) , a challenging regression task; and (iii) , where the experts provide very little, if any information on .
We start with the following basic question: Given only and the first two moments of , can we roughly estimate the difficulty level of our problem? If it belongs to level (i) or (ii), is it possible to detect the most accurate or least accurate regressors? Finally, can we construct a linear combination at least as accurate as the mean or median?
5.1 Manually Crafted Ensembles
With precise details appearing in the supplement, we considered 18 different prediction tasks, including energy output prediction in a power plant, flight delays, basketball scoring and more. Each dataset was randomly split into samples used to train 10 different regression algorithms and remaining samples to construct the observations , see Table 3
in Supplementary. The regressors included Ridge Regression, SVR, Kernel Regression and Decision Trees, among others.
Table 1 in the supplement shows the MSE of UPCR, mean and median averaged over 20 repetitions, each with different random splits into train and test samples. On several datasets, UPCR obtained a significantly lower MSE. With further details in the supplement, here we highlight some of our key results. We start by estimating and classifying the problems by difficulty level. Fig. 1 shows this estimation procedure on three datasets. The axis is the value of normalized by The black curve is the unobserved MSE obtained by the weight vector of Eq. (21), with assumed . The red curve is the computed residual of Eq. (20) and the vertical line is the estimated . Our approach is indeed able to correctly detect the difficulty levels of these problems and estimate a value , whose corresponding MSE is not too far from the minimal achievable by using any of the . Fig. 7 in the supplement shows the estimated vs. the true . For easy problems with the agreement is remarkable.
Next, we evaluated the ability to detect the most accurate regressor in the ensemble. We measured the excess risk of selecting the regressor with the smallest estimated MSE, compared to the best regressor, which is unknown. Additionally, we measured the excess risk in selecting the single regressor with the greatest corresponding entry in the leading eigenvector of . The details are given in Table 2 of the Supplementary. Our experiments show that in most cases choosing the predictor with lowest estimated MSE outperforms the one with largest entry in .
Fig. 2 shows the effectiveness of detecting inaccurate regressors, by pruning those whose entries or . Finally, Fig. 3 illustrates the advantages of UPCR over the mean and median, on problems of easy to moderate difficulty.
5.2 HPNDREAM Challenge Experiment
Next, we consider real world problems where the ensemble regressors were constructed by a third party. The first problem came from the HPNDREAM breast cancer network inference challenge Hill et al. (2016a). Here, participants were asked to predict the time varying concentrations of 4 proteins after the introduction of an inhibitor. We were given the predictions of models on instances. We constructed a separate UPCR model for each protein. Fig. 2 demonstrates the success of our method in detecting accurate regressors and removing inaccurate ones. Fig. 4 shows that UPCR outperformed the mean and median on 3 of the 4 proteins. We note that for all four proteins, the single best model had comparable MSE to UPCR, however, this model is unknown. For three of the four proteins UPCR had smaller MSE than that of the single model estimated as being the most accurate.
5.3 Bounding Box Experiment
Here we were given the predictions of 6 deep learning models trained by Seematics Inc., on the location of physical objects in images. The models were trained on the PASCAL Visual Object Classes dataset
Everingham et al. (2012), whereas the predictions were made on images from COCO dataset Lin et al. (2014). We focused on three object classes {person, dog, cat}, with each neural network providing four coordinates for the bounding box: and . We used UPCR as an ensemble predictor each coordinate separately, with the mean squared error as out measure of accuracy. An example of the MSE estimation by our method can be seen in Fig. 5, and the accuracy for all object classes and all coordinates in Fig. 6. Results on few images are in the supplementary.6 Summary and Discussion
In this paper we tackled the problem of unsupervised ensemble regression. We presented a framework to explore this problem, based on an independent error assumption. We proposed methods, together with theoretical support, to detect the best and worst regressors and to linearly aggregate them, all in an unsupervised manner. As our theoretical analysis in Section 4 showed, unsupervised ensemble regression is different from the well studied problem of unsupervised ensemble classification, and required different approaches to its solution.
Our work raises several questions. One of them is how to extend our method to a semisupervised setting, in which there is also a limited amount of labeled data. It is also interesting to theoretically understand the relative benefits of labeled versus unlabeled data for ensemble learning.
Another direction for future research is to replace the strict independent error assumption by more complicated yet realistic models for dependencies between the regressors. In the context of unsupervised classification, Fetaya et al. (2016) relaxed the conditional independence model of Dawid and Skene by introducing an intermediate layer of latent variables. Instead of a rankone off diagonal covariance, the matrix in their model had a low rank structure, which the authors learned by a spectral method. It is interesting whether a similar approach can be developed for an ensemble of regressors.
Acknowledgments
The authors thank Moshe Guttmann and the Seematics team for their help. This research was funded by the Intel Collaborative Research Institute for Computational Intelligence (B.N.) and by NIH grant 1R01HG00838301A1 (Y.K. and B.N.).
References
 Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832.
 Breiman (1996) Breiman, L. (1996). Stacked regressions. Machine Learning, 24(1):49–64.
 Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
 Cortez et al. (2009) Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553.
 Dawid and Skene (1979) Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pages 20–28.
 Donmez et al. (2010) Donmez, P., Lebanon, G., and Balasubramanian, K. (2010). Unsupervised supervised learning i: Estimating classification and regression errors without labels. Journal of Machine Learning Research, 11:1323–1351.
 Everingham et al. (2012) Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2012). The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.

FanaeeT and Gama (2014)
FanaeeT, H. and Gama, J. (2014).
Event labeling combining ensemble detectors and background knowledge.
Progress in Artificial Intelligence
, 2(23):113–127.  Fetaya et al. (2016) Fetaya, E., Nadler, B., Jaffe, A., Kluger, Y., and Jiang, T. (2016). Unsupervised ensemble learning with dependent classifiers. In AISTATS, pages 351–360.

Freund and Schapire (1995)
Freund, Y. and Schapire, R. E. (1995).
A decisiontheoretic generalization of online learning and an
application to boosting.
In
European conference on computational learning theory
, pages 23–37. 
Friedman et al. (2000)
Friedman, J., Hastie, T., and Tibshirani, R. (2000).
Additive logistic regression: a statistical view of boosting.
The Annals of Statistics, 28(2):337–407.  Friedman (1991) Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics, pages 1–67.
 Hill et al. (2016a) Hill, S. M., Heiser, L. M., Cokelaer, T., Unger, M., Nesser, N. K., Carlin, D. E., Zhang, Y., Sokolov, A., Paull, E. O., Wong, C. K., et al. (2016a). Inferring causal molecular networks: empirical assessment through a communitybased effort. Nature methods, 13(4):310–318.
 Hill et al. (2016b) Hill, S. M., Nesser, N. K., JohnsonCamacho, K., Jeffress, M., Johnson, A., Boniface, C., Spencer, S. E., Lu, Y., Heiser, L. M., Lawrence, Y., et al. (2016b). Context specificity in causal signaling networks revealed by phosphoprotein profiling. Cell systems.
 IonitaLaza et al. (2016) IonitaLaza, I., McCallum, K., Xu, B., and Buxbaum, J. D. (2016). A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nature genetics, 48(2):214–220.
 Jaffe et al. (2015) Jaffe, A., Nadler, B., and Kluger, Y. (2015). Estimating the accuracies of multiple classifiers without labeled data. In AISTATS.
 Johnson (1996) Johnson, V. E. (1996). On Bayesian Analysis of Multirater Ordinal Data: An Application to Automated Essay Grading. Journal of the American Statistical Association, 91(433):42–51.
 Jolliffe (2002) Jolliffe, I. (2002). Principal component analysis. John Wiley & Sons, Ltd.
 Kato (1995) Kato, T. (1995). Perturbation Theory of Linear Operators. Springer, Berlin, second edition.
 Lavancier and Rochet (2016) Lavancier, F. and Rochet, P. (2016). A general procedure to combine estimators. Computational Statistics & Data Analysis, 94:175–192.
 Leblanc and Tibshirani (1996) Leblanc, M. and Tibshirani, R. (1996). Combining Estimates in Regression and Classification. Journal of the American Statistical Association, 91(436):1641–1650.
 Lichman (2013) Lichman, M. (2013). UCI machine learning repository.
 Lin et al. (2014) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755.
 MendesMoreira et al. (2012) MendesMoreira, J., Soares, C., Jorge, A. M., and Sousa, J. F. D. (2012). Ensemble approaches for regression: A survey. ACM Computing Surveys (CSUR), 45(1):10.
 Merz and Pazzani (1999) Merz, C. and Pazzani, M. (1999). A Principal Components Approach to Combining Regression Estimates. Machine Learning, 36(12):9–32.
 Nadler (2008) Nadler, B. (2008). Finite sample approximation results for principal component analysis: A matrix perturbation approach. The Annals of Statistics, pages 2791–2817.
 Parisi et al. (2014) Parisi, F., Strino, F., Nadler, B., and Kluger, Y. (2014). Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences of the United States of America, 111(4):1253–8.
 Perrone and Cooper (1992) Perrone, M. P. and Cooper, L. N. (1992). When networks disagree: Ensemble methods for hybrid neural networks. World Scientific.
 Platanios et al. (2014) Platanios, E. A., Blum, A., and Mitchell, T. (2014). Estimating accuracy from unlabeled data. In Uncertainty in Artificial Intelligence, pages 682–691.
 Platanios et al. (2016) Platanios, E. A., Dubey, A., and Mitchell, T. (2016). Estimating accuracy from unlabeled data: A bayesian approach. In International Conference on Machine Learning, pages 1416–1425.
 Raykar et al. (2010) Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322.
 Sheng et al. (2008) Sheng, V. S., Provost, F., and Ipeirotis, P. G. (2008). Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 614–622. ACM.
 Timmermann (2006) Timmermann, A. (2006). Forecast combinations. Handbook of economic forecasting, 1:135–196.
 Whitehill et al. (2009) Whitehill, J., Wu, T.f., Bergsma, J., Movellan, J. R., and Ruvolo, P. L. (2009). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043.
 Wolpert (1992) Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2):241–259.
 Wu et al. (2016) Wu, D., Lawhern, V. J., Gordon, S., Lance, B. J., and Lin, C.T. (2016). Spectral metalearner for regression (smlr) model aggregation: Towards calibrationless braincomputer interface. In IEEE International Conference on Systems, Man, and Cybernetics, pages 743–749.
 Zhang et al. (2014) Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. (2014). Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pages 1260–1268.
 Zhou et al. (2012) Zhou, D., Basu, S., Mao, Y., and Platt, J. C. (2012). Learning from the wisdom of crowds by minimax entropy. In Advances in Neural Information Processing Systems, pages 2195–2203.
Appendix A Proofs
Proof of Lemma 2.
The proof follows a perturbation approach similar to the one outlined in Nadler (2008). Since is symmetric and quadratic in , classical results on perturbation theory Kato (1995) imply that in a small neighborhood of , the leading eigenvalue and eigenvector are analytic in . We may thus expand them in a Taylor series,
We insert this expansion into the eigenvector equation and solve the resulting equations at increasing powers of .
The leading order equation reads , which gives and . Since the eigenvector is defined only up to a multiplicative factor, we conveniently chose it to be that holds for all . This gives and .
Appendix B Datasets & Results
Dataset  Oracle  UPCR  Mean  Median 

Abalone  0.43 ()  0.49 ()  0.49 ()  0.49 () 
Affairs  0.92 ()  N.A.  0.96 ()  0.94 () 
Basketball  0.28 ()  0.35 ()  0.35 ()  0.36 () 
Bike sharing  0.00 ()  0.00 ()  0.02 ()  0.00 () 
Blog feedback  0.41 ()  0.49 ()  0.50 ()  0.58 () 
Ccpp  0.06 ()  0.07 ()  0.07 ()  0.07 () 
Flights AUS  0.33 ()  0.46 ()  0.58 ()  0.66 () 
Flights BOS  0.47 ()  0.58 ()  0.66 ()  0.69 () 
Flights BWI  0.44 ()  0.56 ()  0.71 ()  0.82 () 
Flights HOU  0.40 ()  0.59 ()  0.69 ()  0.75 () 
Flights JFK  0.50 ()  0.78 ()  0.74 ()  0.90 () 
Flights LGA  0.47 ()  0.59 ()  0.70 ()  0.78 () 
Flights longhaul  0.69 ()  0.89 ()  0.86 ()  0.97 () 
Friedman1  0.02 ()  0.13 ()  0.18 ()  0.16 () 
Friedman2  0.00 ()  0.06 ()  0.08 ()  0.07 () 
Friedman3  0.04 ()  0.09 ()  0.20 ()  0.22 () 
Online videos  0.09 ()  0.18 ()  0.22 ()  0.28 () 
Wine quality white  0.60 ()  0.64 ()  0.66 ()  0.69 () 
b.1 Selecting a Single Regressor
Table 2 compares the MSE of the single regressor estimated to be the most accurate, versus the MSE of the single best regressor, which is unknown in this setup. The following two methods were compared: (i) Selecting the regressor with the maximum entry
, and (ii) selecting the regressor with the minimal estimated MSE. The experiments were repeated 20 times for each dataset, mean and standard deviations are reported. All values are normalized by
for fair comparison.Dataset  Oracle MSE  Best Regressor MSE  MSE of  MSE of 
Abalone  0.43 ()  0.45 ()  0.49 ()  0.77 () 
Basketball  0.28 ()  0.32 ()  0.36 ()  0.49 () 
Bike sharing  0.00 ()  0.00 ()  0.00 ()  0.00 () 
Blog feedback  0.41 ()  0.43 ()  0.66 ()  0.62 () 
CCPP  0.06 ()  0.07 ()  0.07 ()  0.09 () 
Flights AUS  0.33 ()  0.48 ()  0.56 ()  0.66 () 
Flights BOS  0.47 ()  0.53 ()  0.61 ()  1.04 () 
Flights BWI  0.44 ()  0.50 ()  0.50 ()  0.87 () 
Flights HOU  0.40 ()  0.52 ()  0.53 ()  1.13 () 
Flights JFK  0.50 ()  0.54 ()  0.64 ()  0.95 () 
Flights LGA  0.47 ()  0.53 ()  0.59 ()  1.04 () 
Flights longhaul  0.69 ()  0.74 ()  0.79 ()  1.17 () 
Friedman1  0.02 ()  0.03 ()  0.24 ()  0.03 () 
Friedman2  0.00 ()  0.01 ()  0.14 ()  0.02 () 
Friedman3  0.04 ()  0.07 ()  0.25 ()  0.14 () 
Online videos 
0.09 ()  0.10 ()  0.34 ()  0.17 () 
Wine quality white  0.60 ()  0.62 ()  0.77 ()  1.12 () 
b.2 Dataset Descriptions
Below is a list of the prediction tasks for which we manually trained ensembles with 10 regressors. Table 3 summarizes the main characteristics of each dataset, and Table 1 contains the mean squared errors of the different approaches normalized by . The experiments were repeated 20 times, and the mean and standard deviations are reported. We used standard Python packages for the regression algorithms with the following parameters: Ridge (), Kernel Regression (kernel chosen using cross validation between polynomial, RBF, sigmoid), Lasso (), Orthogonal Matching Pursuit, Linear SVR (), SVR with RBF kernel ( chosen using cross validation out of 0.01, 0.1, 1, 10), Regression Tree (depth 4), Regression Tree (infinite depth), Random Forest (100 trees), and a Bagging Regressor.
Name  

Abalone  3277  700  7  0.59  0.45  0.431 () 
Affairs  5466  700  7  1.08  0.93  0.922 () 
Basketball  48899  900  9  0.43  0.32  0.281 () 
Bike Sharing  15579  1600  16  0.07  0.00  0.000 () 
Blog Feedback  24197  28000  280  0.64  0.43  0.415 () 
CCPP  8968  400  4  0.10  0.07  0.059 () 
Flights AUS  47595  1000  10  0.76  0.48  0.329 () 
Flights BOS  112705  1000  10  0.84  0.53  0.470 () 
Flights BWI  101665  1000  10  0.85  0.50  0.440 () 
Flights HOU  53044  1000  10  0.87  0.52  0.397 () 
Flights JFK  113960  1000  10  0.89  0.54  0.495 () 
Flights LGA  111911  1000  10  0.86  0.53  0.471 () 
Flights Long Haul  9393  1000  10  1.00  0.73  0.686 () 
Friedman1  18800  1000  10  0.31  0.03  0.024 () 
Friedman2  19400  400  4  0.17  0.01  0.004 () 
Friedman3  19400  400  4  0.35  0.07  0.043 () 
Online Videos  66484  2100  21  0.34  0.10  0.094 () 
Wine Quality White  3598  1100  11  0.79  0.62  0.595 () 


is the number of heldout samples. The input is dimensional, and the same random samples were used to train the different algorithms in the ensemble. is the average regressor error, is the minimal error achieved by a regressor in the ensemble, and is the MSE of the oracle, normalized by , with its standard deviation in parenthesis. For each dataset the split between train and test was performed 20 times, averages are listed.
Abalone.
A dataset containing features of abalone, where the goal is to predict its age Lichman (2013). archive.ics.uci.edu/ml/datasets/Abalone
Affairs.
A dataset containing features describing an individual such as time at work, time spent with spouse, and time spent with a paramour. The goal here is to predict the time spent in extramarital affairs. statsmodels.sourceforge.net/0.6.0/datasets/generated/fair.html
Basketball.
Dataset contains stats on NBA players. Task: Predict number of points scored by the player on the next game. The features are: name, venue, team, date, start, pts_ma, min_ma, pts_ma_1, min_ma_1, pts, where start is whether or not the player started, pts is number of points scored, min is number of minutes played, ma stands for moving average, starts at season, and ma_1 is a moving average with a 1 game lag.
Bike Sharing.
Bike sharing service statistics, including weather and seasonal information FanaeeT and Gama (2014). The prediction task here is the daily and hourly count of bikes rented. archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset
Blog Feedback.
Instances in this dataset contain features extracted from blog posts. The task associated with the data is to predict how many comments the post will receive.
Flights.
Information on flights from 2008, where the task is to predict the delay upon arrival in minutes. The features here are the date, day of the week, scheduled and actual departure times, scheduled arrival times, flight ID, tail number, origin, destination, and distance. Due to its size, we split this dataset to flights originating from specific airports (AUS, BOS, BWI, HOU, JFK, and LGA), and longhaul flights. statcomputing.org/dataexpo/2009/thedata.html
Ccpp.
Combined Cycle Power Plant UCIdataset containing physical characteristics such as temperature and humidity. The task here is to predict the net hourly electrical energy output of the plant. archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant
Friedman #1
Motivated by Breiman (1996), we used simulated data according to Friedman (1991)
. The predictor variables
are independent and uniformly distributed over
. The response isand .
Friedman #2
The second data set tested by Friedman (1991) simulated impedance in an alternating current circuit. Here four predictor variables are uniformly distributed over the ranges and respectively. The response was
with , where the variance was chosen to provide a 3to1 signal to noise ratio. For the third dataset in this series Friedman #3, see the original paper Friedman (1991).
Online Videos.
YouTube video transcoding dataset. Predict the transcoding time based on parameters of the video. archive.ics.uci.edu/ml/datasets/Online+Video+ Characteristics+and+Transcoding+Time+Dataset
Wine Quality White.
Predict the quality score (110) of white wine based on chemical characteristics, such as acidity and pH level Cortez et al. (2009). archive.ics.uci.edu/ml/datasets/Wine+Quality
Comments
There are no comments yet.