Unsupervised Ensemble Regression

03/08/2017 ∙ by Omer Dror, et al. ∙ Weizmann Institute of Science ibm Yale University 0

Consider a regression problem where there is no labeled data and the only observations are the predictions f_i(x_j) of m experts f_i over many samples x_j. With no knowledge on the accuracy of the experts, is it still possible to accurately estimate the unknown responses y_j? Can one still detect the least or most accurate experts? In this work we propose a framework to study these questions, based on the assumption that the m experts have uncorrelated deviations from the optimal predictor. Assuming the first two moments of the response are known, we develop methods to detect the best and worst regressors, and derive U-PCR, a novel principal components approach for unsupervised ensemble regression. We provide theoretical support for U-PCR and illustrate its improved accuracy over the ensemble mean and median on a variety of regression problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Consider the following unsupervised ensemble regression setup: The only observations are an matrix of real-valued predictions made by different regressors or experts , on a set of unlabeled samples

. There is no a-priori knowledge on the accuracy of the experts and no labeled data to estimate it. Given only the above observed data and minimal knowledge about the unobserved response, such as its mean and variance, is it possible to (i)

rank the regressors, say by their mean squared error; or at least detect the most and least accurate ones? and (ii) construct an ensemble predictor for the unobserved continuous responses , more accurate than both the individual predictors and simple ensemble strategies such as their mean or median?

Our motivation for studying this problem comes from several application domains, where such scenarios naturally arise. Two such domains are biology and medicine, where in recent years there are extensive collaborative efforts to solve challenging prediction problems, see for example the past and ongoing DREAM competitions111www.dreamchallenges.org. Here, multiple participants construct prediction models based on published labeled data, which are then evaluated on held-out data whose statistical distribution may differ significantly from the training one. A key question is whether one can provide more accurate answers than those of the individual participants, by cleverly combining their prediction models. In the experiment section 5 we present one such example, where competitors had to predict the concentrations of multiple phosphoproteins in various cancer cell lines Hill et al. (2016a). Understanding the causal relationships between these proteins is important as it may explain variation in disease phenotypes or therapeutic response Hill et al. (2016b)

. A second application comes from regression problems in computer vision. A specific example, also described in Section

5

, is accurate estimation of the bounding box around detected objects in images by combining several pre-constructed deep neural networks.

The regression problem we consider in this paper is a particular instance of unsupervised ensemble learning. Motivated in part by crowdsourced labeling tasks, previous works on unsupervised ensemble learning mostly focused on discrete outputs, considering binary, multiclass or ordinal classification Johnson (1996); Sheng et al. (2008); Whitehill et al. (2009); Raykar et al. (2010); Platanios et al. (2014, 2016); Zhou et al. (2012). Dawid and Skene (1979)

were among the first to consider the problem of unsupervised ensemble classification. Their approach was based on the assumption that experts make independent errors conditioned on the unobserved true class label. Even for this simple model, estimating the experts’ accuracies and the unknown labels via maximum likelihood is a non-convex problem, typically solved by the expectation-maximization algorithm. Recently, several authors proposed spectral and tensor based methods that are computationally efficient and asymptotically consistent

Anandkumar et al. (2014); Zhang et al. (2014); Jaffe et al. (2015).

In contrast to the discrete case, nearly all previous works on ensemble regression considered only the supervised setting. Some ensemble methods, such as boosting and random forest are widely used in practice.

In this work we propose a framework to study unsupervised ensemble regression, focusing on linear aggregation methods. In Section 2, we first review the optimal weights that minimize the mean squared error (MSE) and highlight the key quantities that need to be estimated in an unsupervised setting. Next, in Section 3 we describe related prior work in supervised and unsupervised regression.

Our main contributions appear in Section 4. We propose a framework for unsupervised ensemble regression, based on an analogue of the Dawid and Skene classification model, adapted to the regression setting. Specifically, we assume that the experts make approximately uncorrelated errors with respect to the optimal predictor that minimizes the MSE. We show that if we knew the minimal attainable MSE, then under our assumed model, the accuracies of the experts can be consistently estimated by solving a system of linear equations. Next, based on our theoretical analysis, we develop methods to estimate this minimal MSE, detect the best and worst regressors and derive U-PCR, a novel unsupervised principal components ensemble regressor. Section 5 illustrates our methods and the improved accuracy of U-PCR over the ensemble mean and median, on a variety of regression problems. These include both problems for which we trained multiple regression algorithms, as well as the two applications mentioned above where the regressors were constructed by a third party and only their predictions were given to us.

Our main findings are that given only the predictions and the first two moments of the response: (i) our approach is able to distinguish between hard prediction problems where any linear aggregation of the regressors yields large errors, and feasible problems where a suitable linear combination of the regressors can accurately estimate the response; (ii) our ranking method is able to reliably detect the most and least accurate experts; and (iii) quite consistently, U-PCR performs as well as and sometimes significantly better than the mean and median of the regressors. We conclude in Section 6 with a summary and future research directions in unsupervised ensemble regression.

2 Problem Setup

Consider a regression problem with a continuous response and explanatory features from an instance space . Let be pre-constructed regression functions, , interchangeably also called experts, and let be i.i.d. samples from the marginal distribution of . We consider the following unsupervised ensemble regression setting, in which the only observed data is the matrix of predictions

(1)

In particular, there are no labeled data pairs and no a-priori knowledge on the accuracy of the regressors.

Given only the matrix (1), and explicit knowledge of the first two moments of , we ask whether it is possible to: (i) estimate the accuracies of the experts, or at least identify the best and worst of them, and (ii) accurately estimate the responses by an ensemble method , whose input are the predictions of . As we explain below knowing the first two moments of seems necessary as otherwise the data matrix (1) can be arbitrarily shifted and scaled. Such knowledge is reasonable in various settings, for example from past experience, previous observations or physical principles.

Following the literature on supervised ensemble regression, we consider linear ensemble learners. Specifically, we restrict ourselves to the following subclass

(2)

where and are assumed known, and

. Note that in this subclass, for any vector

, . While is typically unknown, it can be accurately estimated given the predictions of in Eq. (1) and provided .

As our risk measure, we use the popular mean squared error . For completeness, we first review the optimal weights under this risk and describe several supervised ensemble methods that estimate them.

Optimal Weights.

Let be the covariance matrix of the regressors with elements

(3)

and let be the vector of covariances between the individual regressors and the true response,

(4)

Let be a weight vector that minimizes the MSE

(5)

Then it is easy to show that:

Lemma 1.

The weights satisfy

(6)

Note that depends only on and . If the ensemble regressors are linearly independent, then is invertible and is unique. In our unsupervised scenario, the matrix can be estimated from the predictions . In contrast, estimating directly from its definition in Eq. (4) requires labeled data. A key challenge in unsupervised ensemble regression is thus to estimate without any labeled data.

3 Previous Work

This section provides a brief overview of prior art, first methods for unsupervised ensemble regression, and then two supervised ensemble regression methods that are related to our approach. We conclude this section with the popular Dawid-Skene model of unsupervised ensemble classification, also relevant to our work.

3.1 Unsupervised Ensemble Regression

Whereas many works considered unsupervised ensemble classification, far fewer studied the regression case. Donmez et al. (2010)

, proposed a general framework called unsupervised-supervised learning. In the case of regression, they assumed that the marginal probability density function of the response

is known and that the regressors follow a knownparametric model with parameter . In this setup, given only unlabeled data, can be estimated by maximum likelihood. In contrast, our approach is far more general as we do not assume a parametric model, nor knowledge of the full marginal density .

More closely related is the recent work of Wu et al. (2016), which in turn is based on Ionita-Laza et al. (2016) and Parisi et al. (2014)

. Here, the authors compute the leading eigenvector of the covariance of the

regressors, and use it both to detect inaccurate regressors and to determine the weights of the accurate ones. However, as Wu et al. (2016) themselves write, this relation between the leading eigenvector and regressor accuracy “is based on intuition, and we do not have a rigorous mathematical proof so far”. Our work provides a solid theoretical support for a variant of this spectral approach.

3.2 Supervised Ensemble Regression

As reviewed by Mendes-Moreira et al. (2012), quite a few supervised ensemble regressors were proposed over the past 30 years. These can be broadly divided into two groups. Methods in the first group re-train a basic regression algorithm multiple times on different subsets of the labeled data, possibly also assigning weights to the various labeled instances. Examples include stacking Wolpert (1992); Breiman (1996); Leblanc and Tibshirani (1996), random forest Breiman (2001) and boosting Freund and Schapire (1995); Friedman et al. (2000).

In contrast, ensemble methods in the second group view the regressors as pre-constructed and only estimate the weights of their linear combination. Perrone and Cooper (1992) and Merz and Pazzani (1999) derived two such methods, which we briefly describe below.

While not directly related, there is also extensive literature on supervised combination of forecasts in time series analysis and on methods to combine multiple estimators, see Timmermann (2006); Lavancier and Rochet (2016) and many references therein.

3.3 Generalized Ensemble Method

Perrone and Cooper (1992) were among the first to consider supervised ensemble regression. They defined the misfit of predictor as , and proposed the Generalized Ensemble Method (GEM), with

The corresponding weights that minimize the MSE are

(7)

where is the misfit population covariance matrix

(8)

Perrone and Cooper (1992) proposed to estimate the unknown matrix and consequently using a labeled set . Unfortunately, in many practical scenarios multi-colinearity between the

regressors leads to an ill conditioned matrix

, that cannot be robustly inverted.

3.4 Pcr*

A common approach to handle ill conditioned multivariate problems is via principal component regression Jolliffe (2002). In the context of supervised ensemble learning, Merz and Pazzani (1999) suggested such a method, denoted PCR*. Given a labeled set let be the sample covariance matrix of the regressors,

where , and let be the top leading eigenvectors of . Merz and Pazzani (1999) proposed a weight vector of the form , with coefficients determined by least squares regression over the training set. The number of principal components is chosen by minimizing -fold cross validation error.

In the common scenario where some ensemble regressors are highly correlated, the matrix is ill-conditioned. The GEM estimator, which inverts then yields unstable predictions. In contrast, PCR* with a small number of components can be viewed as a regularized method, providing stability and robustness. In a supervised setting, Merz and Pazzani (1999) found PCR* to outperform GEM.

3.5 Unsupervised Ensemble Classification

The simplest model for unsupervised ensemble classification, going back to Dawid and Skene (1979) is that conditional on the label

, classifiers make independent errors

(9)

Dawid and Skene (1979) estimated the classifier accuracies and the labels by the EM method. In recent years several authors developed computationally efficient and rate optimal methods to estimate these quantities Anandkumar et al. (2014); Zhang et al. (2014); Jaffe et al. (2015).

To the best of our knowledge, our work is the first to propose an analogue of this assumption to the regression case, rigorously study it, and consequently derive corresponding unsupervised ensemble regression schemes.

4 Unsupervised Ensemble Regression

Given only the predictions , the simplest unsupervised approach to estimate the response at an instance is to average the regressors,

Averaging is the optimal linear estimator when all regressors make independent zero-mean errors of equal variance. A more robust but non-linear method is the median,

Averaging and median are naïve estimators in the sense that prediction at each depends only on and does not depend on the other observations , .

As we show theoretically below and illustrate empirically in Section 5, under some reasonable assumptions, one can do significantly better than the ensemble mean and median by analyzing all the data and in particular the covariance matrix of the regressors.

Specifically, we propose a novel framework to study unsupervised ensemble regression, based on the assumption that the experts make approximately uncorrelated errors with respect to the optimal predictor. We develop methods to detect the best and worst regressors and derive U-PCR, a novel unsupervised principal components ensemble regressor. Similar to Merz and Pazzani (1999), the weight vector of U-PCR is a linear combination of the top few eigenvectors of (typically just one or two). The key novelty is that we estimate the coefficients in a fully unsupervised manner.

To this end, we do assume knowledge of the first two moments of . Such knowledge seems inevitable, as otherwise the observed data may be arbitrarily shifted and scaled without changing the correlation of the regressors. Knowing the first moment , allows to estimate the bias of each regressor by its mean over the unlabeled samples,

Knowledge of allows a rough estimate of the accuracy of the regressors. A very accurate regressor must have , whereas if or , then must have a large error.

In what follows we consider predicting the mean-centered responses by a linear combination of the mean centered predictors, . We thus work with the mean centered matrix

This is equivalent to assuming that .

4.1 Statistically Independent Errors

As discussed in Section 2, in light of the optimal weights in Eq. (6), the key challenge in unsupervised ensemble regression is to estimate the vector of Eq. (4), without any labeled data.

To this end, we propose the following regression analogue of the Dawid-Skene assumption of conditionally independent experts. Recall that when the risk function is the MSE, the optimal regressor is the conditional mean,

Its mean is , and its MSE is where

(10)

For each regressor, write . Since is mean centered, . Hence, simplifies to

(11)

Similarly, the MSE of regressor is

(12)

In this notation, the challenge is thus to estimate and the vector . Inspired by Eq. (9) in the case of classification, we assume the regressors make independent errors with respect to , namely that 222Strictly speaking, the assumption is that the deviations from are uncorrelated and not necessarily independent.

(13)

This assumption is reasonable, for example, when the regressors were trained independently and are rich enough to well approximate the conditional mean . Note that when the response is perfectly predictable from the features , then and our assumption then states that the regressors make independent errors with respect to the response . This can be viewed as the regression equivalent of the Dawid-Skene model in classification.

Next, we consider how to estimate the values under the independent error assumption of Eq. (13). Suppose for a moment that the value of of Eq. (10) was known. We shall discuss how to estimate it in the next section. As the following theorem shows, in this case, we can consistently estimate by solving a system of linear equations.

Theorem 1.

Assume that the given regressors make pairwise independent errors with respect to the conditional mean. If is known then given only the data matrix , we can consistently estimate the vector at rate .

Proof.

It is instructive to first consider the population setting where . Here, under the assumption (13), the off-diagonal entries of the population covariance are

(14)

Since is symmetric, these off-diagonal entries provide linear equations for the unknown variables ). Thus, if there are enough linearly independent equations to uniquely recover . The vector can then be computed from Eq. (11).

In practice, the population matrix is unknown. However, given the matrix , we may estimate it by the sample covariance . Since , estimating by least-squares yields a consistent estimator with asymptotic error . ∎

Remark 1.

In practice, assumption (13) that all

regressors make independent errors, may be strongly violated at least for some pairs. To be robust to deviations from this assumption one may choose a suitable loss function

, and solve the optimization problem

(15)

In our experiments, we considered both the absolute loss and the standard squared loss.

4.2 Unsupervised PCR

The analysis above assumed knowledge of , or equivalently of the minimal attainable MSE of the regression problem at hand. Clearly, this would seldom be known to the practitioner. Further, any guess of gives a valid solution. Specifically, let be the solution of (15) with an assumed value . Then, due to the additive structure inside the parenthesis in Eq. (15), regardless of the loss function , we have where . Similarly, by Eq. (11),

(16)

What is needed is thus a model selection criterion that would be able to accurately estimate the value of , given the family of possible solutions for .

To motivate our proposed estimator of , let us first analyze the model of the previous section, but with the additional assumption that all regressors are fairly close to the optimal conditional mean . Namely, for analysis purposes, we scale the deviations by a parameter ,

(17)

and study the behaviour of various quantities as a function of . Specifically, under Eq. (17), the population covariance of the regressors takes the form

where and is a diagonal matrix with entries

. The following lemma characterizes the leading eigenvalue and eigenvector of

, as .

Lemma 2.

Let be the largest eigenvalue and corresponding eigenvector of . Then, as ,

(18)
(19)

Several insights can be gained from this lemma. First, at the matrix is rank one with a single non-zero eigenvector and corresponding eigenvalue . Hence, if the regressors are all very close to , their population matrix is nearly rank one and very ill conditioned. Even with an accurate estimate of and consequently of , inverting Eq. (6) to estimate would then be extremely unstable.

Second, under the model (17), . Comparing this to Eq. (19), the vector and the leading eigenvector , properly scaled, are nearly identical, up to a small shift by and up to terms. Moreover, up to terms, the matrix has rank two, spanned by the two vectors and . Hence, up to terms, the true vector can be written as a linear combination of the first two eigenvectors of . Thus, even though the matrix is ill conditioned, a principal component approach, with just or 2 components, can provide an excellent approximation of the optimal weight vector . While our focus is on unsupervised ensemble, this analysis provides a rigorous theoretical support for the PCR* method of Merz and Pazzani (1999), a result which may be of independent interest for supervised ensemble learning.

Third, since by Eq.(12), , the worst and best regressors may be detected by the largest and smallest entries in or in the estimated vector .

Lemma 2 suggests several ways to estimate the unknown quantity . By Eq. (18), one option is . Under our assumed model, this would incur an error . Another option, which we found works better in practice is to consider the relation between and the top eigenvector of , normalized to . Specifically, we estimate by minimizing the following residual,

(20)

where is given in Eq. (16). From the estimate , the weight vector of U-PCR is

(21)

A sketch of our proposed scheme appears in Algorithm 1.

  Input: Predictions and
  Compute covariance and its leading eigenvector
  For , compute by Eqs. (11), (15) and (16)
  Estimate via Eq. (20).
  Set and
  if  then
     Difficult prediction problem; STOP
  end if
  Exclude experts with or
  Recalculate on remaining experts
  Output: Weight vector of Eq. (21)
Algorithm 1 Sketch of U-PCR
(a) CCPP: Accurate prediction possible
(b) Basketball: Challenging task
(c) Affairs: Limited information on response
Figure 1: Plots of and for problems of easy, moderate and hard difficulty levels. The vertical lines are the estimated at which the residual is minimal and .

4.3 Practical Issues

Before illustrating the competitive performance of U-PCR, we discuss several important practical issues that need to be addressed when handling real-world ensembles, whose individual regressors may not satisfy our assumptions.

First, when the true value , the regression problem at hand is very difficult, and no linear combination of the predictors can give a small error. If our estimated for some small threshold , say 0.1, this is an indication of such a difficult problem. In this case we stop and do not attempt to construct an ensemble learner.

Second, even when accurate prediction is possible, in our experience, if some regressors are far less accurate than others, then it is important to detect them and exclude them from the ensemble, and recompute the various quantities after their removal. However, in the rare cases that after this removal only regressors remained, then we found it better to compute their simple average instead of Eq. (21).

Finally, if the second eigenvalue is not extremely small, then it is beneficial to project the vector onto the first two eigenvectors of . In our experiments we did so when . Then, Eq. (21) is replaced by

5 Experiments

We illustrate the performance of U-PCR on various real world regression problems. These include problems for which we trained multiple regression algorithms, and two applications where the regressors were constructed by a third party and only their predictions were given to us.

We compare U-PCR to the ensemble mean and median as well as to a linear oracle regressor of the form (2), which has access to all the response values

. It determines its weights by ordinary least squares over all

samples

(22)

We denote the normalized MSE of the oracle by .

We divide the regression problems into three difficulty levels: (i) , where accurate prediction is possible by a linear combination of the regressors; (ii) , a challenging regression task; and (iii) , where the experts provide very little, if any information on .

We start with the following basic question: Given only and the first two moments of , can we roughly estimate the difficulty level of our problem? If it belongs to level (i) or (ii), is it possible to detect the most accurate or least accurate regressors? Finally, can we construct a linear combination at least as accurate as the mean or median?

(a)

Before outlier removal

(b) After outlier removal
(c) Before outlier removal
(d) After outlier removal
Figure 2: Estimated MSE( vs. true MSE for Flights AUS (left panels) and UACC812 protein (right panels) before and after outlier removal. The outlier removal scheme is not based on the estimated MSE, but rather as described in main text, on the entries of the estimated . In some datasets, such as Flights AUS, recalculation after this removal gives more accurate estimates of regressors’ MSE.

5.1 Manually Crafted Ensembles

With precise details appearing in the supplement, we considered 18 different prediction tasks, including energy output prediction in a power plant, flight delays, basketball scoring and more. Each dataset was randomly split into samples used to train 10 different regression algorithms and remaining samples to construct the observations , see Table 3

in Supplementary. The regressors included Ridge Regression, SVR, Kernel Regression and Decision Trees, among others.

Table 1 in the supplement shows the MSE of U-PCR, mean and median averaged over 20 repetitions, each with different random splits into train and test samples. On several datasets, U-PCR obtained a significantly lower MSE. With further details in the supplement, here we highlight some of our key results. We start by estimating and classifying the problems by difficulty level. Fig. 1 shows this estimation procedure on three datasets. The -axis is the value of normalized by The black curve is the unobserved MSE obtained by the weight vector of Eq. (21), with assumed . The red curve is the computed residual of Eq. (20) and the vertical line is the estimated . Our approach is indeed able to correctly detect the difficulty levels of these problems and estimate a value , whose corresponding MSE is not too far from the minimal achievable by using any of the . Fig. 7 in the supplement shows the estimated vs. the true . For easy problems with the agreement is remarkable.

Next, we evaluated the ability to detect the most accurate regressor in the ensemble. We measured the excess risk of selecting the regressor with the smallest estimated MSE, compared to the best regressor, which is unknown. Additionally, we measured the excess risk in selecting the single regressor with the greatest corresponding entry in the leading eigenvector of . The details are given in Table 2 of the Supplementary. Our experiments show that in most cases choosing the predictor with lowest estimated MSE outperforms the one with largest entry in .

Fig. 2 shows the effectiveness of detecting inaccurate regressors, by pruning those whose entries or . Finally, Fig. 3 illustrates the advantages of U-PCR over the mean and median, on problems of easy to moderate difficulty.

(a) Nearly Perfect Prediction
(b) Challenging Problems
Figure 3: Excess risk MSE(ensemble)MSE(oracle), divided by for easy problems (left) and challenging ones (right).

5.2 HPN-DREAM Challenge Experiment

Next, we consider real world problems where the ensemble regressors were constructed by a third party. The first problem came from the HPN-DREAM breast cancer network inference challenge Hill et al. (2016a). Here, participants were asked to predict the time varying concentrations of 4 proteins after the introduction of an inhibitor. We were given the predictions of models on instances. We constructed a separate U-PCR model for each protein. Fig. 2 demonstrates the success of our method in detecting accurate regressors and removing inaccurate ones. Fig. 4 shows that U-PCR outperformed the mean and median on 3 of the 4 proteins. We note that for all four proteins, the single best model had comparable MSE to U-PCR, however, this model is unknown. For three of the four proteins U-PCR had smaller MSE than that of the single model estimated as being the most accurate.

Figure 4: HPN-DREAM Challenge Accuracy.

5.3 Bounding Box Experiment

Here we were given the predictions of 6 deep learning models trained by Seematics Inc., on the location of physical objects in images. The models were trained on the PASCAL Visual Object Classes dataset

Everingham et al. (2012), whereas the predictions were made on images from COCO dataset Lin et al. (2014). We focused on three object classes {person, dog, cat}, with each neural network providing four coordinates for the bounding box: and . We used U-PCR as an ensemble predictor each coordinate separately, with the mean squared error as out measure of accuracy. An example of the MSE estimation by our method can be seen in Fig. 5, and the accuracy for all object classes and all coordinates in Fig. 6. Results on few images are in the supplementary.

Figure 5: MSE estimation for class Cat, coordinate
Figure 6: Bounding box prediction accuracies

6 Summary and Discussion

In this paper we tackled the problem of unsupervised ensemble regression. We presented a framework to explore this problem, based on an independent error assumption. We proposed methods, together with theoretical support, to detect the best and worst regressors and to linearly aggregate them, all in an unsupervised manner. As our theoretical analysis in Section 4 showed, unsupervised ensemble regression is different from the well studied problem of unsupervised ensemble classification, and required different approaches to its solution.

Our work raises several questions. One of them is how to extend our method to a semi-supervised setting, in which there is also a limited amount of labeled data. It is also interesting to theoretically understand the relative benefits of labeled versus unlabeled data for ensemble learning.

Another direction for future research is to replace the strict independent error assumption by more complicated yet realistic models for dependencies between the regressors. In the context of unsupervised classification, Fetaya et al. (2016) relaxed the conditional independence model of Dawid and Skene by introducing an intermediate layer of latent variables. Instead of a rank-one off diagonal covariance, the matrix in their model had a low rank structure, which the authors learned by a spectral method. It is interesting whether a similar approach can be developed for an ensemble of regressors.

Acknowledgments

The authors thank Moshe Guttmann and the Seematics team for their help. This research was funded by the Intel Collaborative Research Institute for Computational Intelligence (B.N.) and by NIH grant 1R01HG008383-01A1 (Y.K. and B.N.).

References

  • Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1):2773–2832.
  • Breiman (1996) Breiman, L. (1996). Stacked regressions. Machine Learning, 24(1):49–64.
  • Breiman (2001) Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
  • Cortez et al. (2009) Cortez, P., Cerdeira, A., Almeida, F., Matos, T., and Reis, J. (2009). Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553.
  • Dawid and Skene (1979) Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pages 20–28.
  • Donmez et al. (2010) Donmez, P., Lebanon, G., and Balasubramanian, K. (2010). Unsupervised supervised learning i: Estimating classification and regression errors without labels. Journal of Machine Learning Research, 11:1323–1351.
  • Everingham et al. (2012) Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2012). The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.
  • Fanaee-T and Gama (2014) Fanaee-T, H. and Gama, J. (2014). Event labeling combining ensemble detectors and background knowledge.

    Progress in Artificial Intelligence

    , 2(2-3):113–127.
  • Fetaya et al. (2016) Fetaya, E., Nadler, B., Jaffe, A., Kluger, Y., and Jiang, T. (2016). Unsupervised ensemble learning with dependent classifiers. In AISTATS, pages 351–360.
  • Freund and Schapire (1995) Freund, Y. and Schapire, R. E. (1995). A decision-theoretic generalization of on-line learning and an application to boosting. In

    European conference on computational learning theory

    , pages 23–37.
  • Friedman et al. (2000) Friedman, J., Hastie, T., and Tibshirani, R. (2000).

    Additive logistic regression: a statistical view of boosting.

    The Annals of Statistics, 28(2):337–407.
  • Friedman (1991) Friedman, J. H. (1991). Multivariate adaptive regression splines. The annals of statistics, pages 1–67.
  • Hill et al. (2016a) Hill, S. M., Heiser, L. M., Cokelaer, T., Unger, M., Nesser, N. K., Carlin, D. E., Zhang, Y., Sokolov, A., Paull, E. O., Wong, C. K., et al. (2016a). Inferring causal molecular networks: empirical assessment through a community-based effort. Nature methods, 13(4):310–318.
  • Hill et al. (2016b) Hill, S. M., Nesser, N. K., Johnson-Camacho, K., Jeffress, M., Johnson, A., Boniface, C., Spencer, S. E., Lu, Y., Heiser, L. M., Lawrence, Y., et al. (2016b). Context specificity in causal signaling networks revealed by phosphoprotein profiling. Cell systems.
  • Ionita-Laza et al. (2016) Ionita-Laza, I., McCallum, K., Xu, B., and Buxbaum, J. D. (2016). A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nature genetics, 48(2):214–220.
  • Jaffe et al. (2015) Jaffe, A., Nadler, B., and Kluger, Y. (2015). Estimating the accuracies of multiple classifiers without labeled data. In AISTATS.
  • Johnson (1996) Johnson, V. E. (1996). On Bayesian Analysis of Multirater Ordinal Data: An Application to Automated Essay Grading. Journal of the American Statistical Association, 91(433):42–51.
  • Jolliffe (2002) Jolliffe, I. (2002). Principal component analysis. John Wiley & Sons, Ltd.
  • Kato (1995) Kato, T. (1995). Perturbation Theory of Linear Operators. Springer, Berlin, second edition.
  • Lavancier and Rochet (2016) Lavancier, F. and Rochet, P. (2016). A general procedure to combine estimators. Computational Statistics & Data Analysis, 94:175–192.
  • Leblanc and Tibshirani (1996) Leblanc, M. and Tibshirani, R. (1996). Combining Estimates in Regression and Classification. Journal of the American Statistical Association, 91(436):1641–1650.
  • Lichman (2013) Lichman, M. (2013). UCI machine learning repository.
  • Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755.
  • Mendes-Moreira et al. (2012) Mendes-Moreira, J., Soares, C., Jorge, A. M., and Sousa, J. F. D. (2012). Ensemble approaches for regression: A survey. ACM Computing Surveys (CSUR), 45(1):10.
  • Merz and Pazzani (1999) Merz, C. and Pazzani, M. (1999). A Principal Components Approach to Combining Regression Estimates. Machine Learning, 36(1-2):9–32.
  • Nadler (2008) Nadler, B. (2008). Finite sample approximation results for principal component analysis: A matrix perturbation approach. The Annals of Statistics, pages 2791–2817.
  • Parisi et al. (2014) Parisi, F., Strino, F., Nadler, B., and Kluger, Y. (2014). Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences of the United States of America, 111(4):1253–8.
  • Perrone and Cooper (1992) Perrone, M. P. and Cooper, L. N. (1992). When networks disagree: Ensemble methods for hybrid neural networks. World Scientific.
  • Platanios et al. (2014) Platanios, E. A., Blum, A., and Mitchell, T. (2014). Estimating accuracy from unlabeled data. In Uncertainty in Artificial Intelligence, pages 682–691.
  • Platanios et al. (2016) Platanios, E. A., Dubey, A., and Mitchell, T. (2016). Estimating accuracy from unlabeled data: A bayesian approach. In International Conference on Machine Learning, pages 1416–1425.
  • Raykar et al. (2010) Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322.
  • Sheng et al. (2008) Sheng, V. S., Provost, F., and Ipeirotis, P. G. (2008). Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 614–622. ACM.
  • Timmermann (2006) Timmermann, A. (2006). Forecast combinations. Handbook of economic forecasting, 1:135–196.
  • Whitehill et al. (2009) Whitehill, J., Wu, T.-f., Bergsma, J., Movellan, J. R., and Ruvolo, P. L. (2009). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043.
  • Wolpert (1992) Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2):241–259.
  • Wu et al. (2016) Wu, D., Lawhern, V. J., Gordon, S., Lance, B. J., and Lin, C.-T. (2016). Spectral meta-learner for regression (smlr) model aggregation: Towards calibrationless brain-computer interface. In IEEE International Conference on Systems, Man, and Cybernetics, pages 743–749.
  • Zhang et al. (2014) Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. (2014). Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pages 1260–1268.
  • Zhou et al. (2012) Zhou, D., Basu, S., Mao, Y., and Platt, J. C. (2012). Learning from the wisdom of crowds by minimax entropy. In Advances in Neural Information Processing Systems, pages 2195–2203.

Appendix A Proofs

Proof of Lemma 2.

The proof follows a perturbation approach similar to the one outlined in Nadler (2008). Since is symmetric and quadratic in , classical results on perturbation theory Kato (1995) imply that in a small neighborhood of , the leading eigenvalue and eigenvector are analytic in . We may thus expand them in a Taylor series,

We insert this expansion into the eigenvector equation and solve the resulting equations at increasing powers of .

The leading order equation reads , which gives and . Since the eigenvector is defined only up to a multiplicative factor, we conveniently chose it to be that holds for all . This gives and .

The equation reads

(23)

Multiplying this equation from the left by gives

or . Thus, Eq. (18) follows. Inserting the expression for back into Eq. (23) gives

from which Eq. (19) readily follows. ∎

Appendix B Datasets & Results

Dataset Oracle U-PCR Mean Median
Abalone 0.43 () 0.49 () 0.49 () 0.49 ()
Affairs 0.92 () N.A. 0.96 () 0.94 ()
Basketball 0.28 () 0.35 () 0.35 () 0.36 ()
Bike sharing 0.00 () 0.00 () 0.02 () 0.00 ()
Blog feedback 0.41 () 0.49 () 0.50 () 0.58 ()
Ccpp 0.06 () 0.07 () 0.07 () 0.07 ()
Flights AUS 0.33 () 0.46 () 0.58 () 0.66 ()
Flights BOS 0.47 () 0.58 () 0.66 () 0.69 ()
Flights BWI 0.44 () 0.56 () 0.71 () 0.82 ()
Flights HOU 0.40 () 0.59 () 0.69 () 0.75 ()
Flights JFK 0.50 () 0.78 () 0.74 () 0.90 ()
Flights LGA 0.47 () 0.59 () 0.70 () 0.78 ()
Flights longhaul 0.69 () 0.89 () 0.86 () 0.97 ()
Friedman1 0.02 () 0.13 () 0.18 () 0.16 ()
Friedman2 0.00 () 0.06 () 0.08 () 0.07 ()
Friedman3 0.04 () 0.09 () 0.20 () 0.22 ()
Online videos 0.09 () 0.18 () 0.22 () 0.28 ()
Wine quality white 0.60 () 0.64 () 0.66 () 0.69 ()
Table 1: Mean squared error of different ensemble methods, normalized by On the Affairs data U-PCR estimates it is a difficult problem and does not predict outcomes. Numbers in bold represent cases where one of the unsupervised ensemble regressors was significantly better than the others.

b.1 Selecting a Single Regressor

Table 2 compares the MSE of the single regressor estimated to be the most accurate, versus the MSE of the single best regressor, which is unknown in this setup. The following two methods were compared: (i) Selecting the regressor with the maximum entry

, and (ii) selecting the regressor with the minimal estimated MSE. The experiments were repeated 20 times for each dataset, mean and standard deviations are reported. All values are normalized by

for fair comparison.

Dataset Oracle MSE Best Regressor MSE MSE of MSE of
Abalone 0.43 () 0.45 () 0.49 () 0.77 ()
Basketball 0.28 () 0.32 () 0.36 () 0.49 ()
Bike sharing 0.00 () 0.00 () 0.00 () 0.00 ()
Blog feedback 0.41 () 0.43 () 0.66 () 0.62 ()
CCPP 0.06 () 0.07 () 0.07 () 0.09 ()
Flights AUS 0.33 () 0.48 () 0.56 () 0.66 ()
Flights BOS 0.47 () 0.53 () 0.61 () 1.04 ()
Flights BWI 0.44 () 0.50 () 0.50 () 0.87 ()
Flights HOU 0.40 () 0.52 () 0.53 () 1.13 ()
Flights JFK 0.50 () 0.54 () 0.64 () 0.95 ()
Flights LGA 0.47 () 0.53 () 0.59 () 1.04 ()
Flights longhaul 0.69 () 0.74 () 0.79 () 1.17 ()
Friedman1 0.02 () 0.03 () 0.24 () 0.03 ()
Friedman2 0.00 () 0.01 () 0.14 () 0.02 ()
Friedman3 0.04 () 0.07 () 0.25 () 0.14 ()

Online videos
0.09 () 0.10 () 0.34 () 0.17 ()
Wine quality white 0.60 () 0.62 () 0.77 () 1.12 ()
Table 2: MSE of the single best estimated regressor

b.2 Dataset Descriptions

Below is a list of the prediction tasks for which we manually trained ensembles with 10 regressors. Table 3 summarizes the main characteristics of each dataset, and Table 1 contains the mean squared errors of the different approaches normalized by . The experiments were repeated 20 times, and the mean and standard deviations are reported. We used standard Python packages for the regression algorithms with the following parameters: Ridge (), Kernel Regression (kernel chosen using cross validation between polynomial, RBF, sigmoid), Lasso (), Orthogonal Matching Pursuit, Linear SVR (), SVR with RBF kernel ( chosen using cross validation out of 0.01, 0.1, 1, 10), Regression Tree (depth 4), Regression Tree (infinite depth), Random Forest (100 trees), and a Bagging Regressor.

Name
Abalone 3277 700 7 0.59 0.45 0.431 ()
Affairs 5466 700 7 1.08 0.93 0.922 ()
Basketball 48899 900 9 0.43 0.32 0.281 ()
Bike Sharing 15579 1600 16 0.07 0.00 0.000 ()
Blog Feedback 24197 28000 280 0.64 0.43 0.415 ()
CCPP 8968 400 4 0.10 0.07 0.059 ()
Flights AUS 47595 1000 10 0.76 0.48 0.329 ()
Flights BOS 112705 1000 10 0.84 0.53 0.470 ()
Flights BWI 101665 1000 10 0.85 0.50 0.440 ()
Flights HOU 53044 1000 10 0.87 0.52 0.397 ()
Flights JFK 113960 1000 10 0.89 0.54 0.495 ()
Flights LGA 111911 1000 10 0.86 0.53 0.471 ()
Flights Long Haul 9393 1000 10 1.00 0.73 0.686 ()
Friedman1 18800 1000 10 0.31 0.03 0.024 ()
Friedman2 19400 400 4 0.17 0.01 0.004 ()
Friedman3 19400 400 4 0.35 0.07 0.043 ()
Online Videos 66484 2100 21 0.34 0.10 0.094 ()
Wine Quality White 3598 1100 11 0.79 0.62 0.595 ()

  • is the number of held-out samples. The input is dimensional, and the same random samples were used to train the different algorithms in the ensemble. is the average regressor error, is the minimal error achieved by a regressor in the ensemble, and is the MSE of the oracle, normalized by , with its standard deviation in parenthesis. For each dataset the split between train and test was performed 20 times, averages are listed.

Table 3: Prediction Problems
(a) CCPP: Accurate prediction possible
(b) Basketball: Challenging task
(c) Affairs: No information on response
Figure 7: Estimated vs. true in three regression problems of different difficulty levels

Abalone.

A dataset containing features of abalone, where the goal is to predict its age Lichman (2013). archive.ics.uci.edu/ml/datasets/Abalone

Affairs.

A dataset containing features describing an individual such as time at work, time spent with spouse, and time spent with a paramour. The goal here is to predict the time spent in extramarital affairs. statsmodels.sourceforge.net/0.6.0/datasets/generated/fair.html

Basketball.

Dataset contains stats on NBA players. Task: Predict number of points scored by the player on the next game. The features are: name, venue, team, date, start, pts_ma, min_ma, pts_ma_1, min_ma_1, pts, where start is whether or not the player started, pts is number of points scored, min is number of minutes played, ma stands for moving average, starts at season, and ma_1 is a moving average with a 1 game lag.

Bike Sharing.

Bike sharing service statistics, including weather and seasonal information Fanaee-T and Gama (2014). The prediction task here is the daily and hourly count of bikes rented. archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset

Blog Feedback.

Instances in this dataset contain features extracted from blog posts. The task associated with the data is to predict how many comments the post will receive.

Flights.

Information on flights from 2008, where the task is to predict the delay upon arrival in minutes. The features here are the date, day of the week, scheduled and actual departure times, scheduled arrival times, flight ID, tail number, origin, destination, and distance. Due to its size, we split this dataset to flights originating from specific airports (AUS, BOS, BWI, HOU, JFK, and LGA), and long-haul flights. stat-computing.org/dataexpo/2009/the-data.html

Ccpp.

Combined Cycle Power Plant UCI-dataset containing physical characteristics such as temperature and humidity. The task here is to predict the net hourly electrical energy output of the plant. archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant

Friedman #1

Motivated by Breiman (1996), we used simulated data according to Friedman (1991)

. The predictor variables

are independent and uniformly distributed over

. The response is

and .

Friedman #2

The second data set tested by Friedman (1991) simulated impedance in an alternating current circuit. Here four predictor variables are uniformly distributed over the ranges and respectively. The response was

with , where the variance was chosen to provide a 3-to-1 signal to noise ratio. For the third dataset in this series Friedman #3, see the original paper Friedman (1991).

Online Videos.

YouTube video transcoding dataset. Predict the transcoding time based on parameters of the video. archive.ics.uci.edu/ml/datasets/Online+Video+ Characteristics+and+Transcoding+Time+Dataset

Wine Quality White.

Predict the quality score (1-10) of white wine based on chemical characteristics, such as acidity and pH level Cortez et al. (2009). archive.ics.uci.edu/ml/datasets/Wine+Quality

Figure 8: Sample images from the bounding box experiment. The ground truth bounding box is shown in blue, U-PCR in dashed green, and the regressors are shown in red.