Unsupervised Ensemble Learning with Dependent Classifiers

10/20/2015 ∙ by Ariel Jaffe, et al. ∙ 0

In unsupervised ensemble learning, one obtains predictions from multiple sources or classifiers, yet without knowing the reliability and expertise of each source, and with no labeled data to assess it. The task is to combine these possibly conflicting predictions into an accurate meta-learner. Most works to date assumed perfect diversity between the different sources, a property known as conditional independence. In realistic scenarios, however, this assumption is often violated, and ensemble learners based on it can be severely sub-optimal. The key challenges we address in this paper are: (i) how to detect, in an unsupervised manner, strong violations of conditional independence; and (ii) construct a suitable meta-learner. To this end we introduce a statistical model that allows for dependencies between classifiers. Our main contributions are the development of novel unsupervised methods to detect strongly dependent classifiers, better estimate their accuracies, and construct an improved meta-learner. Using both artificial and real datasets, we showcase the importance of taking classifier dependencies into account and the competitive performance of our approach.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 20

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years unsupervised ensemble learning has become increasingly popular. In multiple application domains one obtains the predictions, over a large set of unlabeled instances, of an ensemble of different experts or classifiers with unknown reliability. Common tasks are to combine these possibly conflicting predictions into an accurate meta-learner, as well as assessing the accuracy of the various experts, both without any labeled data.

A leading example is crowdsourcing, whereby a tedious labeling task is distributed to many annotators. Unsupervised ensemble learning is of increasing interest also in computational biology, where recent works in the field propose to solve difficult prediction tasks by applying multiple algorithms and merging their results [3, 7, 14, 1]. Additional examples of unsupervised ensemble learning appear, among others, in medicine [12] and decision science [17].

Perhaps the first to address ensemble learning in this fully unsupervised setup were Dawid and Skene [5]. A key assumption in their work was of perfect diversity between the different classifiers. Namely, their labeling errors were assumed statistically independent of each other. This property, known as conditional independence is illustrated in the graphical model of Fig. 1 (left). In [5], Dawid and Skene proposed to estimate the parameters of the model, i.e. the accuracies of the different classifiers, by the EM procedure on the non-convex likelihood function. With the increasing popularity of crowdsourcing and other unsupervised ensemble learning applications, there has been a surge of interest in this line of work, and multiple extensions of it [22, 11, 18, 23, 20]. As the quality of the solution found by the EM algorithm critically depends on its starting point, several recent works derived computationally efficient spectral methods to suggest a good initial guess [2, 10, 15, 9].

Despite its popularity and usefulness, the model of Dawid and Skene has several limitations. One notable limitation is its assumption that all instances are equally difficult, with each classifier having the same probability of error over all instances. This issue was addressed, for example, by Whitehill et. al.

[23] who introduced a model of instance difficulty, and also by Tian et. al. [21] who proposed a model where instances are divided into groups, and the expertise of each classifier is group dependent.

A second limitation, at the focus of our work, is the assumption of perfect conditional independence between all classifiers. As we illustrate below, this assumption may be strongly violated in real-world scenarios. Furthermore, as shown in Sec. 5, neglecting classifier dependencies may yield quite sub-optimal predictions. Yet, to the best of our knowledge, relatively few works have attempted to address this important issue.

To handle classifier dependencies, Donmez et. al. [6] proposed a model with pairwise interactions between all classifier outputs. However, they noted that empirically, their model did not yield more accurate predictions. Platanios et. al. [16] developed a method to estimate the error rates of either dependent or independent classifiers. Their method is based on analyzing the agreement rates between pairs or larger subsets of classifiers, together with a soft prior on weak dependence amongst them.

The present work is partly motivated by the ongoing somatic mutation DREAM (Dialogue for Reverse Engineering Assessments and Methods) challenge, a sequence of open competitions for detecting irregularities in the DNA string. This is a real-world example of unsupervised ensemble learning, where participants in the currently open competition are given access to the predictions of more than 100 different classifiers, over more than 100,000 instances. These classifiers were constructed by various labs worldwide, each employing their own biological knowledge and possibly proprietary labeled data. The task is to construct, in an unsupervised fashion, an accurate ensemble learner.

In figure 1(a) we present the empirical conditional covariance matrix between different classifiers in one of the databases of the DREAM challenge, for which ground truth labels have been disclosed. Under the conditional independence assumption, the population conditional covariance between every two classifiers should be exactly zero. Figure 1(a), in contrast, exhibits strong dependencies between groups of classifiers.

Unsupervised ensemble learning in the presence of possibly strongly dependent classifiers raises the following two key challenges: (i) detect, in an unsupervised manner, strong violations of conditional independence; and (ii) construct a suitable meta-learner.

To cope with these challenges, in Sec. 2

we introduce a new model for the joint distribution of all classifiers which allows for dependencies between them through an intermediate layer of latent variables. This generalizes the model of Dawid and Skene, and allows for groups of strongly correlated classifiers, as observed for example in the DREAM data.

In Sec. 3 we devise a simple algorithm to detect subsets of strongly dependent classifiers using only their predictions and no labeled data. This is done by exploiting the structural low-rank properties of the classifiers’ covariance matrix. Figure 1(b) shows our resulting estimate for deviations from conditional independence on the same data as figure 1(a). Comparing the two figures illustrates the ability of our method to detect strong dependencies with no labeled data.

In Sec. 4 we propose methods to better estimate the accuracies of the classifiers and construct an improved meta-learner, both in the presence of strong dependencies between some of the classifiers. Finally, in Sec. 5 we illustrate the competitive performance of the modified ensemble-learner derived from our model on both artificial data, four datasets from the UCI repository and three datasets from the DREAM challenge. These empirical results showcase the limitations of the strict conditional independence model, and highlight the importance of modeling the statistical dependencies between different classifiers in unsupervised ensemble learning scenarios.

2 Problem Setup

Notations.

We consider the following binary classification problem. Let be an instance space with an output space . A labeled instance

is a realization of the random variable

. The joint distribution , as well as the marginals and , are all unknown. We further denote by the class imbalance of ,

(1)

Let be a set of binary classifiers operating on . As our classification problem is binary, the accuracy of the -th classifier is fully characterized by its sensitivity and specificity ,

(2)

For future use, we denote by its balanced accuracy, given by the average of its sensitivity and specificity

(3)

Note that when the class imbalance is zero, is simply the overall accuracy of the -th classifier.

The classical conditional independence model.

Fig. 1: (Left) The perfect conditional independence model of Dawid and Skene. All classifiers are independent given the class label ; (Right) The generalized model considered in this work.

In the model proposed by Dawid and Skene [5], depicted in Fig. 1(left), all classifiers were assumed conditionally independent given the class label. Namely, for any set of predictions

(4)

As shown in [5], the maximum likelihood estimation (MLE) for given the parameters and is linear in the predictions of

(5)

Hence, the main challenge is to estimate the model parameters and . A simple approach to do so, as described in [15, 9], is based on the following insight: A classifier which is totally random has zero correlation with any other classifier. In contrast, a high correlation between the predictions of two classifiers is a strong indication that both are highly accurate, assuming they are not both adversarial.

In many realistic scenarios, however, an ensemble may contain several strongly dependent classifiers. Such a scenario has several consequences: First, the above insight that high correlation between two classifiers implies that both are accurate breaks down completely. Second, as shown in Sec. 5, estimating the classifiers parameters as if they were conditionally independent may be highly inaccurate. Third, in contrast to Eq. (5), the optimal ensemble learner is in general non-linear in the classifiers. Applying the linear meta-classifier of Eq. (5) may be suboptimal, even when provided with the true classifier accuracies.

A model for conditionally dependent classifiers.

In this paper we significantly relax the conditional independence assumption. We introduce a new model which allows classifiers to be dependent through unobserved latent variables, and develop novel methods to learn the model parameters and construct an improved non-linear meta-learner.

In contrast to the 2-layer model of Dawid and Skene, our proposed model, illustrated in Fig. 1(right), has an additional intermediate layer with latent binary random variables . In this model, the unobserved are conditionally independent given the true label , whereas each observed classifier depends on only through a single and unknown latent variable. Classifiers that depend on different latent variables are thus conditionally independent given , whereas classifiers that depend on the same latent variable may have strongly correlated prediction errors. Each hidden variable can thus be interpreted as a separate unobserved teacher, or source of information, and the classifiers that depend on it are different perturbations of it. Namely, even though we observe predictions for each instance, they are in fact generated by a hidden model with intrinsic dimensionality , where possibly .

Let us now describe in detail our probabilistic model. First, since the latent variables follow the classical model of Dawid and Skene, their joint distribution is fully characterized by the class imbalance and the probabilities

Next, we introduce an assignment function , such that if classifier depends on then . The dependence of classifier on the class label is only through its latent variable ,

(6)

Hence, classifiers with maintain the original conditional independence assumption of Eq. (4). In contrast, classifiers with are only conditionally independent given ,

(7)

Note that if the number of groups is equal to the number of classifiers, then all classifiers are conditionally independent, and we recover the original model of Dawid and Skene.

Since the model now consists of three layers, the remaining parameters to describe it are the sensitivity and specificity of the -th classifier given its latent variable ,

By Eq. (6), the overall sensitivity of the -th classifier is related to and via

(8)

with a similar expression for its overall specificity .

Remark on Model Identifiability.

Note that the model depicted in Fig. 1(right) is in general not identifiable. For example, the classical model of Dawid and Skene can also be recovered with a single latent variable , by having . Similarly, for a latent variable that has only a single classifier dependent on it, the parameters and are non-identifiable. Nonetheless, these non-identifiability issues do not affect our algorithms, described below.

Problem Formulation.

We consider the following totally unsupervised scenario. Let be a binary matrix with entries where is the label predicted by classifier at instance . We assume are drawn i.i.d. from . We also assume the classifiers satisfy our generalized model, but otherwise we have no prior knowledge as to the number of groups , the assignment function or the classifier accuracies (sensitivities , and specificities ). Given only the matrix of binary predictions and no labeled data, we consider the following problems:

  1. Is it possible to detect strongly dependent classifiers, and estimate the number of groups and the corresponding assignment function ?

  2. Given a positive answer to the previous question, how can we estimate the sensitivities and specificities of the different classifiers and construct an improved, possibly non-linear, meta learner ?

3 Estimating the assignment function

The main challenge in our model is the first problem of estimating the number of groups and the assignment function . Once is obtained, we will see in Section 4 that our second problem can be reduced to the conditional independent case, already addressed in previous works [9, 15, 25, 10]. In principle, one could try to fit the whole model by maximum likelihood, however this results in a hard combinatorial problem. We propose instead to first estimate only and . We do so using the low-rank structure of the covariance matrix of the classifiers, implied by our model.

The covariance matrix.

Let denote the population covariance matrix of the classifiers

(9)

The following lemma describes its structure. It generalizes a similar lemma, for the standard Dawid and Skene model, proven in [15]. The proof of this and other lemmas below appear in the appendix.

Lemma 1.

There exists two vectors

such that for all ,

(10)

The population covariance matrix is therefore a combination of two rank-one matrices. The block diagonal elements with correspond to the rank-one matrix , where on stands for on-block, while the off-block diagonal elements, with correspond to another rank-one matrix . Let us define the indicator

(11)

The non-diagonal elements of can thus be written as follows,

(12)

Learning the model in the ideal setting.

It is instructive to first examine the case where the data is generated according to our model, and the population covariance matrix is exactly known, i.e. . The question of interest is whether it is possible to recover the assignment function in this setting.

To this end, let us look at the possible values of the determinant of 2x2 submatrices of

(13)

Due to the low rank structure described in lemma 1, we have the following result, with the exact conditions appearing in the appendix.

Lemma 2.

Assume the two vectors and are sufficiently different, then if and only if either: (i) Three or more of the indices and belong to the same group or (ii) , , and .

With details in the appendix, comparing the indices where with fixed, to those where , we can deduce, in polynomial time, whether .

Learning the model in practice.

In practical scenarios, the population covariance matrix is unknown and we can only compute the sample covariance matrix . Furthermore, our model would typically be only an approximation of the classifiers dependency structure. Given only , the approach to recover the assignment function described above, based on exact matching of the pattern of zeros of the determinants of various 2x2 submatrices is clearly not applicable.

In principle, since a standard approach would be to define the following residual

(14)

and find its global minimum. Unfortunately, as stated in the following lemma and proven in the appendix, in general this is not a simple task.

Lemma 3.

Minimizing the residual of Eq. (14) for a general covariance matrix is NP-hard.

In light of Lemma 3, we now present a tractable algorithm to estimate and and provide some theoretical support for it. Our algorithm is inspired by the ideal setting which highlighted the importance of the determinants of submatrices. To detect pairs of classifiers that strongly violate the conditional independence assumption, we thus compute the following score matrix ,

(15)

The idea behind the score matrix is the following: Consider the score matrix computed with the population covariance . Lemma 2 characterized the cases where the submatrices in Eq. (15) are of rank-one, and hence their determinant is zero. When most submatrices come from four different groups, i.e. will have rank one, and thus the sum will be small. On the other hand, when many submatrices will not be rank one and thus will be large, assuming no degeneracy between and . As , large values of serve as an indication of strong conditional dependence between classifiers and .

The following lemma provides some theoretical justification for the utility of the score matrix computed with the population covariance, in recovering the assignment function . For simplicity, we analyze the ’symmetric’ case where the class imbalance , and all groups have equal size of . We measure deviation from conditional independence by the following matrices of conditional covariances and ,

(16)

Finally, we assume there is a such that the balanced accuracies of all classifiers satisfy .

Lemma 4.

Under the assumptions described above, if then

(17)

In contrast, if then

(18)

An immediate corollary from lemma 4, is that if the classifiers are sufficiently accurate, and their dependencies within each group are strong enough then the score matrix exhibits a clear gap with

. In this case, even a simple single-linkage hierarchical clustering algorithm can recover the correct assignment function from

. In practice, as only

is available, we apply spectral clustering which is more robust, and works better in practice.

We illustrate the usefulness of the score matrix using the DREAM challenge S1 dataset, which contains classifiers. Fig. 1(a) shows the matrix of conditional covariance of Eq. (3), computed using the ground truth labels. Fig. 1(b) shows the score matrix computed using only the classifiers predictions. We also plot the values of the score matrix vs. the conditional covariance in figure 3. Clearly, a high score is a reliable indication for strong conditional dependencies between classifiers.

(a) The conditional covariance matrix of the DREAM dataset S1, computed using the ground truth labels.

(b) The score matrix of the DREAM S1 dataset, computed from the matrix of classifier predictions. For visualization purposes, the upper limit of the above score matrix is fixed at 300.
Fig. 2:
Fig. 3: Values of vs. the corresponding conditional covariance matrix for the DREAM dataset S1. The blue dots represent the mean value, the upper and lower red dots represent the and quantiles, respectively.
1:Estimate the covariance matrix (9).
2:Obtain the score matrix by (15)
3:for all  do
4:     Estimate by performing spectral clustering with the Laplacian of the score matrix.
5:     Use the clustering function to estimate the two vectors .
6:     Calculate residual by (14).
7:end for
8:Pick the assignment function and vectors which yield minimal residual.
Algorithm 1 Estimating the assignment function and vectors

It is important to note that the time complexity needed to build the score matrix is . While quartic scaling is usually considered too expensive, in our case as the number of classifiers in many real world problems is in the hundreds our algorithm can run on these datasets in less than an hour. This can be sped-up, for example, by sampling the elements of instead of computing the full matrix [8].

Estimating the assignment function .

We estimate by spectral clustering the score matrix of Eq. (15). As the number of clusters or groups is unknown, we choose the one which minimizes the residual function defined in Eq. (14). The steps for estimating the number of groups and the assignment function are summarized in Algorithm 1. Note that retrieving and from the covariance matrix is a rank-one matrix completion problem, for which several solutions exist, for example see [4]. Also note that while we compute spectral clustering for various number of clusters, the costly eigen-decoposition step only needs to be done once.

4 The latent spectral meta learner

Estimating the model parameters.

Given estimates of and of the assignment function , estimating the remaining model parameters can be divided into two stages: (i) Estimating the sensitivity and specificity of the different classifiers given the latent variables : (ii) Estimating the probabilities associated with the latent variables, and .

The key observation is that in each of these stages the underlying model follows the classical conditional independent model of [5]. In particular, classifiers with a common latent variable are conditionally independent given its value. Similarly the latent variables themselves are conditionally independent given the true label . Thus, we can solve the two stages sequentially by any of the various methods already developed for the Dawid and Skene model. In our implementation, we used the spectral meta learner proposed in [9], whose code is publicly available. A pseudo-code for this process appears in Algorithm 2.

Label Predictions.

Once all the parameters of the model are known, for each instance we estimate its label by maximum likelihood

(19)

Following our generative model, Fig. 1(right), the above probability is a function of the model parameters , and the assignment function .

Classifier selection.

In some cases, it is required to construct a sparse ensemble learner which uses only a small subset of at most out of the available classifiers. This problem of selecting a small subset of classifiers, known as ensemble pruning, has mostly been studied in supervised settings, see [19, 13, 24].

Under the conditional independence assumption, the best subset simply consists of the most accurate classifiers. In our model, in contrast, the correlations between the classifiers have to be taken into account. Assuming the required number of classifiers is smaller than the number of groups , a simple approach is to select the most accurate classifiers under the constraint that they all come from different groups. This creates a balance between accuracy and diversity.

1:Input: Matrix of predictions , parameters and .
2:for  do
3:     Find all classifiers where
4:     Estimate and
5:     Estimate the latent values ,
6:end for
7:Estimate
Algorithm 2 Estimate model parameters

5 Experiments

We demonstrate the performance of the latent variable model on artificial data, on datasets from the UCI repository and on the ICGA-TCGA dream challenge.

Throughout our experiments, we compare the performance of the following unsupervised ensemble methods: (1) Majority voting, which serves as a baseline; (2) SML+EM - a spectral meta-learner based on the independence assumption [9] which provides an initial guess, followed by EM iterations; (3) Oracle-CI: A linear meta-learner based on Eq. (5), which assumes conditional independence but is given the exact accuracies of all the individual classifiers. (4) L-SML (latent SML), the new algorithm presented in this work.

For the artificial data, we also present the performance of its oracle meta-learner, denoted Oracle-L, which is given the exact structure and parameters of the model, and predicts the label by maximum likelihood.

5.1 Artificial Data

To validate our theoretical analysis, we generated artificial binary data according to our assumed model, on a balanced classification problem with . We generated an ensemble of classifiers with instances. All the parameters of the ensemble were chosen uniformly at random from the following intervals: . We consider the case where there is only one group of correlated classifiers, with the remaining classifiers all conditionally independent. The size of the correlated group increases from to . Note that for all classifiers are conditionally independent. Fig. 5 compares the balanced accuracy of the five unsupervised ensemble learners described above, as a function of the size of the first group, . As can be seen in Fig. 5, up to , the ensemble learner based on the concept of correlated classifiers achieves similar results to the optimal classifier (’oracle-L’). As expected from Lemma 4, as increases, it is harder to correctly estimate with the score matrix.

A complementary graph which presents the probability to recover the correct assignment function as a function of appears in the appendix. As expected, the degradation in performance starts when the algorithm fails to correctly estimate the model structure.

Fig. 4: Simulated data: Ensemble learner balanced accuracy vs. the size of group 1.
Fig. 5: UCI magic dataset, a comparison of four unsupervised ensemble learners.

5.2 UCI data sets

We applied our algorithms on various binary classification problems using 4 datasets from the UCI repository: Magic, Spambase, Miniboo and Musk. Our ensemble of

classifiers consists of 4 random forests, 3 logistic model trees, 4 SVM and 5 naive Bayes. Each classifier was trained on a separate, randomly chosen labeled dataset. In our unsupervised ensemble scenario we had access only to their predictions on a large independent test set.

We present results for the ’magic’ dataset, which contains instances with attributes. The task is to classify each instance as background or high energy gamma rays. Further details and results on the other datasets appear in the appendix.

As seen in Fig. 5, the L-SML improves substantially over the standard SML, and even on the oracle classifier that assumes conditional independence. Our method also outperforms the best individual classifier.

In the appendix we show the conditional covariance matrix, Fig. 9, and our assignment, Fig. 9. It can be observed that strongly dependent classifiers are indeed grouped together correctly.

5.3 The DREAM mutation calling challenge

The ICGC-TCGA DREAM challenge is an international effort to improve standard methods for identifying cancer-associated mutations and rearrangements in whole-genome sequencing (WGS) data. This publicly available database contains both real and synthetic in-silico tumor instances. The database contains 14 different datasets, each with over 100,000 instances.

Participants in the currently open competition are given access to the predictions of about a hundred different classifiers (denoted there as pipe-lines)222The data can be downloaded from the challenge website http://dreamchallenges.org/. These classifiers were constructed by various labs worldwide, each employing their own biological knowledge and possibly proprietary labeled data. The two current challenges are to construct a meta-learner, by using either (1) all classifiers; or (2) at most five of them. We evaluate the performance of the different meta-classifiers by their balanced error,

Below we present results on the datasets S1, S2 and S3 for which ground-truth labels have been released.

Challenge I.

The balanced errors of the different meta-learners, constructed using all classifiers, are given in table 1. The L-SML method outperforms the other meta-learners in all the three datasets. On the S3 dataset, it reduces the balanced error by more than 20 % over competing meta learners.

Challenge II

Here the goal is to construct a sparse meta-learner based on at most five individual classifiers from the ensemble. For the methods based on the Dawid and Skene model (SML+EM, voting and Oracle-CI), we took the 5 classifiers with the highest estimated (or known) balanced accuracies. For our model, since the estimated number of groups is larger than five, we first took the best classifier from each group, and then chose the five classifiers with highest estimated balanced accuracies. For all methods, the final prediction was made by a simple vote of the five chosen classifiers. Though potentially sub-optimal, we nonetheless chose it as our purpose was to compare the diversity of the different classifiers.

The results presented in table 2 show that our method outperforms voting and SML, and are similar to those achieved by the oracle learner.

Mean Best Vote
SML+
EM
Oracle-
CI
L-SML
S1 6.1 1.7 2.8 1.7 1.7 1.6
S2 8.7 1.8 4.0 2.8 2.8 2.3
S3 8.3 2.5 4.3 2.3 2.3 1.8
Table 1: Balanced error of meta-classifiers based on the full ensemble. For reference, the first two columns give the mean and smallest balanced error of all classifiers.
Vote SML+EM Oracle-CI L-SML
S1 3.2 2.3 1.9 2.0
S2 4.3 4.1 2.5 2.8
S3 2.9 2.9 2.8 2.5
Table 2: Balanced error of sparse meta-classifiers.

6 Acknowledgments

This research was funded in part by the Intel Collaborative Research Institute for Computational Intelligence (ICRI-CI). Y.K. is supported by the National Institutes of Health Grant R0-1 CA158167, and R0-1 GM086852.

References

  • [1] N. Aghaeepour, G. Finak, H. Hoos, T.R. Mosmann, R. Brinkman, R. Gottardo, R.H. Scheuermann, FlowCAP Consortium, and DREAM Consortium. Critical assessment of automated flow cytometry data analysis techniques. Nature methods, 10(3):228–238, 2013.
  • [2] A. Anandkumar, R. Ge, D. Hsu, S.M. Kakade, and M. Telgarsky. Tensor decompositions for learning latent variable models.

    Journal of Machine Learning Research

    , 15:2773–2832, 2014.
  • [3] P.C. Boutros, A.A. Margolin, J.M. Stuart, A. Califano, and G. Stolovitzky. Toward better benchmarking: challenge-based methods assessment in cancer genomics. Genome biology, 15(9):462, 2014.
  • [4] E.J. Candès and B. Recht. Exact matrix completion via convex optimization. Foundations of Computational Mathematics, 9:717–772, 2009.
  • [5] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorith. Journal of the Royal Statistical Society. Series C, 28:20–28, 1979.
  • [6] P. Donmez, G. Lebanon, and K. Balasubramanian.

    Unsupervised supervised learning i: Estimating classification and regression errors without labels.

    Journal of Machine Learning Research, 11:1323–1351, 2010.
  • [7] A.D. Ewing, K.E. Houlahan, Y. Hu, K. Ellrott, C. Caloian, T.N. Yamaguchi, J.Ch. Bare, C. P’ng, D. Waggott, and V.Y. Sabelnykova. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nature methods, 12:623, 2015.
  • [8] E. Fetaya, O. Shamir, and S. Ullman. Graph approximation and clustering on a budget.

    18th conference on artificial intelligence and statistics

    , 2015.
  • [9] A. Jaffe, B. Nadler, and Y. Kluger. Estimating the accuracies of multiple classifiers without labeled data. In 18th conference on artificial intelligence and statistics, pages 407–415, 2015.
  • [10] P. Jain and S. Oh. Learning mixtures of discrete product distributions using spectral decompositions. Journal of Machine Learning Research, 35:1–33, 2014.
  • [11] D.R. Karger, S. Oh, and D. Shah. Budget-optimal crowdsourcing using low-rank matrix approximations. In IEEE Alerton Conference on Communication, Control and Computing, pages 284–291, 2011.
  • [12] J. A. Lee. Click to cure. The Lancet Oncology, 2013.
  • [13] G. Martinez-Muoz, D. Hernández-Lobato, and A. Suarez. An analysis of ensemble pruning techniques based on ordered aggregation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 31(2):245–259, 2009.
  • [14] M. Micsinai, F. Parisi, F. Strino, P. Asp, B.D. Dynlacht, and Y. Kluger. Picking chip-seq peak detectors for analyzing chromatin modification experiments. Nucleic acids research, 2012.
  • [15] F. Parisi, F. Strino, B. Nadler, and Y. Kluger. Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences, 111:1253–1258, 2014.
  • [16] E.A. Platanios, A. Blum, and T. Mitchell. Estimating accuracy from unlabeled data. In Uncertainty in Artificial Intelligence, 2014.
  • [17] A.J. Quinn. Crowdsourcing decision support: frugal human computation for efficient decision input acquisition. PhD thesis, 2014.
  • [18] V.C. Raykar, Y. Shipeng, L.H. Zhao, G.H. Valdez, C. Florin, L. Bogoni, and Moy L. Learning from crowds. J. Machine Learning Research, 11:1297–1322, 2010.
  • [19] L. Rokach. Collective-agreement-based pruning of ensembles. Computational Statistics & Data Analysis, 53(4):1015–1026, 2009.
  • [20] A. Sheshadri and M. Lease. Square: A benchmark for research on computing crowd concensus. In AAAI conference on human computation and crowdsourcing, 2013.
  • [21] Tian Tian and Jun Zhu. Uncovering the latent structures of crowd labeling. In Advances in Knowledge Discovery and Data Mining, pages 392–404. Springer, 2015.
  • [22] P. Welinder, S. Branson, S. Belongie, and P. Perona. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems 23 (NIPS 2010), 2010.
  • [23] J. Whitehill, P. Ruvolo, T. Wu, J Bergsma, and J.R. Movellan. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in Neural Information Processing Systems 22 (NIPS 2009), 2009.
  • [24] Xu-Cheng Yin, Kaizhu Huang, Chun Yang, and Hong-Wei Hao. Convex ensemble learning with sparsity and diversity. Information Fusion, 2014.
  • [25] Y. Zhang, X. Chen, D. Zhou, and M.I. Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. In Advances in Neural Information Processing Systems, volume 27, pages 1260–1268, 2014.

Appendix A Proof of Lemma 1

This proof is based on the following lemma, which appears in [15]:
If two classifiers are conditionally independent given the class label , then the covariance between them is equal to,

(20)

In our model, if , then are indeed conditionally independent (Fig. 1,right). The first part of lemma 1 follows directly from Eq. (20), with .

To prove the second part of lemma 1, we note that according to our model, two classifiers with are conditionally independent given the value of their latent variable . Therefore, we can treat as the class label, and apply Eq. (20) with replaced by the expectation of , and the sensitivity and specificity replaced by respectively. Hence, Eq. (20) becomes,

(21)

where .

Appendix B Proof of Lemma 2

We assume that and are sufficiently different in the following precise sense: We require that for all 4 distinct indices , .

Next, we elaborate on the relation between and . Let us denote by the sensitivity and specificity of the latent variable . Let be a classifier that depends on . Applying Bayes rule, its overall sensitivity and specificity is given by,

(22)

Adding and we get the following,

(23)

If we have the following dependency between and ,

(24)

It follows that two elements where are linearly dependent with the corresponding elements of . This fact shall be useful in proving the lemma.

To prove lemma 2 we analyze all various possibilities for the group assignments of the four indices of

.

  1. : In this case .

  2. , and , and and : Here .

  3. : . From the linear dependency shown in Eq. (24). .

  4. , and : from our assumption.

It can be seen that is equal to zero only if either three or more of the indices are equal (cases (1) and (2)) or all four pairs which appear in the determinant belong to different groups (case (3)).

Appendix C Algorithm for the ideal setting

An immediate conclusion from lemma 2, is that the indices and for which depend only on the assignment function. This means we can compare the pattern of zeros for and to decide if and belong to the same group. If then . On the other hand if and at least one of the indices and , w.l.o.g , belongs to a group with more than one element, then we can find and such that but . This occurs when , and .

This means that by comparing the pattern of zeros, we can recover the assignment function. Notice, that according to the algorithm, all singleton classifiers, that is, classifiers who are conditionally independent with the rest of the ensemble, are grouped together under a common latent variable. This is not a problem, as our model is not unique and this is an equivalent probabilistic model, when the latent variable being identical to .

1:Initialize arrays to zero
2:for  do
3:     if  then ()
4:     end if
5:     if  then ()
6:     end if
7:end for
8:if (then
9:     .
10:else
11:     .
12:end if
Algorithm 3 Check if

Appendix D Minimizing is a NP hard problem

We prove lemma 3 for the case of clusters and known vectors. Our goal is to find a minimizer for the following residual:

(25)

For the case of we can simplify the residual considerably. Let us define a vector where if and if . We can replace the indicator function with the following,

(26)

In addition, we can replace the minimization over with a minimization over ,

(27)

The first term does not depend on and we can omit it from the minimization problem. Let us also define the matrix ,

(28)

We are left with the following minimization problem:

(29)

If there is a binary vector whose residual is precisely zero, then it can be found by computing the eigenvector with smallest eigenvalue of the matrix

. If, however, the minimal residual is not zero, then eq. (29) is a quadratic optimization problem involving discrete variables, which is well known to be a NP-hard problem.

Appendix E Proof of Lemma 4

We start by proving the first part of the lemma, where . The score matrix is a sum of all possible determinants,

(30)

where we define as a single score element. The following table separates the group of score elements into three types, and states the number of elements in each type.

Element type Number of elements

According to lemma 1, the contribution to the score from elements of the second and third type is exactly (see details in Sec. B). We will therefore focus on analyzing the score elements of the first type, where . Recall, that we assume a symmetrical case where , and . These assumptions imply that for all . Let us consider Lem. 1 in order to analyze the value of ,

(31)

where . For simplicity of notation, let us denote by the ratio of true positives and negatives of the latent variables:

(32)

It can easily be shown that the following holds:

(33)

Inserting (33) into (31) we get,

(34)

Let us now derive the values of the conditional covariance matrices . In order to obtain , we can apply the first part of Lem.1, and replace the class imbalance , which is the mean value of , with . A similar argument applies to