1 Introduction
In recent years, crowdsourcing applications gained significant popularity, and consequently much academic attention. At the same time, deep learning has become a major tool in machine learning and artificial intelligence, demonstrating impressive performance in several applications, including computer vision, speech recognition and natural language processing.
The goal of this paper is to show that deep learning methods can also be applied to the areas of crowdsourcing and unsupervised ensemble learning, and provide stateoftheart results. In unsupervised ensemble learning, one is given the predictions of classifiers on a set of instances and the goal is to recover the true, unknown label of each instance. Dawid and Skene (1979) were among the first to consider such a setup. They assumed that the classifiers are conditionally independent given the true labels. We refer to this model as the DS model and also as the Conditional Independence model.
Despite its simplicity, computing the maximum likelihood estimates of the classifiers’ accuracies and the true labels in the DS model is a nonconvex optimization problem. In their paper, Dawid and Skene estimated these quantities by the EM algorithm, which is only guaranteed to converge to a local optimum. In recent years, several authors developed computationally efficient spectral methods that are asymptotically consistent under the DS model, see Zhang et al. (2014); Parisi et al. (2014); Jain and Oh (2013); Jaffe et al. (2014) and references therein.
The model of Dawid and Skene relied on two key assumptions that typically do not hold in practice: (i) that classifiers make perfectly independent errors; and (ii) that these errors are uniformly distributed across all instances. To address the second issue above, several authors proposed richer models, that include parameters such as instance difficulty and varying skills of annotators across different regions of the input space, see for example
Raykar et al. (2010), Whitehill et al. (2009) and Welinder et al. (2010).In contrast, relatively few works considered relaxations of the conditional independence assumption: Platanios et al. (2014) proposed to estimate the accuracies of possibly dependent classifiers, via their agreement rates over classifier groups of different sizes. Donmez et al. (2010) proposed a model with pairwise interactions between all classifiers. Closest to our approach is the work of Jaffe et al. (2015), who assumed that some of the classifiers may be conditionally dependent, yet their dependency structure can be accurately described by a tree of depth 2.
In this manuscript, we propose a deep learning approach to unsupervised ensemble learning problems with possibly dependent classifiers, where the conditional independence assumption is strongly violated. We make the following contributions. First, we show that the DS model has an equivalent parametrization in terms of a Restricted Boltzmann Machine (RBM) with a single hidden node. Hence, under this model, the posterior probability of the true labels can be estimated from a trained RBM. Next, to tackle violations of conditional independence, we show how a RBMbased Deep Neural Net (DNN) can be applied to unsupervised ensemble learning, and propose a heuristic for determining the DNN architecture. Experimentally, we compare our approach to several stateoftheart methods that are based on the conditional independence assumption and relaxations of it. We show that our DNN approach often performs better than the other methods on both simulated and real world datasets. Remarkably, we demonstrate that in some cases, while the raw representation of the data contains correlated features, the learned features in the last hidden layer are almost perfectly uncorrelated.
The structure of this manuscript is as follows: in Section 2 we give a formal definition of the problem. A brief background on RBMs is given in Section 3. In Section 4 we show how RBMs can be used to predict the true labels, under the assumption of conditional independence. In Section 5 we describe how to estimate the labels using a RBMbased DNN. Experimental results are reported in Section 6. The manuscript concludes with a brief summary in Section 7. Proofs appear in the appendix.
1.1 Notation
Throughout this manuscript,
are random variables,
are probability densities, parametrized by , respectively. We think of as the distribution generating the data and of as the RBM model distribution. When the context is clear, we occasionally write as a shorthand for . The dimensions of the input data and the sample size are denoted by and , respectively. We useto denote the sigmoid function
(1) 
2 Problem Setup
Let be random variables. We refer to as the label of . The pair
has a joint distribution, parametrized by
and denoted by , which is given byThe joint distribution is not known to us, and neither are the marginals . Let be i.i.d samples from . In unsupervised ensemble learning, we observe and the learning task is to recover
. In this application, the binary vector
contains the predictions of classifiers or annotators on an instance, whose label is unobserved.2.1 The Conditional Independence Model
In their seminal paper, Dawid and Skene (1979), assumed that the conditional distribution factorizes, i.e.,
(2) 
Eq. (2), also known as the conditional independence model, is depicted in Figure 1.
It is fully parametrized by , where
are often referred to as sensitivity and specificity, respectively. Under the interpretation of the ’s being classifiers, the sensitivity and specificity quantify the competence of the classifiers or annotators and the conditional independence assumption means that all classifiers make independent errors.
The conditional independence model is often overly simplistic. In this manuscript we propose to apply deep learning techniques, specifically RBMbased DNNs, for unsupervised ensemble learning problems, where the conditional independence is not likely to hold. The following section gives essential background on RBMs, section 4 shows that a RBM with a single hidden node is equivalent to the conditional independence model, and section 5 presents our RBMbased DNN approach.
3 Restricted Boltzmann Machines
A Restricted Boltzmann Machine (RBM) is an undirected bipartite graphical model, consisting of a set of visible binary random variables and a set of hidden binary random variables, arranged in two layers, which are fully connected to each other. An illustration of a RBM is depicted in Figure 2.
A RBM is parametrized by , where is the weight matrix of the connections between the visible and hidden units, and
are the bias vectors of the visible and hidden layers, respectively. Each configuration
of a RBM is associated with the following energy(3) 
which defines the probability of the configuration
where is the partition function. The bipartite structure of the RBM implies factorial conditional probabilities
given by
where is the sigmoid function defined in equation (1), is the th row of and is its th column.
Given iid training data , the RBM parameters are typically tuned to maximize the loglikelihood of the training data, where the likelihood that the RBM associates with a vector is given by
A popular approach to learn the RBM parameters is via gradientbased optimization, where the gradients are approximated using contrastive divergence
(Hinton et al., 2006; Bengio, 2009).4 RBM in the Conditional Independence Case
In this section we show that given observed data from the conditional independence model of Eq. (2), the posterior probabilities of the true, unknown labels can be consistently estimated via a RBM with a single hidden node.
We begin by showing that there is a bijective map from the parameters of a RBM with a single hidden node to the parameters of the conditional independence model, such that the joint distribution specified by the RBM is equivalent to that of the conditional independence model.
Lemma 4.1.
The joint probability of a RBM with parameters is equivalent to the joint probability of a conditional independence model with parameters given by
Furthermore, the map is a bijection.
We are now ready to prove the main result of this section, namely, that the posterior distribution of the true labels can be consistently estimated by a RBM with a single hidden node. To do so, we rely on a special case of a result proved by Chang (1996), that provides conditions under which the parameters of the conditional independence model are identifiable.
Lemma 4.2.
Let be observed data from the conditional independence model, specified by . Assume that is such that for each , is not independent of (i.e., each classifier is not just a random guess), and that . Let be a maximum likelihood parameter estimate of a RBM with a single hidden node. Then the RBM posterior probability converges to the true posterior , as .
Remark 4.3.
The identifiability of the parameters is up to a single global label flip. This means that one recovers either or . Assuming that on average, the ’s are more accurate than a random guess, this sign ambiguity can be resolved by comparing the predictions to the majority vote decision.
Remark 4.4.
Lemma 4.2 assumes that we found the MLE of the RBM parameters. Obtaining such a MLE is problematic for two main reasons. First, RBMs are typically trained to maximize a proxy for the likelihood, as the true likelihood is not tractable. Second, the RBM likelihood function is not concave, hence there are no guarantees that after training a RBM one obtains the maximum likelihood parameter .
5 RBMbased Deep Neural Net
In many practical settings, the variables are not conditionally independent. Fitting a conditionally independent model to such data may yield highly suboptimal predictions for the true labels . To tackle this general case, we propose to train a RBMbased Deep Neural Net (DNN) and use it to estimate the posterior probabilities . In such a DNN, the hidden layer of each RBM is the input for the successive RBM. As suggested by Hinton et al. (2006), the RBMs are trained one at a time, bottom to top, i.e., the DNN is trained in a layerwise fashion. Specifically, given training data
, we start by training the bottom RBM, and then obtain the first layer hidden representation of the data by sampling
from the conditional RBM distribution . The vectors are then used as a training set for the second RBM and so on.In the case considered in this manuscript, where the true label is binary, the uppermost RBM in the DNN has a single hidden unit, from which the posterior probability can be estimated. Such a DNN is depicted in Figure 3.
5.1 Motivation
Deep learning algorithms have recently achieved stateoftheart performance in a wide range of applications LeCun et al. (2015). While a rigorous theoretical understanding of deep nets is still lacking, many researchers believe that a key property in their success is their ability to disentangle factors of variation in the inputs; see for example Bengio et al. (2013), Tishby and Zaslavsky (2015), and Mehta and Schwab (2014). That is, as one moves through the net, the hidden units become less statistically dependent. We have seen in Section 4 that given a representation in which the units are independent conditional on the true label, a single node RBM gives a consistent estimation of the true label posterior probability. Propagating the data through several RBM layers can hence be seen as a processing of the data, which reduces the conditional dependence of the units while preserving most of the information on the true label . In Section 6 we will demonstrate cases where such decoupling does indeed happen in practice, i.e., although the original input variables ’s are not conditionally independent given the true label , after training, the units in the uppermost hidden layer are, remarkably, approximately conditionally independent. Thus, the assumptions of the conditional independence model apply (with respect to the uppermost hidden layer ), and therefore one is able to consistently estimate the label posterior probability, , as in Section 4.
Another motivation for using deep nets with several hidden layers for unsupervised ensemble learning is their rich expressive power. In our setting, we wish to approximate the posterior probability , which in general may be a complicated nonlinear function of . When cannot be accurately estimated by a RBM with a single hidden node (i.e., when the conditional independence assumption of Dawid and Skene does not hold), a better approximation may be obtained from a deeper network. Several works show that there exist functions that are significantly more efficiently represented by deeper networks, compared to shallower ones, where efficiency corresponds to the number of units. For example, Montufar et al. (2014) show that deep networks with piecewise linear activations can represent functions with greater number of linear regions compared to shallow networks with the same number of units. In a recent work, Eldan and Shamir (2015) give an example for a radial function that can be efficiently computed by a 3layer network, while requiring exponentially many units to be approximated accurately by a 2layer network.
Finally, we would like to emphasize that a RBMbased DNN is a discriminative model to estimate the posterior . In general, it may not correspond to any generative model Arora et al. (2015). Indeed, there is no guarantee that the marginal distributions implied by two adjacent RBMs match. Yet, it can be shown (see Appendix C) that stacking RBMs is a variational inference procedure assuming a specific class of data generation models. The nature of approximation of a top down generative model, where the data is generated from a label , by a RBMbased DNN is explored in Appendix D.
5.2 Predicting the Label from a Trained DNN
Given a trained DNN and a sample , the label is estimated by propagating through the network. Specifically, the units of each layer can be set by either (i) sampling from the conditional distribution given the layer below, i.e., , or (ii) by MAP estimate, setting each hidden unit . Since the first option is stochastic, one may propagate through the net multiple times and average the outputs to obtain an approximation of . Experimentally, we found both options to be equally effective, while each option slightly outperforms the other in some cases.
5.3 Choosing the DNN Architecture
The specific DNN architecture (i.e., number and sizes of layers) might have a dramatic effect on the quality of predictions. To determine the number of units in each layer we employed the following procedure: we first train a RBM with
hidden units. Next, we compute the singular value decomposition of the weight matrix
, and determine its rank (i.e., the number of sufficiently large singular values). Given that the rank is some , we retrain the RBM, setting the number of hidden units to be . If , we add another layer on top of the current layer, and proceed recursively. The process stops when , so that the last layer of the DNN contains a single node. We refer to this method as the SVD approach. In our experiments, as a rule of thumb, we set to be the minimal number of singular values (in descending order) whose cumulative sum is at least 95% of the total sum.This method takes advantage of the coadaptation of hidden units, which is a well known phenomenon in RBM training (see, for example, Hinton et al. (2012)). The term coadaptation describes a situation where several hidden units tend to behave very similarly; this implies that the rank of the weight matrix might be small, although the number of hidden units may be larger.
6 Experimental Results
In this section we compare the performance of the proposed DNN approach to several other approaches, and report experimental results obtained on four simulated data sets and eight real world data sets, from two different domains. All our datasets, as well as the scripts reproducing the reported results are publicly available at https://github.com/ushaham/RBMpaper. ^{1}^{1}1 Our scripts are based on the publicly available code in Hinton’s website http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html. .
Specifically, we compare between the following unsupervised ensemble methods:

Vote. Majority voting, which is the maximum likelihood prediction, assuming that all classifiers are conditionally independent and have the same accuracy.

CUBAM The method of Welinder et al. (2010), which assumes conditional independence, but allows the accuracy of each classifier to vary across different regions of the input domain.

LSML Latent SML (Jaffe et al., 2015). This method relaxes the conditional independence assumption to a depth 2 tree model.

DNN The approach presented in this manuscript, with the depth and number of hidden units in each layer determined by the SVD approach, described in Section 5.3.
Following Jaffe et al. (2015), the performance measure we chose is the balanced accuracy, given by
where is the indicator function.
6.1 Simulated Datasets
In this experiment we carefully generated four synthetic datasets, in order to demonstrate the performance of the DNN approach in several specific scenarios. In all four datasets the observed data is a binary matrix, with input dimension and sample size . A detailed description of the datasets generation process is given in Appendix E.1.

CondInd A dataset where the conditional independence holds, and of the classifiers are in fact random guess.

Tree1531 A dataset generated from a depth2 tree with layer sizes 1,3,15. Every node in the intermediate layer is connected to five nodes in the bottom layer. This dataset is generated from the model considered by LSML, and does not satisfy the conditional independence assumption, as is shown in Figure 6.

LayeredGraph15551 A dataset generated from a depth3 layered graph, with layer sizes 1,5,5,15. In this case, the conditional independence assumption does not hold, although in practice the amount of dependence in the data is not high (see Figure 11).

TruncatedGaussian. Here , where the r.v. follows a a mixture of two dimensional Gaussians with different means and same covariance matrix. The label indicates the specific Gaussian from which is sampled. In this case, the data is highly dependent, as can be seen in Figure 11.
The results are summarized in Table 1
. Along with the five unsupervised methods, the table also shows the accuracy of a supervised learner and the estimated accuracy of the Bayesoptimal classifier. The supervised learner is a Multi Layer Perceptron (MLP) with two hidden layers of sizes 4 and 2, that was trained on a dataset with
samples (independent of the test dataset). The Bayesoptimal approximated accuracy was computed on a sample of size , with the true posterior probabilities of all possible binary vectors estimated using a sample of size from the corresponding model.method  condInd  Tree1531  LG15551  TG 

Vote  75.93 0.5  93.45 0.19  76.61 0.09  80.14 0.4 
DS  94.78 0.13  92.68 0.14  86.36 0.2  82.03 0.27 
CUBAM  91.96 0.18  90.74 0.3  77.12 0.26  83.43 0.31 
LSML  55.94 21.88  95.83 0.15  85.87 0.21  79.5 1.35 
DNN  94.78 0.13 (151)  95.13 0.71 (1531)  86.83 0.2 (1541)  88.09 0.52 (1531) 
SUP  94.45 0.11  95.54 0.27  87.01 0.18  90.8 0.4 
BayesOpt  95.32  96.12  87.05  91.39 
On all of the above datasets, the DNN always outperformed the majority vote rule and CUBAM. On the CondInd dataset, the DNN performs similarly to DS, and significantly better than the other methods. Despite being unsupervised, on this dataset both methods perform slightly better than the specific supervised learner we considered, and around the Bayesoptimal accuracy. The architecture determined by the SVD approach in this case is indeed a single RBM (with a single hidden node). The weight matrix of the RBM is shown in Figure 4, and corresponds to the fact that only the first five classifiers actually contain information about the true label in this dataset.
Figure 5 shows the recovery of the true conditional independence model parameters of a similar conditional independent dataset (however with no random guess classifiers) from a RBM with a single hidden node, using the map in Lemma 4.1.
On the Tree1531 dataset, LSML, which is tailored for data generated by a tree, outperforms the DNN. This result is expected, since it can be shown that the distribution of the bottom two layers of a tree cannot be parametrized as a RBM (see Appendix D). Still, the DNN performs significantly better than DS, CUBAM and majority vote, and not far from the supervised learner and the optimal Bayes classifier. Figure 6 shows the correlation matrix at the input and hidden layers, as well as the first layer weight matrix, demonstrating that the DNN captured the true data generation model. Consequently, the 3 hidden units are nearly conditionally uncorrelated given the label .
Figure 7 shows the cumulative proportion of the singular values on the condInd and Tree1531 datasets, which explains the architecture determined by the SVD approach for both datasets.
On the LayeredGraph15551 dataset, while outperforming the other methods, the DNN achieved accuracy close to the supervised learner and the Bayes optimal accuracy; however, the chosen DNN architecture is different from the one of the true data generation model.
The conditional independence assumption is strongly violated in the case of the TruncatedGaussian dataset. Here the DNN performs better than all other methods by a large margin.
6.2 RealWorld Datasets
In this section we experiment with two groups of datasets, from two different domains, as follows:

DREAM Three datasets from the DREAM mutation calling challenge Ewing et al. (2015); this challenge is an international effort to improve standard methods for identifying cancerassociated mutations and rearrangements in wholegenome sequencing data. The accuracy of current variant calling algorithms is not optimal due to sequencing errors, other experimental factors, parametric choices in each algorithm and preprocessing and filtering decisions. Unsupervised ensemble learning of multiple variant callers is expected to provide more robust predictions. One of the goals of this challenge is to develop a stateoftheart meta pipeline for somatic mutation detection, to output accurate as possible mutation calls associated with cancer. Specifically, we used three datasets, (S1, S2, S3) containing the predictions of classifiers that determine the presence or absence of of mutations in genome sequencing data. The data is available at (Ellrot, 2013). In S1, , . In S2, = 114, = 70,561. In S3, , .

Magic Forty datasets, which are constructed from the Magic dataset in the UCI repository, available at https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. This dataset contains instances with 11 attributes, which consists of physical measurements of gamma particles; the learning task is to classify each instance as background or high energy gamma rays. Each of the five datasets we constructed contains binary predictions of
classifiers, obtained in the Weka machine learning software. The 16 classifiers belong to four groups: four random forest classifiers, three logistic trees classifiers, four SVM classifiers, and five naive Bayes classifiers. This setting is adopted from
Jaffe et al. (2015). The group of SVM classifiers is highly correlated, as well as the group of Naive Bayes classifiers, as can be seen in Appendix E.2. Each of the forty datasets was obtained by predictions of the same classifiers, however trained on a different subset of the original Magic dataset (a random subset of size 500 each time).
Table 2 shows the performance of the various methods on the DREAM datasets.
Dataset  Vote  DS  CUBAM  LSML  DNN 

S1  97.2 *  98.3 *  92.31  98.4 *  98.42 0.0 (1241) 
S2  96 *  97.2 *  69.19  97.7 *  97.55 0.01 (1141) 
S3  95.7 *  97.7 *  87.65  98.2 *  98.51 0.01 (99251) 
As can be seen, the DNN and LSML performs similarly on S1, while the former performs better on S3 and the latter on S2. The two methods outperform the majority vote rule, DS and CUBAM on all three datasets. Remarkably, the hidden representation on the S3 dataset is such that the units are perfectly uncorrelated, conditioned on the hidden label. This is shown in Figure 8.
The results on the Magic datasets are shown in Figure 9. On most of these datasets, the DNN outperforms all other methods, with a relatively large margin. On all forty datasets, the SVD approach yielded a 1531 architecture.
To summarize our experiments, we observed that RBMbased DNN performs at least as well and often better than various other methods, on both simulated and real datasets, and that the SVD approach can serve as an effective tool for determination of the DNN architecture.
We remark that in our experiments, we observed that RBMs tend to be highly sensitive to hyperparameter tuning (such as learning rate, momentum, regularization type and penalty), and these hyperparameters need to be carefully tuned. To obtain a reasonable hyperparameter setting we found it useful to apply the random configuration sampling procedure, proposed in (Bergstra and Bengio, 2012), and evaluate different models by average loglikelihood approximation, (see, for example, (Salakhutdinov and Murray, 2008) and the corresponding MATLAB scripts in (Salakhutdinov, 2010)).
7 Summary and Discussion
We demonstrated how deep learning techniques can be used for unsupervised ensemble learning, and showed that the DNN approach proposed in this manuscript often performs at least as well and often better than stateof the art methods, especially when the conditional independence assumption made by Dawid and Skene (1979) does not hold.
Possible directions for future research include extending the approach to multiclass problems, possible using Discrete RBMs Montúfar and Morton (2013), theoretical analysis of the SVD approach, and information theoretic analysis of the decorrelation, while preserving label information, that occurs while propagating data through a RBMbased DNN.
Acknowledgements
The authors would like to thank George Linderman, Alex Cloninger, Tingting Jiang, Raphy Coifman, Sahand Negahban, Andrew Barron, Alex Kovner, Shahar Kovalsky, Maria Angelica Cueto, Jason Morton, and Brend Strumfels for their help.
References
 Arora et al. (2015) Arora, S., Liang, Y., and Ma, T. (2015). Why are deep nets reversible: A simple theory, with implications for training. arXiv preprint arXiv:1511.05653.
 Bengio (2009) Bengio, Y. (2009). Learning deep architectures for ai. Foundations and trends® in Machine Learning, 2(1):1–127.
 Bengio et al. (2013) Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 35(8):1798–1828.
 Bergstra and Bengio (2012) Bergstra, J. and Bengio, Y. (2012). Random search for hyperparameter optimization. The Journal of Machine Learning Research, 13(1):281–305.
 Bishop (2006) Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
 Blei et al. (2016) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2016). Variational inference: A review for statisticians. arXiv preprint arXiv:1601.00670.
 Casella and Berger (2002) Casella, G. and Berger, R. L. (2002). Statistical inference, volume 2. Duxbury Pacific Grove, CA.

Chang (1996)
Chang, J. T. (1996).
Full reconstruction of markov models on evolutionary trees: identifiability and consistency.
Mathematical biosciences, 137(1):51–73.  Cueto et al. (2010) Cueto, M. A., Morton, J., and Sturmfels, B. (2010). Geometry of the restricted boltzmann machine. Algebraic Methods in Statistics and Probability,(eds. M. Viana and H. Wynn), AMS, Contemporary Mathematics, 516:135–153.
 Dawid and Skene (1979) Dawid, A. P. and Skene, A. M. (1979). Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pages 20–28.

Donmez et al. (2010)
Donmez, P., Lebanon, G., and Balasubramanian, K. (2010).
Unsupervised supervised learning i: Estimating classification and regression errors without labels.
The Journal of Machine Learning Research, 11:1323–1351.  Eldan and Shamir (2015) Eldan, R. and Shamir, O. (2015). The power of depth for feedforward neural networks. arXiv preprint arXiv:1512.03965.
 Ellrot (2013) Ellrot, K. (2013). Icgctcga dream mutation calling challenge. https://www.synapse.org/#!Synapse:syn312572/wiki/58893. Online; accessed 12November2015.
 Ewing et al. (2015) Ewing, A. D., Houlahan, K. E., Hu, Y., Ellrott, K., Caloian, C., Yamaguchi, T. N., Bare, J. C., P’ng, C., Waggott, D., Sabelnykova, V. Y., et al. (2015). Combining tumor genome simulation with crowdsourcing to benchmark somatic singlenucleotidevariant detection. Nature methods.

Fox and Roberts (2012)
Fox, C. W. and Roberts, S. J. (2012).
A tutorial on variational bayesian inference.
Artificial intelligence review, 38(2):85–95.  Hinton et al. (2006) Hinton, G. E., Osindero, S., and Teh, Y.W. (2006). A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554.
 Hinton et al. (2012) Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R. (2012). Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580.
 Jaffe et al. (2015) Jaffe, A., Fetaya, E., Nadler, B., Jiang, T., and Kluger, Y. (2015). Unsupervised ensemble learning with dependent classifiers. arXiv preprint arXiv:1510.05830.
 Jaffe et al. (2014) Jaffe, A., Nadler, B., and Kluger, Y. (2014). Estimating the accuracies of multiple classifiers without labeled data. arXiv preprint arXiv:1407.7644.
 Jain and Oh (2013) Jain, P. and Oh, S. (2013). Learning mixtures of discrete product distributions using spectral decompositions. arXiv preprint arXiv:1311.2972.
 LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444.
 Mehta and Schwab (2014) Mehta, P. and Schwab, D. J. (2014). An exact mapping between the variational renormalization group and deep learning. arXiv preprint arXiv:1410.3831.
 Montúfar and Morton (2013) Montúfar, G. and Morton, J. (2013). Discrete restricted boltzmann machines. arXiv preprint arXiv:1301.3529.

Montufar et al. (2014)
Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014).
On the number of linear regions of deep neural networks.
In Advances in Neural Information Processing Systems, pages 2924–2932.  Parisi et al. (2014) Parisi, F., Strino, F., Nadler, B., and Kluger, Y. (2014). Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences, 111(4):1253–1258.
 Platanios et al. (2014) Platanios, A., Blum, A., and Mitchell, T. M. (2014). Estimating accuracy from unlabeled data. In In Proceedings of UAI.
 Raykar et al. (2010) Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L. (2010). Learning from crowds. The Journal of Machine Learning Research, 11:1297–1322.
 Salakhutdinov (2010) Salakhutdinov, R. (2010). Ruslan salakhutdinov’s web page.

Salakhutdinov and Murray (2008)
Salakhutdinov, R. and Murray, I. (2008).
On the quantitative analysis of deep belief networks.
In Proceedings of the 25th international conference on Machine learning, pages 872–879. ACM.  Tishby and Zaslavsky (2015) Tishby, N. and Zaslavsky, N. (2015). Deep learning and the information bottleneck principle. arXiv preprint arXiv:1503.02406.
 Welinder et al. (2010) Welinder, P., Branson, S., Perona, P., and Belongie, S. J. (2010). The multidimensional wisdom of crowds. In Advances in neural information processing systems, pages 2424–2432.
 Whitehill et al. (2009) Whitehill, J., Wu, T.f., Bergsma, J., Movellan, J. R., and Ruvolo, P. L. (2009). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pages 2035–2043.
 Zhang et al. (2014) Zhang, Y., Chen, X., Zhou, D., and Jordan, M. I. (2014). Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pages 1260–1268.
Appendix A Proof of Lemma 4.1
Proof.
We will define so that for every , and .
Since the weight matrix has dimension in this case, it is a vector, which we will denote as . Recall that
hence we define
and
Finally, recall that
where is the energy function given in equation (3), hence we set
(4) 
To see that the map is 1:1, note that uniquely determines , hence uniquely determine . Lastly, rearranging equation (4) we get
so that given , is uniquely determined by . Showing that the map is a also subjective is straightforward. Hence it is a bijection. ∎
Appendix B Proof of Lemma 4.2
Proof.
Since and for each , is not independent of , by Chang (1996), the parameter of the conditional independence model is identifiable. Since the map in Lemma 4.1 is a bijection, there exists corresponding to , which is therefore identifiable as well. By the consistency property of the MLE (see, for example, (Casella and Berger, 2002)),
Since is continuous in , one obtains
Finally, note that Lemma 4.1 implies, in particular, that under the map
which completes the proof. ∎
Appendix C Stacking RBMs as a Variational Inference Procedure
Variational inference is a common approach to tackle complicated probability estimation problems (see, for example, Bishop (2006); Fox and Roberts (2012), and a recent review Blei et al. (2016)). Specifically, let
be a target probability distribution that we want to approximate. In variational inference we define a family of approximate distributions
, and then perform optimization to find the member of that is closest to in KullbackLeibler distance. A key idea is that the family is flexible enough to contain a distribution close to , yet simple enough to perform optimization over. For example, a popular choice is to take as the collection of factorized distributions, i.e., of the form . In this section, we motivate the use of RBMbased DNN by considering a specific data generation model, and showing that training a stack of RBMs on data generated by this model is in fact a variational inference procedure.The generative model we consider is a two layer Deep Belief Network (DBN), which played an important role in the emergence of deep learning in 2006 Hinton et al. (2006). The DBN we consider generates data , , via the probability distribution
where form a RBM (parametrized by ).
We observe data from and our goal is to estimate the posterior for . The posterior can be written as
Cueto et al. (2010) showed that as long as is not too large comparing to , RBMs are locally identifiable, i.e., identifiable up to order and flips of hidden units (Jason Morton, personal communication). Therefore, when training a RBM with hidden units on , by the consistency property of the MLE Casella and Berger (2002) the MLE will converge to the true parameter as . Hence, when is large enough, the vectors obtained from the (trained) RBM are in fact samples from .
At the next step, the vectors are used to train a second RBM, with a single hidden node. Observe that in the data generation model considered in this section, does not factorize. The factorized distribution that minimizes is given by
Bishop (2006) (Chapter 10). By Lemma 4.1, we know that the distribution
(5) 
is equivalent to a RBM. Finally, by Lemma 4.2, the distribution (5) is consistently estimated by a RBM trained on vectors , and is thus a variational inference procedure.
Appendix D Stacking RBMs as an Approximation for a Directed TopDown Model
Assume that the data is generated by a Markov chain
, where , , . We further assume that the distributions factorize, i.e.,(6) 
and
(7) 
and are given by RBMlike conditional distributions, i.e.,
(8) 
and
(9) 
Hence the corresponding data generation probability is parametrized by , where .
This data generation process is depicted in Figure 10.
The posterior probabilities are given by
By Section 4, we know that is equivalent to a RBM. Therefore, to accurately estimate the posterior, it suffices to approximate .
Under the data generation model described in Figure 10 and equations (6)(9), it is evident that the joint distribution cannot be parametrized as a RBM; indeed, does not factorize. Hence, training a RBM on samples from , is a mean field approximation of . The form of is shown in the following lemma.
Lemma D.1.
Proof.
From lemma D.1 we see that is close to be factorizable if is a approximately a loglinear function of and is approximately a loglinear function of .
Appendix E Datasets used for our experiments
e.1 Simulated Dataset Generation Details

CondInd: the label was sampled from a Bernoulli(0.5) distribution; The specificity and sensitivity of the variables were sampled uniformly from . The other ten ’s were random guesses, i.e., had specificity = sensitivity = .

Tree1531: the label was sampled from a Bernoulli(0.5) distribution; each node in the intermediate and layer was generated from his parent with specificity and sensitivity sampled uniformly from , and in the bottom layer with specificity and sensitivity sampled uniformly from .

LayeredGraph15551: Data is generated from a Layered Graph with four layers of dimensions 1,5,5,15, starting at the true label . Each layer in the graph is generated from the above layer, and the graph has sparse connectivity (about 30% of the edges exist). For every node and parent we sample specificity and sensitivity uniformly. Finally, the value at each node was calculated as the weighted sum of the probabilities of the node being 1 given the values of the nodes in the preceding layer, normalized by the sum over the edges. The label was sampled from a Bernoulli(0.5) distribution.

TruncatedGaussian: the label was sampled from a Bernoulli(0.5) distribution. One Gaussian had mean vector were each of the 15 coordinates was sampled uniformly. The other Gaussian had mean vector . Both Gaussians had identical covariance matrix, with off diagonal entries of and diagonal entries of .
e.2 The Magic Datasets
An example for the correlation matrix of the 16 classifiers given the 0 class can be seen in Figure 12.