A Deep Learning Approach to Unsupervised Ensemble Learning

by   Uri Shaham, et al.

We show how deep learning methods can be applied in the context of crowdsourcing and unsupervised ensemble learning. First, we prove that the popular model of Dawid and Skene, which assumes that all classifiers are conditionally independent, is equivalent to a Restricted Boltzmann Machine (RBM) with a single hidden node. Hence, under this model, the posterior probabilities of the true labels can be instead estimated via a trained RBM. Next, to address the more general case, where classifiers may strongly violate the conditional independence assumption, we propose to apply RBM-based Deep Neural Net (DNN). Experimental results on various simulated and real-world datasets demonstrate that our proposed DNN approach outperforms other state-of-the-art methods, in particular when the data violates the conditional independence assumption.


page 12

page 14

page 23


Unsupervised Ensemble Learning with Dependent Classifiers

In unsupervised ensemble learning, one obtains predictions from multiple...

Deep ensemble learning for Alzheimers disease classification

Ensemble learning use multiple algorithms to obtain better predictive pe...

Generating Efficient DNN-Ensembles with Evolutionary Computation

In this work, we leverage ensemble learning as a tool for the creation o...

CARD: Classification and Regression Diffusion Models

Learning the distribution of a continuous or categorical response variab...

A Hybrid Deep Learning Approach for Diagnosis of the Erythemato-Squamous Disease

The diagnosis of the Erythemato-squamous disease (ESD) is accepted as a ...

Discussion of Ensemble Learning under the Era of Deep Learning

Due to the dominant position of deep learning (mostly deep neural networ...

Probabilistic Reasoning via Deep Learning: Neural Association Models

In this paper, we propose a new deep learning approach, called neural as...

1 Introduction

In recent years, crowdsourcing applications gained significant popularity, and consequently much academic attention. At the same time, deep learning has become a major tool in machine learning and artificial intelligence, demonstrating impressive performance in several applications, including computer vision, speech recognition and natural language processing.

The goal of this paper is to show that deep learning methods can also be applied to the areas of crowdsourcing and unsupervised ensemble learning, and provide state-of-the-art results. In unsupervised ensemble learning, one is given the predictions of classifiers on a set of instances and the goal is to recover the true, unknown label of each instance. Dawid and Skene (1979) were among the first to consider such a setup. They assumed that the classifiers are conditionally independent given the true labels. We refer to this model as the DS model and also as the Conditional Independence model.

Despite its simplicity, computing the maximum likelihood estimates of the classifiers’ accuracies and the true labels in the DS model is a non-convex optimization problem. In their paper, Dawid and Skene estimated these quantities by the EM algorithm, which is only guaranteed to converge to a local optimum. In recent years, several authors developed computationally efficient spectral methods that are asymptotically consistent under the DS model, see Zhang et al. (2014); Parisi et al. (2014); Jain and Oh (2013); Jaffe et al. (2014) and references therein.

The model of Dawid and Skene relied on two key assumptions that typically do not hold in practice: (i) that classifiers make perfectly independent errors; and (ii) that these errors are uniformly distributed across all instances. To address the second issue above, several authors proposed richer models, that include parameters such as instance difficulty and varying skills of annotators across different regions of the input space, see for example

Raykar et al. (2010), Whitehill et al. (2009) and Welinder et al. (2010).

In contrast, relatively few works considered relaxations of the conditional independence assumption: Platanios et al. (2014) proposed to estimate the accuracies of possibly dependent classifiers, via their agreement rates over classifier groups of different sizes. Donmez et al. (2010) proposed a model with pairwise interactions between all classifiers. Closest to our approach is the work of Jaffe et al. (2015), who assumed that some of the classifiers may be conditionally dependent, yet their dependency structure can be accurately described by a tree of depth 2.

In this manuscript, we propose a deep learning approach to unsupervised ensemble learning problems with possibly dependent classifiers, where the conditional independence assumption is strongly violated. We make the following contributions. First, we show that the DS model has an equivalent parametrization in terms of a Restricted Boltzmann Machine (RBM) with a single hidden node. Hence, under this model, the posterior probability of the true labels can be estimated from a trained RBM. Next, to tackle violations of conditional independence, we show how a RBM-based Deep Neural Net (DNN) can be applied to unsupervised ensemble learning, and propose a heuristic for determining the DNN architecture. Experimentally, we compare our approach to several state-of-the-art methods that are based on the conditional independence assumption and relaxations of it. We show that our DNN approach often performs better than the other methods on both simulated and real world datasets. Remarkably, we demonstrate that in some cases, while the raw representation of the data contains correlated features, the learned features in the last hidden layer are almost perfectly uncorrelated.

The structure of this manuscript is as follows: in Section 2 we give a formal definition of the problem. A brief background on RBMs is given in Section 3. In Section 4 we show how RBMs can be used to predict the true labels, under the assumption of conditional independence. In Section 5 we describe how to estimate the labels using a RBM-based DNN. Experimental results are reported in Section 6. The manuscript concludes with a brief summary in Section 7. Proofs appear in the appendix.

1.1 Notation

Throughout this manuscript,

are random variables,

are probability densities, parametrized by , respectively. We think of as the distribution generating the data and of as the RBM model distribution. When the context is clear, we occasionally write as a shorthand for . The dimensions of the input data and the sample size are denoted by and , respectively. We use

to denote the sigmoid function


2 Problem Setup

Let be random variables. We refer to as the label of . The pair

has a joint distribution, parametrized by

and denoted by , which is given by

The joint distribution is not known to us, and neither are the marginals . Let be i.i.d samples from . In unsupervised ensemble learning, we observe and the learning task is to recover

. In this application, the binary vector

contains the predictions of classifiers or annotators on an instance, whose label is unobserved.

2.1 The Conditional Independence Model

In their seminal paper, Dawid and Skene (1979), assumed that the conditional distribution factorizes, i.e.,


Eq. (2), also known as the conditional independence model, is depicted in Figure 1.

Figure 1: The conditional independence model, studied by Dawid and Skene (1979).

It is fully parametrized by , where

are often referred to as sensitivity and specificity, respectively. Under the interpretation of the ’s being classifiers, the sensitivity and specificity quantify the competence of the classifiers or annotators and the conditional independence assumption means that all classifiers make independent errors.

The conditional independence model is often overly simplistic. In this manuscript we propose to apply deep learning techniques, specifically RBM-based DNNs, for unsupervised ensemble learning problems, where the conditional independence is not likely to hold. The following section gives essential background on RBMs, section 4 shows that a RBM with a single hidden node is equivalent to the conditional independence model, and section 5 presents our RBM-based DNN approach.

3 Restricted Boltzmann Machines

A Restricted Boltzmann Machine (RBM) is an undirected bipartite graphical model, consisting of a set of visible binary random variables and a set of hidden binary random variables, arranged in two layers, which are fully connected to each other. An illustration of a RBM is depicted in Figure 2.

Figure 2: A RBM with visible and hidden units.

A RBM is parametrized by , where is the weight matrix of the connections between the visible and hidden units, and

are the bias vectors of the visible and hidden layers, respectively. Each configuration

of a RBM is associated with the following energy


which defines the probability of the configuration

where is the partition function. The bipartite structure of the RBM implies factorial conditional probabilities

given by

where is the sigmoid function defined in equation (1), is the -th row of and is its -th column.

Given iid training data , the RBM parameters are typically tuned to maximize the log-likelihood of the training data, where the likelihood that the RBM associates with a vector is given by

A popular approach to learn the RBM parameters is via gradient-based optimization, where the gradients are approximated using contrastive divergence

(Hinton et al., 2006; Bengio, 2009).

4 RBM in the Conditional Independence Case

In this section we show that given observed data from the conditional independence model of Eq. (2), the posterior probabilities of the true, unknown labels can be consistently estimated via a RBM with a single hidden node.

We begin by showing that there is a bijective map from the parameters of a RBM with a single hidden node to the parameters of the conditional independence model, such that the joint distribution specified by the RBM is equivalent to that of the conditional independence model.

Lemma 4.1.

The joint probability of a RBM with parameters is equivalent to the joint probability of a conditional independence model with parameters given by

Furthermore, the map is a bijection.

We are now ready to prove the main result of this section, namely, that the posterior distribution of the true labels can be consistently estimated by a RBM with a single hidden node. To do so, we rely on a special case of a result proved by Chang (1996), that provides conditions under which the parameters of the conditional independence model are identifiable.

Lemma 4.2.

Let be observed data from the conditional independence model, specified by . Assume that is such that for each , is not independent of (i.e., each classifier is not just a random guess), and that . Let be a maximum likelihood parameter estimate of a RBM with a single hidden node. Then the RBM posterior probability converges to the true posterior , as .

Remark 4.3.

The identifiability of the parameters is up to a single global label flip. This means that one recovers either or . Assuming that on average, the ’s are more accurate than a random guess, this sign ambiguity can be resolved by comparing the predictions to the majority vote decision.

Remark 4.4.

Lemma 4.2 assumes that we found the MLE of the RBM parameters. Obtaining such a MLE is problematic for two main reasons. First, RBMs are typically trained to maximize a proxy for the likelihood, as the true likelihood is not tractable. Second, the RBM likelihood function is not concave, hence there are no guarantees that after training a RBM one obtains the maximum likelihood parameter .

5 RBM-based Deep Neural Net

In many practical settings, the variables are not conditionally independent. Fitting a conditionally independent model to such data may yield highly sub-optimal predictions for the true labels . To tackle this general case, we propose to train a RBM-based Deep Neural Net (DNN) and use it to estimate the posterior probabilities . In such a DNN, the hidden layer of each RBM is the input for the successive RBM. As suggested by Hinton et al. (2006), the RBMs are trained one at a time, bottom to top, i.e., the DNN is trained in a layer-wise fashion. Specifically, given training data

, we start by training the bottom RBM, and then obtain the first layer hidden representation of the data by sampling

from the conditional RBM distribution . The vectors are then used as a training set for the second RBM and so on.

In the case considered in this manuscript, where the true label is binary, the upper-most RBM in the DNN has a single hidden unit, from which the posterior probability can be estimated. Such a DNN is depicted in Figure 3.

Figure 3: A sketch of RBM-based DNN with two hidden layers.

5.1 Motivation

Deep learning algorithms have recently achieved state-of-the-art performance in a wide range of applications LeCun et al. (2015). While a rigorous theoretical understanding of deep nets is still lacking, many researchers believe that a key property in their success is their ability to disentangle factors of variation in the inputs; see for example Bengio et al. (2013)Tishby and Zaslavsky (2015), and Mehta and Schwab (2014). That is, as one moves through the net, the hidden units become less statistically dependent. We have seen in Section 4 that given a representation in which the units are independent conditional on the true label, a single node RBM gives a consistent estimation of the true label posterior probability. Propagating the data through several RBM layers can hence be seen as a processing of the data, which reduces the conditional dependence of the units while preserving most of the information on the true label . In Section 6 we will demonstrate cases where such decoupling does indeed happen in practice, i.e., although the original input variables ’s are not conditionally independent given the true label , after training, the units in the uppermost hidden layer are, remarkably, approximately conditionally independent. Thus, the assumptions of the conditional independence model apply (with respect to the uppermost hidden layer ), and therefore one is able to consistently estimate the label posterior probability, , as in Section 4.

Another motivation for using deep nets with several hidden layers for unsupervised ensemble learning is their rich expressive power. In our setting, we wish to approximate the posterior probability , which in general may be a complicated nonlinear function of . When cannot be accurately estimated by a RBM with a single hidden node (i.e., when the conditional independence assumption of Dawid and Skene does not hold), a better approximation may be obtained from a deeper network. Several works show that there exist functions that are significantly more efficiently represented by deeper networks, compared to shallower ones, where efficiency corresponds to the number of units. For example, Montufar et al. (2014) show that deep networks with piece-wise linear activations can represent functions with greater number of linear regions compared to shallow networks with the same number of units. In a recent work, Eldan and Shamir (2015) give an example for a radial function that can be efficiently computed by a 3-layer network, while requiring exponentially many units to be approximated accurately by a 2-layer network.

Finally, we would like to emphasize that a RBM-based DNN is a discriminative model to estimate the posterior . In general, it may not correspond to any generative model Arora et al. (2015). Indeed, there is no guarantee that the marginal distributions implied by two adjacent RBMs match. Yet, it can be shown (see Appendix C) that stacking RBMs is a variational inference procedure assuming a specific class of data generation models. The nature of approximation of a top down generative model, where the data is generated from a label , by a RBM-based DNN is explored in Appendix D.

5.2 Predicting the Label from a Trained DNN

Given a trained DNN and a sample , the label is estimated by propagating through the network. Specifically, the units of each layer can be set by either (i) sampling from the conditional distribution given the layer below, i.e., , or (ii) by MAP estimate, setting each hidden unit . Since the first option is stochastic, one may propagate through the net multiple times and average the outputs to obtain an approximation of . Experimentally, we found both options to be equally effective, while each option slightly outperforms the other in some cases.

5.3 Choosing the DNN Architecture

The specific DNN architecture (i.e., number and sizes of layers) might have a dramatic effect on the quality of predictions. To determine the number of units in each layer we employed the following procedure: we first train a RBM with

hidden units. Next, we compute the singular value decomposition of the weight matrix

, and determine its rank (i.e., the number of sufficiently large singular values). Given that the rank is some , we re-train the RBM, setting the number of hidden units to be . If , we add another layer on top of the current layer, and proceed recursively. The process stops when , so that the last layer of the DNN contains a single node. We refer to this method as the SVD approach. In our experiments, as a rule of thumb, we set to be the minimal number of singular values (in descending order) whose cumulative sum is at least 95% of the total sum.

This method takes advantage of the co-adaptation of hidden units, which is a well known phenomenon in RBM training (see, for example, Hinton et al. (2012)). The term co-adaptation describes a situation where several hidden units tend to behave very similarly; this implies that the rank of the weight matrix might be small, although the number of hidden units may be larger.

6 Experimental Results

In this section we compare the performance of the proposed DNN approach to several other approaches, and report experimental results obtained on four simulated data sets and eight real world data sets, from two different domains. All our datasets, as well as the scripts reproducing the reported results are publicly available at https://github.com/ushaham/RBMpaper. 111 Our scripts are based on the publicly available code in Hinton’s website http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html. .

Specifically, we compare between the following unsupervised ensemble methods:

  • Vote. Majority voting, which is the maximum likelihood prediction, assuming that all classifiers are conditionally independent and have the same accuracy.

  • DS. Approximate maximum likelihood predictions under the Dawid and Skene model. Specifically, we use Spectral Meta Learner (Parisi et al., 2014), and Restricted Likelihood (Jaffe et al., 2014).

  • CUBAM The method of Welinder et al. (2010), which assumes conditional independence, but allows the accuracy of each classifier to vary across different regions of the input domain.

  • L-SML Latent SML (Jaffe et al., 2015). This method relaxes the conditional independence assumption to a depth 2 tree model.

  • DNN The approach presented in this manuscript, with the depth and number of hidden units in each layer determined by the SVD approach, described in Section 5.3.

Following  Jaffe et al. (2015), the performance measure we chose is the balanced accuracy, given by

where is the indicator function.

6.1 Simulated Datasets

In this experiment we carefully generated four synthetic datasets, in order to demonstrate the performance of the DNN approach in several specific scenarios. In all four datasets the observed data is a binary matrix, with input dimension and sample size . A detailed description of the datasets generation process is given in Appendix E.1.

  • CondInd A dataset where the conditional independence holds, and of the classifiers are in fact random guess.

  • Tree15-3-1 A dataset generated from a depth-2 tree with layer sizes 1,3,15. Every node in the intermediate layer is connected to five nodes in the bottom layer. This dataset is generated from the model considered by L-SML, and does not satisfy the conditional independence assumption, as is shown in Figure 6.

  • LayeredGraph15-5-5-1 A dataset generated from a depth-3 layered graph, with layer sizes 1,5,5,15. In this case, the conditional independence assumption does not hold, although in practice the amount of dependence in the data is not high (see Figure 11).

  • TruncatedGaussian. Here , where the r.v. follows a a mixture of two -dimensional Gaussians with different means and same covariance matrix. The label indicates the specific Gaussian from which is sampled. In this case, the data is highly dependent, as can be seen in Figure 11.

The results are summarized in Table 1

. Along with the five unsupervised methods, the table also shows the accuracy of a supervised learner and the estimated accuracy of the Bayes-optimal classifier. The supervised learner is a Multi Layer Perceptron (MLP) with two hidden layers of sizes 4 and 2, that was trained on a dataset with

samples (independent of the test dataset). The Bayes-optimal approximated accuracy was computed on a sample of size , with the true posterior probabilities of all possible binary vectors estimated using a sample of size from the corresponding model.

method condInd Tree15-3-1 LG15-5-5-1 TG
Vote 75.93 0.5 93.45 0.19 76.61 0.09 80.14 0.4
DS 94.78 0.13 92.68 0.14 86.36 0.2 82.03 0.27
CUBAM 91.96 0.18 90.74 0.3 77.12 0.26 83.43 0.31
L-SML 55.94 21.88 95.83 0.15 85.87 0.21 79.5 1.35
DNN 94.78 0.13 (15-1) 95.13 0.71 (15-3-1) 86.83 0.2 (15-4-1) 88.09 0.52 (15-3-1)
SUP 94.45 0.11 95.54 0.27 87.01 0.18 90.8 0.4
Bayes-Opt 95.32 96.12 87.05 91.39
Table 1: Balanced accuracy of various unsupervised ensemble methods on the four synthetic datasets, along with a supervised learner (SUP), and the Bayes optimal classifier (Bayes-Opt). The results are presented as mean standard deviation, based on 5 repetitions, where in each repetition a new dataset was sampled from the model. The numbers in brackets denote the architecture of the DNN, found by the SVD approach.

On all of the above datasets, the DNN always outperformed the majority vote rule and CUBAM. On the CondInd dataset, the DNN performs similarly to DS, and significantly better than the other methods. Despite being unsupervised, on this dataset both methods perform slightly better than the specific supervised learner we considered, and around the Bayes-optimal accuracy. The architecture determined by the SVD approach in this case is indeed a single RBM (with a single hidden node). The weight matrix of the RBM is shown in Figure 4, and corresponds to the fact that only the first five classifiers actually contain information about the true label in this dataset.

Figure 4: The RBM weight vector on the condInd dataset. The hidden unit is strongly connected only to the first five visible units, reflecting the fact that in an unsupervised manner, the RBM detected that the remaining units are random guess classifiers.

Figure 5 shows the recovery of the true conditional independence model parameters of a similar conditional independent dataset (however with no random guess classifiers) from a RBM with a single hidden node, using the map in Lemma 4.1.

Figure 5: Recovery of the conditional Independence model parameters from a RBM with a single hidden node, on a dataset sampled from a conditional independence model. The parameters were uniformly sampled from . Each circle corresponds to a single parameter (e.g., for some ). For convenience, the identity line was added to the plot.

On the Tree15-3-1 dataset, L-SML, which is tailored for data generated by a tree, outperforms the DNN. This result is expected, since it can be shown that the distribution of the bottom two layers of a tree cannot be parametrized as a RBM (see Appendix D). Still, the DNN performs significantly better than DS, CUBAM and majority vote, and not far from the supervised learner and the optimal Bayes classifier. Figure 6 shows the correlation matrix at the input and hidden layers, as well as the first layer weight matrix, demonstrating that the DNN captured the true data generation model. Consequently, the 3 hidden units are nearly conditionally uncorrelated given the label .

Figure 6: The Tree15-3-1 experiment. Top left: correlation matrix of the input data for the class. The first and middle five are not conditionally independent of each other. Top right: correlation matrix of the hidden layer of the DNN for the class. The hidden units are approximately uncorrelated. Bottom: weight matrix of the bottom RBM of the DNN, showing that each hidden unit is strongly connected to 5 visible units, as in the original data generation model.

Figure 7 shows the cumulative proportion of the singular values on the condInd and Tree15-3-1 datasets, which explains the architecture determined by the SVD approach for both datasets.

Figure 7: Cumulative proportion of singular values on the condInd and Tree15-3-1 datasets. While in the condInd case the first singular value is more than 95% of the total sum of singular values, the first three singular values are needed on the Tree15-3-1 dataset. The horizontal line at 0.95 is added to the plot for convenience.

On the LayeredGraph15-5-5-1 dataset, while outperforming the other methods, the DNN achieved accuracy close to the supervised learner and the Bayes optimal accuracy; however, the chosen DNN architecture is different from the one of the true data generation model.

The conditional independence assumption is strongly violated in the case of the TruncatedGaussian dataset. Here the DNN performs better than all other methods by a large margin.

6.2 Real-World Datasets

In this section we experiment with two groups of datasets, from two different domains, as follows:

  • DREAM Three datasets from the DREAM mutation calling challenge Ewing et al. (2015); this challenge is an international effort to improve standard methods for identifying cancer-associated mutations and rearrangements in whole-genome sequencing data. The accuracy of current variant calling algorithms is not optimal due to sequencing errors, other experimental factors, parametric choices in each algorithm and preprocessing and filtering decisions. Unsupervised ensemble learning of multiple variant callers is expected to provide more robust predictions. One of the goals of this challenge is to develop a state-of-the-art meta pipeline for somatic mutation detection, to output accurate as possible mutation calls associated with cancer. Specifically, we used three datasets, (S1, S2, S3) containing the predictions of classifiers that determine the presence or absence of of mutations in genome sequencing data. The data is available at (Ellrot, 2013). In S1, , . In S2, = 114, = 70,561. In S3, , .

  • Magic Forty datasets, which are constructed from the Magic dataset in the UCI repository, available at https://archive.ics.uci.edu/ml/datasets/MAGIC+Gamma+Telescope. This dataset contains instances with 11 attributes, which consists of physical measurements of gamma particles; the learning task is to classify each instance as background or high energy gamma rays. Each of the five datasets we constructed contains binary predictions of

    classifiers, obtained in the Weka machine learning software. The 16 classifiers belong to four groups: four random forest classifiers, three logistic trees classifiers, four SVM classifiers, and five naive Bayes classifiers. This setting is adopted from 

    Jaffe et al. (2015). The group of SVM classifiers is highly correlated, as well as the group of Naive Bayes classifiers, as can be seen in Appendix E.2. Each of the forty datasets was obtained by predictions of the same classifiers, however trained on a different subset of the original Magic dataset (a random subset of size 500 each time).

Table 2 shows the performance of the various methods on the DREAM datasets.

S1 97.2 * 98.3 * 92.31 98.4 * 98.42 0.0 (124-1)
S2 96 * 97.2 * 69.19 97.7 * 97.55 0.01 (114-1)
S3 95.7 * 97.7 * 87.65 98.2 * 98.51 0.01 (99-25-1)
Table 2: Balanced accuracy of various methods on the DREAM datasets S1, S2 and S3. DNN results are averaged over 5 repetitions, and are presented as mean standard deviation. The numbers in brackets denotes the architecture of the DNN, found by the SVD approach. * results reported in (Jaffe et al., 2015)

As can be seen, the DNN and L-SML performs similarly on S1, while the former performs better on S3 and the latter on S2. The two methods outperform the majority vote rule, DS and CUBAM on all three datasets. Remarkably, the hidden representation on the S3 dataset is such that the units are perfectly uncorrelated, conditioned on the hidden label. This is shown in Figure 8.

Figure 8: correlation matrices of the input (left) and hidden (right) layers of the DNN on the S3 dataset, for the class. Remarkably, the hidden units are almost perfectly uncorrelated, conditioned on the class.

The results on the Magic datasets are shown in Figure 9. On most of these datasets, the DNN outperforms all other methods, with a relatively large margin. On all forty datasets, the SVD approach yielded a 15-3-1 architecture.

Figure 9: Performance of the various methods on the Magic datasets. For convenience, the identity line is added to the plot. Most of the points are below the identity line, which indicates that the DNN tend to outperform all other methods on these datasets.

To summarize our experiments, we observed that RBM-based DNN performs at least as well and often better than various other methods, on both simulated and real datasets, and that the SVD approach can serve as an effective tool for determination of the DNN architecture.

We remark that in our experiments, we observed that RBMs tend to be highly sensitive to hyper-parameter tuning (such as learning rate, momentum, regularization type and penalty), and these hyper-parameters need to be carefully tuned. To obtain a reasonable hyper-parameter setting we found it useful to apply the random configuration sampling procedure, proposed in (Bergstra and Bengio, 2012), and evaluate different models by average log-likelihood approximation, (see, for example, (Salakhutdinov and Murray, 2008) and the corresponding MATLAB scripts in (Salakhutdinov, 2010)).

7 Summary and Discussion

We demonstrated how deep learning techniques can be used for unsupervised ensemble learning, and showed that the DNN approach proposed in this manuscript often performs at least as well and often better than state-of the art methods, especially when the conditional independence assumption made by Dawid and Skene (1979) does not hold.

Possible directions for future research include extending the approach to multiclass problems, possible using Discrete RBMs Montúfar and Morton (2013), theoretical analysis of the SVD approach, and information theoretic analysis of the de-correlation, while preserving label information, that occurs while propagating data through a RBM-based DNN.


The authors would like to thank George Linderman, Alex Cloninger, Tingting Jiang, Raphy Coifman, Sahand Negahban, Andrew Barron, Alex Kovner, Shahar Kovalsky, Maria Angelica Cueto, Jason Morton, and Brend Strumfels for their help.


Appendix A Proof of Lemma 4.1


We will define so that for every , and .

Since the weight matrix has dimension in this case, it is a vector, which we will denote as . Recall that

hence we define


Finally, recall that

where is the energy function given in equation (3), hence we set


To see that the map is 1:1, note that uniquely determines , hence uniquely determine . Lastly, rearranging equation (4) we get

so that given , is uniquely determined by . Showing that the map is a also subjective is straightforward. Hence it is a bijection. ∎

Appendix B Proof of Lemma 4.2


Since and for each , is not independent of , by Chang (1996), the parameter of the conditional independence model is identifiable. Since the map in Lemma 4.1 is a bijection, there exists corresponding to , which is therefore identifiable as well. By the consistency property of the MLE (see, for example, (Casella and Berger, 2002)),

Since is continuous in , one obtains

Finally, note that Lemma 4.1 implies, in particular, that under the map

which completes the proof. ∎

Appendix C Stacking RBMs as a Variational Inference Procedure

Variational inference is a common approach to tackle complicated probability estimation problems (see, for example, Bishop (2006); Fox and Roberts (2012), and a recent review Blei et al. (2016)). Specifically, let

be a target probability distribution that we want to approximate. In variational inference we define a family of approximate distributions

, and then perform optimization to find the member of that is closest to in Kullback-Leibler distance. A key idea is that the family is flexible enough to contain a distribution close to , yet simple enough to perform optimization over. For example, a popular choice is to take as the collection of factorized distributions, i.e., of the form . In this section, we motivate the use of RBM-based DNN by considering a specific data generation model, and showing that training a stack of RBMs on data generated by this model is in fact a variational inference procedure.

The generative model we consider is a two layer Deep Belief Network (DBN), which played an important role in the emergence of deep learning in 2006 Hinton et al. (2006). The DBN we consider generates data , , via the probability distribution

where form a RBM (parametrized by ).

We observe data from and our goal is to estimate the posterior for . The posterior can be written as

Cueto et al. (2010) showed that as long as is not too large comparing to , RBMs are locally identifiable, i.e., identifiable up to order and flips of hidden units (Jason Morton, personal communication). Therefore, when training a RBM with hidden units on , by the consistency property of the MLE  Casella and Berger (2002) the MLE will converge to the true parameter as . Hence, when is large enough, the vectors obtained from the (trained) RBM are in fact samples from .

At the next step, the vectors are used to train a second RBM, with a single hidden node. Observe that in the data generation model considered in this section, does not factorize. The factorized distribution that minimizes is given by

Bishop (2006) (Chapter 10). By Lemma 4.1, we know that the distribution


is equivalent to a RBM. Finally, by Lemma 4.2, the distribution (5) is consistently estimated by a RBM trained on vectors , and is thus a variational inference procedure.

Appendix D Stacking RBMs as an Approximation for a Directed Top-Down Model

Assume that the data is generated by a Markov chain

, where , , . We further assume that the distributions factorize, i.e.,




and are given by RBM-like conditional distributions, i.e.,




Hence the corresponding data generation probability is parametrized by , where .

This data generation process is depicted in Figure 10.

Figure 10: Data generated by a Markov Chain with RBM-like conditional probabilities.

The posterior probabilities are given by

By Section 4, we know that is equivalent to a RBM. Therefore, to accurately estimate the posterior, it suffices to approximate .

Under the data generation model described in Figure 10 and equations (6)-(9), it is evident that the joint distribution cannot be parametrized as a RBM; indeed, does not factorize. Hence, training a RBM on samples from , is a mean field approximation of . The form of is shown in the following lemma.

Lemma D.1.

Under the data generation model described in Figure 10 and equations (6)-(9), the joint distribution is given by



By definition,



and similarly

we obtain


Plugging equation (11) in equation (10) we get

From lemma D.1 we see that is close to be factorizable if is a approximately a log-linear function of and is approximately a log-linear function of .

Appendix E Datasets used for our experiments

e.1 Simulated Dataset Generation Details

  • CondInd: the label was sampled from a Bernoulli(0.5) distribution; The specificity and sensitivity of the variables were sampled uniformly from . The other ten ’s were random guesses, i.e., had specificity = sensitivity = .

  • Tree15-3-1: the label was sampled from a Bernoulli(0.5) distribution; each node in the intermediate and layer was generated from his parent with specificity and sensitivity sampled uniformly from , and in the bottom layer with specificity and sensitivity sampled uniformly from .

  • LayeredGraph15-5-5-1: Data is generated from a Layered Graph with four layers of dimensions 1,5,5,15, starting at the true label . Each layer in the graph is generated from the above layer, and the graph has sparse connectivity (about 30% of the edges exist). For every node and parent we sample specificity and sensitivity uniformly. Finally, the value at each node was calculated as the weighted sum of the probabilities of the node being 1 given the values of the nodes in the preceding layer, normalized by the sum over the edges. The label was sampled from a Bernoulli(0.5) distribution.

  • TruncatedGaussian: the label was sampled from a Bernoulli(0.5) distribution. One Gaussian had mean vector were each of the 15 coordinates was sampled uniformly. The other Gaussian had mean vector . Both Gaussians had identical covariance matrix, with off diagonal entries of and diagonal entries of .

Figure 11: correlation matrices of the input data, for the class in all four simulated datasets: condInd (top left), tree15-3-1 (top right), LayeredGraph (bottom left), TruncatedGaussian (bottom right).

e.2 The Magic Datasets

An example for the correlation matrix of the 16 classifiers given the 0 class can be seen in Figure 12.

Figure 12: correlation matrix of the 16 classifiers in the Magic1 dataset, for the class.