Benefits of Overparameterization in Single-Layer Latent Variable Generative Models

06/28/2019 ∙ by Rares-Darius Buhai, et al. ∙ Google MIT 1

One of the most surprising and exciting discoveries in supervising learning was the benefit of overparametrization (i.e. training a very large model) to improving the optimization landscape of a problem, with minimal effect on statistical performance (i.e. generalization). In contrast, unsupervised settings have been under-explored, despite the fact that it has been observed that overparameterization can be helpful as early as Dasgupta & Schulman (2007). In this paper, we perform an exhaustive study of different aspects of overparameterization in unsupervised learning via synthetic and semi-synthetic experiments. We discuss benefits to different metrics of success (held-out log-likelihood, recovering the parameters of the ground-truth model), sensitivity to variations of the training algorithm, and behavior as the amount of overparameterization increases. We find that, when learning using methods such as variational inference, larger models can significantly increase the number of ground truth latent variables recovered.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unsupervised learning is an area of intense focus in recent years. In the absence of labels, the goal of unsupervised learning can vary. Generative adversarial networks, for example, have shown promise for density estimation and synthetic data generation. In commercial applications, unsupervised learning is often used to extract features of the data that are useful in downstream tasks. In the sciences, the goal is frequently to confirm or reject whether a particular model fits well or to identify salient aspects of the data that give insight into underlying causes and factors of variation.

Many unsupervised learning problems of interest, cast as finding a maximum likelihood model, are computationally intractable in the worst case. Though much theoretical work has been done on provable algorithms using the method of moments and tensor decomposition techniques

(Anandkumar et al., 2014; Arora et al., 2017; Halpern & Sontag, 2013)

, iterative techniques such as variational inference are still widely preferred. In particular, variational approximations using recognition networks have become increasingly popular, especially in the context of variational autoencoders

(Mnih & Gregor, 2014). Intriguingly, it has been observed, e.g. by Yeung et al. (2017), that in practice many of the latent variables have low activations and hence not used by the model.

Related phenomena have long been known in supervised learning: it was folklore knowledge among practitioners for some years that training larger neural networks can aid optimization, yet not affect generalization substantially. The seminal paper

(Zhang et al., 2016) thoroughly studied this phenomenon with synthetic and real-life experiments in the supervised setting. In brief, they showed that some neural network architectures that demonstrate strong performance on benchmark datasets are so massively overparametrized that they can “memorize" large image data sets (they can perfectly fit a completely random data set of the same size). In contrast, overparametrization in the unsupervised case has received much less attention.

This paper aims to be a controlled empirical study making precise the benefits of overparameterization in unsupervised learning settings. More precisely, we consider the task of fitting two common latent-variable models – a discrete factor analysis model using a noisy-OR parametrization and sparse coding (inteprpreted as a probabilistic model, see Section 3.2). Through experiments on synthetic and semi-synthetic data sets, we study the following aspects:

  • Log likelihood: We show that larger models provide an increase in the (held-out) log-likelihood, though the benefit decreases as we add latent variables.

  • Latent variable recovery

    : We show that larger models increase the number of ground truth latent variables recovered, as well as the number of runs in which the all ground truth latent variables are recovered. Furthermore, we show that recovering the ground-truth latent variables from the overparameterized solutions can be done via a simple filtering step: the optimization tends to converge to a solution in which all latent variables that do not match ground truth latent variables can either be discarded (i.e. have low prior probability) or near-duplicates of other matched latent variables.

  • Effects of extreme overparametrization: We show that while the benefits of adding new latent variables have diminishing returns, the harmful effects of extreme overparameterization are at most minor. Both the log-likelihood and the number of ground truth recoveries do not worsen significantly as the number of latent variables increases. Instead, sometimes the performance continues to increase even with times the true number of latent variables.

  • Effects of training algorithm

    : We show that changes to the training algorithm, such as significantly increasing the batch size or using a different variational posterior, do not affect the beneficial effects of overparameterization significantly. For learning noisy-OR networks, we test two algorithms based on variational learning: one with a logistic regression recognition network, and one with a mean-field posterior (e.g. see

    (Wainwright et al., 2008)).

  • Latent variable stability over the course of training: One possible explanation for why overparametrization helps is that having more latent variables increases the chances that at least one initialization will be close to each of the ground-truth latent variables. (This is indeed the idea of (Dasgupta & Schulman, 2007)). This does not appear to be the dominant factor here. We track the “matching” of the trained latent variables to the ground truth latent variables (matching = minimum cost bipartite matching, with a cost based on parameter closeness), and show that this matching changes until relatively late in the training process. This suggests that the benefit of overparameterization being observed is not simply due to increased likelihood of initializations close to the ground truth values.

In this investigation, we use simple generative models so that we can exert maximal control over our experiments. With more complex, deep generative models, there is a lot of latitude in choosing the architecture of the model when overparametrizing, making it more difficult to disentangle the importance of the choice of architecture from overparametrization in general.

2 Related work

Though overparameterization as a folklore strategy for improving the optimization landscape has long been around, the first systematic study of it in supervised learning is found in Zhang et al. (2016). Subsequent theoretical works also provided mathematical explanations of some of these phenomena (Allen-Zhu et al., 2018; Allen-Zhu & Li, 2019). The unsupervised case on the other hand has received comparatively much less attention.

To our knowledge, the earliest paper that points out a (simple) benefit of overparametrization is Dasgupta & Schulman (2007) in the context of recovering the means of well-separated spherical Gaussians given samples from the mixture. They point out that using input points as "guesses" (i.e. initializations) for the means allows us to guarantee, by the coupon collector phenomenon, that we include at least one point from each component in the mixture – which would not be so if we only used . A filtering step subsequently allows them to recover the components.

More recently, Li et al. (2018) explored matrix completion and Xu et al. (2018) mixtures of two Gaussians. In the former paper, the authors consider fitting a full-rank matrix to the partially observed matrix, yet prove gradient descent finds the correct, low-rank matrix. In the latter, the authors prove that when fitting a symmetric, equal-weight, two-component mixture, treating the weights as variables helps EM avoid local minima. (This flies in contrast to the intuition that knowing that the weights are equal, one should incorporate this information directly into the algorithm.)

3 Learning overparametrized generative models

We focus on the task of fitting two commonly used latent-variable models: noisy-OR networks and sparse coding. Beyond the ubiquity of these models, the reason we chose them is that they provide the “simplest" architectures for a generative model: a single latent layer, albeit with different activation functions. This makes the choice of “overparameterized" architecture non-controversial: we simply include more latent variables.

3.1 Noisy-OR networks

A noisy-OR network is a bipartite directed graphical model, in which one layer contains binary latent variables and the other layer contains binary observed variables. Edges are directed from latent variables to observed variables. The model has as parameters a set of prior probabilities for the latent variables, a set of noise probabilities for the observed variables, and a set of weights for the graph. If the latent variables are denoted as and the observed variables as

, the joint probability distribution specified by the model factorizes as:

where

It is common to refer to as the failure probability between and (i.e. the probability with which, if , it “fails to activate” ).

Training algorithm: We optimize an approximation of the likelihood of the data under the model, the evidence lower bound (ELBO). This is necessary because direct maximum likelihood optimization is intractable. If the joint pdf is , we have, using the notation in (Mnih & Gregor, 2014):

where is a variational posterior, also known as a recognition network in this setting. When , the inequality becomes equality; however, it is intractable to compute . Instead, we will assume that belongs to a simpler family of distributions: a logistic regression distribution parameterized by weights and a bias terms , s.t.:

Then, we maximize the lower bound by taking gradient steps w.r.t. and

. Furthermore, to improve the estimation of the gradients, we use variance normalization and input-dependent signal centering, as in

Mnih & Gregor (2014). For the input-dependent signal centering, we use a two-layer neural network with hidden nodes in the second layer and activation functions.

Extracting the ground-truth latent variables: As we are training an overparametrized model, we need to filter the learned latent variables to extract latent variables corresponding to the ground-truth variables. First, we discard all latent variables that are discardable, namely have prior probability less than or all failure probabilities of the related observable variables are greater than . Then, for every pair of latent variables that are duplicates

(measured as having failure probability vectors closer than

in distance), we discard the one with lower prior probability – such that for any cluster of duplicate latent variables, only the one with the largest prior probability survives.

3.2 Sparse coding

A sparse coding model is specified by a matrix with , (i.e. unit columns). Samples are generated from this model according to , with , and (i.e. sparsity ). The coordinates of the vector play the role of the latent variables, and the distribution is generated from is as follows:

  • [noitemsep]

  • Uniformly randomly choose coordinates of to be non-zero.

  • Sample the values for the non-zero coordinates uniformly in .

  • Renormalize the non-zero coordinates so they sum to 1.

Training algorithm: We use a simple alternating-minimization algorithm given in (Li et al., 2016). It starts with a random initialization of , such that has unit columns. Then, at each iteration, it "decodes" the latent variables for a batch of samples, s.t. a sample is decoded as , for some fixed and the current version of . After decoding, it takes a gradient step toward minimizing the “reconstruction error” , and then re-normalizes the columns of such that it has unit columns. Here, overparameterization means learning a matrix with , where is the number of columns of the ground truth matrix.

Extracting the latent variables: Similarly as in the noisy-OR case, we are training an overparametrized model, so to extract latent variables which correspond to the ground-truth variables we need to a filtering step. First, we apply the decoding step to all samples in the training set, and mark as "present" all coordinates in the support of . Second, we discard the columns that were never marked as "present". The intuition is rather simple: the first step is a proxy for the prior in the noisy-OR case (it captures how often a latent variable is "used"). The second step removes the unused latent variables. (Note, one can imagine a softer removal, where one removes the variables used less than some threshold, but this simpler step ends up being sufficient for us.)

4 Empirical study for noisy-OR networks

We study the effect of overparameterization in noisy-OR networks using 7 synthetic data sets:

(1) The first, IMG, is based on Šingliar & Hauskrecht (2006). There are latent variables and observed variables. The observed variables represent the pixels of an image. Thus, the connections of a latent variable to observed variables can be represented as an image (see Figure 1). Latent variables have priors . All failure probabilities different from are .

(a)
(b)
Figure 1: (a) Configuration of the IMG noisy-OR model. In the first row, each image represents a latent variable. Each pixel in an image represents the failure probability of the latent variable with the corresponding observed variable (white pixels correspond to failure probabilities different from ). In the second row, each node represents an observed variable; the observed variables corresponding to the first row of the images are shown. The edges show failure probabilities different from . (b) Samples of the IMG data set. Each image represents a sample, and each pixel represents an observed variable (white pixels correspond to ).

(2) The second, PLNT, is semi-synthetic: we learn a noisy-OR model from a real-world data set, then sample from the learned model. We learn the model from the UCI plants data set (Lichman et al., 2013), where each data point represents a plant that grows in North America and the 70 binary features indicate in which states and territories of North America it is found. The data set contains 34,781 data points. The resulting noisy-OR model has latent variables, prior probabilities between and , and failure probabilities either less than or equal to .

The next three data sets are based on randomly generated models with latent and observed variables (same as IMG):

  • (3) UNIF: Each latent variable’s prior is sampled and it is connected to each observation with probability . If connected, the corresponding failure probability is drawn from ; otherwise it is .

  • (4) CON8: for all . Each latent variable is connected to exactly observed variables, selected at random. If connected, the failure probability is ; otherwise it is .

  • (5) CON24: Same as CON8, but each latent variable is connected to observed variables.

The rationale for the previous two distributions is to test different densities for the connection patterns.

The final two are intended to assess whether overparameterization continues to be beneficial in the presence of model misspecification, i.e. when the generated data does not truly come from a noisy-OR model, or when there are additional (distractor) latent variables that occur with low probability.

(6) IMG-FLIP: First, generate a sample from the IMG model described above. Then, with probability , flip the value of every fourth observed variable in the sample (i.e. , , …).

(7) IMG-UNIF: This model has latent variables and observed variables. The first latent variables are those of the IMG model, again with prior probability . We then introduce more latent variables from the UNIF model, with prior probabilities each.

For all models except PLNT, noise probabilities are set to . PLNT uses the learned noise probabilities. To create each data set, we generate samples from the corresponding model. We split these samples into a training set of samples and a validation set of samples. Samples are generated exactly once from the ground truth model and re-used in all experiments with that model. For the randomly generated models, we generate the ground truth model exactly once. Reported log-likelihoods are obtained on the validation set.

To count how many ground truth latent variables are recovered, we perform minimum cost bipartite matching between the ground truth latent variables and the recovered latent variables. The cost of matching two latent variables is the distance between their weight vectors (removing first the variables with prior probability lower than ). After finding the optimal matching, we consider as recovered all ground truth latent variables for which the matching cost is less than . Note the algorithm may recover the ground truth latent variables without converging to a maximum likelihood solution because we do not require the prior probabilities of the latent variables to match (some of the latent variables may be split into duplicates) and because the matching algorithm ignores the state of the unmatched latent variables. We discuss this in more detail in Section 4.3.

4.1 Overparameterization improves log-likelihood and ground truth recovery

For all data sets, we test the recognition network algorithm using latent variables (i.e. no overparametrization), , , , and . For each experiment configuration, we run the algorithm times with different random initializations of the generative model parameters.

Figure 2: Performance of the noisy-OR network learning algorithm. The plots show statistics for runs of the algorithm with random initializations on different data sets with different number of latent variables. The -axis on the top row denotes the average number of ground truth latent variables recovered, in the middle row the percentage of runs with full ground truth recovery, and in the bottom row the average held-out negative log-likelihood under the recovered model. The confidence intervals are shown in red bars.

We report in Figure 2 the average number of ground truth latent variables recovered, the percentage of runs with full ground truth recovery (i.e. where ground truth latent variables are recovered), and the held-out log-likelihood under the recovered model. Note all the log-likelihoods reported are in fact the value of the ELBO (because it is intractable to compute the actual log-likelihood). We see that in all data sets, overparameterization leads to significantly improved metrics compared to using latent variables.

4.2 Harm of extreme overparameterization is minor; benefits are often significant

The results suggest that there may exist an optimal level of overparameterization for each data set, after which overparameterization stops conferring benefits (the harmful effect then may appear because larger models are more difficult to train).

The peak happens at latent variables for IMG, at for PLNT, at for UNIF, at for CON8, and at for CON24. Therefore, overparameterization can continue to confer benefits up to very large levels of overparameterization. In addition, even when latent variables are harmful with respect to lower levels of overparameterization, latent variables lead to significantly improved metrics compared to no overparameterization.

4.3 Unmatched latent variables are discarded or duplicates

When the full ground truth is recovered in an overparameterized setting, the unmatched latent variables usually fall into two categories: discardable or duplicates, as described in Section 3.1. To test this observation systematically, we use the filtering step described in that section.

We applied the filtering step to all experiments reported in Figure 2, in the runs where the algorithm recovered the full ground truth. In nearly all of these runs, the filtering step keeps exactly latent variables that match the ground truth latent variables. Exceptions are runs out of a total of runs (i.e. ); in these exception cases, the filtering step tends to keep more latent variables, out of which match the ground truth, and the others have higher failure probabilities (but nonetheless lower than the threshold of ).

Note that the solutions containing duplicates are not in general equivalent to the ground truth solutions in terms of likelihood. For example, two latent variables with identical weight vectors need not be equivalent to any single latent variable with the same weight vector (and possibly different prior probability); a short proof of this is in the Appendix (H).

4.4 Batch size does not change the effect

We also test the algorithm using batch size instead of . Although the performance decreases – as may be expected given folklore wisdom that stochasticity is helpful in avoiding local minima – overparameterization remains beneficial across all metrics. This shows that the effect of overparameterization is not tied to the stochasticity conferred by a small batch size. For example, on the IMG data set, we recover on average , , and ground truth latent variables for learning with , , and latent variables, respectively. See the Appendix (E, Table 2) for detailed results.

4.5 Variational distribution does not change the effect

To test the effect of the choice of variational distribution, on all data sets, we additionally test the algorithm using a mean-field variational posterior instead of the logistic regression recognition network. (This is arguably the simplest variational approximation.) In this case, the variational posterior models the latent variables as independent Bernoulli. In each epoch, for each sample, the variational posterior is updated from scratch until convergence using coordinate ascent, and then a gradient update is taken w.r.t. the parameters of the generative model.

Though the specific performance achieved on each data set often differs significantly from the previous results, overparameterization still leads to clearly improved metrics on all data sets. For example, on the IMG data set, we recover on average , , and ground truth latent variables for learning with , , and latent variables, respectively. See the Appendix (E, Table 3).

4.6 Matching to ground-truth latent variables is unstable

To understand the optimization dynamics better, we inspect how early the recovered latent variables start to converge toward the ground truth latent variables they match in the end. If this convergence started very early, it could indicate that each latent variable converges to the closest ground truth latent variable – then, overparameterization would simply make it more likely that each ground truth latent variable has a latent variable close to it at initialization.

The story is more complex. First, early on (especially within the first epoch), there is significant conflict between the latent variables, and it is difficult to predict which ground truth latent variable they will converge to. We illustrate this in Figure 3 for a run that recovers the full ground truth on the IMG data set when learning with latent variables. In part (a) of the Figure, at regular intervals in the optimization process, we matched the latent variable to the ground truth latent variables and counted how many pairs are the same as at the end of the optimization process. Especially within the first epoch, the number of such pairs is small, suggesting the latent variables are not “locked” to their final state. Part (b) of the Figure pictorially depicts different stages in the first epoch of the run with latent variables, clearly showing that in the beginning there are many latent variables that are in “conflict” for the same ground truth latent variables.

Second, even in the later stages of the algorithm, it is often the case that the contribution of one ground truth latent variable is split between multiple recovered latent variables, or that the contribution of multiple ground truth latent variables is merged into one recovered latent variable. This is illustrated in part (c) (the same successful run depicted in Figure 3 (b)), which shows multiple later stages of the optimization process which contain “conflict” between latent variables. See the Appendix (F) for the evolution of the optimization process across more intervals.

Of course, the observations above do not rule out the possibility that closeness at initialization between the latent variables and the ground truth latent variables is an important ingredient of the beneficial effect. However, we showed that the optimization process remains complex even in cases with overparameterization.

(a)
(b)
(c)
Figure 3: State of the optimization process on a successful run of the noisy-OR network learning algorithm on the IMG data set with latent variables. (a) The blue line shows the number of latent variables matched to the same ground truth latent variable as at the end of the optimization. The red line is the negative held-out log-likelihood. The graph is truncated at epochs. (b) The shapes of the latent variables after epochs, epochs, and epochs. (c) The shape of the latent variables after epochs, epochs, and epochs.

4.7 Effects of model mismatch

In the experiments with model mismatch, we still use the noisy-OR network learning algorithm from Section 4. We show that, in both data sets, overparameterization allows the algorithm to recover the underlying IMG latent variables more accurately, while modeling the noise with extra latent variables. In general, we think that in misspecified settings the algorithm tends to learn a projection of the ground truth model onto the specified noisy-OR model family, and that overparameterization often allows more of the noise to be “explained away” through latent variables.

For both data sets, the first bump in the recovery metrics happens when we learn with latent variables, which allows latent variables to be used for the IMG data set, and the extra latent variable to capture some of the noise. After that, more overparameterization increases the accuracy even further. For example, on the IMG-FLIP data set, we recover on average , , and ground truth latent variables for learning with , , and latent variables, respectively. See the Appendix (E, Table 4) for detailed results. Also, see the Appendix (G) for examples of the latent variables recovered in successful runs.

For IMG-FLIP, in successful runs, the algorithm tends to learn a model with an extra latent variable with significant non-zero prior probability that approximates the shape of the noise (i.e. a latent variable with connections to every fourth observed variable). For IMG-UNIF, the algorithm uses the extra latent variables to capture the tail latent variables. In this case, the algorithm often merges many of the tail latent variables, which is likely due to their smaller prior probabilities.

5 Empirical study for sparse coding

We find that the conclusions for sparse coding are qualitatively the same as for the noisy-OR models. Thus, for reasons of space, we only describe them briefly; see Appendix (E, Table 5) for full details.

We again evaluate using synthetic data sets. is sampled in pairs of columns such that the angle between each pair is a fixed . Specifically, we first generate random unit columns; then, for each of these columns, we generate another column that is a random rotation at angle of the original column. As a result, columns in different pairs have with high probability an absolute value inner product of approximately (i.e., roughly orthogonal), whereas the columns in the same pair have an inner product determined by the angle . The smaller the angle , the more difficult it is to learn the ground truth, because it is more difficult to distinguish columns in the same pair. We construct two data sets, the first with and the second with .

We experimented with learning using a matrix with columns (the true number) and with columns (overparameterized). To measure how many ground truth columns are recovered, we perform minimum cost bipartite matching between the recovered columns and the ground truth columns. As cost, we use the distance between the columns and consider correct all matches with cost below . We measure the error between the recovered matrix and the ground truth matrix as the sum of costs in the bipartite matching result (including costs larger than the threshold).

As in the case of noisy-OR networks, overparameterization consistently improves the number of ground truth columns recovered, the number of runs with full ground truth recovery, and the error. For example, for , learning with columns gives on average recovered columns, full recoveries, and error, while learning with columns gives recovered columns, full recoveries, and average error.

6 Discussion

The goal of this work was to exhibit the first controlled and thorough study of the benefits of overparametrization in unsupervised learning settings, more concretely noisy-OR networks and sparse coding. The results show that overparameterization is beneficial and impervious to a variety of changes in the settings of the learning algorithm.

As this is the first study of its kind, we mention several natural open questions. As demonstrated in Section 4.5, the choice of variational distribution has impact on the performance which cannot be offset by more overparametrization. In fact, when using the weaker variational approximation used in Šingliar & Hauskrecht (2006) (which introduced some of the datasets we use), we were not able to recover all sources, regardless of the level of overparametrization. This delicate interplay between the power of the variational family and the level of overparametrization demands more study.

Of course, inextricably linked to this is precise understanding of the effects of architecture – especially so with the deluge of different varieties of (deep) generative models. We leave the task of designing controlled experiments for more complicated settings for future work.

We also observed that the sparsity of the connections between latent variables and observed variables can be important in the learnability of the model. For example, for the PLNT data set, it was required to make the connections sparse by thresholding the failure probabilities in order to obtain a model that can be learned.

Finally, the work of (Zhang et al., 2016) on overparametrization in supervised settings considered a data-poor regime: where the number of parameters in the neural networks is comparable or larger than the training set. We did not explore such extreme levels of overparametrization.

References

Appendix A Learning algorithms

We give here pseudocode for the learning algorithms, as well as details that were not provided in the main document.

a.1 Noisy-OR networks with recognition network

Recall the evidence lower bound (ELBO):

One can derive the following gradients (see (Mnih & Gregor, 2014)):

Then, we maximize the lower bound by taking gradient steps w.r.t. and . To estimate the gradients, we average the quantities inside the expectations over multiple samples from .

See Algorithm 1 for an update step of the algorithm, without variance normalization and input-dependent signal centering.

The experiments use mini-batch size (unless specified otherwise). and are optimized using Adam.

  function Update(
     for  to  do
        
        
     end for
     
     
  end function
Algorithm 1 Update step for learning noisy-OR networks with a recognition network. is the current noisy-OR network model (with parameters ), is the current recognition network model (with parameters ), and is a batch of samples.

a.1.1 Variance normalization and input-dependent signal centering

We use variance normalization and input-dependent signal centering to improve the estimation of , as in (Mnih & Gregor, 2014).

The goal of both techniques is to reduce the variance in the estimation of . They are based on the observation that (see (Mnih & Gregor, 2014)):

where does not depend on . Therefore, it is possible to reduce the variance in the estimator by using some close to .

Variance normalization keeps running averages of the mean and variance of . Let be the average mean and be the average variance. Then variance normalization transforms into .

Input-dependent signal centering keeps an input-dependent function that approximates the normalized value of . We model as a two-layer neural network with hidden nodes in the second layer and activation functions. We train to minimize , and optimize it using SGD.

Therefore, our estimator of is obtained as:

a.2 Noisy-OR networks with mean-field variational posterior

In the mean-field algorithm, we use the variational posterior . That is, the latent variables are modeled as independent Bernoulli.

For each data point, we optimize from scratch (unlike the case of the recognition network variational posterior, which is “global”), and then we make a gradient update to the generative model.

To optimize the variational posterior we use coordinate ascent, according to:

where the expectation is over .

See Algorithm 2 for an update step of the algorithm. We use iterations of coordinate ascent, and we use samples to estimate expectations.

  function Update(
     
     for  to  do
        for  to  do
           
           for  to  do
              
              
              
              
           end for
           
        end for
     end for
     for  to  do
        
        
     end for
  end function
Algorithm 2 Update step for learning noisy-OR networks with mean-field variational posterior. is the current noisy-OR network model (with parameters ), and is a sample.

a.3 Sparse coding

See Algorithm 3 for an update step of the algorithm. We use batch size in the learning algorithm in all experiments.

1:  function Update(
2:     for  to  do
3:        
4:     end for
5:     
6:     normalize columns of
7:  end function
Algorithm 3 Alternating minimization algorithm update for sparse coding. is the current matrix, and is a batch of samples.

Appendix B IMG data set properties

In addition to being easy to visualize, the noisy-OR network model of the IMG data set has properties that ensure it is not “too easy” to learn. Specifically, out of the latent variable do not have “anchor” observed variables (i.e. observed variables for which a single failure probability is different from ). Such anchor observed variables are an ingredient of most known provable algorithms for learning noisy-OR networks. More technically, the model requires a subtraction step in the quartet learning approach of (Jernite et al., 2013).

Appendix C PLNT data set construction

We learn the PLNT model from the UCI plants data set (Lichman et al., 2013), where each data point represents a plant that grows in North America and the 70 binary features indicate in which states and territories of North America it is found. The data set contains 34,781 data points.

To learn the data set, we use the learning algorithm described in Section A.1, with latent variables. We remove all learned latent variables with prior probability less than . Furthermore, we transform all failure probabilities greater than into . This transformation is necessary to obtain sparse connections between the latent variables and the observed variables; without it, every latent variable is connected to almost every observed variable, which makes learning difficult. The resulting noisy-OR network model has latent variables. Each latent variable has a prior probability between and . By construction, each failure probability different from is between and .

Figure 4 shows a representation of the latent variables learned in the PLNT data set. As observed, the latent variables correspond to neighboring regions in North America.

Figure 4: Latent variable configuration of the PLNT noisy-OR network model. Each map represents a latent variable. The regions in blue represent the observed variables for which the failure probability is not . The fifth latent variable, which seems to contain only Florida, also contains Puerto Rico and Virgin Islands (not shown on map).

Appendix D Detailed experiment configuration

Below we give more details on the configuration of the experiments.

d.1 Initialization

In all noisy-OR network experiments, the noisy-OR network model is initialized by sampling each prior probability , each failure probability , and each complement of noise probability as follows: sample , and then set the parameter to . Note that is roughly between and , so the noisy-OR network is biased toward having large prior probabilities, large failure probabilities, and small noise probabilities. We found that this biased initialization improves results over one centered around .

In the recognition network noisy-OR experiments, we initialize the recognition network to have all weight parameters and bias parameters uniform in .

In the mean-field network noisy-OR experiments, we initialize the mean-field Bernoulli random variables to have parameters uniform in

.

In the sparse coding experiments, we initialize the matrix by sampling each entry from a standard Gaussian, and then normalizing the columns such as to have unit norm.

d.2 Hyperparameters

We generally tune hyperparameters within factors of

for each experiment configuration. Due to time constraints, for each choice of hyperparameters we only test it on runs with random initializations of the algorithm, and then choosing the best performing hyperparameters for the large-scale experiments.

In the recognition network noisy-OR experiments, we tune the step size for the noisy-OR network model parameters and the step size for the input-dependent signal centering neural network. The step size for the recognition network model parameters is the same as the one for the noisy-OR network model parameters (tuning it independently did not seem to change the results significantly). The variance reduction technique requires a rate for the running estimates; we set this to .

In the mean-field noisy-OR experiments, we only tune the step size for the noisy-OR network model parameters. For the coordinate ascent method used to optimize the mean-field parameters, we use iterations of coordinate ascent (i.e. iterations through all coordinates), and in each iteration we estimate expectations with samples.

In the sparse coding experiments, we tune the step size for the updates to the matrix and the variable.

d.3 Number of epochs

In all experiments, we use a large enough fixed number of epochs such that the log-likelihood / error measurement does not improve at the end of the optimization in all or nearly-all the runs. However, to avoid overfitting, we save the model parameters at regular intervals in the optimization process, and report the results from the timestep that achieved the best held-out log-likelihood / error measurement (i.e. we perform post-hoc “early stopping”).

Appendix E Tables of results

Below we present detailed tables of results for the experiments.

Table 1 shows the noisy-OR network results with recognition network (also included in the main document, 4.1).

Table 2 shows the noisy-OR network results with recognition network and larger batch size . (Corresponds to Section 4.4 in the main document.)

Table 3 shows the noisy-OR network results with mean-field variational posterior. (Corresponds to Section 4.5 in the main document.)

Table 4 shows the noisy-OR network results with recognition network on the misspecified data sets (IMG-FLIP and IMG-UNIF). (Corresponds to Section 4.7 in the main document.)

Table 5 shows the sparse coding results. (Corresponds to Section 5 in the main document.)

LAT RECOV FULL NEGATIVE
VARS VARS RECOV (%) LL (NATS)
IMG
8
16
32
64
128
PLNT
8
16
32
64
128
UNIF
8
16
32
64
128
CON8
8
16
32
64
128
CON24
8
16
32
64
128
Table 1: Performance of the noisy-OR network learning algorithm with recognition network. Each row reports statistics for runs of the algorithm with random initializations. The confidence intervals are included. The first column denotes the number of latent variables used in learning, the second column the average number of ground truth latent variables recovered, the third column the percentage of runs with full ground truth recovery, and the fourth column the average held-out negative log-likelihood.
LAT RECOV FULL NEGATIVE
VARS VARS RECOV (%) LL (NATS)
IMG
8
16
32
PLNT
8
16
32
UNIF
8
16
32
CON8
8
16
32
CON24
8
16
32
Table 2: Performance of the noisy-OR network learning algorithm with recognition network and batch size . Each row reports statistics for runs of the algorithm with random initializations. The confidence intervals are included. The first column denotes the number of latent variables used in learning, the second column the average number of ground truth latent variables recovered, the third column the percentage of runs with full ground truth recovery, and the fourth column the average held-out negative log-likelihood.
LAT RECOV FULL NEGATIVE
VARS VARS RECOV (%) LL (NATS)
IMG
8
16
32
PLNT
8
16
32
UNIF
8
16
32
CON8
8
16
32
CON24
8
16
32
Table 3: Performance of the noisy-OR network learning algorithm with mean-field variational posterior. Each row reports statistics for runs of the algorithm with random initializations. The confidence intervals are included. The first column denotes the number of latent variables used in learning, the second column the average number of ground truth latent variables recovered, the third column the percentage of runs with full ground truth recovery, and the fourth column the average held-out negative log-likelihood.
LAT RECOV FULL NEGATIVE
VARS VARS RECOV (%) LL (NATS)
IMG-FLIP
8
9
10
16
IMG-UNIF
8
9
16
32
Table 4: Performance of the noisy-OR network learning algorithm with recognition network on the misspecified data sets (IMG-FLIP and IMG-UNIF). Each row reports statistics for runs of the algorithm with random initializations. The confidence intervals are included. The first column denotes the number of latent variables used in learning, the second column the average number of ground truth latent variables recovered, the third column the percentage of runs with full ground truth recovery, and the fourth column the average held-out negative log-likelihood.
COLS RECOV FULL ERROR
COLS RECOV (%)
24
48
24
48
Table 5: Performance of the sparse coding learning algorithm. Each row reports statistics for runs of the algorithm with random initializations. The confidence intervals are included. The first column denotes the number of latent variables used in learning, the second column the average number of ground truth columns recovered, the third column the percentage of runs with full ground truth recovery, and the fourth column the average error.

Appendix F State of the optimization process

Figures 5 and 6 show more steps of the optimization process on the successful run with latent variables mentioned in Section 4.6 in the main document.

Figure 5: Latent variables on a successful run of the noisy-OR network learning algorithm on the IMG data set with latent variables. Shown is the state of the latent variables after epochs , , , , , , , , and .
Figure 6: Latent variables on a successful run of the noisy-OR network learning algorithm on the IMG data set with latent variables. Shown is the state of the latent variables after epochs , , , , , , , , and .

Appendix G Recovered latent variables on misspecified data sets

Figure 7 shows successful recoveries for the IMG-FLIP and IMG-UNIF data sets (Corresponds to Section 4.7 in the main document.) As observed, some of the extra latent variables are used to model some of the noise due to misspecification.

(a)
(b)
Figure 7: Latent variables recovered in successful runs (i.e. they recover the IMG latent variables) on the mismatched data sets. Below each images corresponding to a latent variable, there is a color corresponding to its prior (whiter means prior closer to ). (a) Successful run with latent variables on the IMG-FLIP data set. (b) Successful run with latent variables on the IMG-UNIF data set.

Appendix H Proof that duplicate latent variables are not equivalent to single latent variable

We prove that a model with two latent variables that have identical failure probabilities is not necessarily equivalent to a model with one latent variable that has the same failure probabilities (and possibly different priors).

Consider a noisy-OR network model with two latent variables and two observed variables . Let (prior probabilities), (failure probabilities), and (noise probabilities). Then the negative moments are , , and .

Consider now a noisy-OR network model with one latent variable and two observed variables . Let and . Then, to match the first-order negative moments and , we need (prior probability). But then this gives , which does not match . Therefore, there exists no noisy-OR model with one latent variable and identical failure and noise probabilities that is equivalent to the noisy-OR model with two latent variables.