1 Introduction
Unsupervised learning is an area of intense focus in recent years. In the absence of labels, the goal of unsupervised learning can vary. Generative adversarial networks, for example, have shown promise for density estimation and synthetic data generation. In commercial applications, unsupervised learning is often used to extract features of the data that are useful in downstream tasks. In the sciences, the goal is frequently to confirm or reject whether a particular model fits well or to identify salient aspects of the data that give insight into underlying causes and factors of variation.
Many unsupervised learning problems of interest, cast as finding a maximum likelihood model, are computationally intractable in the worst case. Though much theoretical work has been done on provable algorithms using the method of moments and tensor decomposition techniques
(Anandkumar et al., 2014; Arora et al., 2017; Halpern & Sontag, 2013), iterative techniques such as variational inference are still widely preferred. In particular, variational approximations using recognition networks have become increasingly popular, especially in the context of variational autoencoders
(Mnih & Gregor, 2014). Intriguingly, it has been observed, e.g. by Yeung et al. (2017), that in practice many of the latent variables have low activations and hence not used by the model.Related phenomena have long been known in supervised learning: it was folklore knowledge among practitioners for some years that training larger neural networks can aid optimization, yet not affect generalization substantially. The seminal paper
(Zhang et al., 2016) thoroughly studied this phenomenon with synthetic and reallife experiments in the supervised setting. In brief, they showed that some neural network architectures that demonstrate strong performance on benchmark datasets are so massively overparametrized that they can “memorize" large image data sets (they can perfectly fit a completely random data set of the same size). In contrast, overparametrization in the unsupervised case has received much less attention.This paper aims to be a controlled empirical study making precise the benefits of overparameterization in unsupervised learning settings. More precisely, we consider the task of fitting two common latentvariable models – a discrete factor analysis model using a noisyOR parametrization and sparse coding (inteprpreted as a probabilistic model, see Section 3.2). Through experiments on synthetic and semisynthetic data sets, we study the following aspects:

Log likelihood: We show that larger models provide an increase in the (heldout) loglikelihood, though the benefit decreases as we add latent variables.

Latent variable recovery
: We show that larger models increase the number of ground truth latent variables recovered, as well as the number of runs in which the all ground truth latent variables are recovered. Furthermore, we show that recovering the groundtruth latent variables from the overparameterized solutions can be done via a simple filtering step: the optimization tends to converge to a solution in which all latent variables that do not match ground truth latent variables can either be discarded (i.e. have low prior probability) or nearduplicates of other matched latent variables.

Effects of extreme overparametrization: We show that while the benefits of adding new latent variables have diminishing returns, the harmful effects of extreme overparameterization are at most minor. Both the loglikelihood and the number of ground truth recoveries do not worsen significantly as the number of latent variables increases. Instead, sometimes the performance continues to increase even with times the true number of latent variables.

Effects of training algorithm
: We show that changes to the training algorithm, such as significantly increasing the batch size or using a different variational posterior, do not affect the beneficial effects of overparameterization significantly. For learning noisyOR networks, we test two algorithms based on variational learning: one with a logistic regression recognition network, and one with a meanfield posterior (e.g. see
(Wainwright et al., 2008)). 
Latent variable stability over the course of training: One possible explanation for why overparametrization helps is that having more latent variables increases the chances that at least one initialization will be close to each of the groundtruth latent variables. (This is indeed the idea of (Dasgupta & Schulman, 2007)). This does not appear to be the dominant factor here. We track the “matching” of the trained latent variables to the ground truth latent variables (matching = minimum cost bipartite matching, with a cost based on parameter closeness), and show that this matching changes until relatively late in the training process. This suggests that the benefit of overparameterization being observed is not simply due to increased likelihood of initializations close to the ground truth values.
In this investigation, we use simple generative models so that we can exert maximal control over our experiments. With more complex, deep generative models, there is a lot of latitude in choosing the architecture of the model when overparametrizing, making it more difficult to disentangle the importance of the choice of architecture from overparametrization in general.
2 Related work
Though overparameterization as a folklore strategy for improving the optimization landscape has long been around, the first systematic study of it in supervised learning is found in Zhang et al. (2016). Subsequent theoretical works also provided mathematical explanations of some of these phenomena (AllenZhu et al., 2018; AllenZhu & Li, 2019). The unsupervised case on the other hand has received comparatively much less attention.
To our knowledge, the earliest paper that points out a (simple) benefit of overparametrization is Dasgupta & Schulman (2007) in the context of recovering the means of wellseparated spherical Gaussians given samples from the mixture. They point out that using input points as "guesses" (i.e. initializations) for the means allows us to guarantee, by the coupon collector phenomenon, that we include at least one point from each component in the mixture – which would not be so if we only used . A filtering step subsequently allows them to recover the components.
More recently, Li et al. (2018) explored matrix completion and Xu et al. (2018) mixtures of two Gaussians. In the former paper, the authors consider fitting a fullrank matrix to the partially observed matrix, yet prove gradient descent finds the correct, lowrank matrix. In the latter, the authors prove that when fitting a symmetric, equalweight, twocomponent mixture, treating the weights as variables helps EM avoid local minima. (This flies in contrast to the intuition that knowing that the weights are equal, one should incorporate this information directly into the algorithm.)
3 Learning overparametrized generative models
We focus on the task of fitting two commonly used latentvariable models: noisyOR networks and sparse coding. Beyond the ubiquity of these models, the reason we chose them is that they provide the “simplest" architectures for a generative model: a single latent layer, albeit with different activation functions. This makes the choice of “overparameterized" architecture noncontroversial: we simply include more latent variables.
3.1 NoisyOR networks
A noisyOR network is a bipartite directed graphical model, in which one layer contains binary latent variables and the other layer contains binary observed variables. Edges are directed from latent variables to observed variables. The model has as parameters a set of prior probabilities for the latent variables, a set of noise probabilities for the observed variables, and a set of weights for the graph. If the latent variables are denoted as and the observed variables as
, the joint probability distribution specified by the model factorizes as:
whereIt is common to refer to as the failure probability between and (i.e. the probability with which, if , it “fails to activate” ).
Training algorithm: We optimize an approximation of the likelihood of the data under the model, the evidence lower bound (ELBO). This is necessary because direct maximum likelihood optimization is intractable. If the joint pdf is , we have, using the notation in (Mnih & Gregor, 2014):
where is a variational posterior, also known as a recognition network in this setting. When , the inequality becomes equality; however, it is intractable to compute . Instead, we will assume that belongs to a simpler family of distributions: a logistic regression distribution parameterized by weights and a bias terms , s.t.:
Then, we maximize the lower bound by taking gradient steps w.r.t. and
. Furthermore, to improve the estimation of the gradients, we use variance normalization and inputdependent signal centering, as in
Mnih & Gregor (2014). For the inputdependent signal centering, we use a twolayer neural network with hidden nodes in the second layer and activation functions.Extracting the groundtruth latent variables: As we are training an overparametrized model, we need to filter the learned latent variables to extract latent variables corresponding to the groundtruth variables. First, we discard all latent variables that are discardable, namely have prior probability less than or all failure probabilities of the related observable variables are greater than . Then, for every pair of latent variables that are duplicates
(measured as having failure probability vectors closer than
in distance), we discard the one with lower prior probability – such that for any cluster of duplicate latent variables, only the one with the largest prior probability survives.3.2 Sparse coding
A sparse coding model is specified by a matrix with , (i.e. unit columns). Samples are generated from this model according to , with , and (i.e. sparsity ). The coordinates of the vector play the role of the latent variables, and the distribution is generated from is as follows:

[noitemsep]

Uniformly randomly choose coordinates of to be nonzero.

Sample the values for the nonzero coordinates uniformly in .

Renormalize the nonzero coordinates so they sum to 1.
Training algorithm: We use a simple alternatingminimization algorithm given in (Li et al., 2016). It starts with a random initialization of , such that has unit columns. Then, at each iteration, it "decodes" the latent variables for a batch of samples, s.t. a sample is decoded as , for some fixed and the current version of . After decoding, it takes a gradient step toward minimizing the “reconstruction error” , and then renormalizes the columns of such that it has unit columns. Here, overparameterization means learning a matrix with , where is the number of columns of the ground truth matrix.
Extracting the latent variables: Similarly as in the noisyOR case, we are training an overparametrized model, so to extract latent variables which correspond to the groundtruth variables we need to a filtering step. First, we apply the decoding step to all samples in the training set, and mark as "present" all coordinates in the support of . Second, we discard the columns that were never marked as "present". The intuition is rather simple: the first step is a proxy for the prior in the noisyOR case (it captures how often a latent variable is "used"). The second step removes the unused latent variables. (Note, one can imagine a softer removal, where one removes the variables used less than some threshold, but this simpler step ends up being sufficient for us.)
4 Empirical study for noisyOR networks
We study the effect of overparameterization in noisyOR networks using 7 synthetic data sets:
(1) The first, IMG, is based on Šingliar & Hauskrecht (2006). There are latent variables and observed variables. The observed variables represent the pixels of an image. Thus, the connections of a latent variable to observed variables can be represented as an image (see Figure 1). Latent variables have priors . All failure probabilities different from are .

(2) The second, PLNT, is semisynthetic: we learn a noisyOR model from a realworld data set, then sample from the learned model. We learn the model from the UCI plants data set (Lichman et al., 2013), where each data point represents a plant that grows in North America and the 70 binary features indicate in which states and territories of North America it is found. The data set contains 34,781 data points. The resulting noisyOR model has latent variables, prior probabilities between and , and failure probabilities either less than or equal to .
The next three data sets are based on randomly generated models with latent and observed variables (same as IMG):

(3) UNIF: Each latent variable’s prior is sampled and it is connected to each observation with probability . If connected, the corresponding failure probability is drawn from ; otherwise it is .

(4) CON8: for all . Each latent variable is connected to exactly observed variables, selected at random. If connected, the failure probability is ; otherwise it is .

(5) CON24: Same as CON8, but each latent variable is connected to observed variables.
The rationale for the previous two distributions is to test different densities for the connection patterns.
The final two are intended to assess whether overparameterization continues to be beneficial in the presence of model misspecification, i.e. when the generated data does not truly come from a noisyOR model, or when there are additional (distractor) latent variables that occur with low probability.
(6) IMGFLIP: First, generate a sample from the IMG model described above. Then, with probability , flip the value of every fourth observed variable in the sample (i.e. , , …).
(7) IMGUNIF: This model has latent variables and observed variables. The first latent variables are those of the IMG model, again with prior probability . We then introduce more latent variables from the UNIF model, with prior probabilities each.
For all models except PLNT, noise probabilities are set to . PLNT uses the learned noise probabilities. To create each data set, we generate samples from the corresponding model. We split these samples into a training set of samples and a validation set of samples. Samples are generated exactly once from the ground truth model and reused in all experiments with that model. For the randomly generated models, we generate the ground truth model exactly once. Reported loglikelihoods are obtained on the validation set.
To count how many ground truth latent variables are recovered, we perform minimum cost bipartite matching between the ground truth latent variables and the recovered latent variables. The cost of matching two latent variables is the distance between their weight vectors (removing first the variables with prior probability lower than ). After finding the optimal matching, we consider as recovered all ground truth latent variables for which the matching cost is less than . Note the algorithm may recover the ground truth latent variables without converging to a maximum likelihood solution because we do not require the prior probabilities of the latent variables to match (some of the latent variables may be split into duplicates) and because the matching algorithm ignores the state of the unmatched latent variables. We discuss this in more detail in Section 4.3.
4.1 Overparameterization improves loglikelihood and ground truth recovery
For all data sets, we test the recognition network algorithm using latent variables (i.e. no overparametrization), , , , and . For each experiment configuration, we run the algorithm times with different random initializations of the generative model parameters.
We report in Figure 2 the average number of ground truth latent variables recovered, the percentage of runs with full ground truth recovery (i.e. where ground truth latent variables are recovered), and the heldout loglikelihood under the recovered model. Note all the loglikelihoods reported are in fact the value of the ELBO (because it is intractable to compute the actual loglikelihood). We see that in all data sets, overparameterization leads to significantly improved metrics compared to using latent variables.
4.2 Harm of extreme overparameterization is minor; benefits are often significant
The results suggest that there may exist an optimal level of overparameterization for each data set, after which overparameterization stops conferring benefits (the harmful effect then may appear because larger models are more difficult to train).
The peak happens at latent variables for IMG, at for PLNT, at for UNIF, at for CON8, and at for CON24. Therefore, overparameterization can continue to confer benefits up to very large levels of overparameterization. In addition, even when latent variables are harmful with respect to lower levels of overparameterization, latent variables lead to significantly improved metrics compared to no overparameterization.
4.3 Unmatched latent variables are discarded or duplicates
When the full ground truth is recovered in an overparameterized setting, the unmatched latent variables usually fall into two categories: discardable or duplicates, as described in Section 3.1. To test this observation systematically, we use the filtering step described in that section.
We applied the filtering step to all experiments reported in Figure 2, in the runs where the algorithm recovered the full ground truth. In nearly all of these runs, the filtering step keeps exactly latent variables that match the ground truth latent variables. Exceptions are runs out of a total of runs (i.e. ); in these exception cases, the filtering step tends to keep more latent variables, out of which match the ground truth, and the others have higher failure probabilities (but nonetheless lower than the threshold of ).
Note that the solutions containing duplicates are not in general equivalent to the ground truth solutions in terms of likelihood. For example, two latent variables with identical weight vectors need not be equivalent to any single latent variable with the same weight vector (and possibly different prior probability); a short proof of this is in the Appendix (H).
4.4 Batch size does not change the effect
We also test the algorithm using batch size instead of . Although the performance decreases – as may be expected given folklore wisdom that stochasticity is helpful in avoiding local minima – overparameterization remains beneficial across all metrics. This shows that the effect of overparameterization is not tied to the stochasticity conferred by a small batch size. For example, on the IMG data set, we recover on average , , and ground truth latent variables for learning with , , and latent variables, respectively. See the Appendix (E, Table 2) for detailed results.
4.5 Variational distribution does not change the effect
To test the effect of the choice of variational distribution, on all data sets, we additionally test the algorithm using a meanfield variational posterior instead of the logistic regression recognition network. (This is arguably the simplest variational approximation.) In this case, the variational posterior models the latent variables as independent Bernoulli. In each epoch, for each sample, the variational posterior is updated from scratch until convergence using coordinate ascent, and then a gradient update is taken w.r.t. the parameters of the generative model.
Though the specific performance achieved on each data set often differs significantly from the previous results, overparameterization still leads to clearly improved metrics on all data sets. For example, on the IMG data set, we recover on average , , and ground truth latent variables for learning with , , and latent variables, respectively. See the Appendix (E, Table 3).
4.6 Matching to groundtruth latent variables is unstable
To understand the optimization dynamics better, we inspect how early the recovered latent variables start to converge toward the ground truth latent variables they match in the end. If this convergence started very early, it could indicate that each latent variable converges to the closest ground truth latent variable – then, overparameterization would simply make it more likely that each ground truth latent variable has a latent variable close to it at initialization.
The story is more complex. First, early on (especially within the first epoch), there is significant conflict between the latent variables, and it is difficult to predict which ground truth latent variable they will converge to. We illustrate this in Figure 3 for a run that recovers the full ground truth on the IMG data set when learning with latent variables. In part (a) of the Figure, at regular intervals in the optimization process, we matched the latent variable to the ground truth latent variables and counted how many pairs are the same as at the end of the optimization process. Especially within the first epoch, the number of such pairs is small, suggesting the latent variables are not “locked” to their final state. Part (b) of the Figure pictorially depicts different stages in the first epoch of the run with latent variables, clearly showing that in the beginning there are many latent variables that are in “conflict” for the same ground truth latent variables.
Second, even in the later stages of the algorithm, it is often the case that the contribution of one ground truth latent variable is split between multiple recovered latent variables, or that the contribution of multiple ground truth latent variables is merged into one recovered latent variable. This is illustrated in part (c) (the same successful run depicted in Figure 3 (b)), which shows multiple later stages of the optimization process which contain “conflict” between latent variables. See the Appendix (F) for the evolution of the optimization process across more intervals.
Of course, the observations above do not rule out the possibility that closeness at initialization between the latent variables and the ground truth latent variables is an important ingredient of the beneficial effect. However, we showed that the optimization process remains complex even in cases with overparameterization.
4.7 Effects of model mismatch
In the experiments with model mismatch, we still use the noisyOR network learning algorithm from Section 4. We show that, in both data sets, overparameterization allows the algorithm to recover the underlying IMG latent variables more accurately, while modeling the noise with extra latent variables. In general, we think that in misspecified settings the algorithm tends to learn a projection of the ground truth model onto the specified noisyOR model family, and that overparameterization often allows more of the noise to be “explained away” through latent variables.
For both data sets, the first bump in the recovery metrics happens when we learn with latent variables, which allows latent variables to be used for the IMG data set, and the extra latent variable to capture some of the noise. After that, more overparameterization increases the accuracy even further. For example, on the IMGFLIP data set, we recover on average , , and ground truth latent variables for learning with , , and latent variables, respectively. See the Appendix (E, Table 4) for detailed results. Also, see the Appendix (G) for examples of the latent variables recovered in successful runs.
For IMGFLIP, in successful runs, the algorithm tends to learn a model with an extra latent variable with significant nonzero prior probability that approximates the shape of the noise (i.e. a latent variable with connections to every fourth observed variable). For IMGUNIF, the algorithm uses the extra latent variables to capture the tail latent variables. In this case, the algorithm often merges many of the tail latent variables, which is likely due to their smaller prior probabilities.
5 Empirical study for sparse coding
We find that the conclusions for sparse coding are qualitatively the same as for the noisyOR models. Thus, for reasons of space, we only describe them briefly; see Appendix (E, Table 5) for full details.
We again evaluate using synthetic data sets. is sampled in pairs of columns such that the angle between each pair is a fixed . Specifically, we first generate random unit columns; then, for each of these columns, we generate another column that is a random rotation at angle of the original column. As a result, columns in different pairs have with high probability an absolute value inner product of approximately (i.e., roughly orthogonal), whereas the columns in the same pair have an inner product determined by the angle . The smaller the angle , the more difficult it is to learn the ground truth, because it is more difficult to distinguish columns in the same pair. We construct two data sets, the first with and the second with .
We experimented with learning using a matrix with columns (the true number) and with columns (overparameterized). To measure how many ground truth columns are recovered, we perform minimum cost bipartite matching between the recovered columns and the ground truth columns. As cost, we use the distance between the columns and consider correct all matches with cost below . We measure the error between the recovered matrix and the ground truth matrix as the sum of costs in the bipartite matching result (including costs larger than the threshold).
As in the case of noisyOR networks, overparameterization consistently improves the number of ground truth columns recovered, the number of runs with full ground truth recovery, and the error. For example, for , learning with columns gives on average recovered columns, full recoveries, and error, while learning with columns gives recovered columns, full recoveries, and average error.
6 Discussion
The goal of this work was to exhibit the first controlled and thorough study of the benefits of overparametrization in unsupervised learning settings, more concretely noisyOR networks and sparse coding. The results show that overparameterization is beneficial and impervious to a variety of changes in the settings of the learning algorithm.
As this is the first study of its kind, we mention several natural open questions. As demonstrated in Section 4.5, the choice of variational distribution has impact on the performance which cannot be offset by more overparametrization. In fact, when using the weaker variational approximation used in Šingliar & Hauskrecht (2006) (which introduced some of the datasets we use), we were not able to recover all sources, regardless of the level of overparametrization. This delicate interplay between the power of the variational family and the level of overparametrization demands more study.
Of course, inextricably linked to this is precise understanding of the effects of architecture – especially so with the deluge of different varieties of (deep) generative models. We leave the task of designing controlled experiments for more complicated settings for future work.
We also observed that the sparsity of the connections between latent variables and observed variables can be important in the learnability of the model. For example, for the PLNT data set, it was required to make the connections sparse by thresholding the failure probabilities in order to obtain a model that can be learned.
Finally, the work of (Zhang et al., 2016) on overparametrization in supervised settings considered a datapoor regime: where the number of parameters in the neural networks is comparable or larger than the training set. We did not explore such extreme levels of overparametrization.
References
 AllenZhu & Li (2019) AllenZhu, Z. and Li, Y. Can sgd learn recurrent neural networks with provable generalization? arXiv preprint arXiv:1902.01028, 2019.
 AllenZhu et al. (2018) AllenZhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918, 2018.

Anandkumar et al. (2014)
Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M.
Tensor decompositions for learning latent variable models.
The Journal of Machine Learning Research
, 15(1):2773–2832, 2014. 
Arora et al. (2017)
Arora, S., Ge, R., Ma, T., and Risteski, A.
Provable learning of noisyor networks.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, pp. 1057–1066. ACM, 2017.  Dasgupta & Schulman (2007) Dasgupta, S. and Schulman, L. A probabilistic analysis of em for mixtures of separated, spherical gaussians. Journal of Machine Learning Research, 8(Feb):203–226, 2007.
 Halpern & Sontag (2013) Halpern, Y. and Sontag, D. Unsupervised learning of noisyor bayesian networks. arXiv preprint arXiv:1309.6834, 2013.
 Jernite et al. (2013) Jernite, Y., Halpern, Y., and Sontag, D. Discovering hidden variables in noisyor networks using quartet tests. In Advances in Neural Information Processing Systems, pp. 2355–2363, 2013.
 Li et al. (2016) Li, Y., Liang, Y., and Risteski, A. Recovery guarantee of nonnegative matrix factorization via alternating updates. In Advances in neural information processing systems, pp. 4987–4995, 2016.
 Li et al. (2018) Li, Y., Ma, T., and Zhang, H. Algorithmic regularization in overparameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pp. 2–47, 2018.
 Lichman et al. (2013) Lichman, M. et al. Uci machine learning repository, 2013.
 Mnih & Gregor (2014) Mnih, A. and Gregor, K. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, pp. 1791–1799, 2014.
 Šingliar & Hauskrecht (2006) Šingliar, T. and Hauskrecht, M. Noisyor component analysis and its application to link analysis. Journal of Machine Learning Research, 7(Oct):2189–2213, 2006.
 Wainwright et al. (2008) Wainwright, M. J., Jordan, M. I., et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
 Xu et al. (2018) Xu, J., Hsu, D. J., and Maleki, A. Benefits of overparameterization with em. In Advances in Neural Information Processing Systems, pp. 10685–10695, 2018.
 Yeung et al. (2017) Yeung, S., Kannan, A., Dauphin, Y., and FeiFei, L. Tackling overpruning in variational autoencoders. arXiv preprint arXiv:1706.03643, 2017.
 Zhang et al. (2016) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
Appendix A Learning algorithms
We give here pseudocode for the learning algorithms, as well as details that were not provided in the main document.
a.1 NoisyOR networks with recognition network
Recall the evidence lower bound (ELBO):
One can derive the following gradients (see (Mnih & Gregor, 2014)):
Then, we maximize the lower bound by taking gradient steps w.r.t. and . To estimate the gradients, we average the quantities inside the expectations over multiple samples from .
See Algorithm 1 for an update step of the algorithm, without variance normalization and inputdependent signal centering.
The experiments use minibatch size (unless specified otherwise). and are optimized using Adam.
a.1.1 Variance normalization and inputdependent signal centering
We use variance normalization and inputdependent signal centering to improve the estimation of , as in (Mnih & Gregor, 2014).
The goal of both techniques is to reduce the variance in the estimation of . They are based on the observation that (see (Mnih & Gregor, 2014)):
where does not depend on . Therefore, it is possible to reduce the variance in the estimator by using some close to .
Variance normalization keeps running averages of the mean and variance of . Let be the average mean and be the average variance. Then variance normalization transforms into .
Inputdependent signal centering keeps an inputdependent function that approximates the normalized value of . We model as a twolayer neural network with hidden nodes in the second layer and activation functions. We train to minimize , and optimize it using SGD.
Therefore, our estimator of is obtained as:
a.2 NoisyOR networks with meanfield variational posterior
In the meanfield algorithm, we use the variational posterior . That is, the latent variables are modeled as independent Bernoulli.
For each data point, we optimize from scratch (unlike the case of the recognition network variational posterior, which is “global”), and then we make a gradient update to the generative model.
To optimize the variational posterior we use coordinate ascent, according to:
where the expectation is over .
See Algorithm 2 for an update step of the algorithm. We use iterations of coordinate ascent, and we use samples to estimate expectations.
a.3 Sparse coding
See Algorithm 3 for an update step of the algorithm. We use batch size in the learning algorithm in all experiments.
Appendix B IMG data set properties
In addition to being easy to visualize, the noisyOR network model of the IMG data set has properties that ensure it is not “too easy” to learn. Specifically, out of the latent variable do not have “anchor” observed variables (i.e. observed variables for which a single failure probability is different from ). Such anchor observed variables are an ingredient of most known provable algorithms for learning noisyOR networks. More technically, the model requires a subtraction step in the quartet learning approach of (Jernite et al., 2013).
Appendix C PLNT data set construction
We learn the PLNT model from the UCI plants data set (Lichman et al., 2013), where each data point represents a plant that grows in North America and the 70 binary features indicate in which states and territories of North America it is found. The data set contains 34,781 data points.
To learn the data set, we use the learning algorithm described in Section A.1, with latent variables. We remove all learned latent variables with prior probability less than . Furthermore, we transform all failure probabilities greater than into . This transformation is necessary to obtain sparse connections between the latent variables and the observed variables; without it, every latent variable is connected to almost every observed variable, which makes learning difficult. The resulting noisyOR network model has latent variables. Each latent variable has a prior probability between and . By construction, each failure probability different from is between and .
Figure 4 shows a representation of the latent variables learned in the PLNT data set. As observed, the latent variables correspond to neighboring regions in North America.
Appendix D Detailed experiment configuration
Below we give more details on the configuration of the experiments.
d.1 Initialization
In all noisyOR network experiments, the noisyOR network model is initialized by sampling each prior probability , each failure probability , and each complement of noise probability as follows: sample , and then set the parameter to . Note that is roughly between and , so the noisyOR network is biased toward having large prior probabilities, large failure probabilities, and small noise probabilities. We found that this biased initialization improves results over one centered around .
In the recognition network noisyOR experiments, we initialize the recognition network to have all weight parameters and bias parameters uniform in .
In the meanfield network noisyOR experiments, we initialize the meanfield Bernoulli random variables to have parameters uniform in
.In the sparse coding experiments, we initialize the matrix by sampling each entry from a standard Gaussian, and then normalizing the columns such as to have unit norm.
d.2 Hyperparameters
We generally tune hyperparameters within factors of
for each experiment configuration. Due to time constraints, for each choice of hyperparameters we only test it on runs with random initializations of the algorithm, and then choosing the best performing hyperparameters for the largescale experiments.In the recognition network noisyOR experiments, we tune the step size for the noisyOR network model parameters and the step size for the inputdependent signal centering neural network. The step size for the recognition network model parameters is the same as the one for the noisyOR network model parameters (tuning it independently did not seem to change the results significantly). The variance reduction technique requires a rate for the running estimates; we set this to .
In the meanfield noisyOR experiments, we only tune the step size for the noisyOR network model parameters. For the coordinate ascent method used to optimize the meanfield parameters, we use iterations of coordinate ascent (i.e. iterations through all coordinates), and in each iteration we estimate expectations with samples.
In the sparse coding experiments, we tune the step size for the updates to the matrix and the variable.
d.3 Number of epochs
In all experiments, we use a large enough fixed number of epochs such that the loglikelihood / error measurement does not improve at the end of the optimization in all or nearlyall the runs. However, to avoid overfitting, we save the model parameters at regular intervals in the optimization process, and report the results from the timestep that achieved the best heldout loglikelihood / error measurement (i.e. we perform posthoc “early stopping”).
Appendix E Tables of results
Below we present detailed tables of results for the experiments.
Table 1 shows the noisyOR network results with recognition network (also included in the main document, 4.1).
Table 2 shows the noisyOR network results with recognition network and larger batch size . (Corresponds to Section 4.4 in the main document.)
Table 3 shows the noisyOR network results with meanfield variational posterior. (Corresponds to Section 4.5 in the main document.)
Table 4 shows the noisyOR network results with recognition network on the misspecified data sets (IMGFLIP and IMGUNIF). (Corresponds to Section 4.7 in the main document.)
LAT  RECOV  FULL  NEGATIVE 

VARS  VARS  RECOV (%)  LL (NATS) 
IMG  
8  
16  
32  
64  
128  
PLNT  
8  
16  
32  
64  
128  
UNIF  
8  
16  
32  
64  
128  
CON8  
8  
16  
32  
64  
128  
CON24  
8  
16  
32  
64  
128 
LAT  RECOV  FULL  NEGATIVE 

VARS  VARS  RECOV (%)  LL (NATS) 
IMG  
8  
16  
32  
PLNT  
8  
16  
32  
UNIF  
8  
16  
32  
CON8  
8  
16  
32  
CON24  
8  
16  
32 
LAT  RECOV  FULL  NEGATIVE 

VARS  VARS  RECOV (%)  LL (NATS) 
IMG  
8  
16  
32  
PLNT  
8  
16  
32  
UNIF  
8  
16  
32  
CON8  
8  
16  
32  
CON24  
8  
16  
32 
LAT  RECOV  FULL  NEGATIVE 

VARS  VARS  RECOV (%)  LL (NATS) 
IMGFLIP  
8  
9  
10  
16  
IMGUNIF  
8  
9  
16  
32 
COLS  RECOV  FULL  ERROR 

COLS  RECOV (%)  
24  
48  
24  
48 
Appendix F State of the optimization process
Figures 5 and 6 show more steps of the optimization process on the successful run with latent variables mentioned in Section 4.6 in the main document.
Appendix G Recovered latent variables on misspecified data sets
Figure 7 shows successful recoveries for the IMGFLIP and IMGUNIF data sets (Corresponds to Section 4.7 in the main document.) As observed, some of the extra latent variables are used to model some of the noise due to misspecification.
Appendix H Proof that duplicate latent variables are not equivalent to single latent variable
We prove that a model with two latent variables that have identical failure probabilities is not necessarily equivalent to a model with one latent variable that has the same failure probabilities (and possibly different priors).
Consider a noisyOR network model with two latent variables and two observed variables . Let (prior probabilities), (failure probabilities), and (noise probabilities). Then the negative moments are , , and .
Consider now a noisyOR network model with one latent variable and two observed variables . Let and . Then, to match the firstorder negative moments and , we need (prior probability). But then this gives , which does not match . Therefore, there exists no noisyOR model with one latent variable and identical failure and noise probabilities that is equivalent to the noisyOR model with two latent variables.