1 Introduction
Humans have the ability to automatically categorize objects and entities through subconsciously defined notions of similarity, and often in the absence of any supervised signal. For instance, studies have shown that young infants are capable of automatically forming categories based on gender (Younger and Fearing, 1999; Johnston et al., 2001), types of animals (Gopnik and Meltzoff, 1987; Bornstein and Arterberry, 2010), shapes (Smith and Samuelson, 2006), etc. It is generally hypothesized that such discrete categorizations result in efficient encoding of sensory input that reduces the amount of information processing required by the brain (Rakison and Yermolayeva, 2010). Therefore, unsupervised categorization can be seen as a means of learning useful encoding for real world data. This skill is extremely valuable since the majority of data available in the real world is unlabeled.
In this spirit, we introduce a generic parameterization that allows learning representations from unlabeled data by categorizing them. Specifically, our parameterization implicitly maps samples from an observed random variable to a latent discrete space where the distribution gets segmented into a finite number of arbitrary conditional distributions. Imposing different conditions on the latent space through different objective functions will result in learning qualitatively different representations.
We note that our parameterization may be used to compute statistical quantities involving observed variables and latent discrete variables that are in general difficult to compute, thus providing a flexible framework for unsupervised representation learning. To illustrate this aspect, we develop two independent use cases for this parameterization– mutual information maximization (Linsker, 1988) and disjoint manifold labeling, as described in the abstract. For the MIM task, we experiment with benchmark image datasets and show that the unsupervised representation learned by the network achieves good performance on downstream classification tasks. For the manifold labeling task, we show experiments on 2D datasets and their highdimensional counterparts designed as per the problem formulation, and show that the proposed objective can optimally label disjoint manifolds. For both objectives we design regularizations necessary to achieve the desired behavior in practice.
The paper is organized as follows. We introduce the parameterization in section 2. We then develop the two applications of the parameterization, viz, mutual information maximization and disjoint manifold labeling, in section 3 and section 4 respectively. Finally we show experiments in section 5 followed by related work and conclusion. All the proofs can be found in the appendix.
2 Neural Bayes
Consider a data distribution from which we have access to i.i.d. samples . We suppose that this marginal distribution is a union of conditionals where the density is denoted by
and the corresponding probability mass denoted by
. Here is a discrete random variable with states. We now introduce the parameterization that allows us to implicitly factorize any marginal distribution into conditionals as described above. Aside from the technical details, the key idea behind this parameterization is the Bayes’ rule.Lemma 1
Let and
be any conditional and marginal distribution defined for continuous random variable
and discrete random variable . If , then there exists a nonparametric function for any given input with the property such that,(1) 
and this parameterization is consistent.
Thus the function can be seen as a form of soft categorization of input samples. In practice, we use a neural network with sufficient capacity and softmax output to realize this function . We name our parameterization method Neural Bayes and replace with to denote the parameters of the network. By imposing different conditions on the structure of by formulating meaningful objectives, we will get qualitatively different kinds of factorization of the marginal , and therefore the function will encode the posterior for that factorization. In summary, if one formulates any objective that involves the terms , or , where is an observed random variable and is a discrete latent random variable, then they can be substituted with , and respectively.
On an important note, Neural Bayes parameterization requires using the term , through which computing gradient is infeasible in general. A general discussion around this can be found in appendix A. Nonetheless, we show that minibatch gradients can have good fidelity for one of the objectives we propose using our parameterization. In the next two sections, we explore two different ways of factorizing resulting in qualitatively different goals of unsupervised representation learning.
3 Mutual Information Maximization (MIM)
3.1 Theory
Suppose we want to find a discrete latent representation (with states) for the distribution such that the mutual information is maximized (Linsker, 1988). Such an encoding demands that it must be very efficient since it has to capture maximum possible information about the continuous distribution in just discrete states. Assuming we can learn such an encoding, we are interested in computing since it tells us the likelihood of belonging to each discrete state of , thereby performing soft categorization which may be useful for downstream tasks. In the proposition below, we show an objective for computing for a discrete latent representation that maximizes .
Proposition 1
(Neural BayesMIMv1) Let be a nonparametric function for any given input with the property . Consider the following objective,
(2) 
Then , where .
The proof essentially involves expressing MI in terms of , and , which can be substituted using Neural Bayes parameterization. However, the objective proposed in the above theorem poses a challenge– the objective contains the term for which computing high fidelity gradient in a batch setting is problematic (see appendix A). However, we can overcome this problem for the MIM objective because it turns out that gradient through certain terms are 0 as shown by the following theorem.
Theorem 1
(Gradient Simplification) Denote,
(3) 
(4) 
where indicates that gradients are not computed through the argument. Then .
The above theorem implies that as long as we plugin a decent estimate of
in the objective, unbiased gradients can be computed without the need to compute gradients using the entire dataset. Note that the objective can be rewritten as,(5) 
The second term is the negative entropy of the discrete latent representation which acts as a uniform prior. In other words, this term encourages learning a latent code such that all states of activate uniformly over the marginal input distribution
. This is an attribute of distributed representation which is a fundamental goal in deep learning. We can therefore further encourage this behavior by treating the coefficient of this term as a hyperparameter. In our experiments we confirm both the distributed representation behavior of this term as well as the benefit of using a hyperparameter as our coefficient.
3.2 Implementation Details
Alternative Formulation of Uniform Prior: In practice we found that an alternative formulation of the second term in Eq 3.1 results in better performance and more interpretable filters. Specifically, we replace it with the following crossentropy formulation,
(6) 
While both, the second term in Eq 3.1 as well as are minimized when , the latter formulation provides much stronger gradients during optimization when approaches 1 (see appendix C.1 for details); is undesirable since it discourages distributed representation. Finally, unbiased gradients can be computed through Eq 3.2 as long as a good estimate of is plugged in. Also note that the condition in lemma 1 is met by the Neural BayesMIM objective implicitly during optimization as discussed in the above paragraph in regards to distributed representation.
Implementation: The final Neural BayesMIMv2 objective is,
(7) 
where and are hyperparameters, is a smoothness regularization introduced in section 4.2, is a small scalar used to prevent numerical instability. Qualitatively, we find that the regularization prevents filters from memorizing the input samples. Finally, we apply the first two terms in Eq 3.2 to all hidden layers of a deep network at different scales (computed by spatially average pooling and applying Softmax). These two regularizations gave a significant performance boost. Thorough implementation details are provided in appendix B. For brevity, we refer to our final objective as Neural BayesMIM in the rest of the paper.
On the other hand, to compute a good estimate of gradients, we use the following trick. During optimization, we compute gradients using a sufficiently large minibatch of size MBS (Eg. ) that fits in memory (so that the estimate of is reasonable), and accumulate these gradients until BS samples are seen (Eg. 2000), and averaged before updating the parameters to further reduce estimation error.
4 Disjoint Manifold Labeling (DML)
4.1 Theory
A distribution is defined over a support. In many cases, the support may be a set of disjoint manifolds. In this task, our goal is to label samples from each disjoint manifold with a distinct value. This formulation can be seen as a generalization of subspace clustering (Ma et al., 2008) where affine manifolds are considered. To make the problem concrete, we first formalize the definition of a disjoint manifold.
Definition 1
(Connected Set) We say that a set is a connected set (disjoint manifold) if for any , there exists a continuous path between and such that all the points on the path also belong to .
To identify such disjoint manifolds in a distribution, we exploit the observation that only partitions that separate one disjoint manifold from others have high divergence between the respective conditional distributions while partitions that cut through a disjoint manifold result in conditional distributions with low divergence between them. Therefore, the objective we propose for this task is to partition the unlabeled data distribution into conditional distributions ’s such that a divergence between them is maximized. By doing so we recover the conditional distributions defined over the disjoint manifolds (we prove its optimality in theorem 2). We begin with two disjoint manifolds and extend this idea to multiple disjoint manifolds in appendix G.
Let be a symmetric divergence (Eg. JensenShannon divergence, Wasserstein divergence, etc), and and be the disjoint conditional distributions that we want to learn. Then the aforementioned objective can be written formally as follows:
(8)  
Since our goal is to simply assign labels to data samples corresponding to which manifold they belong instead of learning conditional distributions as achieved by Eq. (8), we would like to learn a function which maps samples from disjoint manifolds to distinct labels. To do so, below we derive an objective equivalent to Eq. (8) that learns such a function .
Proposition 2
(Neural BayesDML) Let be a nonparametric function for any given input , and let be the JensenShannon divergence. Define scalars and . Then the objective in Eq. (8) is equivalent to,
(9)  
(10) 
Optimality
: We now prove the optimality of the proposed objective towards discovering disjoint manifolds present in the support of a probability density function
.Theorem 2
(optimality) Let be a probability density function over whose support is the union of two nonempty connected sets (definition 1) and that are disjoint, i.e. . Let belong to the class of continuous functions which is learned by solving the objective in Eq. (9). Then the objective in Eq. (9) is maximized if and only if one of the following is true:
The above theorem proves that optimizing the derived objective over the space of functions implicitly partitions the data distribution into maximally separated conditionals by assigning a distinct label to points in each manifold. Most importantly, the theorem shows that the continuity condition on the function plays an important role. Without this condition, the network cannot identify disjoint manifolds.
4.2 Implementation Details
Prior Collapse: The constraint in proposition 2 is a boundary condition required for technical reasons in lemma 1. In practice we do not worry about them because optimization itself avoids situations where . To see the reason behind this, note that except when initialized in a way such that , the log terms are negative by definition. Since the denominators of and are and respectively, the objective is maximized when moves away from 0 and 1. Thus, for any reasonable initialization, optimization itself pushes away from 0 and 1.
Smoothness of : As shown in theorem 2, the proposed objectives can optimally recover disjoint manifolds only when the function
is continuous. In practice we found enforcing the function to be smooth (thus also continuous) helps significantly. Therefore, after experimenting with a handful of heuristics for regularizing
, we found the following finite difference Jacobian regularization to be effective (can be scalar or vector),
(11) 
where is a normalized noise vector computed independently for each sample in a batch of size as,
(12) 
Here is the matrix containing the batch of samples, and each dimension of is sampled i.i.d. from a standard Gaussian. This computation ensures that the perturbation lies in the span of data, which we found to be important. Finally
is the scale of normalized noise added to all samples in a batch. In our experiments, since we always normalize the datasets to have zero mean and unit variance across all dimensions, we sample
.Implementation: We implement the binarypartition Neural BayesDML using the MonteCarlo sampling approximation of the following objective,
(13) 
where and . Here is a small scalar used to prevent numerical instability, and is a hyperparameter to control the continuity of . The multipartition case can be implemented in a similar way. Due to the need for computing in the objective, optimizing it using gradient descent methods with small batchsizes is not possible. Therefore we experiment with this method on datasets where gradients can be computed for a very large batchsize needed to approximate the gradient through sufficiently well.
5 Experiments
5.1 Mutual Information Maximization
Instead of aiming for stateoftheart results, our goal in this section is to conduct a preliminary (but thorough) set of experiments using Neural BayesMIM to understand the behavior of the algorithm, the hyperparameters involved and do a fair comparison with popular existing methods for selfsupervised learning. Therefore, we use the following simple CNN encoder architecture
^{1}^{1}1We use the following shorthand for a) conv layer: C(number of filters, filter size, stride size, padding); b) pooling: P(kernel size, stride, padding, pool mode)
in our experiments: . For an input image of size , the output of this encoder has size . The encoder is initialized using orthogonal initialization (Saxe et al., 2013)(Ioffe and Szegedy, 2015)is used after each convolution layer and ReLU nonlinearities are used. All datasets are normalized to have dimensionwise 0 mean and unit variance. Early stopping in all experiments is done using the test set (following previous work). We broadly follow the experimental setup of
Hjelm et al. (2019). We do not use any data augmentation in our experiments. After training the encoder, we freeze its features and train a 1 hidden layer (200 units) classifier to get the final test accuracy. Extending the algorithm to more complex architectures (Eg. ResNets
(He et al., 2016)), use of multiple data augmentation techniques and other advanced regularizations (Eg. see Bachman et al. (2019)) is left as future work.5.1.1 Ablation Studies
Behavior of Neural BayesMIMv1 (Eq 3.1) vs Neural BayesMIM (v2, Eq 3.2): The experiments and details are discussed in appendix C.2. The main differences are: 1. majority of the filters learned by the v1 objective are dead, as opposed to the v2 objective which encourages distributed representation; 2. the performance of v2 is better than that of the v1 objective.
Visualization of Filters
: We visualize the filters learned by the Neural BayesMIM objective on MNIST digits and qualitatively study the effects of the regularizations used. For this we train a deep fully connected network with 3 hidden layers each of width 500 using Adam with learning rate 0.001, batch size 500, 0 weight decay for 50 epochs (other Adam hyperparameters are kept standard). We train three configurations: 1.
, ; 2. , ; 3. , . The learned filters are shown in figure 4. We find that the uniform prior regularization () prevents dead filters while the smoothness regularization () prevents input memorization.Performance due to Regularizations and State Scaling: We now evaluate the effects of the various components involved in the Neural BayesMIM objective– coefficients and , and applying the objective at different scales of hidden states. We use the CIFAR10 dataset for these experiments.
In the first experiment, for each value of the number of different scales considered, we vary , and record the final performance, thus capturing the variation in performance due to all these three components. We consider two scaling configurations: 1. no pooling is applied to the hidden layers; 2. for each hidden layer, we spatially average pool the state using a pooing filter with a stride of 2. For the encoder used in our experiments (which has 4 internal hidden layers post ReLU), this gives us 4 and 8 states respectively (including the original unscaled hidden layers) to apply the Neural BayesMIM objective. After getting all the states, we apply the Softmax activation to each state along the channel dimension so that the Neural Bayes parameterization holds. Thus for states with height and width, the objective is applied to each spatial location separately and averaged. Also, for states with height (or width) less than the pooling size, we use the height (or width) as pooling size.
We train Neural BayesMIM on the full training set for 100 epochs using Adam with learning rate 0.001 (other Adam hyperparameters are standard), minibatch size 500 and batch size 2000, 0 weight decay. In the first 32 experiments, and are sampled uniformly from and respectively. In the next 5 experiments, is set to be 0 while is sampled uniformly. In the next 5 experiments, is set to be 0 while is sampled uniformly. Thus in total we run 42 experiments for each number of scaling considered.
Once we get a trained , we train a 1 hidden layer (with 200 units) MLP classifier on the frozen features from using the labels in the training set. This training is done for 100 epochs using Adam with learning rate 0.001 (other Adam hyperparameters are standard), batch size 128 and weight decay 0.
As a baseline for these experiments, we use a randomly initialed encoder . Since there are no tunable hyperparameters in this case, we perform a grid search on the classifier hyperparameters. Specifically, we choose weight decay from , batch size from , and learning rate from . This yields a total of 16 configurations. The test accuracy from these runs varied between and . We consider as our baseline.
The performance of encoders under the aforementioned configurations is shown in figure 5. It is clear that both the hyperparameters and especially play an important role in the quality of representations learned. Also, applying Neural BayesMIM at different scales of the network states significantly improves the average and best performance.
Effect of Minibatch size (MBS) and Batch size (BS): During implementation, we proposed to compute gradients using a reasonably large minibatch of size MBS and accumulate gradients until BS samples are seen. This is done to overcome the gradient estimation problem due to the term in Neural BayesMIM. Here we evaluate the effect of these two hyperparameters on the final test performance. We choose MBS from and BS from . For each combination of MBS and BS, we train the CNN encoder using Neural BayesMIM with and (chosen by examining figure 5); the rest of the training settings are kept identical to those used for figure 5 experiment. Table 1 shows the final test accuracy on CIFAR10 for each combination of hyperparameters MBS and BS. We make two observations: 1. using very small MBS (Eg. 50 and 100) typically results in poor (even worse than that of a random encoder ()), while larger MBS significantly improves performance; 2. using a larger BS further improves performance in most cases (even when MBS is small).
Accuracy vs Epochs: Finally, we plot the evolution of accuracy over epochs for all the models learned in the experiments of figure 5. For Neural BayesMIM we use the models with scaling (42 in total), and all 16 models for the random encoder. The convergence plot is shown in figure 6.
MBS \BS  50  250  500  2000  3000 

50  40.62  42.97  41.41  75  78.91 
100  N/A  67.97  66.41  78.12  78.91 
250  N/A  76.56  78.91  82.03  84.38 
500  N/A  N/A  82.03  78.91  79.69 
5.1.2 Final Classification Performance
We compare the final test accuracy of Neural BayesMIM with 3 baselines– a random encoder (described in ablation studies), Deep Infomax (Hjelm et al., 2019), and Rotation Prediction based representation learning (Gidaris et al., 2018) on benchmark image datasets– CIFAR10 and CIFAR100 (Krizhevsky, 2009) and STL10 (Coates et al., 2011). Random Network refers to the use of a randomly initialized network. The experimental details for them are identical to those in our ablation involving a hyperparameter search over 16 configurations done for each dataset separately.
DIM results are reported from Hjelm et al. (2019). We omit STL10 number for DIM because we resize images to a much smaller size of in our runs instead of as used in DIM.
Rotation prediction refers to the algorithm in Gidaris et al. (2018) where the encoder is learned by training it to predict the rotation of unlabeled images. We use the same CNN architecture used in previous experiments, with a linear classifier added on top, and train it to predict 4 rotations angles– 0, 90, 180, 270. We run this pretraining with 8 configurations of hyperparameters– batchsize (each batch further includes rotated copies of each sample making the total batchsize 100, 200), weight decay and learning rate . For each run, we then train a 1 hidden layer (200 units) classifier on top of the frozen features with learning rate . We report the best performance of all runs. Since Kolesnikov et al. (2019) report that lower layers in CNN architectures trained with rotation prediction have better performance on downstream tasks, we also train classifiers on the layer and report their performance which is significantly better.
The following describes the experiment details for Neural BayesMIM. We use and (chosen roughly by examining figure 5), and MBS=500, BS=4000 in all the experiments. Note these values are not tuned for STL10 and CIFAR100. For CIFAR10 and STL10 each, we run 4 configurations of Neural BayesMIM over hyperparameters learning rate and weight decay . For each run, we then train a 1 hidden layer (200 units) classifier on top of the frozen features with learning rate . We report the best performance of all runs. For CIFAR100, we take the encoder that produces the best performance on CIFAR10, and train a classifier with the 2 learning rates and report the best of the 2 runs. Similar to rotation prediction, we also train classifiers on the layer and report their performance.
Table 2 reports the classification performance of all the methods. We note that all experiments were done with CNN architecture without any data augmentation. Neural BayesMIM outperforms baseline methods in general. However, when using layer features, rotation prediction (RP) performs better. We hope to further improve the performance of Neural BayesMIM with additional regularizations similar to Bachman et al. (2019).
Encoder \Dataset  CIFAR10  CIFAR100  STL10 

Random Network  67.97  42.97  53.91 
Rotation Prediction (RP)  33.59  6.25  25.78 
DIM (Hjelm et al., 2019)  80.95  49.74   
Neural BayesMIM  82.81  55.47  64.84 
RP ( layer)  84.38  64.06  70.31 
Neural BayesMIM ( layer)  84.38  57.03  67.19 
5.2 Disjoint Manifold Labeling
Clustering in general is an ill posed problem. However, in our problem setup, the definition is precise, i.e., our goal is to optimally label all the disjoint manifolds present in the support of a distribution. Since this is a unique goal that is not generally considered in literature, as empirical verification, we show qualitative results on 2D synthetic datasets in figure 7. Top 2 subfigures have 2 clusters and the bottom 2 have 3 clusters. For all experiments we use a 4 layer MLP with 400 hidden units each, batchnorm, ReLU activation, and last layer Softmax activation. In all cases we train using Adam optimizer with a learning rate of 0.001, batch size of 400 and no weight decay, and trained until convergence. Regularization coefficient was chosen from that resulted in optimal clustering. For generality in these experiments, these 2D datasets were projected to high dimensions (512) by appending 510 dimensions of 0 entries to each sample and then randomly rotated before performing clustering. The datasets were then projected back to the original 2D space for visualizing predictions. Additional experiments can be found in appendix H.
6 Related Work
Neural BayesMIM maximizes mutual information for learning useful representations in a selfsupervised way. Introduced in Linsker (1988) and Bell and Sejnowski (1995), there are a myriad of selfsupervised methods that involve MIM. As discussed in Vincent et al. (2010), autoencoder based methods achieve this goal implicitly by minimizing the reconstruction error of the input samples under isotropic Gaussian assumption. Deep infomax (DIM, Hjelm et al. (2019)) instead uses MINE (Belghazi et al., 2018) to estimate MI and maximize it while applying it to both local and global features and imposing priors on the learned representation. Hjelm et al. (2019) have also shown that DIM performs better than representations learned by autoencoder based methods such as VAE (Kingma and Welling, 2013), VAE (Higgins et al., 2017) and adversarial autoencoder (Makhzani et al., 2015), among others such as noise as targets (Bojanowski and Joulin, 2017) and BiGAN (Donahue et al., 2016). Contrastive Predictive Coding (Oord et al., 2018) also maximizes MI by predicting lower layer representations from higher layers using a contrastive loss instead of reconstruction loss.
Unlike the aforementioned methods that learn continuous latent representation, Neural BayesMIM implicitly learns discrete latent representations. We note that the estimation of mutual information due to Neural Bayes parameterization in the Neural BayesMIMv1 objective (Eq 3.1) turns out to be identical to the one proposed in IMSAT (Hu et al., 2017). However, there are important differences: 1. we provide theoretical justifications for the parameterization used (lemma 1) and show in theorem 1 why it is feasible to compute high fidelity gradients using this objective in the minibatch setting even though it contains the term . On the other hand, the justification used in IMSAT is that optimizing using minibatches is equivalent to optimizing an upper bound of the original objective; 2. while the MI part of IMSAT was introduced in the context of clustering, we improve the MI formulation (Eq 3.2), and introduce regularization terms and state scaling which are important for learning useful representations using the Neural BayesMIM objective that perform well on downstream classification tasks; 3. we perform extensive ablation studies exposing the role of the introduced regularizations; 4. the goal of our paper is broader, i.e., to introduce the Neural Bayes parameterization that can be used for formulating new objectives. From the aspect of learning discrete latent representation, Neural BayesMIM has similarities with VQVAE (Oord et al., 2017). However, similar to other autoencoder based methods, VQVAE imposes the isotropy assumption in the reconstruction loss.
In many selfsupervised methods, the idea is to learn useful representations by predicting nontrivial information about the input. Examples of such methods are Rotation Prediction (Gidaris et al., 2018), Exemplar (Dosovitskiy et al., 2014), Jigsaw (Noroozi and Favaro, 2016) and Relative Patch Location (Doersch et al., 2015). Kolesnikov et al. (2019) have extensively compared these methods and found that Rotation Prediction (RP) in general outperforms or performs at par with the latter methods. For the aforementioned reasons, we compared Neural BayesMIM with RP and DIM.
Numerous recent papers have proposed clustering algorithm for unsupervised representation learning such as Deep Clustering (Caron et al., 2018), information based clustering (Ji et al., 2019)
(Shaham et al., 2018), Assosiative Deep Clustering (Haeusser et al., 2018) etc. Our goal in regards to clustering in Neural BayesDML is in general different from such methods. Our objective is aimed at labeling disjoint manifolds in a distribution. Thus it can be seen as a generalization of the traditional subspace clustering methods (Ma et al., 2008; Liu et al., 2010) from affine subspaces to arbitrary manifolds.7 Conclusion
We proposed a parameterization method that can be used to express an arbitrary set of distributions , and in closed form using a neural network with sufficient capacity, which can in turn be used to formulate new objective functions. We formulated two different objectives that use this parameterization which were aimed towards different goals of selfsupervised learning– learning deep network features using the infomax principle, and identification of disjoint manifolds in the support of continuous distributions. We presented theoretical and empirical analysis of both the objectives while especially focusing on the former since it has broader applications.
Acknowledgments
I (DA) was supported by IVADO during my time at MILA and currently supported by Salesforce. There are many people who have directly or indirectly contributed to this work and we would like to thank them. During the early phase of research on Neural BayesDML (in the context of which the Neural Bayes parameterization was developed), Chen Xing pointed out an intuition which led me to simplify its optimization procedure. We thank Ali Madani and Ehsan HosseiniAsl for exploring Neural BayesDML for unsupervised representation learning for images. We thank Min Lin for taking interest in the connection between Neural BayesDML and mutual information, which led me to the idea that mutual information can be computed using the parameterization. We thank Aadyot Bhatnagar and Weiran Wang for proofchecking the paper and providing helpful feedback. We thank Devon Hjelm and Alex Fedorov for discussing their algorithm Deep Infomax in great detail. Finally, we thank Aaron Courville, Sharan Vaswani, Nikhil Naik, Isabela Albuquerque, Lav Varshney, Yu Bai, Jonathan Binas, David Krueger, Tegan Maharaj and Govardana Sachithanandam Ramachandran for helpful discussions.
References
 Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp. 15509–15519. Cited by: §5.1.2, §5.1.
 Mine: mutual information neural estimation. arXiv preprint arXiv:1801.04062. Cited by: §6.
 An informationmaximization approach to blind separation and blind deconvolution. Neural computation 7 (6), pp. 1129–1159. Cited by: §6.

Unsupervised learning by predicting noise.
In
Proceedings of the 34th International Conference on Machine LearningVolume 70
, pp. 517–526. Cited by: §6.  The development of object categorization in young children: hierarchical inclusiveness, age, perceptual attribute, and group versus individual analyses.. Developmental psychology 46 (2), pp. 350. Cited by: §1.

Deep clustering for unsupervised learning of visual features.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pp. 132–149. Cited by: §6. 
An analysis of singlelayer networks in unsupervised feature learning.
In
Proceedings of the fourteenth international conference on artificial intelligence and statistics
, pp. 215–223. Cited by: §5.1.2.  Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1422–1430. Cited by: §6.
 Adversarial feature learning. arXiv preprint arXiv:1605.09782. Cited by: §6.

Discriminative unsupervised feature learning with convolutional neural networks
. In Advances in neural information processing systems, pp. 766–774. Cited by: §6.  Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728. Cited by: §5.1.2, §5.1.2, §6.
 The development of categorization in the second year and its relation to other cognitive and linguistic developments. Child development, pp. 1523–1531. Cited by: §1.

Associative deep clustering: training a classification network with no labels.
In
German Conference on Pattern Recognition
, pp. 18–32. Cited by: §6.  Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.1.
 Betavae: learning basic visual concepts with a constrained variational framework.. International Conference on Learning Representations (ICLR). Cited by: §6.
 Learning deep representations by mutual information estimation and maximization. International Conference on Learning Representations (ICLR). Cited by: §5.1.2, §5.1.2, §5.1, Table 2, §6.
 Learning discrete representations via information maximizing selfaugmented training. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1558–1567. Cited by: §6.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §5.1.
 Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9865–9874. Cited by: §6.
 Developmental changes in infants’ and toddlers’ attention to gender categories. MerrillPalmer Quarterly (1982), pp. 563–584. Cited by: §1.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §6.
 Revisiting selfsupervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1920–1929. Cited by: §5.1.2, §6.
 Learning multiple layers of features from tiny images. Technical report Cited by: §5.1.2.
 Selforganization in a perceptual network. Computer 21 (3), pp. 105–117. Cited by: §1, §3.1, §6.
 Robust subspace segmentation by lowrank representation. In Proceedings of the 27th international conference on machine learning (ICML10), pp. 663–670. Cited by: §6.
 Estimation of subspace arrangements with applications in modeling and segmenting mixed data. SIAM review 50 (3), pp. 413–458. Cited by: §4.1, §6.

Adversarial autoencoders
. arXiv preprint arXiv:1511.05644. Cited by: §6.  Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §6.
 Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §6.
 Neural discrete representation learning. arXiv preprint arXiv:1711.00937. Cited by: §6.
 Infant categorization. Wiley Interdisciplinary Reviews: Cognitive Science 1 (6), pp. 894–905. Cited by: §1.
 Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120. Cited by: §5.1.
 Spectralnet: spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587. Cited by: §6.
 An attentional learning account of the shape bias: reply to cimpian and markman (2005) and booth, waxman, and huang (2005). Developmental Psychology 42 (6), pp. 1339–1343. Cited by: §1.

Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion
. Journal of machine learning research 11 (Dec), pp. 3371–3408. Cited by: §6.  Parsing items into separate categories: developmental change in infant categorization. Child Development 70 (2), pp. 291–303. Cited by: §1.
Appendix
Appendix A Gradient Computation Problem for the term
The Neural Bayes parameterization contains the term . Computing unbiased gradient through this term is in general difficult without the use of very large batchsizes even though the quantity itself may have a good estimate using very few samples. For instance, consider the scalar function . Consider the scenario when . The quantity can be estimated very accurately using even one example. Further, , hence . However, when using a finite number of samples, the approximation of can have a very high variance estimate due to improper cancelling of gradient terms from individual samples.
In the case of Neural BayesMIM we found that gradients through terms involving were 0. This allows us to estimate gradients for this objective reliably in the minibatch setting. But in general it may be challenging to do so and solving objectives using Neural Bayes parameterization may require a customized workaround for each objective.
Appendix B Implementation Details of the Neural BayesMIM Objective
We apply the Neural BayesMIM objective (Eq 3.2) to all the hidden layers at different scales (using average pooling). We now discuss its implementation details. Consider the CNN architecture used in our experiments– . Denote () be the 4 hidden layer ReLU outputs after the 4 convolution layers. For input of size , all these hidden states have height and width dimension in addition to channel dimension. For a minibatch
, these hidden states are therefore 4 dimensional tensors. Let these 4 dimensions for the
state be denoted by , where the dimensions denote batchsize, number of channels, height and width. Denote to be the Softmax function applied along the channel dimension, and to be . Further, denote () as the scaled version of the original states computed by average pooling, and define numbers accordingly. Then the total Neural BayesMIM objective for this architecture is given by,(14) 
where,
(15) 
and,
(16) 
where is a normalized noise vector computed independently for each sample in the batch as,
(17) 
Here is the matrix containing the batch of samples, and each dimension of is sampled i.i.d. from a standard Gaussian. This computation ensures that the perturbation lies in the span of data. Finally is the scale of normalized noise added to all samples in a batch. In our experiments, since we always normalize the datasets to have zero mean and unit variance across all dimensions, we sample . Note that for the architecture used, results in an output with height and width equal to 1, hence the output is effectively a 2D matrix of size . Finally, the gradient form this minibatch is accumulated and averaged over multiple batches before updating the parameters for a more accurate estimate of gradients.
Appendix C Additional Analysis of Neural BayesMIM
c.1 Gradient Strength of Uniform Prior in Neural BayesMIMv1 (Eq 3.1) vs Neural BayesMIMv2 (3.2)
As discussed in the main text, the term,
(18) 
acts as a uniform prior encouraging the representations to be distributed. However, gradients are much stronger when approaches 1 for the alternative crossentropy formulation,
(19) 
To see this, note that gradient for is given by,
(20)  
(21)  
(22) 
where the last equality holds due to the linearity of expectation and because by design. On the other hand, gradients for is given by,
(23) 
When the representation being learned is such that the marginal peaks along a single state , i.e., (making the representation degenerate), the gradient for the term for v1 is given by,
(24) 
while that for v2 is given by,
(25) 
whose magnitude approaches infinity as . Thus is beneficial in terms of gradient strength.
c.2 Empirical Comparison between Neural BayesMIMv1 (Eq 3.1) and Neural BayesMIMv2 (3.2)
To empirically understand the difference in behavior of Neural BayesMIM objective v1 vs v2, we first plot the filters learned by the v1 objective and compare it with those learned by the v2 objective. The filters learned by the v1 objective are shown in figure 8 using the configuration , . It can be seen that most filters are dead. We tried other configurations as well without any change in the outcome. Since the v1 and v2 objective differ only in the formulation of the uniform prior regularization, as explained in the previous section, we believe that v1 leads to dead filters because of weak gradients from its regularization term.
In the second set of experiments, we train many models using Neural BayesMIMv1 and Neural BayesMIMv2 objectives separately with different hyperparameter configurations similar to the setting of figure 5. The performance scatter plot is shown in figure 13. We find that Neural BayesMIMv2 has better average and best performance compared with Neural BayesMIMv1.
Appendix D Proof of Lemma 1
Lemma 1
Let and be any conditional and marginal distribution defined for continuous random variable and discrete random variable . If , then there exists a nonparametric function for any given input with the property such that,
(26) 
and this parameterization is consistent.
Proof: First we show the existence proof. Notice that there exists a nonparametric function . Denote . Then,
(27) 
and,
(28) 
Thus works. To verify that this parameterization is consistent, note that for any ,
(29) 
where we use the condition . Secondly, we note that,
(30)  
(31)  
(32) 
where the last equality is due to the conditions . Thirdly,
(33)  
(34)  
Finally, we have from Bayes’ rule:
(35)  
(36)  
(37) 
where the second equality holds because of the existence and consistency proofs of and shown above.
Appendix E Proofs for Neural BayesMIM
Proposition 1
(Neural BayesMIMv1) (proposition 1 in main text) Let be a nonparametric function for any given input with the property . Consider the following objective,
(38) 
Then , where .
Theorem 1
(Theorem 1 in main text) Denote,
(43) 
(44) 
where denotes gradients are not computed through the argument. Then .
Proof: We note that,
Comments
There are no comments yet.