1 Introduction
Humans are able to learn from both labeled and unlabeled data. Young infants can acquire knowledge about the world and distinguish objects of different classes with only a few provided “labels”. Mathematically, this poverty of input implies that the data distribution contains information useful for inferring the category posterior
. The ability to extract this useful hidden knowledge from the data in order to leverage both labeled and unlabeled examples for inference and learning, i.e. semisupervised learning, has been a longsought after objective in computer vision, machine learning and computational neuroscience.
In the last few years, Deep Convolutional Networks (DCNs) have emerged as powerful supervised learning models that achieve nearhuman or superhuman performance in various visual inference tasks, such as object recognition and image segmentation. However, DCNs are still far behind humans in semisupervised learning tasks, in which only a few labels are available. The main difficulty in semisupervised learning in DCNs is that, until recently, there has not been a mathematical framework for deep learning architectures. As a result, it is not clear how DCNs encode the data distribution, making combining supervised and unsupervised learning challenging.
Recently, the Deep Rendering Mixture Model (DRMM) [13, 14] has been developed as a probabilistic graphical model underlying DCNs. The DRMM is a hierarchical generative model in which the image is rendered via multiple levels of abstraction. It has been shown that the bottomup inference in the DRMM corresponds to the feedforward propagation in the DCNs. The DRMM enables us to perform semisupervised learning with DCNs. Some preliminary results for semisupervised learning with the DRMM are provided in [14]. Those results are promising, but more work is needed to evaluate the algorithms across many tasks.
In this paper, we systematically develop a semisupervised learning algorithm for the Nonnegative DRMM (NNDRMM), a DRMM in which the intermediate rendered templates are nonnegative. Our algorithm contains a bottomup inference pass to infer the nuisance variables in the data and a topdown pass that performs reconstruction. We also employ variational inference and the nonnegative nature of the NNDRMM to derive two new penalty terms for the training objective function. An overview of our algorithm is given in Figure 1. We validate our methods by showing stateoftheart semisupervised learning results on MNIST and SVHN, as well as comparable results to other stateoftheart methods on CIFAR10. Finally, we analyze the trained model using a synthetically rendered dataset, which mimics CIFAR10 but has groundtruth labels for nuisance variables, including the orientation and location of the object in the image. We show how the trained NNDRMM encodes nusiance variations across its layers and show a comparison against traditional DCNs.
2 Related Work
We focus our review on semisupervised methods that employ neural network structures and divide them into different types.
Autoencoderbased Architectures:
Many early works in semisupevised learning for neural networks are built upon autoencoders
[1]. In autoencoders, the images are first projected onto a lowdimensional manifold via an encoding neural network and then reconstructed using a decoding neural network. The model is learned by minimizing the reconstruction error. This method is able to learn from unlabeled data and can be combined with traditional neural networks to perform semisupervised learning. In this line of work are the Contractive Autoencoder
[16], the Manifold Tangent Classifier, the Pseudolabel Denoising Auto Encoder
[9], the WinnerTakeAll Autoencoders [11], and the Stacked WhatWhere Autoencoder [20]. These architectures perform well when there are enough labels in the dataset but fail when the number of labels is reduced since the data distribution is not taken into account. Recently, the Ladder Network [15] was developed to overcome this shortcoming. The Ladder Network approximates a deep factor analyzer where each layer in the model is a factor analyzer. Deep neural networks are then used to do approximate bottomup and topdown inference.Deep Generative Models:
Another line of work in semisupervised learning is to use neural networks to estimate the parameters of a probabilistic graphical model. This approach is applied when the inference in the graphical model is hard to derive or when the exact inference is computationally intractable. The Deep Generative Model family is in this line of work
[7, 10].Both Ladder Networks and Deep Generative Models yield good semisupervised learning performance on benchmarks. They are complementary to our semisupervised learning on DRMM. However, our method is different from these approaches in that the DRMM is the graphical model underlying DCNs, and we theoretically derive our semisupervised learning algorithm as a proper probabilistic inference against this graphical model.
Generative Adversarial Networks (GANs): In the last few years a variety of GANs have achieved promising results in semisupervised learning on different benchmarks, including MNIST, SVHN and CIFAR10. These models also generate goodlooking images of natural objects. In GANs, two neural networks play a minimax game. One generates images, and the other classifies images. The objective function is the game’s Nash equilibrium, which is different from standard object functions in probabilistic modeling. It would be both exciting and promising to extend the DRMM objective to a minimax game as in GANs, but we leave this for future work.
3 Deep Rendering Mixture Model
The Deep Rendering Mixture Model (DRMM) is a recently developed probabilistic generative model whose bottomup inference, under a nonnegativity assumption, is equivalent to the feedforward propagation in a DCN [13, 14]. It has been shown that the inference process in the DRMM is efficient due to the hierarchical structure of the model. In particular, the latent variations in the data are captured across multiple levels in the DRMM. This factorized structure results in an exponential reduction of the free parameters in the model and enables efficient learning and inference. The DRMM can potentially be used for semisupervised learning tasks [13].
Definition 1 (Deep Rendering Mixture Model).
The Deep Rendering Mixture Model (DRMM)
is a deep Gaussian Mixture Model (GMM) with special constraints on the latent variables. Generation in the DRMM takes the form:
(1)  
(2)  
(3) 
where is the layer, is the object category, are the latent (nuisance) variables at layer , and are parameter dictionaries that contain templates at layer . Here the image is generated by adding isotropic Gaussian noise to a multiscale “rendered” template .
In the DRMM, the rendering path is defined as the sequence from the root (overall class) down to the individual pixels at .
The variable is the template used to render the image, and represents the sequence of local nuisance transformations that partially render finerscale details as we move from abstract to concrete. Note that the factorized structure of the DRMM results in an exponential reduction in the number of free parameters. This enables efficient inference, learning, and better generalization.
A useful variant of the DRMM is the NonNegative Deep Rendering Mixture Model (NNDRMM), where the intermediate rendered templates are constrained to be nonnegative. The NNDRMM model can be written as
(4) 
It has been proven that the inference in the NNDRMM via a dynamic programming algorithm leads to the feedforward propagation in a DCN. This paper develops a semisupevised learning algorithm for the NNDRMM. For brevity, throughout the rest of the paper we will drop the NN.
SumOverPaths Formulation of the DRMM: The DRMM can be can be reformulated by expanding out the matrix multiplications in the generation process into scalar products. Then each pixel intensity is the sum over all active paths leading to that pixel of the product of weights along that path. The sparsity of controls the number fraction of active paths. Figure 2 depicts the sumoverpaths formulation graphically.
4 DRMMbased SemiSupervised Learning
4.1 Learning Algorithm
Our semisupervised learning algorithm for the DRMM is analogous to the hard ExpectationMaximization (EM) algorithm for GMMs
[2, 13, 14]. In the Estep, we perform a bottomup inference to estimate the most likely joint configuration of the latent variables and the object category given the input image. This bottomup pass is then followed by a topdown inference which uses to reconstruct the image and compute the reconstruction error . It is known that when applying a hard EM algorithm on GMMs, the reconstruction error averaged over the dataset is proportional to the expected completedata loglikelihood. For labeled images, we also compute the crossentropybetween the predicted object classes and the given labels as in regular supervised learning tasks. In order to further improve the performance, we introduce a KullbackLeibler divergence penalty
on the predicted object class and a nonnegativity penalty on the intermediate rendered templates at each layer into the training cost objective function. The motivation and derivation for these two terms are discussed in section 4.2 and 4.3 below. The final objective function for semisupervised learning in the DRMM is given by , where , , and are the weights for the crossentropy loss , reconstruction loss , variational inference loss , and the nonnegativity penalty loss , respectively. The losses are defined as follows:(5)  
(6)  
(7)  
(8) 
Here, is an approximation of the true posterior . In the context of the DRMM and the DCN, is the SoftMax activations, is the class prior, is the set of labeled images, and if and otherwise. The
operator is applied elementwise and equivalent to the ReLu activation function used in DCNs.
During training, instead of a closedform M step as in EM algorithm for GMMs, we use gradientbased optimization methods such as stochastic gradient descent to optimize the objective function.
4.2 Variational Inference
The DRMM can compute the most likely latent configuration given the image , and therefore, allows the exact inference of . Using variational inference, we would like the approximate posterior to be close to the true posterior by minimizing the KL divergence . It has been shown in [3] that this optimization is equivalent to the following:
(9) 
where . A similar idea has been employed in variational autoencoders [6]
, but here instead of using a Gaussian distribution,
is a categorical random variable. An extension of the optimization
9 is given by:(10) 
As has been shown in [4], for this optimization, there exists a value for such that latent variations in the data are optimally disentangled.
The KL divergence in Eqns. 9 and 10 results in the loss in the semisupervised learning objective function for DRMM (see Eqn. 5). Similarly, the expected reconstruction error corresponds to the loss in the objective function. Note that this expected reconstruction error can be exactly computed for the DRMM since there are only a finite number of configurations for the class
. When the number of object classes is large, such as in ImageNet
[17] where there are 1000 classes, sampling techniques can be used to approximate . From our experiments (see Section 5), we notice that for semisupervised learning tasks on MNIST, SVHN, and CIFAR10, using the most likely predicted in the bottomup inference to compute the reconstruction error yields the best classification accuracies.4.3 NonNegativity Constraint Optimization
In order to derive the DCNs from the DRMM, the intermediate rendered templates must be nonnegative[14]. This is necessary in order to apply the maxproduct algorithm, wherein we can push the max to the right to get an efficient message passing algorithm. We enforce this condition in the TopDown inference of the DRMM by introducing new nonnegativity constrains into the optimization 9 and 10. There are various welldeveloped methods to solve optimization problems with nonnegativity constraints. We employ a simple but useful approach, which adds an extra nonnegativity penalty, in this case, , into the objective function. This yields an unconstrained optimization which can be solved by gradientbased methods such as stochastic gradient descent. We crossvalidate the penalty weight .
5 Experiments
We evaluate our methods on the MNIST, SVHN, and CIFAR10 datasets. In all experiments, we perform semisupervised learning using the DRMM with the training objective including the crossentropy cost, the reconstruction cost, the KLdistance, and the nonnegativity penalty discussed in Section 4.1. We train the model on all provided training examples with different numbers of labels and report stateoftheart test errors on MNIST and SVHN. The results on CIFAR10 are comparable to stateoftheart methods. In order to focus on and better understand the impact of the KL and NN penalties on the semisupervised learning in the DRMM, we don’t use any other regularization techniques such as DropOut or noise injection in our experiments. We also only use a simple stochastic gradient descent optimization with exponentiallydecayed learning rates to train the model. Applying regularization and using better optimization methods like ADAM [6] may help improve the semisupervised learning performance of the DRMM. More model and training details are provided in the Appendix.
5.1 Mnist
MNIST dataset contains 60,000 training images and 10,000 test images of handwritten digits from 0 to 9. Each image is of size 28by28. For evaluating semisupervised learning, we randomly choose images with labels from the training set such that the amounts of labeled training images from each class are balanced. The remaining training images are provided without labels. We use a 5layer DRMM with the feedforward configuration similar to the Conv Small network in [15]
. We apply batch normalization on the net inputs and use stochastic gradient descent with exponentiallydecayed learning rate to train the model.
Table 1 shows the test error for each experiment. The KL and NN penalties help improve the semisupervised learning performance across all setups. In particular, the KL penalty alone reduces the test error from to when . Using both the KL and NN penalties, the test error is reduced to , and the DRMM achieves stateoftheart results in all experiments. ^{1}^{1}1The results for improved GAN is on permutation invariant MNIST task while the DRMM performance is on the regular MNIST task. Since the DRMM contains local latent variables and at each level, it is not suitable for tasks such as permutation invariant MNIST We also analyze the value that the KL and NN penalties add to the learning. Table 2 reports the reductions in test errors for when using the KL penalty only, the NN penalty only, and both penalties during training. Individually, the KL penalty leads to significant improvements in test errors ( reduction in test error when ), likely since it helps disentangle latent variations in the data. In fact, for a model with continuous latent variables, it has been experimentally shown that there exists an optimal value for the KL penalty such that all of the latent variations in the data are almost optimally disentangled [4]. More results are provided in the Appendix.
Model  Test error (%) for a given number of labeled examples  
DGN [7]    
catGAN [19]      
Virtual Adversarial [12]      
Skip Deep Generative Model [10]      
LadderNetwork [15]    
Auxiliary Deep Generative Model [10]      
ImprovedGAN [18]    
DRMM 5layer  
DRMM 5layer + NN penalty  
DRMM 5layer + KL penalty  
DRMM 5layer + KL and NN penalties 
Model  Test error reduction (%)  

DRMM 5layer + NN penalty  
DRMM 5layer + KL penalty  
DRMM 5layer + KL and NN penalties 
5.2 Svhn
Like MNIST, the SVHN dataset is used for validating semisupervised learning methods. SVHN contains 73,257 color images of streetview house number digits. For training, we use a 9layer DRMM with the feedforward propagation similar to the Conv Large network in [15]. Other training details are the same as for MNIST. We train our model on and show stateoftheart results in Table 3.
Model  Test error (%) for a given number of labeled examples  

500  1000  2000  
DGN [7]  
Virtual Adversarial [12]  
Auxiliary Deep Generative Model [10]  
Skip Deep Generative Model [10]  
ImprovedGAN [18]  
DRMM 9layer + KL penalty  
DRMM 9layer + KL and NN penalty 
5.3 Cifar10
We use CIFAR10 to test the semisupervised learning performance of the DRMM on natural images. For CIFAR10 training, we use the same 9layer DRMM as for SVHN. Stochastic gradient descent with exponentiallydecayed learning rate is still used to train the model. Table 4 presents the semisupervised learning results for the 9layer DRMM for images. Even though we only use a simple SGD algorithm to train our model, the DRMM achieves comparable results to stateoftheart methods ( versus test error when as with the Ladder Networks). For semisupervised learning tasks on CIFAR10, the Improved GAN has the best classification error ( and test errors when ). However, unlike the Ladder Networks and the DRMM, GANbased architectures have an entirely different objective function, approximating the Nash equilibrium of a twolayer minimax game, and therefore, are not directly comparable to our model.
5.4 Analyzing the DRMM using Synthetic Imagery
In order to better understand what the DRMM learns during training and how latent variations are encoded in the DRMM, we train DRMMs on our synthetic dataset which has labels for important latent variations in the data and analyze the trained model using linear decoding analysis. We show that the DRMM disentangles latent variations over multiple layers and compare the results with traditional DCNs.
Dataset and Training:
The DRMM captures latent variations in the data [13, 14]. Given that the DRMM yields very good semisupervised learning performance on classification tasks, we would like to gain more insight into how a trained DRMM stores knowledge of latent variations in the data. To do such analysis requires the labels for the latent variations in the data. However, popular benchmarks such as MNIST, SVHN, and CIFAR10 do not include that information. In order to overcome this difficulty, we have developed a Python API for Blender, an opensource computer graphics rendering software, that allows us to not only generate images but also to have access to the values of the latent variables used to generate the image.
The dataset we generate for the linear decoding analysis in Section 5.4 mimics CIFAR10. The dataset contains 60K grayscale images in 10 classes of natural objects. Each image is of size 32by32, and the classes are the same as in CIFAR10. For each image, we also have labels for the slant, tilt, xlocation, ylocation and depth of the object in the image. Sample images from the dataset are given in Figure 3.
For the training, we split the dataset into the training and test set, each contains 50K and 10K images, respectively. We perform semisupervised learning with labeled images and images without labels using a 9layer DRMM with the same configuration as in the experiments with CIFAR10. We train the equivalent DCN on the same dataset in a supervised setup using the same number of labeled images. The test errors are reported in Table 5.
Model  Test error (%)  

50K  
Conv Large 9layer  
DRMM 9layer + KL and NN penalty 
Linear Decoding Analysis:
We applied a linear decoding analysis on the DRMMs and the DCNs trained on the synthetic dataset using . Particularly, for a given image, we map its activations at each layer to the latents variables by first quantizing the values of latent variables into 10 bins and then classifying the activations into each bin using first ten principle components of the activations. We show the classification errors in Figure 4.
Like the DCNs, the DRMMs disentangle latent variations in the data. However, the DRMMs keeps the information about the latent variations across most of the layers in the model and only drop those information when making decision on the class labels. This behavior of the DRMMs is because during semisupervised learning, in addition to object classification tasks, the DRMMs also need to minimize the reconstruction error, and the knowledge of the latent variation in the input images is needed for this second task.
6 Conclusions
In this paper, we proposed a new approach for semisupervised learning with DCNs. Our algorithm builds upon the DRMM, a recently developed probabilistic generative model underlying DCNs. We employed the EM algorithm to develop the bottomup and topdown inference in DRMM. We also apply variational inference and utilize the nonnegativity constraint in the DRMM to derive two new penalty terms, the KL and NN penalties, for the training objective function. Our method achieves stateoftheart results in semisupervised learning tasks on MNIST and SVHN and yields comparable results to stateoftheart methods on CIFAR10. We analyzed the trained DRMM using our synthetic dataset and showed how latent variations were disentangled across layers in the DRMM. Taken together, our semisupervised learning algorithm for the DRMM is promising for wide range of applications in which labels are hard to obtain, as well as for future research.
References
 [1] Y. Bengio. Learning deep architectures for ai. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
 [2] C. M. Bishop. Pattern Recognition and Machine Learning, volume 4. Springer New York, 2006.
 [3] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. arXiv preprint arXiv:1601.00670, 2016.
 [4] I. Higgins, L. Matthey, X. Glorot, A. Pal, B. Uria, C. Blundell, S. Mohamed, and A. Lerchner. Early visual concept learning with unsupervised deep learning. arXiv preprint arXiv:1606.05579, 2016.
 [5] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [6] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [7] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
 [8] Y. Lecun, L. Bottou, G. B. Orr, and K.R. Müller. Efficient backprop. 1998.
 [9] D.H. Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, 2013.
 [10] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
 [11] A. Makhzani and B. J. Frey. Winnertakeall autoencoders. In Advances in Neural Information Processing Systems, pages 2773–2781, 2015.
 [12] T. Miyato, S.i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing by virtual adversarial examples. arXiv preprint arXiv:1507.00677, 2015.
 [13] A. B. Patel, T. Nguyen, and R. G. Baraniuk. A probabilistic theory of deep learning. arXiv preprint arXiv:1504.00641, 2015.
 [14] A. B. Patel, T. Nguyen, and R. G. Baraniuk. A probabilistic framework for deep learning. NIPS, 2016.
 [15] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3532–3540, 2015.

[16]
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio.
Contractive autoencoders: Explicit invariance during feature extraction.
In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 833–840, 2011.  [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.
 [19] J. T. Springenberg. Unsupervised and semisupervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
 [20] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun. Stacked whatwhere autoencoders. arXiv preprint arXiv:1506.02351, 2016.
Appendix A Model Architectures and Training Details
Model  Optimiser/Hyperparameters  Dataset  DRMM architecture  

DRMM  SGD [8]  MNIST  Input  784 (flattened 28x28x1). 
of ConvSmall  learning rate 0.2  0.0001  Estep BottomUp  Conv 32x5x5 (Full),  
[15]  Maxpool 2x2,  
over 500 epochs 
Conv 64x3x3 (Valid), 64x3x3 (Full),  
Maxpool 2x2,  
, ,  Conv 128x3x3 (Valid), 10x1x1 (Valid),  
Meanpool 6x6,  
,  Softmax.  
batch size  BatchNorm after each Conv layer. ReLU activation.  
Classes  10.  
Estep TopDown  DRMM TopDown Reconstruction. No BatchNorm.  
Upsampling nearestneighbor.  
DRMM  SGD [8]  SVHN  Input  3072 (flattened 32x32x3). 
of ConvLarge  learning rate 0.2  0.0001  CIFAR10  Estep BottomUp  Conv 96x3x3 (Half), 96x3x3 (Full), 96x3x3 (Full), 
[15]  Maxpool 2x2,  
over 500 epochs  Conv 192x3x3 (Valid), 192x3x3 (Full), 192x3x3 (Valid),  
Maxpool 2x2,  
, ,  Conv 192x3x3 (Valid), 192x1x1 (Valid), 10x1x1 (Valid),  
Meanpool 6x6,  
,  Softmax.  
batch size  BatchNorm after each Conv layer. ReLU activation.  
Classes  10.  
Estep TopDown  DRMM TopDown Reconstruction. No BatchNorm.  
Upsampling nearestneighbor. 
The details of model architectures and trainings in this paper are provided in Table 6. The models are trained using Stochastic Gradient Descent [8]
with exponentiallydecayed learning rate. All convolutions are of stride one, and poolings are nonoverlapping. Full, half, and valid convolutions follow the standards in Theano. Full convolution increases the image size, half convolution reserves the image size, and valid convolution decreases the image size. The mean and variance in batch normalizations
[5] are kept track during the training using exponential moving average and used in testing and validation. The implementation of the DRMM generation process can be found in Section B. The set of labeled images is replicated until its size is the same as the size of the unlabeled set (60K for MNIST, 73,257 for SVHN, and 50K for CIFAR10). In each training iteration, the same amounts of the labeled and unlabeled images (half of the batch size) are sent into the DRMM. The batch size used is 100. The values of hyperparameters provided in Table 6 are for in case of MNIST, in case of SVHN, and in case of CIFAR10. is the number of labeled images used in training. Also, DRMM of ConvSmall is the DRMM whose Estep BottomUp is similar to the ConvSmall network [15], and DRMM of ConvLarge is the DRMM whose Estep BottomUp is similar to the ConvLarge network [15]. Note that we only apply batch normalization after the convolutions, but not after the pooling layers.Appendix B Generation Process in the Deep Rendering Mixture Model
As mentioned in Section 3 of the paper, generation in the DRMM takes the form:
where is the layer, is the object category, are the latent (nuisance) variables at layer , and are parameter dictionaries that contain templates at layer . The image is generated by adding isotropic Gaussian noise to a multiscale “rendered” template . When applying the hard ExpectationMaximization (EM) algorithm, we take the zeronoise limit. Here, where
is a vector of binary switching variables that select the templates to render and
is the vector of rendering positions. Note that (see Figure 2B) and where UL, UR, LL and LR stand for upper left, upper right, lower left and lower right positions, respectively. As defined in [14], the intermediate rendered image is given by:(11)  
(12)  
(13)  
(14) 
where is a masking matrix, is the set of core templates of size
(without any zeropadding and translation) at layer
, is a set of zeropadding operators, and is a set of translation operators to position . Elements of and are indexed by . Also, are the same for in the same channel of the intermediate rendered image . Note that in the main paper, we call and intermediate rendered templates.The DRMM layer can be implemented using convolutions of filters , or equivalently, deconvolutions of filters . and are used to select rendering templates and positions to render, respectively. In the Estep TopDown Reconstruction, and estimated in the Estep BottomUp are used instead.
References
 [1] Y. Bengio. Learning deep architectures for ai. Foundations and Trends in Machine Learning, 2(1):1–127, 2009.
 [2] C. M. Bishop. Pattern Recognition and Machine Learning, volume 4. Springer New York, 2006.
 [3] D. M. Blei, A. Kucukelbir, and J. D. McAuliffe. Variational inference: A review for statisticians. arXiv preprint arXiv:1601.00670, 2016.
 [4] I. Higgins, L. Matthey, X. Glorot, A. Pal, B. Uria, C. Blundell, S. Mohamed, and A. Lerchner. Early visual concept learning with unsupervised deep learning. arXiv preprint arXiv:1606.05579, 2016.
 [5] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [6] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [7] D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
 [8] Y. Lecun, L. Bottou, G. B. Orr, and K.R. Müller. Efficient backprop. 1998.
 [9] D.H. Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, volume 3, 2013.
 [10] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther. Auxiliary deep generative models. arXiv preprint arXiv:1602.05473, 2016.
 [11] A. Makhzani and B. J. Frey. Winnertakeall autoencoders. In Advances in Neural Information Processing Systems, pages 2773–2781, 2015.
 [12] T. Miyato, S.i. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing by virtual adversarial examples. arXiv preprint arXiv:1507.00677, 2015.
 [13] A. B. Patel, T. Nguyen, and R. G. Baraniuk. A probabilistic theory of deep learning. arXiv preprint arXiv:1504.00641, 2015.
 [14] A. B. Patel, T. Nguyen, and R. G. Baraniuk. A probabilistic framework for deep learning. NIPS, 2016.
 [15] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, pages 3532–3540, 2015.
 [16] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive autoencoders: Explicit invariance during feature extraction. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 833–840, 2011.
 [17] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
 [18] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. arXiv preprint arXiv:1606.03498, 2016.
 [19] J. T. Springenberg. Unsupervised and semisupervised learning with categorical generative adversarial networks. arXiv preprint arXiv:1511.06390, 2015.
 [20] J. Zhao, M. Mathieu, R. Goroshin, and Y. LeCun. Stacked whatwhere autoencoders. arXiv preprint arXiv:1506.02351, 2016.
Comments
There are no comments yet.