1 Introduction
Maxmargin learning has been effective on learning discriminative models, with many examples such as univariateoutput support vector machines (SVMs)
[9] and multivariateoutput maxmargin Markov networks (or structured SVMs) [52, 1, 53]. However, the everincreasing size of complex data makes it hard to construct such a fully discriminative model, which has only a single layer of adjustable weights, due to the facts that: (1) the manually constructed features may not well capture the underlying highorder statistics; and (2) a fully discriminative approach cannot reconstruct the input data when noise or missing values are present.To address the first challenge, previous work has considered incorporating latent variables into a maxmargin model, including partially observed maximum entropy discrimination Markov networks [61], structured latent SVMs [56] and maxmargin minentropy models [36]. All this work has primarily focused on a shallow structure of latent variables. To improve the flexibility, learning SVMs with a deep latent structure has been presented in [51]. However, these methods do not address the second challenge, which requires a generative model to describe the inputs. The recent work on learning maxmargin generative models includes maxmargin topic models [58, 59], maxmargin Harmoniums [8], and nonparametric Bayesian latent SVMs [60] which can infer the dimension of latent features from data. However, these methods only consider the shallow structure of latent variables, which may not be flexible enough to describe complex data.
Much work has been done on learning generative models with a deep structure of nonlinear hidden variables, including deep belief networks
[45, 27, 41][23, 17], stochastic variations of autoencoders
[55, 4, 3] and Generative Adversarial Nets (GANs) [14, 40]. For such models, inference is a challenging problem, which has motivated much recent progress on stochastic variational inference algorithms [22, 44, 5, 6]. However, the primary focus of deep generative models (DGMs) has been on unsupervised learning, with the goals of learning latent representations and generating input samples. Though the latent representations can be used with a downstream classifier to make predictions, it is often beneficial to learn a joint model that considers both input and response variables. The recent work on semisupervised deep generative models
[21, 50, 34, 46] proves the effectiveness of DGMs on modeling the density of unlabeled data to benefit the prediction task (See Sec. 2 for a detailed discussion). However, it remains open whether the discriminative maxmargin learning is suitable for this task.In this paper, we revisit the maxmargin principle and present maxmargin deep generative models (mmDGMs), which learn multilayered representations that are good for both classification and input inference. Our mmDGMs conjoin the flexibility of DGMs on describing input data and the strong discriminative ability of maxmargin learning on making accurate predictions. Given fully labeled data, we formulate mmDGMs as solving a variational inference problem of a DGM regularized by a set of maxmargin posterior constraints, which bias the model to learn representations that are good for prediction. We define the maxmargin posterior constraints as a linear functional of the target variational distribution of the latent presentations. To optimize the joint learning problems, we develop a doubly stochastic subgradient descent algorithm, which generalizes the Pagesos algorithm [49] to consider nontrivial latent variables. For the variational distribution, we build a recognition model to capture the nonlinearity, similar as in [22, 44].
To reduce the dependency on fully labeled data, we further propose a classconditional variant of mmDGMs (mmDCGMs) to deal with partially labeled data for semisupervised learning, where the amount of unlabeled data is typically much larger than that of labeled ones. Specifically, mmDCGMs employ a deep maxmargin classifier to infer the missing labels for unlabeled data and a classconditional deep generative model [21]
to capture the joint distribution of the data, labels and latent variables. Unlike
[42, 50, 46], our mmDCGMs separate the pathways of inferring labels and latent variables completely and can generate images given a specific class. Instead of inferring the full posterior of labels as in [21, 34], which is computationally expensive for large datasets, we use the prediction of the classifier as a point estimation of the label to speedup the training procedure. We further design additional maxmargin and labelbalance regularization terms of unlabeled data to enhance the classifier and significantly boost the classification performance.
We consider two types of networks used in our mmDGMs and mmDCGMs—multiple layer perceptrons (MLPs) as in
[22, 44] and convolutional neural networks (CNNs) [24]. In the CNN case, following [11], we apply unpooling, convolution and rectification sequentially to form a highly nontrivial deep generative network to generate images from the latent variables that are learned automatically by a recognition model using a standard CNN. We present the detailed network structures in the experiment section. Empirical results on the widely used MNIST [24], SVHN [39] and small NORB [25] datasets demonstrate that: (1) mmDGMs can significantly improve the prediction performance in supervised learning, which is competitive to the best feedforward neural networks, while retaining the capability of generating input samples and completing their missing values; and (2) mmDCGMs can achieve stateoftheart classification results with efficient inference and disentangle styles and classes based on raw images in semisupervised learning.In summary, our main contributions are:

We present maxmargin DGMs for both supervised and semisupervised settings to significantly enhance the discriminative power of DGMs while retaining their generative ability;

We develop efficient algorithms to solve the joint learning problems, which involve intractable expectation and nonsmooth piecewise linear operations;

We achieve stateoftheart results on several benchmarks in semisupervised learning and competitive prediction accuracy as the fully discriminative CNNs in supervised learning.
The rest of the paper is structured as follows. Section 2 surveys the related work. Section 3 presents maxmargin deep generative models for both supervised and semisupervised learning. Section 4 presents experimental results. Finally, Section 5 concludes.
2 Related Work
Deep generative models (DGMs) are good at discovering the underlying structures in the input data, but the training of the model parameters and inference of the posterior distribution are highly nontrivial tasks. Recently, significant progress has been made on enriching the representative power of variational inference and Markov chain Monte Carlo methods for posterior inference, such as variational Autoencoders (VAEs)
[22, 44] and neural adaptive MCMC [12]. VAEs [22, 44] build a recognition model to infer the posterior of latent variables and the parameters are trained to optimize a variational bound of the data likelihood. Neural adaptive MCMC [12]employs a similar recognition model as the proposal distribution for importance sampling to estimate the gradient of logposterior and hence can perform approximate Bayesian inference of DGMs.
To learn the parameters, besides the commonly used MLE estimator as adopted by VAEs, recent work has proposed various objectives. For example, Generative Adversarial Nets (GANs) [14]
construct a discriminator to distinguish the generated samples from the training data and the parameters are trained based on a minimax twoplayer game framework. Generative Moment Matching Networks (GMMNs)
[30, 13] generate samples from a directed deep generative model, which is trained to match all orders of statistics between training data and samples from the model. The very recent work [43] extends the ideas to learn conditional GMMNs with much broader applicability.Extensive work has been focusing on realistic image generation in unsupervised setting. For example, DRAW [16]
employs recurrent neural networks as the generative model and recognition model and introduces a 2D attention mechanism to generate sequences of real digits step by step. MEMVAE
[29] leverages an external memory and an attention mechanism to encode and retrieve the detailed information lost in the recognition model to enhance DGMs. LAPGAN [10] proposes a cascade of GANs to generate high quality natural images through a Laplacian pyramid framework [7]. DCGAN [40]adopts fractionally strided convolution networks in the generator to learn the spatial upsampling and refines the generated samples.
Some recent advances [21, 34, 50, 46, 42] have been made on extending DGMs to deal with partially observed data. For example, the conditional VAEs [21] treat labels as conditions of DGMs to describe input data; they perform posterior inference of labels given unlabeled data and can generate a specific class of images. ADGM [34] introduces auxiliary latent variables to DGMs to make the variational distribution more expressive and does well in semisupervised learning. CatGAN [50] generalizes GANs with a categorical discriminative network and an objective function that includes the mutual information between the input data and the prediction of the discriminative network. [46]
proposes feature mapping, virtual batch normalization and other techniques to improve the performance of GANs on semisupervised learning and image generation. The Ladder Network
[42]achieves excellent classification results in semisupervised learning by employing lateral connections between autoencoders to reduce the competition between the invariant feature extraction and the reconstruction of object details.
Our work is complimentary to the above progress in the sense that we investigate a new criterion (i.e., maxmargin learning) for DGMs in both supervised and semisupervised settings. Some preliminary results on the fully supervised mmDGMs were published in [28], while the semisupervised extensions are novel.
3 Maxmargin Deep Generative Models
We now present the maxmargin deep generative models for supervised learning and their classconditional variants for semisupervised learning. For both methods, we present efficient algorithms.
3.1 Basics of Deep Generative Models
We start from a general setting, where we have i.i.d. data . A deep generative model (DGM) assumes that each is generated from a vector of latent variables
, which itself follows some distribution. The joint probability of a DGM is as follows:
(1) 
where is the prior of the latent variables and is the likelihood model for generating observations. For notation simplicity, we define . Depending on the structure of , various DGMs have been developed, such as the deep belief networks [45, 27], deep sigmoid networks [38], deep latent Gaussian models [44], and deep autoregressive models [17]. In this paper, we focus on the directed DGMs, which can be easily sampled from via an ancestral sampler.
However, in most cases learning DGMs is challenging due to the intractability of posterior inference. The stateoftheart methods resort to stochastic variational methods under the maximum likelihood estimation (MLE) framework, (See the related work for alternative learning methods). Specifically, let be the variational distribution that approximates the true posterior . A variational upper bound of the per sample negative loglikelihood (NLL) is:
where is the KullbackLeibler (KL) divergence between distributions and . Then, upper bounds the full negative loglikelihood .
It is important to notice that if we do not make restricting assumption on the variational distribution , the bound is tight by simply setting . That is, the MLE is equivalent to solving the variational problem: . However, since the true posterior is intractable except a handful of special cases, we must resort to approximation methods. One common assumption is that the variational distribution is of some parametric form, , and then we optimize the variational bound w.r.t the variational parameters . For DGMs, another challenge arises that the variational bound is often intractable to compute analytically. To address this challenge, the early work further bounds the intractable parts with tractable ones by introducing more variational parameters [47]. However, this technique increases the gap between the bound being optimized and the loglikelihood, potentially resulting in poorer estimates. Much recent progress [22, 44, 38] has been made on hybrid Monte Carlo and variational methods, which approximates the intractable expectations and their gradients over the parameters
via some unbiased Monte Carlo estimates. Furthermore, to handle largescale datasets, stochastic optimization of the variational objective can be used with a suitable learning rate annealing scheme. It is important to notice that variance reduction is a key part of these methods in order to have fast and stable convergence.
Most work on directed DGMs has been focusing on the generative capability on inferring the observations, such as filling in missing values [22, 44, 38], while relatively insufficient work has been done on investigating the predictive power, except the recent advances [21, 50] for semisupervised learning. Below, we present maxmargin deep generative models, which explore the discriminative maxmargin principle to improve the predictive ability of the latent representations, while retaining the generative capability.
3.2 Maxmargin Deep Generative Models
We first consider the fully supervised setting, where the training data is a pair with input features and the groundtruth label . Without loss of generality, we consider the multiclass classification, where . As illustrated in Fig. 1, a maxmargin deep generative model (mmDGM) consists of two components: (1) a deep generative model to describe input features; and (2) a maxmargin classifier to consider supervision. For the generative model, we can in theory adopt any DGM that defines a joint distribution over as in Eq. (1). For the maxmargin classifier, instead of fitting the input features into a conventional SVM, we define the linear classifier on the latent representations, whose learning will be regularized by the supervision signal as we shall see. Specifically, if the latent representation is given, we define the latent discriminant function where is an dimensional vector that concatenates subvectors, with the th being and all others being zero, and is the corresponding weight vector.
We consider the case that is a random vector, following some prior distribution . Then our goal is to infer the posterior distribution , which is typically approximated by a variational distribution for computational tractability. Notice that this posterior is different from the one in the vanilla DGM. We expect that the supervision information will bias the learned representations to be more powerful on predicting the labels at testing. To account for the uncertainty of , we take the expectation and define the discriminant function and the final prediction rule that maps inputs to outputs is:
(2) 
Note that different from the conditional DGM [21], which puts the class labels upstream and generates the latent representations as well as input data by conditioning on , the above classifier is a downstream model in the sense that the supervision signal is determined by conditioning on the latent representations.
3.2.1 The Learning Problem
We want to jointly learn the parameters and infer the posterior distribution . Based on the equivalent variational formulation of MLE, we define the joint learning problem as solving:
(3)  
where is the difference of the feature vectors;
is the loss function that measures the cost to predict
if the true label is ; and is a nonnegative regularization parameter balancing the two components. In the objective, the variational bound is defined as , and the margin constraints are from the classifier (2). If we ignore the constraints (e.g., setting at 0), the solution of will be exactly the Bayesian posterior, and the problem is equivalent to do MLE for .By absorbing the slack variables, we can rewrite the problem in an unconstrained form:
(4) 
where the hinge loss is: Due to the convexity of function, it is easy to verify that the hinge loss is an upper bound of the training error of classifier (2), that is, . Furthermore, the hinge loss is a convex functional over the variational distribution because of the linearity of the expectation operator. These properties render the hinge loss as a good surrogate to optimize over. Previous work has explored this idea to learn discriminative topic models [58], but with a restriction on the shallow structure of hidden variables. Our work presents a significant extension to learn deep generative models, which pose new challenges on the learning and inference.
3.2.2 The Doubly Stochastic Subgradient Algorithm
The variational formulation of problem (4) naturally suggests that we can develop a variational algorithm to address the intractability of the true posterior. We now present a new algorithm to solve problem (4). Our method is a doubly stochastic generalization of the Pegasos (i.e., Primal Estimated subGrAdient SOlver for SVM) algorithm [49] for the classic SVMs with fully observed input features, with the new extension of dealing with a highly nontrivial structure of latent variables.
First, we make the structured meanfield (SMF) assumption that . Under the assumption, we have the discriminant function as Moreover, we can solve for the optimal solution of in some analytical form. In fact, by the calculus of variations, we can show that given the other parts the solution is where are the Lagrange multipliers (See [58] for details). If the prior is normal, , we have the normal posterior: Therefore, even though we did not make a parametric form assumption of , the above results show that the optimal posterior distribution of is Gaussian. Since we only use the expectation in the optimization problem and in prediction, we can directly solve for the mean parameter instead of . Further, in this case we can verify that and then the equivalent objective function in terms of can be written as:
(5) 
where is the total hinge loss, and the persample hingeloss is . Below, we present a doubly stochastic subgradient descent algorithm to solve this problem.
The first stochasticity arises from a stochastic estimate of the objective by random minibatches. Specifically, the batch learning needs to scan the full dataset to compute subgradients, which is often too expensive to deal with largescale datasets. One effective technique is to do stochastic subgradient descent [49], where at each iteration we randomly draw a minibatch of the training data and then do the variational updates over the small minibatch. Formally, given a mini batch of size
, we get an unbiased estimate of the objective:
The second stochasticity arises from a stochastic estimate of the persample variational bound and its subgradient, whose intractability calls for another Monte Carlo estimator.
Formally, let be a set of samples from the variational distribution, where we explicitly put the conditions. Then, the estimates of the persample variational bound and the persample hingeloss are
and
respectively, where . Note that is an unbiased estimate of , while is a biased estimate of . Nevertheless, we can still show that is an upper bound estimate of under expectation. Furthermore, this biasedness does not affect our estimate of the gradient. In fact, by using the equality , we can construct an unbiased Monte Carlo estimate of as:
(6)  
where the last term roots from the hinge loss with the lossaugmented prediction . For and , the estimates of the gradient and the subgradient are easier, which are:
and
Notice that the sampling and the gradient only depend on the variational distribution, not the underlying model.
The above estimates consider the general case where the variational bound is intractable. In some cases, we can compute the KLdivergence term analytically, e.g., when the prior and the variational distribution are both Gaussian. In such cases, we only need to estimate the rest intractable part by sampling, which often reduces the variance [22]. Similarly, we could use the expectation of the features directly, if it can be computed analytically, in the computation of subgradients (e.g., and ) instead of sampling, which again can lead to variance reduction.
With the above estimates of subgradients, we can use stochastic optimization methods such as SGD [49] and AdaM [20] to update the parameters, as outlined in Alg. 1. Overall, our algorithm is a doubly stochastic generalization of Pegasos to deal with the highly nontrivial latent variables.
Now, the remaining question is how to define an appropriate variational distribution to obtain a robust estimate of the subgradients as well as the objective. Two types of methods have been developed for unsupervised DGMs, namely, variance reduction [38] and autoencoding variational Bayes (AVB) [22]. Though both methods can be used for our models, we focus on the AVB approach. For continuous variables , under certain mild conditions we can reparameterize the variational distribution using some simple variables . Specifically, we can draw samples from some simple distribution and do the transformation to get the sample of the distribution . We refer the readers to [22] for more details. In our experiments, we consider the special Gaussian case, where we assume that the variational distribution is a multivariate Gaussian with a diagonal covariance matrix:
(7) 
whose mean and variance are functions of the input data. This defines our recognition model. Then, the reparameterization trick is as follows: we first draw standard normal variables and then do the transformation to get a sample. For simplicity, we assume that both the mean and variance are function of only. However, it is worth to emphasize that although the recognition model is unsupervised, the parameters are learned in a supervised manner because the subgradient (6) depends on the hinge loss. Further details of the experimental settings are presented in Sec. 4.1.
3.3 Conditional Variants for Semisupervised Learning
As collecting labeled data is often costly and timeconsuming, semisupervised learning (SSL) [62] is an important setting, where the easytoget unlabeled data are leveraged to improve the quality. We now present an extension of mmDGMs to the semisupervised learning scenario.
Given a labeled dataset and an unlabeled dataset , where the size is typically much larger than , the goal of SSL is to explore the intrinsic structures underlying the unlabeled data to help learn a classifier. As the learning objective of mmDGMs consists of two parts—a data likelihood and a classification loss, a naive approach to considering unlabeled data is to simply ignore the loss term when the class label is missing. However, such ignorance leads to a weak coupling between the likelihood model and the classifier. Below, we present a conditional variant of mmDGMs, namely maxmargin deep conditional generative models (mmDCGMs), to strongly couple the classifier and data likelihood.
Similar as in mmDGMs, an mmDCGM consists of two components: (1) a deep maxmargin classifier to infer labels given data and (2) a classconditional deep generative model to describe the joint distribution of the data, labels and latent variables. Fig. 1 compares the graphical models of the mmDGM and mmDCGM. Below, we present the learning objective of mmDCGM formally, which consists of several key components. For notation simplicity, we will omit the parameters , and in the following formulae if no confusion arises.
Generative loss: The first part of our learning objective is a generative loss to describe the observed data. For the labeled data whose is visible, mmDCGM maximizes the joint likelihood for the pair , , which is lower bounded by:
(8) 
For the unlabeled data whose is hidden, we can maximize the marginal likelihood by integrating out the hidden labels, whose variational lowerbound is:
(9)  
These lowerbounds were adopted in the previous method [21]. However, one issue with this method is on the computational inefficiency when dealing with a large set of unlabeled data and a large number of classes. This is because we need to compute the lowerbounds of the joint likelihood for all possible and for each unlabeled data point.
To make it computationally efficient, we propose to use the prediction of a classifier as a point estimation to approximate the full posterior to speedup the inference procedure, where we denote the classifier by
because it is not restricted to a specific form with a proper distribution over labels but is an unnormalized one trained under the maxmargin principle. Indeed, the outputs of the classifier are real values transformed by linear operations, denoting the signed distance from the data to the hyperplanes defined by the weights. Consequently, the entropy term should be zero and the lowerbound turns out to be:
(10) 
Note that the lowerbound is valid because we can view as a delta distribution.
With the above deviations, we define the overall generative loss as the summation of the negative variational bounds over and :
(11)  
Hinge loss: The second part of our learning objective is a hinge loss on the labeled data. Specifically, though the labeled data can contribute to the training of the classifier implicitly through the objective function in Eqn. (11), it has been shown that adding a predictive loss for the labeled data can speedup convergence and achieve better results [21, 34]. Here, we adopt the similar idea by introducing a hinge loss as the discriminative regularization for the labeled data:
(12) 
which is the same as in the fully supervised case.
Hat loss: The third part of our learning objective is a hat loss on the unlabeled data. Specifically, as is typically much larger than in the semisupervised learning, it is desirable that the unlabeled data can regularize the behaviour of the classifier explicitly. To this end, we further propose a maxmargin “hat loss” [62] for the unlabeled data as follows:
(13) 
where and is an function that indicates whether equals to the prediction or not. Namely, we treat the prediction as putative label and apply the hinge loss function on the unlabeled data. This function is called the hat loss due to its shape in a binary classification example [62]. Intuitively, the hinge loss enforces the predictor to make prediction correctly and confidently with a large margin for labeled data, while the hat loss only requires the predictor to make decision confidently for unlabeled data. The hat loss, which has been originally proposed in S3VMs [54], assumes that decision boundary tends to lie on lowdensity areas of the feature space. In such shallow models, the correctness of the assumption heavily depends on the true data distribution, which is fixed but unknown. However, the constraint is much more relaxed when building upon the latent feature space learned by a deep model as described in our method. In practice, the predictive performance of mmDCGMs is improved substantially by adding this regularization, which will be shown in Sec. 4.3.
Labelbalance regularization: The last part of our learning objective is a regularization term to balance the possible label predictions on the unlabeled data. Specifically, one practical problem of semisupervised learning is the imbalance of the predictions [62], that is, a classifier may classify most of the unlabeled points as a same class. To address this problem, we introduce a balance constraint for multiclass semisupervised learning:
(14) 
which assumes that the distribution of the predictions of unlabeled data should be the same as that of the groundtruth labels in the labeled set. However, both sides in Eqn. (14) are summations of indicator functions, which are nondifferentiable with respect to . Therefore, we cannot optimize based on gradient methods to satisfy this constraint directly. Here, we relax the constraint (14) as:
(15) 
where and we simplify the summation notation. Given certain class , the left hand side selects the unlabeled data whose predictions equal to according to the indicator functions, and adds the corresponding activations (discriminant functions of divided by a factor ) together. The right hand side computes this normalized activations with indicator functions in same class for the labeled data. Note that is no smaller than for any other due to the definitions of the prediction and the indicator function . The gradients in the relaxed version are still not welldefined due to the indicator functions. However, assuming that the predictions are given, both sides in Eqn. (15) are summations without indicator functions, which are differentiable with respect to . In our experiments, we indeed ignore the dependency of the indicator functions on and approximate the total gradients by the gradients of the cumulative activations. This approximation does not work for the constraint in Eqn. (14) because both sides turn out to be scalars given and the gradient with respect to is zero almost everywhere, which cannot be used to optimize parameters. In fact, the relaxed constraint balances the predictions of unlabeled data according to the groundtruth implicitly, under the further assumption that the cumulative activation is proportional to the number of predictions for any . Intuitively, if the cumulative activation of the selected unlabeled data in certain class is larger than that of the labeled data, then probably the predictor classifies some unlabeled data as incorrectly. Consequently, the is updated to reduce the activations, and then the number of predictions in this class will decrease because may be smaller than for some other . Moreover, as hard constraints are unlikely to satisfy in practice, we further relax them by using a regularization penalty in the common norm:
With the above subobjectives, our final objective function is a weighted sum:
(16) 
where , and are hyperparameters that control the relative weights for the corresponding terms. We will discuss the choice of each value in Sec. 4.1.
To optimize the overall learning objective, we still use our doubly stochastic algorithm described in Sec. 3.2.2 to compute the unbiased subgradient estimations for all of the parameters and perform updates. Specifically, given a minibatch of data consisting of labeled data and unlabeled data , we sequentially

predict using the classifier for each ;

plug in the predictions of the unlabeled data and the groudtruth of the labeled data into the indicator functions in the labelbalance regularization;

take (sub)gradient with respect to all parameters in the generative model, recognition model and classfier to optimize the final objective (16);

approximate the (sub)gradients with intractable expectations using the techniques described in Sec. 3.2.2 and update parameters.
Though the objective in semisupervised learning is complex, our method works well in practice.
4 Experiments
We now present the experimental results in both supervised and semisupervised learning settings. Our results on several benchmark datasets demonstrate that both mmDGMs and mmDCGMs are highly competitive in classification while retaining the generative ability, under the comparison with various strong competitors.
4.1 Experiment Settings
Though mmDGMs and mmDCGMs are applicable to any DGMs that define a joint distribution of and respectively, we concentrate on the Variational Autoencoder (VA) [22] and Conditional VA [21] in our experiments. We consider two types of recognition models: multiple layer perceptrons (MLPs) and convolutional neural networks (CNNs). We denote our mmDGM with MLPs by MMVA. To perform classification using VA which is unsupervised, we first learn the feature representations by VA, and then build a linear SVM classifier on these features using the Pegasos stochastic subgradient algorithm [49]. This baseline will be denoted by VA+Pegasos. The corresponding models with CNNs are denoted by ConvMMVA and ConvVA+Pegasos respectively. We denote our mmDCGM with CNNs by ConvMMCVA
. We implement all experiments based on Theano
[2]. ^{1}^{1}1Source code and more detailed settings can be found at https://github.com/thuml/mmdcgmssl.4.1.1 Datasets and Preprocessing
We evaluate our models on the widely adopted MNIST [24], SVHN [39] and small NORB [25] datasets. MNIST consists of handwritten digits of 10 different classes (0 to 9). There are 50,000 training samples, 10,000 validating samples and 10,000 testing samples and each one is of size . SVHN is a large dataset consisting of color images of size . The task is to recognize the center digits in natural scene images. We follow the work [48, 15] to split the dataset into 598,388 training data, 6,000 validating data and 26,032 testing data. The small NORB dataset consisits of gray images distributed across 5 general classes: animal, human, airplane, truck and car. Both the training set and testing set in NORB contain 24,300 samples with different lighting conditions and azimuths. We downsample the images to size of as in [34] and split 1,000 samples from the training set as the validating data if required.
For fair comparison in supervised learning on SVHN, we perform Local Contrast Normalization (LCN) in the experiment of the ConvMMVA following [48, 15] and set the distribution of given as Gaussian. In other cases, we just normalize the data by a factor of 256 and choose Bernoulli as the distribution of data.
4.1.2 Supervised Learning
In mmDGMs, the recognition network and the classifier share layers in computation. The mean and variance of the latent variable are transformed from the last layer of the recognition model through an affine transformation. It should be noticed that we could use not only the expectation of but also the activation of any layer in the recognition model as features. The only theoretical difference is from where we add a hinge loss regularization to the gradient and backpropagate it to previous layers. In all of the experiments, the mean of has the same nonlinearity but typically much lower dimension than the activation of the last layer in the recognition model, and hence often leads to a worse performance. We use different features in MMVA and ConvMMVA, which will be explained below. We use AdaM [20]
to optimize parameters in all of the models. Although it is an adaptive gradientbased optimization method, we decay the global learning rate by a factor after sufficient number of epochs to ensure a stable convergence.
In MMVA, we follow the settings in [21] to compare both generative and discriminative capacity of VA and MMVA. Both the recognition and generative models employ a twolayer MLP with 500 hidden units in each layer and the dimension of the latent variables is 50. We choose as default in MMVA. We concatenate the activations of 2 layers as the features used in the supervised tasks. We illustrate the network architecture of MMVA in Appendix A.
In ConvMMVA, we use standard CNNs [24]
with convolution and maxpooling operation as the recognition model to obtain more competitive classification results. For the generative model, we use unconvnets
[11] with a “symmetric” structure as the recognition model, to reconstruct the input images approximately. More specifically, the topdown generative model has the same structure as the bottomup recognition model but replacing maxpooling with unpooling operation [11] and applies unpooling, convolution and rectification in order. Typically, there are 5 or 6 convolutional layers in the generative model and the recognition model and the kernel size is either 5 or 3, depending on the data. The total number of parameters is comparable with previous work [15, 31, 26] and the split of the training sets is the same. For simplicity, we do not involve mlpconv layers [31, 26] and contrast normalization layers in our recognition model, but they are not exclusive to our model. We set on MNIST and on SVHN as default. We use the activations of the last deterministic layer as the features. We illustrate the network architecture of ConvMMVA with Gaussian hidden variables and Bernoulli visible variables in Fig. 2.4.1.3 Semisupervised Learning
The mmDCGM separates the classifier and the recognition model of the latent variables completely, which allows us to simply combine the stateoftheart classifier and deep generative models together without competition. We only consider the convolutional neural networks here and adopt advanced techniques including global average pooling [32] and batch normalization [19] to boost the performance of our ConvMMCVA. The architecture of the maxmargin classifier refers to that of the discriminator in [50]
and the generative model is similar with the ConvMMVA but concatenates the feature maps and additional label maps in onehot encoding format at each layer as in
[40]. Similar with ConvMMVA, the depth of each convolutional networks is 5 or 6. We set according to the conditional VAE [21]. We optimize and with a search grid in terms of the validation classification error of a shallow S3VM on MNIST given 100 labels. The best values are andand we fix them in our ConvMMCVA across all of the datasets. Other hyperparameters including the anneal strategy and batch size are chosen according to the validation generative loss. Once the hyperparameters are fixed, we run our model for 10 times with different random splits of the labeled and unlabeled data, and we report the mean and the standard deviation of the error rates.
4.2 Results with Supervised Learning
We first present the results in the supervised learning setting. Specifically, we evaluate the predictive and generative performance of our MMVA and ConvMMVA on the MNIST and SVHN datasets in various tasks, including classification, sample generation, and missing data imputation.
4.2.1 Predictive Performance
C  Error Rate (%)  Lower Bound 

0  1.35  93.17 
1  1.86  95.86 
0.88  95.90  
0.54  96.35  
0.45  99.62  
0.43  112.12 
We test both MMVA and ConvMMVA on the MNIST dataset. In MLP case, the first three rows in Table LABEL:mnistbasictable compare VA+Pegasos, VA+ClasscondtionVA and MMVA, where VA+ClasscondtionVA refers to the best fully supervised model in [21]. Our model outperforms the baselines significantly. We further use the tSNE algorithm [35] to embed the features learned by VA and MMVA on 2D plane, which again demonstrates the stronger discriminative ability of MMVA (See Appendix B for details).
In CNN cases, Table LABEL:effectc shows the effect of on classification error rate and variational lower bound. Typically, as gets lager, ConvMMVA learns more discriminative features and leads to a worse estimation of data likelihood. However, if is too small, the supervision is not enough to lead to predictive features. Nevertheless, is quite a good tradeoff between the classification performance and generative performance. In this setting, the classification performance of our ConvMMVA model is comparable to the stateoftheart fully discriminative networks with comparable architectures and number of parameters, shown in the last four rows of Table LABEL:mnistbasictable.
We focus on ConvMMVA on the SVHN datset as it is more challenging. Table LABEL:svhnbasictable shows the predictive performance on SVHN. In this harder problem, we observe a larger improvement by ConvMMVA as compared to ConvVA+Pegasos, suggesting that DGMs benefit a lot from maxmargin learning on image classification. We also compare ConvMMVA with stateoftheart results. To the best of our knowledge, there is no competitive generative models to classify digits on the SVHN dataset with full labels.
4.2.2 Generative Performance
We investigate the generative capability of MMVA and ConvMMVA on generating samples. Fig. 3 and Fig. 4 illustrate the images randomly sampled from VA and MMVA models on MNIST and SVHN respectively, where we output the expectation of the value at each pixel to get a smooth visualization. Fig. 4 demonstrates the benefits from jointly training of DGMs and maxmargin classifiers. Though ConvVA gives a tighter lower bound of data likelihood and reconstructs data more elaborately, it fails to learn the pattern of digits in a complex scenario and could not generate meaningful images. In this scenario, the hinge loss regularization on recognition model is useful for generating main objects to be classified in images.
Noise Type  VA  MMVA  ConvVA  ConvMMVA 

RandDrop (0.2)  0.0109  0.0110  0.0111  0.0147 
RandDrop (0.4)  0.0127  0.0127  0.0127  0.0161 
RandDrop (0.6)  0.0168  0.0165  0.0175  0.0203 
RandDrop (0.8)  0.0379  0.0358  0.0453  0.0449 
Rect (6 6)  0.0637  0.0645  0.0585  0.0597 
Rect (8 8)  0.0850  0.0841  0.0754  0.0724 
Rect (10 10)  0.1100  0.1079  0.0978  0.0884 
Rect (12 12)  0.1450  0.1342  0.1299  0.1090 
4.2.3 Missing Data Imputation and Classification
We further test MMVA and ConvMMVA on the task of missing data imputation. For MNIST, we consider two types of missing values [33]: (1) RandDrop: each pixel is missing randomly with a prefixed probability; and (2) Rect
: a rectangle located at the center of the image is missing. Given the perturbed images, we uniformly initialize the missing values between 0 and 1, and then iteratively do the following steps: (1) using the recognition model to sample the hidden variables; (2) predicting the missing values to generate images; and (3) using the refined images as the input of the next round. For SVHN, we do the same procedure as in MNIST but initialize the missing values with Guassian random variables as the input distribution changes.
Intuitively, generative models with CNNs could be more powerful on learning patterns and highlevel structures, while generative models with MLPs learn more to reconstruct the pixels in detail. This conforms to the MSE results shown in Table LABEL:deniosemse: ConvVA and ConvMMVA outperform VA and MMVA with a missing rectangle, while VA and MMVA outperform ConvVA and ConvMMVA with random missing values. Compared with the baselines, mmDGMs also make more accurate completion when large patches are missing. All of the models infer missing values for 100 iterations.
We visualize the inference procedure of MMVA in Fig. 5. Considering both types of missing values, MMVA could infer the unknown values and refine the images in several iterations even with a large ratio of missing pixels. More visualization results on MNIST and SVHN are presented in Appendix C.
Noise Level  CNN  ConvVA  ConvMMVA 

Rect (6 6)  7.5  2.5  1.9 
Rect (8 8)  18.8  4.2  3.7 
Rect (10 10)  30.3  8.4  7.7 
Rect (12 12)  47.2  18.3  15.9 
We further present classification results with missing values on MNIST in Table LABEL:errorwithmissing. CNN makes prediction on the incomplete data directly. ConvVA and ConvMMVA infer missing data for 100 iterations at first and then make prediction on the refined data. In this scenario, ConvMMVA outperforms both ConvVA and CNN, which demonstrates the advantages of our mmDGMs, which have both strong discriminative and generative capabilities.
Overall, mmDGMs have comparable capability of inferring missing values and prefer to learn highlevel patterns instead of local details.
4.3 Results with Semisupervised Learning
We now present the predictive and generative results on MNIST, SVHN and small NORB datasets given partially labeled data.
4.3.1 Predictive Performance
Algorithm  ALL  

M1+M2 [21]  3.33 ()  2.4 ()  0.96 
VAT [37]  2.33  1.36  0.64 
Ladder [42]  1.06 ()  0.84 ()  0.57 
CatGAN [50]  1.91 ()  1.73 ()  0.91 
ADGM [34]  0.96 ()     
SDGM [34]  1.32 ()     
ConvCatGAN [50]  1.39 ()    0.48 
ImprovedGAN [46]  0.96 ()     
ConvLadder [42]  0.89 ()     
ConvMMCVA  1.24 ()  0.54 ()  0.31 
We compare our ConvMMCVA with a large body of previous methods on the MNIST dataset under different settings in Table LABEL:sslmnisttable. Our method is competitive to the stateoftheart results given 100 labels. As the number of labels increases, the maxmargin principle significantly boosts the performance of ConvMMCVA relative to the other models, including the Ladder Network [42]. Indeed, given 1,000 labels, ConvMMCVA not only beats existing methods in the same setting, but also is comparable to the best supervised results of DGMs. The supervised learning results of ConvMMCVA again confirm that by leveraging maxmargin principle DGMs can achieve the same discriminative ability as the stateoftheart CNNs with comparable architectures. We analyze the effect of the number of labels in Fig. 6 for ConvMMCVA, where the four curves share the same settings but use different random seeds to split data and initialize the networks.
Algorithm  SVHN  NORB 

M1+M2 [21]  36.02 ()  18.79 () 
VAT [37]  24.63  9.88 
ADGM [34]  22.86  10.06() 
SDGM [34]  16.61()  9.40() 
ImprovedGAN [46]  8.11 ()   
Ensemble10GANs [46]  5.88 ()   
ConvMMCVA  4.95 ()  6.11 () 
Table LABEL:sslrealtable shows the classification results on the more challenging SVHN and NORB datasets. Following previous methods [21, 37, 34], we use 1,000 labels on both datasets. We can see that our methods outperform the previous stateoftheart substantially. Ensemble10GANs refers to an ensemble of 10 ImprovedGANs [46] with 9layer classifiers while we employ a single model with a shallower 6layer classifier. Note that it is easy to further improve our model by using more advanced networks, e.g. ResNet [18], without competition due to the separated architectures. In this paper, we focus on comparable architectures for fairness.
We further analyze the effect of the regularization terms to investigate the possible reasons for the outstanding performance. If we omit the hat loss regularization, the ConvMMCVA suffers from overfitting and only achieves 6.4% error rates on the MNIST dataset given 100 labels. The underlying reason is that we approximate the full posterior inference by a greedy point estimation. If the prediction of the classifier is wrong, the generative model tends to interpret the unlabeled data with the incorrect label instead of enforcing the classifier to find the true label as in previous conditional DGM [21]. However, the hat loss provides an effective way for the classifier to achieve a sufficiently good classification result, which can be finetuned according to the generative loss. In fact, trained to optimize the maxmargin losses for both the labeled and unlabeled data, the classifier itself without the DGM can get 2.1% error rates on MNIST given 100 labels. These results demonstrate the effectiveness of our proposed maxmargin loss for the unlabeled data. We also reduce 0.2% error rate in this setting by using the labelbalance regularization. Besides the excellent performance, our ConvMMCVA provides a potential way to apply classconditional DGMs on large scale datasets with many more categories due to the efficient inference.
4.3.2 Generative Performance
We demonstrate that our ConvMMCVA has the ability to disentangle classes and styles given a small amount of labels on the MNIST, SVHN and NORB datasets, as shown in Fig. 7 and Fig. 8. The images are generated by conditioning on a label and a style vector . On the MNIST and SVHN datasets, ConvMMCVA are able to generate highquality images and can capture the intensities, scales and colors of the images. Note that previous generation on SVHN in the semisupervised learning setting is either unconditioned [46] or based on some preprocessed data [21]. Our samples are a little blurry on the NORB dataset, which contains elaborate images of 3D toys with different lighting conditions and points of view. Nevertheless, ConvMMCVA can still separate these physical semantics from the general categories beyond digits. To the best of our knowledge, there is no competitive generative models to generate NORB data classconditionally given the partially labeled data.
5 Conclusions
In this paper, we propose maxmargin deep generative models (mmDGMs) and the classconditional variants (mmDCGMs), which conjoin the predictive power of maxmargin principle and the generative ability of deep generative models. We develop a doubly stochastic subgradient algorithm to learn all parameters jointly and consider two types of recognition models with MLPs and CNNs respectively. We evaluate our mmDGMs and MMDCGMs in supervised learning and semisupervised learning settings respectively. Given partially labeled data, we approximate the full posterior of the labels by a delta distribution for efficiency and propose additional maxmargin and label balance losses for unlabeled data for effectiveness.
We present extensive results to demonstrate that our methods can significantly improve the prediction performance of deep generative models, while retaining the strong generative ability on generating input samples as well as completing missing values. In fact, by employing CNNs in our mmDGMs and mmDCGMs, we achieve low error rates on several datasets including MNIST, SVHN and NORB, which are competitive to the best fully discriminative networks in supervised learning and improve the previous stateoftheart semisupervised results significantly.
Acknowledgments
The work was supported by the National Basic Research Program (973 Program) of China (No. 2013CB329403), National NSF of China (Nos. 61620106010, 61322308, 61332007), the Youth Topnotch Talent Support Program, and Tsinghua TNList Lab Big Data Initiative.
References

[1]
Y. Altun, I. Tsochantaridis, and T. Hofmann.
Hidden Markov support vector machines.
In
International Conference on Machine Learning
, 2003.  [2] F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. Goodfellow, A. Bergeron, N. Bouchard, D. WardeFarley, and Y. Bengio. Theano: new features and speed improvements. In Deep Learning and Unsupervised Feature Learning Workshop, Advances in Neural Information Processing Systems Workshop, 2012.
 [3] Y. Bengio, E. Laufer, G. Alain, and J. Yosinski. Deep generative stochastic networks trainable by backprop. In International Conference on Machine Learning, 2014.
 [4] Y. Bengio, L. Yao, G.e Alain, and P. Vincent. Generalized denoising autoencoders as generative models. In Advances in Neural Information Processing Systems, 2013.
 [5] J. Bornschein and Y. Bengio. Reweighted wakesleep. International Conference on Learning Representations, 2015.
 [6] Y. Burda, R. Grosse, and R. Salakhutdinov. Importance weighted autoencoders. International Conference on Learning Representations, 2015.
 [7] P. Burt and Edward A. The laplacian pyramid as a compact image code. IEEE Transactions on communications, 31(4):532–540, 1983.
 [8] N. Chen, J. Zhu, F. Sun, and E. P. Xing. Largemargin predictive latent subspace learning for multiview data analysis. IEEE Trans. on PAMI, 34(12):2365–2378, 2012.
 [9] C. Cortes and V. Vapnik. Supportvector networks. Journal of Machine Learning, 20(3):273–297, 1995.
 [10] E. Denton, S. Chintala, S. Arthur, and R. Fergus. Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, 2015.
 [11] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. Computer Vision and Pattern Recognition, 2015.
 [12] C. Du, J. Zhu, and B. Zhang. Learning deep generative models with doubly stochastic mcmc. arXiv preprint arXiv:1506.04557, 2015.

[13]
G. K. Dziugaite, D. M. Roy, and Z. Ghahramani.
Training generative neural networks via maximum mean discrepancy
optimization.
Uncertainty in Artificial Intelligence
, 2015.  [14] I. J. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S.ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, 2014.
 [15] I. J. Goodfellow, D.WardeFarley, M. Mirza, A. C. Courville, and Y. Bengio. Maxout networks. In International Conference on Machine Learning, 2013.
 [16] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra. Draw: A recurrent neural network for image generation. International Conference on Machine Learning, 2015.
 [17] K. Gregor, I. Danihelka, A. Mnih, C. Blundell, and D. Wierstra. Deep autoregressive networks. In International Conference on Machine Learning, 2014.
 [18] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. Computer Vision and Pattern Recognition, 2016.
 [19] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
 [20] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
 [21] D. P. Kingma, D. J. Rezende, S. Mohamed, and M. Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, 2014.
 [22] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In International Conference on Learning Representations, 2014.
 [23] H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In International Conference on Artificial Intelligence and Statistics, 2011.
 [24] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. In Proceedings of the IEEE, 1998.
 [25] Y. LeCun, F. J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004.
 [26] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In International Conference on Artificial Intelligence and Statistics, 2015.
 [27] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In International Conference on Machine Learning, 2009.
 [28] C. Li, J. Zhu, T. Shi, and B. Zhang. Maxmargin deep generative models. In Advances in neural information processing systems, 2015.
 [29] C. Li, J. Zhu, and B. Zhang. Learning to generate with memory. International Conference on Machine Learning, 2016.
 [30] Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International Conference on Machine Learning, 2015.
 [31] M. Lin, Q. Chen, and S. Yan. Network in network. In International Conference on Learning Representations, 2014.
 [32] M. Lin, Q. Chen, and S. Yan. Network in network. International Conference on Learning Representations, 2014.
 [33] R. J. Little and D. B. Rubin. Statistical analysis with missing data. Journal of Machine Learning Research, 539, 1987.
 [34] L. Maaløe, C. K. Sønderby, S. K. Sønderby, and O. Winther. Auxiliary deep generative models. In International Conference on Machine Learning, 2016.
 [35] L. V. Matten and G. Hinton. Visualizing data using tSNE. Journal of Machine Learning Research, 9:2579–2605, 2008.
 [36] K. Miller, M. P. Kumar, B. Packer, D. Goodman, and D. Koller. Maxmargin minentropy models. In International Conference on Artificial Intelligence and Statistics, 2012.
 [37] T. Miyato, S. Maeda, M. Koyama, K. Nakae, and S. Ishii. Distributional smoothing with virtual adversarial training. stat, 1050:25, 2015.
 [38] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning, 2014.
 [39] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. Advances in Neural Information Processing Systems Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 [40] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. International Conference on Learning Representations, 2015.
 [41] M. Ranzato, J. Susskind, V. Mnih, and G. E. Hinton. On deep generative models with applications to recognition. In Computer Vision and Pattern Recognition, 2011.
 [42] A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko. Semisupervised learning with ladder networks. In Advances in Neural Information Processing Systems, 2015.
 [43] Y. Ren, J. Li, Y. Luo, and J. Zhu. Conditional generative momentmatching networks. Advances in Neural Information Processing Systems, 2016.

[44]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In International Conference on Machine Learning, 2014. 
[45]
R. Salakhutdinov and G. E. Hinton.
Deep Boltzmann machines.
In International Conference on Artificial Intelligence and Statistics, 2009.  [46] T. Salimans, I. J. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. Advances in Neural Information Processing Systems, 2016.
 [47] L. Saul, T. Jaakkola, and M. Jordan. Mean field theory for sigmoid belief networks. Journal of AI Research, 4:61–76, 1996.
 [48] P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In International Conference on Pattern Recognition, 2012.
 [49] S. ShalevShwartz, Y. Singer, N. Srebro, and A. Cotter. Pegasos: Primal estimated subgradient solver for SVM. Mathematical Programming, Series B, 2011.
 [50] J. T. Springenberg. Unsupervised and semisupervised learning with categorical generative adversarial networks. In International Conference on Learning Representations, 2016.
 [51] Y. Tang. Deep learning using linear support vector machines. In Challenges on Representation Learning Workshop, International Conference on Machine Learning, 2013.
 [52] B. Taskar, C. Guestrin, and D. Koller. Maxmargin Markov networks. In Advances in Neural Information Processing Systems, 2003.
 [53] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interdependent and structured output spaces. In International Conference on Machine Learning, 2004.
 [54] V. Vapnik. Statistical learning theory. WileyInterscience, 1998.

[55]
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P. A. Manzagol.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of Machine Learning Research, 11(Dec):3371–3408, 2010.  [56] C. J. Yu and T. Joachims. Learning structural SVMs with latent variables. In International Conference on Machine Learning, 2009.
 [57] M. D. Zeiler and R. Fergus. Stochastic pooling for regularization of deep convolutional neural networks. In International Conference on Learning Representations, 2013.
 [58] J. Zhu, A. Ahmed, and E. P. Xing. MedLDA: Maximum margin supervised topic models. Journal of Machine Learning Research, 13:2237–2278, 2012.
 [59] J. Zhu, N. Chen, H. Perkins, and B. Zhang. Gibbs maxmargin topic models with data augmentation. Journal of Machine Learning Research, 15:1073–1110, 2014.
 [60] J. Zhu, N. Chen, and E. P. Xing. Bayesian inference with posterior regularization and applications to infinite latent SVMs. Journal of Machine Learning Research, 15:1799–1847, 2014.
 [61] J. Zhu, E.P. Xing, and B. Zhang. Partially observed maximum entropy discrimination Markov networks. In Advances in Neural Information Processing Systems, 2008.
 [62] X. Zhu and A. Goldberg. Introduction to semisupervised learning. Synthesis lectures on artificial intelligence and machine learning, 3(1):1–130, 2009.
Appendix A Network Architecture of MMVA
We illustrate the network structure of MMVA with Gaussian hidden variables and Bernoulli visible variables in Fig. 9.
Appendix B Manifold Visualization
TSNE embedding results of the features learned by VA and MMVA on 2D plane are shown in Fig. 10 (a) and Fig. 10 (b) respectively, using the same data points randomly sampled from the MNIST dataset. Compared to the VA’s embedding, MMVA separates the images from different categories better, especially for the confusable digits such as digit “4” and “9”. These results show that MMVA, which benefits from the maxmargin principle, learns more discriminative representations of digits than VA.
Appendix C Visualization of Imputation Results
The imputation results of ConvVA and ConvMMVA on MNIST and SVHN are shown in Fig. 11 and Fig. 12 respectively. On MNIST, ConvMMVA makes fewer mistakes and refines the images better, which accords with the MSE results as reported in the main text. On SVHN, in most cases, ConvMMVA could complete the images with missing values on this much harder dataset. In the remaining cases, ConvMMVA fails potentially due to the changeful digit patterns and less color contrast compared with handwriting digits dataset. Nevertheless, ConvMMVA achieves comparable results with ConvVA on inferring missing data.
Comments
There are no comments yet.