1 Introduction
Performance within the task of supervised image classification has been vastly improved in the era of deep learning using modern convolutional neural network (CNN)
[30] based discriminative classifiers [26, 43, 31, 42, 16, 52, 18]. On the other hand, unsupervised generative models in deep learning were previously attained using methods under the umbrella of graphical models — e.g., the Boltzmann machine
[17]or autoencoder
[4, 24] architectures. However, the rich representational power seen within convolutionbased (discriminative) models is not being directly enjoyed in these generative models. Later, inverting convolutional neural networks in order to convert internal representations into a real image was investigated in [6, 27]. Recently, generative adversarial networks (GAN) [11] and followup works [38, 2, 14]have attracted a tremendous amount of attention in machine learning and computer vision by producing high quality synthesized images by training a pair of competing models against one another in an adversarial manner. While a generator tries to create “fake” images to fool the discriminator, the discriminator attempts to discern between these “real” (given training) and “fake” images. After convergence, the generator is able to produce images faithful to the underlying data distribution.
Before the deep learning era [17], generative modeling had been an area with a steady pace of development [1, 5, 54, 36, 49, 53]. These models were guided by rigorous statistical theories which, although nice in theory, did not succeed in producing synthesized images with practical quality.
In terms of building generative models from discriminative classifiers, there have been early attempts in [48, 44]. In [48], a generative model was obtained from a repeatedly trained boosting algorithm [8] using a weak classifier whereas [44] used a strong classifier in order to selfgenerate negative examples or “pseudonegatives”.
To address the lack of richness in representation and efficiency in synthesis, convolutional neural networks were adopted in introspective neural networks (INN) [29, 21] to build a single model that is simultaneously generative and discriminative. The generative modeling aspect was studied in [29] where a sequence of CNN classifiers () were trained, while the power within the classification setting was revealed in [21] in the form of introspective convolutional networks (ICN) that used only a single CNN classifier. Although INN models [29, 21] point to a promising direction to obtain a single model being both a good generator and a strong discriminative classifier, a sequence of CNNs were needed to generate realistic synthesis. As a result, this requirement may serve as a possible bottleneck with respect to training complexity and model size.
Recently, a generic formulation [2] was developed within the GAN model family to incorporate a Wasserstein objective to alleviate the wellknown difficulty in GAN training. Motivated by introspective neural networks (INN) [29] and this Wasserstein objective [2], we propose to adopt the Wasserstein term into the INN formulation to enhance the modeling capability. The resulting model, Wasserstein introspective neural networks (WINN) shows greatly enhanced modeling capability over INN by having reduction in the number of CNN classifiers.
2 Significance and Related Work
We make several interesting observations for WINN:

By adopting the Wasserstein distance into INN, we are able to generate images using a single CNN in WINN with even higher quality than those by INN that uses 20 CNNs (as seen in Figure 2, 4, 5, 6, and 7; the similar underlying CNN architectures are used in WINN and INN). WINN achieves a significant reduction in model complexity over INN, making the generator more practical.

Within texture modeling, INN and WINN are able to inherently model the input image space, making the synthesis of large texture images realistic, whereas GAN projects a noise vector onto the image space making the image patch stitching more difficult (although extensions exist), as demonstrated in Figure
2. 
To compare with the family of GAN models, we compute Inception scores using the standard procedure on the CIFAR10 datasets and observed modest results. Here, we typically train 45 cascades to boost the numbers but WINN with one CNN is already promising. Overall, modern GAN variants (e.g., [14]) still outperform our WINN with better quality images. Some results are shown in Figure 7.

To test the robustness of the discriminative abilities of WINN, we directly make WINN into a discriminative classifier by training it on the standard MNIST and SVHN datasets. Not only are we able to improve over the previous ICN [21] classifier for supervised classification, we also observe a large improvement in robustness against adversarial examples compared with the baseline CNN, ResNet, and the competing ICN.
In terms of other related work, we briefly discuss some existing methods below.
Wasserstein GAN. A closely related work to our WINN algorithm is the Wasserstein generative adversarial networks (WGAN) method [2, 14]. While WINN adopts the Wasserstein distance as motivated by WGAN, our overall algorithm is still within the family of introspective neural networks (INN) [29, 21]. WGAN on the other hand is a variant of GAN with an improvement over GAN by having an objective that is easier to train. The level of difference between WINN and WGAN is similar to that between INN [29, 21] and GAN [12]. The overall comparisons between INN and GAN have been described in [29, 21].
Generative ConvNets. Recently, there has also been a cluster of algorithms developed in [50, 51, 15] where Langevin dynamics are adopted in generator CNNs. However, the models proposed in [50, 51, 15] do not perform introspection (Figure 1) and their generator and discriminator components are still somewhat separated; thus, their generators are not used as effective discriminative classifiers to perform stateoftheart classification on standard supervised machine learning tasks. Their training processes are also more complex than those of INN and WINN.
Deep energy models (DEMs) [35]. DEM [35]
extends the standard density estimation by using multilayer neural networks (MLNN) with a rather complex training procedure. The probability model in DEM includes both the raw input and the features computed by MLNN. WINN instead takes a more general and simplistic form and is easier to train (see Eq. (
1)). In general, DEM belongs to the minimum description length (MDL) family models in which the maximum likelihood is achieved. WINN, instead, has a formulation being simultaneously discriminative and generative.3 Introspective Neural Networks
3.1 Brief introduction of INN
We first briefly introduce the introspective neural network method (INNg) [29] for generative modeling and its companion model [21] which focuses on the classification aspect. The main motivation behind the INN work [29, 21] is to make a convolutional neural network classifier simultaneously discriminative and generative. A single CNN classifier is trained in an introspective manner to improve the standard supervised classification result [21], however, a sequence of CNNs (typically ) is needed to be able to synthesize images of good quality [29].
Figure 1 shows a brief illustration of a single introspective CNN classifier [21]. We discuss our basic unsupervised formulation next. Suppose we are given a set of training examples: where we assume each — e.g., for images of size . These will constitute positive examples of the patterns/targets we wish to model. The main idea of INN is to define pseudonegative examples that are to be selfgenerated by the discriminative classifier itself. We therefore define label for each example , if is from the given training set and if is selfgenerated. Motivated by the generative via discriminative learning (GDL) framework [44], one could try to learn the generative model for the given examples, , by a sequentially learned distribution of the pseudonegative samples, which is abbreviated as where includes all the model parameters learned at step .
(1) 
where and the initial distribution
such as a Gaussian distribution over the entire space of
. The discriminative classifier is a convolutional neural network (CNN) parameterized by where denotes the weights of the top layer combining the features through(e.g., softmax layer) and
parameterizing the internal representations. The synthesis process through which pseudonegative samples are generated is carried out by stochastic gradient Langevin dynamics [47] aswhere is a Gaussian distribution and is the step size that is annealed in the sampling process. Overall, we desire
(2) 
using the iterative reclassificationbysynthesis process [21, 29] guided by Eq. (1).
3.2 Connection to the Wasserstein distance
The overall training process, reclassificationbysynthesis, is carried out iteratively without an explicit objective function. The generative adversarial network (GAN) model [11] instead has an objective function formulated in a minimax fashion with the generator and discriminator competing against each other. The Wasserstein generative adversarial network (WGAN) work [2] improves GAN [11] by replacing the JensenShannon distance with an efficient approximation of the EarthMover distance [2]. Also, there has been further generalization of the GAN family models in [32].
Let be the target distribution and be the pseudonegative distribution parameterized by . Next, we show a connection between the INN framework and the WGAN formulation [2], whose objective (rewritten with our notations) can be defined as
(3) 
where denotes the space of 1Lipschitz functions. To build the connection between Eq. (3) of WGAN and Eq. (1) of INN, we first present the following lemma.
Lemma 1
Considering and assuming its 1Lipschitz property, we have a lower bound on the Wasserstein distance by
(4)  
where
denotes the KullbackLeibler divergence between the two distributions
and , and is the Jeffreys divergence.Proof: see Appendix A.
Note that using the Bayes’ rule, the ratio of the generative probabilities in Lemma 1 can be turned into the ratio of the discriminative probabilities assuming equal priors .
Theorem 1
Proof: see Appendix B.
4 Wasserstein Introspective Networks
Here we present the formulation for WINN building upon the formulation of the prior introspective learning works presented in Section 3.
4.1 WINN algorithm
We denote our unlabeled input training data as . Also, we denote the set of all the selfgenerated pseudonegative samples up to step as . In other words, consists of pseudonegatives sampled from our model for where is the model parameter vector at step .
Classificationstep. The classificationstep can be viewed as training a classifier to approximate the Wasserstein distance between and for . Note that we also keep pseudonegatives from earlier stages – which are essentially the mistakes of the earlier stages – to prevent the classifier forgetting what it has learned in previous stages. We use CNNs parametrized by as base classifiers. Let
denote the output of final fully connected layer (without passing through sigmoid nonlinearity) of the CNN. In the previous introspective learning frameworks
[29, 21], the classifier learning objective was to minimize the following standard crossentropy loss function on
:where denotes the sigmoid nonlinearity. Motivated by Section 3.2
, in WINN training we wish to minimize the following Wasserstein loss function by the stochastic gradient descent algorithm via backpropagation:
(5) 
To enforce the function to be Lipschitz, we add the following gradient penalty term [14] to :
where , , , and .
Synthesisstep. Obtaining increasingly difficult pseudonegative samples is an integral part of the introspective learning framework, as it is crucial for tightening the decision boundary. To this end, we develop an efficient sampling procedure under the Wasserstein formulation. After the classificationstep, we obtain the following distribution of pseudonegatives:
(6) 
where ; the initial distribution is a Gaussian distribution or the distribution defined in Appendix D. We find that the distribution of Appendix D encourages the diversity of sampled images.
The following equivalence is shown in [29, 21]:
(7) 
The sampling strategy of [29, 21] was to carry out gradient ascent on the term . In Lemma 1 we chose to be . Using Bayes’ rule, it is easy to see that is loosely connected to . Also, [2, 14] argue that correlates with the quality of the sample . This motivates us to use the following sampling strategy. After initializing by drawing a fair sample from , we increase using gradient ascent on the image via backpropagation. Specifically, as shown in [47], we can obtain fair samples from the distribution using the following update rule:
where is a timevarying step size and
is a random variable following the Gaussian distribution
. Gaussian noise term is added to make samples cover the full distribution. Inspired by [20], we found that injecting noise in the image space could be substituted by applying Dropout to the higher layers of CNN. In practice, we were able obtain the samples of enough diversity without step size annealing and noise injection.As an early stopping criterion, we empirically find that the following is effective: (1) we measure the minimum and maximum
of positive examples; (2) we set the early stopping threshold to a random number from the uniform distribution between these two numbers. Intuitively, by matching the value of
positives and pseudonegatives, we expect to obtain pseudonegative samples that match the quality of positive samples.4.2 Expanding model capacity
In practice, we find that the version with the single classifier – which we call WINNsingle – is expressive enough to capture the generative distribution under variety of applications. The introspective learning formulation [44, 29, 21] allows us to model more complex distributions by adding a sequence of cascaded classifiers parameterized by . Then, we can model the distribution as:
(8) 
In the next sections, we demonstrate the modeling capability of WINN under cascaded classifiers, as well as its agnosticy to the type of base classifier.
4.3 GAN’s discriminator vs. WINN’s classifier
GAN uses the competing discriminator and the generator whereas WINN maintains a single model being both generative and discriminative. Some general comparisons between GAN and INN have been provided in [29, 21]. Below we make a few additional observations that are worth future exploration and discussions.

First, the generator of GAN is a costeffective option for image patch synthesis, as it works in a feedforward fashion. However the generator of GAN is not meant to be trained as a classifier to perform the standard classification task, while the generator in the introspective framework is also a strong classifier. Section 5.6 shows WINN to have significant robustness to external adversarial examples.

Second, the discriminator in GAN is meant to be a critic but not a generator. To show whether or not the discriminator in GAN can also be used as a generator, we train WGANGP [14] on the CelebA face dataset. Using the same CNN architecture (ResNet from [14]) that was used as GAN’s discriminator, we also train a WINNsingle model, making GAN’s discriminator and WINNsingle to have the identical CNN architecture. Applying the sampling strategy to WGANGP’s discriminator allows us to synthesize image form WGANGP’s discriminator as well and we show some samples in Figure 3 (a). These synthesized images are not like faces, yet they have been classified by the discriminator of WGANGP as “real” faces; this demonstrates the separation between the generator and the discriminator in GAN. In contrast, images synthesized by WINNsingle’s CNN classifier are faces like, as shown in Figure 3 (b).

Third, the discriminator of GAN may not be used as a direct discriminative classifier for the standard supervised learning task. As shown and discussed in ICN
[21], the introspective framework has the ability of classification for discriminator.
5 Experiments
5.1 Implementation
Classificationstep. For training the discriminator network, we use Adam [23] with a minibatch size of 100. The learning rate was set to 0.0001. We set and , inspired by [14]. Each batch consists of 50 positive images sampled from the set of positives and 50 pseudonegative images sampled from the set of pseudonegatives . In each iteration, we limit total number of training images to .
Synthesisstep. For synthesizing pseudonegative images via backpropagation, we perform gradient ascent on the image space. In the first cascade, each image is initialized with a noise sampled from the distribution described in Appendix D. In the later cascades, images are initialized with the images sampled from the last cascade. We use Adam with a minibatch size of 100. The learning rate was set to 0.01. We set and .
5.2 Texture modeling
We evaluate the texture modeling capability of WINN. For a fair comparison, we use the same 7 texture images presented in [9] where each texture image has a size of 256 256. We follow the training method of Section 5.1 except that positive images are constructed by cropping 64 64 patches from the source texture image at random positions. We use network architecture of Appendix C. After training is done on the 6464 patchbased model, we try to synthesize texture images of arbitrary size using the anysizeimagegeneration method following [29]. During the synthesis process, we keep a single working image of size 320320. Note that we expand the image so that center 256256 pixels are covered with equal probability. In each iteration, we sample 200 patches from the working image, and perform gradient ascent on the chosen patches. For the overlapping pixels between patches, we take the average of the gradients assigned to such pixels. We show synthesized texture images in Figure 2 and 4. WINNsingle shows a significant improvement over INNgsingle and comparable results to INNg (using 20 CNNs). It is worth noting that [10, 45]
leverage rich features of VGG19 network pretrained on ImageNet. WINN and INNg instead train networks from scratch.
5.3 CelebA face modeling
Real data  DCGAN  INNgsingle 
INNg  WINNsingle (ours)  WINN4CNNs (ours) 
The CelebA dataset [33] consists of face images of celebrities. This dataset has been widely used in the previous generative modeling works since it contains large pose variations and background clutters. The network architecture adopted here is described in Appendix C. In Figure 5, we show some synthesized face images using WINNsingle and WINN, as well as those by DCGAN [38], INNgsingle, and INNg [29]. WINNsingle attains image quality even higher than that of INNg (12 CNNs).
5.4 SVHN modeling
SVHN [34] consists of images from Google Street View. It contains training images, test images, and extra images. We use only the training images for the unsupervised SVHN modeling. We use the ResNet architecture described in [14]. Generated images by WINNsingle and WINN (4 CNN classifiers) as well as DCGAN and INN are shown in Figure 6. The improvement of WINN over INNg is evident.
5.5 CIFAR10 modeling
Method  Score 


Real data  
WGANGP [14]  
WGAN [2]  
DCGAN [38] (in [19])  
ALI [7] (in [46])  
Improved GANs (L) [40]  
INNgsingle [29]  
INNg [29]  
WINNsingle (ours)  
WINN5CNNs (ours) 
CIFAR10 [25] consists of training images and test images of size in 10 classes. We use training images augmented by horizontal flips [26] for unsupervised CIFAR10 modeling. We use the ResNet given in [14]. Figure 7 shows generated images by various models.
Real data  DCGAN  WINNsingle (ours)  WINN5CNNs (ours) 
To measure the semantic discriminability, we compute the Inception scores [40] on generated images. WINN shows its clear advantage over INN. WINN5CNNs produces a result close to WGAN but there is still a gap to the stateoftheart results by WGANGP.
5.6 Image classification and adversarial examples
To demonstrate the robustness of WINN as a discriminative classifier, we present experiments on the supervised classification tasks.

Training Methods. We add the Wasserstein loss term to the ICN [21] loss function, obtaining the following:
where , denotes the internal parameters for the CNN, and denotes the toplayer weights for the th class. In the experiments, we set the weight of the WINN loss, , to 0.01. We use the vanilla network architecture resembling [21] as the baseline CNN, which has less filters and parameters than the one in [21]. We also use a ResNet32 architecture with Layer Normalization [3] on MNIST and SVHN. In the classificationstep, we use Adam with a fixed learning rate of 0.001, of . In the synthesisstep, we use the Adam optimizer with a learning rate of and of . Table 2 shows the errors on MNIST and SVHN.
Adversarial error  Correction rate  Correction rate  
Method  of Method  by Method  by Baseline 


Baseline vanilla CNN      
ICN [21]  
WINNsingle vanilla (ours)  
Baseline ResNet32      
WINNsingle ResNet32 (ours) 
Robustness to adversarial examples. It is argued in [11] that discriminative CNN’s vulnerability to adversarial examples primarily arises due to its linear nature. Since the reclassificationbysynthesis process helps tighten the decision boundary (Figure 1), one might expect that CNNs trained with the WINN algorithm are more robust to adversarial examples. Note that unlike existing methods for adversarial defenses [13, 28], our method does not train networks with specific types of adversarial examples. With test images of MNIST and SVHN, we adopt “fast gradient sign method” [11] ( for MNIST and for SVHN) to generate adversarial examples clipped to range , which differs from [21]. We experiment with two networks having the same architecture and only differing in training method (the standard crossentropy loss vs. the WINN procedure). We call the former as the baseline CNN. We summarize the results in Table 3. Compared to ICN [21], WINN significantly reduces the adversarial error to 7.99% and improves the correction rate to 90.00%. In addition, we have adopted the ResNet32 architecture into WINN. See Table 3 and 4. We still obtain the adversarial error reduction and correction rate improvement on MNIST and SVHN () with ResNet32. Our observation is that WINN is not necessarily improving over a strong baseline for the supervised classification task but its advantage on adversarial attacks is evident.
Adversarial error  Correction rate  Correction rate  
Method  of Method  by Method  by Baseline 


Baseline ResNet32      
WINNsingle ResNet32 (ours) 
5.7 Agnostic to different architectures
In Figure 8, we demonstrate our algorithm being agnostic to the type of classifier, by varying network architectures to ResNet [16] and DenseNet [18]. Little modification was required to adapt two architectures for WINN.
WINNsingle (ResNet13)  WINNsingle (DenseNet20) 
6 Conclusion
In this paper, we have introduced Wasserstein introspective neural networks (WINN) that produce encouraging results as a generator and a discriminative classifier at the same time. WINN is able to achieve model size reduction over the previous introspective neural networks (INN) by a factor of . In most of the images shown in the paper, we find a single CNN classifier in WINN being sufficient to produce visually appealing images as well as significant error reduction against adversarial examples. WINN is agnostic to the architecture design of CNN and we demonstrate results on three networks including a vanilla CNN, ResNet [16], and DenseNet [18] networks where popular CNN discriminative classifiers are turned into generative models under the WINN procedure. WINN can be adopted in a wide range of applications in computer vision such as image classification, recognition, and generation.
Acknowledgements.
This work is supported by NSF IIS1717431 and NSF IIS1618477. The authors thank Justin Lazarow, Long Jin, Hubert Le, Ying Nian Wu, Max Welling, Richard Zemel, and Tong Zhang for valuable discussions.
7 Appendix
A. Proof of Lemma 1.
Proof. Plugging into Eq. (3), we have
B. Proof of Theorem 1.
Corollary 1
The Jeffreys divergence in Eq. (4) of lemma 1 is lower and upper bounded by up to some multiplicative constant
, where is the Jeffreys divergence.
Proof. Based on the Pinsker’s inequality [37, 41], it is observed that
where is total variation (TV) distance. From [41] we also have
where . Applying the above bounds to and using the symmetry of the TV distance ,
Plugging the equation above into the Jeffreys divergence, we observe that is upper and lower bounded by by .
Now we can look at theorem 1. It was shown in [21] that Eq. (1) reduces , which bounds the Jeffreys divergence as shown in corollary 1. Lemma 1 shows the connection between Jeffreys divergence and the WGAN objective (Eq. (3)) when . We therefore can see that the formulation of introspective neural networks (Eq. (1)) connects to a lower bound of the WGAN [2] objective (Eq. (3)).
C. Texture and CelebA Modeling Architecture. Inspired by [22], we design a CNN architecture for 64 64 image as in Table 5. We use Swish [39] nonlinearity after each convolutional layer. We add Layer Normalization [3] after each convolution except the first layer, following [14].
Textures and CelebA  Alternative Initialization  


D. Alternative Initializations
. We sample an initial pseudonegative image by applying an operation defined by the network above to a tensor of size
sampled from . The weights of the network are sampled from . We do not apply any nonlinearities in the network. We add Layer Normalization [3] after each convolution except the last layer.References
 [1] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
 [2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
 [3] L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.

[4]
P. Baldi.
Autoencoders, unsupervised learning, and deep architectures.
In
ICML Workshop on Unsupervised and Transfer Learning
, pages 37–49, 2012.  [5] S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. IEEE transactions on pattern analysis and machine intelligence, 19(4):380–393, 1997.
 [6] A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In CVPR, 2015.
 [7] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. In ICLR, 2017.
 [8] Y. Freund and R. E. Schapire. A decisiontheoretic generalization of online learning and an application to boosting. J. of Comp. and Sys. Sci., 55(1), 1997.
 [9] L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.
 [10] L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
 [11] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
 [12] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
 [13] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
 [14] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. In NIPS, 2017.
 [15] T. Han, Y. Lu, S.C. Zhu, and Y. N. Wu. Alternating backpropagation for generator network. In AAAI, 2017.
 [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [17] G. E. Hinton, S. Osindero, and Y.W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
 [18] G. Huang*, Z. Liu*, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [19] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In CVPR, 2017.
 [20] P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros. Imagetoimage translation with conditional adversarial networks. In CVPR, 2017.
 [21] L. Jin, J. Lazarow, and Z. Tu. Introspective classification with convolutional nets. In NIPS, 2017.
 [22] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
 [23] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [24] D. P. Kingma and M. Welling. Autoencoding variational bayes. In ICLR, 2014.
 [25] A. Krizhevsky, V. Nair, and G. Hinton. Cifar10 (canadian institute for advanced research).
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
 [27] T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015.
 [28] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. In ICLR, 2017.
 [29] J. Lazarow*, L. Jin*, and Z. Tu. Introspective neural networks for generative modeling. In ICCV, 2017.
 [30] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. In Neural Computation, 1989.
 [31] C.Y. Lee*, S. Xie*, P. Gallagher, Z. Zhang, and Z. Tu. Deeplysupervised nets. In AISTATS, 2015.
 [32] S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative adversarial learning. In NIPS, 2017.
 [33] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
 [34] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [35] J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng. Learning deep energy models. In ICML, 2011.
 [36] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997.
 [37] M. S. Pinsker. Information and information stability of random variables and processes. 1960.
 [38] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 [39] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. CoRR, abs/1710.05941, 2017.
 [40] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.
 [41] I. Sason and S. Verdú. divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
 [42] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [43] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [44] Z. Tu. Learning generative models via discriminative approaches. In CVPR, 2007.
 [45] D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feedforward synthesis of textures and stylized images. In ICML, 2016.
 [46] D. WardeFarley and Y. Bengio. Improving generative adversarial networks with denoising feature matching. In ICLR, 2017.
 [47] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In ICML, 2011.
 [48] M. Welling, R. S. Zemel, and G. E. Hinton. Self supervised boosting. In NIPS, 2002.
 [49] Y. N. Wu, Z. Si, H. Gong, and S.C. Zhu. Learning active basis model for object detection and recognition. International journal of computer vision, 90(2):198–235, 2010.

[50]
J. Xie, Y. Lu, R. Gao, S.C. Zhu, and Y. N. Wu.
Cooperative learning of energybased model and latent variable model via mcmc teaching.
In AAAI, 2018.  [51] J. Xie, Y. Lu, S.C. Zhu, and Y. N. Wu. A theory of generative convnet. In ICML, 2016.
 [52] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
 [53] S.C. Zhu and D. Mumford. A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision, 2(4):259–362, 2007.
 [54] S. C. Zhu, Y. N. Wu, and D. Mumford. Minimax entropy principle and its application to texture modeling. Neural Computation, 9(8):1627–1660, 1997.