. On the other hand, unsupervised generative models in deep learning were previously attained using methods under the umbrella of graphical models — e.g., the Boltzmann machine
or autoencoder[4, 24] architectures. However, the rich representational power seen within convolution-based (discriminative) models is not being directly enjoyed in these generative models. Later, inverting convolutional neural networks in order to convert internal representations into a real image was investigated in [6, 27]. Recently, generative adversarial networks (GAN)  and followup works [38, 2, 14]
have attracted a tremendous amount of attention in machine learning and computer vision by producing high quality synthesized images by training a pair of competing models against one another in an adversarial manner. While a generator tries to create “fake” images to fool the discriminator, the discriminator attempts to discern between these “real” (given training) and “fake” images. After convergence, the generator is able to produce images faithful to the underlying data distribution.
Before the deep learning era , generative modeling had been an area with a steady pace of development [1, 5, 54, 36, 49, 53]. These models were guided by rigorous statistical theories which, although nice in theory, did not succeed in producing synthesized images with practical quality.
In terms of building generative models from discriminative classifiers, there have been early attempts in [48, 44]. In , a generative model was obtained from a repeatedly trained boosting algorithm  using a weak classifier whereas  used a strong classifier in order to self-generate negative examples or “pseudo-negatives”.
To address the lack of richness in representation and efficiency in synthesis, convolutional neural networks were adopted in introspective neural networks (INN) [29, 21] to build a single model that is simultaneously generative and discriminative. The generative modeling aspect was studied in  where a sequence of CNN classifiers () were trained, while the power within the classification setting was revealed in  in the form of introspective convolutional networks (ICN) that used only a single CNN classifier. Although INN models [29, 21] point to a promising direction to obtain a single model being both a good generator and a strong discriminative classifier, a sequence of CNNs were needed to generate realistic synthesis. As a result, this requirement may serve as a possible bottleneck with respect to training complexity and model size.
Recently, a generic formulation  was developed within the GAN model family to incorporate a Wasserstein objective to alleviate the well-known difficulty in GAN training. Motivated by introspective neural networks (INN)  and this Wasserstein objective , we propose to adopt the Wasserstein term into the INN formulation to enhance the modeling capability. The resulting model, Wasserstein introspective neural networks (WINN) shows greatly enhanced modeling capability over INN by having reduction in the number of CNN classifiers.
2 Significance and Related Work
We make several interesting observations for WINN:
By adopting the Wasserstein distance into INN, we are able to generate images using a single CNN in WINN with even higher quality than those by INN that uses 20 CNNs (as seen in Figure 2, 4, 5, 6, and 7; the similar underlying CNN architectures are used in WINN and INN). WINN achieves a significant reduction in model complexity over INN, making the generator more practical.
Within texture modeling, INN and WINN are able to inherently model the input image space, making the synthesis of large texture images realistic, whereas GAN projects a noise vector onto the image space making the image patch stitching more difficult (although extensions exist), as demonstrated in Figure2.
To compare with the family of GAN models, we compute Inception scores using the standard procedure on the CIFAR-10 datasets and observed modest results. Here, we typically train 4-5 cascades to boost the numbers but WINN with one CNN is already promising. Overall, modern GAN variants (e.g., ) still outperform our WINN with better quality images. Some results are shown in Figure 7.
To test the robustness of the discriminative abilities of WINN, we directly make WINN into a discriminative classifier by training it on the standard MNIST and SVHN datasets. Not only are we able to improve over the previous ICN  classifier for supervised classification, we also observe a large improvement in robustness against adversarial examples compared with the baseline CNN, ResNet, and the competing ICN.
In terms of other related work, we briefly discuss some existing methods below.
Wasserstein GAN. A closely related work to our WINN algorithm is the Wasserstein generative adversarial networks (WGAN) method [2, 14]. While WINN adopts the Wasserstein distance as motivated by WGAN, our overall algorithm is still within the family of introspective neural networks (INN) [29, 21]. WGAN on the other hand is a variant of GAN with an improvement over GAN by having an objective that is easier to train. The level of difference between WINN and WGAN is similar to that between INN [29, 21] and GAN . The overall comparisons between INN and GAN have been described in [29, 21].
Generative ConvNets. Recently, there has also been a cluster of algorithms developed in [50, 51, 15] where Langevin dynamics are adopted in generator CNNs. However, the models proposed in [50, 51, 15] do not perform introspection (Figure 1) and their generator and discriminator components are still somewhat separated; thus, their generators are not used as effective discriminative classifiers to perform state-of-the-art classification on standard supervised machine learning tasks. Their training processes are also more complex than those of INN and WINN.
extends the standard density estimation by using multi-layer neural networks (MLNN) with a rather complex training procedure. The probability model in DEM includes both the raw input and the features computed by MLNN. WINN instead takes a more general and simplistic form and is easier to train (see Eq. (1)). In general, DEM belongs to the minimum description length (MDL) family models in which the maximum likelihood is achieved. WINN, instead, has a formulation being simultaneously discriminative and generative.
3 Introspective Neural Networks
3.1 Brief introduction of INN
We first briefly introduce the introspective neural network method (INNg)  for generative modeling and its companion model  which focuses on the classification aspect. The main motivation behind the INN work [29, 21] is to make a convolutional neural network classifier simultaneously discriminative and generative. A single CNN classifier is trained in an introspective manner to improve the standard supervised classification result , however, a sequence of CNNs (typically ) is needed to be able to synthesize images of good quality .
Figure 1 shows a brief illustration of a single introspective CNN classifier . We discuss our basic unsupervised formulation next. Suppose we are given a set of training examples: where we assume each — e.g., for images of size . These will constitute positive examples of the patterns/targets we wish to model. The main idea of INN is to define pseudo-negative examples that are to be self-generated by the discriminative classifier itself. We therefore define label for each example , if is from the given training set and if is self-generated. Motivated by the generative via discriminative learning (GDL) framework , one could try to learn the generative model for the given examples, , by a sequentially learned distribution of the pseudo-negative samples, which is abbreviated as where includes all the model parameters learned at step .
where and the initial distribution
such as a Gaussian distribution over the entire space of. The discriminative classifier is a convolutional neural network (CNN) parameterized by where denotes the weights of the top layer combining the features through
(e.g., softmax layer) andparameterizing the internal representations. The synthesis process through which pseudo-negative samples are generated is carried out by stochastic gradient Langevin dynamics  as
where is a Gaussian distribution and is the step size that is annealed in the sampling process. Overall, we desire
3.2 Connection to the Wasserstein distance
The overall training process, reclassification-by-synthesis, is carried out iteratively without an explicit objective function. The generative adversarial network (GAN) model  instead has an objective function formulated in a minimax fashion with the generator and discriminator competing against each other. The Wasserstein generative adversarial network (WGAN) work  improves GAN  by replacing the Jensen-Shannon distance with an efficient approximation of the Earth-Mover distance . Also, there has been further generalization of the GAN family models in .
Let be the target distribution and be the pseudo-negative distribution parameterized by . Next, we show a connection between the INN framework and the WGAN formulation , whose objective (rewritten with our notations) can be defined as
Considering and assuming its 1-Lipschitz property, we have a lower bound on the Wasserstein distance by
where denotes the Kullback-Leibler divergence between the two distributions
denotes the Kullback-Leibler divergence between the two distributionsand , and is the Jeffreys divergence.
Proof: see Appendix A.
Note that using the Bayes’ rule, the ratio of the generative probabilities in Lemma 1 can be turned into the ratio of the discriminative probabilities assuming equal priors .
Proof: see Appendix B.
4 Wasserstein Introspective Networks
Here we present the formulation for WINN building upon the formulation of the prior introspective learning works presented in Section 3.
4.1 WINN algorithm
We denote our unlabeled input training data as . Also, we denote the set of all the self-generated pseudo-negative samples up to step as . In other words, consists of pseudo-negatives sampled from our model for where is the model parameter vector at step .
Classification-step. The classification-step can be viewed as training a classifier to approximate the Wasserstein distance between and for . Note that we also keep pseudo-negatives from earlier stages – which are essentially the mistakes of the earlier stages – to prevent the classifier forgetting what it has learned in previous stages. We use CNNs parametrized by as base classifiers. Let
denote the output of final fully connected layer (without passing through sigmoid nonlinearity) of the CNN. In the previous introspective learning frameworks[29, 21]
, the classifier learning objective was to minimize the following standard cross-entropy loss function on:
where denotes the sigmoid nonlinearity. Motivated by Section 3.2
, in WINN training we wish to minimize the following Wasserstein loss function by the stochastic gradient descent algorithm via backpropagation:
To enforce the function to be -Lipschitz, we add the following gradient penalty term  to :
where , , , and .
Synthesis-step. Obtaining increasingly difficult pseudo-negative samples is an integral part of the introspective learning framework, as it is crucial for tightening the decision boundary. To this end, we develop an efficient sampling procedure under the Wasserstein formulation. After the classification-step, we obtain the following distribution of pseudo-negatives:
where ; the initial distribution is a Gaussian distribution or the distribution defined in Appendix D. We find that the distribution of Appendix D encourages the diversity of sampled images.
The following equivalence is shown in [29, 21]:
The sampling strategy of [29, 21] was to carry out gradient ascent on the term . In Lemma 1 we chose to be . Using Bayes’ rule, it is easy to see that is loosely connected to . Also, [2, 14] argue that correlates with the quality of the sample . This motivates us to use the following sampling strategy. After initializing by drawing a fair sample from , we increase using gradient ascent on the image via backpropagation. Specifically, as shown in , we can obtain fair samples from the distribution using the following update rule:
where is a time-varying step size and
is a random variable following the Gaussian distribution. Gaussian noise term is added to make samples cover the full distribution. Inspired by , we found that injecting noise in the image space could be substituted by applying Dropout to the higher layers of CNN. In practice, we were able obtain the samples of enough diversity without step size annealing and noise injection.
As an early stopping criterion, we empirically find that the following is effective: (1) we measure the minimum and maximum
of positive examples; (2) we set the early stopping threshold to a random number from the uniform distribution between these two numbers. Intuitively, by matching the value ofpositives and pseudo-negatives, we expect to obtain pseudo-negative samples that match the quality of positive samples.
4.2 Expanding model capacity
In practice, we find that the version with the single classifier – which we call WINN-single – is expressive enough to capture the generative distribution under variety of applications. The introspective learning formulation [44, 29, 21] allows us to model more complex distributions by adding a sequence of cascaded classifiers parameterized by . Then, we can model the distribution as:
In the next sections, we demonstrate the modeling capability of WINN under cascaded classifiers, as well as its agnosticy to the type of base classifier.
4.3 GAN’s discriminator vs. WINN’s classifier
GAN uses the competing discriminator and the generator whereas WINN maintains a single model being both generative and discriminative. Some general comparisons between GAN and INN have been provided in [29, 21]. Below we make a few additional observations that are worth future exploration and discussions.
First, the generator of GAN is a cost-effective option for image patch synthesis, as it works in a feed-forward fashion. However the generator of GAN is not meant to be trained as a classifier to perform the standard classification task, while the generator in the introspective framework is also a strong classifier. Section 5.6 shows WINN to have significant robustness to external adversarial examples.
Second, the discriminator in GAN is meant to be a critic but not a generator. To show whether or not the discriminator in GAN can also be used as a generator, we train WGAN-GP  on the CelebA face dataset. Using the same CNN architecture (ResNet from ) that was used as GAN’s discriminator, we also train a WINN-single model, making GAN’s discriminator and WINN-single to have the identical CNN architecture. Applying the sampling strategy to WGAN-GP’s discriminator allows us to synthesize image form WGAN-GP’s discriminator as well and we show some samples in Figure 3 (a). These synthesized images are not like faces, yet they have been classified by the discriminator of WGAN-GP as “real” faces; this demonstrates the separation between the generator and the discriminator in GAN. In contrast, images synthesized by WINN-single’s CNN classifier are faces like, as shown in Figure 3 (b).
Classification-step. For training the discriminator network, we use Adam  with a mini-batch size of 100. The learning rate was set to 0.0001. We set and , inspired by . Each batch consists of 50 positive images sampled from the set of positives and 50 pseudo-negative images sampled from the set of pseudo-negatives . In each iteration, we limit total number of training images to .
Synthesis-step. For synthesizing pseudo-negative images via back-propagation, we perform gradient ascent on the image space. In the first cascade, each image is initialized with a noise sampled from the distribution described in Appendix D. In the later cascades, images are initialized with the images sampled from the last cascade. We use Adam with a mini-batch size of 100. The learning rate was set to 0.01. We set and .
5.2 Texture modeling
We evaluate the texture modeling capability of WINN. For a fair comparison, we use the same 7 texture images presented in  where each texture image has a size of 256 256. We follow the training method of Section 5.1 except that positive images are constructed by cropping 64 64 patches from the source texture image at random positions. We use network architecture of Appendix C. After training is done on the 6464 patch-based model, we try to synthesize texture images of arbitrary size using the anysize-image-generation method following . During the synthesis process, we keep a single working image of size 320320. Note that we expand the image so that center 256256 pixels are covered with equal probability. In each iteration, we sample 200 patches from the working image, and perform gradient ascent on the chosen patches. For the overlapping pixels between patches, we take the average of the gradients assigned to such pixels. We show synthesized texture images in Figure 2 and 4. WINN-single shows a significant improvement over INNg-single and comparable results to INNg (using 20 CNNs). It is worth noting that [10, 45]
leverage rich features of VGG-19 network pretrained on ImageNet. WINN and INNg instead train networks from scratch.
5.3 CelebA face modeling
|INNg||WINN-single (ours)||WINN-4CNNs (ours)|
The CelebA dataset  consists of face images of celebrities. This dataset has been widely used in the previous generative modeling works since it contains large pose variations and background clutters. The network architecture adopted here is described in Appendix C. In Figure 5, we show some synthesized face images using WINN-single and WINN, as well as those by DCGAN , INNg-single, and INNg . WINN-single attains image quality even higher than that of INNg (12 CNNs).
5.4 SVHN modeling
SVHN  consists of images from Google Street View. It contains training images, test images, and extra images. We use only the training images for the unsupervised SVHN modeling. We use the ResNet architecture described in . Generated images by WINN-single and WINN (4 CNN classifiers) as well as DCGAN and INN are shown in Figure 6. The improvement of WINN over INNg is evident.
5.5 CIFAR-10 modeling
|DCGAN  (in )|
|ALI  (in )|
|Improved GANs (-L) |
CIFAR-10  consists of training images and test images of size in 10 classes. We use training images augmented by horizontal flips  for unsupervised CIFAR-10 modeling. We use the ResNet given in . Figure 7 shows generated images by various models.
|Real data||DCGAN||WINN-single (ours)||WINN-5CNNs (ours)|
To measure the semantic discriminability, we compute the Inception scores  on generated images. WINN shows its clear advantage over INN. WINN-5CNNs produces a result close to WGAN but there is still a gap to the state-of-the-art results by WGAN-GP.
5.6 Image classification and adversarial examples
To demonstrate the robustness of WINN as a discriminative classifier, we present experiments on the supervised classification tasks.
Training Methods. We add the Wasserstein loss term to the ICN  loss function, obtaining the following:
where , denotes the internal parameters for the CNN, and denotes the top-layer weights for the -th class. In the experiments, we set the weight of the WINN loss, , to 0.01. We use the vanilla network architecture resembling  as the baseline CNN, which has less filters and parameters than the one in . We also use a ResNet-32 architecture with Layer Normalization  on MNIST and SVHN. In the classification-step, we use Adam with a fixed learning rate of 0.001, of . In the synthesis-step, we use the Adam optimizer with a learning rate of and of . Table 2 shows the errors on MNIST and SVHN.
|Adversarial error||Correction rate||Correction rate|
|Method||of Method||by Method||by Baseline|
|Baseline vanilla CNN||-||-|
|WINN-single vanilla (ours)|
|WINN-single ResNet-32 (ours)|
Robustness to adversarial examples. It is argued in  that discriminative CNN’s vulnerability to adversarial examples primarily arises due to its linear nature. Since the reclassification-by-synthesis process helps tighten the decision boundary (Figure 1), one might expect that CNNs trained with the WINN algorithm are more robust to adversarial examples. Note that unlike existing methods for adversarial defenses [13, 28], our method does not train networks with specific types of adversarial examples. With test images of MNIST and SVHN, we adopt “fast gradient sign method”  ( for MNIST and for SVHN) to generate adversarial examples clipped to range , which differs from . We experiment with two networks having the same architecture and only differing in training method (the standard cross-entropy loss vs. the WINN procedure). We call the former as the baseline CNN. We summarize the results in Table 3. Compared to ICN , WINN significantly reduces the adversarial error to 7.99% and improves the correction rate to 90.00%. In addition, we have adopted the ResNet-32 architecture into WINN. See Table 3 and 4. We still obtain the adversarial error reduction and correction rate improvement on MNIST and SVHN () with ResNet-32. Our observation is that WINN is not necessarily improving over a strong baseline for the supervised classification task but its advantage on adversarial attacks is evident.
|Adversarial error||Correction rate||Correction rate|
|Method||of Method||by Method||by Baseline|
|WINN-single ResNet-32 (ours)|
5.7 Agnostic to different architectures
In Figure 8, we demonstrate our algorithm being agnostic to the type of classifier, by varying network architectures to ResNet  and DenseNet . Little modification was required to adapt two architectures for WINN.
|WINN-single (ResNet-13)||WINN-single (DenseNet-20)|
In this paper, we have introduced Wasserstein introspective neural networks (WINN) that produce encouraging results as a generator and a discriminative classifier at the same time. WINN is able to achieve model size reduction over the previous introspective neural networks (INN) by a factor of . In most of the images shown in the paper, we find a single CNN classifier in WINN being sufficient to produce visually appealing images as well as significant error reduction against adversarial examples. WINN is agnostic to the architecture design of CNN and we demonstrate results on three networks including a vanilla CNN, ResNet , and DenseNet  networks where popular CNN discriminative classifiers are turned into generative models under the WINN procedure. WINN can be adopted in a wide range of applications in computer vision such as image classification, recognition, and generation.
Acknowledgements. This work is supported by NSF IIS-1717431 and NSF IIS-1618477. The authors thank Justin Lazarow, Long Jin, Hubert Le, Ying Nian Wu, Max Welling, Richard Zemel, and Tong Zhang for valuable discussions.
A. Proof of Lemma 1.
Proof. Plugging into Eq. (3), we have
B. Proof of Theorem 1.
, where is the Jeffreys divergence.
where is total variation (TV) distance. From  we also have
where . Applying the above bounds to and using the symmetry of the TV distance ,
Plugging the equation above into the Jeffreys divergence, we observe that is upper and lower bounded by by .
Now we can look at theorem 1. It was shown in  that Eq. (1) reduces , which bounds the Jeffreys divergence as shown in corollary 1. Lemma 1 shows the connection between Jeffreys divergence and the WGAN objective (Eq. (3)) when . We therefore can see that the formulation of introspective neural networks (Eq. (1)) connects to a lower bound of the WGAN  objective (Eq. (3)).
C. Texture and CelebA Modeling Architecture. Inspired by , we design a CNN architecture for 64 64 image as in Table 5. We use Swish  non-linearity after each convolutional layer. We add Layer Normalization  after each convolution except the first layer, following .
|Textures and CelebA||Alternative Initialization|
D. Alternative Initializations
. We sample an initial pseudo-negative image by applying an operation defined by the network above to a tensor of sizesampled from . The weights of the network are sampled from . We do not apply any nonlinearities in the network. We add Layer Normalization  after each convolution except the last layer.
-  D. H. Ackley, G. E. Hinton, and T. J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147–169, 1985.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
-  L. J. Ba, R. Kiros, and G. E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
Autoencoders, unsupervised learning, and deep architectures.
ICML Workshop on Unsupervised and Transfer Learning, pages 37–49, 2012.
-  S. Della Pietra, V. Della Pietra, and J. Lafferty. Inducing features of random fields. IEEE transactions on pattern analysis and machine intelligence, 19(4):380–393, 1997.
-  A. Dosovitskiy, J. T. Springenberg, and T. Brox. Learning to generate chairs with convolutional neural networks. In CVPR, 2015.
-  V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville. Adversarially learned inference. In ICLR, 2017.
-  Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. of Comp. and Sys. Sci., 55(1), 1997.
-  L. Gatys, A. S. Ecker, and M. Bethge. Texture synthesis using convolutional neural networks. In NIPS, 2015.
-  L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015.
-  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. In NIPS, 2017.
-  T. Han, Y. Lu, S.-C. Zhu, and Y. N. Wu. Alternating back-propagation for generator network. In AAAI, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural computation, 18(7):1527–1554, 2006.
-  G. Huang*, Z. Liu*, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie. Stacked generative adversarial networks. In CVPR, 2017.
-  P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
-  L. Jin, J. Lazarow, and Z. Tu. Introspective classification with convolutional nets. In NIPS, 2017.
-  T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. In ICLR, 2018.
-  D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
-  D. P. Kingma and M. Welling. Auto-encoding variational bayes. In ICLR, 2014.
-  A. Krizhevsky, V. Nair, and G. Hinton. Cifar-10 (canadian institute for advanced research).
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
-  T. D. Kulkarni, W. F. Whitney, P. Kohli, and J. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015.
-  A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial machine learning at scale. In ICLR, 2017.
-  J. Lazarow*, L. Jin*, and Z. Tu. Introspective neural networks for generative modeling. In ICCV, 2017.
-  Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Backpropagation applied to handwritten zip code recognition. In Neural Computation, 1989.
-  C.-Y. Lee*, S. Xie*, P. Gallagher, Z. Zhang, and Z. Tu. Deeply-supervised nets. In AISTATS, 2015.
-  S. Liu, O. Bousquet, and K. Chaudhuri. Approximation and convergence properties of generative adversarial learning. In NIPS, 2017.
-  Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
-  Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
-  J. Ngiam, Z. Chen, P. W. Koh, and A. Y. Ng. Learning deep energy models. In ICML, 2011.
-  B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997.
-  M. S. Pinsker. Information and information stability of random variables and processes. 1960.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
-  P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. CoRR, abs/1710.05941, 2017.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, 2016.
-  I. Sason and S. Verdú. -divergence inequalities. IEEE Transactions on Information Theory, 62(11):5973–6006, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
-  Z. Tu. Learning generative models via discriminative approaches. In CVPR, 2007.
-  D. Ulyanov, V. Lebedev, A. Vedaldi, and V. Lempitsky. Texture networks: Feed-forward synthesis of textures and stylized images. In ICML, 2016.
-  D. Warde-Farley and Y. Bengio. Improving generative adversarial networks with denoising feature matching. In ICLR, 2017.
-  M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In ICML, 2011.
-  M. Welling, R. S. Zemel, and G. E. Hinton. Self supervised boosting. In NIPS, 2002.
-  Y. N. Wu, Z. Si, H. Gong, and S.-C. Zhu. Learning active basis model for object detection and recognition. International journal of computer vision, 90(2):198–235, 2010.
J. Xie, Y. Lu, R. Gao, S.-C. Zhu, and Y. N. Wu.
Cooperative learning of energy-based model and latent variable model via mcmc teaching.In AAAI, 2018.
-  J. Xie, Y. Lu, S.-C. Zhu, and Y. N. Wu. A theory of generative convnet. In ICML, 2016.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
-  S.-C. Zhu and D. Mumford. A stochastic grammar of images. Foundations and Trends® in Computer Graphics and Vision, 2(4):259–362, 2007.
-  S. C. Zhu, Y. N. Wu, and D. Mumford. Minimax entropy principle and its application to texture modeling. Neural Computation, 9(8):1627–1660, 1997.