Towards Interpretable Image Synthesis by Learning Sparsely Connected AND-OR Networks

09/10/2019 ∙ by Xianglei Xing, et al. ∙ NC State University 18

This paper proposes interpretable image synthesis by learning hierarchical AND-OR networks of sparsely connected semantically meaningful nodes. The proposed method is based on the compositionality and interpretability of scene-objects-parts-subparts-primitives hierarchy in image representation. A scene has different types (i.e., OR) each of which consists of a number of objects (i.e., AND). This can be recursively formulated across the scene-objects-parts-subparts hierarchy and is terminated at the primitive level (e.g., Gabor wavelets-like basis). To realize this interpretable AND-OR hierarchy in image synthesis, the proposed method consists of two components: (i) Each layer of the hierarchy is represented by an over-completed set of basis functions. The basis functions are instantiated using convolution to be translation covariant. Off-the-shelf convolutional neural architectures are then exploited to implement the hierarchy. (ii) Sparsity-inducing constraints are introduced in end-to-end training, which facilitate a sparsely connected AND-OR network to emerge from initially densely connected convolutional neural networks. A straightforward sparsity-inducing constraint is utilized, that is to only allow the top-k basis functions to be active at each layer (where k is a hyperparameter). The learned basis functions are also capable of image reconstruction to explain away input images. In experiments, the proposed method is tested on five benchmark datasets. The results show that meaningful and interpretable hierarchical representations are learned with better qualities of image synthesis and reconstruction obtained than state-of-the-art baselines.



There are no comments yet.


page 2

page 6

page 7

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Remarkable recent progress on image synthesis [12, 6, 19, 37, 3, 38] have been made using deep neural networks (DNNs) [23, 22]. Most efforts focus on developing sophisticated architectures and training paradigms for sharp and realistic-looking image synthesis [28, 5, 19]. Although high-fidelity images can be generated, the internal synthesizing process via DNNs is still largely viewed as a black-box, thus potentially hindering the long-term applicability in eXplainable AI (XAI) [8]. More recently, the generative adversarial network (GAN) dissection method [4]

has been proposed to identify internal neurons in pre-trained GANs that show interpretable meanings using a separate annotated dataset in a post-hoc fashion.

In this paper, we focus on learning interpretable models for unconditional image synthesis from scratch with explicit hierarchical representations. By interpretable image synthesis, it means the internal image generation process can be explicitly unfolded through meaningful basis functions end-to-end learned at different layers and conceptually reflecting the hierarchy of scene-objects-parts-subparts-primitives. A scene has different types (i.e., OR) each of which consists of a number of objects (i.e., AND). This can be recursively formulated across the scene-objects-parts-subparts hierarchy and is terminated at the primitive level (e.g., Gabor wavelets-like basis functions). Figure 1 shows an example of the AND-OR tree learned from scratch for explaining a generating face image.

Fig. 1: An example of AND-OR tree learned for faces. For clarity, we only show 3 layers (out of the total 5 layers). On the top, we show a synthesized face at the resolution of . A 3-layer AND-OR tree is illustrated from Layer 3 (the part/composite part level) to Layer 1 (the primitive level). The spatial resolution of the feature map at Layer 3 is . The entire grid is interpreted by a AND-node. Each position in the grid is interpreted by an OR-node such as the eye node at the positions and . Similarly, Layer 2 and Layer 1 can be interpreted w.r.t. the AND-OR compositions. The activated basis functions have semantically meaningful interpretations at Layer 3 and Layer 2. Layer 1 shows the learned primitives covering the classic Gabor-like wavelets and the blob-like primitives. See text for detail. Best viewed in color.

The hierarchy of scene-objects-parts-subparts-primitives is at the stem of image grammar models [11, 44]. The AND-OR compositionality has been applied in image vision tasks [44]. With the recent resurgence of deep neural networks (DNNs) [23, 22] and the more recent DNN-based image synthesis framework such as the widely used Generative Adversarial Networks (GANs) [12] and Variational Auto-Encoder (VAE) methods [20, 15], the hierarchy is usually assumed to be modeled implicitly in DNNs. Due to dense connections between consecutive layers in traditional DNNs, they often learn noisy compositional patterns of how entities in a layer are formed from “smaller” ones in the layer right below it.

On the other hand, the sparsity principle has played a fundamental role in high-dimensional statistics, machine learning, signal processing and AI. In particular, the sparse coding scheme 

[33] is an important principle for understanding the visual cortex. By imposing sparsity constraints on the coefficients of a linear generative model, [34] learn Gabor-like wavelets that resemble the neurons in the primary visual cortex (V1) from natural image patches. Since then, there are many important work on sparse coding presented in the literature before the resurgence of DNNs. With the remarkable successes of sparse coding models, it is not unreasonable to assume that a top-down generative model of natural images should be based on the linear sparse coding model, or incorporate the sparse coding principle at all of its layers. However, developing a top-down sparse coding model that can generate, rather than merely reconstruct, realistic-looking natural image patterns has proven to be a difficult task [18], mainly due to the difficulty of selecting and fitting sparse basis functions to each image.

In this paper, we take a step forward rethinking dense connections between consecutive layers in traditional DNNs. We propose to “re-wire” them sparsely for explicit modeling of the hierarchy of scene-objects-parts-subparts-primitives in image synthesis (see Figure 1. To realize the “re-wiring”, we integrate the sparsity principle in DNNs in a simple yet effective and adaptive way: (i) Each layer of the hierarchy is represented by a (over-completed) set of basis functions. The basis functions are instantiated using convolution to be translation covariant. Off-the-shelf convolutional neural architectures are then exploited to implement the hierarchy such as generator networks used in GANs. (ii) Sparsity-inducing constraints are introduced in end-to-end training which facilitates a sparsely connected AND-OR network to emerge from initially densely connected convolutional neural networks. A straightforward sparsity-inducing constraint is utilized, that is to only allow the top- basis functions to fire at each layer (where is a hyperparameter). By doing so, we can harness the highly expressive modeling capability and the end-to-end learning flexibility of DNNs, and the interpertability rigor of the explicit compositional hierarchy.

2 Related works


[31, 30, 16]

were proposed for effective feature representations and these representations can improve the performance of the classification task. The sparsity constrains are designed and encouraged by the Kullback-Leibler divergence between the Bernoulli random variables

[31], penalty on the normalized features [32], and winner-take-all principle [29]. However, these methods do not have the ability of generating new data. Lee [24, 17]

proposed a convolutional deep belief networks which employ sparsity regularization and probabilistic max-pooling to learn hierarchical representations. However, the learning is difficult and computationally expensive for training the deep belief nets. Zeiler

[42, 43] proposed the deconvolutional networks to learn the low and mid-level image representations based on the convolutional decomposition of images under a sparsity constrain. However, for the aforementioned methods, the hierarchical representations have to be learned layer by layer, that is to first train the bottom layer of the network and then fix the learned layer and train the upper layers one by one. Moreover, the above methods usually work on the gray images or the gradient images which are preprocessed by removing the low frequency texture information and highlighting the structure information. Unlike the above method, the proposed method can directly work on the raw color images without any preprocessing. The proposed model can simultaneously learn meaningful hierarchical representations, generate realistic images and reconstruct the original images.

Our Contributions. This paper makes three main contributions to the field of generative learning: (i) It proposes interpretable image synthesis that unfolds the internal generation process via a hierarchical AND-OR network of semantically meaningful nodes. (ii) It presents a simple yet effective sparsity-inducing method that facilitates a hierarchical AND-OR network of sparsely connected nodes to emerge from an initial network of dense connection between consecutive layers. (iii) It shows that meaningful hierarchical representations can be learned end-to-end in image synthesis with better qualities than state-of-the-art baselines.

3 The Proposed Approach

3.1 Image Synthesis and Model Interpretability

From the viewpoint of top-down generative learning in image synthesis, we start with a

-dimensional latent code vector

consisting of latent factors. We usually assume , where represents the

-dimensional identity matrix. In GANs and VAE, generator networks are used to implement the highly non-linear mapping from a latent code vector

to a synthesized image, denoted by which lies in a -dimensional image space (i.e., equals the product of the spatial dimensions, width and height of an image, and the number of chromatic channels such as for RGB images). The generator networks are thus seen as non-linear extensions of factor analysis [13]. We have,



is the observational errors assumed to be Gaussian white noises,

represents the generator network and collects parameters from all layers.

As illustrated in the top of Figure 2, dense connections between consecutive layers are learned in the vanilla generator network, which we think is the main drawback that hinders explicit model intepretability. We explore and exploit the AND-OR compositionality in image synthesis by learning to rewire the connections sparsely and to unfold the internal image generation process in an interpretable way, as illustrated in the bottom of Figure 2.

Fig. 2: Top: traditional generator networks with dense connections (solid arrows) between consecutive layers, which are widely used in GANs and VAE. Bottom: the proposed AND-OR networks with sparse connections (dashed arrows). See text for detail.

3.2 The Proposed AND-OR Network

Without loss of generality, consider a simple hierarchy of object(O)-part(P)-primitive/basis(B) that generates RGB images. Start with the latent code vector , we have,

Hierarchy: (2)
Layer Index: (3)

For example, Figure 2 illustrates the computing flow from Layer 1 to Layer 3.

The symbol in the hierarchy is grounded in an internal -dimensional space, which can be treated as -dimensional vectors when instantiated. Similarly, the symbols and will be instantiated as -dimensional vectors and -dimensional vectors respectively. is a generated RGB image of size .

To better show how to facilitate the sparse connections to emerge from the dense ones, we look at the computing flow using the lens of vector-matrix multiplication [10]. In the vanilla generator network, consider a -dimensional vector, in Layer , it connects to a set of -dimensional vectors, ’s in Layer . Let be the set of indices of the vectors in Layer which connect with (i.e., ’s child nodes). We have,


where means the contribution of to since there may have other vectors in Layer connecting to too. is the transformation matrix and

the bias vector. Consider Layer 1 to Layer 2 (

), is connected with all vectors ’s in with different ’s and ’s. Consider Layer 2 to Layer 3 (), convolution is usually used, so each only connects to vectors ’s locally, and ’s and ’s are shared among different ’s.

Denote by the set of indices of vectors in Layer connecting with . In the vanilla generator network,



stands for activation function such as the ReLU function 


In the proposed method, we compute by,


where is the sparsity-inducing function. From symbol to , we apply the sparsity-inducing function along the dimension and retain the top out of elements in the resulting vector in terms of the element values. In the subsequent layers, we apply it along the spatial domain across the dimensions individually. By doing so, the resulting vectors at different location will have different sparsity ratios. The ’s are hyper-parameters. We usually set and , that is, Layer has higher sparsity degree than lower Layer .

With sparsity-inducing functions, image synthesis is fundamentally changed in terms of representation. The internal generation process is also much easier to unfold. The proposed AND-OR network is emerged from the vanilla dense connected generator network. We can rewrite Eqn. 1 as,


where the sparsity hyper-parameters . We summarize the proposed AND-OR network for image synthesis as follows.

Layer 1 to Layer 2: . The latent code vector is represented by a root OR-node (non-terminal symbol),


where denotes OR switching between symbols and (i.e., instantiated latent code vectors that generate different object images).

Each instantiated latent code vector is then mapped to an object instance AND-node . The object instance AND-node represents the object-part decomposition in the lattice (of size ). We have,


where represents the composition between symbols and . is the number of part symbols. The object-part decomposition is usually done in the spatial domain. For example, if the support domain for ’s is , we will have at most parts. We could use parts if we further divide the domain into blocks.

Each is then represented by an OR-node in the -dimensional vector space indicating the sparse selection among candidates. When instantiated, we have part AND-node .

Layer 2 to Layer 3: . Each part AND-node is decomposed into a number of child part type OR-nodes,


where is determined by the kernel size when convolution is used to compute Layer 3 from Layer 2.

Similarly, each part type OR-node is grounded in the -dimensional vector space indicating the sparse selection among candidates. When instantiated, we have part-primitive AND-node. Then, the AND-OR is recursively formulated in the downstream layers. Now, let us look at Figure 2 again, for each instantiated , we can follow the sparse connections and visualize the encountered kernel symbols (see Figure 1).

3.3 Learning and Inference

The proposed AND-OR network can still utilize off-the-shelf end-to-end learning framework since the sparsity-inducing terms do not change the formulation (Eqn. 7). We adopt the alternating back-propagation learning framework proposed in [13].

Denote by the training dataset consisting of images (e.g., face images). The learning objective is to maximize the observed data log-likelihood,


where the latent vector for an observed data is integrated out, and the complete-data likelihood. The gradient of is computed as follows,


In general, the expectation in Eqn.12 is analytically intractable. Monte Carlo average is usually adopted in practice with samples drawn from the posterior by the Langevin dynamics,


where indexes the time step, is the step size, and denotes the noise term, .

Based on Eqn. 7, the complete-data log-likelihood is computed by,


where is a constant term independent of and . It can be shown that, given sufficient transition steps, the obtained from this procedure follows the joint posterior distribution [40]. For each training example , we run the Langevin dynamics in Eqn.13 to get the corresponding posterior sample . The sample is then used for gradient computation in Eqn.12. The parameters are then learned through Monte Carlo approximation,


3.4 Combining with an Energy-Based Network

It is well known that using squared Euclidean distance alone to train generator networks often yields blurry reconstruction results, since the precise location information of details may not be preserved, and the

loss in the image space leads to averaging effects among all likely locations. In order to improve the quality, we utilize an energy-based network to help the generator network. The energy-based model is in the form of exponential tilting of a reference distribution of observed data,


where is parameterized by a bottom-up ConvNet which maps an image to the feature statistics or energy, the normalizing constant, and is the reference distribution such as Gaussian white noise,


Let be the underlying true data distribution. We jointly learn the generator network and the energy-based network by,


where is the KL divergence and is the cross-entropy between the two distributions. Minimizing the first term is equivalent to maximizing the Eqn. 11. Maximizing the negative KL distance of the second term is equivalent to maximizing the following log-likelihood function,


The gradient of is computed by ,

One key result is that , where denotes the expectation with respect to . We use the generative model to alleviate the difficulty of sampling images from the energy-based model,


The third term in Eqn.18 can be solved by,


Then, the gradient of the cross-entropy w.r.t. is computed by,


Algorithm 1 summarizes the detail of learning and inference.

0:    (1) training examples (2) network architectures and sparsity-inducing hyper-parameters (3) Langevin steps and learning iterations

    (1) estimated parameters

and (2) synthesized examples
1:  Let , randomly initialize and .
2:  repeat
3:     Step 1: For , generate , and generate . Update , where is computed using Eqn. 20.
4:     Step 2: For each , start from the current , run steps of Langevin dynamics to update each of whhic follows Eqn. 13.
5:     Step 3: Update , where is computed using Eqn. 15, and is computed using Eqn. 22.
6:     Let
7:  until 
Algorithm 1 Learning and Inference Algorithm

4 Experiments

In this section, we present the qualitative and quantitative results of the proposed method tested on five datasets widely used in image synthesis. The proposed method consistently obtains better quantitative performance with interpretable hierarchical representations learned. We implement the proposed method using Google’s TensorFlow 

111 and our source code will be released.

Datasets: We use the CelebA dataset [27], the human fashion dataset [26], the Stanford car dataset [21], the LSUN bedroom dataset [41]. We train our proposed AND-OR networks on the first 10k CelebA images as processed by OpenFace [2], 78,979 human fashion images as done in [26], the first 16k Stanford car images, and the first 100k bedroom images, all cropped to pixels.

Baselines: We compare our model with state-of-art image synthesis methods including VAE [20], DCGAN [35], WGAN [3], CoopNet [37], CEGAN [7]), ALI [9], and ALICE [25]. We use the Fréchet Inception distance (FID) [14] for evaluating the quality of generated images. The number of generated samples for computing FID is the same as that of training set. We also compare the image reconstruction quality in terms per pixel mean square errors (MSE).

Settings: Table I summarizes architectures of the generator network and the energy-based network used in our experiments.

Layer Generator Network Energy Based Network
1 Y, ()
2 FC, ; Conv+LReLU,
3 Upsample, 2 Downsample, 2
Conv+ReLU, Conv+LReLU,
Conv+ReLU, ; Conv+LReLU,
4 Upsample, 2 Downsample, 2
Conv+ReLU, Conv+LReLU,
Conv+ReLU, ; Conv+LReLU,
5 Upsample, 2 Downsample, 2
Conv+ReLU, Conv+LReLU,
Conv+ReLU, ; Conv+LReLU,
6 Upsample, 2 Downsample, 2
Conv+ReLU, Conv+LReLU,
Conv+ReLU, Conv+LReLU,
7 Conv+Tanh, FC,

Network architectures used in experiments. Upsample uses nearest neighbor interpolation. Downsample uses average pooling. LReLU is the leaky-ReLU with negative slope being 0.2. All convolution layers use kernels of size

with the number of output channles listed in . The sparsity-inducing hyper-parameter is also given.
Fig. 3: Results of image synthesis and reconstruction on CelebA. The first two rows show the original face images, the middle two rows show the reconstruction results, and the last two rows show the generated face images. The learned AND-OR tree model is illustrated in Figure 1.
Fig. 4: Results on the human fashion dataset. Top: the three rows show original images, reconstructed images and generated images respectively. Bottom: the learned AND-OR tree model shown in the same way as Figure 1. Best viewed in color and magnification.
Fig. 5: Results on the Stanford car dataset. Top: the three rows show original images, reconstructed images and generated images respectively. Bottom: the learned AND-OR tree model shown in the same way as Figure 1. Best viewed in color and magnification.
Fig. 6: Results on the LSUN bedroom dataset. Top: the three rows show original images, reconstructed images and generated images respectively. Bottom: the learned AND-OR tree model shown in the same way as Figure 1. Best viewed in color and magnification.
Datasets Methods VAE [20] DCGAN [35] WGAN [3] CoopNet [37] CEGAN [7] ALI [9] ALICE [25] Ours
CelebA 45.06 19.28 18.85 28.49 20.62 30.53 23.17 16.62 2.32
HumanFashion 23.28 10.82 10.19 15.39 11.14 16.75 12.56 8.65 1.44
Standford cars 76.21 33.58 31.62 45.34 36.12 50.48 37.35 28.36 2.26
LSUN bedrooms 81.35 36.26 33.81 49.73 41.64 52.79 39.08 29.70 4.11
TABLE II: Comparisons of the Fréchet inception distance (FID). Smaller FID is better. The last column shows the improvement of our method over the runner-up method, WGAN.

4.1 Qualitative Results

The proposed AND-OR network is capable of joint image synthesis and reconstruction. Figure 3 shows examples of reconstructed and generated face images. The top of Figure 4, Figure 5 and Figure 6 shows examples for human fashion images, car images and bedroom images respectively. Both the reconstructed images and the generated images look sharp. The reconstructed images of bedroom (Figure 6) look relatively blurrier. Bedroom images usually have larger variations which may entail more complicated generator and energy-based network architectures. We use the same architectures for all the tasks.

The learned AND-OR trees on the five datasets unfold the internal generation process with semantically meaningful internal basis functions learned (emerged). To our knowledge, this is the first work in image synthesis that learn interpretable image generation from scratch. More interestingly, we observe that the primitive layers in different AND-OR trees share many common patterns similar to the Gabor wavelets and blob-like structures, which is also consistent with results in traditional sparse coding.

4.2 Quantitative Results

The FID comparsions are summarized in Table II. The proposed method consistently outperforms the seven state-of-the-art image synthesis methods in comparisons. On the human fashion dataset, the images are nice and clean, our method obtains the least improvement by . On the bedroom dataset, the images are much more complex with large structural and appearance variations, our method obtains the biggest improvement by . We note that all the improvement are obtained with more interpretable representations learned in the form of AND-OR trees. This is especially interesting since it shows that jointly improving model performance and interpretability is possible.

We utilize per-pixel mean square error (MSE) to evaluate image reconstruction. Table III shows the comparisons with three state-of-the-art methods that are also capable of joint image synthesis and reconstruction (VAE [20], ALI [9], and ALICE [25]). We do not compare with the variants of GANs and CoopNets since they usually can not perform joint image reconstruction.

Datasets Methods VAE [20] ALI [9] ALICE [25] Ours
CelebA 0.016 0.132 0.019 0.011
HumanFashion 0.033 0.28 0.043 0.024
Standford cars 0.081 0.563 0.078 0.054
LSUN bedrooms 0.154 0.988 0.127 0.097
TABLE III: Comparisons of the per-pixel mean square error (MSE). Smaller MSE is better.

4.3 Ablation Studies

In addition to the AND-OR tree visualization, we propose a simple method to evaluate the intepretability of learned basis functions (e.g., those at Layer 3, see Figure 1). We perform Template Matching between the learned basis functions with training images using the fast normalized cross-correlation algorithm  [39]. Consider Layer 3 (a.k.a. object part level), if the learned basis functions contain meaningful local parts of the object, the matching score shall be high. We compare the Layer-3 basis functions learned with and without the proposed sparsity-inducing approach respectively (i.e., Eqn. 7 vs Eqn. 1). The results of the mean matching scores are summarized in Table IV. The proposed method significantly outperforms the counterpart. The results verify that the proposed method can learn meaningful basis functions for better model interpretability.

MethodsDatasets CelebA HumanFashion Cars Bedroom
w/o sparsity 0.33 0.29 0.31 0.23
w/ sparsity 0.83 0.81 0.76 0.72
TABLE IV: Evaluation of interpretability. Comparisons of the matching scores using the fast normalized cross-correlation algorithm between the generator without sparsity and the proposed sparse activated generator.

5 Conclusion

This paper proposes interpretable image synthesis by learning sparsely connected AND-OR networks. The proposed method is built on the vanilla generator network. The AND-OR network of sparsely connected nodes emerges from the original densely connected generator network when sparsity-inducing terms are introduced. In training, we further combine with energy-based networks and pose the learning problem under MLE. The resulting AND-OR networks are capable of joint image synthesis and reconstruction. In experiments, the proposed method is tested on five benchmark datasets. The results show that meaningful and interpretable hierarchical representations are learned with better qualities of image synthesis and reconstruction obtained than seven state-of-the-art methods.

Appendix A Computing and Visualization of the Basis functions

a.1 Basis functions and sparse representations

Suppose denotes an image defined on the spatial domain , where denotes a two dimensional vector which indexes the coordinates of pixels. can be treated as a two-dimensional function defined on . can also be treated as a vector if we fix an ordering for the pixels. Suppose counts the number of pixels in , then is the dimensionality of the vector .

A linear basis function (or basis vector) is a local image patch, which is utilized to represent image intensities. Let be a set of prototype basis functions, e.g., wavelets. Suppose that each is supported on a local domain centered at the origin, can be shift or translate spatially to a position to get a translated copy of as . can be treated as a locally supported function defined on . can also be treated as a vector of the same dimensionality as . The basis functions are about top-down linear representation (top-down means from the coefficients to the image) of an image by

where are the coefficients, and is the residual image. When most of the are equal to zero, the resulting representation is also called sparse coding.

The traditional one layer sparse coding model generalizes the prior distribution of the coefficients vector to the signal in factor analysis from a Gaussian noise vector to a sparse vector. And typically comprise two steps: (1) Learning a dictionary of basis functions , where index a finite collection of prototype functions , that is a spatially translated copy of . (2) Inferring the latent factor (),

K-SVD [1] is a typical algorithm for learning the dictionary of basis functions , while convex relaxation minimization or greedy methods such as matching pursuit (OMP) [36] can be employed to infer the sparsification , i.e., the selection of the basis functions. We will develop a hierarchical top-down sparse coding model that can generate (in addition to merely reconstruct) the realistic natural image patterns.

a.2 Inducing unified hierarchical sparse coding model from sparsely connected generator

A sparsity-inducing function is introduced to develop an explainable generative model. At each layer of the top-down generator network, the sparsity-inducing function only selects the top coefficients to be active and forces all the other coefficients to be zeros.

Suppose there are deconvolutional layers, the featuremap at the layer is

, and after sparsity, the survived activations constitute a sparse tensor,

. For the layer,

where we denote the deconvolutional operation as , the parameters at the layer as , the ReLU operation as , and the sparsity-inducing function as . The sparse activations are determined by the input latent vector and the learned parameters (weights and bias) of the generator. A different latent code vector will generate a different parsing tree. The ReLU and the sparsity-inducing function can be seen as the switches or masks that partitioning the space. For an instantiated latent code vector , remember the masks of the chosen elements of both the sparsity-inducing and functions. Then, these non-linear functions will be changed to linear functions,

where is the deconvolutional function, denotes the mask matrix of the sparsity-inducing function, denotes the mask matrix of the function, and denotes the dot product between two matrices. The final generated image can be formulated as,

Let the sparse activation be , where is the mask matrix only selects the sparse activation. Then .

where is a linear function (composed with a serious of linear function). For the layer, the value of the sparse activations can be interpreted as the sparse coefficients, . The sparse activations in the layer is equivalent to,


And we have

We obtain the final induced unified sparse coding formulation as,

where is the basis function of the sparse coefficient of the layer, ( can be computed from Eq. (23)).

The generated (or reconstruction) image can be represented as the summation of basis function at different layers,


can be employed to analyze each sparse activation’s contribution to the whole generated image, when the layer going from bottom to top, contains more and more high-level semantic information of the generated images.


The work of X. Xing, S.-C. Zhu and Y. Wu is supported by DARPA XAI project N66001-17-2-4029; ARO project W911NF1810296; and ONR MURI project N00014-16-1-2007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. The work of X. Xing is also supported by Natural Science Foundation of China No. 61703119, Natural Science Fund of Heilongjiang Province of China No. QC2017070, and Fundamental Research Funds for the Central Universities No. 3072019CFT0402. The work of T. Wu is supported in part by NSF IIS-1909644 and ARO grant W911NF1810295.


  • [1] M. Aharon, M. Elad, A. Bruckstein, et al. (2006) K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE TSP 54 (11), pp. 4311. Cited by: §A.1.
  • [2] B. Amos, B. Ludwiczuk, and M. Satyanarayanan (2016)

    OpenFace: a general-purpose face recognition library with mobile applications

    Technical report CMU-CS-16-118. Cited by: §4.
  • [3] M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein gan. arXiv:1701.07875. Cited by: §1, TABLE II, §4.
  • [4] D. Bau, J. Zhu, H. Strobelt, Z. Bolei, J. B. Tenenbaum, W. T. Freeman, and A. Torralba (2019) GAN dissection: visualizing and understanding generative adversarial networks. In ICLR, Cited by: §1.
  • [5] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096. Cited by: §1.
  • [6] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096. Cited by: §1.
  • [7] Z. Dai, A. Almahairi, P. Bachman, E. Hovy, and A. Courville (2017) Calibrating energy-based generative adversarial networks. arXiv:1702.01691. Cited by: TABLE II, §4.
  • [8] DARPA (2016)

    Explainable artificial intelligence (xai) program, explainable-artificial-intelligence, full solicitation at darpa-baa-16-53.pdf
    Cited by: §1.
  • [9] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2016) Adversarially learned inference. arXiv:1606.00704. Cited by: §4.2, TABLE II, TABLE III, §4.
  • [10] R. Gao, J. Xie, S. Zhu, and Y. N. Wu (2019) Learning grid cells as vector representation of self-position coupled with matrix representation of self-motion. See DBLP:conf/iclr/2019, External Links: Link Cited by: §3.2.
  • [11] S. Geman, D. Potter, and Z. Y. Chi (2002) Composition systems. Quarterly of Applied Mathematics 60 (4), pp. 707–736. Cited by: §1.
  • [12] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NeurIPS, pp. 2672–2680. Cited by: §1, §1.
  • [13] T. Han, Y. Lu, S. Zhu, and Y. N. Wu (2017) Alternating back-propagation for generator network.. In AAAI, pp. 1976–1984. Cited by: §3.1, §3.3.
  • [14] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, pp. 6626–6637. Cited by: §4.
  • [15] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner (2016) Beta-vae: learning basic visual concepts with a constrained variational framework. Cited by: §1.
  • [16] E. Hosseini-Asl, J. M. Zurada, and O. Nasraoui (2015) Deep learning of part-based representation of data using sparse autoencoders with nonnegativity constraints. IEEE Trans. Neural Netw. Learn. Syst 27 (12), pp. 2486–2498. Cited by: §2.
  • [17] G. B. Huang, H. Lee, and E. Learned-Miller (2012) Learning hierarchical representations for face verification with convolutional deep belief networks. In CVPR, pp. 2518–2525. Cited by: §2.
  • [18] H. Ishwaran and J. S. Rao (2005) Spike and slab variable selection: frequentist and bayesian strategies. The Ann. of Stat. 33 (2), pp. 730–773. Cited by: §1.
  • [19] T. Karras, S. Laine, and T. Aila (2018) A style-based generator architecture for generative adversarial networks. arXiv:1812.04948. Cited by: §1.
  • [20] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv:1312.6114. Cited by: §1, §4.2, TABLE II, TABLE III, §4.
  • [21] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §4.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, pp. 1106–1114. Cited by: §1, §1, §3.2.
  • [23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proc. of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §1.
  • [24] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, pp. 609–616. Cited by: §2.
  • [25] C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin (2017)

    Alice: towards understanding adversarial learning for joint distribution matching

    In NeurIPS, pp. 5495–5503. Cited by: §4.2, TABLE II, TABLE III, §4.
  • [26] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016-06) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In CVPR, Cited by: §4.
  • [27] Z. Liu, P. Luo, X. Wang, and X. Tang (2015-12) Deep learning face attributes in the wild. In ICCV, Cited by: §4.
  • [28] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2017) Are gans created equal? a large-scale study. arXiv:1711.10337. Cited by: §1.
  • [29] A. Makhzani and B. J. Frey (2015) Winner-take-all autoencoders. In NeurIPS, pp. 2791–2799. Cited by: §2.
  • [30] A. Makhzani and B. Frey (2013) K-sparse autoencoders. arXiv:1312.5663. Cited by: §2.
  • [31] A. Ng et al. (2011) Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §2.
  • [32] J. Ngiam, Z. Chen, S. A. Bhaskar, P. W. Koh, and A. Y. Ng (2011) Sparse filtering. In NeurIPS, pp. 1125–1133. Cited by: §2.
  • [33] B. A. Olshausen and D. J. Field (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381 (6583), pp. 607. Cited by: §1.
  • [34] B. A. Olshausen and D. J. Field (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision research 37 (23), pp. 3311–3325. Cited by: §1.
  • [35] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434. Cited by: TABLE II, §4.
  • [36] J. A. Tropp and A. C. Gilbert (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. on Info. Th. 53 (12), pp. 4655–4666. Cited by: §A.1.
  • [37] J. Xie, Y. Lu, R. Gao, and Y. N. Wu (2018) Cooperative learning of energy-based model and latent variable model via mcmc teaching. In AAAI, Cited by: §1, TABLE II, §4.
  • [38] X. Xing, T. Han, R. Gao, S. Zhu, and Y. N. Wu Unsupervised disentangling of appearance and geometry by deformable generator network. In CVPR, pp. 10354–10363. Cited by: §1.
  • [39] J. Yoo and T. H. Han (2009) Fast normalized cross-correlation. Circuits, systems and signal processing 28 (6), pp. 819. Cited by: §4.3.
  • [40] L. Younes (1999) On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics 65 (3-4), pp. 177–228. Cited by: §3.3.
  • [41] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao (2015) Lsun: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv:1506.03365. Cited by: §4.
  • [42] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus (2010) Deconvolutional networks.. In CVPR, Vol. 10, pp. 7. Cited by: §2.
  • [43] M. D. Zeiler, G. W. Taylor, R. Fergus, et al. (2011) Adaptive deconvolutional networks for mid and high level feature learning.. In ICCV, Vol. 1, pp. 6. Cited by: §2.
  • [44] S. Zhu and D. Mumford (2007) A stochastic grammar of images. Found. Trends Comp. Graph. Vis. 2 (4), pp. 259–362. Cited by: §1.