1 Introduction
Remarkable recent progress on image synthesis [12, 6, 19, 37, 3, 38] have been made using deep neural networks (DNNs) [23, 22]. Most efforts focus on developing sophisticated architectures and training paradigms for sharp and realisticlooking image synthesis [28, 5, 19]. Although highfidelity images can be generated, the internal synthesizing process via DNNs is still largely viewed as a blackbox, thus potentially hindering the longterm applicability in eXplainable AI (XAI) [8]. More recently, the generative adversarial network (GAN) dissection method [4]
has been proposed to identify internal neurons in pretrained GANs that show interpretable meanings using a separate annotated dataset in a posthoc fashion.
In this paper, we focus on learning interpretable models for unconditional image synthesis from scratch with explicit hierarchical representations. By interpretable image synthesis, it means the internal image generation process can be explicitly unfolded through meaningful basis functions endtoend learned at different layers and conceptually reflecting the hierarchy of sceneobjectspartssubpartsprimitives. A scene has different types (i.e., OR) each of which consists of a number of objects (i.e., AND). This can be recursively formulated across the sceneobjectspartssubparts hierarchy and is terminated at the primitive level (e.g., Gabor waveletslike basis functions). Figure 1 shows an example of the ANDOR tree learned from scratch for explaining a generating face image.
The hierarchy of sceneobjectspartssubpartsprimitives is at the stem of image grammar models [11, 44]. The ANDOR compositionality has been applied in image vision tasks [44]. With the recent resurgence of deep neural networks (DNNs) [23, 22] and the more recent DNNbased image synthesis framework such as the widely used Generative Adversarial Networks (GANs) [12] and Variational AutoEncoder (VAE) methods [20, 15], the hierarchy is usually assumed to be modeled implicitly in DNNs. Due to dense connections between consecutive layers in traditional DNNs, they often learn noisy compositional patterns of how entities in a layer are formed from “smaller” ones in the layer right below it.
On the other hand, the sparsity principle has played a fundamental role in highdimensional statistics, machine learning, signal processing and AI. In particular, the sparse coding scheme
[33] is an important principle for understanding the visual cortex. By imposing sparsity constraints on the coefficients of a linear generative model, [34] learn Gaborlike wavelets that resemble the neurons in the primary visual cortex (V1) from natural image patches. Since then, there are many important work on sparse coding presented in the literature before the resurgence of DNNs. With the remarkable successes of sparse coding models, it is not unreasonable to assume that a topdown generative model of natural images should be based on the linear sparse coding model, or incorporate the sparse coding principle at all of its layers. However, developing a topdown sparse coding model that can generate, rather than merely reconstruct, realisticlooking natural image patterns has proven to be a difficult task [18], mainly due to the difficulty of selecting and fitting sparse basis functions to each image.In this paper, we take a step forward rethinking dense connections between consecutive layers in traditional DNNs. We propose to “rewire” them sparsely for explicit modeling of the hierarchy of sceneobjectspartssubpartsprimitives in image synthesis (see Figure 1. To realize the “rewiring”, we integrate the sparsity principle in DNNs in a simple yet effective and adaptive way: (i) Each layer of the hierarchy is represented by a (overcompleted) set of basis functions. The basis functions are instantiated using convolution to be translation covariant. Offtheshelf convolutional neural architectures are then exploited to implement the hierarchy such as generator networks used in GANs. (ii) Sparsityinducing constraints are introduced in endtoend training which facilitates a sparsely connected ANDOR network to emerge from initially densely connected convolutional neural networks. A straightforward sparsityinducing constraint is utilized, that is to only allow the top basis functions to fire at each layer (where is a hyperparameter). By doing so, we can harness the highly expressive modeling capability and the endtoend learning flexibility of DNNs, and the interpertability rigor of the explicit compositional hierarchy.
2 Related works
Sparseautoencoders
[31, 30, 16]were proposed for effective feature representations and these representations can improve the performance of the classification task. The sparsity constrains are designed and encouraged by the KullbackLeibler divergence between the Bernoulli random variables
[31], penalty on the normalized features [32], and winnertakeall principle [29]. However, these methods do not have the ability of generating new data. Lee [24, 17]proposed a convolutional deep belief networks which employ sparsity regularization and probabilistic maxpooling to learn hierarchical representations. However, the learning is difficult and computationally expensive for training the deep belief nets. Zeiler
[42, 43] proposed the deconvolutional networks to learn the low and midlevel image representations based on the convolutional decomposition of images under a sparsity constrain. However, for the aforementioned methods, the hierarchical representations have to be learned layer by layer, that is to first train the bottom layer of the network and then fix the learned layer and train the upper layers one by one. Moreover, the above methods usually work on the gray images or the gradient images which are preprocessed by removing the low frequency texture information and highlighting the structure information. Unlike the above method, the proposed method can directly work on the raw color images without any preprocessing. The proposed model can simultaneously learn meaningful hierarchical representations, generate realistic images and reconstruct the original images.Our Contributions. This paper makes three main contributions to the field of generative learning: (i) It proposes interpretable image synthesis that unfolds the internal generation process via a hierarchical ANDOR network of semantically meaningful nodes. (ii) It presents a simple yet effective sparsityinducing method that facilitates a hierarchical ANDOR network of sparsely connected nodes to emerge from an initial network of dense connection between consecutive layers. (iii) It shows that meaningful hierarchical representations can be learned endtoend in image synthesis with better qualities than stateoftheart baselines.
3 The Proposed Approach
3.1 Image Synthesis and Model Interpretability
From the viewpoint of topdown generative learning in image synthesis, we start with a
dimensional latent code vector
consisting of latent factors. We usually assume , where represents thedimensional identity matrix. In GANs and VAE, generator networks are used to implement the highly nonlinear mapping from a latent code vector
to a synthesized image, denoted by which lies in a dimensional image space (i.e., equals the product of the spatial dimensions, width and height of an image, and the number of chromatic channels such as for RGB images). The generator networks are thus seen as nonlinear extensions of factor analysis [13]. We have,(1)  
where
is the observational errors assumed to be Gaussian white noises,
represents the generator network and collects parameters from all layers.As illustrated in the top of Figure 2, dense connections between consecutive layers are learned in the vanilla generator network, which we think is the main drawback that hinders explicit model intepretability. We explore and exploit the ANDOR compositionality in image synthesis by learning to rewire the connections sparsely and to unfold the internal image generation process in an interpretable way, as illustrated in the bottom of Figure 2.
3.2 The Proposed ANDOR Network
Without loss of generality, consider a simple hierarchy of object(O)part(P)primitive/basis(B) that generates RGB images. Start with the latent code vector , we have,
Hierarchy:  (2)  
Layer Index:  (3) 
For example, Figure 2 illustrates the computing flow from Layer 1 to Layer 3.
The symbol in the hierarchy is grounded in an internal dimensional space, which can be treated as dimensional vectors when instantiated. Similarly, the symbols and will be instantiated as dimensional vectors and dimensional vectors respectively. is a generated RGB image of size .
To better show how to facilitate the sparse connections to emerge from the dense ones, we look at the computing flow using the lens of vectormatrix multiplication [10]. In the vanilla generator network, consider a dimensional vector, in Layer , it connects to a set of dimensional vectors, ’s in Layer . Let be the set of indices of the vectors in Layer which connect with (i.e., ’s child nodes). We have,
(4) 
where means the contribution of to since there may have other vectors in Layer connecting to too. is the transformation matrix and
the bias vector. Consider Layer 1 to Layer 2 (
), is connected with all vectors ’s in with different ’s and ’s. Consider Layer 2 to Layer 3 (), convolution is usually used, so each only connects to vectors ’s locally, and ’s and ’s are shared among different ’s.Denote by the set of indices of vectors in Layer connecting with . In the vanilla generator network,
(5) 
where
stands for activation function such as the ReLU function
[22].In the proposed method, we compute by,
(6) 
where is the sparsityinducing function. From symbol to , we apply the sparsityinducing function along the dimension and retain the top out of elements in the resulting vector in terms of the element values. In the subsequent layers, we apply it along the spatial domain across the dimensions individually. By doing so, the resulting vectors at different location will have different sparsity ratios. The ’s are hyperparameters. We usually set and , that is, Layer has higher sparsity degree than lower Layer .
With sparsityinducing functions, image synthesis is fundamentally changed in terms of representation. The internal generation process is also much easier to unfold. The proposed ANDOR network is emerged from the vanilla dense connected generator network. We can rewrite Eqn. 1 as,
(7)  
where the sparsity hyperparameters . We summarize the proposed ANDOR network for image synthesis as follows.
Layer 1 to Layer 2: . The latent code vector is represented by a root ORnode (nonterminal symbol),
(8) 
where denotes OR switching between symbols and (i.e., instantiated latent code vectors that generate different object images).
Each instantiated latent code vector is then mapped to an object instance ANDnode . The object instance ANDnode represents the objectpart decomposition in the lattice (of size ). We have,
(9) 
where represents the composition between symbols and . is the number of part symbols. The objectpart decomposition is usually done in the spatial domain. For example, if the support domain for ’s is , we will have at most parts. We could use parts if we further divide the domain into blocks.
Each is then represented by an ORnode in the dimensional vector space indicating the sparse selection among candidates. When instantiated, we have part ANDnode .
Layer 2 to Layer 3: . Each part ANDnode is decomposed into a number of child part type ORnodes,
(10) 
where is determined by the kernel size when convolution is used to compute Layer 3 from Layer 2.
Similarly, each part type ORnode is grounded in the dimensional vector space indicating the sparse selection among candidates. When instantiated, we have partprimitive ANDnode. Then, the ANDOR is recursively formulated in the downstream layers. Now, let us look at Figure 2 again, for each instantiated , we can follow the sparse connections and visualize the encountered kernel symbols (see Figure 1).
3.3 Learning and Inference
The proposed ANDOR network can still utilize offtheshelf endtoend learning framework since the sparsityinducing terms do not change the formulation (Eqn. 7). We adopt the alternating backpropagation learning framework proposed in [13].
Denote by the training dataset consisting of images (e.g., face images). The learning objective is to maximize the observed data loglikelihood,
(11)  
where the latent vector for an observed data is integrated out, and the completedata likelihood. The gradient of is computed as follows,
(12)  
In general, the expectation in Eqn.12 is analytically intractable. Monte Carlo average is usually adopted in practice with samples drawn from the posterior by the Langevin dynamics,
(13) 
where indexes the time step, is the step size, and denotes the noise term, .
Based on Eqn. 7, the completedata loglikelihood is computed by,
(14) 
where is a constant term independent of and . It can be shown that, given sufficient transition steps, the obtained from this procedure follows the joint posterior distribution [40]. For each training example , we run the Langevin dynamics in Eqn.13 to get the corresponding posterior sample . The sample is then used for gradient computation in Eqn.12. The parameters are then learned through Monte Carlo approximation,
(15) 
3.4 Combining with an EnergyBased Network
It is well known that using squared Euclidean distance alone to train generator networks often yields blurry reconstruction results, since the precise location information of details may not be preserved, and the
loss in the image space leads to averaging effects among all likely locations. In order to improve the quality, we utilize an energybased network to help the generator network. The energybased model is in the form of exponential tilting of a reference distribution of observed data,
(16) 
where is parameterized by a bottomup ConvNet which maps an image to the feature statistics or energy, the normalizing constant, and is the reference distribution such as Gaussian white noise,
(17) 
Let be the underlying true data distribution. We jointly learn the generator network and the energybased network by,
(18)  
where is the KL divergence and is the crossentropy between the two distributions. Minimizing the first term is equivalent to maximizing the Eqn. 11. Maximizing the negative KL distance of the second term is equivalent to maximizing the following loglikelihood function,
(19) 
The gradient of is computed by ,
One key result is that , where denotes the expectation with respect to . We use the generative model to alleviate the difficulty of sampling images from the energybased model,
(20) 
The third term in Eqn.18 can be solved by,
(21) 
Then, the gradient of the crossentropy w.r.t. is computed by,
(22) 
Algorithm 1 summarizes the detail of learning and inference.
4 Experiments
In this section, we present the qualitative and quantitative results of the proposed method tested on five datasets widely used in image synthesis. The proposed method consistently obtains better quantitative performance with interpretable hierarchical representations learned. We implement the proposed method using Google’s TensorFlow
^{1}^{1}1https://github.com/tensorflow and our source code will be released.Datasets: We use the CelebA dataset [27], the human fashion dataset [26], the Stanford car dataset [21], the LSUN bedroom dataset [41]. We train our proposed ANDOR networks on the first 10k CelebA images as processed by OpenFace [2], 78,979 human fashion images as done in [26], the first 16k Stanford car images, and the first 100k bedroom images, all cropped to pixels.
Baselines: We compare our model with stateofart image synthesis methods including VAE [20], DCGAN [35], WGAN [3], CoopNet [37], CEGAN [7]), ALI [9], and ALICE [25]. We use the Fréchet Inception distance (FID) [14] for evaluating the quality of generated images. The number of generated samples for computing FID is the same as that of training set. We also compare the image reconstruction quality in terms per pixel mean square errors (MSE).
Settings: Table I summarizes architectures of the generator network and the energybased network used in our experiments.
Layer  Generator Network  Energy Based Network 
1  Y, ()  
2  FC, ;  Conv+LReLU, 
3  Upsample, 2  Downsample, 2 
Conv+ReLU,  Conv+LReLU,  
Conv+ReLU, ;  Conv+LReLU,  
4  Upsample, 2  Downsample, 2 
Conv+ReLU,  Conv+LReLU,  
Conv+ReLU, ;  Conv+LReLU,  
5  Upsample, 2  Downsample, 2 
Conv+ReLU,  Conv+LReLU,  
Conv+ReLU, ;  Conv+LReLU,  
6  Upsample, 2  Downsample, 2 
Conv+ReLU,  Conv+LReLU,  
Conv+ReLU,  Conv+LReLU,  
7  Conv+Tanh,  FC, 
Network architectures used in experiments. Upsample uses nearest neighbor interpolation. Downsample uses average pooling. LReLU is the leakyReLU with negative slope being 0.2. All convolution layers use kernels of size
with the number of output channles listed in . The sparsityinducing hyperparameter is also given.Datasets Methods  VAE [20]  DCGAN [35]  WGAN [3]  CoopNet [37]  CEGAN [7]  ALI [9]  ALICE [25]  Ours  
CelebA  45.06  19.28  18.85  28.49  20.62  30.53  23.17  16.62  2.32 
HumanFashion  23.28  10.82  10.19  15.39  11.14  16.75  12.56  8.65  1.44 
Standford cars  76.21  33.58  31.62  45.34  36.12  50.48  37.35  28.36  2.26 
LSUN bedrooms  81.35  36.26  33.81  49.73  41.64  52.79  39.08  29.70  4.11 
4.1 Qualitative Results
The proposed ANDOR network is capable of joint image synthesis and reconstruction. Figure 3 shows examples of reconstructed and generated face images. The top of Figure 4, Figure 5 and Figure 6 shows examples for human fashion images, car images and bedroom images respectively. Both the reconstructed images and the generated images look sharp. The reconstructed images of bedroom (Figure 6) look relatively blurrier. Bedroom images usually have larger variations which may entail more complicated generator and energybased network architectures. We use the same architectures for all the tasks.
The learned ANDOR trees on the five datasets unfold the internal generation process with semantically meaningful internal basis functions learned (emerged). To our knowledge, this is the first work in image synthesis that learn interpretable image generation from scratch. More interestingly, we observe that the primitive layers in different ANDOR trees share many common patterns similar to the Gabor wavelets and bloblike structures, which is also consistent with results in traditional sparse coding.
4.2 Quantitative Results
The FID comparsions are summarized in Table II. The proposed method consistently outperforms the seven stateoftheart image synthesis methods in comparisons. On the human fashion dataset, the images are nice and clean, our method obtains the least improvement by . On the bedroom dataset, the images are much more complex with large structural and appearance variations, our method obtains the biggest improvement by . We note that all the improvement are obtained with more interpretable representations learned in the form of ANDOR trees. This is especially interesting since it shows that jointly improving model performance and interpretability is possible.
We utilize perpixel mean square error (MSE) to evaluate image reconstruction. Table III shows the comparisons with three stateoftheart methods that are also capable of joint image synthesis and reconstruction (VAE [20], ALI [9], and ALICE [25]). We do not compare with the variants of GANs and CoopNets since they usually can not perform joint image reconstruction.
4.3 Ablation Studies
In addition to the ANDOR tree visualization, we propose a simple method to evaluate the intepretability of learned basis functions (e.g., those at Layer 3, see Figure 1). We perform Template Matching between the learned basis functions with training images using the fast normalized crosscorrelation algorithm [39]. Consider Layer 3 (a.k.a. object part level), if the learned basis functions contain meaningful local parts of the object, the matching score shall be high. We compare the Layer3 basis functions learned with and without the proposed sparsityinducing approach respectively (i.e., Eqn. 7 vs Eqn. 1). The results of the mean matching scores are summarized in Table IV. The proposed method significantly outperforms the counterpart. The results verify that the proposed method can learn meaningful basis functions for better model interpretability.
MethodsDatasets  CelebA  HumanFashion  Cars  Bedroom 

w/o sparsity  0.33  0.29  0.31  0.23 
w/ sparsity  0.83  0.81  0.76  0.72 
5 Conclusion
This paper proposes interpretable image synthesis by learning sparsely connected ANDOR networks. The proposed method is built on the vanilla generator network. The ANDOR network of sparsely connected nodes emerges from the original densely connected generator network when sparsityinducing terms are introduced. In training, we further combine with energybased networks and pose the learning problem under MLE. The resulting ANDOR networks are capable of joint image synthesis and reconstruction. In experiments, the proposed method is tested on five benchmark datasets. The results show that meaningful and interpretable hierarchical representations are learned with better qualities of image synthesis and reconstruction obtained than seven stateoftheart methods.
Appendix A Computing and Visualization of the Basis functions
a.1 Basis functions and sparse representations
Suppose denotes an image defined on the spatial domain , where denotes a two dimensional vector which indexes the coordinates of pixels. can be treated as a twodimensional function defined on . can also be treated as a vector if we fix an ordering for the pixels. Suppose counts the number of pixels in , then is the dimensionality of the vector .
A linear basis function (or basis vector) is a local image patch, which is utilized to represent image intensities. Let be a set of prototype basis functions, e.g., wavelets. Suppose that each is supported on a local domain centered at the origin, can be shift or translate spatially to a position to get a translated copy of as . can be treated as a locally supported function defined on . can also be treated as a vector of the same dimensionality as . The basis functions are about topdown linear representation (topdown means from the coefficients to the image) of an image by
where are the coefficients, and is the residual image. When most of the are equal to zero, the resulting representation is also called sparse coding.
The traditional one layer sparse coding model generalizes the prior distribution of the coefficients vector to the signal in factor analysis from a Gaussian noise vector to a sparse vector. And typically comprise two steps: (1) Learning a dictionary of basis functions , where index a finite collection of prototype functions , that is a spatially translated copy of . (2) Inferring the latent factor (),
KSVD [1] is a typical algorithm for learning the dictionary of basis functions , while convex relaxation minimization or greedy methods such as matching pursuit (OMP) [36] can be employed to infer the sparsification , i.e., the selection of the basis functions. We will develop a hierarchical topdown sparse coding model that can generate (in addition to merely reconstruct) the realistic natural image patterns.
a.2 Inducing unified hierarchical sparse coding model from sparsely connected generator
A sparsityinducing function is introduced to develop an explainable generative model. At each layer of the topdown generator network, the sparsityinducing function only selects the top coefficients to be active and forces all the other coefficients to be zeros.
Suppose there are deconvolutional layers, the featuremap at the layer is
, and after sparsity, the survived activations constitute a sparse tensor,
. For the layer,where we denote the deconvolutional operation as , the parameters at the layer as , the ReLU operation as , and the sparsityinducing function as . The sparse activations are determined by the input latent vector and the learned parameters (weights and bias) of the generator. A different latent code vector will generate a different parsing tree. The ReLU and the sparsityinducing function can be seen as the switches or masks that partitioning the space. For an instantiated latent code vector , remember the masks of the chosen elements of both the sparsityinducing and functions. Then, these nonlinear functions will be changed to linear functions,
where is the deconvolutional function, denotes the mask matrix of the sparsityinducing function, denotes the mask matrix of the function, and denotes the dot product between two matrices. The final generated image can be formulated as,
Let the sparse activation be , where is the mask matrix only selects the sparse activation. Then .
where is a linear function (composed with a serious of linear function). For the layer, the value of the sparse activations can be interpreted as the sparse coefficients, . The sparse activations in the layer is equivalent to,
(23) 
And we have
We obtain the final induced unified sparse coding formulation as,
where is the basis function of the sparse coefficient of the layer, ( can be computed from Eq. (23)).
The generated (or reconstruction) image can be represented as the summation of basis function at different layers,
(25)  
can be employed to analyze each sparse activation’s contribution to the whole generated image, when the layer going from bottom to top, contains more and more highlevel semantic information of the generated images.
Acknowledgments
The work of X. Xing, S.C. Zhu and Y. Wu is supported by DARPA XAI project N660011724029; ARO project W911NF1810296; and ONR MURI project N000141612007; and Extreme Science and Engineering Discovery Environment (XSEDE) grant ASC170063. The work of X. Xing is also supported by Natural Science Foundation of China No. 61703119, Natural Science Fund of Heilongjiang Province of China No. QC2017070, and Fundamental Research Funds for the Central Universities No. 3072019CFT0402. The work of T. Wu is supported in part by NSF IIS1909644 and ARO grant W911NF1810295.
References
 [1] (2006) Ksvd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE TSP 54 (11), pp. 4311. Cited by: §A.1.

[2]
(2016)
OpenFace: a generalpurpose face recognition library with mobile applications
. Technical report CMUCS16118. Cited by: §4.  [3] (2017) Wasserstein gan. arXiv:1701.07875. Cited by: §1, TABLE II, §4.
 [4] (2019) GAN dissection: visualizing and understanding generative adversarial networks. In ICLR, Cited by: §1.
 [5] (2018) Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096. Cited by: §1.
 [6] (2018) Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096. Cited by: §1.
 [7] (2017) Calibrating energybased generative adversarial networks. arXiv:1702.01691. Cited by: TABLE II, §4.

[8]
(2016)
Explainable artificial intelligence (xai) program,
http://www.darpa.mil/program/ explainableartificialintelligence, full solicitation at http://www.darpa.mil/attachments/ darpabaa1653.pdf. Cited by: §1.  [9] (2016) Adversarially learned inference. arXiv:1606.00704. Cited by: §4.2, TABLE II, TABLE III, §4.
 [10] (2019) Learning grid cells as vector representation of selfposition coupled with matrix representation of selfmotion. See DBLP:conf/iclr/2019, External Links: Link Cited by: §3.2.
 [11] (2002) Composition systems. Quarterly of Applied Mathematics 60 (4), pp. 707–736. Cited by: §1.
 [12] (2014) Generative adversarial nets. In NeurIPS, pp. 2672–2680. Cited by: §1, §1.
 [13] (2017) Alternating backpropagation for generator network.. In AAAI, pp. 1976–1984. Cited by: §3.1, §3.3.
 [14] (2017) Gans trained by a two timescale update rule converge to a local nash equilibrium. In NeurIPS, pp. 6626–6637. Cited by: §4.
 [15] (2016) Betavae: learning basic visual concepts with a constrained variational framework. Cited by: §1.
 [16] (2015) Deep learning of partbased representation of data using sparse autoencoders with nonnegativity constraints. IEEE Trans. Neural Netw. Learn. Syst 27 (12), pp. 2486–2498. Cited by: §2.
 [17] (2012) Learning hierarchical representations for face verification with convolutional deep belief networks. In CVPR, pp. 2518–2525. Cited by: §2.
 [18] (2005) Spike and slab variable selection: frequentist and bayesian strategies. The Ann. of Stat. 33 (2), pp. 730–773. Cited by: §1.
 [19] (2018) A stylebased generator architecture for generative adversarial networks. arXiv:1812.04948. Cited by: §1.
 [20] (2013) Autoencoding variational bayes. arXiv:1312.6114. Cited by: §1, §4.2, TABLE II, TABLE III, §4.
 [21] (2013) 3D object representations for finegrained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR13), Sydney, Australia. Cited by: §4.
 [22] (2012) ImageNet classification with deep convolutional neural networks. In NeurIPS, pp. 1106–1114. Cited by: §1, §1, §3.2.
 [23] (1998) Gradientbased learning applied to document recognition. Proc. of the IEEE 86 (11), pp. 2278–2324. Cited by: §1, §1.
 [24] (2009) Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, pp. 609–616. Cited by: §2.

[25]
(2017)
Alice: towards understanding adversarial learning for joint distribution matching
. In NeurIPS, pp. 5495–5503. Cited by: §4.2, TABLE II, TABLE III, §4.  [26] (201606) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In CVPR, Cited by: §4.
 [27] (201512) Deep learning face attributes in the wild. In ICCV, Cited by: §4.
 [28] (2017) Are gans created equal? a largescale study. arXiv:1711.10337. Cited by: §1.
 [29] (2015) Winnertakeall autoencoders. In NeurIPS, pp. 2791–2799. Cited by: §2.
 [30] (2013) Ksparse autoencoders. arXiv:1312.5663. Cited by: §2.
 [31] (2011) Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §2.
 [32] (2011) Sparse filtering. In NeurIPS, pp. 1125–1133. Cited by: §2.
 [33] (1996) Emergence of simplecell receptive field properties by learning a sparse code for natural images. Nature 381 (6583), pp. 607. Cited by: §1.
 [34] (1997) Sparse coding with an overcomplete basis set: a strategy employed by v1?. Vision research 37 (23), pp. 3311–3325. Cited by: §1.
 [35] (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:1511.06434. Cited by: TABLE II, §4.
 [36] (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans. on Info. Th. 53 (12), pp. 4655–4666. Cited by: §A.1.
 [37] (2018) Cooperative learning of energybased model and latent variable model via mcmc teaching. In AAAI, Cited by: §1, TABLE II, §4.
 [38] Unsupervised disentangling of appearance and geometry by deformable generator network. In CVPR, pp. 10354–10363. Cited by: §1.
 [39] (2009) Fast normalized crosscorrelation. Circuits, systems and signal processing 28 (6), pp. 819. Cited by: §4.3.
 [40] (1999) On the convergence of markovian stochastic algorithms with rapidly decreasing ergodicity rates. Stochastics 65 (34), pp. 177–228. Cited by: §3.3.
 [41] (2015) Lsun: construction of a largescale image dataset using deep learning with humans in the loop. arXiv:1506.03365. Cited by: §4.
 [42] (2010) Deconvolutional networks.. In CVPR, Vol. 10, pp. 7. Cited by: §2.
 [43] (2011) Adaptive deconvolutional networks for mid and high level feature learning.. In ICCV, Vol. 1, pp. 6. Cited by: §2.
 [44] (2007) A stochastic grammar of images. Found. Trends Comp. Graph. Vis. 2 (4), pp. 259–362. Cited by: §1.
Comments
There are no comments yet.