Variational Capsules for Image Analysis and Synthesis

07/11/2018 ∙ by Huaibo Huang, et al. ∙ 6

A capsule is a group of neurons whose activity vector models different properties of the same entity. This paper extends the capsule to a generative version, named variational capsules (VCs). Each VC produces a latent variable for a specific entity, making it possible to integrate image analysis and image synthesis into a unified framework. Variational capsules model an image as a composition of entities in a probabilistic model. Different capsules' divergence with a specific prior distribution represents the presence of different entities, which can be applied in image analysis tasks such as classification. In addition, variational capsules encode multiple entities in a semantically-disentangling way. Diverse instantiations of capsules are related to various properties of the same entity, making it easy to generate diverse samples with fine-grained semantic attributes. Extensive experiments demonstrate that deep networks designed with variational capsules can not only achieve promising performance on image analysis tasks (including image classification and attribute prediction) but can also improve the diversity and controllability of image synthesis.



page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With recent advances in deep learning, tremendous success has been achieved in variety of application domains, including image analysis and synthesis. Image analysis usually refers to extracting information from an image 

Zhang et al. (2017)

using discriminative models, while image synthesis aims to produce image samples following an assigned distribution via generative models. These two tasks are highly interconnected and are expected to complement and promote each other. Numerous methods attempt to utilize both analysis blocks (e.g., classifiers) and synthesis blocks (e.g., autoregressive models 

Van Oord et al. (2016); van den Oord et al. (2016), VAEs Kingma & Welling (2014); Rezende et al. (2014) and GANs Goodfellow et al. (2014)). In these approaches, analysis blocks are employed to produce controllable conditions for synthesis blocks Perarnau et al. (2016); Wang et al. (2018), or serve as constraints to regularize the target properties of generated samples Ledig et al. (2017). Nevertheless, in most circumstances the analysis and synthesis blocks are trained in a disjointed way, which may be not an optimal solution for tackling these two tasks simultaneously. It is still a challenge to build a unified framework for image analysis and synthesis, in which these two tasks can collaborate and assist each other.

To alleviate this challenge, we present a new method, called variational capsules (VCs), to model images in a unified discriminative and generative framework. Capsules, which were originally introduced by Hinton et al. Hinton et al. (2011); Sabour et al. (2017)

, are groups of neurons whose activity vector represents various properties of a particular entity. The proposed variational capsules are a new version of capsules, which use the divergence of each capsule with a prior distribution rather than the length of the activity vector to represent the probability that an entity exists. Variational capsules model an image as a composition of entities in a probabilistic model, which maps the existing entities into the posterior that matches the prior approximately. Compared with the capsules in  

Sabour et al. (2017), variational capsules can be drawn from the prior distribution, which extends them from a purely discriminative model to a joint discriminative-generative model.

As illustrated in Figure 1, our framework takes a VAE-like architecture, and it comprises two parts: an encoder mapping the input images into variational capsules and a generator (or decoder) generating images from masked variational capsules. In the training phase, the encoder aims to detect or classify the existing entity and make the active capsules match the prior distribution, while the decoder tries to reconstruct the input image from the active vectors. In the testing phase, the encoder can be used to analyze the input images via the predicted capsules, and the decoder can synthesize a new sample by drawing samples from the prior distribution. In addition, when dealing with multiple entities, taking attributes as an example, the model can map attribute-aware information into disentangling semantic representations, which makes it possible to synthesize or edit images in a more controllable way with larger diversity. As shown in Figure 3, the presented model can generate various styles of glasses while preserving other attributes in face synthesis or editing, while most of the other models Yan et al. (2016); Perarnau et al. (2016); Lample et al. (2017) can only generate one style of glasses in this case.

Our contribution is four-fold. i) We present a new version of capsules, variational capsules, that model images as a composition of entities in a probabilistic model. ii) We present a unified framework of image analysis and synthesis, in which image synthesis is helpful for improving the prediction accuracy of image analysis, meanwhile image analysis provides semantic representations for image synthesis. iii) The proposed method provides a new technique for conditional image generation by mapping an image to disentangling semantic representations, which improves the interpretability and diversity of image synthesis. iv) The experiments demonstrate that the proposed method achieves promising or even state-of-the-art performance in some typical image analysis tasks, such as classification and attribute prediction, and outperforms state-of-the-art methods in the diversity and controllability of image synthesis.

2 Background

As our work is mostly related to capsules, variational autoencoders (VAEs) and conditional image generation, we start with a brief review of them.

Capsules Hinton et al. Hinton et al. (2011) introduce capsules to represent properties of an image and propose transforming auto-encoder to learn and manipulate an image with capsules. Sabour et al. Sabour et al. (2017) use the length of a capsule’s activity vector to represent the probability of an entity and design an iterative routing-by-agreement mechanism to improve the performance of capsule networks. Hinton et al. Hinton et al. (2018) propose a matrix version of capsules with EM routing. Our work can be seen as a new version of capsules that uses a different metric to represent the presence of an entity. It extends capsules to generative models that are capable of producing new samples.

Variational Autoencoder (VAE) Kingma & Welling (2014); Rezende et al. (2014) is one of the most promising generative models for its theory elegancy, stable training and nice manifold representations. VAE consists of two models: a generative model to synthesize the visible data from the latent code and an inference model to map the visible data to the latent which matches to a prior . The object of VAE is to maximize the variational lower bound (or evidence lower bound, ELBO) of :


The first term in the ELBO aims to reconstruct the input data from the posterior and the second term aims to make the posterior match the prior . Following the original VAEs Kingma & Welling (2014), let the prior be the centred isotropic multivariate Gaussian and the posterior , then the KL-divergence term, given data samples, can be computed as:


Conditional image synthesis There are mainly two forms of conditional image synthesis according to the provided condition. One is to generate new images from a prior and given conditions such as object category, attribute, caption, etc. This task is often implemented by typical generative models with the combination inputs of the latent code and the condition , including CGANs Mirza & Osindero (2014), CVAEs Yan et al. (2016); Bao et al. (2017), conditional PixelCNN van den Oord et al. (2016), etc. The other one is to generate new versions of an input image according to the given conditions , which is also called image transformation (manipulation or editing) Lample et al. (2017); Perarnau et al. (2016); Shu et al. (2017); Bao et al. (2018).

In conditional image synthesis, the conditions are mostly given or learned as binary codes Mirza & Osindero (2014); Yan et al. (2016); Bao et al. (2017); Lample et al. (2017); Perarnau et al. (2016) (to indicate category, attribute, caption, etc) or embedding features Shu et al. (2017); Bao et al. (2018). Lample et al. Lample et al. (2017) learn attribute-invariant features through adversarial learning and modify an image by sliding the values of the binary attributes. Bao et al. Bao et al. (2018) disentangle the identity and attribute features from a face image and map the attribute information into the prior . Compared with the existing methods, our method can learn semantically-disentangling embeddings at a fine-grained level. Taking attribute-guided image synthesis as an example, our method can map the input image into disentangling attribute-aware embeddings, i.e., each attribute is embedded into different capsules, which makes it possible to generate various styles for each attribute while preserving other characteristics of the image.

3 Approach

Figure 1: Illustration of the VCNs. The network consists of two parts: an inference model (encoder) to map an input image into the posterior and a generator model to produce the output image from the masked capsules. The posterior is trained to match the prior for the active capsules, and to deviate from the prior for the nonactive capsules. The input capsules of the generator are sampled from the posterior (or from the prior when sampling new images) using the reparameterization trick with a mask to indicate the present entities.

To analyze and synthesize images in a unified framework, the proposed VCs are expected to have two properties: one is that the active vector of a capsule is able to indicate the existing probability of an entity in an image; the other is that the capsule follows a known prior distribution to allow sampling of new capsules from the prior distribution to generate new images. In this section, we first describe how to design such variational capsules, followed by the training details of the VC networks (VCNs) and their applications in image synthesis.

3.1 Variational Capsules

The capsules proposed in Sabour et al. (2017) use the length of the instantiation vector to represent the probability of the existing entity. To facilitate the sampling of new capsules, we design variational capsules in a probabilistic manner that the active capsules follow a known prior distribution while the nonactive ones do the opposite. Following VAEs Kingma & Welling (2014), we select the KL-divergence as the metric to indicate the degree how two distributions match to each other. Hence, the KL-divergence of each capsule with the prior distribution represents the probability that a capsule’s entity exists, i.e., the capsule corresponding to the existing entity has a small KL-divergence with the prior while those corresponding to the non-existing entities have large KL-divergences with the prior distribution.

Following the original VAEs Kingma & Welling (2014), the prior

is assumed to follow isotropic multivariate Gaussian distribution, i.e.,

, while the proposed capsule follows multivariate Gaussian distribution whose mean and covariance are parameterized by . The KL-divergence of each capsule with the prior , i.e., , can be computed using Eq. (2). Let denote the above divergence, we use a separate margin loss for each capsule (where indicates the index of the capsule), which is defined as:


where if and only if the entity (such as a category or an attribute) exists; , is a positive margin; and is a down-weighting coefficient. The total loss is the sum of the losses of all the capsules, i.e., .

3.2 Training VCNs

As illustrated in Figure 1, VCNs are designed like VAEs Kingma & Welling (2014) with an auto-encoding architecture. The VCNs contain two modules: an inference model (Encoder) to map an input image into an approximate posterior matching the prior distribution, and a generator model (Generator) to generate samples from the capsules. During training, the proposed variational capsules in Figure 1 are sampled from the posterior where and are the output vectors of the encoder. Following VAEs Kingma & Welling (2014), the capsules are sampled using the reparameterization trick, i.e., , where is a random vector and means the element-wise multiplication. Besides, we take masked capsules as the input of the generator; i.e., all the capsules are set to zero, except for the active capsule that has the minimal KL-divergence with the prior distribution.

Similar to the capsules in Sabour et al. (2017), we use an additional reconstruction loss to encourage the proposed capsules to capture the entity’s instantiation details of the input image. The auto-encoding loss

is a classic pixel-wise mean squared error (MSE) which estimates the element-wise reconstruction quality given

data samples:


where is the reconstruction of the -th data . It is noted that the exact choice of the auto-encoding loss is not fundamental for the proposed methods. For example, cross entropy loss may be more suitable for binary images such as digits in MNIST LeCun et al. (1998).

In addition, when dealing with high-resolution images such as faces in CelebA (Liu et al., 2015), additional losses could be adopted to boost synthesis performance Ledig et al. (2017); Bao et al. (2017); Larsen et al. (2016), such as the adversarial loss in GANs Goodfellow et al. (2014) to promote generating sharp images, and the perceptual loss Bruna et al. (2016); Johnson et al. (2016) to regularize high-level semantic property. When the adversarial loss is employed, an extra discrimination model (Discriminator) is introduced to compete with the above mentioned encoder and generator as in traditional GANs. Concretely, we take the form of the adversarial loss as in LSGAN Mao et al. (2017) to get better convergence and higher image quality.


The final objective takes the following form:


where , are weighting coefficients to balance the importance of the losses.

3.3 Image Synthesis with VCs

There are three steps to generate a new sample with variational capsules: (1) determine the mask according to the expected entity, such as the object category or attribute of the generated image; (2) sample the variational capsules from the prior distribution , or from the posterior given an input image ; (3) produce the output image from the masked capsules using the generator. Through these steps, it is possible to control the synthesized images in a fine-grained way. We can modify the mask to change the expected entities in the output image and can use different instantiations of an activity capsule to generate large variations of a specific entity.

The advantages of the proposed method is mostly remarkable when dealing with the images that contain multiple entities. Variational capsules encode a single image into disentangling semantic representations in which each capsule corresponds with a specific entity. With the semantically-disentangling representation, it is easy to control the properties of the synthesised image. Taking attribute-guided image generation as an example, an input image is modeled as a composition of multiple attribute-related capsules. We can synthesize a completely new image with the latent representation sampled from the prior distribution. We can also modify a given image or a synthesized image in an attribute-level approach so that it is easy to generate various styles of a single attribute while preserving the properties of other attributes. For example, the proposed VCNs can synthesize images with various styles of bangs or glasses (see in Figure 3). In contrast, most other works can only change the existence of an attribute Yan et al. (2016); Perarnau et al. (2016) or the intensity of the attribute Lample et al. (2017); Bao et al. (2017); Bruna et al. (2016), but can hardly provide different styles of the attribute.

4 Experiments

We implement experiments on two datasets: attribute prediction and synthesis on CelebA Liu et al. (2015) , digit classification and generation on MNIST LeCun et al. (1998), to evaluate the performance of the presented method on image analysis and synthesis.

4.1 Experiments on CelebA

The CelebA database Liu et al. (2015) consists of 202,599 celebrity images with large variations in facial attributes. These images are obtained from unconstrained environments and annotated with 40 attributes. The standard split for CelebA is employed in our experiments, where 162,770 images for training, 19,867 for validation and 19,962 for testing. Following the image pre-processing method in Lample et al. (2017), we use the aligned version of CelebA in our experiments. Images are firstly center cropped to and then resized to before fed in our networks.

We treat the attribute prediction as a multi-task binary classification problem. For each attribute, we train a classifier with two outputs that model the active/nonactive status of this attribute. Following the original VAEs, each output is formed by a pair of variational capsules that represent the mean and covariance of the posterior distribution, respectively. In our attribute prediction experiment, the dimension of variational capsule is set to 32. Therefore, in total the encoder has variational capsule outputs, and each capsule is a 32-D vector. The decoder receives capsules as input, which are sampled from the posteriors, i.e., the outputs of the encoder.

Let C5s1-k denote a

Convolution-BatchNorm-ReLU layer with


filters and stride 1.

d denotes an average pooling layer with kernel size and stride 2. u denotes an upsampling layer with sale factor 2. Rk denotes a residual block that contains two Convolution-BatchNorm-ReLU layers with k filters, and an extra convolution layer with k filters in the identity path when the input channel does not equal k. Fk denotes a fully-connected layer with output dimension k. If one convolutional layer is followed by denotation gk , the convolutional filters in this layer are separated into k groups. The encoder architecture is: C5s1-16d, R32d, R64d, R128d, R256d, R512d, C1s1-10240, C4s1g40-5120. The decoder architecture is: F8192, R512u, R256u, R128u, R64u, R32u, R16u, R16, c5s1-3.

An extra multi-scale discriminator is employed to differentiate between natural and synthesized samples, with which adversarial training is conducted. All these sub-networks are trained jointly with a batch size of 64 and a learning rate of . In our experiment, we empirically set the trade-off parameters for reconstruction loss and adversarial loss to 0.025 and 10, respectively.

4.1.1 Attribute Prediction

Approach 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

LNets+ANet Liu et al. (2015) 91.00 79.00 81.00 79.00 98.00 95.00 68.00 78.00 88.00 95.00 84.00 80.00 90.00 91.00 92.00 99.00 95.00 97.00 90.00 87.00 98.00
MOON Rudd et al. (2016) 94.03 82.26 81.67 84.92 98.77 95.80 71.48 84.00 89.40 95.86 95.67 89.38 92.62 95.44 96.32 99.47 97.04 98.10 90.99 87.01 98.10
MCNN+AUX Hand & Chellappa (2017) 94.51 83.42 83.06 84.92 98.90 96.05 71.47 84.53 89.78 96.01 96.17 89.15 92.84 95.67 96.32 99.63 97.24 98.20 91.55 87.58 98.17
PaW Ding et al. (2018) 94.64 83.01 82.86 84.58 98.93 95.93 71.46 83.63 89.84 95.85 96.11 88.50 92.62 95.46 96.26 99.59 97.38 98.21 91.53 87.44 98.39
Ours( w/o recon.) 94.64 84.10 83.01 85.05 98.90 95.98 71.43 84.99 89.58 96.06 96.26 89.06 93.01 95.96 96.60 99.69 97.52 98.28 91.57 87.71 98.32
Ours 94.88 84.15 83.19 85.69 99.05 96.09 71.75 84.95 90.23 96.28 96.26 90.00 93.06 95.66 96.58 99.70 97.66 98.37 92.06 87.80 98.37
Approach 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Avg
LNets+ANet Liu et al. (2015) 92.00 95.00 81.00 95.00 66.00 91.00 72.00 89.00 90.00 96.00 92.00 73.00 80.00 82.00 99.00 93.00 71.00 93.00 87.00 87.30
MOON Rudd et al. (2016) 93.54 96.82 86.52 95.58 75.73 97.00 76.46 93.56 94.82 97.59 92.60 82.26 82.47 89.60 98.95 93.93 87.04 96.63 88.08 90.94
MCNN+AUX Hand & Chellappa (2017) 93.74 96.88 87.23 96.05 75.84 97.05 77.47 93.81 95.16 97.85 92.73 83.58 83.91 90.43 99.05 94.11 86.63 96.51 88.48 91.29
PaW Ding et al. (2018) 94.05 96.90 87.56 96.22 75.03 97.08 77.35 93.44 95.07 97.64 92.73 83.52 84.07 89.93 99.02 94.24 87.70 96.85 88.59 91.23
Ours( w/o recon.) 94.02 96.95 87.53 96.57 75.96 97.06 77.70 93.85 94.94 97.84 93.16 83.69 84.30 90.81 99.07 94.37 86.21 96.42 88.72 91.42
Ours 93.98 96.97 87.70 96.43 73.82 97.08 76.50 93.99 95.24 97.97 93.17 84.39 84.40 90.84 99.13 94.14 87.81 97.15 88.65 91.53
Table 1: Attribute prediction accuracies on CelebA. Attributes are numbered from 1 to 40 in alphabetical order.

To demonstrate the capacity of our method in image analysis, we conduct attribute prediction in CelebA. The classification accuracies are reported in Table 1. Our method obtains an average accuracy of , outperforming the baseline method LNets+ANet Liu et al. (2015) by over . Besides, VCNs perform better than PaW Ding et al. (2018) which uses multiple networks, and MCNN+AUX Hand & Chellappa (2017) which elaborately categorizes the attributes into different groups. Adding the reconstruction boosts the prediction accuracy, suggesting that image synthesis is helpful for learning discriminative representation. Particularly, we observe that the reconstruction loss contributes a lot in predicting attributes which explicitly affect the visual appearance of face images, such as ‘Heavy Makeup’, ‘Rocy Cheeks’ and ‘Straignt Hair’.

4.1.2 Face Synthesis

In this part, we provide experiments for generating face images from latent representations. Attribute-conditioned image generation, facial attribute swapping and attribute interpolating are conducted to demonstrate our models’ ability in synthesizing face images with great diversity and fine-grain controllability.

Attribute-conditioned image generation. Figure 2 shows examples of generating face images from specified attributes. The left side of Figure 2 shows results of directly synthesizing new faces from latent codes. The latent codes are randomly sampled from prior and then masked accordingly to targeted attributes. In the first column of each example, a binary block image exhibits the exact activation status of the 40 attributes in CelebA, corresponding to the first 40 blocks in matrices. For each example in the right side, a reference image is involved to generate images sharing the same attributes with it. Concretely, attributes are firstly predicted via the proposed inference model (encoder), then latent codes are sampled according to the prediction results. We change the latent code for one attribute in each example by adjusting specific capsules while keeping the rest fixed. Two positive and two negative samples are provided for the changed attribute. As shown in Figure 2, all the generated faces are visually plausible and accord with targeted attributes. Attributes of the reference images are well transferred to new generated images, which demonstrates that VCNs perform well in both image analysis and synthesis. Specifically, when we change specific attributes, the rest attributes are well preserved, suggesting our model’s ability in learn semantically-disentangling representations.

Figure 2: Attribute-conditioned image generation. Blocks corresponding to the changed attributes are marked with red color.

Facial attribute swapping. Some visual examples of facial attribute swapping are shown in Figure 3. For each identity, the first and second images are the original and reconstruction faces from the CelebA testing set, respectively. Due to the injection of random noise in the training phase of VCNs, the reconstruction images cannot keep accurate pixel-wise similarity with the original images. But the attributes of reconstruction remain unchanged. The rest four faces are synthesis results by swapping an attribute of the input face while keeping other attributes preserved. These generated images confirm that the proposed VCNs are able to learn semantically disentangling features. In addition, various embodiments of the same attribute can be accessed by resampling the proposed variational capsules. For example, different styles of glasses and bangs are presented in these generated faces.

Figure 3: Swapping the attributes of different faces. From left to right, original faces, reconstruction faces and various attribute-swapping results.
Figure 4: Interpolation between individual attributes.
Figure 5: Interpolation between multiple attributes.

Facial attribute interpolation. In this part, we conduct attribute interpolation experiments to show our method’s ability in continuously changing facial attributes. Figure 4 shows results for single attribute interpolation. Specifically, we change the attribute intensity by linearly interpolating between an activate capsule and a nonactive capsule. Subtle changes exist in contiguous images, while images in both ends differ from each other in greater degree. These results demonstrate the operation-friendliness of our method, as we can easily synthesize facial images with desired attributes and intensities. Apart from single attribute interpolation, results for multi attributes interpolation are also provided (Figure 5). As mentioned above, diverse samples can be synthesized from the same active capsules with different instantiations. Thus, we interpolate two different instantiations of the same attributes in the first row of Figure 5, in which faces change gradually meanwhile the attributes are kept unchanged. The rest rows are interpolation results between faces with different attributes. The interpolation results are visual-pleasing, and continuous attributes change can be found in these generated images, confirming our method’s capacity in representing facial attributes again.

4.2 Experiments on MNIST

The MNIST database

 LeCun et al. (1998) is a digit dataset with 60,000 training and 10,000 testing images. All images in MNIST dataset are binary images of size , and each image contains a single handwritten digit with the class label from 0 to 9.

The encoder used in this experiment consists of 3 convolutional layers, followed by a fully-connected layer which produces variational capsules for means and covariances of 10 digit class. The dimension of variational capsules is set to 16 such that the output of encoder is of 320-D. The encoder architecture is: C5s1-256, C5s2-256, C5s2-256, F320. The decoder is formed of 3 fully-connected layers, and the detailed network architecture is: F512, F1024, F784.

(a) Original
(b) Reconstructions
(c) Samples
Figure 6: Results on the MNIST dataset. From left to right are the original images from the MNIST test set, the reconstruction images and the synthesized new digit samples.
Figure 7: Interpolation between digit images.

Digit recognition. We obtain an error rate of using the encoder only, and when introducing the reconstruction loss. The introducing of auto-encoding architecture helps in improving the classification accuracy. Sabour et al. Sabour et al. (2017) implement a 3-layer capsule network, and achieve error rates of without reconstruction loss and with reconstruction loss respectively. Hinton et al. Hinton et al. (2018) get error rate with matrix capsules. Dynamic routing mechanism is employed in these capsule networks Sabour et al. (2017); Hinton et al. (2018) and plays important roles. However, because of the different way of representing the probability that an entity exists, these routing algorithms in Sabour et al. (2017); Hinton et al. (2018) cannot be directly applied in the proposed variational capsules. It will be a potential research direction for our method to explore the design of routing algorithm.

Digit synthesis. As illustrated in Figure 6, the reconstructions of variational capsules are robust while keeping only important details. Apart from reconstructing digit images from the inputs, we also generating new digits from randomly sampled latent representations. The synthesized images follow the given class label successfully, showing almost no ambiguity judging with human eyes. Besides, great diversity can be found within the same class. To show our model is able to learn the digit representation, examples of interpolation between two different digit images are also provided in Figure 7. Specifically, we interpolate between the respective latent encodes of the two digits, and the generated digit images show continuous changes. Digit synthesis on the MNIST dataset verifies that our method can be used in conditional image generation, and reflects the superiority of the proposed unified discriminative-generative framework again.

5 Conclusion

In this paper, we have presented a new type of capsule that models images in a unified discriminative and generative framework. The proposed variational capsules are designed in a probabilistic way, in which the values of active capsules are expected to be drawn from a known prior distribution. Thus, the divergence of each capsule with the prior distribution can be used to represent the presence of an entity, deriving a new metric for image classification. By sampling values for active capsules from the prior distribution, the proposed VCs can be further extended into a generative model and employed to synthesize new images. Benefitting from the semantically-disentangling representations learned via VCs, it is easy to synthesize image samples with fine-grained semantic attributes and large diversity. The experimental results of the digit recognition and synthesis as well as the facial attribute prediction and manipulation demonstrate our method’s superiority in integrating image analysis and synthesis into a unified framework.


  • Bao et al. (2017) Bao, Jianmin, Chen, Dong, Wen, Fang, Li, Houqiang, and Hua, Gang. CVAE-GAN: Fine-grained image generation through asymmetric training. In ICCV, 2017.
  • Bao et al. (2018) Bao, Jianmin, Chen, Dong, Wen, Fang, Li, Houqiang, and Hua, Gang. Towards open-set identity preserving face synthesis. In CVPR, 2018.
  • Bruna et al. (2016) Bruna, Joan, Sprechmann, Pablo, and LeCun, Yann. Super-resolution with deep convolutional sufficient statistics. In ICLR, 2016.
  • Ding et al. (2018) Ding, Hui, Zhou, Hao, Zhou, Shaohua Kevin, and Chellappa, Rama. A deep cascade network for unaligned face attribute classification. In AAAI, 2018.
  • Goodfellow et al. (2014) Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS, pp. 2672–2680, 2014.
  • Hand & Chellappa (2017) Hand, Emily M and Chellappa, Rama. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In AAAI, pp. 4068–4074, 2017.
  • Hinton et al. (2011) Hinton, Geoffrey E, Krizhevsky, Alex, and Wang, Sida D. Transforming auto-encoders. In ICANN, pp. 44–51, 2011.
  • Hinton et al. (2018) Hinton, Geoffrey E, Sabour, Sara, and Frosst, Nicholas. Matrix capsules with em routing. In ICLR, 2018.
  • Johnson et al. (2016) Johnson, Justin, Alahi, Alexandre, and Fei-Fei, Li. Perceptual losses for real-time style transfer and super-resolution. In ECCV, pp. 694–711, 2016.
  • Kingma & Welling (2014) Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. In ICLR, 2014.
  • Lample et al. (2017) Lample, Guillaume, Zeghidour, Neil, Usunier, Nicolas, Bordes, Antoine, Denoyer, Ludovic, et al. Fader networks: Manipulating images by sliding attributes. In NIPS, pp. 5969–5978, 2017.
  • Larsen et al. (2016) Larsen, Anders Boesen Lindbo, Sønderby, Søren Kaae, Larochelle, Hugo, and Winther, Ole. Autoencoding beyond pixels using a learned similarity metric. In ICML, pp. 1558–1566, 2016.
  • LeCun et al. (1998) LeCun, Yann, Bottou, Léon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Ledig et al. (2017) Ledig, Christian, Theis, Lucas, Huszar, Ferenc, Caballero, Jose, Cunningham, Andrew, Acosta, Alejandro, Aitken, Andrew, Tejani, Alykhan, Totz, Johannes, Wang, Zehan, and Shi, Wenzhe. Photo-realistic single image super-resolution using a generative adversarial network. In CVPR, pp. 4681–4690, 2017.
  • Liu et al. (2015) Liu, Ziwei, Luo, Ping, Wang, Xiaogang, and Tang, Xiaoou. Deep learning face attributes in the wild. In ICCV, pp. 3730–3738, 2015.
  • Mao et al. (2017) Mao, Xudong, Li, Qing, Xie, Haoran, Lau, Raymond YK, Wang, Zhen, and Smolley, Stephen Paul. Least squares generative adversarial networks. In ICCV, pp. 2813–2821, 2017.
  • Mirza & Osindero (2014) Mirza, Mehdi and Osindero, Simon. Conditional generative adversarial nets. In NIPSW, 2014.
  • Perarnau et al. (2016) Perarnau, Guim, van de Weijer, Joost, Raducanu, Bogdan, and Álvarez, Jose M. Invertible conditional gans for image editing. In NIPSW, 2016.
  • Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan.

    Stochastic backpropagation and approximate inference in deep generative models.

    In ICML, pp. 1278–1286, 2014.
  • Rudd et al. (2016) Rudd, Ethan M, Günther, Manuel, and Boult, Terrance E. Moon: A mixed objective optimization network for the recognition of facial attributes. In ECCV, pp. 19–35, 2016.
  • Sabour et al. (2017) Sabour, Sara, Frosst, Nicholas, and Hinton, Geoffrey E. Dynamic routing between capsules. In NIPS, pp. 3859–3869, 2017.
  • Shu et al. (2017) Shu, Zhixin, Yumer, Ersin, Hadap, Sunil, Sunkavalli, Kalyan, Shechtman, Eli, and Samaras, Dimitris. Neural face editing with intrinsic image disentangling. In CVPR, pp. 5541–5550, 2017.
  • van den Oord et al. (2016) van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al. Conditional image generation with pixelcnn decoders. In NIPS, pp. 4790–4798, 2016.
  • Van Oord et al. (2016) Van Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray.

    Pixel recurrent neural networks.

    In ICML, pp. 1747–1756, 2016.
  • Wang et al. (2018) Wang, Ting-Chun, Liu, Ming-Yu, Zhu, Jun-Yan, Tao, Andrew, Kautz, Jan, and Catanzaro, Bryan. High-resolution image synthesis and semantic manipulation with conditional GANs. In CVPR, 2018.
  • Yan et al. (2016) Yan, Xinchen, Yang, Jimei, Sohn, Kihyuk, and Lee, Honglak. Attribute2image: Conditional image generation from visual attributes. In ECCV, pp. 776–791, 2016.
  • Zhang et al. (2017) Zhang, YuJin et al. Image analysis. Walter de Gruyter GmbH & Co KG, 2017.