1 Introduction
A fundamental challenge in developing a conceptual understanding of our world is learning the factorial structure of the observations without supervision [3, 27]. Conceptual understanding requires a disentangled representation which separates the underlying explanatory factors and explicitly represents the important attributes of the realworld data [5, 1]
. For instance, given an image dataset of human faces, a disentangled representation can include the face’s appearance attributes, such as color, light source, identity, gender, and the geometric attributes, such as face shape and viewing angle. A disentangled representation is useful not only for building more transparent and interpretable generative models, but also for a large variety of downstream AI tasks such as transfer learning and zeroshot inference where humans excel but machines struggle
[23]. Many exciting applications require generative models that can synthesize novel instances while certain key factors of variation are held fixed. Potential applications include generating a face image with desired attributes, such as color, face shape, expression and view, or transferring the face shape, expression, or view learned from one person to another person.Generative models have shown great promise in learning disentangled representations of images. The generative models used for unsupervised disentangling usually fall into two categories: the Generative Adversarial Net (GAN) framework [11, 9, 29, 24, 33]
and the Variational Autoencoder (VAE) framework
[18, 31, 28, 22]. InfoGAN [6], a representative of the former family, is motivated by the principle of the maximization of the mutual information between the observations and a subset of latent vectors. However, its disentangling performance is sensitive to the choice of the prior and the number of latent vectors. The VAE [14], from the latter family, learns disentangled representations by utilizing a VAE objective with an extra KL penalty to encourage the latent distribution (variational posterior) to be close to the standard normal distribution, giving a more robust and stable solution for disentangling.
In contrast to the existing methods which use one latent vector to encode the factors of variation, our work introduces a deformable generator network that disentangles the appearance and geometric information from an image into two independent latent vectors in an unsupervised manner. Motivated by the Active Appearance Models (AAM) [7, 20] which uses a linear model for jointly capturing the appearance and shape variation in an image, the proposed model introduces two nonlinear generators to extract the appearance and geometry information separately. Unlike the AAM method [7, 20, 19] which requires handannotated facial landmarks for each training image, the proposed deformable generator model is purely unsupervised and learns from images alone.
2 Model and learning algorithm
This section provides the details of the model and the associated learning and inference algorithm.
2.1 Model
The proposed model contains two generator networks: one appearance generator and one geometric generator. The two generators are connected with a warping function to produce the final images or video frames, as shown in Figure 1. Suppose an arbitrary image or video frame is generated with two independent latent vectors, which controls its appearance, and which controls its geometric information. Varying the geometric latent vector and fixing the latent vector of appearance, we can transform an object’s geometric information, such as rotating it with some angles and changing its shape. Varying the and fixing the , we can change the identity or the category of the object, while keeping it within the same geometric status, such as the same viewing angle or the same shape. Thus, the appearance information and the geometric information are disentangled in the ideal situation.
The model can be expressed as
(1)  
where , , and () are independent. is the warping function, which employs the features generated by the geometric generator to warp the geometry of the image from the appearance generator to obtain the final output image .
2.2 Warping function
A warping function usually includes a geometric transformation operation for image coordinates and a differentiable interpolation (or resampling) operation. The geometric transformation describes the destination coordinates for every location in the source coordinates. The geometric operation only modifies the positions of pixels in an image without changing their colors and illumination. Therefore, the color and illumination information and the geometric information are naturally disentangled by the geometric generator and the appearance generator in the proposed model.
The geometric transformation
can be a rigid affine mapping, as is used in the spatial transformer networks
[17], or a nonrigid deformable mapping, which is the case in our work. Specifically, the coordinates displacement (or the dense optical flow field) of each regular grid in the output warping image are generated by our geometric generator . The pointwise transformation in this deformable mapping can be formulated as(2) 
where are the source coordinates of the image generated by the appearance generator .
Since the evaluated by Eq.(2) does not usually have integer coordinates, each pixel’s value of the output warping image can be computed by a differentiable interpolation operation. Let denotes the image generated by the appearance generator. The warping function can be formulated as
(3) 
where is the differentiable interpolation function. In this study, we employ a differentiable bilinear interpolation of the form
(4) 
where , and from Eq.(2), we have ,. The details of backpropagation through this bilinear interpolation can be found in [17].
The displacement is used in the deformable convolutional networks [8]. The computation of coordinates displacement
is known as the optical flow estimation
[15, 4, 32, 10, 16, 30]. Our work is concerned with modeling and generating the optical flow, in addition to estimating the optical flow.The displacement can be caused by the motion of the objects in the scene. It can also be caused by the change of viewpoint relative to 3D objects in the scene. Therefore, it is natural to incorporate motion and 3D models into the geometric generator where the change or variation of depends on the motion and 3D information.
2.3 Inference and learning
To learn this deformable generator model, we introduce a learning and inference algorithm for two latent vectors, without designing and learning extra inference networks. Our method is motivated by a maximum likelihood learning algorithm for generator networks [13]. Specifically, the proposed model can be trained by maximizing the loglikelihood on the training dataset ,
(5)  
The uncertainties in inferring and are taken into account by the above observeddata loglikelihood.
We can evaluate the gradient of according to the following wellknown result which is related to the EM algorithm:
(6) 
Since the expectation in Eq.(6) is usually analytically intractable, we employ Langevin dynamics to draw samples from the posterior and compute the Monte Carlo average to obtain an approximation. For each observation , the latent vectors and can be sampled from alternately by Langevin dynamics: fixing and sampling from , then fixing and sampling from . The latent vectors are inferred and updated as follows:
(7) 
where is the number of steps of the Langevin sampling, is standard Gaussian noise added to prevent the chain from becoming trapped in local modes, and is the step size of Langevin dynamics. The log of the joint density in Eq.(7) can be evaluated by
(8) 
where and are defined in Eq.(1), and both and are constants. It can be shown that, given sufficient transition steps, the and obtained from this procedure follow their joint posterior distribution.
Obtaining independent samples of the posterior density in each training iteration is infeasible due to the high computational cost of the MCMC updates. In this paper, the MCMC transitions of both and
start from the updated latent vectors from the previous learning iteration. The persistent updating results in a chain that is long enough to sample from the posterior distribution, and the warm initialization vastly reduces the computational burden of the MCMC updates. The convergence of stochastic gradient descent based on persistent MCMC has been studied by
[34].For each training example , we run the Langevin dynamics in Eq.(7) to get the corresponding posterior sample and . The sample is then used for gradient computation in Eq.(6). More precisely, the parameter is learned through Monte Carlo approximation:
(9) 
The whole algorithm iterates through two steps: (1) inferential step which infers the latent vectors through Langevin dynamics, and (2) learning step which learns the network parameters by stochastic gradient descent. Gradient computations in both steps are powered by backpropagation. Algorithm 1 describes the details of the learning and inference algorithm.
2.4 Deformable Variational Autoencoder
The proposed deformable generator scheme is general and agnostic to different models. In fact, our method can also be learned by VAE [18] to obtain deformable variational autoencoder, by utilizing extra inference network to infer through reparametrization. Specifically, we learn another to approximate the intractable posterior . The appearance and geometric latent vectors are assumed to be independent Gaussian in the approximated distribution, i.e.,
, where the means and variances are modeled by inference network with parameters
. This deformable VAE model is a naturally extension of the proposed deformable generator framework developed. We show some preliminary results in Sec.3.1.1. Notice that the proposed scheme can also be used in adversarial learning methods[11], by designing a separate discriminator network for shape and appearance. We leave it as our further work. In this work, we focus on the current learning and inference algorithm for the sake of simplicity, so that we do not resort to extra networks.3 Experiments
In this section, we first qualitatively demonstrate that our proposed deformable generator framework consistently disentangles the appearance and geometric information. We then analyze and evaluate the proposed model quantitatively. The deformable generator network structures and parameters are listed in the Appendix. We set the value of the interpolation parameter to in the experiments, i.e., we vary the components of the latent vectors within the range when visualizing the effects of the components.
3.1 Qualitative experiments
3.1.1 Experiments on CelebA
We first train the deformable generator on the 10,000 images from CelebA benchmark dataset [25]. Some examples in CelebA are shown in Figure 2, which are processed by the OpenFace [2] and cropped to pixels.
To study the performance of the proposed method for disentangling the appearance and geometric information, we investigate the effect of different combinations of the geometric latent vector and the appearance latent vector . (1) Set the geometric latent vector to zero, and each time vary one dimension of the appearance variable from with a uniform step , while holding the other dimensions of at zero. Some typical generated images are shown in Figure 3. (2) Set to be a fixed value, and each time vary one dimension of the geometric latent vector from with a uniform step , while keeping the other dimensions of at zero. Some representative generated results are shown in Figure 4. The full images corresponding to each dimensions of and are attached in the appendix.
As we can observe from Figure 3, (1) although the training faces from CelebA have different viewing angles, the appearance latent vector only encodes frontview information, and (2) each dimension of the appearance latent vector encodes appearance information such as color, illumination and identity. For example, in the fist line of Figure 3, from left to right, the color of background varies from black to white, and the identity of the face changes from a women to a man. In the second line of Figure 3, the moustache of the man becomes thicker when the value of the corresponding dimension of decreases, and the hair of the woman becomes denser when the value of the corresponding dimension of increases. In the third line, from left to right, the skin color varies from dark to white, and in the fourth line, from left to right, the illumination lighting changes from the leftside of the face to the rightside of the face.
From Figure 4, we have the following interesting observations. (1) The geometric latent vectors does not encode any appearance information. The color, illumination and identity are the same across these generated images. (2) Each dimension of the geometric latent vector encodes fundamental geometric information such as shape and viewing angle. For example, in the fist line of Figure 4, the shape of the face changes from fat to thin from left to the right; in the second line, the pose of the face varies from left to right; in the third line, from left to right, the tilt of the face varies from downward to upward; and in the fourth line, the expression changes from stretched to cramped.
The appearance and geometric information could also be effectively disentangled by the introduced deformable VAE. For the extra inference network, or encoder network, we use the mirror structure of our generator model in which we use convolution layers instead of convolution transpose layers. The generator network structure as well as other parameters are kept the same as the model learned by alternating backpropagation. Figures 5 and 6 show interpolation results following the same protocol described before.
From the results in Figures 3 and 4, we find that the appearance and geometric information of face images have been disentangled effectively. Therefore, we can apply the geometric warping (e.g. operations in Figure 4) learned by the geometric generator to all the canonical faces (e.g. generated faces in Figure 3) learned by the appearance generator. Figure 7 demonstrates the effect of applying geometric warping to the generated canonical faces in Figure 3. Comparing Figure 3 with Figure 7, we find that the rotation and shape warping operations do not modify the identity information of the canonical faces, which corroborates the disentangling power of the proposed deformable generator model.
Furthermore, we evaluate the disentangling ability of the proposed model by transferring and recombining geometric and appearance vectors from different faces. Specifically, we first feed 7 unseen images from CelebA into our deformable generator model to infer their appearance vectors , ,, and geometric vectors , ,, using the Langevin dynamics (with 300 steps) in Eq.(7). Then, we transfer and recombine the appearance and geometric vectors and use , , to generate six new face images, as shown in the second row of Figure 8. We also transfer and recombine the appearance and geometric vectors and use ,, to generate another six new faces, as shown in the third row of Figure 8. From the 2nd to the 7th column, the images in the second row have the same appearance vector , but the geometric latent vectors are swapped between each image pair. As we can observe from the second row of Figure 8, (1) the geometric information of the original images are swapped in the synthesized images, and (2) the inferred can capture the view information of the unseen images. The images in the third row of Figure 8 have the same geometric vector , but the appearance vectors are swapped between each image pair. From the third row of Figure 8, we observe that (1) the appearance information are exchanged. (2) The inferred capture the color, illumination and coarse appearance information but lose more nuanced identity information. Only finite features are learned from 10k CelebA images, and the model may not contain the features necessary to closely model an unseen face.
The learned geometric information can be directly applied to the faces of animals such as cats and monkeys, as shown in Figure 9. The monkey and cat faces rotate from left to right when the rotation warping learned from human faces is applied. The shape of both the monkey and cat faces changes from fat to thin when the shape warping learned by the geometric generators is used.
3.1.2 Experiments on expression dataset
We next study the performance of the proposed deformable generator model on the face expression dataset CK+ [26]. Following the same experimental protocol as the last subsection, we can investigate the change produced by each dimension of the appearance latent vector (after setting the value of geometric latent vector to zero) and the geometric latent vector (after setting the appearance latent vector to a fixed value). The disentangled results are shown in Figure 10
. The training faces from CK+ have labels of expressions, but we do not use any such labels in our unsupervised learning method. Although the dataset contains faces of different expressions, the learned appearance latent vector usually encodes a neutral expression. The geometric latent vector controls major variation in expression, but does not change the identity information.
To test whether appearance and geometric information are disentangled in the proposed model, we try to transfer the learned expression from CK+ to another face dataset, MultiPie [12], by fineturning the appearance generator on the target face dataset while fixing the parameters in the geometric generator. Figure 10(c) shows the result of transferring the expressions of 10(b) into the faces of MultiPie. The expressions from the gray faces of CK+ have been transferred into the color faces of MultiPie.
3.1.3 Experiment on CIFAR10
We further test our model on the CIFAR10 [21] dataset, which includes various object categories and has 50,000 training examples. We randomly sample from . For , we interpolate one dimension from to and fix the other dimensions to . Figure 11 shows interpolated examples generated by model learned from the car category. For each row, we use different and interpolate the same dimension of . The results show that each dimension of controls a specific geometric transformation.
3.2 Quantitative experiments
3.2.1 Covariance between the latent vectors and geometric variation
To quantitatively study the covariance between each dimension of the latent vectors () and input images with geometric variation, we select images with groundtruth labels that record geometric attributes, specifically the multiview face images from the MultiPie dataset [12]. For each of the 5 viewing angles , , , , , we feed 100 images into the learned model to infer their geometric latent vector and appearance latent vector . Under each view , we compute the means and of the inferred latent vectors. For each dimension of , we construct a 5dimensional vector . Similarly, we construct a 5dimensional vector under each dimension of . We normalize the viewing angles vector to have unit norm. Finally, we compute the covariance between each dimension of the latent vectors () and input images with view variations as follows:
(10) 
where denotes the th dimension of latent vector or , and denotes the absolute value. We summarize the the covariance responses and of the geometric and appearance latent vectors in Figure 12. As we can observe in Figure 12, the tends to be much larger than .
Moreover, for the two largest and the largest , we plot covariance relationship between the latent vector (or ) and viewing angles vector in Figure 13.
As we can observe from the left and the center subfigures from Figure 13, the corresponding to the two largest (, ) have very strong negative and positive covariance respectively with change in viewing angle. However, as shown in the right subfigure, the corresponding to the largest () does not have strong covariance with the change of viewing angle. We wish to point out that we should not expect to encode the identity exclusively and to encode the view exclusively, because different persons may have shape changes, and different views may have lighting or color changes.
Furthermore, we generate face images by varying the dimension of corresponding to the two largest covariance responses from values with a uniform step , while holding the other dimensions of to zero as we did in the subsection 4.1.1. Similarly, we generate face images by varying the dimension of corresponding to the largest covariance responses from values with a uniform step , while holding the other dimensions of to zero. The generated images are shown in Figure 13(b). We can make several important observations. (1) The variation in viewing angle in the first two rows is very obvious, and the magnitude of the change in view in the first row is larger than that in the second row. This is consistent with the fact that and with the observation that the slope in the left subfigure of Figure 13(a) is steeper than that of the center subfigure of Figure 13(a). (2) In the first row, the faces rotate from right to left and the covariance relationship in the left subfigure of Figure 13(a) is nearly perfect negative covariance. In the second row, the faces rotate from left to right and the covariance relationship in the center subfigure of Figure 13(a) is nearly perfect positive covariance. (3) It is difficult to find obvious variation in viewing angle in the third row. Therefore, these generated images further verify that the geometric generator of the proposed model mainly captures geometric variation, while the appearance generator is not sensitive to geometric variation.
3.2.2 Reconstruction error on unseen multiview faces
Since the proposed deformable generator model can disentangle the appearance and geometric information from an image, we can transfer the geometric warping operation learned from one dataset into another dataset. Specifically, given 1000 frontview faces from the MultiPie dataset [12], we can finetune the appearance generator’s parameters while fixing the geometric generator’s parameters, which are learned from the CelebA dataset. Then we can reconstruct unseen images that have various viewpoints. In order to quantitatively evaluate the geometric knowledge transfer ability of our model, we compute the reconstruction error on 5000 unseen images from MultiPie for the views , , , , , with 1000 faces for each view. We compare the proposed model with the stateofart generative models, such as VAE [18, 5] and ABP [13]. For fair comparison, we first train the VAE and ABP models with the same CelebA training set of 10,000 faces, and then finetune them on the 1000 frontview faces from the MultiPie dataset. The mean square reconstruction error per image for each method is shown in Table 1. As we can observe from Table 1, the proposed method obtains the lowest reconstruction error. Our model benefits from the transferred geometric knowledge learned from the CelebA dataset, while both the VAE and ABP models cannot efficiently learn or transfer purely geometric information.
3.3 Balancing explainingaway competition
The proposed deformable generator model utilizes two generator networks to disentangle the appearance and geometric information from an image. Since the geometric generator only produces displacement for each pixel without modifying the pixel’s value, the color and illumination information and the geometric information are naturally disentangled by the proposed model’s specific structure.
In order to properly disentangle the identity (or category) and the view (or geometry) information, the learning capacity between the appearance generator and geometric generator should be balanced. The appearance generator and the geometric generator cooperate with each other to generate the images. Meanwhile, they also compete against each other to explain away the training images. If the learning of the appearance generator outpaces that of the geometric generator, the appearance generator will encode most of the knowledge (including the view and shape information), while the geometric generator will only learn minor warping operations. On the other hand, if the geometric generator learns much faster than the appearance generator, the geometric generator will encode most of the knowledge (including the identity or category information, which should be encoded by the appearance network).
To control the tradeoff between the two generators, we introduce a balance parameter , which is defined as the ratio of the number of filters within each layer between the appearance and geometric generators. The balance parameter should not be too large or too small. We set to 0.625 in our experiments.
4 Conclusion
In this study, we propose a deformable generator model which aims to disentangle the appearance and geometric information of an image into two independent latent vectors and . The learned geometric generator can be transferred to other datasets, or can be used for the purpose of data augmentation to produce more variations in the training data for better generalization.
In addition to the learning and inference algorithm adopted in this paper, the model can also be trained by VAE and GAN, as well as their generalizations such as VAE and infoGAN, which target disentanglement in general.
The model can be generalized to a dynamic one by adding transition models for the latent vectors. While the transition model for the appearance vector may generates dynamic textures of nontrackable motion, the transition model for the geometric vector may generate intuitive physics of trackable motion. The geometric generator can also be generalized to incorporate 3D information of rigid or nonrigid 3D objects.
In our work, the warping function based on coordinate displacements is hand designed. A refinement of this warping function in the form of a residual in addition to the warping function may be learned from the data. However, we tend to believe that the warping function itself or more importantly the notion of coordinate displacements may have to be a fundamentally innate part of a model for vision that may not be learned from the data.
Acknowledgment
This work was supported by DARPA SIMPLEX N6600115C4035, ONR MURI N000141612007, DARPA ARO W911NF1610579, DARPA N660011724029, Natural Science Foundation of China No. 61703119, Natural Science Fund of Heilongjiang Province of China No. QC2017070 and Fundamental Research Funds for the Central Universities of China No. HEUCFM180405. We thank Mitchell K. Hill for his assistance with writing.
References

[1]
A. Achille and S. Soatto.
Emergence of invariance and disentanglement in deep representations.
In
Proc. International Conference on Machine Learning (ICML), Sydney
, 2017. 
[2]
B. Amos, B. Ludwiczuk, and M. Satyanarayanan.
Openface: A generalpurpose face recognition library with mobile applications.
Technical report, CMUCS16118, CMU School of Computer Science, 2016.  [3] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013.

[4]
T. Brox, A. Bruhn, N. Papenberg, and J. Weickert.
High accuracy optical flow estimation based on a theory for warping.
In
European conference on computer vision
, pages 25–36. Springer, 2004.  [5] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Desjardins, and A. Lerchner. Understanding disentangling in betavae. arXiv preprint arXiv:1804.03599, 2018.
 [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2172–2180, 2016.
 [7] T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. IEEE Transactions on pattern analysis and machine intelligence, 23(6):681–685, 2001.

[8]
J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei.
Deformable convolutional networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 764–773, 2017.  [9] E. L. Denton, S. Chintala, R. Fergus, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, pages 1486–1494, 2015.
 [10] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
 [11] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [12] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multipie. Image Vision Comput., 28(5):807–813, May 2010.
 [13] T. Han, Y. Lu, S.C. Zhu, and Y. N. Wu. Alternating backpropagation for generator network. In AAAI, pages 1976–1984, 2017.
 [14] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. betavae: Learning basic visual concepts with a constrained variational framework. 2016.
 [15] B. K. Horn and B. G. Schunck. Determining optical flow. Artificial intelligence, 17(13):185–203, 1981.
 [16] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.
 [17] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
 [18] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [19] J. Kossaifi, L. Tran, Y. Panagakis, and M. Pantic. Gagan: Geometryaware generative adverserial networks. arXiv preprint arXiv:1712.00684, 2017.
 [20] J. Kossaifi, G. Tzimiropoulos, and M. Pantic. Fast and exact newton and bidirectional fitting of active appearance models. IEEE transactions on image processing, 26(2):1040–1053, 2017.
 [21] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.
 [22] A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. arXiv preprint arXiv:1711.00848, 2017.
 [23] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40, 2017.
 [24] Z. Li, Y. Tang, and Y. He. Unsupervised disentangled representation learning with analogical relations. arXiv preprint arXiv:1804.09502, 2018.
 [25] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), 2015.
 [26] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohnkanade dataset (ck+): A complete dataset for action unit and emotionspecified expression. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 IEEE Computer Society Conference on, pages 94–101. IEEE, 2010.
 [27] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun. Disentangling factors of variation in deep representation using adversarial training. In Advances in Neural Information Processing Systems, pages 5040–5048, 2016.
 [28] A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In ICML, 2014.
 [29] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
 [30] A. Ranjan and M. J. Black. Optical flow estimation using a spatial pyramid network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, 2017.

[31]
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In NIPS, pages 1278–1286, 2014.  [32] D. Sun, S. Roth, and M. J. Black. A quantitative analysis of current practices in optical flow estimation and the principles behind them. International Journal of Computer Vision, 106(2):115–137, 2014.
 [33] L. Tran, X. Yin, and X. Liu. Disentangled representation learning gan for poseinvariant face recognition. In CVPR, volume 3, page 7, 2017.

[34]
L. Younes.
On the convergence of markovian stochastic algorithms with rapidly
decreasing ergodicity rates.
Stochastics: An International Journal of Probability and Stochastic Processes
, 65(34):177–228, 1999.
Appendix A Deformable generator’ Network Structure and Parameters
Layers  InOut Shape  ConvKernel size  stride 

Null  Null  
Fc1  Null  
Deconv1  
Deconv2  
Deconv3  
Out(Deconv4) 
Layers  InOut Shape  ConvKernel size  stride 

Null  Null  
Fc1  Null  
Deconv1  
Deconv2  
Deconv3  
Out(Deconv4) 
Comments
There are no comments yet.