1 Introduction
Numerous graphics algorithms have been established to synthesize photorealistic images from 3D models and environmental variables (lighting and viewpoints), commonly known as rendering. At the same time, recent advances in vision algorithms enable computers to gain some form of understanding of objects contained in images, such as classification [17], detection [11], segmentation [19], and caption generation [28]
, to name a few. These approaches typically aim to deduce abstract representations from raw image pixels. However, it has been a longstanding problem for both graphics and vision to automatically synthesize novel images by applying intrinsic transformations (e.g., 3D rotation and deformation) to the subject of an input image. From an artificial intelligence perspective, this can be viewed as answering questions about object appearance when the view angle or illumination is changed, or some action is taken. These synthesized images may then be perceived by humans in photo editing
[15], or evaluated by other machine vision systems, such as the game playing agent with visionbased reinforcement learning
[21, 22].In this paper, we consider the problem of predicting transformed appearances of an object when it is rotated in 3D from a single image. In general, this is an illposed problem due to the loss of information inherent in projecting a 3D object into the image space. Classic geometrybased approaches either recover a 3D object model from multiple related images, i.e., multiview stereo and structurefrommotion, or register a single image of a known object category to its prior 3D model, e.g., faces [5]. The resulting mesh can be used to rerender the scene from novel viewpoints. However, having 3D meshes as intermediate representations, these methods are 1) limited to particular object categories, 2) vulnerable to image alignment mistakes and 3) easy to generate artifacts during unseen texture synthesis. To overcome these limitations, we propose a learningbased approach without explicit 3D model recovery. Having observed rotations of similar 3D objects (e.g., faces, chairs, household objects), the trained model can both 1) better infer the true pose, shape and texture of the object, and 2) make plausible assumptions about potentially ambiguous aspects of appearance in novel viewpoints. Thus, the learning algorithm relies on mappings between Euclidean image space and underlying nonlinear manifold. In particular, 3D view synthesis can be cast as pose manifold traversal where a desired rotation can be decomposed into a sequence of small steps. A major challenge arises due to the longterm dependency among multiple rotation steps; the key identifying information (e.g., shape, texture) from the original input must be remembered along the entire trajectory. Furthermore, the local rotation at each step must generate the correct result on the data manifold, or subsequent steps will also fail.
Closely related to the image generation task considered in this paper is the problem of 3D invariant recognition, which involves comparing object images from different viewpoints or poses with dramatic changes of appearance. Shepard and Metzler in their mental rotation experiments [24] found that the time taken for humans to match 3D objects from two different views increased proportionally with the angular rotational difference between them. It was as if the humans were rotating their mental images at a steady rate. Inspired by this mental rotation phenomenon, we propose a recurrent convolutional encoderdecoder network with action units to model the process of pose manifold traversal. The network consists of four components: a deep convolutional encoder [17], shared identity units, recurrent pose units with rotation action inputs, and a deep convolutional decoder [8]. Rather than training the network to model a specific rotation sequence, we provide control signals at each time step instructing the model how to move locally along the pose manifold. The rotation sequences can be of varying length. To improve the ease of training, we employed curriculum learning, similar to that used in other sequence prediction problems [29]. Intuitively, the model should learn how to make onestep rotation before learning how to make a series of such rotations.
The main contributions of this work are summarized as follows. First, a novel recurrent convolutional encoderdecoder network is developed for learning to apply outofplane rotations to human faces and 3D chair models. Second, the learned model can generate realistic rotation trajectories with a control signal supplied at each step by the user. Third, despite only being trained to synthesize images, our model learns discriminative viewinvariant features without using class labels. This weaklysupervised disentangling is especially notable with longerterm prediction.
2 Related Work
The transforming autoencoder
[13] introduces the notion of capsules in deep networks, which tracks both the presence and position of visual features in the input image. These models can apply affine transformations and 3D rotations to images. We address a similar task of rendering object appearance undergoing 3D rotations, but we use a convolutional network architecture in lieu of capsules, and incorporate action inputs and recurrent structure to handle repeated rotation steps. The Predictive Gating Pyramid [20] is developed for timeseries prediction and can learn image transformations including shifts and rotation over multiple time steps. Our task is related to this timeseries prediction, but our formulation includes a control signal, uses disentangled latent features, and uses convolutional encoder and decoder networks to model detailed images. Ding and Taylor [7] proposed a gating network to directly model mental rotation by optimizing transforming distance. Instead of extracting invariant recognition features in one shot, their model learns to perform recognition by exploring a space of relevant transformations. Similarly, our model can explore the space of rotation about an object image by setting the control signal at each time step of our recurrent network.The problem of training neural networks that generate images is studied in [27]. Dosovitskiy et al. [8]
proposed a convolutional network mapping shape, pose and transformation labels to images for generating chairs. It is able to control these factors of variation and generate highquality renderings. We also generate chair renderings in this paper, but our model adds several additional features: a deep encoder network (so that we can generalize to novel images, rather than only decode), distributed representations for appearance and pose, and recurrent structure for longterm prediction.
Contemporary to our work, the Inverse Graphics Network (IGN) [18] also adds an encoding function to learn graphics codes of images, along with a decoder similar to that in the chair generating network. As in our model, IGN uses a deep convolutional encoder to extract image representations, apply modifications to these, and then rerender. Our model differs in that 1) we train a recurrent network to perform trajectories of multiple transformations, 2) we add control signal input at each step, and 3) we use deterministic feedforward training rather than the variational autoencoder (VAE) framework [16] (although our approach could be extended to a VAE version).
A related line of work to ours is disentangling the latent factors of variation that generate natural images. Bilinear models for separating style and content are developed in [26]
, and are shown to be capable of separating handwriting style and character identity, and also separating face identity and pose. The disentangling Boltzmann Machine (disBM)
[23]applies this idea to augment the Restricted Boltzmann Machine by partitioning its hidden state into distinct factors of variation and modeling their higherorder interaction. The multiview perceptron
[31] employs a stochastic feedforward network to disentangle the identity and pose factors of face images in order to achieve viewinvariant recognition. The encoder network for IGN is also trained to learn a disentangled representation of images by extracting a graphics code for each factor. In [6], the (potentially unknown) latent factors of variation are both discovered and disentangled using a novel hidden unit regularizer. Our work is also loosely related to the “DeepStereo" algorithm [10] that synthesizes novel views of scenes from multiple images using deep convolutional networks.3 Recurrent Convolutional EncoderDecoder Network
In this section we describe our model formulation. Given an image of 3D object, our goal is to synthesize its rotated views. Inspired by recent success of convolutional networks (CNNs) in mapping images to highlevel abstract representations [17] and synthesizing images from graphics codes [8], we base our model on deep convolutional encoderdecoder networks. One example network structure is shown in Figure 1. The encoder network used
convolutionrelu layers with stride 2 and 2pixel padding so that the dimension is halved at each convolution layer, followed by two fullyconnected layers. In the bottleneck layer, we define a group of units to represent the pose (
pose units) where the desired transformations can be applied. The other group of units represent what does not change during transformations, named as identity units. The decoder network is symmetric to the encoder. To increase dimensionality we use fixed upsampling as in [8]. We found that fixed stride2 convolution and upsampling worked better than maxpooling and unpooling with switches, because when applying transformations the encoder pooling switches would not in general match the switches produced by the target image. The desired transformations are reflected by the action units. We used a 1of3 encoding, in which
encoded a clockwise rotation, encoded a noop, andencoded a counterclockwise rotation. The triangle indicates a tensor product taking as input the pose units and action units, and producing the transformed pose units. Equivalently, the action unit selects the matrix that transforms the input pose units to the output pose units.
The action units introduce a small linear increment to the pose units, which essentially model the local transformations in the nonlinear pose manifold. However, in order to achieve longer rotation trajectories, if we simply accumulate the linear increments from the action units (e.g., [2 0 0] for twostep clockwise rotation, the pose units will fall off the manifold resulting in bad predictions. To overcome this problem, we generalize the model to a recurrent neural network, which have been shown to capture longterm dependencies for a wide variety of sequence modeling problems. In essence, we use recurrent pose units to model the stepbystep pose manifold traversals. The identity units are shared across all time steps since we assume that all training sequences preserve the identity while only changing the pose. Figure
2 shows the unrolled version of our RNN model.We only perform encoding at the first time step, and all transformations are carried out in the latent space; i.e., the model predictions at time step are not fed into the next time step input. The training objective is based on pixelwise prediction over all time steps for training sequences:
(1) 
where is the sequence of actions, produces the identity features invariant to all the time steps, produces the transformed pose features at time step , is the image decoder producing an image given the output of and , is the th image, is the th training image target at step .
3.1 Curriculum Training
We trained the network parameters using backpropagation through time and the ADAM optimization method
[3]. To effectively train our recurrent network, we found it beneficial to use curriculum learning [4], in which we gradually increase the difficulty of training by increasing the trajectory length. This appears to be useful for sequence prediction with recurrent networks in other domains as well [22, 29]. In Section 4, we show that increasing the training sequence length improves both the model’s image prediction performance as well as the poseinvariant recognition performance of identity features.Also, longer training sequences force the identity units to better disentangle themselves from the pose. If the same identity units need to be used to predict both a rotated and a
rotated image during training, these units cannot pick up poserelated information. In this way, our model can learn disentangled features (i.e., identity units can do invariant identity recognition but are not informative of pose, and vice versa) without explicitly regularizing to achieve this effect. We did not find it necessary to use gradient clipping.
4 Experiments
We carry out experiments to achieve the following objectives. First, we examine the ability of our model to synthesize highquality images of both face and complex 3D objects (chairs) in a wide range of rotational angles. Second, we evaluate the discriminative performance of disentangled identity units through crossview object recognition. Third, we demonstrate the ability to generate and rotate novel object classes by interpolating identity units of query objects.
4.1 Datasets
MultiPIE.
The MultiPIE [12] dataset consists of 754,204 face images from 337 people. The images are captured from 15 viewpoints under 20 illumination conditions in different sessions. To evaluate our model for rotating faces, we select a subset of MultiPIE that covers 7 viewpoints evenly from to under neutral illumination. Each face image is aligned through manually annotated landmarks on eyes, nose and mouth corners, and then cropped to pixels. We use the images of first 200 people for training and the remaining 137 people for testing.
Chairs.
This dataset contains 1393 chair CAD models made publicly available by Aubry et al. [2]. Each chair model is rendered from 31 azimuth angles (with steps of or ) and 2 elevation angles ( and ) at a fixed distance to the virtual camera. We use a subset of 809 chair models in our experiments, which are selected out of 1393 by Dosovitskiy et al. [8] in order to remove nearduplicate models (e.g., models differing only in color) or lowquality models. We crop the rendered images to have a small border and resize them to a common size of pixels. We also prepare their binary masks by subtracting the white background. We use the images of the first 500 models as the training set and the remaining 309 models as the test set.
4.2 Network Architectures and Training Details
MultiPIE.
The encoder network for the MultiPIE dataset used two convolutionrelu layers with stride 2 and 2pixel padding, followed by one fullyconnected layer: . The number of identity and pose units are and , respectively. The decoder network is symmetric to the encoder. The curriculum training procedure starts with the singlestep rotation model which we call RNN1.
We prepare the training samples by pairing face images of the same person captured in the same session with adjacent camera viewpoints. For example, at is mapped to at with action ; at is mapped to at with action ; and at is mapped to at with action . For face images with ending viewpoints and , only oneway rotation is feasible. We train the network using the ADAM optimizer with fixed learning rate for epochs.^{1}^{1}1
We carry out experiments using Caffe
[14] on Nvidia K40c and Titan X GPUs.Since there are 7 viewpoints per person per session, we schedule the curriculum training with , and stages, which we call RNN2, RNN4 and RNN6, respectively. To sample training sequences with fixed length, we allow both clockwise and counterclockwise rotations. For example, when , one input image at is mapped to with corresponding angles and action inputs . In each stage, we initialize the network parameters with the previous stage and finetune the network with fixed learning rate for additional epochs.
Chairs.
The encoder network for chairs used three convolutionrelu layers with stride 2 and 2pixel padding, followed by two fullyconnected layers: . The decoder network is symmetric, except that after the fullyconnected layers it branches into image and mask prediction layers. The mask prediction indicates whether a pixel belongs to foreground or background. We adopted this idea from the generative CNN [8] and found it beneficial to training efficiency and image synthesis quality. A tradeoff parameter is applied to the mask prediction loss. We train the singlestep network parameters with fixed learning rate for epochs. We schedule the curriculum training with , , and , which we call RNN2, RNN4, RNN8 and RNN16. Note that the curriculum training stops at because we reached the limit of GPU memory. Since the images of each chair model are rendered from 31 viewpoints evenly sampled between and , we can easily prepare training sequences of clockwise or counterclockwise step rotations around the circle. Similarly, the network parameters of the current stage are initialized with those of previous stage and finetuned with the learning rate for epochs.


Input  
RNN  
3D model 
4.3 3D View Synthesis of Novel Objects
We first examine the rerendering quality of our RNN models for novel object instances that were not seen during training. On the MultiPIE dataset, given one input image from the test set with possible views between to , the encoder produces identity units and pose units and then the decoder renders images progressively with fixed identity units and actiondriven recurrent pose units up to steps. Examples are shown in Figure 3 of the longest rotations, i.e., clockwise from to and counterclockwise from to with RNN6. Highquality renderings are generated with smooth transformations between adjacent views. The characteristics of faces, such as gender, expression, eyes, nose and glasses are also preserved during rotation. We also compare our RNN model with a stateoftheart 3D morphable model for face pose normalization [30] in Figure 4. It can be observed that our RNN model produces stable renderings while 3D morphable model is sensitive to facial landmark localization. One of the advantages of 3D morphable model is that it preserves facial textures well.
On the chair dataset, we use RNN16 to synthesize 16 rotated views of novel chairs in the test set. Given a chair image of a certain view, we define two action sequences; one for progressive clockwise rotation and another for counterclockwise rotation. It is a more challenging task compared to rotating faces due to the complex 3D shapes of chairs and the large rotation angles (more than
after 16step rotations). Since no previous methods tackle the exact same chair rerendering problem, we use a knearestneighbor (KNN) method for baseline comparisons. The KNN baseline is implemented as follows. We first extract the CNN features “fc7” from VGG16 net
[25] for all the chair images. For each test chair image, we find its Knearest neighbors in the training set by comparing their “fc7” features. The retrieved topK images are expected to be similar to the query in terms of both style and pose [1]. Given a desired rotation angle, we synthesize rotated views of the test image by averaging the corresponding rotated views of the retrieved topK images in the training set at the pixel level. We tune the K value in [1,3,5,7], namely KNN1, KNN3, KNN5 and KNN7 to achieve the best performance. Two examples are shown in Figure 5. In our RNN model, the 3D shapes are well preserved with clear boundaries for all the 16 rotated views from different input, and the appearance changes smoothly between adjacent views with a consistent style.RNN  
KNN  
RNN  
KNN  
RNN  
KNN  
RNN  
KNN  
Input  t=1  t=2  t=3  t=4  t=5  t=6  t=7  t=8  t=9  t=10  t=11  t=12  t=13  t=14  t=15  t=16 
Note that conceptually the learned network parameters during different stages of curriculum training can be used to process an arbitrary number of rotation steps. The RNN1 model (the first row in Figure 7) works well in the first rotation step, but it produces degenerate results from the second step. The RNN2 (the second row), trained with twostep rotations, generates reasonable results in the third step. Progressively, the RNN4 and RNN8 seem to generalize well on chairs with longer predictions ( for RNN4 and for RNN8). We measure the quantitative performance of KNN and our RNN by the mean squared error (MSE) in (1) in Figure 7. As a result, the best KNN with 5 retrievals (KNN5) obtains 310 MSE, which is comparable to our RNN4 model, but our RNN16 model significantly outperforms KNN5 (179 MSE) with a 42% relative improvement.
4.4 CrossView Object Recognition
In this experiment, we examine and compare the discriminative performance of disentangled representations through crossview object recognition.
MultiPIE.
We create 7 gallery/probe splits from the test set. In each split, the face images of the same view, e.g., are collected as gallery and the rest of other views as probes. We extract 512d features from the identity units of RNNs for all the test images so that the probes are matched to the gallery by their cosine distance. It is considered as a success if the matched gallery image has the same identity with one probe. We also categorize the probes in each split by measuring their angle offsets from the gallery. In particular, the angle offsets range from to
. The recognition difficulties increase with angle offsets. To demonstrate the discriminative performance of our learned representations, we also implement a convolutional network classifier. The CNN architecture is set up by connecting our encoder and identity units with a 200way softmax output layer, and its parameters are learned on the training set with ground truth class labels. The 512d features extracted from the layer before the softmax layer are used to perform crossview object recognition as above. Figure
8(left) compares the average success rates of RNNs and CNN with their standard deviations over 7 splits for each angle offset. The success rates of RNN1 drop more than
from angle offset to . The success rates keep improving in general with curriculum training of RNNs, and the best results are achieved with RNN6. As expected, the performance gap for RNN6 between to reduces to. This phenomenon demonstrates that our RNN model gradually learns pose/viewpointinvariant representations for 3D face recognition. Without using any class labels, our RNN model achieves competitive results against the CNN.
Chairs.
The experimental setup is similar to MultiPIE. There are in total 31 azimuth views per chair instance. For each view, we create its gallery/probe split so that we have 31 splits. We extract 512d features from identity units of RNN1, RNN2, RNN4, RNN8 and RNN16. The probes for each split are sorted by their angle offsets from the gallery images. Note that this experiment is particularly challenging because chair matching is a finegrained recognition task and chair appearances change significantly with 3D rotations. We also compare our model against CNN, but instead of training CNN from scratch we use the pretrained VGG16 net [25] to extract the 4096d “fc7” features for chair matching. The success rates are shown in Figure 8 (right). The performance drops quickly when the angle offset is greater than , but the RNN16 significantly improves the overall success rates especially for large angle offsets. We notice that the standard deviations are large around the angle offsets to . This is because some views contain more information about the chair 3D shapes than the other views so that we see performance variations. Interestingly, the performance of VGG16 net surpasses our RNN model when the angle offset is greater than . We hypothesize that this phenomenon results from the symmetric structures of most of the chairs. The VGG16 net was trained with mirroring data augmentation to achieve certain symmetric invariance while our RNN model does not explore this structure.
To further demonstrate the disentangling property of our RNN model, we use the pose units extracted from the input images to repeat the above crossview recognition experiments. The mean success rates are shown in Table 1. It turns out that the better the identity units perform the worse the pose units perform. When the identity units achieve nearperfect recognition on MultiPIE, the pose units only obtain a mean success rate , which is close to the random guess for 200 classes.
Models  RNN: identity  RNN: pose  CNN 

MultiPIE  
Chairs 
4.5 Class Interpolation and View Synthesis
In this experiment, we demonstrate the ability of our RNN model to generate novel chairs by interpolating between two existing ones. Given two chair images of the same view from different instances, the encoder network is used to compute their identity units and pose units , respectively. The interpolation is computed by and , where . The interpolated and are then fed into the recurrent decoder network to render its rotated views. Example interpolations between four chair instances are shown in Figure 9. The Interpolated chairs present smooth stylistic transformations between any pair of input classes (each row in Figure 9), and their unique stylistic characteristics are also well preserved among its rotated views (each column in Figure 9).
Input  

t=1  
t=5  
t=9  
t=13  
0.0  0.2  0.4  0.6  0.8  1.0  0.2  0.4  0.6  0.8  1.0  0.2  0.4  0.6  0.8  1.0 
5 Conclusion
In this paper, we develop a recurrent convolutional encoderdecoder network and demonstrate its effectiveness for synthesizing 3D views of unseen object instances. On the MultiPIE dataset and a database of 3D chair CAD models, the model predicts accurate renderings across trajectories of repeated rotations. The proposed curriculum training by gradually increasing trajectory length of training sequences yields both better image appearance and more discriminative features for poseinvariant recognition. We also show that a trained model could interpolate across the identity manifold of chairs at fixed pose, and traverse the pose manifold while fixing the identity. This generative disentangling of chair identity and pose emerged from our recurrent rotation prediction objective, even though we do not explicitly regularize the hidden units to be disentangled. Our future work includes introducing more actions into the proposed model other than rotation, handling objects embedded in complex scenes, and handling onetomany mappings for which a transformation yields a multimodal distribution over future states in the trajectory.
Acknowledgments
This work was supported in part by ONR N000141310762, NSF CAREER IIS1453651, and NSF CMMI1266184. We thank NVIDIA for donating a Tesla K40 GPU.
References

Aubry and Russell [2015]
M. Aubry and B. C. Russell.
Understanding deep features with computergenerated imagery.
In ICCV, 2015.  Aubry et al. [2014] M. Aubry, D. Maturana, A. A. Efros, B. Russell, and J. Sivic. Seeing 3D chairs: exemplar partbased 2D3D alignment using a large dataset of CAD models. In CVPR, 2014.
 Ba and Kingma [2015] J. Ba and D. Kingma. Adam: A method for stochastic optimization. In ICLR, 2015.
 Bengio et al. [2009] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
 Blanz and Vetter [1999] V. Blanz and T. Vetter. A morphable model for the synthesis of 3D faces. In SIGGRAPH, 1999.
 Cheung et al. [2015] B. Cheung, J. Livezey, A. Bansal, and B. Olshausen. Discovering hidden factors of variation in deep networks. In ICLR, 2015.

Ding and Taylor [2014]
W. Ding and G. Taylor.
Mental rotation by optimizing transforming distance.
In
NIPS Deep Learning and Representation Learning Workshop
, 2014. 
Dosovitskiy et al. [2015]
A. Dosovitskiy, J. Springenberg, and T. Brox.
Learning to generate chairs with convolutional neural networks.
In CVPR, 2015. 
Fidler et al. [2012]
S. Fidler, S. Dickinson, and R. Urtasun.
3D object detection and viewpoint estimation with a deformable 3D cuboid model.
In NIPS, 2012.  Flynn et al. [2015] J. Flynn, I. Neulander, J. Philbin, and N. Snavely. Deepstereo: Learning to predict new views from the world’s imagery. arXiv preprint arXiv:1506.06825, 2015.
 Girshick [2015] R. Girshick. Fast RCNN. In ICCV, 2015.
 Gross et al. [2010] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. MultiPIE. Image and Vision Computing, 28(5):807–813, May 2010.
 Hinton et al. [2011] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming autoencoders. In ICANN, 2011.
 Jia et al. [2014] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
 Kholgade et al. [2014] N. Kholgade, T. Simon, A. Efros, and Y. Sheikh. 3D object manipulation in a single photograph using stock 3D models. In SIGGRAPH, 2014.
 Kingma and Welling [2014] D. P. Kingma and M. Welling. Autoencoding variational Bayes. In ICLR, 2014.
 Krizhevsky et al. [2012] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 Kulkarni et al. [2015] T. D. Kulkarni, W. Whitney, P. Kohli, and J. B. Tenenbaum. Deep convolutional inverse graphics network. In NIPS, 2015.
 Long et al. [2015] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 Michalski et al. [2014] V. Michalski, R. Memisevic, and K. Konda. Modeling deep temporal dependencies with recurrent “grammar cells”. In NIPS, 2014.
 Mnih et al. [2013] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. In NIPS Deep Learning Workshop, 2013.
 Oh et al. [2015] J. Oh, X. Guo, H. Lee, R. Lewis, and S. Singh. Actionconditional video prediction using deep networks in atari games. In NIPS, 2015.
 Reed et al. [2014] S. Reed, K. Sohn, Y. Zhang, and H. Lee. Learning to disentangle factors of variation with manifold interaction. In ICML, 2014.
 Shepard and Metzler [1971] R. N. Shepard and J. Metzler. Mental rotation of three dimensional objects. Science, 171(3972):701–703, 1971.
 Simonyan and Zisserman [2015] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 Tenenbaum and Freeman [2000] J. B. Tenenbaum and W. T. Freeman. Separating style and content with bilinear models. Neural Computation, 12(6):1247–1283, 2000.
 Tieleman [2014] T. Tieleman. Optimizing neural networks that generate images. PhD thesis, University of Toronto, 2014.
 Vinyals et al. [2015] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In CVPR, 2015.
 Zaremba and Sutskever [2014] W. Zaremba and I. Sutskever. Learning to execute. arXiv preprint arXiv:1410.4615, 2014.
 Zhu et al. [2015] X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. Highfidelity pose and expression normalization for face recognition in the wild. In CVPR, 2015.
 Zhu et al. [2014] Z. Zhu, P. Luo, X. Wang, and X. Tang. Multiview perceptron: a deep model for learning face identity and view representations. In NIPS, 2014.
Appendix
5.1 Training the chair model with mask stream.
On training the chair model, in addition to decoding the rotated chair images, we also decode their binary masks. The layer structure is illustrated in Figure 10. Being an easier prediction objective than the image, the binary mask provides a proper regularization to the network and significantly improves the prediction performance. We compare the learning curves with and without mask stream for training RNN1 in Figure 11. It is notable that the training loss for the network with mask stream deceases faster and its testing loss tends to converge better.
5.2 Results on cars.
We also evaluate our model on synthesize 3D views of cars from a singe image. We use the car CAD models collected by [9]. For each of 183 CAD models, we generate 64x64 grayscale renderings from 24 azimuth angles each offset by 15 degrees and 4 elevation angles [0,6,12,18]. The renderings from first 150 models are used for training and the rest 33 models for tests. The same network structure as the chair model in Figure 10 is used in this experiment except that both input and output layers are singlechanneled for grayscale images. The curriculum training also follows the procedure in the chair experiment. We train RNN1, RNN2, RNN4, RNN8 and RNN16 sequentially. Note that we only train the network to perform azimuth rotation. We present example 3D view synthesis of 16step rotation on two car models from the test set in Figure 12.
Input  t=1  t=2  t=3  t=4  t=5  t=6  t=7  t=8  t=9  t=10  t=11  t=12  t=13  t=14  t=15  t=16 