IterGANs: Iterative GANs to Learn and Control 3D Object Transformation

04/16/2018 ∙ by Ysbrand Galama, et al. ∙ University of Amsterdam 0

We are interested in learning visual representations which allow for 3D manipulations of visual objects based on a single 2D image. We cast this into an image-to-image transformation task, and propose Iterative Generative Adversarial Networks (IterGANs) which iteratively transform an input image into an output image. Our models learn a visual representation that can be used for objects seen in training, but also for never seen objects. Since object manipulation requires a full understanding of the geometry and appearance of the object, our IterGANs learn an implicit 3D model and a full appearance model of the object, which are both inferred from a single (test) image. Two advantages of IterGANs are that the intermediate generated images can be used for an additional supervision signal, even in an unsupervised fashion, and that the number of iterations can be used as a control signal to steer the transformation. Experiments on rotated objects and scenes show how IterGANs help with the generation process.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 18

page 19

page 20

page 21

page 22

page 23

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we are interested in manipulating visual objects and scenes, without resorting to external provided (CAD) models or advanced (3D/depth) sensing techniques. To be more specific, we focus on rotating objects and rotating the camera viewpoint of scene from a single 2D image. Manipulating objects/scenes require an expectation about the appearance and the geometrical structure of the unseen part of the object/scene. Humans clearly have such an expectation based on an understanding of the physics of the world, the continuity of objects, and previously seen (related) objects and scenes. We aim to learn such an 3D understanding, which can be inferred from single 2D images.

In order to learn a representation for object manipulation, we cast this problem into an image-to-image transformation task, with the goal to transform an input image following a given 3D transformation to an target image. For this kind of object manipulation, often either stereoscopic cameras [1, 2] or temporal data streams [3, 4] have been used to infer depth cues, while our aim is to obtain the target image from a single input image only. Similarly as humans are able to do it with one eye closed [5], there has also been works that aim to reconstruct from a single image [6, 7], however these typically require external provided 3D object models, e.g[8], or focus on a single class of objects only [9]. Our aim, on the other hand, is to learn a general transformation model, which can transform many classes of objects, even objects never seen at train time (without provided 3D models), based on the fact that appearance and geometrical continuity are (mostly) not object/domain specific but general applicable. For now, we focus on a specific instance of this general object manipulation problem: the object in the target image is a fixed rotation of the input image.

For this task, we propose the use of Iterative Generative Adversarial Networks (IterGANs), see (a). GANs have been used for many (image) generation tasks [10, 11, 12], including image-to-image prediction [13, 14]. IterGANs are an extension of the image-to-image generator GANs [13], where the input image is fed to the generator, and the output of the generator is again fed into the generator, this circle is repeated a predefined number of iterations, or used as a control mechanism to steer the image rotation.

The iterative nature of IterGANs has some particular advantages over a single image-to-image GAN translation network: (i) the generator only needs to learn a small image manipulation; and (ii) we can use the intermediately generated images to steer the learning process. A fundamental difference between image manipulation and the image-to-image tasks explored in [13]

is, that when translating a map into an aerial image there exists a one-to-one pixel relation between the input and the output image. In the case of object manipulation, however, pixels have long range dependencies, depending on the geometry and the appearance of the object and the required degree of rotation. IterGANs break these long dependencies into a series of shorter dependencies. Moreover, IterGANs allow to use intermediate loss functions measuring the quality of the series of generated images to improve the overall transformation quality.

(a) IterGAN Pipeline
(b) Mental rotation
Figure 1: IterGAN (left): Illustration of a default 3D manipulation pipeline (blue) versus an IterGAN pipeline (red), the last of which learns an implicit image-to-image model for 3D object manipulation. Mental rotation (right): Psychology research has shown that there exist a linear relation between the human reaction time to identify matching pairs of rotated objects and the degree of rotation [15]. Thus we postulate it is easier to train an IterGAN to iteratively rotate an object for a few degrees, than a GAN for a single (large) rotation.
How long does it take you to find the non-matching pair?

This paper is an extended version of our ICLR Workshop paper [16] and the source-code is available on GitHub111Code available at: https://github.com/tomaat/itergan. Our paper is organised as follows. Next we will discuss some of the most related work in image-to-image object manipulation. In Section 3 we introduce IterGANs and propose extended loss functions on the intermediate generated images. We show extensive experimental results in Section 4, on the ALOI dataset [17] for object rotation, and the VKITTI dataset [18] for camera viewpoint scene rotation. Finally, we conclude the paper in Section 5.

2 Related work

There is vast amount of related work in the field of 3D reconstruction and image generation. Here, we only highlight the most relevant methods with respect to our proposed models. For a more in depth overview we refer to [19] for 3D reconstruction and to [14] for image generation.

3D reconstruction from 2D images

There are several techniques for generating 3D environments from 2D data, differing in both the type of data, and the type of environment. It is possible to create a point-cloud from video using Structure from Motion [20, 21], or to fit polygonal objects to images [7, 22] and deform them to fit other images [8].

In recent years, there have also been attempts to use deep learning instead. These techniques also allow forgoing 3D models, thus using only 2D images to describe the 3D environment. Such an approach has been used to classify objects 

[23], generate different viewpoints from descriptors [24] or creating the frames for 3D movies [25].

In contrast to their work, we focus on changing the viewpoint of a given image. Therefore our models need to not only constuct the view as Dosovitskiy et al. [24] does, but also perceive an input image, and capture more than a disparity map from Xie et al. [25]. Moreover, since the output is an image, we use GANs as a training method to transform the image.

Image manipulation with GANs

GANs [26, 27] have been shown to be useful for the generation of images. These networks manage to generate visually pleasing images, as shown by many recent studies (e.g[10, 11, 12]

). Using a GAN conditioned on an input image to generate a related image has been proposed to transform in different settings as ‘Pix2pix’ 

[13].

The development of GANs has introduced multiple studies to manipulate images and the objects in images. It is now possible to age a face from a single image [28], or provide it with glasses and a shave [29]. Alternatively images can be manipulated by changing the main object [30] (e.g. cats to dogs).

Novel viewpoint estimation

A more specific form of image manipulation is Novel viewpoint estimation

, which has a slightly different goal than 3D reconstruction, since the output is again a 2D image. A new viewpoint can be estimated using voxel projections 

[31], or GANs to estimate the frontal view of a face [32], or transforming a single image with viewpoint estimation [33]. Quality of the generated image can be improved by multiple reconstructed images [33], or using a second network to finish the image estimated by a flow-network [9].

Contrary to previous work is the type of data and control signal used. We use the amount of iterations of the generator as control, not a vector as input in the network. Moreover, instead of synthetic data of a single object class, a heterogeneous dataset of real world images photographed under constraint conditions is used.

(a) IterGAN base
(b) Unsupervised IDL
(c) Supervised IDL
Figure 2: IterGAN framework and intermediate discriminator loss (IDL) functions: the iterative nature of IterGANs (left) allow to define an unsupervised discriminator loss on the intermediately generated images (middle), where the discriminator aims to tell apart generated images () from real images ( or ), and to define a supervised discriminator loss function (right) to tell apart an input-generated pair of images of from a real pair of images ).

3 IterGANs

Iterative GANs are an extension of the image-to-image GAN networks of Isola et al[13], where the generator is called iteratively. This allows for both, learning with more details an image-to-image translation task by making small iterative steps, and for controlling object manipulation by calling the generator for iterations, where depends on the desired manipulation. The underlying assumption of Pix2Pix, i.e. a one-to-one correspondence between pixels in the input and output image, is (in part) restored by calling the generator iteratively. Thus, the final image generated from input image is given by, iteratively calling the generator of the GAN network times:

(1)

Each generator now has to rotate its input image only by a small fraction, until the final rotation has been reached . The iterative nature of the IterGAN is illustrated in (a), we refer to the IterGAN network as IG.

Our proposed network has the same number of parameters as a Pix2Pix network [13] with a single generator, and for learning we can use the same generator and discriminator losses as in [13]:

(2)
(3)

where the cross-entropy loss () between the input image and the generated image is used, as well as the loss between the generated image and the target image .

Object mask specific reconstruction loss

The ALOI dataset [17] we use for most of our experiments contains (rotated) objects against a black background. In order to focus the reconstruction mostly on the objects, we use a variant of the L1-loss, which uses the provided binary mask :

(4)

where assigns pixels to the object (1) or background (0). This variant of the L1-loss weights the object twice as important as the background, which is relevant for our task of object transformations. We refer to this as model.

Figure 3: Examples of generated artefacts in intermediate images (top row) that are not present in the final generated image (bottom row). Artefacts are independent of specific value for , examples from different iterations .

Intermediate artefacts

While the iterative generator in general leads to more realistic final images compared to the baseline Pix2pix model, the intermediate generated images show different types of artefacts. In Fig. 3, we show examples of generated intermediate images from experiments using and each of these have artefacts, like switching colours, adding noise, or adding other patterns. Interestingly, these are gone in the final generated image, so repeatedly applying the same generator removes the introduced artefacts. To overcome these artefacts and to improve the general generation quality, we introduce a set of intermediate loss functions.

3.1 Intermediate Discriminator Loss Functions

In this section we present two classes of intermediate discriminator loss (IDL) functions, where we include an additional discriminator in the learning process, which is fed by the intermediate images. While IterGANs produce intermediate generated images, so far it was an implicit assumption that these would be realistic images as well. The additional IDLs will enforce to learn to generate realistic intermediate images, either in a unsupervised setting, i.e. which only requires real images and , or in a supervised setting, i.e. where we require a sequence of images and , see (b) and (c).

Unsupervised Intermediate Discriminator Loss

In the Unsupervised IDL model we include a discriminator, with the goal to tell apart image A or T from any of the generated images , unconditioned on the original input image. The generator, on the other hand, aims to fool this discriminator (as well as the main discriminator). This results in the following loss functions:

(5)
(6)

where is uniformly sampled from , and either or is used, is an additional hyper-parameter. Since no additional labeled data is required, this is an unsupervised IDL and we coin this model , see (b).

Supervised Intermediate Discriminator Loss

The second extension also adds a discriminator, but conditioned on the input image . It aims to discriminate whether is a real or a generated rotation from image , and uses the intermediate target images for supervision. While this discriminator is also conditioned and therefore behaves more similar to the conditional discriminator in the Pix2Pix/IG, the difference is that the goal of the main discriminator is to detect if the output is a real rotation of the input (R=30 in most of our experiments), and the aim of the new discriminator is to accepts an arbitrarily rotation of image .

The loss functions for this model are:

(7)
(8)

where per image pair () randomly a single pair () is sampled, with . In order to train this losses, we need additional supervision in the form of the intermediate real images , therefore this is a supervised IDL, and we coin this method model, see (c).

3.2 Training IterGANs to Control Object Manipulation

In this section we aim to use IterGANs to control object rotation, where the number of iterations () is determined by the desired rotation of the object. We see this as a explicit alternative to methods which learn implicit control, by adding a signal to representation of the generator network, see e.g[9].

In fact, the supervised IDL method already uses sequences of images of a transforming (rotating) object to train, however the learning objective aims for a fixed rotation and a fixed number of iterations . Here, we control object rotation, by varying the number of iterations () depending on the desired rotation of the object. Therefore, instead of training on input-target pairs with a fixed rotation, we sample a value of and select an input/target pair with the according different degrees of rotation . Each train step, the generator is repeated times and aims to generate an image of  degrees rotation. When the IDL can still be used to discriminate the intermediate results sampled from .

Stepwise learning to rotate objects

To learn optimally for controlling manipulation, we combine the insight that learning small rotations are easier than larger rotations, and that the intermediate images in IterGANs might produce artefacts. Therefore we adhere to a training procedure, which slowly increase the degree of rotation. In practice we start with

for a few epochs, then in 3 steps increases the range of

to the full range ().

4 Experiments

Experimental Setup and Evaluation Measures

For most of our experiments we use the data of the Amsterdam Library of Object Images (ALOI) [17]. This set contains images of 1000 household objects, photographed under constrained lighting conditions, from different viewing directions using a turntable setup. Each object is rotated in steps of

, resulting in 72 images per object, which were padded and scaled to fit the

pixel size of the networks.

For training and testing, we split this data into three different sets. First, for training we use a subset of the data containing 800 objects in 36 pairs of pictures with an angle of , resulting in k images. For testing we use two different sets, the first contains 100 objects from the train set, but with different start (and thus target) rotations, resulting in k images, referred to as Seen objects. The other test set, also contains a set of 100 objects, albeit not present in the train set, i.e. . This set is referred to as Unseen objects and it also contains k images.

We train all models for 20 epochs, with the hyper parameters from [13] and (if used) . To counteract bad initialisation of the models, each was trained multiple times (3), and the best were used for comparison. Source code, the trained models, and the train and test splits used are available on GitHub.

Evaluating the quality of generated images is hard by itself [34, 35], therefore we use different evaluation measures. Pixels in target and output generated images are scaled to

by using a tanh activation function in the last layer of the neural net, we then evaluate using:

  1. the mean absolute distance between pixels ( loss, also used in [13]);

  2. the object specific mask loss ( loss, Eq. 4); and

  3. the Kullback-Leibner Label divergence:

    (9)

    to measure the similarity of the label distributions , obtained from a pre-trained VGG16, between the generated image and the target image . This measure is inspired on the KL measure used to measure specificity and diversity in [35], however we aim that the generated image is realistic and therefore has a similar label distribution to the target image.

Preliminary experiments with other evaluation measures, including SSIM [36] and VIFp [37], showed similar order of results as the -loss, and are therefore not included in the overview.

Identity
Projective
Pix2pix
(a) Baseline Comparison
(b) Cumulative plot of
Figure 4: Model comparison: evaluation of different models (left) and a cumulative plot of measure of the first four models (right). In a cumulative plot, each line indicates the data percentage with that score or lower, thus the earlier a line reaches a value of , the better the model.

4.1 IterGANs on ALOI

In the first experiment, we compare different IG models, to three baselines: Pix2Pix, which is identical/similar as using k=1, and two non-learning transformations, the identity projection (B=A) and the projective transformation, which rotates the image plane assuming a pinhole camera to compute point-pairs for calculating the transformation matrix. To compare we introduce the IG models (), with both IDL (, ), and the -loss (, , ). As reduction of hyper-parameters, we use for these models, since the targets are available at intervals. For different values of we already showed the existance of artefacts in intermediate images (see Fig. 3).

From the results in Fig. 4 (left) we observe that for and the learning methods outperform the non-learning baselines, while for

the later are very strong. This is probably because VGG16 has learned to be invariant versus object viewpoint, while subtle differences in local image statistics can have a large impact. The

model improves the Pix2pix model on and , while small, it is significant according to the non-parametric Friedman test where each image judges the two models on the results of all images.

Figure 5: Cumulative scores showing the differences between Pix2pix and in detail. Shown is that for both seen (solid) and unseen (dashed) data, scores slightly better than Pix2pix.
Figure 6: Difference in of Pix2pix and IG split in input angle and object ID. The y-axis indicates the score, where the x-axis indicates either the input angle from the camera (left) or the ID of the object (right). *To increase readability, the IDs are sorted by the scores of our model.

Detailed comparison

In order to gain more insight in the difference between Pix2pix and models, we compare these two in different ways. First, in Fig. 5 we show the a cumulative histogram of the score for both models. This plot shows that for both seen and unseen objects,

scores better than Pix2pix. An independent T-test of both populations supports this by indicating a significant difference between the two models.

Second, in Fig. 6, we compare Pix2pix and where the objects are sorted by viewing angle of the input (left) or based on object identity (right), and then averaged over all test examples of the same angle/ID. From this figure, we can draw the conclusions that the viewing angle of the input image is not significantly influencing the performance. On the other hand, there is a clear relation between object and performance. Apparently, for both models, some objects are more difficult to turn around than others.

The object identity figure also show a preference for the model, given that outperforms (i.e. has a lower score than) Pix2pix for every angle and almost every object.

Figure 7: Evaluation of the performance of adding IDL (left) and -loss (right) to the training. -loss increases performance on the generated image, and IDL only when used in combination.

Figure 8: Qualitative comparison of the models (columns). The top three rows show a rotation of seen objects, and the bottom three for unseen objects. The green rectangle is magnified to better compare details.

Figure 9: Inter- and extrapolation of the different models (rows) with the ground truth at top. The columns show the rotation from the input. The iterative GANs show more realistic generated images for a wide range of angles. (Best viewed in colour, zoom in for details)

IterGAN extentions

The scores in Fig. 7 show the performance of the introduced extentions: IDL and -loss. The IDLs were introduced to supress the artefacts from the intermediate results. These models ( and ) score lower on the performance of the final image. However, qualitative examples show a reduction of artefacts in the intermediate images . See Fig. 9 for these examples. The fact that the final results of these models is lower than could be caused because finding the solution satisfying both intermediate and overall quality is much harder than any solution which satisfy only final quality, thus making it harder to find a good solution during training.

Learning with the -loss shows a clear increase in performance of the models. This result is not unsuprisingly, since the models are now also trained on the test score, and the large background is taken into account during training. The fact that the IDL models now score better could be that the -loss helps the generator more to find a solution that fools the discriminators. The small difference in performance of the unsupervised and supervised IDL, where the second one has more target-data available during trainig, could either be that the models already perform at their best, or that the extra data is underused by the current training method.

We therefore conclude that the introduction of -loss clearly outperforms the other methods, and that adding the unsupervised IDL () further improves the quality of the image, without the need of extra train data.

4.2 Control object rotation

Since IterGANs generate the final image in steps, each step could be interpreted as a partial rotation. In this experiment we show what happens when the amount of iterations is used to control the amount of rotation, i.e. inter- and extrapolate the generator to generate different angles. Furthermore, we show how using variable during training (as explained in Section 3.2) influences the performance of the partial rotations.

As baseline Pix2pix was used in our default setting of (which is only able to generate images rotated a multiple of ), and one trained on image pairs (Pix2pix). The three models with variable during training are , and , where indicates the full range of used for each epoch, and indicates stepwise increase from to the full range in three steps of five epochs. Furthermore, the best model with fixed during trainig () is also used for comparison.

The graph in Fig. 10 shows the performance of each model for different angles (thus different values of ). The baseline of Pix2pix

shows an upper bound of quality, and Pix2pix shows the inflexibility of interpolated angles. A clear phenomenon is the periodic patern of

, with phase two it is one of the best scoring models, or the worst, showing the effect of the artefacts in the intermediate images, as also seen in the examples in Fig. 9. This phenomenon shows that even with IDL, it is hard to train the generator to produce good intermediate images because of the complex parameter space. The models with variable do not show this behaviour, each iteration of the generator creates a slightly rotated image. The graph also shows that training with incrementing outperforms training all with values for from the start, indicating that smaller angles first helps the training process. Lastly, the IDL no longer seems to have any effect, probably because the generator is already forced to produce good intermediate images since it was also trained on smaller values of . The use of IDLs could be benificial again if not all data of the intermediate images is available, e.g. only has targets to train on.

Figure 10: Performance on control of the rotation. Each bar indicates the mean -score of that model on images generated of the angle indicated by the x-axis. Seen objects (top) and unseen objects (bottom) are rotated by repeating the trained generator. Note that since Pix2pix rotates in steps of , most angles cannot be generated by this model, and has no bar indicating the score for those angles.

seen unseen Identity Pix2pix

Figure 11: Scores comparing the models on unseen objects. Right shows the meanstd. -scores on both seen and unseen datasets, and left a cumulative plot of four of these models for visual comparison. performs best on seen data, however and score almost as good on unseen data.

4.3 Unseen objects

In this final experiment we compare the results on the two test sets: the seen and unseen objects. The results are shown in Fig. 11. We observe, as is expected, that the performance on the seen objects is in general better than the unseen objects, but also that training for the metric does not improve the results noticeably in this case. Moreover, as seen from the qualitative results in Fig. 8, the model also introduces artefacts in background of the final image. However, the models trained with variable show they are less affected by unseen objects, the drop in performance is noticabily less. This last indicates that these models can rotate objects in images, without the object necessarily part of the training procedure. I.e. the models can use the depth cues in a single image to understand the 3D properties of any object.

seen unseen
Pix2pix
Figure 12: Table of -scores on VKITTI trained models.

4.4 Camera rotation on VKITTI

In our final experiment, we explore a different rotation problem: camera rotation. The task is to generate a scene from a different camera viewpoint. For this task we use the Virtual-KITTI dataset [18], which is a synthetic version of KITTI [38] that contains a side view ( rotated from the main camera). Now the task is to generate this side view from the main camera. Since the proposed models are based on image pairs, we use all frames independently and not as a video stream.

We use pairs of pixels images, and select four sequences for training. Since the models are fully convolutional, a three times as wide input image will only change the output of each layer, not the amount of parameters. Yet, given that the dataset is smaller than ALOI (and that the intermediate rotations were not available), we evaluate only Pix2pix, IG and , trained for 50 epochs. Also note that since there are no foreground masks, the -measure was used throughout both training and testing.

Figure 13: Illustration of generated images from the VKITTI dataset. The model iteratively rotates and add details. (Best viewed in colour, zoom in for details)

In Fig. 13 we show examples of the trained models. For both IG and we show for an given input the intermediate and final generated image. Similarly as in the ALOI experiments, the IG has periodic artefacts in the intermediate steps, which are reduced by the supervision of the second discriminator in IG. The intermediate images from seem to add details in every step, besides rotating the scene.

The final performance is rather similar for all three models (see Fig. 12). On test images from the train sequences the -measure is around , while for the unseen test sequences the performance is around , with IG being marginally better in both scenarios. Therefore, we conclude that Iterative GANs are suitable for generating rotated images.

5 Conclusion & Outlook

In this paper we have introduced IterGANs, a model to iteratively transform an image into a target image whereby the generator has to learn only a small transformation. IterGANs are in part inspired on the Shepard and Metzler[15] mental rotation experiment222In (b) the middle objects are not only rotated, but also mirrored.. Our experiments have shown that IterGANs outperform a direct transformation GAN.

Moreover the intermediate generated images allow for extending loss functions, by using intermediate discriminators, either supervised or unsupervised. Surprisingly the unsupervised loss functions outperform the supervised ones.

Finally the IterGANs allow for controlling the image manipulation by calling the generator different number of times. Learning with increasing number of iterations, i.e. sample from an increasing amount of possible , shows an increased quality of the final results. Our experiments on rotating objects never seen at train time, show that our proposed models learn a generic object appearance and transformation generator. Future research could investigate extension of the control signal beyond rotation to a full 3D transformation of both objects and scenes.

Acknoledgements

This research is supported in part by the NWO VENI What&Where project.

References

  • [1] Ko, J., Kim, M., Kim, C.: 2d-to-3d stereoscopic conversion: depth-map estimation in a 2d single-view image. In: Proc. of the SPIE. Volume 6696. (2007) 66962
  • [2] Bruno, F., Bruno, S., De Sensi, G., Luchi, M.L., Mancuso, S., Muzzupappa, M.: From 3d reconstruction to virtual reality: A complete methodology for digital archaeological exhibition. Journal of Cultural Heritage 11(1) (2010) 42–49
  • [3] Pollefeys, M., Nistér, D., Frahm, J.M., Akbarzadeh, A., Mordohai, P., Clipp, B., Engels, C., Gallup, D., Kim, S.J., Merrell, P., et al.: Detailed real-time urban 3d reconstruction from video. International Journal of Computer Vision 78(2-3) (2008) 143–167
  • [4] Gibson, S., Hubbold, R.J., Cook, J., Howard, T.L.: Interactive reconstruction of virtual environments from video sequences. Computers & Graphics 27(2) (2003) 293–301
  • [5] Vishwanath, D., Hibbard, P.B.: Seeing in 3-d with just one eye: Stereopsis without binocular vision. Psychological science 24(9) (2013) 1673–1685
  • [6] Saxena, A., Sun, M., Ng, A.Y.: Make3d: Learning 3d scene structure from a single still image. IEEE transactions on pattern analysis and machine intelligence 31(5) (2009) 824–840
  • [7] Rematas, K., Nguyen, C.H., Ritschel, T., Fritz, M., Tuytelaars, T.: Novel views of objects from a single image. IEEE transactions on pattern analysis and machine intelligence 39(8) (2017) 1576–1590
  • [8] Vicente, S., Agapito, L.: Balloon shapes: Reconstructing and deforming objects with volume from images. In: 3DTV-Conference, IEEE (2013)
  • [9] Park, E., Yang, J., Yumer, E., Ceylan, D., Berg, A.C.: Transformation-grounded image generation network for novel 3d view synthesis. In: CVPR. (2017)
  • [10] Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: ICML. (2016)
  • [11] Denton, E.L., Chintala, S., Fergus, R., et al.: Deep generative image models using a laplacian pyramid of adversarial networks. In: Advances in neural information processing systems. (2015) 1486–1494
  • [12] Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
  • [13] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.:

    Image-to-image translation with conditional adversarial networks.

    In: CVPR, IEEE (July 2017)
  • [14] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: ICCV, IEEE (Oct 2017)
  • [15] Shepard, R., Metzler, J.: Mental rotation of three-dimensional objects. Science 171(3972) (1971) 701–703
  • [16] Galama, Y., Mensink, T.: Itergans: Iterative gans for rotating visual objects. In: International Conference on Learning Representations - Workshop (ICLRw). (2018)
  • [17] Geusebroek, J.M., Burghouts, G.J., Smeulders, A.W.: The amsterdam library of object images. International Journal of Computer Vision 61(1) (2005) 103–112
  • [18] Gaidon, A., Wang, Q., Cabon, Y., Vig, E.: Virtual worlds as proxy for multi-object tracking analysis. In: CVPR, IEEE (2016)
  • [19] Li, M., Zheng, D., Zhang, R., Yin, J., Tian, X.: Overview of 3d reconstruction methods based on multi-view. In: IHMSC, IEEE (2015)
  • [20] Koenderink, J.J., Van Doorn, A.J.: Affine structure from motion. JOSA A 8(2) (1991) 377–385
  • [21] Nistér, D.: Preemptive ransac for live structure and motion estimation. Machine Vision and Applications 16(5) (2005) 321–329
  • [22] Suveg, I., Vosselman, G.: Reconstruction of 3d building models from aerial images and maps. ISPRS Journal of Photogrammetry and remote sensing 58(3) (2004) 202–224
  • [23] Su, H., Maji, S., Kalogerakis, E., Learned-Miller, E.:

    Multi-view convolutional neural networks for 3d shape recognition.

    In: ICCV, IEEE (2015)
  • [24] Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: CVPR, IEEE (2015)
  • [25] Xie, J., Girshick, R., Farhadi, A.: Deep3d: Fully automatic 2d-to-3d video conversion with deep convolutional neural networks. In: ECCV, Springer (2016)
  • [26] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. (2014) 2672–2680
  • [27] Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
  • [28] Antipov, G., Baccouche, M., Dugelay, J.L.: Face aging with conditional generative adversarial networks. In: ICIP. (2017)
  • [29] Shen, W., Liu, R.: Learning residual images for face attribute manipulation. In: CVPR, IEEE (2017) 1225–1233
  • [30] Liang, X., Zhang, H., Xing, E.P.: Generative semantic manipulation with contrasting gan. arXiv preprint arXiv:1708.00315 (2017)
  • [31] Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: Learning single-view 3d object reconstruction without 3d supervision. In: NIPS. (2016)
  • [32] Huang, R., Zhang, S., Li, T., He, R., et al.: Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In: ICCV. (2017)
  • [33] Zhou, T., Tulsiani, S., Sun, W., Malik, J., Efros, A.A.: View synthesis by appearance flow. In: ECCV. (2016)
  • [34] Wang, Z., Bovik, A.C., Lu, L.: Why is image quality assessment so difficult? In: ICASSP, IEEE (2002)
  • [35] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., Chen, X., Chen, X.: Improved techniques for training gans. In: NIPS. (2016)
  • [36] Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4) (2004) 600–612
  • [37] Sheikh, H.R., Bovik, A.C.: Image information and visual quality. IEEE Transactions on image processing 15(2) (2006) 430–444
  • [38] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR, IEEE (2012)

Appendix 0.A Projective transformation

One of the baselines introduced in the paper is a projective baseline: approximate the rotation with a projective transformation matrix. In this supplementary, a more detailed explanation of the computation, and several examples of qualitative results.

When assuming the object to be a plane, and using the assumption of a pinhole camera, a projective transformation can be performed to transform the images. Because the position of the camera with respect to the object is known for the ALOI dataset, the computation is relatively straightforward.

The rotation matrix for a rotation (rad) around the axis is:

(10)

When using the pinhole camera model to compute where points on this rotated plane and there origin are projected on the image, we need to define the origin. Let’s assume the origin of the 3D orthogonal basis is fixed at the base of the object, the position of the camera is at cm in (X,Y,Z), then we obtain:

(11)

where are real world coordinates, and are pixel coordinates, with being a scaling factor. Therefor can be computed from and for the rotated version can be computed from .

The pixel to rotated-pixel transformation matrix is defined as:

(12)

To solve for the s we need at least 4 pairs of and . Thus, we use the four points on the -plane (i.e. a unit square) to compute . With this transformation matrix new images can be created from the images of the dataset using bilinear interpolation with Eq. 12 for all pixels. Examples of transformed images can be seen in Fig. 14.

Figure 14: Several examples of the projective baseline. The columns show input, output, target, pixel-wise difference between output and target, and score. For illustration purposes grey values indicate the out-of-plane pixels, when computing , we use the background colour (black).

Appendix 0.B Additional qualitative examples

Figure 15: Examples of seen objects as rotated by the different models (columns). Each row shows a rotation of the input (leftmost), and the ground truth is given as rightmost. The green rectangle is magnified to better compare details.
Figure 16: Examples of unseen objects, similar to Fig. 15.
Figure 17: Rotating a seen object with the different models (rows, first is ground truth) over different angles (columns).
Figure 18: Another seen example like Fig. 17
Figure 19: Rotating an unseen object with the different models (rows, first is ground truth) over different angles (columns). Similar to Fig. 17 and 18, only now for objects never seen during training.
Figure 20: Another unseen example like Fig. 19.

Appendix 0.B Additional qualitative examples