Semi-parametric Object Synthesis

07/24/2019 ∙ by Andrea Palazzi, et al. ∙ 2

We present a new semi-parametric approach to synthesize novel views of an object from a single monocular image. First, we exploit man-made object symmetry and piece-wise planarity to integrate rich a-priori visual information into the novel viewpoint synthesis process. An Image Completion Network (ICN) then leverages 2.5D sketches rendered from a 3D CAD as guidance to generate a realistic image. In contrast to concurrent works, we do not rely solely on synthetic data but leverage instead existing datasets for 3D object detection to operate in a real-world scenario. Differently from competitors, our semi-parametric framework allows the handling of a wide range of 3D transformations. Thorough experimental analysis against state-of-the-art baselines shows the efficacy of our method both from a quantitative and a perceptive point of view. Code and supplementary material are available at:



There are no comments yet.


page 3

page 5

page 6

page 8

Code Repositories


Semi-parametric Object Synthesis. A semi-parametric approach for synthesizing novel views of an object from a single monocular image.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

How would you see an object from another point of view? Given a single view of an object, predicting how it would look like from arbitrarily different viewpoints is definitely non-trivial for humans and machines. This task is inherently ill-posed, as most of the 3D information is lost in the projection on the 2D image plane. Still, according to Gardners’ theory of multiple intelligences [12], the ability to perform out-of-plane transformations on 2D objects constitutes a sign of visual-spatial intelligence and a desirable property for a machine in its long journey towards human-level intelligence. Indeed, humans have been shown to perform mental transformations for decision-taking about their surrounding environment [47, 2, 66], e.g. decide if two object figures share the same underlying three-dimensional shape despite differing by a rotation [47] or a scale factor [2].

Figure 1: From a single in-the-wild image (left) our framework allows to synthesize realistic novel views of the object (top) or to transfer its appearance to different models (bottom).

Recently, powerful parametricdeep learning models [26, 14] made it possible to frame the generation of novel viewpoints as a conditioned image synthesis problem. Despite the promising results, a number of issues are still open. First, even though synthesised images may look realistic per se

, fine-grained visual appearance (e.g. texture) is lost when encoded through the network. Second, a vast amount of data is required for the network to generalize to arbitrary transformations. Consequently, most methods are trained solely on synthetic data, leading to a performance drop in real-world scenarios. Finally, in absence of any prior information about the novel view, a fully-parametric model struggles to generalize to arbitrary viewpoint changes: indeed, most recent works still have to restrict to a small set of transformations (e.g. rotating around the object at constant radius) 

[63, 53, 71, 39].
These shortcomings are particularly unfortunate given that a number of prior works have shown that a non-parametric approach can be a viable path for photorealism, as also pointed out by Qi  [42]. For instance, new images can be generated by collaging [15, 27, 7, 22, 19] or by leveraging multiple photographs to synthesize novel views via image-based rendering [5, 4, 36, 17, 37]. Still, these methods require a large amount of data at test time: entire image banks for collaging, multiple photographs and depth data for image-based rendering.
In this work, we propose an original semi-parametric

approach to the problem of object novel viewpoint synthesis, which attempts to take the best from both worlds: the realism of non-parametric models, and the representational capability of purely parametric ones. The intuition behind this work is that many man-made objects exhibit a symmetric, piece-wise planar structure. Therefore, they may be approximately represented by a small set of piece-wise planar patches, which can be warped almost exactly from source to destination viewpoint via a symmetry-aware homography transformation. These warped patches provide a rich hint about the visual content of the target viewpoint, but they are far from being useful on their own. Thus, a fully-convolutional network is fed with these patches along with 2.5D CAD-rendered sketches to be used as guidance; it is then trained in a self-supervised manner to discriminate which part of the image must be completed or in-painted for the result to look realistic (see Fig. 


Figure 2: Model architecture overview. Approximately planar patches are extracted from the 2D keypoints locations. The Image Completion Network (ICN) uses the synthetic 2.5D sketches as templates to reconstruct object’s appearance from the patches in a self-supervised fashion. During training, input patches are warped forth and back to a randomly sampled viewpoint to enforce resilience against homography issues that are likely to be encountered at test time. During inference, novel views of the input object are synthesised by providing the ICN a novel viewpoint and a (possibly different) rendered 3D model to be used as shape guideline.

Among all man-made object categories, the one of vehicles has by far drawn most of the attention in recent literature due to their ubiquity in urban scene understanding applications 

[64, 54, 11, 46, 33, 29]. Our experimental evaluation will therefore focus on vehicles, leaving as future work the analysis of a broader set of object categories.
In summary, our main contributions follow:

  • We frame the problem of object novel viewpoint synthesis in a semi-parametric setting. Simple geometrical assumptions about the object shape provide rich hints about its appearance (non-parametric); this information guides a fully-convolutional network (parametric) in the synthesis process.

  • We show how our model can be trained on existing datasets for 3D object detection, with no need for paired source/target viewpoint images. Furthermore, we leverage 2D keypoints for real-world images where foreground segmentation is not provided.

  • We demonstrate how our method is able to preserve visual details (e.g. texture) and is resilient to a much wider range of 3D transformations than competitors.

A thorough experimental analysis is conducted comparing our proposal with state-of-the-art methods, considering both the quantitative and the perceptual point of view.

2 Related Work

View synthesis In just few years, the widespread adoption of deep generative models [26, 14] has led to astounding results in different areas of image synthesis [43, 1, 67, 23, 56, 65]. In this scenario, conditional GANs [34]

have been demonstrated to be a powerful tool to tackle image-to-image translation problems 

[20, 72, 73, 8]. Hallucinating novel views of the subject of a photo can be naturally framed as an image-to-image translation problem. For human subjects, this has been cast to predicting his/her appearance in different poses [32, 49, 68, 49]. Fashion and surveillance domains drew most of the attention, with much progress enabled by large real-world datasets providing multiple views of the same subject [30, 69].
For rigid objects instead, this task is usually referred to as novel 3D view synthesis and additional assumptions such as object symmetry are taken into account. Starting from a single image, Yang  [63] showed how a recurrent convolutional network can be trained via curriculum-learning to perform out-of-plane object rotation. In a similar setting Tatarchenko  [53] predicted both object appearance and depth map from different viewpoints. Successive works [71, 39] trained a network to learn a symmetry-aware appearance flow to map object pixels form input to output view, re-casting the remaining synthesis as a task of image completion. All these works [63, 53, 71, 39] assume the target view to be known at training time. As this is not usually the case in the real-world, these approaches are limited by the need to be trained solely on synthetic data and exhibit limited generalization in a real-world scenario.
The recent work by Zhu  [74]

exploits cycle consistency losses to overcome the need of paired data, thus training on datasets of segmented real-world cars and chairs they gathered for the purpose. Although that work shows more realistic results, it requires pixel-level segmentation for each class of interest. In contrast, we show that already available datasets for object 3D pose estimation 

[62, 61] can be used for this purpose, despite the extremely rough alignment between the annotated model and the image. More importantly, we differ by all methods above as we explicitly bootstrap the image generation with visual patches warped from the source image.
3D Shape reconstruction In this frame, few recent works [58, 74] have shown that the use of 2.5D sketches can be a viable path to bridge the gap between synthetic and real-world data. In particular, in Zhu  [74] the 2.5D sketch consists of both a silhouette and a depth image rendered from a learnt low-resolution voxel grid by means of a differentiable ray-tracer. While this method is appealing for its geometrical guarantees, it is limited by a number of factors: i) it requires a custom differentiable ray-tracing module; ii) footprint of voxel-based representations scales with the cube of the resolution despite most of the information lying on the surface [51, 38]; iii) errors in the 3D voxel grid naturally propagate to the 2.5D sketch. We also follow this line of work to provide soft 3D priors to the synthesis process. However, in our semi-parametric setting 2.5D sketches are additional inputs which can be rendered from arbitrary viewpoints using standard rendering engines.
Nonparametric view synthesis In the interactive editing setting, recent works [24, 44] have shown astounding results by keeping the human in the loop and assuming a perfect (even part-level) alignment between the 3D model and the input image. As pixels are warped from the input to the target view [44] it is not feasible to transfer the object appearance on a completely different model. Moreover, the time required to synthesise the output is still far from real-time (few seconds). On the opposite, we work with a very coarse alignment between the input image and the 3D model. Since we train a deep network to map texture hints on the output, it is possible to transfer appearance from one model to a completely different one (e.g. sedan to jeep). Finally, our method takes just few milliseconds to hallucinate the novel object view and can thus work in real-time.
In the different setting of image synthesis from semantic layout the recent work of Qi  [42] has shown that non-parametric components (i.e. a memory bank of image segments) can be integrated in a parametric image synthesis pipeline to produce impressive photo-realistic results; this effectively reduces the complexity of the task to aligning, ordering and painting these segments properly on the output canvas. Although our task is completely different, we similarly rely on image patches to provide hints to the Image Completion Network; however, our patches are not queried from a database but warped directly from the input view.

3 Model

Our goal is to develop a framework in which the visual appearance of an object can be automatically predicted from arbitrary viewpoints, given a single real-world image. More formally, the model takes as input an image of a single object viewed from the source viewpoint along with its 2D keypoints and outputs an image depicting the same object from the destination viewpoint . For training only, the 3D CAD model aligned with must be also available in order to remove the background during self-supervision. At test time only object 2D keypoints are needed (see Fig. 2).
The three main parts of our pipeline are described in this section: i) extraction of the planar patches from the object (Sec. 3.1), ii) patch warping to the destination viewpoint (Sec. 3.2) and iii) synthesis of the novel image (Sec. 3.33.4 and 3.5). In the following, we assume a bounding box around the object and 2D keypoints have been provided by off-the-shelf detectors e.g. [16, 41]; we focus on the overall pipeline for object synthesis instead.

3.1 Keypoint-based decomposition into planar patches

We leverage 2D keypoints to approximate the visible object with a simple polyhedron with a small set of faces. Since keypoints mark characteristic locations in the object shape (e.g. corners), a face defined from at least three of those carries a semantic meaning (e.g. the roof of a car). Exploiting 2D keypoints to find object faces is appealing for a number of reasons. First, it is straightforward to compute the homography matrix between planes in different viewpoints, since computing correspondences between two sets of keypoints is trivial. Furthermore, a number of datasets provide object landmark annotations in real-world scenarios (e.g. [28, 62, 61, 57, 59]) and solid keypoints detection methods exist [55, 41, 16].
Specifically, for each source image an array of 2D keypoints is available, being the category-specific number of keypoints (e.g. for vehicles). From these a set of planar patches can be defined as:


where each patch has a subset of object keypoints as vertices. We choose for vehicles, namely left, right, front and back sides.

3.2 Warping and dewarping

Warping patches Source patches are warped to the destination viewpoint to get a set of warped patches that are employed to bootstrap the novel viewpoint synthesis. To this end, we define the destination viewpoint to be an arbitrary rigid transformation of the camera:


Locations of 2D keypoints in the novel viewpoint can be now computed by using a pinhole camera model as:


where is a virtual intrinsic camera matrix with squared pixel and principal point in the image center, is the 3D keypoint in the CAD model. It is worth noting that the CAD model is not constrained to be the same as the one in the source image; it only has to feature the same keypoints (see appearance transfer in Fig. 8). An homography matrix relating planar surfaces in the two views is then estimated from correspondences between and . In this way patches in the destination viewpoint (warped patches from now on) can be computed via matrix multiplication:


De-warping patches Since the dataset does not provide paired views, it is not possible to supervise the destination image ; hence, we propose to train the network in a self-supervised manner, by forcing it to reconstruct from . Nonetheless, this would create a distribution shift between the data fed to the network during training and inference stages. In fact, while is perfectly aligned with (it is a subset of ),

might be affected by homography failures and interpolation errors among other issues. To alleviate this shift, we train the network to reconstruct the image

from a third set of patches (called dewarped patches in what follows):


In this way the network learns to cope with possible transformation errors and cannot simply short-circuit input patches to the output. The importance of this dewarping trick for a well-behaved network training is highlighted in Sec. 4.1.

Figure 3: Visual results comparison with competitors and ablated versions of the proposed method on Pascal3D+ test set. Please refer to Sec. 4.1 for details.

3.3 Leveraging 2.5D sketches

While image patches carry rich information about the appearance of the object, they bear few cues about the object shape. In other words, visual aspect and shape are disentangled by design. This is a desirable property enabling multiple applications which require to change one of the two while keeping the other fixed. In this section we propose a method to constrain the synthesised object shape. Let


be the set of CAD models which approximate the intra-class variation for the current object class, each being a 3D mesh composed of faces. The number of CADs needed to cover the intra-class variation reasonably depends on the object category, but it is often relatively low ( for the vehicle class in the Pascal3D+ dataset [62]). Each training example is thus composed by an image and its associated viewpoint and CAD index . Therefore, a virtual camera can be used to render the CAD from viewpoint . In particular, following [58], we render the 2.5D sketch of CAD surface normals along with a coarse material-based part segmentation:


which provides rich information about the object’s 3D shape. During training, this 2.5D sketch is fed to the image completion network together with de-warped patches to reconstruct . It is worth noticing that these additional data come for free, as they require only synthetic sources (i.e. the object CAD in our method).

3.4 Appearance prior

Our method relies on warped patches to transfer the object appearance from a source to a destination viewpoint. Still it can happen that viewpoints and are so far apart that an object shares no visible faces across the two even with symmetry constraints (e.g. front to back). To alleviate this issue, we crop from the input image a small patch with side of the image size and give it as an additional input to the image completion network as a prior knowledge about the rough object appearance in absence of other hints.

3.5 Image Completion Network

The Image Completion Network (ICN) is a fully convolutional network parametrized by trained to reconstruct a realistic image from dewarped patches , 2.5D sketches and appearance prior :



Our fully convolutional network resembles the generator networks introduced in [72] and employed among others in [74]. Our claims of this being a viable choice are twofold. On the one hand, this generator network has been designed for similar tasks; achieving state-of-the-art performances and being employed by a wide community. On the other hand, using a comparable architecture allows an equitable comparison with the state-of-the-art, since difference in results cannot be attributed to the representational power of the network.


A number of recent works [21, 6, 42]

indicate that loss functions based on high-level features extracted from pretrained networks can lead to much more realistic results compared to naive

per-pixel losses between the output and ground-truth image. Given a set of layers from a network and a training pair consisting of a real and a generated images

, we define the perceptual loss function as


Where is the ICN. We employ each second convolutional layer of each block in VGG-19 [50] as feature extractor . Following [6] we set such that the expected contribution of each term is approximately the same for each layer.
As mentioned above, images generated from novel viewpoints cannot be directly supervised if the dataset does not provide paired views. Nevertheless, we can still enforce the realism of ICN output in an adversarial fashion. Given a generic image synthesised by ICN either in the source () or the destination () viewpoint, we set up a min-max game as follows:


where D is the discriminator network from [72] aiming to distinguish between real and synthesised images. Our total loss is defined as:


where modulates the contribution of the adversarial term.

4 Experiments

Figure 4: Results of 360° rotation. Our output is consistent for the whole rotation circle. Best viewed zoomed on screen.
Figure 5: Predictions of our model from different viewpoints. The geometry-aware design of our semi-parametric method allows the model to be resilient to large viewpoint variations, including rotation, elevation and camera distance. Best viewed on screen.

Datasets Large-scale 3D shape repositories providing object geometries such as Princeton Shape Benchmark [48] and Shapenet [3] exist, but do not come with real-world images aligned. Differently, large real-world datasets for 3D object detection and pose estimation such as Pascal3D+ [62] and Objectnet3D [61] provide 3D shape annotation (roughly [52]) aligned with in-the-wild images. In the following we use Pascal3D+ dataset.
Competitors We evaluate our method against three competitive baselines which are currently state-of-the-art in the task of novel viewpoint synthesis. The first one is Visual Object Networks [74], an adversarial learning framework in which object shape, viewpoint and texture are treated as three conditionally independent factors that contribute to the synthesis of the novel viewpoint. The pre-trained model released by the authors is denoted VON in what follows. Since VON is originally trained on a custom car dataset collected by the authors [74], for a fair comparison we implement a second baseline VON by fine-tuning their network on Pascal3D+. Lastly, we compare to Variational U-Net for Conditional Appearance and Shape Generation [10] (VUnet

in the following), a state-of-the-art framework for conditional image generation based on variational autoencoder 

[26]. In the original implementation [10] a U-net [45] architecture is fed with keypoint-based skeletons to perform pose-guided human generation. We re-train their model on Pascal3D+ to perform pose-guided object generation. To ensure the setting is the same, we feed their shape-encoding network with our 2.5D sketches, instead of the skeletons employed in the original implementation.

To maximize evaluation fairness, in what follows we only sample novel viewpoints rotating around the z-axis at fixed distance and elevation, which is the only setting handled by competitors. Still, our method can handle more general roto-translations, as showcased in Fig. 5.

Implementation details The 2D bounding box of each example of Pascal3D+ [62]

is padded to a squared aspect ratio and resized to 128x128 pixels. We work in LAB space relying on the training procedure from 

[60]. Following [55, 41]

truncated and occluded objects are discarded, resulting in 4081 training and 1042 testing examples respectively. Our model is trained for 150 epochs with batch size 10. During training both input and target undergo small random rotations, translations and shearing for data augmentation purposes. We use Adam 

[25] optimizer with initial learning rate and halve it every 25 epochs. Loss balancing term is set to

. The code is developed in PyTorch 

[40]: we depend on Open3D library [70] for 3D data manipulation and rendering. Random search has been employed for hyper-parameter tuning.

4.1 Visual Results

Our model produces high-quality results for a variety of camera viewpoints, preserving fine-grained object appearance, as it can be appreciated in figures 3, 4 and 5.

P3D+ 30° 60° 90° 120° 150° 180° 210° 240° 270° 300° 330° Avg
ours 43.1 166.7 53.4 42.7 46.5 48.2 58.5 177.6 55.7 45.9 45.6 42.3 55.5 69.89
VUnet 72.4 202.4 90.7 79.9 88.3 78.8 96.3 203.3 94.0 77.5 85.2 82.0 92.6 105.9
VON 86.4 134.1 107.2 150.0 126.5 124.4 114.4 151.7 113.8 127.2 128.9 132.0 107.3 126.4
VON 81.9 165.2 100.6 137.8 125.9 137.6 108.1 190.5 155.7 134.8 123.1 117.1 102.1 133.2
Table 1: Fréchet Inception Distances [18] results. Each row reports the average distance between real and generated images for each method on the left. The first column lists FID scores when novel viewpoints are sampled from Pascal3D+ distribution. On the right side, viewpoints are sampled while rotating around the object at fixed elevation and distance and FID results are reported for 12 azimuthal angles bins. Please refer to Sec. 4.2 for details.
Car (plain) Car (textured) Avg
ours > VUnet [10] 76.0% 85.0% 78.0%
ours > VON [74] 88.0% 98.0% 91.0%
ours > VON [74] 96.0% 99.0% 97.0%
Table 2: Blind randomized A/B test results. Each row lists the percentage of workers who preferred the novel viewpoint generated with our method with respect to each baseline (chance is at 50%).

Competitors The key differences between the proposed method and baselines can be appreciated in Fig. 3. The output from Visual Object Networks [74] (VON) is generally realistic, but hardly reflects the visual appearance of the input. Furthermore, both VON and VON generator networks do not generalize to poses which are less common in the training set such as the frontal pose in (d). VUnet [10] suffers from blurred results typical from variational autoencoders [13]

; also, due to skip connections, input appearance may leak to the output when the two viewpoints are very different (b, d). More generally, the drawbacks of a solely learning-based viewpoint synthesis which all three competitors share are evident in (a, c): complex textures cannot be recovered once compressed in a feature vector.

Ablation Ablated versions of our model are shown in fourth to eighth columns. First we investigate the aid of the dewarping trick presented in Sec. 3.2. In No-dewarp column, the ICN was trained to reconstruct the image from instead of (see Sec. 3.1). As expected, despite the very low reconstruction error at training time (due to the similarity between and ), the model fails to generalize to the synthesis of novel viewpoints where the textures are the result of an homography transformation. The effect of removing the appearance prior is showcased in No-prior column. Without prior information, the ICN fails to infer the object appearance when no planar patch is provided, as shown in (b, e). Removing the adversarial term (No-adv column) results in slightly blurred outputs. Eventually, Sil-only and Normal-only show ablated versions in which the input sketches are constituted only by 2D silhouette and 2.5D surface normals respectively. Although results do not differ dramatically, it can be appreciated how the network benefits from additional information to resolve ambiguous situations such as self-occlusions (e) and details such as side windows, lights, wheels (a-e).

Results for appearance transfer are showed in Fig. 8. In this setting, the network is requested to complete the warped faces using the 2.5D sketch rendered from a totally different CAD. It can be appreciated how novel viewpoints are still realistic, since the network exploits the 2.5D sketch to complete the warped appearance in a CAD-agnostic manner. Common failure cases, mostly coming from errors in 2D keypoints or from homography failures, are showcased in Fig. 9: please refer to the caption for details.

4.2 Metrics Evaluation

Fréchet Inception Distance To quantitatively measure the similarity between generated and real images we rely on Fréchet Inception Distance (FID), which was shown to consistently correlate with human judgment [18, 31]

. We employ activations from the last convolutional layer of an InceptionV3 model pretrained on ImageNet 

[9] to compute the FID in two scenarios. In the first one, novel viewpoints used to generate the images are sampled from Pascal3D+ annotations. Since real images come from the same viewpoint distribution this setting indicates the minimum distance that a method can achieve; results are reported in Table 1, first column (P3D+). In the second scenario, we sample novel viewpoints while rotating around the object at fixed distance and elevation: results are reported in Table 1, binned in 12 equidistant azimuthal angles. Fréchet Inception Distance rewards the realism we can get with our semi-parametric approach; in both setting our method outperforms competitors by a large margin.

4.3 Perceptual Experiments

Figure 6: Results of time-limited A/B preference test against real images. Both VON and our method are resilient to human judgement over time. Please refer to Sec.4.3 for details.

To assess the quality of our results also from a perceptual point of view, randomized A/B preference tests were performed by 43 human workers, following the experimental protocol of previous works [6, 42, 74]. As we want to evaluate both the realism and the appearance coherence of our method, we perform two different tests. For both experiments, images are all shown at the same resolution of 128x128 pixels. As all methods produce a white background, ground truth aligned 3D CAD is used to mask Pascal3D+ real images. Both sampling order and left-right order of A and B are randomized.

In the first setting, the subject is presented with three images: while the first one comes from Pascal3D+ test set, A and B depict a novel viewpoint of the object generated with two different methods. The human worker is then asked whether rotating the input object would better lead to A or B. This setting rewards the method producing images which are more consistent with human’s expectation after a mental rotation of the first object. Results reported in Tab. 2 indicate that our method is largely preferred to competitors, likely because of the built-in realism that comes from warping the original image. As a further analysis we split by manual annotation Pascal3D+ images into plain and textured sets, the latter set containing vehicles which feature characteristic textures. Table 2 highlights that results on the two sets are significantly different, with workers expressing almost unanimous preference for our method on the textured set. The fact that human attention was caught by these appearance details highlights the importance of preserving fine-grained details in the synthesized output.

The second experiment consists of a two-alternative forced choice aimed at evaluating the relative realism of each method. Here the subject is presented with only two images for a determined amount of time. The worker is then asked which of the two appeared more realistic. The experiment is then repeated by varying the amount of time the worker can spend on each pair of images. Results depicted in Figure  6 are twofold. On the one hand, workers clearly discern VUnet and VON images from real ones as more time is available. VUnet is hurt by excessive blur and visual artifacts; VON suffers from a severe loss of realism w.r.t. the original VON method, which may be related to the great variety of viewpoints in the Pascal3D+ dataset compared with the one used in Zhu  [74]. On the other hand, both VON and our method produce realistic images workers struggle to distinguish from the real ones even in 8000ms.

Figure 7: Artificial data generated stitching generated vehicles onto Pascal3D+ [62] backgrounds. See Sec. 4.4 for details.
Figure 8: Appearance transfer to different 3D models; it can be appreciated how the network performs a reasonable transfer of object’s appearance even when the destination 3D model is highly different to the input one.
Figure 9: Failure cases. The model can fail when the input patches are grossly wrong, either due to errors in keypoints estimation (top) or to homography failure (middle). Lastly, when the training dataset provides a very coarse alignment between the 3D models and the images such as Pascal3D+[62], in some cases the alignment error is learnt as well (bottom - see the back wheel).

4.4 Applications

Novel viewpoint synthesis Our method can be employed to generate realistic novel views of an object from an arbitrary viewpoint and distance, as depicted in Figures 4 and 5.

Disentangled appearance and shape editing Our approach makes it possible to edit object appearance and shape in a disentangled manner. Thus shape can be changed while preserving appearance (Fig. 8) or the other way around (Fig. 7).

Data augmentation Our model may be used to generate realistic synthetic data. To support this, we generate ten synthetic images from each one in Pascal3D+ [62] training set and employ them to augment training data for a stacked-hourglass network [35, 41] in the task of keypoint localization. In this setting, training solely on synthetic images achieves of Percentage of Correct Keypoints (PCK) on the real test set, compared to when training on real data. However, pre-training on generated data and fine-tuning on real ones boosts the PCK by to , all other hyper-parameters being the same.

5 Conclusions

In this work we have presented a semi-parametric approach for object novel viewpoint synthesis. In our framework, non-parametric visual hints act as prior information for a deep parametric model to synthesize realistic images, disentangling appearance and shape. Perceptual experiments results as well as image-quality metrics reward our method for its realism and the visual consistency of the synthesised object across arbitrary points of view. Still, a number of improvements can already be foreseen and are left as future work, such as loosening the assumptions on input shape to work with more complex (deformable) objects or to train the whole system end-to-end leveraging spatial transformer layers and differentiable rendering.


  • [1] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In

    International Conference on Machine Learning

    , pages 214–223, 2017.
  • [2] C. Bundesen and A. Larsen. Visual transformation of size. Journal of Experimental Psychology: Human Perception and Performance, 1(3):214, 1975.
  • [3] A. X. Chang, T. Funkhouser, L. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  • [4] G. Chaurasia, S. Duchene, O. Sorkine-Hornung, and G. Drettakis. Depth synthesis and local warps for plausible image-based navigation. ACM Transactions on Graphics (TOG), 32(3):30, 2013.
  • [5] G. Chaurasia, O. Sorkine, and G. Drettakis. Silhouette-aware warping for image-based rendering. In Computer Graphics Forum, volume 30, pages 1223–1232. Wiley Online Library, 2011.
  • [6] Q. Chen and V. Koltun. Photographic image synthesis with cascaded refinement networks. In

    IEEE International Conference on Computer Vision (ICCV)

    , volume 1, page 3, 2017.
  • [7] T. Chen, M.-M. Cheng, P. Tan, A. Shamir, and S.-M. Hu. Sketch2photo: Internet image montage. In ACM transactions on graphics (TOG), volume 28, page 124. ACM, 2009.
  • [8] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 8789–8797, 2018.
  • [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. 2009.
  • [10] P. Esser, E. Sutter, and B. Ommer. A variational u-net for conditional appearance and shape generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8857–8866, 2018.
  • [11] D. Feng, L. Rosenbaum, and K. Dietmayer.

    Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection.

    In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3266–3273. IEEE, 2018.
  • [12] H. Gardner. Frames of mind: The theory of multiple intelligences. Hachette UK, 2011.
  • [13] I. Goodfellow, Y. Bengio, and A. Courville. Deep learning. MIT press, 2016.
  • [14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [15] J. Hays and A. A. Efros. Scene completion using millions of photographs. ACM Transactions on Graphics (TOG), 26(3):4, 2007.
  • [16] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [17] P. Hedman, T. Ritschel, G. Drettakis, and G. Brostow. Scalable inside-out image-based rendering. ACM Transactions on Graphics (TOG), 35(6):231, 2016.
  • [18] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • [19] P. Isola and C. Liu. Scene collaging: Analysis and synthesis of natural images with semantic layers. In Proceedings of the IEEE International Conference on Computer Vision, pages 3048–3055, 2013.
  • [20] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976. IEEE, 2017.
  • [21] J. Johnson, A. Alahi, and L. Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In European conference on computer vision, pages 694–711. Springer, 2016.
  • [22] M. K. Johnson, K. Dale, S. Avidan, H. Pfister, W. T. Freeman, and W. Matusik. Cg2real: Improving the realism of computer generated images using a large collection of photographs. IEEE Transactions on Visualization and Computer Graphics, 17(9):1273–1285, 2011.
  • [23] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. International Conference on Learning Representations (ICLR), 2018.
  • [24] N. Kholgade, T. Simon, A. Efros, and Y. Sheikh. 3d object manipulation in a single photograph using stock 3d models. ACM Transactions on Graphics (TOG), 33(4):127, 2014.
  • [25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), 2015.
  • [26] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
  • [27] J.-F. Lalonde, D. Hoiem, A. A. Efros, C. Rother, J. Winn, and A. Criminisi. Photo clip art. In ACM transactions on graphics (TOG), volume 26, page 3. ACM, 2007.
  • [28] J. J. Lim, H. Pirsiavash, and A. Torralba. Parsing IKEA Objects: Fine Pose Estimation. ICCV, 2013.
  • [29] X. Liu, W. Liu, H. Ma, and H. Fu. Large-scale vehicle re-identification in urban surveillance videos. In 2016 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2016.
  • [30] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016.
  • [31] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet. Are gans created equal? a large-scale study. In Advances in neural information processing systems, pages 698–707, 2018.
  • [32] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool. Pose guided person image generation. In Advances in Neural Information Processing Systems, pages 406–416, 2017.
  • [33] P. A. Marín-Reyes, L. Bergamini, J. Lorenzo-Navarro, A. Palazzi, S. Calderara, and R. Cucchiara. Unsupervised vehicle re-identification using triplet networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 166–1665. IEEE, 2018.
  • [34] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [35] A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
  • [36] R. Ortiz-Cayon, A. Djelouah, and G. Drettakis. A bayesian approach for selective image-based rendering using superpixels. In International Conference on 3D Vision-3DV, 2015.
  • [37] R. Ortiz-Cayon, A. Djelouah, F. Massa, M. Aubry, and G. Drettakis. Automatic 3d car model alignment for mixed image-based rendering. In 2016 Fourth International Conference on 3D Vision (3DV), pages 286–295. IEEE, 2016.
  • [38] A. Palazzi, L. Bergamini, S. Calderara, and R. Cucchiara. End-to-end 6-dof object pose estimation through differentiable rasterization. In Second Workshop on 3D Reconstruction Meets Semantics (3DRMS), 2018.
  • [39] E. Park, J. Yang, E. Yumer, D. Ceylan, and A. C. Berg. Transformation-grounded image generation network for novel 3d view synthesis. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 702–711. IEEE, 2017.
  • [40] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [41] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis. 6-dof object pose from semantic keypoints. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pages 2011–2018. IEEE, 2017.
  • [42] X. Qi, Q. Chen, J. Jia, and V. Koltun. Semi-parametric image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8808–8816, 2018.
  • [43] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. International Conference on Learning Representations (ICLR), 2016.
  • [44] K. Rematas, C. H. Nguyen, T. Ritschel, M. Fritz, and T. Tuytelaars. Novel views of objects from a single image. IEEE transactions on pattern analysis and machine intelligence, 39(8):1576–1590, 2017.
  • [45] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [46] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang. Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals. In Proceedings of the IEEE International Conference on Computer Vision, pages 1900–1909, 2017.
  • [47] R. N. Shepard and J. Metzler. Mental rotation of three-dimensional objects. Science, 171(3972):701–703, 1971.
  • [48] P. Shilane, P. Min, M. Kazhdan, and T. Funkhouser. The princeton shape benchmark. In Proceedings Shape Modeling Applications, 2004., pages 167–178. IEEE, 2004.
  • [49] A. Siarohin, E. Sangineto, S. Lathuilière, and N. Sebe. Deformable gans for pose-based human image generation. In CVPR 2018-Computer Vision and Pattern Recognition, 2018.
  • [50] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [51] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani. Surfnet: Generating 3d shape surfaces using deep residual networks. In IEEE CVPR, volume 1, 2017.
  • [52] X. Sun, J. Wu, X. Zhang, Z. Zhang, C. Zhang, T. Xue, J. B. Tenenbaum, and W. T. Freeman. Pix3d: Dataset and methods for single-image 3d shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2974–2983, 2018.
  • [53] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Multi-view 3d models from single images with a convolutional network. In European Conference on Computer Vision, pages 322–337. Springer, 2016.
  • [54] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun. Multinet: Real-time joint semantic reasoning for autonomous driving. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1013–1020. IEEE, 2018.
  • [55] S. Tulsiani and J. Malik. Viewpoints and keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1510–1519, 2015.
  • [56] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8798–8807, 2018.
  • [57] Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang, H. Li, and X. Wang. Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [58] J. Wu, Y. Wang, T. Xue, X. Sun, B. Freeman, and J. Tenenbaum. Marrnet: 3d shape reconstruction via 2.5 d sketches. In Advances in neural information processing systems, pages 540–550, 2017.
  • [59] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman. 3d interpreter networks for viewer-centered wireframe modeling. International Journal of Computer Vision, 126(9):1009–1026, 2018.
  • [60] W. Xian, P. Sangkloy, V. Agrawal, A. Raj, J. Lu, C. Fang, F. Yu, and J. Hays. Texturegan: Controlling deep image synthesis with texture patches. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8456–8465, 2018.
  • [61] Y. Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese. Objectnet3d: A large scale database for 3d object recognition. In European Conference on Computer Vision, pages 160–176. Springer, 2016.
  • [62] Y. Xiang, R. Mottaghi, and S. Savarese. Beyond pascal: A benchmark for 3d object detection in the wild. In Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on, pages 75–82. IEEE, 2014.
  • [63] J. Yang, S. E. Reed, M.-H. Yang, and H. Lee. Weakly-supervised disentangling with recurrent transformations for 3d view synthesis. In Advances in Neural Information Processing Systems, pages 1099–1107, 2015.
  • [64] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang. Denseaspp for semantic segmentation in street scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3684–3692, 2018.
  • [65] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang.

    Generative image inpainting with contextual attention.

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [66] J. M. Zacks, J. Mires, B. Tversky, and E. Hazeltine. Mental spatial transformations of objects and perspective. Spatial Cognition and Computation, 2(4):315–332, 2000.
  • [67] H. Zhang, T. Xu, H. Li, S. Zhang, X. Wang, X. Huang, and D. N. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907–5915, 2017.
  • [68] B. Zhao, X. Wu, Z.-Q. Cheng, H. Liu, Z. Jie, and J. Feng. Multi-view image generation from a single-view. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 383–391. ACM, 2018.
  • [69] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In Computer Vision, IEEE International Conference on, 2015.
  • [70] Q.-Y. Zhou, J. Park, and V. Koltun. Open3D: A modern library for 3D data processing. arXiv:1801.09847, 2018.
  • [71] T. Zhou, S. Tulsiani, W. Sun, J. Malik, and A. A. Efros. View synthesis by appearance flow. In European conference on computer vision, pages 286–301. Springer, 2016.
  • [72] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. IEEE International Conference on Computer Vision (ICCV), 2017.
  • [73] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward multimodal image-to-image translation. In Advances in Neural Information Processing Systems, pages 465–476, 2017.
  • [74] J.-Y. Zhu, Z. Zhang, C. Zhang, J. Wu, A. Torralba, J. Tenenbaum, and B. Freeman. Visual object networks: image generation with disentangled 3d representations. In Advances in Neural Information Processing Systems, pages 118–129, 2018.