Semi-parametric Object Synthesis. A semi-parametric approach for synthesizing novel views of an object from a single monocular image.
We present a new semi-parametric approach to synthesize novel views of an object from a single monocular image. First, we exploit man-made object symmetry and piece-wise planarity to integrate rich a-priori visual information into the novel viewpoint synthesis process. An Image Completion Network (ICN) then leverages 2.5D sketches rendered from a 3D CAD as guidance to generate a realistic image. In contrast to concurrent works, we do not rely solely on synthetic data but leverage instead existing datasets for 3D object detection to operate in a real-world scenario. Differently from competitors, our semi-parametric framework allows the handling of a wide range of 3D transformations. Thorough experimental analysis against state-of-the-art baselines shows the efficacy of our method both from a quantitative and a perceptive point of view. Code and supplementary material are available at: https://github.com/ndrplz/semiparametricREAD FULL TEXT VIEW PDF
We present a semi-parametric approach to photographic image synthesis fr...
We present Worldsheet, a method for novel view synthesis using just a si...
Existing deep learning-based approaches for monocular 3D object detectio...
It is often desired to train 6D pose estimation systems on synthetic dat...
In this paper, we establish a baseline for object symmetry detection in
Is a geometric model required to synthesize novel views from a single im...
We introduce the GANsformer, a novel and efficient type of transformer, ...
Semi-parametric Object Synthesis. A semi-parametric approach for synthesizing novel views of an object from a single monocular image.
How would you see an object from another point of view? Given a single view of an object, predicting how it would look like from arbitrarily different viewpoints is definitely non-trivial for humans and machines. This task is inherently ill-posed, as most of the 3D information is lost in the projection on the 2D image plane.
Still, according to Gardners’ theory of multiple intelligences , the ability to perform out-of-plane transformations on 2D objects constitutes a sign of visual-spatial intelligence and a desirable property for a machine in its long journey towards human-level intelligence. Indeed, humans have been shown to perform mental transformations for decision-taking about their surrounding environment [47, 2, 66], e.g. decide if two object figures share the same underlying three-dimensional shape despite differing by a rotation  or a scale factor .
Recently, powerful parametricdeep learning models [26, 14] made it possible to frame the generation of novel viewpoints as a conditioned image synthesis problem. Despite the promising results, a number of issues are still open. First, even though synthesised images may look realistic per se
, fine-grained visual appearance (e.g. texture) is lost when encoded through the network. Second, a vast amount of data is required for the network to generalize to arbitrary transformations. Consequently, most methods are trained solely on synthetic data, leading to a performance drop in real-world scenarios. Finally, in absence of any prior information about the novel view, a fully-parametric model struggles to generalize to arbitrary viewpoint changes: indeed, most recent works still have to restrict to a small set of transformations (e.g. rotating around the object at constant radius)[63, 53, 71, 39].
approach to the problem of object novel viewpoint synthesis, which attempts to take the best from both worlds: the realism of non-parametric models, and the representational capability of purely parametric ones. The intuition behind this work is that many man-made objects exhibit a symmetric, piece-wise planar structure. Therefore, they may be approximately represented by a small set of piece-wise planar patches, which can be warped almost exactly from source to destination viewpoint via a symmetry-aware homography transformation. These warped patches provide a rich hint about the visual content of the target viewpoint, but they are far from being useful on their own. Thus, a fully-convolutional network is fed with these patches along with 2.5D CAD-rendered sketches to be used as guidance; it is then trained in a self-supervised manner to discriminate which part of the image must be completed or in-painted for the result to look realistic (see Fig.2).
Among all man-made object categories, the one of vehicles has by far drawn most of the attention in recent literature due to their ubiquity in urban scene understanding applications[64, 54, 11, 46, 33, 29]. Our experimental evaluation will therefore focus on vehicles, leaving as future work the analysis of a broader set of object categories.
We frame the problem of object novel viewpoint synthesis in a semi-parametric setting. Simple geometrical assumptions about the object shape provide rich hints about its appearance (non-parametric); this information guides a fully-convolutional network (parametric) in the synthesis process.
We show how our model can be trained on existing datasets for 3D object detection, with no need for paired source/target viewpoint images. Furthermore, we leverage 2D keypoints for real-world images where foreground segmentation is not provided.
We demonstrate how our method is able to preserve visual details (e.g. texture) and is resilient to a much wider range of 3D transformations than competitors.
A thorough experimental analysis is conducted comparing our proposal with state-of-the-art methods, considering both the quantitative and the perceptual point of view.
View synthesis In just few years, the widespread adoption of deep generative models [26, 14] has led to astounding results in different areas of image synthesis [43, 1, 67, 23, 56, 65]. In this scenario, conditional GANs 
have been demonstrated to be a powerful tool to tackle image-to-image translation problems[20, 72, 73, 8]. Hallucinating novel views of the subject of a photo can be naturally framed as an image-to-image translation problem. For human subjects, this has been cast to predicting his/her appearance in different poses [32, 49, 68, 49]. Fashion and surveillance domains drew most of the attention, with much progress enabled by large real-world datasets providing multiple views of the same subject [30, 69].
exploits cycle consistency losses to overcome the need of paired data, thus training on datasets of segmented real-world cars and chairs they gathered for the purpose. Although that work shows more realistic results, it requires pixel-level segmentation for each class of interest. In contrast, we show that already available datasets for object 3D pose estimation[62, 61] can be used for this purpose, despite the extremely rough alignment between the annotated model and the image. More importantly, we differ by all methods above as we explicitly bootstrap the image generation with visual patches warped from the source image.
Our goal is to develop a framework in which the visual appearance of an object can be automatically predicted from arbitrary viewpoints, given a single real-world image. More formally, the model takes as input an image of a single object viewed from the source viewpoint along with its 2D keypoints and outputs an image depicting the same object from the destination viewpoint . For training only, the 3D CAD model aligned with must be also available in order to remove the background during self-supervision. At test time only object 2D keypoints are needed (see Fig. 2).
The three main parts of our pipeline are described in this section: i) extraction of the planar patches from the object (Sec. 3.1), ii) patch warping to the destination viewpoint (Sec. 3.2) and iii) synthesis of the novel image (Sec. 3.3, 3.4 and 3.5). In the following, we assume a bounding box around the object and 2D keypoints have been provided by off-the-shelf detectors e.g. [16, 41]; we focus on the overall pipeline for object synthesis instead.
We leverage 2D keypoints to approximate the visible object with a simple polyhedron with a small set of faces.
Since keypoints mark characteristic locations in the object shape (e.g. corners), a face defined from at least three of those carries a semantic meaning (e.g. the roof of a car).
Exploiting 2D keypoints to find object faces is appealing for a number of reasons. First, it is straightforward to compute the homography matrix between planes in different viewpoints, since computing correspondences between two sets of keypoints is trivial. Furthermore, a number of datasets provide object landmark annotations in real-world scenarios (e.g. [28, 62, 61, 57, 59]) and solid keypoints detection methods exist [55, 41, 16].
Specifically, for each source image an array of 2D keypoints is available, being the category-specific number of keypoints (e.g. for vehicles). From these a set of planar patches can be defined as:
where each patch has a subset of object keypoints as vertices. We choose for vehicles, namely left, right, front and back sides.
Warping patches Source patches are warped to the destination viewpoint to get a set of warped patches that are employed to bootstrap the novel viewpoint synthesis. To this end, we define the destination viewpoint to be an arbitrary rigid transformation of the camera:
Locations of 2D keypoints in the novel viewpoint can be now computed by using a pinhole camera model as:
where is a virtual intrinsic camera matrix with squared pixel and principal point in the image center, is the 3D keypoint in the CAD model. It is worth noting that the CAD model is not constrained to be the same as the one in the source image; it only has to feature the same keypoints (see appearance transfer in Fig. 8). An homography matrix relating planar surfaces in the two views is then estimated from correspondences between and . In this way patches in the destination viewpoint (warped patches from now on) can be computed via matrix multiplication:
De-warping patches Since the dataset does not provide paired views, it is not possible to supervise the destination image ; hence, we propose to train the network in a self-supervised manner, by forcing it to reconstruct from . Nonetheless, this would create a distribution shift between the data fed to the network during training and inference stages. In fact, while is perfectly aligned with (it is a subset of ),
might be affected by homography failures and interpolation errors among other issues. To alleviate this shift, we train the network to reconstruct the imagefrom a third set of patches (called dewarped patches in what follows):
In this way the network learns to cope with possible transformation errors and cannot simply short-circuit input patches to the output. The importance of this dewarping trick for a well-behaved network training is highlighted in Sec. 4.1.
While image patches carry rich information about the appearance of the object, they bear few cues about the object shape. In other words, visual aspect and shape are disentangled by design. This is a desirable property enabling multiple applications which require to change one of the two while keeping the other fixed. In this section we propose a method to constrain the synthesised object shape. Let
be the set of CAD models which approximate the intra-class variation for the current object class, each being a 3D mesh composed of faces. The number of CADs needed to cover the intra-class variation reasonably depends on the object category, but it is often relatively low ( for the vehicle class in the Pascal3D+ dataset ). Each training example is thus composed by an image and its associated viewpoint and CAD index . Therefore, a virtual camera can be used to render the CAD from viewpoint . In particular, following , we render the 2.5D sketch of CAD surface normals along with a coarse material-based part segmentation:
which provides rich information about the object’s 3D shape. During training, this 2.5D sketch is fed to the image completion network together with de-warped patches to reconstruct . It is worth noticing that these additional data come for free, as they require only synthetic sources (i.e. the object CAD in our method).
Our method relies on warped patches to transfer the object appearance from a source to a destination viewpoint. Still it can happen that viewpoints and are so far apart that an object shares no visible faces across the two even with symmetry constraints (e.g. front to back). To alleviate this issue, we crop from the input image a small patch with side of the image size and give it as an additional input to the image completion network as a prior knowledge about the rough object appearance in absence of other hints.
The Image Completion Network (ICN) is a fully convolutional network parametrized by trained to reconstruct a realistic image from dewarped patches , 2.5D sketches and appearance prior :
Our fully convolutional network resembles the generator networks introduced in  and employed among others in . Our claims of this being a viable choice are twofold. On the one hand, this generator network has been designed for similar tasks; achieving state-of-the-art performances and being employed by a wide community. On the other hand, using a comparable architecture allows an equitable comparison with the state-of-the-art, since difference in results cannot be attributed to the representational power of the network.
, we define the perceptual loss function as
Where is the ICN.
We employ each second convolutional layer of each block in VGG-19  as feature extractor . Following  we set such that the expected contribution of each term is approximately the same for each layer.
As mentioned above, images generated from novel viewpoints cannot be directly supervised if the dataset does not provide paired views. Nevertheless, we can still enforce the realism of ICN output in an adversarial fashion. Given a generic image synthesised by ICN either in the source () or the destination () viewpoint, we set up a min-max game as follows:
where D is the discriminator network from  aiming to distinguish between real and synthesised images. Our total loss is defined as:
where modulates the contribution of the adversarial term.
Large-scale 3D shape repositories providing object geometries such as Princeton Shape Benchmark  and Shapenet  exist, but do not come with real-world images aligned. Differently, large real-world datasets for 3D object detection and pose estimation such as Pascal3D+  and Objectnet3D  provide 3D shape annotation (roughly ) aligned with in-the-wild images.
In the following we use Pascal3D+ dataset.
Competitors We evaluate our method against three competitive baselines which are currently state-of-the-art in the task of novel viewpoint synthesis. The first one is Visual Object Networks , an adversarial learning framework in which object shape, viewpoint and texture are treated as three conditionally independent factors that contribute to the synthesis of the novel viewpoint. The pre-trained model released by the authors is denoted VON in what follows. Since VON is originally trained on a custom car dataset collected by the authors , for a fair comparison we implement a second baseline VON by fine-tuning their network on Pascal3D+. Lastly, we compare to Variational U-Net for Conditional Appearance and Shape Generation  (VUnet
in the following), a state-of-the-art framework for conditional image generation based on variational autoencoder. In the original implementation  a U-net  architecture is fed with keypoint-based skeletons to perform pose-guided human generation. We re-train their model on Pascal3D+ to perform pose-guided object generation. To ensure the setting is the same, we feed their shape-encoding network with our 2.5D sketches, instead of the skeletons employed in the original implementation.
is padded to a squared aspect ratio and resized to 128x128 pixels. We work in LAB space relying on the training procedure from. Following [55, 41]
truncated and occluded objects are discarded, resulting in 4081 training and 1042 testing examples respectively. Our model is trained for 150 epochs with batch size 10. During training both input and target undergo small random rotations, translations and shearing for data augmentation purposes. We use Adam optimizer with initial learning rate and halve it every 25 epochs. Loss balancing term is set to
. The code is developed in PyTorch: we depend on Open3D library  for 3D data manipulation and rendering. Random search has been employed for hyper-parameter tuning.
|Car (plain)||Car (textured)||Avg|
|ours > VUnet ||76.0%||85.0%||78.0%|
|ours > VON ||88.0%||98.0%||91.0%|
|ours > VON ||96.0%||99.0%||97.0%|
Competitors The key differences between the proposed method and baselines can be appreciated in Fig. 3. The output from Visual Object Networks  (VON) is generally realistic, but hardly reflects the visual appearance of the input. Furthermore, both VON and VON generator networks do not generalize to poses which are less common in the training set such as the frontal pose in (d). VUnet  suffers from blurred results typical from variational autoencoders 
; also, due to skip connections, input appearance may leak to the output when the two viewpoints are very different (b, d). More generally, the drawbacks of a solely learning-based viewpoint synthesis which all three competitors share are evident in (a, c): complex textures cannot be recovered once compressed in a feature vector.
Ablation Ablated versions of our model are shown in fourth to eighth columns. First we investigate the aid of the dewarping trick presented in Sec. 3.2. In No-dewarp column, the ICN was trained to reconstruct the image from instead of (see Sec. 3.1). As expected, despite the very low reconstruction error at training time (due to the similarity between and ), the model fails to generalize to the synthesis of novel viewpoints where the textures are the result of an homography transformation. The effect of removing the appearance prior is showcased in No-prior column. Without prior information, the ICN fails to infer the object appearance when no planar patch is provided, as shown in (b, e). Removing the adversarial term (No-adv column) results in slightly blurred outputs.
Eventually, Sil-only and Normal-only show ablated versions in which the input sketches are constituted only by 2D silhouette and 2.5D surface normals respectively. Although results do not differ dramatically, it can be appreciated how the network benefits from additional information to resolve ambiguous situations such as self-occlusions (e) and details such as side windows, lights, wheels (a-e).
Results for appearance transfer are showed in Fig. 8. In this setting, the network is requested to complete the warped faces using the 2.5D sketch rendered from a totally different CAD. It can be appreciated how novel viewpoints are still realistic, since the network exploits the 2.5D sketch to complete the warped appearance in a CAD-agnostic manner. Common failure cases, mostly coming from errors in 2D keypoints or from homography failures, are showcased in Fig. 9: please refer to the caption for details.
Fréchet Inception Distance To quantitatively measure the similarity between generated and real images we rely on Fréchet Inception Distance (FID), which was shown to consistently correlate with human judgment [18, 31]
. We employ activations from the last convolutional layer of an InceptionV3 model pretrained on ImageNet to compute the FID in two scenarios. In the first one, novel viewpoints used to generate the images are sampled from Pascal3D+ annotations. Since real images come from the same viewpoint distribution this setting indicates the minimum distance that a method can achieve; results are reported in Table 1, first column (P3D+). In the second scenario, we sample novel viewpoints while rotating around the object at fixed distance and elevation: results are reported in Table 1, binned in 12 equidistant azimuthal angles. Fréchet Inception Distance rewards the realism we can get with our semi-parametric approach; in both setting our method outperforms competitors by a large margin.
To assess the quality of our results also from a perceptual point of view, randomized A/B preference tests were performed by 43 human workers, following the experimental protocol of previous works [6, 42, 74]. As we want to evaluate both the realism and the appearance coherence of our method, we perform two different tests. For both experiments, images are all shown at the same resolution of 128x128 pixels. As all methods produce a white background, ground truth aligned 3D CAD is used to mask Pascal3D+ real images. Both sampling order and left-right order of A and B are randomized.
In the first setting, the subject is presented with three images: while the first one comes from Pascal3D+ test set, A and B depict a novel viewpoint of the object generated with two different methods. The human worker is then asked whether rotating the input object would better lead to A or B. This setting rewards the method producing images which are more consistent with human’s expectation after a mental rotation of the first object. Results reported in Tab. 2 indicate that our method is largely preferred to competitors, likely because of the built-in realism that comes from warping the original image. As a further analysis we split by manual annotation Pascal3D+ images into plain and textured sets, the latter set containing vehicles which feature characteristic textures. Table 2 highlights that results on the two sets are significantly different, with workers expressing almost unanimous preference for our method on the textured set. The fact that human attention was caught by these appearance details highlights the importance of preserving fine-grained details in the synthesized output.
The second experiment consists of a two-alternative forced choice aimed at evaluating the relative realism of each method. Here the subject is presented with only two images for a determined amount of time. The worker is then asked which of the two appeared more realistic. The experiment is then repeated by varying the amount of time the worker can spend on each pair of images. Results depicted in Figure 6 are twofold. On the one hand, workers clearly discern VUnet and VON images from real ones as more time is available. VUnet is hurt by excessive blur and visual artifacts; VON suffers from a severe loss of realism w.r.t. the original VON method, which may be related to the great variety of viewpoints in the Pascal3D+ dataset compared with the one used in Zhu . On the other hand, both VON and our method produce realistic images workers struggle to distinguish from the real ones even in 8000ms.
Disentangled appearance and shape editing
Our approach makes it possible to edit object appearance and shape in a disentangled manner. Thus shape can be changed while preserving appearance (Fig. 8) or the other way around (Fig. 7).
Data augmentation Our model may be used to generate realistic synthetic data. To support this, we generate ten synthetic images from each one in Pascal3D+  training set and employ them to augment training data for a stacked-hourglass network [35, 41] in the task of keypoint localization. In this setting, training solely on synthetic images achieves of Percentage of Correct Keypoints (PCK) on the real test set, compared to when training on real data. However, pre-training on generated data and fine-tuning on real ones boosts the PCK by to , all other hyper-parameters being the same.
In this work we have presented a semi-parametric approach for object novel viewpoint synthesis. In our framework, non-parametric visual hints act as prior information for a deep parametric model to synthesize realistic images, disentangling appearance and shape. Perceptual experiments results as well as image-quality metrics reward our method for its realism and the visual consistency of the synthesised object across arbitrary points of view. Still, a number of improvements can already be foreseen and are left as future work, such as loosening the assumptions on input shape to work with more complex (deformable) objects or to train the whole system end-to-end leveraging spatial transformer layers and differentiable rendering.
International Conference on Machine Learning, pages 214–223, 2017.
IEEE International Conference on Computer Vision (ICCV), volume 1, page 3, 2017.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8789–8797, 2018.
Towards safe autonomous driving: Capture uncertainty in the deep neural network for lidar 3d vehicle detection.In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pages 3266–3273. IEEE, 2018.
Image-to-image translation with conditional adversarial networks.In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5967–5976. IEEE, 2017.
Perceptual losses for real-time style transfer and super-resolution.In European conference on computer vision, pages 694–711. Springer, 2016.
Generative image inpainting with contextual attention.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.