Specular-to-Diffuse Translation for Multi-View Reconstruction

07/14/2018 ∙ by Shihao Wu, et al. ∙ 2

Most multi-view 3D reconstruction algorithms, especially when shape-from-shading cues are used, assume that object appearance is predominantly diffuse. To alleviate this restriction, we introduce S2Dnet, a generative adversarial network for transferring multiple views of objects with specular reflection into diffuse ones, so that multi-view reconstruction methods can be applied more effectively. Our network extends unsupervised image-to-image translation to multi-view "specular to diffuse" translation. To preserve object appearance across multiple views, we introduce a Multi-View Coherence loss (MVC) that evaluates the similarity and faithfulness of local patches after the view-transformation. Our MVC loss ensures that the similarity of local correspondences among multi-view images is preserved under the image-to-image translation. As a result, our network yields significantly better results than several single-view baseline techniques. In addition, we carefully design and generate a large synthetic training data set using physically-based rendering. During testing, our network takes only the raw glossy images as input, without extra information such as segmentation masks or lighting estimation. Results demonstrate that multi-view reconstruction can be significantly improved using the images filtered by our network. We also show promising performance on real world training and testing data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 11

page 12

page 13

page 14

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Three-dimensional reconstruction from multi-view images is a long standing problem in computer vision. State-of-the-art shape-from-shading techniques achieve impressive results

[1, 2]. These techniques, however, make rather strong assumptions about the data, mainly that target objects are predominantly diffuse with almost no specular reflectance. Multi-view reconstruction of glossy surfaces is a challenging problem, which has been addressed by adding specialized hardware (e.g., coded pattern projection [3] and two-layer LCD [4]), imposing surface constraints [5, 6], or making use of additional information like silhouettes and environment maps [7], or the Blinn-Phong model [8].

In this paper, we present a generative adversarial neural network (GAN) that translates multi-view images of objects with specular reflection to diffuse ones. The network aims to generate a specular-free surface, which then can be reconstructed by a standard multi-view reconstruction technique as shown in Figure 

1. We name our translation network, S2Dnet, for Specular-to-Diffuse. Our approach is inspired by recent GAN-based image translation methods, like pix2pix [9] or cycleGAN [10], that can transform an image from one domain to another. Such techniques, however, are not designed for multi-view image translation. Directly applying these translation techniques to individual views is prone to reconstruction artifacts due to the lack of coherence among the transformed images. Hence, instead of using single views, our network considers a triplet of nearby views as input. These triplets allow learning the mutual information of neighboring views. More specifically, we introduce a global-local discriminator and a perceptual correspondence loss that evaluate the multi-view coherency of local corresponding image patches. Experiments show that our method outperforms baseline image translation methods.

Figure 1: Specular-to-diffuse translation of multi-view images. We show eleven views of a glossy object (top), and the specular-free images generated by our network (bottom).

Another obstacle of applying image translation techniques to specularity removal is the lack of good training data. It is rather impractical to take enough paired or even unpaired photos to successfully train a deep network. Inspired by the recent works of simulating training data by physically-based rendering [11, 12, 13, 14] and domain adaptation [15, 16, 17, 18], we present a fine-tuned process for generating training data, then adapting it to real world data. Instead of using Shapenet [19], we develop a new training dataset that includes models with richer geometric details, which allows us to apply our method to complex real-world data. Both quantitative and qualitative evaluations demonstrate that the performance of multi-view reconstruction can be significantly improved using the images filtered by our network. We show also the performance of adapting our network on real world training and testing data with some promising results.

2 Related work

Specular Object Reconstruction. Image based 3D reconstruction has been widely used for AR/VR applications, and the reconstruction speed and quality have been improved dramatically in recent years. However, most photometric stereo methods are based on the assumption that the object surface is diffuse, that is, the appearance of the object is view independent. Such assumptions, however, are not valid for glossy or specular objects in uncontrolled environments. It is well known that modeling the specularity is difficult as the specular effects are largely caused by the complicated global illumination that is usually unknown. For example, Godard et al. [7] first reconstruct a rough model by silhouette and then refine it using the specified environment map. Their method can reconstruct high quality specular surfaces from HDR images with extra information, such as silhouette and environment map.

In contrast, our method requires only the multi-view images as input. Researchers have proposed sophisticated equipment, such as a setup with two-layer LCDs to encode the directions of the emitted light field [4], taking advantages of the IR images recorded by RGB-D scanners [20, 21] or casting coded patterns onto mirror-like objects [3]. While such techniques can effectively handle challenging non-diffuse effects, they require additional hardware and user expertise. Another way to tackle this problem is by introducing additional assumptions, such as surface constraints [5, 6], the Blinn-Phong model [8], and shape-from-specularity [22]. These methods can also benefit from our network that outputs diffuse images, where strong specularities are removed from uncontrolled illumination. Please refer to [23] for a survey on specular object reconstruction.

GAN-based Image-to-Image Translation. We are inspired by the latest success of learning based image-to-image translation methods, such as ConditionalGAN [9], cycleGAN [10], [24] dualGAN, and discoGAN [17]. The remarkable capacity of Generative Adversarial Networks (GANs) [25] in modeling data distributions allows these methods to transform images from one domain to another with relatively small amounts of training data, while preserving the intrinsic structure of original images faithfully. With improved multi-scale training techniques, such as Progressive GAN [26] and pix2pixHD [27], image-to-image translation can be performed at mega pixel resolutions and achieve results of stunning visual quality.

Recently, modified image-to-image translation architectures have been successfully applied to ill-posed or underconstrained vision tasks, including face frontal view synthesis [28], facial geometry reconstruction [29, 30, 31, 32], raindrop removal [33], or shadow removal [34]. These applications motivate us to develop a glossiness removal method based on GANs to facilitate multi-view 3D reconstruction of non-diffuse objects.

Learning-based Multi-View 3D Reconstruction. Learning surface reconstruction from multi-view images end-to-end has been an active research direction recently [35, 36, 37, 38]. Wu et al. [39] and Gwak et al. [40] use GANs to learn the latent space of shapes and apply it to single image 3D reconstruction. 3D-R2N2 [36] designs a recurrent network for unified single and multi-view reconstruction. Image2Mesh [41] learns parameters of free-form-deformation of a base model. Nonetheless, in general, the reconstruction quality of these methods cannot really surpass that of traditional approaches that exploit multiple-view geometry and heavily engineered photometric stereo pipelines. To take the local image feature coherence into account, we focus on removing the specular effect on the image level and resort to the power of multi-view reconstruction as a post-processing and also a production step.

On the other hand, there are works, closer to ours, that focus on applying deep learning on subparts of the stereo reconstruction pipeline, such as depth and pose estimation

[42], feature point detection and description [43, 44], semantic segmentation [45], and bundle adjustment [46, 47]. These methods still impose the Lambertian assumption for objects or scenes, where our method can serve as a preprocessing step to deal with glossiness.

Learning-based Intrinsic Image Decomposition. Our method is also loosely related to some recent works on learning intrinsic image decomposition. These methods include training a CNN to reconstruct rendering parameters, e.g., material [48, 49], reflectance maps [50], illumination [51], or some combination of those components [13, 52, 48]. These methods are often trained on synthetic data and are usually applied to the re-rendering of single images. Our method shares certain similarity with these methods. However, our goal is not to recover intrinsic images with albedos. Disregarding albedo, we aim for output images with a consistent appearance across the entire training set that reflects the structure of the object.

3 Multi-view Specular-to-Diffuse GAN

Figure 2:

Overview of S2Dnet. Two generators and two discriminators are trained simultaneously to learn cross-domain translations between the glossy and the diffuse domain. In each training iteration, the model randomly picks and forwards a real glossy and diffuse image sequence, computes the loss functions and updates the model parameters.

In this section, we introduce S2Dnet, a conditional GAN that translates multi-view images of highly specular scenes into corresponding diffuse images. The input to our model is a multi-view sequence of a glossy scene without any additional input such as segmentation masks, camera parameters, or light probes. This enables our model to process real-world data, where such additional information is not readily available. The output of our model directly serves as input to state-of-the-art photometric stereo pipelines, resulting in improved 3D reconstruction without additional effort. Figure 2 shows a visualization of the proposed model. We discuss the training data, one of our major contributions, in Section 3.1. In Section 3.2 we introduce the concept of inter-view coherence that enables our model to process multiple views of a scene in a consistent manner, which is important in the context of multi-view reconstruction. Then, we outline in Section 3.3 the overall end-to-end training procedure. Implementation details are discussed in Section 3.4. Upon publication we will release both our data (synthetic and real) and the proposed model to foster further work.

Figure 3: Gallery of our synthetically rendered specular-to-diffuse training data.

3.1 Training Data

To train our model to translate multi-view glossy images to diffuse correspondents, we need appropriate data for both domains, i.e., glossy source domain images as inputs, and diffuse images as the target domain. Yi et al.[24] propose a MATERIAL dataset consisting of unlabeled data grouped in different material classes, such as plastic, fabric, metal, and leather, and they train GANs to perform material transfer. However, the MATERIAL dataset does not contain multi-view images and thus is not suited for our application. Moreover, the dataset is rather small and we expect our deep model to require a larger amount of training data. Hence, we propose a novel synthetic dataset consisting of multi-view images, which is both sufficiently large to train deep networks and complex to generalize to real-world objects. For this purpose, we collect and align 91 watertight and noise-free geometric models featuring rich geometric details from SketchFab (Figure 3). We exclude three models for testing and use the remaining 88 models for training. To obtain a dataset that generalizes well to real-world images, we use PBRT, a physically based renderer [53] to render these geometric models in various environments with a wide variety of glossy materials applied to form our source domain. Next, we render the target domain images by applying a Lambertian material to our geometric models.

Our experiments show that the choice of the rendering parameters has a strong impact on the translation performance. On one hand, making the two domains more similar by choosing similar materials for both domains improves the translation quality on synthetic data. Moreover, simple environments, such as a constant ground plane, also increase the quality on synthetic data. On the other hand, such simplifications cause the model to overfit and prevent generalization to real-world data. Hence, a main goal of our dataset is to provide enough complexity to allow generalization to real data. To achieve realistic illumination, we randomly sample one of 20 different HDR indoor environment maps and randomly rotate it for each scene. In addition, we orient a directional light source pointing from the camera approximately towards the center of the scene and position two additional light sources above the scene. The intensities, positions, and directions of these additional light sources are randomly jittered. This setup guarantees a rather even, but still random illumination. To render the source domain images, we applied the various metal materials defined in PBRT, including copper, silver, and gold. Material roughness and index of refraction are randomly sampled to cover a large variety of glossy materials. We randomly sample camera positions on the upper hemisphere around the scene pointing towards the center of the scene. To obtain multi-view data, we always sample 5 close-by, consecutive camera positions in clock-wise order while keeping the scene parameters fixed to mimic the common procedure of taking photos for stereo reconstruction. Since we collect 5 images of the same scene and the input to our network consists of 3 views, we obtain 3 training samples per scene. All rendered images are of resolution, which is the limit for our GPU memory. However, it is likely that higher resolutions would further improve the reconstruction quality. Finally, we render the exact same images again with a white, Lambertian material, i.e., the mapping from the source to the target domain is bijective. The proposed procedure results in a training dataset of more than 647k images, i.e., more than 320k images per domain. For testing, we rendered 2k sequences of images, each consisting of 50 images. All qualitative results on synthetic data shown in this paper belong to this test set.

3.2 Inter-view Coherence

Multi-view reconstruction algorithms leverage corresponding features in different views to accurately estimate the 3D geometry. Therefore, we cannot expect good reconstruction quality if the glossy images in a multi-view sequence are translated independently using standard image translation methods, e.g., [10, 9]. This will introduce inconsistencies along the different views, and thus cause artifacts in the subsequent reconstruction. We therefore propose a novel model that enforces inter-view coherence by processing multiple views simultaneously. Our approach consists of a global and local consistency constraint: the global constraint is implemented using an appropriate network architecture, and the local consistency is enforced using a novel loss function.

Global Inter-view Coherence.

A straightforward idea to incorporate multiple views is to stack them pixel-by-pixel before feeding them to the network. We found that this does not lead to strong enough constraints, since the network can still learn independent filter weights for the different views. This results in blurry translations, especially if corresponding pixels in different views are not aligned, which is typically the case. Instead, we concatenate the different views along the spatial axis before feeding them to the network. This solution, although simple, enforces the network to use the same filter weights for all views, and thus effectively avoids inconsistencies on a global scale.

Local Inter-view Coherence.

Figure 4: Two examples of the SIFT correspondences pre-computed for our training.

Incorporating loss functions based on local image patches has been successfully applied to generative adversarial models, such as image completion [54] or texture synthesis [55]. However, comparing image patches at random locations is not meaningful in a multi-view setup for stereo reconstruction. Instead, we encourage the network to maintain feature point correspondences in the input sequence, i.e., inter-view correspondences should be invariant to the translation. Since the subsequent reconstruction pipeline relies on such correspondences, maintaining them during translation should improve reconstruction quality. To achieve this, we first extract SIFT feature correspondences for all training images. For each training sequence consisting of three views, we compute corresponding feature points between the different views in the source domain; see Figure 4 for two examples. During training, we encourage the network output at the SIFT feature locations to be similar along the views using a perceptual loss in VGG feature space [56, 27, 57, 58]. The key idea is to measure both high- and low-level similarity of two images by considering their feature activations in a deep CNN like VGG. We adopt this idea to keep local image patches around corresponding SIFT features perceptually similar in the translated output. The perceptual loss in VGG feature space is defined as:

(1)

where denotes the -th layer in the VGG network consisting of elements. Now consider a glossy input sequence consisting of three images , and the corresponding diffuse sequence produced by our model. A SIFT correspondence for this sequence consists of three image coordinates , one in each glossy image, and all three pixels at the corresponding coordinates represent the same feature. We then extract local image patches centered at from , and define the perceptual correspondence loss as:

(2)

3.3 Training Procedure

Given two sets of data samples from two domains, a source domain and a target domain , the goal of image translation is to find a mapping that transforms data points to such that , while the intrinsic structure of should be preserved under . Training GANs has been proven to produce astonishing results on this task, both in supervised settings where the data of the two domains are paired [9], and in unsupervised cases using unpaired data [10]. In our experiments, we observed that both approaches (ConditionalGAN [9] and cycleGAN [10]

) perform similarly well on our dataset. However, while paired training data might be readily available for synthetic data, paired real-world data is difficult to obtain. Therefore we come up with a design for unsupervised learning that can easily be fine-tuned on unpaired real-world data.

Cycle-consistency Loss.

Similar to CycleGAN [10], we learn the mapping between domain and with two translators and that are trained simultaneously. The key idea is to train with cycle-consistency loss, i.e., to enforce that and , where and . This cycle-consistency loss guarantees that data points preserve their intrinsic structure under the learned mapping. Formally, the cycle-consistency loss is defined as:

(3)

Adversarial Loss.

To enforce the translation networks to produce data that is indistinguishable from genuine images, we also include an adversarial loss to train our model. For both translators, in GAN context often called generators, we train two additional discriminator networks and that are trained to distinguish translated from genuine images. To train our model, we use the following adversarial term:

(4)

where is the LSGAN formulation [59].

Overall, we train our model using the following loss function:

(5)

where , , and

are user-defined hyperparameters.

3.4 Implementation Details

Figure 5: Illustration of the generator and discriminator network. The generator uses the U-net architecture and both input and output are a multi-view sequence consisting of three views. A random SIFT correspondence is sampled during training to compute the correspondence loss. The multi-scale joint discriminator examines three scales of the image sequence and two scales of corresponding local patches. The width and height of each rectangular block indicate the channel size and the spatial dimension of the output feature map, respectively.

Our model is based on cycleGAN and implemented in Pytorch. We experimented with different architectures for the translation networks, including U-Net

[60], ResNet [61], and RNN-blocks [62]. Given enough training time, we found that all networks produce similar results. Due to its memory efficiency and fast convergence, we chose U-Net for our final model. As shown in Figure 5, we use the multi-scale discriminator introduced in [27] that downsamples by a rate of 2, which generally works better for high resolution images. Our discriminator also considers the local correspondence patches as additional input, which helps to produce coherent translations. Followed by the training guidances proposed in [26]

, we use pixel-wise normalization in the generators and add a 1-strided convolutional layer after each deconvolutional layer. For computing the correspondence loss, we use a patch size of

and sample a single SIFT correspondence per training iteration randomly. The discriminator follows the architecture as: C64-C128-C256-C512-C1. The generator’s encoder architecture is: C64-C128-C256-C512-C512-C512-C512-C512. We use in all our experiments and train using the ADAM optimizer with a learning rate of 0.0002.

4 Evaluation

In this section, we present qualitative and quantitative evaluations of our proposed S2Dnet. For this purpose, we evaluate the performance of our model on both the translation task and the subsequent 3D reconstruction, and we compare to several baseline systems. In Section 4.1 we report results on our synthetic test set and we also perform an evaluation on real-world data in Section 4.2.

To evaluate the benefit of our proposed inter-view coherence, we perform a comparison to a single-view translation baseline by training a cycleGAN network [10] on glossy to diffuse translation. Since our synthetic dataset features a bijective mapping between glossy and diffuse images, we also train a pix2pix network [9] for a supervised baseline on synthetic data. In addition, we compare reconstruction quality to performing stereo reconstruction directly on the glossy multi-view sequence to demonstrate the benefit of translating the input as a preprocessing step. For 3D reconstruction, we apply a state-of-the-art multi-view surface reconstruction method [1] on input sequences consisting of 10 to 15 views. For our method, we translate each input view sequentially but we feed the two neighboring views as additional inputs to our multi-view network. For the two baseline translation methods, we translate each view independently. The 3D reconstruction pipeline then uses the entire translated multi-view sequence as input.

4.1 Synthetic Data

Glossy pix2pix cycleGAN S2Dnet
Image MSE 118.39 56.20 69.15 57.78
Table 1: Quantitative evaluation of the image error on our synthetic testing data.
Figure 6: Qualitative translation results on a synthetic input sequence consisting of 8 views. From top down: the glossy input sequence, the ground truth diffuse rendering, and the translation results for the baselines pix2pix and cycleGAN, and our S2Dnet. The output of pix2pix is generally blurry. The cycleGAN output, although sharp, lacks inter-view consistency. Our S2Dnet produces both crisp and coherent translations.

For a quantitative evaluation of the image translation performance, we compute MSE with respect to the ground truth diffuse renderings on our synthetic test set. Table 1 shows a comparison of our S2Dnet to pix2pix and cycleGAN. Unsurprisingly, the supervised pix2pix network performs best, closely followed by our S2Dnet, which outperforms the unsupervised baseline by a significant margin. In Figure 6

we show qualitative translation results. Note that the output of pix2pix is generally blurry. Since MSE penalizes outliers and prefers a smooth solution, pix2pix still achieves a low MSE error. While the output of cycleGAN is sharper, the translated sequence lacks inter-view consistency, whereas our S2Dnet produces both highly detailed and coherent translations.

Model 1 2 3 4 5 6 7 8 9 10 AVG
Glossy 0.67 0.88 1.35 0.76 1.15 1.13 1.15 0.78 0.54 0.66 0.90
cycleGAN 1.18 0.72 0.89 0.59 1.35 0.72 0.99 0.62 0.51 0.42 0.80
S2Dnet 0.52 0.67 0.72 0.43 0.87 0.54 0.92 0.65 0.55 0.56 0.64
Table 2: Quantitative evaluation of surface reconstruction performance on 10 different scenes. The error metric is the percentage of bounding box diagonal. Our S2Dnet performs best, and the translation baseline still performs significantly better than directly reconstructing from the glossy images. The numbering of the models follows the visualization in Figure 7, using the same left to right order.
Figure 7: Qualitative surface reconstruction results on 10 different scenes. From top to bottom: glossy input, ground truth diffuse renderings, cycleGAN translation outputs, our S2Dnet translation outputs, reconstructions from glossy images, reconstructions from ground truth diffuse images, reconstructions from cycleGAN output, and reconstructions from our S2Dnet output. All sequences are excluded from our training set, and the objects in column 3 and 4 have not even been seen during training.

Next, we evaluate the quality of the surface reconstruction by feeding the translated sequences to the reconstruction pipeline. We found that the blurry output of pix2pix is not suitable for stereo reconstruction, since already the first step, estimating camera parameters based on feature correspondences, fails on this data. We therefore exclude pix2pix from the surface reconstruction evaluation but include the trivial baseline of directly reconstructing from the glossy input sequence to demonstrate the benefit of the translation step. In order to compute the geometric error of the surface reconstruction output, we register the reconstructed geometry to the ground truth mesh using a variant of ICP [63]. Next, we compute the Euclidean distance of each reconstructed surface point to its nearest neighbor in the ground truth mesh and report the per-model mean value. Table 2 shows the surface reconstruction error for our S2Dnet in comparison to the three baselines. The numbers show that our S2Dnet performs best, and that preprocessing the glossy input sequences clearly helps to obtain a more accurate reconstruction, even when using the cycleGAN baseline. In Figure 7 we show qualitative surface reconstruction results for 10 different scenes in various environments.

4.2 Real-world Data

Since we do not have real-world ground truth data, we compile a real-world test set and perform a qualitative comparison on it. For all methods, we compare generalization performance when training on our synthetic dataset. Moreover, we evaluate how the different models perform when fine-tuning on real-world data, or training on real-world data from scratch. For this purpose, we compile a dataset by shooting photos of real-world objects. We choose 5 diffuse real-world objects and take 5k pictures in total from different camera positions and under varying lighting conditions. Next, we use a glossy spray paint to cover our objects with a glossy coat and shoot another 5k pictures to represent the glossy domain. The resulting dataset consists of unpaired samples of glossy and diffuse objects under real-world conditions, see Figure 10 a) and b).

Figure 8: Qualitative translation results on a real-world input sequence consisting of 11 views. The first row shows the glossy input sequence and the remaining rows show the translation results of pix2pix, cycleGAN, and our S2Dnet. All networks are trained on synthetic data only. Similar to the synthetic case, cycleGAN outperforms pix2pix, but it produces high-frequency artifacts that are not consistent along the views. Our S2Dnet is able to remove most of the specular effects and preserves all the geometric details in a consistent manner.
Figure 9: Qualitative surface reconstruction results on 7 different real-world scenes. Top to bottom: glossy input, cycleGAN translation outputs, our S2Dnet translation outputs, reconstructions from glossy images, reconstructions from cycleGAN output, and reconstructions from our S2Dnet output. All networks are trained on synthetic data only.

In Figure 8 we show qualitative translation results on real-world data. All networks are trained on synthetic data only here, and they all manage to generalize to some extent to real-world data, thanks to our high-quality synthetic dataset. Similar to the synthetic results in Figure 6, pix2pix produces blurry results, while cycleGAN introduces inconsistent high-frequency artifacts. S2Dnet is able the remove most of the specular effects and preserves geometric details in a consistent manner. In Figure 9 we show qualitative surface reconstruction results for 7 different scenes. Artifacts occur mainly close to the object silhouettes in complex background environments. This could be mitigated by training with segmentation masks.

(a)
(b)
(c)
(d)
(e)
(f)
Figure 10: a), b) A sample of our real-world dataset. c) translation result of cycleGAN when training from scratch on our real-world dataset. d) S2Dnet output, trained from scratch on our real-world dataset. e) S2Dnet output, trained on synthetic data only. f) S2Dnet output, trained on synthetic data, fine-tuned on real-world data.

Finally, we evaluate performance when either fine-tuning or training from scratch on real-world data. We retrain or fine-tune S2Dnet and cycleGAN on our real-world dataset, but cannot retrain pix2pix for this purpose, since it relies on a supervision signal that is not present in our unpaired real-world dataset. Our experiments show that training or fine-tuning using such a small dataset leads to heavy overfitting. The translation performance for real-world objects that were not seen during training decreases significantly compared to the models trained on synthetic data only. In Figure 10 c) and d) we show image translation results of cycleGAN and S2Dnet when training from scratch on our real-world dataset. Since the scene in Figure 10 is part of the training set (although the input image itself is excluded from the training set), our S2Dnet produces decent translation results, which is not the case for scenes not seen during training. Fine-tuning our S2Dnet produces similar results (Figure 10 f)).

5 Limitations and Future Work

Although the proposed framework enables reconstructing glossy and specular objects more accurately compared to state-of-the-art 3D reconstruction algorithms, a few limitations do exist. First, since the network architecture contains an encoder and a decoder with skip connections, the glossy-to-Lambertian image translation is limited to images of a fixed resolution. This resolutions might be too low for certain types of applications. Next, due to the variability of the background in real images, the translation network might treat a portion of the background as part of the reconstructed object. Similarly, the network occasionally misclassifies the foreground as part of the background, especially in very light domains on specular objects. Finally, as the simulated training data was rendered by assuming a fixed albedo, the network cannot consistently translate glossy materials with spatially varying albedo into a Lambertian surface. We predict that given a larger and more diverse training set in terms of shapes, backgrounds, albedos and materials, the accuracy of the proposed method in recovering real object would be largely enhanced. Our current training dataset includes the most common types of specular material. The proposed translation network has potential to be extended to other more challenging materials, such as transparent objects, given proper training data.

6 Supplementary Material

This supplementary material provides more technical details and experimental results for our specular-to-diffuse translation network, S2Dnet. Upon publication we will also release the full training and test data set as well as the trained network and code.

Training Details.

Tables 3 and 4 give more detail of our network architecture. Xavier [64] is used for weights initialization. We train our models on an NVIDIA 1080Ti GPU with 11GB GPU memory, which only allows us to use a training batch size of 1.

Additional Results.

Figure 11 presents a gallery of our real-world training data as described in the paper. Figure 12 shows more results of our S2Dnet, given input scenes with various illumination and different objects. Figure 13 is an extension of Figure 10 in paper, illustrating the results of training setups using different kinds of training data. Figures 14 and 15 demonstrate two additional examples of image translation and reconstruction, where S2Dnet outperforms both pix2pix and cycleGAN. We also observe that when the weight for the perceptual correspondence loss is , i.e., removing the perceptual correspondence loss, the output of S2Dnet lacks of inter-view consistency.

Layer Input Output Shape Layer Information
Input Layer (h, 3w, 3) (, , 64) CONV-(N64, K4x4, S2, P1), LeReLU
Hidden Layers (, , 64) (, , 128) CONV-(N128, K4x4, S2, P1), PN, LeReLU
(, , 128) (, , 256) CONV-(N256, K4x4, S2, P1), PN, LeReLU
(, , 256) (, , 512) CONV-(N512, K4x4, S2, P1), PN, LeReLU
Output Layer (, , 512) (, , 1) CONV-(N1, K4x4, S2, P1)
Table 3:

Discriminator network architecture. We use 5 such discriminators that have an identical network structure but operate at three scales of the image sequence and two scales of corresponding local patches using LSGAN (see Figure 5 in the paper). N: the number of output channels, K: kernel size, S: stride size, P: padding size, PN: pixel-wise Normalization, LeReLU: LeakyReLU with

, : width and height of input images. Note that the input width is because we spatially concatenate the three views of the input sequences.
Part Input Output Shape Layer Information
Down-sampling (h, 3w, 3) (, , 64) CONV-(N64, K4x4, S2, P1), LeReLU
(, , 64) (, , 128) CONV-(N128, K4x4, S2, P1), PN, LeReLU
(, , 128) (, , 256) CONV-(N256, K4x4, S2, P1), PN, LeReLU
(, , 256) (, , 512) CONV-(N512, K4x4, S2, P1), PN, LeReLU
(, , 512) (, , 512) CONV-(N512, K4x4, S2, P1), PN, LeReLU
(, , 512) (, , 512) CONV-(N512, K4x4, S2, P1), PN, LeReLU
(, , 512) (, , 512) CONV-(N512, K4x4, S2, P1), PN, LeReLU
(, , 512) (, , 512) CONV-(N512, K4x4, S2, P1), PN, LeReLU
(, , 512) (, , 512)

CONV-(N512, K4x4, S2, P1), ReLU

Up-sampling (, , 512) (, , 512) DECONV-(N512, K4x4, S2, P1),
CONV-(N512, K3x3, S1, P1), PN, ReLU
(, , 1024) (, , 512) DECONV-(N512, K4x4, S2, P1),
CONV-(N512, K3x3, S1, P1), PN, ReLU
(, , 1024) (, , 512) DECONV-(N512, K4x4, S2, P1),
CONV-(N512, K3x3, S1, P1), PN, ReLU
(, , 1024) (, , 512) DECONV-(N512, K4x4, S2, P1),
CONV-(N512, K3x3, S1, P1), PN, ReLU
(, , 1024) (, , 512) DECONV-(N512, K4x4, S2, P1),
CONV-(N512, K3x3, S1, P1), PN, ReLU
(, , 1024) (, , 256) DECONV-(N256, K4x4, S2, P1),
CONV-(N256, K3x3, S1, P1), PN, ReLU
(, , 512) (, , 128) DECONV-(N128, K4x4, S2, P1),
CONV-(N128, K3x3, S1, P1), PN, ReLU
(, , 256) (, , 64) DECONV-(N64, K4x4, S2, P1),
CONV-(N64, K3x3, S1, P1), PN, ReLU
(, , 64) (h, 3w, 3) DECONV-(N3, K4x4, S2, P1),
CONV-(N512, K3x3, S1, P1), Tanh
Table 4: Generator network architecture.
Figure 11: Gallery of our glossy-to-diffuse real-world training data and the spray (leftmost column) we used to paint the objects. We first choose 5 diffuse real-world objects and take 5k pictures in total from different camera positions and under varying lighting conditions. We then use a glossy spray paint to cover our objects with a glossy coat and shoot another 5k pictures to represent the glossy domain.
Figure 12: Gallery of our glossy-to-diffuse results of 40 synthetic and 10 real-world (last two rows) scenes. All sequences are excluded from our training set. Three synthetic (Armadillo, Standing Buddha, Roman Head Sculpture) and all real-world objects have not even been seen during training.
Figure 13: A sample of our real-world dataset is shown in (a-b). Translation results of cycleGAN when training from scratch on our real-world dataset or synthetic data only are shown in (c) and (d), respectively. S2Dnet outputs, trained from scratch on our real-world dataset or synthetic data only, are shown in (e) and (f), respectively. Another output of S2Dnet, trained on synthetic data and then fine-tuned on real-world data is presented in (g). The last row demonstrates the corresponding reconstruction results. Note that the output images are blurry when training from scratch on real-world data, i.e. (c) and (e), and thus not suitable for stereo reconstruction.
Figure 14: Qualitative comparison of image translation and surface reconstruction on a synthetic sequence consisting of 10 views. From top to bottom: glossy input, ground truth diffuse renderings, pix2pix translation outputs, cycleGAN translation outputs, our S2Dnet translation outputs using (no perceptual correspondence loss), our S2Dnet translation outputs using . The last row shows the corresponding reconstruction results. All sequences are excluded from our training set. The output of pix2pix is blurry and is not suitable for multi-view reconstruction. The outputs of cycleGAN and our S2Dnet without perceptual correspondence loss, although sharp, lack of inter-view consistency. Our S2Dnet with perceptual correspondence loss () produces crisp and coherent translations, resulting in a better reconstruction.
Figure 15: Another set of image translation and surface reconstruction comparison on a synthetic input sequence consisting of 10 views.

Acknowledgement

We thank the anonymous reviewers for their constructive comments. This work was supported in parts by Swiss National Science Foundation (169151), NSFC (61522213, 61761146002, 61861130365), 973 Program (2015CB352501), Guangdong Science and Technology Program (2015A030312015), ISF-NSFC Joint Research Program (2472/17) and Shenzhen Innovation Program (KQJSCX20170727101233642).

References