This is the official website of our work 3D Appearance Super-Resolution with Deep Learning published on CVPR2019.
We tackle the problem of retrieving high-resolution (HR) texture maps of objects that are captured from multiple view points. In the multi-view case, model-based super-resolution (SR) methods have been recently proved to recover high quality texture maps. On the other hand, the advent of deep learning-based methods has already a significant impact on the problem of video and image SR. Yet, a deep learning-based approach to super-resolve the appearance of 3D objects is still missing. The main limitation of exploiting the power of deep learning techniques in the multi-view case is the lack of data. We introduce a 3D appearance SR (3DASR) dataset based on the existing ETH3D , SyB3R , MiddleBury, and our Collection of 3D scenes from TUM , Fountain  and Relief . We provide the high- and low-resolution texture maps, the 3D geometric model, images and projection matrices. We exploit the power of 2D learning-based SR methods and design networks suitable for the 3D multi-view case. We incorporate the geometric information by introducing normal maps and further improve the learning process. Experimental results demonstrate that our proposed networks successfully incorporate the 3D geometric information and super-resolve the texture maps.READ FULL TEXT VIEW PDF
We present a super-resolution method capable of creating a high-resoluti...
Image super-resolution (SR) is one of the vital image processing methods...
Digital Rock Imaging is constrained by detector hardware, and a trade-of...
Super-resolution (SR) is by definition ill-posed. There are infinitely m...
By developing sophisticated image priors or designing deep(er) architect...
The task of reconstructing detailed 3D human body models from images is
Objects moving at high speed appear significantly blurred when captured ...
This is the official website of our work 3D Appearance Super-Resolution with Deep Learning published on CVPR2019.
Retrieving efficiently the appearance information of objects through multi-camera observations is of a great importance for the final goal of creating realistic 3D content. To increase the realism of the reconstructed 3D object a detailed appearance needs to be added on top of geometry. This high quality 3D content is used in applications such as movie production, video games and digital culture heritage preservation. Yet, even with highly accurate 3D geometric reconstruction, simply re-projecting the images onto the geometry does not guarantee detailed appearance coverage.
To regain details from the low-resolution (LR) images, model-based super-resolution (SR) techniques have been introduced in the multi-view case [22, 21, 45]. These methods introduce a single coherent texture space to define a common texture map and they model the captured image as a downgraded version of this high-resolution (HR) texture map. Through image formation model they exploit the visual redundancy of the overlapping views [22, 21] and of video frames . Although these model-based SR techniques recover successfully high quality texture maps, they are computationally demanding.
On the other hand, 2D example-based SR methods have be shown to outperform the model-based methods. The basic assumption of example-based SR is the recurrence of similar patches in different parts of an image or in different images . In particular, recent deep learning-based techniques have been proposed to learn the mapping between the LR and HR images. Different networks are trained on large image datasets that contain pairs of HR and LR images. Super-resolving LR images is then realized with a feed forward step. Yet, a deep learning-based approach to super-resolve the appearance of 3D objects is still missing.
In this paper, our goal is to introduce deep learning techniques into the problem of appearance SR in the multi-view case. To exploit the capacity of 2D deep learning techniques, we first provide a 3D appearance dataset. Similar to the model-based SR methods, we introduce a common texture space and define a single coherent texture map. This texture map is first mapped onto the geometry. Then the textured surface is projected into the image space. We express the concatenation of these two mappings through the image formation model (Fig. 2). Through this image generation process and using captured images of multiple scaling factors we can then recover the corresponding texture maps. We provide a dataset that contains ground truth HR texture maps together with LR texture maps of down-scaling factor , , and . The dataset covers both synthetic scenes SyB3R  and real scenes ETH3D , MiddleBury, and our Collection of 3D scenes from TUM , Fountain  and Relief . We then leverage the capacity of 2D learning-based methods  and design two architectures suitable for the 3D multi-view case. Similar to 
we introduce normal maps to capture the local structure of the 3D model and incorporate the 3D geometric information into the 2D SR network. To our knowledge, our work is the first that introduces deep learning approaches for the appearance SR in the multi-view case. Using our provided dataset, we evaluate different texture map SR methods including interpolation-based, model-based, and learning-based. In summary, the contributions of our paper are:
a 3D texture dataset that contains pairs of HR and LR textures of 3D objects. With this dataset we facilitate the integration of deep learning techniques into the problem of appearance SR in the multi-view case and we open up a promising novel research direction. We refer to the dataset as 3DASR.
the first appearance SR framework that elegantly combines the power of 2D deep learning-based techniques with the 3D geometric information in the multi-view setting.
The rest of the paper is organized as follows. Sec. 2 introduces related works of this paper. Sec. 3 describes how the texture maps are retrieved. Sec. 4 explains the generation process of the dataset. Sec. 5
explores the introduction of normal information into neural networks to super-resolve LR texture maps. Sec.6 shows the evaluation results of different methods. Sec. 7 concludes the paper.
2D image SR has been extensively studied and it can be classified into three categories, i.e. interpolation-based, model-based, and example-based[40, 17, 48, 18]. Although a comprehensive review of these methods is beyond the scope of this paper, we present the underlying concepts of each of them. Interpolation-based methods [2, 32] increase the resolution by computing pixel values using the neighbouring information. But leveraging only the local information within the image cannot guarantee the recovery of high-frequency details. Model-based approaches describe the LR image as downgraded version of the HR image and express analytically the forward degradation system. Solving for the inverse problem prior knowledge over the unknown HR image such as smoothness and non-local similarity [8, 34] is imposed. Treating the problem as a stochastic process, maximum likelihood  or maximum a posterior  approach is followed. Although these methods successfully recover high-frequency details, they require elegant optimization techniques. Most of the times they correspond to iterative approaches that are computationally heavy and time-consuming. Learning-based methods shift this computational burden to the learning phase and using the trained network they super-resolve the image through a feed forward step. Due to the availability of large datasets, carefully designed network architectures can learn the mapping from LR to HR image and achieve state-of-the-art performance [14, 44, 28, 36, 33, 50]. Our work, introduces deep learning-based approach in the multi-view case to retrieve the fine texture of 3D objects.
Adding a high quality texture layer onto the 3D geometry plays an essential role in the final realism. This is a challenging step since in the multi-view case there are additional sources of variation that we need to account for, namely occlusions, calibration and reconstruction inaccuracies. Several methods have been proposed in the literature  to efficiently exploit all the available color information and to address the aforementioned challenges.
Single view selection. To cope with different geometric inaccuracies, several methods use only one view to assign texture to each face. Lempitsky and Ivanon  compensate for seams between the boundaries of each face by solving a discrete labeling problem. Gal et al.  incorporate in their optimization the effect of foreshortening, image resolution, and blur by modifying the weighting function. Waechter et al.  add an additional smoothness term to penalize inconsistencies between adjacent faces. By choosing a single view, these methods disregard the multiple color information that exists in the multi-view setting.
Multi-view selection. To leverage the multiple color information over views, several methods blend the images for each face. Debevec et al.  reproject and blend view contributions according to visibility and viewpoint-to-surface angle. To capture view dependent shading effects Buehler et al.  model and approximate the plenoptic function for the scene object. Some hybrid approaches [3, 10] select a single view per face and blend in frequency space views close to texture patch borders. To correct geometric inaccuracies, in  camera poses are jointly optimized with the photometric consistency. Following the success of patch-based synthesis methods, Bi et al. propose a single view-independent texture mapping method that account for geometric misalignment . Generally these methods do not exploit efficiently viewpoint visual redundancy.
Multi-view texture SR methods. To retrieve fine appearance details, a handful of texture SR methods have leveraged the SR principle in the multi-view case and compute texture maps with a resolution higher than the input images [25, 39]. Goldlücke et al. introduce an image formation model to super-resolve texture maps  and to refine the geometry and camera calibration . Tsiminaki et al.  further improve SR texture quality by exploiting additional temporal redundancy and by uniformly correcting calibration and geometry errors with optical flow. These methods are however computationally expensive.
We alleviate the limitations of these model-based SR by introducing the deep learning-based approaches that have been proven to outperform in the 2D case.
In order to be able to use deep learning-based techniques for super-resolving the texture of 3D objects, datasets need to be available. For 2D image SR there are several benchmarking datasets Set5 , Set14 , Urban100 , BSD100  and works [47, 26]
. ImageNet has been also used as training dataset in several example based approaches [14, 15]. More recently, DIV2K dataset was introduced to provide higher quality images .
Such data are however not available in the multi-view case. We propose in this work a methodology to compute textures of several resolution and we provide a 3D texture dataset, 3DASR, that contains pairs of HR and LR textures of 3D objects.
The image formation model simulates the generation of the image from the unknown texture map. In Fig. 2, we can distinguish two steps i.e., texture mapping and projection to image space.
The texture mapping function assigns each entity of the texture map (texel) to a 3D point of the geometry. In order to be able to define the texture map and the mapping, we first need to parameterize the geometry in a common space. We assume that the 3D model is a known triangulated mesh and thus we can define any UV parameterization. In  advanced algorithms that result in space-optimized texture maps are discussed. In this work we use a fixed UV parameterization, described in Subsec. 4.1. Through this mapping function , a texel is mapped to a point of the 3D mesh model .
We assume that we know the camera poses and the intrinsic camera parameters. The textured 3D object is then projected into the image space given the known projection matrices. Let be the camera projection matrix at the view point and the corresponding image of resolution . The geometric point is projected to the pixel location in the image plane. Let and
be the vectorized version of the texture map and the projected image. The image is then expressed as a linear combination of the texture mapwhere is a matrix of dimension
. To estimate this projection operator several issues need to be addressed. First, two geometric points of the surface might be projected into the same location due the convexity of the geometry and then only the visible color value needs to be selected. Second, this projection step can lead to non-integer locations. Third, the distribution of the projected points in the image space is non-uniform, which means that the points may be sparse for some areas. To combine the contributions of all the projected the projected points falling into the neighborhood of a pixel we introduce the Gaussian function as the weighting function. This function takes the location proximity into account, encouraging pixels near the center of while penalizing those far way from . By combining the contributions of of all projected points falling into the neighborhood of a pixel with this Gaussian function we solve for the sparse areas in the image space that can originate due to high curvature regions of the surface.
We retrieve the texture maps by inverting the image formation model. We examine several scaling factors including the ground truth high resolution and down-scaling factor . Given the projection matrices with the multi-view images we compute the corresponding texture maps.
The 3DASR dataset we provide is based on four existing subsets; one synthetic subset SyB3R  and three real subsets EHT3D , MiddleBury , and Collection of Bird, Beethoven and Bunny from the multi-view dataset of TUM , Fountain  and Relief . We follow a generic pipeline to preprocess all subsets. We compute the triangulated 3D mesh with texture coordinates and vertex normals. We use the images provided by the original dataset as the HR images and we downscale them using scale factors to compute the corresponding LR images. The projection matrices for the corresponding LR images are derived by RQ matrix decomposition of the original projection matrix and then scaling down the intrinsic parameters.
ETH3D, Collection, and MiddleBury correspond to real scenes. Regarding ETH3D, we use the training set of the HR multi-view subset that contains 13 scenes. Every scene is provided with multi-view images captured by DSLR cameras, the camera intrinsic and extrinsic parameters, and the ground truth point clouds captured by laser scanners. Collection is a collection of 6 3D scenes. We use the TempleRing and DinoRing of MiddleBury.
We first compute the triangulated mesh and then unwrap it to define the texture map. Through the UV unwrapping we assign to each vertex a UV coordinate.
For MiddleBury, we use the Multi-View Stereo (MVS) pipeline  to reconstruct the meshes. For Bird, Beethoven and Bunny we use the same meshes as in the paper  and for Fountain Relief the meshes are refined in the work of Maier et al. .
For ETH3D subset, the provided 3D model is just a point cloud. Therefore, both of the processing steps are needed. Fig. 3 shows the workflow. Note that triangulation is implemented in MeshLab while parameterization is done in Blender. First of all, for most of the scenes, there are multiple point clouds and each of them captures the scene geometry from different viewpoints. Thus, these point clouds are fused to create a fully-fledged scene geometry followed by the computation of normals. The merged point cloud contains tens of millions of points which may become a computation bottleneck for the post-processing. Thus, the point cloud is simplified using Poisson disk sampling  which reduces the number of points while maintains the geometric details of the scene. Then the mesh is reconstructed using ball pivoting algorithm .
The reconstruction result is exported to a PLY file which is imported into Blender. Blender’s UV unwrapping procedure is used for UV parameterization. At last, the triangulated mesh with UV texture coordinates is exported to an OBJ file.
We consider the provided images by the original dataset as the HR images and we derive the LR images by down-sampling the HR. The intrinsic and extrinsic parameters are given for ETH3D and MiddleBury. Thus computing the projection matrices is straightforward. For the Collection subset, we use RQ decomposition to compute intrinsic and extrinsic parameters. For all of the three subsets, the projection matrices corresponding to the LR images are derived by down-scaling the intrinsic parameters with , , and scaling factors.
SyB3R is a synthetic dataset containing four scenes. Each scene contains an accurate geometry mesh model with optimal UV parameterization. The image rendering pipeline is shown in Fig. 4. To speed up the rendering, we add GPU option to the Python script. We edit the synthetic scene by keeping the major object, setting image resolutions, adding lights and cameras. The generated script and altered scene are passed to Blender and Cycles, resulting in the rendered images. The original mesh model of SyB3R contains several separated objects whose texture maps may overlap with each other in the texture space. To address this problem, we only keep the major part of the scene, i.e., the body of Toad, the skull of Skull, and the single rock of Geological Sample. We do not use Lego Bulldozer because it consists of many small pieces without meaningful texture.
To capture every surface of the object, 14 cameras are uniformly aligned on the sphere surrounding the object. The focal length of the cameras is 25 mm. The size of the sensor is mm. To ensure uniform background across the rendered images, 6 lights are added in the scene lighting from the 6 directions of the object.
The resolution of HR images is while the resolution of the LR images is calculated by dividing the HR width and height with respective scaling factors. Knowing the focal length, image resolution, principal point, rotation matrix and translation vector, the camera projection matrix is computed. As stated by the authors , the rendering time can be multiple hours per image due to the high computational load of the image synthesis process. Thus, we use GPU to render the images. Examples of rendered images are shown in Fig. 5.
After generating these data, we can now use the texture retrieval algorithm and compute the texture maps of different resolutions. Fig. 6 shows the texture maps of the different scenes.
Our 3DASR dataset contains pairs of HR and LR texture maps which resemble two dimensional images. This allows us to make use of state-of-the-art 2D deep learning-based image SR methods. Such an integration is however not without its own source of difficulties. Being in the multi-view setting, the geometric information needs also to be encoded. The texture domain has its own characteristics compare to natural images. It is thus important to adapt the 2D SR deep learning-based method to this new domain. We incorporate the 3D geometric information through the normals and we show how to guide the learning process.
Normal coordinates can be normalized and stored as pixel colors in normal maps (Fig. 8) which have the same support as the texture maps. These normal maps capture the local structure of the surface. We thus use them into the network to introduce the 3D geometric information. We store them as PNG images with 4 channels. The first 3 channels store the normalized normal coordinates and the fourth alpha channel is a mask that shows the support of the texture map, namely, where texel information is available.
The next essential step is to incorporate the normal maps and adjust the neural network to the multi-view setting. There are two main approaches. The first is to use them directly as input information by concatenating them with the texture maps. The second approach is to interpret them as high-level features and concatenate them with feature maps computed at specific layers of the network. We follow the second approach due to the following two considerations. First, the normal maps encode 3D geometric information and can indeed be seen as high-level feature maps. Second, in the case where the normal maps were used as input, the whole network should be trained from scratch. Given the small size of our 3DASR dataset this would lead to over-fitting. Thus, by introducing them at higher layers we train only the few last layers of the network, fine-tune the lower ones and avoid this way over-fitting.
In order to examine the importance of the geometric information in the performance of the training, we compute the normals in both spaces of the low and high resolution texture maps. We call them LR and HR normal maps accordingly. We use EDSR  as a case study network to show the adaption of the network. We thus provide two difference versions, one where the LR normal maps are added before the upsampling layer and a second where the HR normal maps are added after the upsampling layer.
The architecture of the two adapted networks is shown in Fig. (a)a and Fig. (b)b, which we name as NLR and NHR, representing the utilization of LR and HR normal maps. In Fig. (a)a, LR normal maps are concatenated with the feature maps after the 30th ResBlock. The following two ResBlocks and the upsampling layer learn representation from the combined feature map. In Fig. (b)b, upsampling layer is moved before the two fine-tuning ResBlocks and the HR normal maps are added directly after the upsampling layer. Four additional convolutional layers follow the two ResBlocks. The number of feature maps after the concatenation becomes 260 which is the sum of the original 256 channels and the additional 4 channels of the normal map.
We name the layers from the starting convolutional layer to the 30th ResBlock as the body part of the network. The remaining layers are referred to as the tail part. The parameters of the body part are loaded from pretrained EDSR model and fine-tuned to adapt to the texture domain while those of the tail part are randomly initialized and trained from scratch. Thus, a larger learning rate is used to train the tail parameters while a smaller one is used to fine-tune the body parameters. We also directly fine-tune the EDSR model without any architecture modification. An in-between learning rate is used. To train the CNN, the mask is used to identify the active areas of the texture maps. We crop the texture maps into patches of size and feed them into the network for training by excluding these ones that have black areas larger than a predefined threshold . During inference the CNN is applied on the whole LR texture map.
The provided dataset contains 4 subsets and 24 texture maps in total. Cross-validation is used to get the evaluation result on the whole dataset. That is, we divide the 24 texture map into 2 splits, one for training and one for testing. The texture maps of the 4 subsets are equally distributed to the two splits, thus each with 12 texture maps. In addition, we also try cross-validation within the subset. That is, the training and testing texture maps are from the same subset. The 4 subsets are captured under different conditions and they may have different characteristics. In the case of cross validation within the subset, the training and testing data are from the same subset and they have the same characteristics. In the case of cross-validation on the whole dataset, there are more training data but with different characteristics. A comparison of these two cases can indicate whether subset characteristics or large training set is more important in our problem setting. The networks are trained for 50 epochs for subset cross-validation and 100 epochs for all of the other experiments.
|EDSR 21.77dB||EDSR-FT 28.25dB||NLR 28.38 dB||NHR 30.25dB||HRST 32.29dB|
Using our 3DASR dataset, we compare three main categories; interpolation-based, model-based and learning-based methods for super-resolving the appearance of 3D objects. The interpolation-based methods include nearest, bilinear, bicubic, and Lanczos  interpolation. We use the method of Tsiminaki et al.  as a representative of the model-based category, denoted as HRST. Using the EDSR network as a base model, we introduce several modifications of it. There are in total 6 different cases. EDSR: We use the pretrained network EDSR and directly test it on our data. EDSR-FT: We fine-tune the pretrained EDSR on our 3DASR dataset without architecture modification and using whole set cross-validation. NLR-Sub: We incorporate LR normal map into EDSR and use subset cross-validation. NLR: We incorporate LR normal map into EDSR as in Fig. (a)a and use whole set cross-validation. NHR: We incorporate HR normal map into EDSR as in Fig. (b)b and use whole set cross-validation. HRST-CNN: We use EDSR as a post-processing step of the super-resolved texture maps of HRST. In this scenario, the upsampling layer of EDSR is replaced with ordinary convolutional layers.
We compute PSNR metrics in the active regions of the texture domains, that is, on the set of texels in the texture domain that is actually mapped to the 3D model. For the purpose of benchmarking, these metrics can also be computed in the image domain by reprojecting the texture maps into the image space. According to the PSNR values of Table 1, we can draw the following conlcusions.
Among the interpolation-based methods, bilinear interpolation achieves better results than bicubic and Lanczos interpolation, which contradicts the 2D image interpolation. This can be probably explained by the fact that the texture and the ordinary image domains have different characteristics. In the 2D image SR, LR image is modeled as bicubic down-sampled verison of the HR image, which favors advanced interpoaltion methods. In the multi-view setting, due to the several sources of variability, the LR and HR texture maps might be not strictly aligned.
The texture domain knowledge is different than the image domain. The fine-tuning of EDSR-FT incorporates the characteristics of the texture compare to the pretrained EDSR model. Thus, algorithms need to be adpated to the spesific domain.
We incorporate the 3D geometric information of the multi-view setting through the normal maps and we compare to the simple case of fine-tuned EDSR-FT. According to the PSNR values, the geometric information imrpoves the quality of the reconstructed texture maps. We then validate its importance by comparing the two cases of NLR and NHR. The PSNR values increase even more when we express this geometric information with higher precision. NHR case, where HR normal maps are used outperforms NLR. Thus, HR normal maps capture more geometric details and improve the performance.
NLR-Sub uses cross-validation on the subset while NLR on the whole set. In the case of NLR-Sub, the subset characteristics are respected while in the case of NLR not. The main advantage of NLR is that more data are used for training (12 HR texture maps). The high PSNR values of the NLR compared to NLR-Sub indicate that the training data size is more important than subset characteristics to this task. Furthermore, the PSNR gap between NLR and NLR-Sub on ETH3D is larger than that on MiddleBury and Collection. This is because ETH3D is a relatively larger dataset than MiddleBury and Collection. Thus, even if subset cross-validation is used, NLR-Sub does not diverge a lot from NLR on ETH3D dataset. Therefore, we conclude that although each subset may have its own characteristics, training data size stands out as a major factor.
The model-based method HRST formulates the texture retrieval problem as an optimization problem. It is a two-stage iterative algorithm and its computational cost increases even more with an increase of geometric complexity. This explains the unstable behaviour of HRST method across the datasets. HRST outperforms NHR on MiddleBury and Collection whereas on ETH3D and SyB3R not. In most of the cases, HRST-CNN enhances the super-resolved texture maps. It is important to note that even in the cases where the model-based method outperforms the deep learning-based approach, the PSNR values are relatively close. More importantly, the deep learning-based approach is a feed-forward step that can be executed in seconds while the model-based is a heavy iterative process.
The visual results are shown in Fig. 9 and Fig. 10. Directly upsampling the LR texture maps creates blurring images. EDSR leads to some white texels along the boundaries between the black region and the texture region. While we introduce gradually the characteristics of the domain through the EDSR-FT, NLR, and NHR methods, we successfully recover more visual details.
We provided 3DASR, a 3D appearance SR dataset 111The dataset, the evaluation codes, and the baseline models is available at https://github.com/ofsoundof/3D_Appearance_SR. that captures both synthetic and real scenes with a large variety of texture characteristics. It is based on four datasets, ETH3D, Collection, MiddleBury, and SyB3R. The dataset contains ground truth HR texture maps and LR texture maps of scaling factors , , and . The 3D mesh, multi-view images, projection matrices, and normal maps are also provided. We introduced a deep learning-based SR framework in the multi-view setting. We showed that 2D deep learning-based SR techniques can successfully be adapted to the new texture domain by introducing the geometric information via normal maps and achieve relatively similar performance to the model-based methods. This work opens up a novel direction of deep learning-based texture SR methods for the multi-view setting. A necessary next step is to enlarge our dataset either through common augmentation techniques or by following our proposed texture retrieval pipeline to introduce new datasets. The fact that the performance of our deep learning-based SR framework is in some cases (MiddleBury and Collection) below the model-based one indicates that there is still space for more elaborate methods that unify the concepts of model-based SR techniques and the 2D deep learning-based approaches.
|Mesh size||No. vertices||No. Faces||Resolution||No. Views|
The provided dataset has 24 different scenes in total including 13 from ETH3D, 6 from Collection, 3 from SyB3R, and 2 from Middlebury. Each scene contains a 3D mesh, multi-view images, and the corresponding projection matrices. The details of those scenes are provided in Table 2 including the mesh size, the number of vertices and faces in the mesh, the resolution of the HR images, and the number of views in the scene. It is shown in Table 2 that the scenes have different complexities, i.e., different mesh size and number of vertices and faces.
Twelve of the 24 texture maps for different resolutions (HR, , , down-sampling) are shown in Fig. 11, Fig. 12, and Fig. 13, respectively. By comparing the texure maps with differnt resolutions, we find that the ground truth texture maps contain more details than the LR ones. In addition, the HR texture maps are denser than the LR ones. Since optimal UV parameters exist for the synthetic scenes GeologicalSample, Toad, and Skull, their texture maps have less disconnected support regions.
In Table 3 and Table 4, we show the PSNR and SSIM results of different methods. Apart from the methods in the main paper, the results of FSRCNN [dong2016accelerating], SRResNet , and RCAN  are also provided. For FSRCNN, the pre-trained models provided by the authors are directly used. SRResNet, EDSR, and RCAN are trained on DIV2K . More visual results of relief, facade, Buddha, and Fountain for different methods are shown in Fig. 14, Fig. 15, Fig. 16, and Fig. 17, respectively.