Deep Shape-from-Template: Wide-Baseline, Dense and Fast Registration and Deformable Reconstruction from a Single Image

11/19/2018 ∙ by David Fuentes Jiménez, et al. ∙ Universidad de Alcalá 10

We present Deep Shape-from-Template (DeepSfT), a novel Deep Neural Network (DNN) method for solving real-time automatic registration and 3D reconstruction of a deformable object viewed in a single monocular image.DeepSfT advances the state-of-the-art in various aspects. Compared to existing DNN SfT methods, it is the first fully convolutional real-time approach that handles an arbitrary object geometry, topology and surface representation. It also does not require ground truth registration with real data and scales well to very complex object models with large numbers of elements. Compared to previous non-DNN SfT methods, it does not involve numerical optimization at run-time, and is a dense, wide-baseline solution that does not demand, and does not suffer from, feature-based matching. It is able to process a single image with significant deformation and viewpoint changes, and handles well the core challenges of occlusions, weak texture and blur. DeepSfT is based on residual encoder-decoder structures and refining blocks. It is trained end-to-end with a novel combination of supervised learning from simulated renderings of the object model and semi-supervised automatic fine-tuning using real data captured with a standard RGB-D camera. The cameras used for fine-tuning and run-time can be different, making DeepSfT practical for real-world use. We show that DeepSfT significantly outperforms state-of-the-art wide-baseline approaches for non-trivial templates, with quantitative and qualitative evaluation.



There are no comments yet.


page 3

page 5

page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The joint task of registration and 3D reconstruction of deformable objects from RGB videos and images is a major objective in computer vision, with numerous potential applications, for instance in augmented reality. In comparison with other mature 3D reconstruction problems, such as Structure-from-Motion (SfM) where rigidity is imposed on the scene 

[25], deformable registration and 3D reconstruction present significant unsolved problems. Two main scenarios exist in this task: Non-Rigid SfM (NRSfM) [8, 27, 47, 11] and Shape-from-Template (SfT) [42, 6, 35, 10]. NRSfM reconstructs the 3D shape of a deformable object from multiple RGB images. In contrast, SfT reconstructs the 3D shape from a single RGB image using an object template. The template includes knowledge about the object’s appearance, shape and permissible deformations. These are typically represented by a texture-map, a 3D mesh and a simple mechanical model. SfT is suitable for many applications where the template is known or can be acquired, using for instance SfM or any available 3D scanning solution.

SfT solves two fundamental and intimately related problems: i) template-image registration, which associates pixels in the image to their corresponding locations in the template, and ii) shape inference, which recovers the observed 3D shape or equivalently the template’s 3D deformation. The majority of SfT methods focus on solving shape inference assuming that registration is independently obtained with existing feature-based or dense methods [39, 19, 14]. In all other cases, both problems are solved simultaneously using tracking with iterative optimization [35, 13, 3]. To date there exists no non-DNN wide-baseline SfT method capable of solving both problems densely and in real-time. DNN SfT methods have been very recently proposed [40, 20]

, following the success of the DNN methodology in related problems such as 3D human pose estimation 

[33, 22], depth [16, 18, 28] and surface normal reconstruction with rigid objects [5, 45]. The general idea is to learn the function that maps an input image to the template’s 3D deformation parameters from training data. This has the potential to jointly solve registration and shape inference and eliminates the need for iterative optimization at run-time. These two recent methods are promising but bear important limitations. First, they are limited to flat templates described by regular meshes with very small vertex counts. Second, they require ground-truth registration for training, which is practically impossible to obtain for real data.

We propose DeepSfT, the first DNN SfT method based on a fully-convolutional network without the above limitations. DeepSfT has the following desirable characteristics. 1) It is dense and provides registration and 3D reconstruction at the pixel level. 2) It does not require temporal continuity and handles large deformations and pose changes between the template and the object. 3) It runs in real-time using conventional GPU hardware. 4) It is applicable for templates with arbitrary geometry, topology and surface representation, including meshes, implicit and explicit functions such as NURBS. 5) It is highly robust and handles well the major challenges of SfT, including self and external occlusions, illumination changes and blur. 6)

Training involves a novel combination of supervised learning with synthetic data and semi-supervised learning with RGB-D real data. Crucially, we do not require ground-truth registration for the real training data but only RGB-D. Compared to previous approaches, this makes it feasible to acquire the real training data

automatically, and therefore feasible to deploy it in real settings. 7) The network complexity, training cost and running cost are independent of the template representation, for instance of the mesh vertex count. It therefore scales very well to highly complex templates with detailed geometry that were, until now, not solvable in real-time. There exists no previous method in the literature with the above characteristics. Our method thus pushes SfT significantly forward. We present quantitative and qualitative experimental results showing that our method concretely outperforms in accuracy, robustness and computation time.

2 Previous Work

We first review the non-DNN SfT methods, forming the vast majority of existing work. We start with the shape inference methods and then the integrated methods combining shape inference and registration. We finally review the recent DNN SfT methods.

Shape inference methods.

The shape inference methods assume that the registration between the template and the image is given, which is a fundamental limiting factor of applicability. We classify them according to the deformation model. The most popular deformation model is isometry, which attempts to approximately preserve the geodesic distance, and has been shown to be widely applicable.

Isometric methods follow three main strategies: i) Using a convex relaxation of isometry called inextensibility [41, 42, 37, 9], ii) using local differential geometry [6, 10] and iii) minimizing a global non-convex cost [9, 36]. Methods in iii) are the most accurate but also the most expensive. They require an initial solution found using a method from i) or ii). There also exist non-isometric methods, with the angle preserving conformal model [6] or simple mechanical models with linear [32, 31] and non-linear elasticity [23, 24, 2]. These models all require boundary conditions in the form of known 3D points, which is another fundamental limiting factor of applicability. Their well-posedness remain open research questions.

Integrated methods.

The integrated methods compute both registration and shape inference. We classify them according to their ability to handle wide-baseline cases. Short-baseline methods are restricted to video data and may work in real time [35, 13, 29]. They are based on the iterative minimization of a non-convex cost and use keypoint correspondences [35] or optic flow [13, 29]. The latter supports dense solutions and resolve complex, high-frequency deformations. Their main limitations are two-fold. First, they break down when there is fast deformation or camera motion. Second, at run-time, they must solve an optimization process that is highly computationally demanding, requiring careful hand-crafted design and balancing of data and deformation constraints. In contrast, wide-baseline SfT methods can deal with individual images showing the object with strong deformation without priors on the camera viewpoint [35, 12]. These methods solve registration sparsely using keypoints such as SIFT [30] with filtering to reduce the mismatches [39, 38]. The main limitations of these methods are two-fold. First, they are fundamentally limited by the feature-based registration, which fails due to a weak or repetitive texture, low image resolution, blur or viewpoint distortion. second, they require to solve a highly demanding optimization problem at run-time. Because of these limitations, the existing wide-baseline methods have only been shown to work for simple objects with simple deformations, such as bending sheets of paper.

DNN SfT methods.

Two DNN SfT methods [40, 20] have been recently proposed. They address isometric SfT by learning the mapping between the input image and the 3D vertex coordinates of a regular mesh. Both methods use regression with a fully-convolutional encoder. They require the template to be flat and to contain a smaller number of regular elements. In [40] belief maps are obtained for the 2D position of the vertices which are then combined with depth estimation and reprojection constraints to recover their 3D positions. This considerably limits the size of the mesh, as shown by the reported examples with fewer than vertex counts. Both methods were trained and tested with synthetically generated images. Only [40] provides results on a real video of a bending paper sheet, but required ground-truth registration and shape to fine-tune the network on part of the video. These two methods thus form a preliminary step toward applying DNN to SfT, but are strongly limited by the low template complexity and requirement for ground truth registration. Indeed, even if depth may be relatively easy to obtain for training, ground truth registration is extremely difficult to measure for real data.

3 Problem Formulation

Figure 1 shows the geometrical model of SfT.

Figure 1: Differential Geometrical Model

The template is known and represented by a 3D surface  jointly with its appearance, described as a texture map . The texture map is standard and represented as a collection of flattened texture charts whose union cover the appearance of the whole template, as seen in Figure 1. In our approach templates are not restricted to a specific topology, modelling both thin-shell and volumetric objects. They are also not restricted by a specific representation. In our experimental section we use mesh representations because of their generality, but this is not a requirement of the method. The bijective map between and is known and denoted by . We assume that the template surface is deformed with an unknown quasi-isometric map , where denotes the unknown deformed surface. Quasi-isometric maps permit localized extension/compression, common with real world deforming objects.

The input image is modeled as the colour intensity function , which is discretized into a regular grid of pixels in the retinal plane. The visible part of the surface is projected on an unknown subset of the image plane . We assume the perspective camera for projection:


where is represented with a perspective embedding with .

We assume is known and any lens distortion is either negligible or has been corrected. The depth function represents the depth coordinate of from the camera’s coordinate system. In the absence of self-occlusions, . Volumetric templates always induce self-occlusions in the image.

The unknown registration function, or warp, is an injective map that relates each point of to its corresponding point in .

4 Network architecture

We propose a DNN, hereinafter DeepSfT, that estimates and directly from the input image :


where and are normalized (, ) and discretized versions of and . Our method also recovers as both and are equal to outside the domain of the image. In this sense DeepSfT performs object segmentation at a pixel level. are the network weights, that depend on the template , and are learned with training (see Section 6). DeepSfT has been trained to recognize a specific template so a large amount of deformations are required as described in Section 5.

Figure 2 shows the proposed network architecture. The complete architecture receives an RGB input image with a resolution of pixels and returns the estimated depth map and the registration maps . Both and have the same size as the input image.

Figure 2: DeepSfT architecture. The proposed network architecture is composed of two principal blocks, the main block and the domain adaptation block. Each block is an encoder-decoder scheme taylored for SfT. The Main block receives an RGB input image, process it and gives a first depth map estimate along with the warp. The domain change block collects the depth and warp estimation from the main block and the RGB input image to create a final refined depth map

DeepSfT is divided into two main blocks: the main block is modelled on an encoder-decoder architecture, very similar to those used in semantic segmentation [4]. This gives a first depth map estimation and the proposed registration function . The second one is a domain adaptation block that uses the RGB input image together with the output of the previous block to refine the depth map estimation . This cascade topology where the input image is feed into refinement blocks has proven to improve the results obtained using single stages in methods for 3D depth estimation [17]. This block is also crucial to adapt the network to real data as described in Section 6.

Both the main and domain adaptation blocks use identity, convolutional and deconvolutional residual feed-forwarding structures based on the ResNet50 [43] (see Figure 3).

Figure 3: Identity, convolutional and deconvolutional residual blocks.

Each block is composed of two unbalanced parallel branches with convolutional layers that propagate the information forward into deeper layers, preserving the high frequencies of the data.

Table 1 shows the layered decomposition of the main block. It first receives the RGB input image and performs a first reduction of the input size. Then, a sequence of three convolutional and identity blocks

encode texture and the depth information as deep features.

max width= Layer num Type Output size Kernels/Activation 1 Input (270,480,3) 2 Convolution 2D (135,240,64) (7,7) 3 Batch Normalization (135,240,64) 4 Activation (135,240,64) Relu 5 MaxPooling 2D (45,80,64) (3,3) 6 Encoding Convolutional Block (45,80,[64, 64, 256]) (3,3) 7-8 Encoding identity Block x 2 (45,80,[64, 64, 256]) (3,3) 9 Encoding Convolutional Block (23,40,[128, 128, 512]) (3,3) 10-12 Encoding identity Block x 3 (23,40,[128, 128, 512]) (3,3) 13 Encoding Convolutional Block (12,20,[256, 256, 1024]) (3,3) 14-16 Encoding identity Block x 3 (12,20,[256, 256, 1024]) (3,3) 17-20 Encoding identity Block x 3 (12,20,[1024, 1024, 256]) (3,3) 21 Decoding Convolutional Block (24,40,[512, 512, 128]) (3,3) 22 Cropping 2D (23,39,128) (1,1) 23-25 Encoding identity Block x 3 (23,39,[512, 512, 128]) (3,3) 26 Decoding Convolutional Block (46,78,[256, 256, 64]) (3,3) 27

Zero Padding

(46,80,64) (0,1) 28-29 Encoding identity Block x 2 (46,80,[256, 256, 64]) (3,3) 30 Upsampling (138,240,64) (3,3) 31 Cropping 2D (136,240,64) (2,0) 32 Convolution 2D (136,240,64) (7,7) 33 Batch Normalization (135,240,64) 34 Activation (136,240,64) Relu 35 Upsampling (272,480,64) (3,3) 36 Cropping 2D (270,480,64) (2,0) 37 Convolution 2D (272,480,3) (3,3) 38 Activation (270,480,1) Linear

Table 1: Main block architecture.

Information is reduced to a compressed feature vector in a representation space of dimension

. Information related with the deformable surface is coded in this vector per each RGB input image.

Decoding is performed with decoding blocks

. These require upsampling layers to increase the dimensions of the input tensors before passing through the convolution layers, as shown in Figure 

3.c. Finally, the last layers consist of CNNs and cropping layers that adapt the output of the decoding block to the size of the output maps (). The first channel provides the depth estimate and the last two channels provide the registration warp.

Table 2 shows the layered decomposition of the domain adaptation block. It is a reduced version of the main block where only the first two encoding and decoding blocks are included. The domain adaptation block take as input the concatenation of the input image and the output from the main block (6 channels) and it outputs a new refined depth map.

max width= Layer num Type Output size Kernels/Activation 1 Input (270,480,3) 2 Convolution 2D (135,240,64) (7,7) 3 Batch Normalization (135,240,64) 4 Activation (135,240,64) Relu 5 MaxPooling 2D (45,80,64) (3,3) 6 Encoding Convolutional Block (45,80,[64, 64, 256]) (3,3) 7-8 Encoding identity Block x 2 (45,80,[64, 64, 256]) (3,3) 9 Encoding Convolutional Block (23,40,[128, 128, 512]) (3,3) 10-13 Encoding identity Block x 4 (23,40,[128, 128, 512]) (3,3) 14 Decoding Convolutional Block (46,80,[512, 512, 128]) (3,3) 15-16 Encoding identity Block x 2 (46,80,[512, 512, 128]) (3,3) 17 Upsampling (92,160,128) (2,2) 18 Cropping 2D (92,160,128) (2,0) 19 Convolution 2D (90,160,64) (3,3) 20 Batch Normalization (90,160,64) 21 Activation (90,160,64) Relu 22 Upsampling (270, 480, 64) (3,3) 23 Convolution 2D (270, 480, 32) (3,3) 24 Activation (270, 480, 32) Relu 25 Convolution 2D (272,480,1) (3,3) 26 Activation (270,480,1) Linear

Table 2: Domain adaptation block architecture. The block achieves to adapt the network to real data domain

5 Training

We create a quasi-photorealistic SfT synthetic database using simulation software. Synthetic data allows us to easily train our DNN end-to-end. We then follow by re-training the domain adaption block using a much smaller dataset collected using a standard RGB-D sensor. We recall that there are no public training datasets of this kind.

5.1 Synthetic Data

This process involves randomized sampling from the object’s deformation space, generating the resulting deformation, and rendering from randomized viewpoints. We now describe the process for generating these training datasets for the templates used in the experiential section below (two thin-shell and two volumetric templates, see Table 3). DB1 corresponds to a DIN A4 piece of poorly-texture paper. DB2 has the same shape as DB1 but with a richer texture. DB3 is a soft child’s toy and DB4 is an adult sneaker. We emphasize that no previous work has been able to solve SfT for these last two objects in wide-baseline. The rest shape surfaces for DB3 and DB4 are obtained with triangulated meshes built using SfM (Agisoft Photoscan [1]).

max width= DB1 DB2 DB3 DB4 Mesh Faces=1521 Mesh Faces=1521 Mesh Faces=36256 Mesh Faces=5212 Texture Maps Synthetic images-S Real images-R

Table 3: First row shows four templates of the databases DB1, DB2, DB3 and DB4. Next three rows show different synthetic deformation images of the templates. The last three rows represent real deformations of the templates.

We use Blender [7] to sample the deformation spaces and to create quasi-photorealistic renderings. It includes a physics simulation engine to simulate deformations with different degrees of stiffness using position based dynamics. For the paper templates we used Blender’s cloth simulator using a high stiffness term to model the stiffness of paper, with contour conditions and tensile and compressive forces in randomized 3D directions. This generates continuous deformation videos. For the other two templates we used rig-based deformation with hand-crafted rigs. This generates non-continuous deformation instances, using randomized joint angle configurations. For each deformation we generate random viewpoint variations with random rotations and translations of the camera, lighting variations using different specular light models and random backgrounds obtained with [34]. In total, each dataset consists of RGB images, depth maps, and registrations (2-channel optical flow maps between the image and the template’s texturemap). All images have a resolution of to fit the input/output of the network. We refer the reader to the supplementary material for a copy of these datasets with rigs and simulation parameters.

5.2 Real Data

We used Microsoft Kinect v2 to record a total of RGB-D frames of the four objects while undergoing deformations induced by hand manipulation, and viewpoint changes (see Table 3). Image resolution was downsized to to fit the input shape of the network.

6 Training Procedure

The training procedure is divided in two main steps: 1) training with synthetic data followed by 2) semi-supervised fine-tuning with real data. In step 1) both main and domain adaptation blocks are trained end-to-end as a single block. We use ADAM [15] optimization with learning rate and parameters . We train for epochs with a batch size of . We initialize DeepSfT with uniform random weights [46]

. The loss function is defined as follows:


where and are the output depth map and warp estimates given by DeepSfT respectively. The terms and are the respective ground truths, and and are constants. The symbol denotes the Euclidean norm. Observe that and inherently depend on the network weights and on the input image , see Eq. (2). In step 2) we train the domain adaptation block using real data while freezing the weights of the main block. This step is crucial to adapt the network to handle the ‘render gap’ and include the appearance characteristics of real data, such as the complex illumination, camera response and color balance. Also crucial is the fact that this can be done automatically, without the need for ground truth registration. We use stochastic gradient descend (SGD) with a small and fixed learning rate of . We train the network during epochs with a batch size of . Having both a low learning rate and a reduced number of epochs allows us to adapt our network to real data while avoiding overfitting. In this step a different loss function is used, which only includes the depth information given by the depth sensor as the target of the domain adaptation block:


where .

7 Experimental Results

We evaluate DeepSfT in terms of 3D reconstruction and registration error with synthetic and real test data (described in §5.2). Synthetic test data was generated using the same process as the synthetic training data, using new randomized configurations not present in the training data. Real test data was generated using the same process as the real training data, using new video sequences, consisting of new viewpoints and object manipulations not present in the training data. We also generated new test data using two new cameras, as described below in §7.3.

We compare DeepSfT against a state-of-the-art isometric SfT method [10] refereed as CH17. We provide this method with two types of registration: CH17+GTR uses the ground truth registration (indicting its best possible performance independent of the registration method) and CH17+DOF using the output of a state-of-the-art dense optical flow method [44]. In the latter case we generate registration for image sequences using frame-to-frame tracking. We also compare these two variants using a posteriori deformation refinement using Levenberg–Marquardt, which is standard practice for improving the output of closed-form SfT methods. We refer to these improvements as CH17R+GTR and CH17R+DOF. We compare DeepSfT with two DNN-based methods: The first is a naïve application of the popular Resnet architecture [43] to SfT, referred as  R50F

. We performed this by removing the final two layers of Resnet and introducing one dense layer with 200 neurons and a final dense layer with a 3-channel output (for depth and warp maps) of the same size as the input image. We trained  

R50F with exactly the same training data as DeepSfT and real-data fine tuning. Fine-tuning was implemented by optimizing the depth loss while forcing the the warp outputs to be unchanged, using the same optimizer and learning rate as we used for DeepSfT. The second DNN method is [20], applicable only for DB1 and DB2. Because public code is not available, we carefully re-implemented it, requiring an adaption of the image input size and the mesh size, so that it matched the size of meshes for DB1 and DB2. We refer to this as HDM-net.

We evaluate reconstruction error using the root mean square error (RMSE) between the 3D reconstruction and the ground truth in millimeters. We also use RMSE to evaluate the registration accuracy in pixels. The evaluation of registration accuracy is notoriously difficult for real data, because there is no way to reliably obtain ground truth. We propose to use as a proxy for the ground truth the output from a state-of-the-art dense trajectory optical flow method DOF. We only make this evaluation for video sequence data, for which DOF can reliably estimate optical flow over the sequence.

max width=0.8 Registration (px) 3D Reconstruction (mm) Sequence Type Samples DOF R50F DeepSfT CH17+GTR CH17+DOF CH17R+GTR CH17R+DOF HDM-net R50F DeepSfT DB1S 3400 4.63 6.69 1.87 6.8968 15.60 8.27 15.41 10.80 7.99 1.68 DB2S 3400 5.91 6.13 1.34 6.89 28.26 8.27 28.04 9.92 7.75 1.63 DB1R 100 - 5.02 2.32 - 38.12 - 34.24 - 17.53 9.51 DB2R 230 - 4.13 1.53 - 27.31 - 25.24 - 14.45 7.3721

Table 4: Evaluation on synthetic and real databases DB1S, DB2S, DB1R and DB2R

7.1 Experiments with thin-shell objects and continuous test sequences

We show in Tables 4 and 6 the quantitative and qualitative results obtained with the thin-shell templates DB1 and DB2 with synthetic test datasets, denoted by DB1S and DB2S, and real test datasets, denoted by DB1R and DB2R. In terms of reconstruction error DeepSfT is considerably better than other methods, both in synthetic data, where the error remains below 2mm, and for real data, where the error is below 10mm. The Kinect V2 have an uncertainty of about 10mm at a distance of one meter, which partially explains the higher error for real data. The second and third best methods are R50F and HDM-net

, also based on deep learning. However their results are far from those of

DeepSfT. The method CH17 obtains reasonable results when it is provided with ground truth registration (CH17-GTR and CH17R-GTR). However, the performance is considerably worse when real registration is provided using dense optical flow (CH17-DOF and CH17R-DOF).

In terms of registration error, DeepSfT also has the best results both for synthetic test data, where ground-truth registration is available, and in real test data, where DOF is used as the proxy. In all cases DeepSfT has a mean registration error approximately 2 pixels. The performance of R50F is competitive with DOF, with registration errors approximately 5 pixels. We note that DOF exploits temporal coherence while RF50 and DeepSfT do not and process each frame independently.

7.2 Experiments with volumetric objects and non-continuous test images

The quantitative and qualitative results of the experiments for volumetric templates DB3 and DB4 are provided in Tables 5 and 6 with both synthetic test data, denoted by DB3S and DB4S, and real test data, denoted by DB4R and DB4R. In this case we only provide registration error with synthetic data, because reliable registration using DOF is impossible with non-continuous test images. The method CH17+GTR and CH17R+GTR is tested only in the case of DB4S, because this is the only case that it can work (requiring a continuous texture map and a registration).

max width= Registration (px) 3D Reconstruction (mm) Sequence Type Samples R50F DeepSFT CH17+GTR CH17R+GTR R50F DeepSfT DB3S 5000 7.14 1.05 - - 6.34 1.16 DB4S 5000 8.93 3.60 73.80 70.70 12.62 1.57 DB3R 1300 - - - - 12.43 5.12 DB4R 550 - - - - 27.31 7.55

Table 5: Evaluation on synthetic and real databases DB3S, DB4S, DB3R and DB4R

We observe a similar trend as with the thin-shell objects. DeepSfT is the best method both in terms of 3D reconstruction, with errors of the order of millimeters, and in registration with errors close to 2 pixels. The second best method is R50F although its results are significantly worse than those obtained by DeepSfT. The results of CH17 and its variants are very poor. This may be due to the fact that CH17 is not a method well adapted for volumetric objects with non-negligible deformation strain.

max width= Input Image Depth Output Warp-U Output     Warp-V Output     Depth Error (mm) 3.21 4.69 11.26 8.96 9.08 7.49

Table 6: Examples of the four templates outputs

We show in Table 7 qualitative reconstruction results obtained with DB1, DB3 and DB4 with real images.

max width= Input Image Output vs GTH Textured Output     Textured Groundtruth     Error Map

Table 7: Examples of 3D shapes recovered by DeepSfT

We observe that shapes recovered with DeepSfT are similar to ground-truth obtained with the RGB-D camera. We can observe that the error is larger near self-occlusion boundaries. Errors for DB1 are qualitatively smaller than for volumetric objects, which is consistent with Tables 4 and 5.

7.3 Experiments with other cameras

We now present experiments showing the ability of DeepSfT to be used with a different camera at run-time, without any fine tuning with the new camera. The different cameras are an Intel Realsense D435[26] (an RGB-D camera that we use for quantitative evaluation) and a Gopro Hero V3[21] (an RGB camera for qualitative evaluation). Table 8 shows their respective camera intrinsics.

max width= Cameras Image Resolution Kinect V2 1920x1080 1057.8 1064 947.64 530.38 Intel Realsense D435 1270x720 915.457 915.457 645.511 366.344 Gopro Hero V3 1920x1080 1686.8 1694.2 952.8 563.5

Table 8: Camera description table

We have trained DeepSfT with a source RGB-D camera (Kinect V2), which has different intrinsics to the new cameras. We cannot immediately use images from the new camera because the network weights are specific to the intrinsics of the source camera. We propose to handle this by adapting the new camera’s effective intrinsics to match the source camera. Because the object’s depth within the training set varies (and so the perspective effects vary), we cam emulate training with the new camera’s intrinsics simply by an affine transform of the new camera image. This eliminates the need to retrain the network. We assume lens distortion is either negligible or has been corrected a priori using e.g. OpenCV. The affine transform is given by and displacement , where are the intrinsics of the new camera and are the intrinsics of source camera divided by . The corrected image is then clipped about its optical centre and zero padded (if necessary), to obtain the resolution of (the input image size of DeepSfT.

Table 9 gives 3D reconstruction error for Intel Realsense D435[26]. For the Gopro Hero V3[21] we show qualitative result.

max width= Camera Converted Images Results Depth error Kinect V2       7.12 Realsense D435       12.34 Gopro Hero V3      

Table 9: Results of experiments with diferent cameras(error in mm)

Quantitatively the 3D reconstruction error of the original camera and the Intel Realsense D435[26] are quite similar. This clearly demonstrates the ability of DeepSfT to generalize well to images taken with a different camera. DeepSfT is able to cope with images from other cameras even if the focals are quite different as it is the case with the GoPro camera.

7.4 Light and Oclusion Resistance

We show that DeepSfT is resistant to light changes and significant occlusions. The first two rows of Table 10 show representative examples of scenes with external and self occlusions for the thin-shell and volumetric objects. DeepSfT is able to cope with them, accurately detecting the occlusion boundaries.

max width= Occlusions Image Output Illumination Changes Input Output Failure Cases Input Points with information

Table 10: Representative oclusion resistance, light resistance and failure cases.

The third and fourth rows of Table 10 show examples of scenes with light changes that produce significant changes in shading. DeepSfT shows resistance to those changes.

7.5 Failure Modes

There are some instances where DeepSfT fails, shown in the final two rows of Table 10. There are general failure modes of SfT (very strong occlusions and illumination changes), for which all methods will fail at some point. We also have failure modes specific to a learning-based approach (excessive deformations that are not represented in the training set).

7.6 Timing Experiments

Table 11 shows the average frame rates of the compared methods, benchmarked on a conventional Linux desktop PC with a single NVIDIA GTX-1080 GPU.

max width=0.8 DeepSfT R50F CH17 CH17R DOF Time(fps) 20.4 37 0.75 0.193 8.84

Table 11: Frame rates of the evaluated methods.

The DNN-based methods are considerably faster than the other methods, with frame rates close to real time (DeepSfT). Solutions based on CH17 are far from real-time.

8 Conclusions

We have presented DeepSfT, the first dense, real-time solution for wide-baseline SfT with generic templates. This has been a long-standing computer vision problem for over a decade. DeepSfT

will enable many real-world applications that require dense registration and 3D reconstruction of deformable objects, in particular augmented reality with deforming objects. We also expect it to be an important component for dense NRSfM in the wild. In the future we aim to improve results by incorporating temporal context information with recurrant neural networks, and to extend


for unsupervised learning.


  • [1] Agisoft Photoscan.
  • [2] A. Agudo and F. Moreno-Noguer. Simultaneous pose and non-rigid shape with particle dynamics. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2179–2187, 2015.
  • [3] A. Agudo, F. Moreno-Noguer, B. Calvo, and J. M. M. Montiel. Sequential non-rigid structure from motion using physical priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(5):979–994, 2016.
  • [4] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR, abs/1511.00561, 2015.
  • [5] A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2d-3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5965–5974, 2016.
  • [6] A. Bartoli, Y. Gérard, F. Chadebecq, T. Collins, and D. Pizarro. Shape-from-template. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(10):2099–2118, 2015.
  • [7] Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam.
  • [8] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3D shape from image streams. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2000.
  • [9] F. Brunet, A. Bartoli, and R. I. Hartley. Monocular template-based 3d surface reconstruction: Convex inextensible and nonconvex isometric methods. Computer Vision and Image Understanding, 125:138–154, 2014.
  • [10] A. Chhatkuli, D. Pizarro, A. Bartoli, and T. Collins. A stable analytical framework for isometric shape-from-template by surface integration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(5):833–850, 2017.
  • [11] A. Chhatkuli, D. Pizarro, T. Collins, and A. Bartoli. Inextensible non-rigid structure-from-motion by second-order cone programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2017.
  • [12] T. Collins and A. Bartoli. Using isometry to classify correct/incorrect 3D-2D correspondences. In ECCV, 2014.
  • [13] T. Collins, A. Bartoli, N. Bourdel, and M. Canis. Robust, real-time, dense and deformable 3d organ tracking in laparoscopic videos. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 404–412. Springer, 2016.
  • [14] T. Collins, P. Mesejo, and A. Bartoli. An analysis of errors in graph-based keypoint matching and proposed solutions. In European Conference on Computer Vision, pages 138–153. Springer, 2014.
  • [15] J. B. Diederik P. Kingma. Adam: A method for stochastic optimization. Arxiv, arXiv:1412.6980(6), December 2014.
  • [16] D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
  • [17] D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2366–2374. Curran Associates, Inc., 2014.
  • [18] R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016.
  • [19] V. Gay-Bellile, A. Bartoli, and P. Sayd. Direct estimation of nonrigid registrations with image-based self-occlusion reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1):87–104, Jan 2010.
  • [20] V. Golyanik, S. Shimada, K. Varanasi, and D. Stricker. Hdm-net: Monocular non-rigid 3d reconstruction with learned deformation model. CoRR, abs/1803.10193, 2018.
  • [21] GoPro. Gopro hero silver v3 rgb camera.
  • [22] R. A. Güler, N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the wild. arXiv preprint arXiv:1802.00434, 2018.
  • [23] N. Haouchine and S. Cotin. Template-based monocular 3D recovery of elastic shapes using lagrangian multipliers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, July 2017.
  • [24] N. Haouchine, J. Dequidt, M.-O. Berger, and S. Cotin. Single view augmentation of 3D elastic objects. In ISMAR, pages 229–236. IEEE, 2014.
  • [25] R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
  • [26] Intel. Intel realsense d435 stereo depth camera.
  • [27] L. Torresani, A. Hertzmann and C. Bregler. Nonrigid structure-from-motion: Estimating shape and motion with hierarchical priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):878–892, 2008.
  • [28] F. Liu, C. Shen, G. Lin, and I. D. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2024–2039, 2016.
  • [29] Q. Liu-Yin, R. Yu, L. Agapito, A. Fitzgibbon, and C. Russell. Better together: Joint reasoning for non-rigid 3d reconstruction with specularities and shading. arXiv preprint arXiv:1708.01654, 2017.
  • [30] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004.
  • [31] A. Malti, A. Bartoli, and R. Hartley. A linear least-squares solution to elastic shape-from-template. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1629–1637, 2015.
  • [32] A. Malti, R. Hartley, A. Bartoli, and J.-H. Kim. Monocular template-based 3d reconstruction of extensible surfaces with local linear elasticity. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1522–1529, 2013.
  • [33] J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, volume 1, page 5, 2017.
  • [34] P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  • [35] D. T. Ngo, J. Östlund, and P. Fua. Template-based monocular 3d shape recovery using laplacian meshes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):172–187, 2016.
  • [36] E. Özgür and A. Bartoli. Particle-sft: A provably-convergent, fast shape-from-template algorithm. International Journal of Computer Vision, 123(2):184–205, 2017.
  • [37] M. Perriollat, R. Hartley, and A. Bartoli. Monocular template-based reconstruction of inextensible surfaces. International journal of computer vision, 95(2):124–137, 2011.
  • [38] J. Pilet, V. Lepetit, and P. Fua. Fast non-rigid surface detection, registration and realistic augmentation. IJCV, 76(2):109–122, February 2008.
  • [39] D. Pizarro and A. Bartoli. Feature-based deformable surface detection with self-occlusion reasoning. International Journal of Computer Vision, 97(1):54–70, 2012.
  • [40] A. Pumarola, A. Agudo, L. Porzi, A. Sanfeliu, V. Lepetit, and F. Moreno-Noguer. Geometry-aware network for non-rigid shape prediction from a single view. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4681–4690, 2018.
  • [41] M. Salzmann and P. Fua. Reconstructing sharply folding surfaces: A convex formulation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1054–1061. IEEE, 2009.
  • [42] M. Salzmann, F. Moreno-Noguer, V. Lepetit, and P. Fua. Closed-form solution to non-rigid 3d surface registration. Computer Vision–ECCV 2008, pages 581–594, 2008.
  • [43] K. H. X. Z. S. R. J. Sun. Deep residual learning for image recognition. Arxiv, arXiv:1512.03385, December 2015.
  • [44] S. N. B. T. and K. K. Dense point trajectories by gpu-accelerated large displacement optical flow. ECCV, 2010.
  • [45] X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 539–547, 2015.
  • [46] Y. B. Xavier Glorot, Antoine Bordes. Understanding the difficulty of training deep feedforward neural networks. Procedings MLR.
  • [47] Y. Dai, H. Li, and M. He. A simple prior-free method for non-rigid structure-from-motion factorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.