The joint task of registration and 3D reconstruction of deformable objects from RGB videos and images is a major objective in computer vision, with numerous potential applications, for instance in augmented reality. In comparison with other mature 3D reconstruction problems, such as Structure-from-Motion (SfM) where rigidity is imposed on the scene, deformable registration and 3D reconstruction present significant unsolved problems. Two main scenarios exist in this task: Non-Rigid SfM (NRSfM) [8, 27, 47, 11] and Shape-from-Template (SfT) [42, 6, 35, 10]. NRSfM reconstructs the 3D shape of a deformable object from multiple RGB images. In contrast, SfT reconstructs the 3D shape from a single RGB image using an object template. The template includes knowledge about the object’s appearance, shape and permissible deformations. These are typically represented by a texture-map, a 3D mesh and a simple mechanical model. SfT is suitable for many applications where the template is known or can be acquired, using for instance SfM or any available 3D scanning solution.
SfT solves two fundamental and intimately related problems: i) template-image registration, which associates pixels in the image to their corresponding locations in the template, and ii) shape inference, which recovers the observed 3D shape or equivalently the template’s 3D deformation. The majority of SfT methods focus on solving shape inference assuming that registration is independently obtained with existing feature-based or dense methods [39, 19, 14]. In all other cases, both problems are solved simultaneously using tracking with iterative optimization [35, 13, 3]. To date there exists no non-DNN wide-baseline SfT method capable of solving both problems densely and in real-time. DNN SfT methods have been very recently proposed [40, 20]
, following the success of the DNN methodology in related problems such as 3D human pose estimation[33, 22], depth [16, 18, 28] and surface normal reconstruction with rigid objects [5, 45]. The general idea is to learn the function that maps an input image to the template’s 3D deformation parameters from training data. This has the potential to jointly solve registration and shape inference and eliminates the need for iterative optimization at run-time. These two recent methods are promising but bear important limitations. First, they are limited to flat templates described by regular meshes with very small vertex counts. Second, they require ground-truth registration for training, which is practically impossible to obtain for real data.
We propose DeepSfT, the first DNN SfT method based on a fully-convolutional network without the above limitations. DeepSfT has the following desirable characteristics. 1) It is dense and provides registration and 3D reconstruction at the pixel level. 2) It does not require temporal continuity and handles large deformations and pose changes between the template and the object. 3) It runs in real-time using conventional GPU hardware. 4) It is applicable for templates with arbitrary geometry, topology and surface representation, including meshes, implicit and explicit functions such as NURBS. 5) It is highly robust and handles well the major challenges of SfT, including self and external occlusions, illumination changes and blur. 6)
Training involves a novel combination of supervised learning with synthetic data and semi-supervised learning with RGB-D real data. Crucially, we do not require ground-truth registration for the real training data but only RGB-D. Compared to previous approaches, this makes it feasible to acquire the real training dataautomatically, and therefore feasible to deploy it in real settings. 7) The network complexity, training cost and running cost are independent of the template representation, for instance of the mesh vertex count. It therefore scales very well to highly complex templates with detailed geometry that were, until now, not solvable in real-time. There exists no previous method in the literature with the above characteristics. Our method thus pushes SfT significantly forward. We present quantitative and qualitative experimental results showing that our method concretely outperforms in accuracy, robustness and computation time.
2 Previous Work
We first review the non-DNN SfT methods, forming the vast majority of existing work. We start with the shape inference methods and then the integrated methods combining shape inference and registration. We finally review the recent DNN SfT methods.
Shape inference methods.
The shape inference methods assume that the registration between the template and the image is given, which is a fundamental limiting factor of applicability. We classify them according to the deformation model. The most popular deformation model is isometry, which attempts to approximately preserve the geodesic distance, and has been shown to be widely applicable.Isometric methods follow three main strategies: i) Using a convex relaxation of isometry called inextensibility [41, 42, 37, 9], ii) using local differential geometry [6, 10] and iii) minimizing a global non-convex cost [9, 36]. Methods in iii) are the most accurate but also the most expensive. They require an initial solution found using a method from i) or ii). There also exist non-isometric methods, with the angle preserving conformal model  or simple mechanical models with linear [32, 31] and non-linear elasticity [23, 24, 2]. These models all require boundary conditions in the form of known 3D points, which is another fundamental limiting factor of applicability. Their well-posedness remain open research questions.
The integrated methods compute both registration and shape inference. We classify them according to their ability to handle wide-baseline cases. Short-baseline methods are restricted to video data and may work in real time [35, 13, 29]. They are based on the iterative minimization of a non-convex cost and use keypoint correspondences  or optic flow [13, 29]. The latter supports dense solutions and resolve complex, high-frequency deformations. Their main limitations are two-fold. First, they break down when there is fast deformation or camera motion. Second, at run-time, they must solve an optimization process that is highly computationally demanding, requiring careful hand-crafted design and balancing of data and deformation constraints. In contrast, wide-baseline SfT methods can deal with individual images showing the object with strong deformation without priors on the camera viewpoint [35, 12]. These methods solve registration sparsely using keypoints such as SIFT  with filtering to reduce the mismatches [39, 38]. The main limitations of these methods are two-fold. First, they are fundamentally limited by the feature-based registration, which fails due to a weak or repetitive texture, low image resolution, blur or viewpoint distortion. second, they require to solve a highly demanding optimization problem at run-time. Because of these limitations, the existing wide-baseline methods have only been shown to work for simple objects with simple deformations, such as bending sheets of paper.
DNN SfT methods.
Two DNN SfT methods [40, 20] have been recently proposed. They address isometric SfT by learning the mapping between the input image and the 3D vertex coordinates of a regular mesh. Both methods use regression with a fully-convolutional encoder. They require the template to be flat and to contain a smaller number of regular elements. In  belief maps are obtained for the 2D position of the vertices which are then combined with depth estimation and reprojection constraints to recover their 3D positions. This considerably limits the size of the mesh, as shown by the reported examples with fewer than vertex counts. Both methods were trained and tested with synthetically generated images. Only  provides results on a real video of a bending paper sheet, but required ground-truth registration and shape to fine-tune the network on part of the video. These two methods thus form a preliminary step toward applying DNN to SfT, but are strongly limited by the low template complexity and requirement for ground truth registration. Indeed, even if depth may be relatively easy to obtain for training, ground truth registration is extremely difficult to measure for real data.
3 Problem Formulation
Figure 1 shows the geometrical model of SfT.
The template is known and represented by a 3D surface jointly with its appearance, described as a texture map . The texture map is standard and represented as a collection of flattened texture charts whose union cover the appearance of the whole template, as seen in Figure 1. In our approach templates are not restricted to a specific topology, modelling both thin-shell and volumetric objects. They are also not restricted by a specific representation. In our experimental section we use mesh representations because of their generality, but this is not a requirement of the method. The bijective map between and is known and denoted by . We assume that the template surface is deformed with an unknown quasi-isometric map , where denotes the unknown deformed surface. Quasi-isometric maps permit localized extension/compression, common with real world deforming objects.
The input image is modeled as the colour intensity function , which is discretized into a regular grid of pixels in the retinal plane. The visible part of the surface is projected on an unknown subset of the image plane . We assume the perspective camera for projection:
where is represented with a perspective embedding with .
We assume is known and any lens distortion is either negligible or has been corrected. The depth function represents the depth coordinate of from the camera’s coordinate system. In the absence of self-occlusions, . Volumetric templates always induce self-occlusions in the image.
The unknown registration function, or warp, is an injective map that relates each point of to its corresponding point in .
4 Network architecture
We propose a DNN, hereinafter DeepSfT, that estimates and directly from the input image :
where and are normalized (, ) and discretized versions of and . Our method also recovers as both and are equal to outside the domain of the image. In this sense DeepSfT performs object segmentation at a pixel level. are the network weights, that depend on the template , and are learned with training (see Section 6). DeepSfT has been trained to recognize a specific template so a large amount of deformations are required as described in Section 5.
Figure 2 shows the proposed network architecture. The complete architecture receives an RGB input image with a resolution of pixels and returns the estimated depth map and the registration maps . Both and have the same size as the input image.
DeepSfT is divided into two main blocks: the main block is modelled on an encoder-decoder architecture, very similar to those used in semantic segmentation . This gives a first depth map estimation and the proposed registration function . The second one is a domain adaptation block that uses the RGB input image together with the output of the previous block to refine the depth map estimation . This cascade topology where the input image is feed into refinement blocks has proven to improve the results obtained using single stages in methods for 3D depth estimation . This block is also crucial to adapt the network to real data as described in Section 6.
Each block is composed of two unbalanced parallel branches with convolutional layers that propagate the information forward into deeper layers, preserving the high frequencies of the data.
Table 1 shows the layered decomposition of the main block. It first receives the RGB input image and performs a first reduction of the input size. Then, a sequence of three convolutional and identity blocks
encode texture and the depth information as deep features.
Information is reduced to a compressed feature vector in a representation space of dimension. Information related with the deformable surface is coded in this vector per each RGB input image.
Decoding is performed with decoding blocks
. These require upsampling layers to increase the dimensions of the input tensors before passing through the convolution layers, as shown in Figure3.c. Finally, the last layers consist of CNNs and cropping layers that adapt the output of the decoding block to the size of the output maps (). The first channel provides the depth estimate and the last two channels provide the registration warp.
Table 2 shows the layered decomposition of the domain adaptation block. It is a reduced version of the main block where only the first two encoding and decoding blocks are included. The domain adaptation block take as input the concatenation of the input image and the output from the main block (6 channels) and it outputs a new refined depth map.
We create a quasi-photorealistic SfT synthetic database using simulation software. Synthetic data allows us to easily train our DNN end-to-end. We then follow by re-training the domain adaption block using a much smaller dataset collected using a standard RGB-D sensor. We recall that there are no public training datasets of this kind.
5.1 Synthetic Data
This process involves randomized sampling from the object’s deformation space, generating the resulting deformation, and rendering from randomized viewpoints. We now describe the process for generating these training datasets for the templates used in the experiential section below (two thin-shell and two volumetric templates, see Table 3). DB1 corresponds to a DIN A4 piece of poorly-texture paper. DB2 has the same shape as DB1 but with a richer texture. DB3 is a soft child’s toy and DB4 is an adult sneaker. We emphasize that no previous work has been able to solve SfT for these last two objects in wide-baseline. The rest shape surfaces for DB3 and DB4 are obtained with triangulated meshes built using SfM (Agisoft Photoscan ).
We use Blender  to sample the deformation spaces and to create quasi-photorealistic renderings. It includes a physics simulation engine to simulate deformations with different degrees of stiffness using position based dynamics. For the paper templates we used Blender’s cloth simulator using a high stiffness term to model the stiffness of paper, with contour conditions and tensile and compressive forces in randomized 3D directions. This generates continuous deformation videos. For the other two templates we used rig-based deformation with hand-crafted rigs. This generates non-continuous deformation instances, using randomized joint angle configurations. For each deformation we generate random viewpoint variations with random rotations and translations of the camera, lighting variations using different specular light models and random backgrounds obtained with . In total, each dataset consists of RGB images, depth maps, and registrations (2-channel optical flow maps between the image and the template’s texturemap). All images have a resolution of to fit the input/output of the network. We refer the reader to the supplementary material for a copy of these datasets with rigs and simulation parameters.
5.2 Real Data
We used Microsoft Kinect v2 to record a total of RGB-D frames of the four objects while undergoing deformations induced by hand manipulation, and viewpoint changes (see Table 3). Image resolution was downsized to to fit the input shape of the network.
6 Training Procedure
The training procedure is divided in two main steps: 1) training with synthetic data followed by 2) semi-supervised fine-tuning with real data. In step 1) both main and domain adaptation blocks are trained end-to-end as a single block. We use ADAM  optimization with learning rate and parameters . We train for epochs with a batch size of . We initialize DeepSfT with uniform random weights 
. The loss function is defined as follows:
where and are the output depth map and warp estimates given by DeepSfT respectively. The terms and are the respective ground truths, and and are constants. The symbol denotes the Euclidean norm. Observe that and inherently depend on the network weights and on the input image , see Eq. (2). In step 2) we train the domain adaptation block using real data while freezing the weights of the main block. This step is crucial to adapt the network to handle the ‘render gap’ and include the appearance characteristics of real data, such as the complex illumination, camera response and color balance. Also crucial is the fact that this can be done automatically, without the need for ground truth registration. We use stochastic gradient descend (SGD) with a small and fixed learning rate of . We train the network during epochs with a batch size of . Having both a low learning rate and a reduced number of epochs allows us to adapt our network to real data while avoiding overfitting. In this step a different loss function is used, which only includes the depth information given by the depth sensor as the target of the domain adaptation block:
7 Experimental Results
We evaluate DeepSfT in terms of 3D reconstruction and registration error with synthetic and real test data (described in §5.2). Synthetic test data was generated using the same process as the synthetic training data, using new randomized configurations not present in the training data. Real test data was generated using the same process as the real training data, using new video sequences, consisting of new viewpoints and object manipulations not present in the training data. We also generated new test data using two new cameras, as described below in §7.3.
We compare DeepSfT against a state-of-the-art isometric SfT method  refereed as CH17. We provide this method with two types of registration: CH17+GTR uses the ground truth registration (indicting its best possible performance independent of the registration method) and CH17+DOF using the output of a state-of-the-art dense optical flow method . In the latter case we generate registration for image sequences using frame-to-frame tracking. We also compare these two variants using a posteriori deformation refinement using Levenberg–Marquardt, which is standard practice for improving the output of closed-form SfT methods. We refer to these improvements as CH17R+GTR and CH17R+DOF. We compare DeepSfT with two DNN-based methods: The first is a naïve application of the popular Resnet architecture  to SfT, referred as R50F
. We performed this by removing the final two layers of Resnet and introducing one dense layer with 200 neurons and a final dense layer with a 3-channel output (for depth and warp maps) of the same size as the input image. We trainedR50F with exactly the same training data as DeepSfT and real-data fine tuning. Fine-tuning was implemented by optimizing the depth loss while forcing the the warp outputs to be unchanged, using the same optimizer and learning rate as we used for DeepSfT. The second DNN method is , applicable only for DB1 and DB2. Because public code is not available, we carefully re-implemented it, requiring an adaption of the image input size and the mesh size, so that it matched the size of meshes for DB1 and DB2. We refer to this as HDM-net.
We evaluate reconstruction error using the root mean square error (RMSE) between the 3D reconstruction and the ground truth in millimeters. We also use RMSE to evaluate the registration accuracy in pixels. The evaluation of registration accuracy is notoriously difficult for real data, because there is no way to reliably obtain ground truth. We propose to use as a proxy for the ground truth the output from a state-of-the-art dense trajectory optical flow method DOF. We only make this evaluation for video sequence data, for which DOF can reliably estimate optical flow over the sequence.
7.1 Experiments with thin-shell objects and continuous test sequences
We show in Tables 4 and 6 the quantitative and qualitative results obtained with the thin-shell templates DB1 and DB2 with synthetic test datasets, denoted by DB1S and DB2S, and real test datasets, denoted by DB1R and DB2R. In terms of reconstruction error DeepSfT is considerably better than other methods, both in synthetic data, where the error remains below 2mm, and for real data, where the error is below 10mm. The Kinect V2 have an uncertainty of about 10mm at a distance of one meter, which partially explains the higher error for real data. The second and third best methods are R50F and HDM-net
, also based on deep learning. However their results are far from those ofDeepSfT. The method CH17 obtains reasonable results when it is provided with ground truth registration (CH17-GTR and CH17R-GTR). However, the performance is considerably worse when real registration is provided using dense optical flow (CH17-DOF and CH17R-DOF).
In terms of registration error, DeepSfT also has the best results both for synthetic test data, where ground-truth registration is available, and in real test data, where DOF is used as the proxy. In all cases DeepSfT has a mean registration error approximately 2 pixels. The performance of R50F is competitive with DOF, with registration errors approximately 5 pixels. We note that DOF exploits temporal coherence while RF50 and DeepSfT do not and process each frame independently.
7.2 Experiments with volumetric objects and non-continuous test images
The quantitative and qualitative results of the experiments for volumetric templates DB3 and DB4 are provided in Tables 5 and 6 with both synthetic test data, denoted by DB3S and DB4S, and real test data, denoted by DB4R and DB4R. In this case we only provide registration error with synthetic data, because reliable registration using DOF is impossible with non-continuous test images. The method CH17+GTR and CH17R+GTR is tested only in the case of DB4S, because this is the only case that it can work (requiring a continuous texture map and a registration).
We observe a similar trend as with the thin-shell objects. DeepSfT is the best method both in terms of 3D reconstruction, with errors of the order of millimeters, and in registration with errors close to 2 pixels. The second best method is R50F although its results are significantly worse than those obtained by DeepSfT. The results of CH17 and its variants are very poor. This may be due to the fact that CH17 is not a method well adapted for volumetric objects with non-negligible deformation strain.
We show in Table 7 qualitative reconstruction results obtained with DB1, DB3 and DB4 with real images.
7.3 Experiments with other cameras
We now present experiments showing the ability of DeepSfT to be used with a different camera at run-time, without any fine tuning with the new camera. The different cameras are an Intel Realsense D435 (an RGB-D camera that we use for quantitative evaluation) and a Gopro Hero V3 (an RGB camera for qualitative evaluation). Table 8 shows their respective camera intrinsics.
We have trained DeepSfT with a source RGB-D camera (Kinect V2), which has different intrinsics to the new cameras. We cannot immediately use images from the new camera because the network weights are specific to the intrinsics of the source camera. We propose to handle this by adapting the new camera’s effective intrinsics to match the source camera. Because the object’s depth within the training set varies (and so the perspective effects vary), we cam emulate training with the new camera’s intrinsics simply by an affine transform of the new camera image. This eliminates the need to retrain the network. We assume lens distortion is either negligible or has been corrected a priori using e.g. OpenCV. The affine transform is given by and displacement , where are the intrinsics of the new camera and are the intrinsics of source camera divided by . The corrected image is then clipped about its optical centre and zero padded (if necessary), to obtain the resolution of (the input image size of DeepSfT.
Quantitatively the 3D reconstruction error of the original camera and the Intel Realsense D435 are quite similar. This clearly demonstrates the ability of DeepSfT to generalize well to images taken with a different camera. DeepSfT is able to cope with images from other cameras even if the focals are quite different as it is the case with the GoPro camera.
7.4 Light and Oclusion Resistance
We show that DeepSfT is resistant to light changes and significant occlusions. The first two rows of Table 10 show representative examples of scenes with external and self occlusions for the thin-shell and volumetric objects. DeepSfT is able to cope with them, accurately detecting the occlusion boundaries.
The third and fourth rows of Table 10 show examples of scenes with light changes that produce significant changes in shading. DeepSfT shows resistance to those changes.
7.5 Failure Modes
There are some instances where DeepSfT fails, shown in the final two rows of Table 10. There are general failure modes of SfT (very strong occlusions and illumination changes), for which all methods will fail at some point. We also have failure modes specific to a learning-based approach (excessive deformations that are not represented in the training set).
7.6 Timing Experiments
Table 11 shows the average frame rates of the compared methods, benchmarked on a conventional Linux desktop PC with a single NVIDIA GTX-1080 GPU.
The DNN-based methods are considerably faster than the other methods, with frame rates close to real time (DeepSfT). Solutions based on CH17 are far from real-time.
We have presented DeepSfT, the first dense, real-time solution for wide-baseline SfT with generic templates. This has been a long-standing computer vision problem for over a decade. DeepSfT
will enable many real-world applications that require dense registration and 3D reconstruction of deformable objects, in particular augmented reality with deforming objects. We also expect it to be an important component for dense NRSfM in the wild. In the future we aim to improve results by incorporating temporal context information with recurrant neural networks, and to extendDeepSfT
-  Agisoft Photoscan. https://www.agisoft.com.
A. Agudo and F. Moreno-Noguer.
Simultaneous pose and non-rigid shape with particle dynamics.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2179–2187, 2015.
-  A. Agudo, F. Moreno-Noguer, B. Calvo, and J. M. M. Montiel. Sequential non-rigid structure from motion using physical priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(5):979–994, 2016.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR, abs/1511.00561, 2015.
-  A. Bansal, B. Russell, and A. Gupta. Marr revisited: 2d-3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5965–5974, 2016.
-  A. Bartoli, Y. Gérard, F. Chadebecq, T. Collins, and D. Pizarro. Shape-from-template. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(10):2099–2118, 2015.
-  Blender Online Community. Blender - a 3D modelling and rendering package. Blender Foundation, Blender Institute, Amsterdam.
-  C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3D shape from image streams. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE, 2000.
-  F. Brunet, A. Bartoli, and R. I. Hartley. Monocular template-based 3d surface reconstruction: Convex inextensible and nonconvex isometric methods. Computer Vision and Image Understanding, 125:138–154, 2014.
-  A. Chhatkuli, D. Pizarro, A. Bartoli, and T. Collins. A stable analytical framework for isometric shape-from-template by surface integration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(5):833–850, 2017.
-  A. Chhatkuli, D. Pizarro, T. Collins, and A. Bartoli. Inextensible non-rigid structure-from-motion by second-order cone programming. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2017.
-  T. Collins and A. Bartoli. Using isometry to classify correct/incorrect 3D-2D correspondences. In ECCV, 2014.
-  T. Collins, A. Bartoli, N. Bourdel, and M. Canis. Robust, real-time, dense and deformable 3d organ tracking in laparoscopic videos. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 404–412. Springer, 2016.
-  T. Collins, P. Mesejo, and A. Bartoli. An analysis of errors in graph-based keypoint matching and proposed solutions. In European Conference on Computer Vision, pages 138–153. Springer, 2014.
-  J. B. Diederik P. Kingma. Adam: A method for stochastic optimization. Arxiv, arXiv:1412.6980(6), December 2014.
-  D. Eigen and R. Fergus. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In Proceedings of the IEEE International Conference on Computer Vision, pages 2650–2658, 2015.
-  D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2366–2374. Curran Associates, Inc., 2014.
-  R. Garg, V. K. BG, G. Carneiro, and I. Reid. Unsupervised cnn for single view depth estimation: Geometry to the rescue. In European Conference on Computer Vision, pages 740–756. Springer, 2016.
-  V. Gay-Bellile, A. Bartoli, and P. Sayd. Direct estimation of nonrigid registrations with image-based self-occlusion reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1):87–104, Jan 2010.
-  V. Golyanik, S. Shimada, K. Varanasi, and D. Stricker. Hdm-net: Monocular non-rigid 3d reconstruction with learned deformation model. CoRR, abs/1803.10193, 2018.
-  GoPro. Gopro hero silver v3 rgb camera. https://es.gopro.com/update/hero3.
-  R. A. Güler, N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the wild. arXiv preprint arXiv:1802.00434, 2018.
-  N. Haouchine and S. Cotin. Template-based monocular 3D recovery of elastic shapes using lagrangian multipliers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, July 2017.
-  N. Haouchine, J. Dequidt, M.-O. Berger, and S. Cotin. Single view augmentation of 3D elastic objects. In ISMAR, pages 229–236. IEEE, 2014.
-  R. Hartley and A. Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
-  Intel. Intel realsense d435 stereo depth camera. http://realsense.intel.com.
-  L. Torresani, A. Hertzmann and C. Bregler. Nonrigid structure-from-motion: Estimating shape and motion with hierarchical priors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(5):878–892, 2008.
-  F. Liu, C. Shen, G. Lin, and I. D. Reid. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):2024–2039, 2016.
-  Q. Liu-Yin, R. Yu, L. Agapito, A. Fitzgibbon, and C. Russell. Better together: Joint reasoning for non-rigid 3d reconstruction with specularities and shading. arXiv preprint arXiv:1708.01654, 2017.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60:91–110, 2004.
-  A. Malti, A. Bartoli, and R. Hartley. A linear least-squares solution to elastic shape-from-template. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1629–1637, 2015.
-  A. Malti, R. Hartley, A. Bartoli, and J.-H. Kim. Monocular template-based 3d reconstruction of extensible surfaces with local linear elasticity. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1522–1529, 2013.
-  J. Martinez, R. Hossain, J. Romero, and J. J. Little. A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, volume 1, page 5, 2017.
-  P. K. Nathan Silberman, Derek Hoiem and R. Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
-  D. T. Ngo, J. Östlund, and P. Fua. Template-based monocular 3d shape recovery using laplacian meshes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(1):172–187, 2016.
-  E. Özgür and A. Bartoli. Particle-sft: A provably-convergent, fast shape-from-template algorithm. International Journal of Computer Vision, 123(2):184–205, 2017.
-  M. Perriollat, R. Hartley, and A. Bartoli. Monocular template-based reconstruction of inextensible surfaces. International journal of computer vision, 95(2):124–137, 2011.
-  J. Pilet, V. Lepetit, and P. Fua. Fast non-rigid surface detection, registration and realistic augmentation. IJCV, 76(2):109–122, February 2008.
-  D. Pizarro and A. Bartoli. Feature-based deformable surface detection with self-occlusion reasoning. International Journal of Computer Vision, 97(1):54–70, 2012.
-  A. Pumarola, A. Agudo, L. Porzi, A. Sanfeliu, V. Lepetit, and F. Moreno-Noguer. Geometry-aware network for non-rigid shape prediction from a single view. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4681–4690, 2018.
-  M. Salzmann and P. Fua. Reconstructing sharply folding surfaces: A convex formulation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1054–1061. IEEE, 2009.
-  M. Salzmann, F. Moreno-Noguer, V. Lepetit, and P. Fua. Closed-form solution to non-rigid 3d surface registration. Computer Vision–ECCV 2008, pages 581–594, 2008.
-  K. H. X. Z. S. R. J. Sun. Deep residual learning for image recognition. Arxiv, arXiv:1512.03385, December 2015.
-  S. N. B. T. and K. K. Dense point trajectories by gpu-accelerated large displacement optical flow. ECCV, 2010.
-  X. Wang, D. Fouhey, and A. Gupta. Designing deep networks for surface normal estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 539–547, 2015.
-  Y. B. Xavier Glorot, Antoine Bordes. Understanding the difficulty of training deep feedforward neural networks. Procedings MLR.
-  Y. Dai, H. Li, and M. He. A simple prior-free method for non-rigid structure-from-motion factorization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.