Deep NRSfM++: Towards 3D Reconstruction in the Wild

01/27/2020 ∙ by Chaoyang Wang, et al. ∙ Carnegie Mellon University 10

The recovery of 3D shape and pose solely from 2D landmarks stemming from a large ensemble of images can be viewed as a non-rigid structure from motion (NRSfM) problem. To date, however, the application of NRSfM to problems in the wild has been problematic. Classical NRSfM approaches do not scale to large numbers of images and can only handle certain types of 3D structure (e.g. low-rank). A recent breakthrough in this problem has allowed for the reconstruction of a substantially broader set of 3D structures, dramatically expanding the approach's importance to many problems in computer vision. However, the approach is still limited in that (i) it cannot handle missing/occluded points, and (ii) it is applicable only to weak-perspective camera models. In this paper, we present Deep NRSfM++, an approach to allow NRSfM to be truly applicable in the wild by offering up innovative solutions to the above two issues. Furthermore, we demonstrate state-of-the-art performance across numerous benchmarks, even against recent methods based on deep neural networks.



There are no comments yet.


page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is well understood [12, 28] how to recover the pose of an object through 2D landmarks if one knows the 3D structure beforehand. In most practical circumstances, however, one does not know that 3D structure. Some [35, 45] have advocated to instead use a dictionary of 3D structures – where the shape instance with the smallest re-projection error is chosen so as to recover the joint pose and shape. This strategy is also problematic as there is little guarantee that the dictionary of 3D shapes reflects the geometry of the object within the image. Increasingly, the vision community is needing solutions to this problem that rely solely on 2D landmarks from a large ensemble of images. This problem can be viewed as non-rigid structure from motion (NRSfM).

Figure 1: We present Deep NRSfM++, a general NRSfM framework of joint 3D shape and camera recovery from in-the-wild datasets, possibly consisting of occluded 2D landmarks and strong perspective camera projections.

Current state-of-the-art NRSfM techniques, however, come with severe limitations. For one, NRSfM optimizers impose prior assumptions (e.g. low-rank dictionaries and temporal coherence) that limits their capabilities to deal with large-scale datasets. In addition, most NRSfM techniques are restricted to weak perspective camera models and assume fully annotated point correspondences to be available. This becomes a challenging scenario for objects in the wild, where occlusions and strong perspective effects are inevitable.

Recently, Kong & Lucey [22] introduced a new NRSfM technique, termed Deep NRSfM, that learns the dictionaries for joint recovery of 3D shapes and camera poses. Deep NRSf

M solves the hierarchical block-sparse dictionary learning problem by optimizing the objective end-to-end using deep learning as the machinery, where the structured inductive biases in the network architecture reflect the optimization procedure used for estimating block-sparse codes. With its ability to learn solely from 2D landmarks, Deep NRS

fM circumvents the limitations of having to learn from the associated ground-truth 3D CAD models. However, its assumptions of fully-annotated observations under weak perspective camera models makes it yet to be practical on datasets collected in the wild, where (a) images can exhibit strong perspective effects and (b) missing landmark annotations due to heavy (self-)occlusions.

In this work, we develop upon the theoretical elegance of Deep NRSfM and propose Deep NRSfM++, a more general framework applicable to more relaxed settings. Deep NRSfM++ is able to model both weak and strong perspective camera models with the ability to tolerate missing data, which is a fundamental breakthrough of NRSfM problems to real-world scenarios where camera perspectives and missing landmarks inherently exhibit within in-the-wild image datasets. We show that missing data and perspective projection can be accounted for by adaptively normalizing both the input 2D landmarks and the shape dictionary; in addition, explicit estimation of the camera’s translational component can be circumvented by fully taking advantage of the object-centric nature of the problem. These reformulations lead to a unified framework under both strong and weak perspective camera models, capable of handling missing data.

Our contributions are summarized as below:

  • We offer a novel formulation of the NRSfM problem to handle missing landmark data and perspective projection that is compatible to block sparse coding.

  • We propose a solution at the architectural level that keeps a closer mathematical proximity to the hierarchical block-sparse coding objective in NRSfM.

  • We demonstrate state-of-the-art performance of Deep NRSfM++ across multiple benchmarks when comparing against classical NRSfM and deep learning methods, showing the effectiveness of Deep NRSfM++ in handling high percentage of missing data under both weak and strong perspective camera models.

We note that in this paper, we are considering approaches to NRSfM that make no assumptions about the temporal relationship between images and more generally applicable to datasets that are disjoint in both space and time.

2 Related Work

Non-rigid structure from motion. NRSfM concerns the problem of reconstructing 3D shapes from 2D point correspondences from multiple images, without the assumption of the 3D shape being rigid. Under orhtogonal projection, for each image, NRSfM is framed as factorizing the 2D projection matrix as the product of a 3D shape matrix and camera projection matrix :


where each row of , corresponds to the image coordinate and world coordinate of the -th point of a shape with a total of points. This factorization problem is obviously ill-posed. To resolve the ambiguities in solutions, priors are enforced upon the stack of shape matrices under multiple views, and also on the trajectories if temporal information is available. Such priors include the assumptios of shape/trajectory matrices being low rank [9, 5, 3, 14]; lying in a union of subspaces [25, 49, 1]; or being sparsely compressible [23, 21, 48].

These methods are mathematically well interpreted, and usually guarantees the uniqueness of the solution. However, many of them meet limitations in large scale data: Low-rank assumption is infeasible when the data has complex variations and the number of points is much smaller than the number of frames; Union-of-subspaces NRSf

M has difficulties in how to effectively cluster shape deformations solely from 2D inputs, and how to estimate affinity matrix when the number of frames is large; Sparsity prior enables more power to model complex shape variations with large number of subspaces. But because there are many possible subspaces to choose from, it is sensitive to noise.

Perspective projection. Most NRSfM research assumes the weak perspective camera model. But in the real-world data that has objects close to the camera, modeling perspective projection is necessary for accurate reconstruction. Sturm & Triggs [36] formulate SfM under perspective camera as a factorization problem. This formulation was later developed to solve NRSfM. Xiao & Kanade [46] develop a factorization algorithm with two steps: projective depths are first recovered with sub-space constraints embeded in the 2D measurement, and then solve the factorization by weak perspective NRSfM. Wang et al[42] propose to update solutions from a weak perspective to a full perspective projection by refining the projective depths recursively. Hartley & Vidal [17]

derive a closed form linear solution. Their algorithm requires an initial estimation of a multifocal tensor, which is reported to be sensitive to noise.

Instead of directly solving the factorization problem in the form of Sturm & Triggs [36], we simplify the problem by fully utilizing the object-centric settings, and reformulate it to be compatible to block sparse coding.

Missing data. Missing point correspondences are inevitable in real-world data due to occlusions. Handling missing data is a non-trivial task not only due to the input data being incomplete, but also the fact that translation can no longer be removed simply by centering the 2D data. Prior works employ the following strategies : (i) introduce a visibility mask to exclude missing 2D points from the evaluation of the objective function [10, 15, 23, 43, 22], (ii) recover the missing 2D points using matrix completion algorithms, and then run the NRSf[9, 27, 16], or (iii) treat missing points as unkonwns, and update them iteratively [31]. In this work, we follow the first strategy, and derives a novel approach under the framework of block sparse coding.

Deep learning methods for NRSfM. In recent literature of unsupervised 3D pose learning, equivalent problem is approached through training neural networks to lift 3D from 2D input poses. The lifting network is primarily trained by minimizing the 2D reprojection error. However 2D reprojection error alone is obviously not enough to constrain the problem, and thus other constraints are needed. GANs and various forms of self consistency loss [30, 6, 40, 11, 24]

, are used to assist training. These methods are developed mostly from the machine learning point of view, and as a consequence, lack geometric interpretability especially when dealing with both missing data and perspective projection.

In the contrary, we mathematically derive a general framework that is applicable for both weak perspective and perspective projection, robust to missing data, and interpretable as solving hierarchical block sparse dictionary coding.

3 Background: Deep NRSfM

Figure 2: Deep NRSfM++: 2D keypoint input and shape dictionary are normalized according to the visibility mask and camera type (summarized in Table 1). The normalized 2D input is then fed into an encoder-decoder network derived from hierarchical block sparse coding (see Sec. 4.1). The network outputs the camera matrix and the 3D shape , from which we can have a 2D reconstruction . The training objective is to minimize the difference between and .

At its core, Deep NRSf[22] solves a hierarchical block sparse dictionary learning problem under the assumption that the 3D shape are compressible via multi-layer sparse coding, where is the number of landmarks. This is formally written as

s.t. (2)


is the vectorization of

, and , are the hierarchical dictionaries. Constraints on multi-layer sparsity not only preserves sufficient freedom on shape variation, but also results in more constrained code recovery.

Hierarchical block sparse coding. Multiplying both sides of equations in (2) by the camera projection matrix leads to the following hierarchical block sparse coding objective:


where denotes Kronecker product and is a reshape of . The 3D shape matrix can be recovered by


which is a concatenation of code blocks. Since the camera matrix is orthonormal, we have


where denotes the sum of the Frobenius norm of each block. Therefore, we equivalently enforce . This hierarchical block sparse coding formulation can be generalized to handle occlusions and also perspective projections, as we show in later sections.

Bilevel optimization. Kong & Lucey [22] proposed to solve the hierarchical block sparse dictionary learning problem through a bilevel optimization procedure. With the dictionaries fixed, the lower-level optimization solves for the camera matrices and shapes by minimizing the block sparse coding objective. Kong & Lucey [22] advocated a neural network formulation inspired by the Iterative Shrinkage and Thresholding Algorithm (ISTA) [4, 34, 18], which can be interpreted as an approximation of a single ISTA iteration of inferring the sparse codes using the network weights as the dictionaries [32]. The neural network approximates the solution of the lower-level optimization problem, i.e.

. The block sparsity constraint, however, was further relaxed to be non-negative with the ReLU operation, which we empirically find to degrade performance. At the higher level, the network weights are updated to minimize the 2D reprojection error,

i.e. the difference between the reprojection of the 3D reconstruction and the 2D measurement.

The theoretical connection between solving a hierarchical block sparse coding and a feed-forward network is the major breakthrough in Deep NRSfM. However, this connection was formulated under the assumption of having perfect point correspondences with weak perspective camera model. We address these issues and propose Deep NRSfM++, a more general framework capable of handling missing points and perspective camera projections.

4 Deep NRSfM++

Deep NRSfM++ is a general framework for NRSfM that can handle an arbitrary number of missing landmarks under both weak and strong perspective camera models. Similar to Deep NRSfM, it formulates the problem as learning hierarchical block sparse dictionaries. The major difference is that the 2D input and the first dictionary are adaptive according to the visibility of the input points as well as the selected camera model. The block sparse coding objective can be written in a generic form as:


where is a diagonal binary matrix indicating visibility and is the number of columns of the camera matrix , where for weak perspective and for perspective camera models. , , denote generic forms and are functions of , , and respectively; for the special case of full landmark visibility under weak perspective cameras, these relationships fall back to (3) in Deep NRSfM with , , . Table 1 summarizes the formulation for , , for different settings. We provide derivations of these mathematical relationships in Sec. 4.2 for weak perspective cameras and Sec. 4.3 for perspective cameras.

weak perspective ,
perspective  (28)  (29) ,
Table 1: Summary of , , under weak perspective and perspective camera.

4.1 ISTA with Block Soft Thresholding

Consider the single-layer case for block sparse coding:


One iteration of ISTA is computed as


where is the proximal operator for block sparsity of block size . Let , where . Thus is equivalent to applying block soft thresholding (BST) to all , defined as


Suppose is initialized to , and the step size , then the first iteration of ISTA is written as


We interpret BST as solving for the block sparse code and incorporate as the nonlinearity in our encoder part of the network, similar to a single-layer feed-forward ReLU network being interpreted as basis pursuit [32]. This formulation is closer to the true objective of NRSfM compared to Deep NRSfM, which uses ReLU as the nonlinearity to instead relax the constraint to sparsity with non-negative constraint. This is mainly due to the fact that the relationship is not applicable to the non-negative constraint. We show with empirical evidence of the superiority of over ReLU in Table 2.

Encoder-decoder network. By unrolling one iteration of block ISTA for each layers, our encoder network takes as input and produces the block code for the last layer as output, expressed as


where are the thresholds for each of the block. Following similar practice in Deep NRSfM, we factorize into the code vector and camera matrix with approximate solutions, where the camera matrix is further constrained to be orthonormal under weak perspective camera and under perspective camera using SVD [22].

The recovered code is pass into a decoder network to reconstruct 3D shape via


Training objective. Reflecting the block sparse coding objective in (6), our training objective is to enforce the estimated to a closer reconstruction of the normalized 2D input :


4.2 Weak Perspective Camera

Due to an inherent scale ambiguity between camera and 3D shape in weak perspective camera models, we do not explicitly solve for the camera scale, but rather normalize the input 2D points with a corresponding 2D bounding box. We consider the scale after normalization to be one and reconstruct a scaled 3D shape. We assume the 3D shape is zero-centered and lies in an object-centric coordinate system. can thus be transformed to the camera coordinates via:


where is the rotation matrix and is the translation vector. Under orthorgraphic projection, the 2D projection is


where and are the first two columns of and respectively is the 2D translation in x-y coordinates. Assuming that the object-centric coordinates are centered at the mean of the keypoint locations, we have


Handling missing data. To take possible occlusions into account, we can rewrite (15) as a generalized form using the visibility diagonal binary matrix as


Since not all 2D keypoint locations are guaranteed available, cannot be computed via (16) anymore. To resolve this, we propose to replace the occluded keypoints with the projection from , with rewritten as


where indicates the visibility of the -th keypoint and denotes the -th 3D point in . Rearranging (18) yields


where denotes the number of visible points. Substituting (4.2) into (17) and rearranging, we have


Since from (3), we have


In other words, is formed by shifting with the average of visible keypoints locations. This aligns with the common practice employed for data normalization [22, 30]. For the special case where all points are visible, the expressions in (21) degenerates back to (3).

4.3 Perspective Camera

We first consider the case where all points are visible. Let be the point coordinates in . Due to the fact that , can also be expressed as


where refers to each row of , and , , are the three columns of . Assuming that camera intrinsics , we have the linear relationships


which simply states that the product of the depth and 2D coordinates is equivalent to the x-y coordinates in 3D.

In the object-centric coordinate system, translation can be expressed as the mean of back-projection of the 2D points


Substituting (24) and (22) into (23), we have the following linear relationships in matrix form:


In this case, is formed not only by shifting with its mean but also scaled by , the depth of the object center to the camera. is simply a scalar that normalizes the 2D input and controls the scale of the 3D reconstruction, which is similar to the weak perspective case. However, is now a vectorization of by concatenating its columns.

Handling missing data. As in the weak perspective case, translation can be expressed using the average of visible and occluded points multiplied by their projective depth as


and similarly for . Substituting the new expressions of the translational components and into (23), we have


where , which is the same as normalized dictionary in (21).

Scale correction. Properly normalizing the input has positive impact on both classical NRSfM and deep learning methods. Since we do not assume oracle 3D information to be accessible, we keep our algorithm general by using available 2D information such as bounding boxes. In practice, we utilize a strategy to estimate and recorrect the scale in an iterative fashion. Specifically, we use the detected 2D object bounding box to provide an initial estimate of ; subsequently, we update the scale estimation with either the Frobenius norm of the reconstructed shape or the average bone length of a skeleton model, if available. Once we have updated the scale estimation, namely ’s, we rerun Deep NRSfM++ and update the reconstruction. This scale correction procedure is applied iteratively, such that the 3D reconstruction and scale estimation improves each other.

5 Experiments

Implementation details. The only hyper-parameter for our approach is the number of layers and the size of the dictionaries. Among those we find the most important hyper parameter is the size of dictionary at the last level, which is chosen depending on the amount of shape variation in the data. For human skeleton data, we use 8-10, and for rigid objects, we use 2-4. Compared to the residual networks used in other deep learning approaches, our model is much more compact. We use standard optimizers e.g. Adam to train our method. Detailed description is provided in supplementary.

Evaluation metrics. We employ the following metrics to evaluate the accuracy of 3D reconstruction: MPJPE: before calculating the mean per joint position, we first normalize the scale of the prediction to match against ground truth (GT). To account for the ambiguity due to weak perspective cameras, we also flip the depth value of the prediction if it leads to lower error. PA-MPJPE: rigid align the prediction to GT before evaluating MPJPE. STRESS: borrowed from Novotny et al[30] is a metric invariant to camera pose and scale. Normalized 3D error: 3D error normalized by the scale of GT, used in prior NRSfM works [2, 9, 22, 15].

Block-soft vs ReLU thresholding. We first compare our approach against Deep NRSf[22] on orthogonal projection data with perfect point correspondences, e.g. CMU motion capture dataset [8]. We follow the settings in Deep NRSfM– train a separate model for each of the human subject, and report the normalized 3D error on the training set. Table 2 shows that our approach with closer proximity to solving the true block sparse coding objective, namely using block soft thresholding instead of ReLU, achieves better accuracy compared to our own re-implementation (ReLU) of Deep NRSfM as well as the numbers reported in the original paper. We also compare against other NRSfM methods and show dramatic improvement.

Weak perspective projection with missing data. We evaluate the weak perspective version of our method on two benchmarks with high amount of missing data: Synthetic Up3D is a large synthetic dataset with dense human keypoints based on the Unite the People 3D(Up3D) dataset [26]. The data is generated by the orthographic projection SMPL body shape with 6890 vertices. The visibility of each point is computed using a ray tracer. The goal is to reconstruct 3D shapes from the rendered 2D keypoints. We follow the exact same settings as C3DPO [30], we apply the learned model on the test set, and evaluate the metric on the 79 representative vertices of the SMPL model. Pascal 3D+ [45]

consists of images from PASCAL VOC and ImageNet images for 12 rigid object categories with sparse keypoints annotations. Each categories are associated with up to 10 CAD models. To ensure consistence between 2D keypoint and 3D GT., we follow  

[30] to use the orthographic projections of the aligned CAD model. The visibility of each point is then updated according to the original 2D annotations. Like C3DPO, we also train a single model to account for all 12 object categories. In addition, we include the result of testing our methods using detected 2D keypoints by HRNet [37].

Our method demonstrates state-of-the-art performance compared to other NRSfM methods and also deep learning method, namely C3DPO (see Table 3 & 4). On the one hand, we show over 32% error reduction due to our novel formulation of handling missing data by comparing to Deep NRSfM (see Table 4). On the other hand, ours compares favorably against C3DPO. Our method shows even greater advantage compared to C3DPO-base while both are learnt using the 2D reprojection loss. This advocates the elegance and effectiveness of our method.

Subject CNS NLO SPS Deep NRSfM ReLU Ours
1 0.613 1.22 1.28 0.175 0.265 0.112
5 0.657 1.160 1.122 0.220 0.393 0.230
18 0.541 0.917 0.953 0.081 0.117 0.076
23 0.603 0.998 0.880 0.053 0.093 0.048
64 0.543 1.218 1.119 0.082 0.179 0.020
70 0.472 0.836 1.009 0.039 0.030 0.019
106 0.636 1.016 0.957 0.113 0.364 0.116
123 0.479 1.009 0.828 0.040 0.040 0.020
Table 2: Results on CMU Motion Capture. Compared with NRSfM methods: CNS [27], NLO [10], SPS [21] and Deep NRSf[22]. ‘ReLU’ is our re-implementation of Deep NRSfM.
Figure 3: Qualatitative comparison. Blue points: GT, yellow points: prediction, red lines: distance between prediction and GT. The first two rows are visual results from H3.6M. In each sample, results from left to right are: C3DPO, ours with weak perspective camera model, and ours with perspective projection. The bottom two rows are results from Apollo dataset. In each sample from left to right are: ours with weak perspective camera model, and ours with perspective projection.
EM-SfM [39] 0.107 0.061
GbNRSfM [14] 0.093 0.062
C3DPO-base [30] 0.160 0.105
C3DPO [30] 0.067 0.040
Ours 0.062 0.037
Table 3: Results on Synthetic Up3D.
EM-SfM [39] 131.0 116.8
GbNRSfM [14] 184.6 111.3
Deep NRSf[22] 51.3 44.5
C3DPO-base [30] 53.5 46.8
C3DPO [30] 36.6 31.1
Ours 34.8 27.9
Deep NRSfM 65.3 47.7
CMR [20] 74.4 53.7
C3DPO 57.5 41.4
Ours 53.0 36.1
Table 4: Results on Pascal3D+. The first row section is the testing result with GT 2D kepoints as input. The second row section is the testing result with detected kepoints by HRNet.
Martinez et al. [29] - - 45.5 37.1
Zhao et al. [47] - - 43.8 -
3DInterpreter [44] - 88.6
AIGN [13] - 79.0 2
RepNet [40] 50.9 38.2
Drover et al. [11] - 38.2
Pose-GAN [24] 130.9 -
C3DPO [30] 95.6 -
Chen et al. [6] - 58
+ DA, TD - 51
Ours(weak persp) 104.2 72.9
+ persp 60.5 51.8
+ scale corr itr1 57.0 51.3
+ scale corr itr2 56.6 50.9
Table 5: Results on H3.6M. For each method we indicate their training supervision. MV/T means multi-view or temporal constraint. Ext3D means using external 3D data. The first row section lists two state-of-the-art supervised methods as reference. The 2nd section lists weakly supervised methods that use external 3D data. The bottom section lists unsupervised methods. The different versions of our method are: + persp: using perspective projection model, +scale corr itr: applying different number of scale correction iterations.
no missing pts. with missing pts.
method train test train test
Consensus [27] 1.30 - - -
Ours(weak persp) 0.596 0.591 0.679 0.681
Ours(persp) 0.152 0.145 0.182 0.185
+ scale correction 0.131 0.124 0.165 0.168
Table 6: Results on Apollo 3D Car dataset. The evaluation metric is MPJPE in meters.
Missing pts. (%)
Noise() 0 30 60
0 0.124 0.142 0.192
3 0.129 0.144 0.205
5 0.136 0.150 0.202
10 0.125 0.166 0.181
15 0.191 0.188 0.304
Table 7: Results of our method on Apollo 3D Car dataset under different occlusion rate and noise in 2D keypoints.

Strong perspective projection We evaluate our approach on two datasets with strong perspective effects: H3.6M [19] is a large scale human pose dataset annotated by MoCap systems. We closely follow the commonly used evaluation protocol – we use 5 subjects (S1, S5, S6, S7, S8) for training, and 2 subjects (S9, S11) for testing. Apollo 3D Car [35] has 5277 images featuring cars. Each car instance is annotated with 3D pose by running PnP with 3D CAD models. 2D keypoint annotations are provided without 3D ground truth. To evaluate our method, we render 2D keypoints by projecting 34 car models according to the 3D car pose labels. Visibility of each keypoints are marked according to the original 2D keypoint annotations. To demonstrate strong perspective effects, we select cars within 30 meters with no less than 25 visible points (out of 58 in total) for our experiment. This gives us 2848 samples for training and 1842 for testing.

We evaluate different variants of our approach. We find that modeling perspective projection (Ours persp) leads to significant improvement over weak perspective model (Ours weak persp) and applying scale correction further improves accuracy. Our method shows robustness under different level of noise and occlusion (see Table 7

) and achieves the best result compared to other unsupervised learning method. We outperforms the previous leading GAN-based method 

[6] by significant margin when using the same training set (50.9 v.s. 58) and Chen et al[6] have to utilize external training source and temporal constraints to reach our level of performance.

6 Conclusion

We propose Deep NRSfM++, a general NRSfM framework for joint recovery of 3D shapes and camera poses solely from 2D landmarks. Deep NRSfM++generalizes to both weak and strong perspective camera models with the ability to handle missing/occluded landmark data, making it applicable to datasets captured in the wild. Furthermore, we establish a closer theoretical connection of an end-to-end learning framework to the true objective of the NRSfM problem. We demonstrate state-of-the-art performances across numerous benchmarks against various classical NRSfM techniques as well as deep learning based approaches, indicating the effectiveness of our approach.


I: Implementation details

Dictionary sizes.

The dictionary size in each layer of the block sparse coding is listed in Table 8. We tried two strategies to set the dictionary sizes: (i) exponentially decrease the dictionary size, i.e. 512, 256, 128, …, (ii) linearly decrease, i.e. 125, 115, 104, … . Both strategies give reasonably good results. However, the hack we need to perform is to pick the size of the first and last layer dictionaries. We find that the size of the first layer would not have a major impact on accuracy as long as it is sufficiently large, and the corresponding hierarchical dictionary is deep enough (e.g. for exponential decrease, for linear decrease). The major performance factor is the size of the last layer dictionary. The rule of thumb we discovered is: 2-4 for rigid objects and 8-10 for articulated objects such as human body.

Training parameters.

We use Adam optimizer to train. Learning rate = 0.0001 with linear decay rate 0.95 per 50k iterations. The total number of iteration is 400k to 1.2 million depending on the data. Batch size is set to 128. Larger or smaller batch sizes all lead to similar result.

Initial scale for input normalization.

For weak perspective and strong perspective data, we need to estimate the scale so as to properly normalize the size of the 2D input shape. For Pascal3D+ and H3.6M datasets, we use the maximum length of the 2D bounding box edges, i.e. . For Apollo 3D Car dataset, we choose the minimum length of the bounding box, i.e. by assuming that the height of each car is identical.

II: Additional empirical analysis


We add the test result using detected 2D keypoint as input in Table 9. Deep NRSfM ++ achieves state-of-the-art result compared to other unsupervised methods.

Apollo 3D Car dataset.

Figure 4 shows additional analysis of Deep NRSfM ++ on the training set. Our method achieves cm error for over 80% testing samples with occlusions, while the compared baseline method, namely Consensus NRSf[27] fails to produce meaningful reconstruction using perfect point correspondences. Average errors at different distances, rotation angles and occlusion rates are also reported. Overall, our method does not have a strong bias against a particular distance or occlusion rate. It does show larger error at azimuth, most likely due to the data distribution, where most cars are in either front () or back () view.

III: Additional discussion

One of the benefit of solving NRSf

M by training a neural network is that, in addition to 2D reconstruction loss, we can easily employ other loss functions to further constrain the problem. In summary, our preliminary study finds that: (i) adding the canonicalization loss 

[30] does not noticeably improve result. (ii) adding Lasso regularization on gives marginally better result in some datasets. (iii) adding symmetry constraint on the skeleton bone length helps to improve robustness against network initialization, but does not lead to better accuracy.

dataset dictionary sizes
CMU-Mocap 512, 256, 128, 64, 32, 16, 8
UP3D 512, 256, 128, 64, 32, 16, 8
Pascal3D+ 256, 128, 64, 32, 16, 8, 4, 2
H3.6M 125, 115, 104, 94, 83, 73, 62, 52, 41, 31, 20, 10
Apollo 128, 100, 64, 50, 32, 16, 8, 4
Table 8: Dictionary sizes used in each experiment.
Martinez et al. [29] - - 62.9 52.1
Zhao et al. [47] - - 57.6 -
3DInterpreter [44] - 98.4
AIGN [13] - 97.2
Tome et al. [38] 88.4 -
RepNet [40] 89.9 65.1
Drover et al. [11] - 64.6
Pose-GAN [24] 173.2 -
C3DPO [30] 145.0 -
Wang et al. [41] 83.0 57.5
Chen et al. [6] - 68
Ours(persp proj) 68.9 59.4
+ scale corr itr1 67.3 59.2
+ scale corr itr2 67.0 58.7
Table 9: Result on H3.6M dataset with detected 2D kepoint input. In our result, we use detected points from cascaded pyramid network (CPN [7]) which is finetuned on H3.6M training set (excluding S9 and S11) by [33].
Figure 4: Additional result on Apollo 3D Car dataset. Top-left: percentage of sucess at different error thresholds. Rest: average error at different distances to the camera (top-right), azimuth angles of the car (bottom-left) and number of observed keypoints (bottom-right).


  • [1] A. Agudo, M. Pijoan, and F. Moreno-Noguer (2018-06) Image collection pop-up: 3d reconstruction and clustering of rigid and non-rigid categories. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.
  • [2] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade (2009) Nonrigid structure from motion in trajectory space. In Advances in neural information processing systems, pp. 41–48. Cited by: §5.
  • [3] I. Akhter, Y. Sheikh, and S. Khan (2009) In defense of orthonormality constraints for nonrigid structure from motion. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1534–1541. Cited by: §2.
  • [4] A. Beck and M. Teboulle A fast iterative shrinkage-thresholding algorithm with application to wavelet-based image deblurring.. Cited by: §3.
  • [5] C. Bregler Recovering non-rigid 3d shape from image streams. Cited by: §2.
  • [6] C. Chen, A. Tyagi, A. Agrawal, D. Drover, R. MV, S. Stojanov, and J. M. Rehg (2019-06)

    Unsupervised 3d pose estimation with geometric self-supervision

    In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Table 5, §5, Table 9.
  • [7] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7103–7112. Cited by: Table 9.
  • [8] CMU motion capture dataset. Note: available at Cited by: §5.
  • [9] Y. Dai, H. Li, and M. He (2014) A simple prior-free method for non-rigid structure-from-motion factorization. International Journal of Computer Vision 107 (2), pp. 101–122. Cited by: §2, §2, §5.
  • [10] A. Del Bue, F. Smeraldi, and L. Agapito (2007) Non-rigid structure from motion using ranklet-based tracking and non-linear optimization. Image and Vision Computing 25 (3), pp. 297–310. Cited by: §2, Table 2.
  • [11] D. Drover, R. MV, C. Chen, A. Agrawal, A. Tyagi, and C. Phuoc Huynh (2018-09) Can 3d pose be learned from 2d projections alone?. In The European Conference on Computer Vision (ECCV) Workshops, Cited by: §2, Table 5, Table 9.
  • [12] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §1.
  • [13] H. Fish Tung, A. W. Harley, W. Seto, and K. Fragkiadaki (2017-10)

    Adversarial inverse graphics networks: learning 2d-to-3d lifting and image-to-image translation from unpaired supervision

    In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 5, Table 9.
  • [14] K. Fragkiadaki, M. Salas, P. Arbelaez, and J. Malik (2014) Grouping-based low-rank trajectory completion and 3d reconstruction. In Advances in Neural Information Processing Systems, pp. 55–63. Cited by: §2, Table 3, Table 4.
  • [15] P. F. Gotardo and A. M. Martinez (2011) Kernel non-rigid structure from motion. In 2011 International Conference on Computer Vision, pp. 802–809. Cited by: §2, §5.
  • [16] O. C. Hamsici, P. F. Gotardo, and A. M. Martinez (2012) Learning spatially-smooth mappings in non-rigid structure from motion. In European Conference on Computer Vision, pp. 260–273. Cited by: §2.
  • [17] R. Hartley and R. Vidal (2008) Perspective nonrigid shape and motion recovery. In European Conference on Computer Vision, pp. 276–289. Cited by: §2.
  • [18] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin (2005) The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27 (2), pp. 83–85. Cited by: §3.
  • [19] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu (2013) Human3. 6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence 36 (7), pp. 1325–1339. Cited by: §5.
  • [20] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik (2018) Learning category-specific mesh reconstruction from image collections. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 371–386. Cited by: Table 4.
  • [21] C. Kong and S. Lucey (2016) Prior-less compressible structure from motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4123–4131. Cited by: §2, Table 2.
  • [22] C. Kong and S. Lucey (2019-10) Deep non-rigid structure from motion. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Deep NRSfM++: Towards 3D Reconstruction in the Wild, §1, §2, §3, §3, §4.1, §4.2, Table 2, Table 4, §5, §5.
  • [23] C. Kong, R. Zhu, H. Kiani, and S. Lucey (2016) Structure from category: a generic and prior-less approach. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 296–304. Cited by: §2, §2.
  • [24] Y. Kudo, K. Ogaki, Y. Matsui, and Y. Odagiri (2018) Unsupervised adversarial learning of 3d human pose from 2d joint locations. arXiv preprint arXiv:1803.08244. Cited by: §2, Table 5, Table 9.
  • [25] S. Kumar, Y. Dai, and H. Li (2017) Monocular dense 3d reconstruction of a complex dynamic scene from two perspective frames. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4649–4657. Cited by: §2.
  • [26] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler (2017-07) Unite the people: closing the loop between 3d and 2d human representations. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), External Links: Link Cited by: §5.
  • [27] M. Lee, J. Cho, and S. Oh (2016) Consensus of non-rigid reconstructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4670–4678. Cited by: §2, Table 2, Table 6, Apollo 3D Car dataset..
  • [28] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) Epnp: an accurate o (n) solution to the pnp problem. International journal of computer vision 81 (2), pp. 155. Cited by: §1.
  • [29] J. Martinez, R. Hossain, J. Romero, and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649. Cited by: Table 5, Table 9.
  • [30] D. Novotny, N. Ravi, B. Graham, N. Neverova, and A. Vedaldi (2019-10) C3DPO: canonical 3d pose networks for non-rigid structure from motion. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §4.2, Table 3, Table 4, Table 5, §5, §5, III: Additional discussion, Table 9.
  • [31] M. Paladini, A. Del Bue, J. Xavier, L. Agapito, M. Stošić, and M. Dodig (2012) Optimal metric projections for deformable and articulated structure-from-motion. International journal of computer vision 96 (2), pp. 252–276. Cited by: §2.
  • [32] V. Papyan, Y. Romano, and M. Elad (2017) Convolutional neural networks analyzed via convolutional sparse coding. The Journal of Machine Learning Research 18 (1), pp. 2887–2938. Cited by: §3, §4.1.
  • [33] D. Pavllo, C. Feichtenhofer, D. Grangier, and M. Auli (2019) 3D human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7753–7762. Cited by: Table 9.
  • [34] C. J. Rozell, D. H. Johnson, R. G. Baraniuk, and B. A. Olshausen (2008) Sparse coding via thresholding and local competition in neural circuits. Neural computation 20 (10), pp. 2526–2563. Cited by: §3.
  • [35] X. Song, P. Wang, D. Zhou, R. Zhu, C. Guan, Y. Dai, H. Su, H. Li, and R. Yang (2019) Apollocar3d: a large 3d car instance understanding benchmark for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5452–5462. Cited by: §1, §5.
  • [36] P. Sturm and B. Triggs (1996) A factorization based algorithm for multi-image projective structure and motion. In European conference on computer vision, pp. 709–720. Cited by: §2, §2.
  • [37] K. Sun, B. Xiao, D. Liu, and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. arXiv preprint arXiv:1902.09212. Cited by: §5.
  • [38] D. Tome, C. Russell, and L. Agapito (2017-07) Lifting from the deep: convolutional 3d pose estimation from a single image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 9.
  • [39] L. Torresani, A. Hertzmann, and C. Bregler (2008) Nonrigid structure-from-motion: estimating shape and motion with hierarchical priors. IEEE transactions on pattern analysis and machine intelligence 30 (5), pp. 878–892. Cited by: Table 3, Table 4.
  • [40] B. Wandt and B. Rosenhahn (2019) RepNet: weakly supervised training of an adversarial reprojection network for 3d human pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7782–7791. Cited by: §2, Table 5, Table 9.
  • [41] C. Wang, C. Kong, and S. Lucey (2019-10) Distill knowledge from nrsfm for weakly supervised 3d pose learning. In The IEEE International Conference on Computer Vision (ICCV), Cited by: Table 9.
  • [42] G. Wang, H. Tsui, and Z. Hu (2007) Structure and motion of nonrigid object under perspective projection. Pattern recognition letters 28 (4), pp. 507–515. Cited by: §2.
  • [43] G. Wang, H. Tsui, and Q. M. J. Wu (2008) Rotation constrained power factorization for structure from motion of nonrigid objects. Pattern Recognition Letters 29, pp. 72–80. Cited by: §2.
  • [44] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman (2016) Single image 3d interpreter network. In European Conference on Computer Vision, pp. 365–382. Cited by: Table 5, Table 9.
  • [45] Y. Xiang, R. Mottaghi, and S. Savarese (2014) Beyond pascal: a benchmark for 3d object detection in the wild. In IEEE Winter Conference on Applications of Computer Vision, pp. 75–82. Cited by: §1, §5.
  • [46] J. Xiao and T. Kanade (2005) Uncalibrated perspective reconstruction of deformable structures. In Tenth IEEE International Conference on Computer Vision (ICCV’05) Volume 1, Vol. 2, pp. 1075–1082. Cited by: §2.
  • [47] L. Zhao, X. Peng, Y. Tian, M. Kapadia, and D. N. Metaxas (2019-06) Semantic graph convolutional networks for 3d human pose regression. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 5, Table 9.
  • [48] X. Zhou, M. Zhu, S. Leonardos, K. G. Derpanis, and K. Daniilidis (2016) Sparseness meets deepness: 3d human pose estimation from monocular video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4966–4975. Cited by: §2.
  • [49] Y. Zhu, D. Huang, F. De La Torre, and S. Lucey (2014) Complex non-rigid motion 3d reconstruction by union of subspaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1542–1549. Cited by: §2.