1 Introduction
Nonrigid structurefrommotion (NRSfM) aims at simultaneously recovering the camera motion and nonrigid structure from 2D images by using a monocular camera, which is central to many computer vision applications (3D reconstruction, motion capture, humancomputer interaction etc) and has received considerable attention in recent years. A great number of methods have been established, and most of the existing methods can be roughly categorized as sparse methods and dense methods
[1] [2][3] [4] [5][6] [7].NRSfM is in essential underdetermined (estimating a 3D point from a single 2D measurement), therefore, extra regularization is needed to constrain the problem. For sparse NRSfM, various priors/constraints have been enforced, such as shape basis
[1], trajectory basis [8], shapetrajectory basis [9][7] and smoothness [10]. In sparse reconstruction, the feature points are geometrically apart from each other, thus spatial regularization cannot be enforced. By contrast, dense NRSfM aims at achieving 3D nonrigid reconstruction for each pixel in a video sequence, where spatial constraint has been widely exploited to regularize the problem [11][12][13]. Garg et al.[14] proposed to enforce both the total variation constraint and the nuclear norm induced lowrank constraint on the 3D nonrigid shape. This results in a complex convex optimization and GPU is needed to speed up the implementation. Furthermore, they only implemented the method on complete and noisefree datasets, thus its robustness remains questionable. In Russell et al.’s work[13], segmentation is performed on both objectlevel and partlevel, then piecewise reconstruction is applied by assuming locally rigid pieces. In [15], motion segmentation is paired with rank constrained 2D track completion to deal with occlusions, then nuclear norm minimization is used to recover the 3D shape. Yu et al.[16] proposed to utilize the temporal smoothness in both camera motion and 3D deformation, where a template 3D shape is available. Ranftl et al.[17] investigated the relative scale in dynamic scene. All these constraints are based on motion and semantic segmentation, thus computational complex.In this paper, we look for a simple and elegant convex optimization for dense NRSfM that can be efficiently implemented on a CPU. We would like to argue that the inherent spatial and temporal smoothness constraints could be well exploited to regularize the dense nonrigid reconstruction problem. Specifically, we revisit the temporal smoothness in sparse reconstruction and demonstrate that it can be employed in dense case directly. Second, to exploit the spatial smoothness in dense reconstruction, we resort to the Laplacian of the 3D nonrigid shape, which captures the local smoothness and owns mathematical simplicity. Finally, to handle inevitable noise and outliers in real world image measurements, we robustify the data term by using the norm rather than commonly used norm. In this way, our method could robustly exploit the spatialtemporal smoothness in dense nonrigid reconstruction effectively. Our method is very easy to implement, which involves solving a series of least squares problems. In Fig. 1, we demonstrate the contribution of each component. With the introduction of temporal smoothness, spatial smoothness and robust cost function, the dense 3D nonrigid reconstruction has been gradually improved.
2 Prerequisites
Dense NRSfM takes a 2D video obtained by a monocular camera as input, with image frames each containing pixels. In this paper, we assume the perpixel feature tracks have been extracted, say by optical flow or dense matching. Thus, the input to our system is a feature track matrix , which stacks following matrix: where denotes the th feature point captured in the th image frame. Assuming an orthographic camera model and the camera has been centralized at the center of the object, we have: where is a matrix that represents the first two rows of the rotation matrix of the th frame, and is a matrix containing the 3D positions of every point in the th nonrigid shape. Stacking all the feature tracks for all frames gives:
(1) 
with and are with dimension and , respectively. Dense NRSfM aims at simultaneously recovering both the camera motion and the nonrigid shape from the feature track matrix . The problem is inherently underdetermined as the number of variables to estimate () greatly exceeds the number of measurements . Therefore, extra constraints are needed to regularize the problem.
Under dense NRSfM, we solve for the camera rotation by utilizing the lowrank structure of . Even though we have to deal with tens of thousands of points, the rotation estimation method in [10] still could handle it as the computational complexity is independent of the number of points but only depends on the model complexity .
3 Formulation and Solution
In this paper, we propose to exploit the generic and generally available smoothness from temporal direction and spatial direction. By jointly enforcing the spatial and temporal constraints, we are able to achieve dense nonrigid reconstruction in an easy and elegant way.
3.1 Temporal Smoothness Revisited
First, we revisit the temporal smoothness, which has been widely used in sparse NRSfM [10]. We would like to argue that this simple strategy could be pretty efficient in achieving comparable performance with complex convex optimization or ADMM based methods.
By introducing smooth deformation regularization [18][19] [20], we can formulate the nonrigid shape recovery problem as minimizing a data term evaluated on the image measurements and a regularization term based on temporal smoothness, thus reaching the following optimization:
(2) 
where the first term measures the reprojection error evaluated on image plane while the second term measures the temporal smoothness constraint. We could apply different smooth operators to characterize various kinds of smoothness in temporal direction, e.g. first order smoothness as in Eq.(3), second order smoothness and etc..
(3) 
The resultant optimization problem in Eq.(2) admits an analytical (closedform) solution,
(4) 
The rotation matrix is of row full rank thus is of rank generally. The smoothness matrix is rank deficient too, thus is of rank (for first order smoothness). In general case, is a full rank matrix, thus invertible.
The 3D nonrigid shape generated by this solution depends on the choice of the tradeoff parameter , which trades off between 2D reprojection error and temporal smoothness. When approaches 0, the solution approaches , i.e. the pseudoinverse solution. When is large enough, the solution approaches a rigid shape, which minimizes the combination of and . When approaches , the solution approaches a trivial solution [10].
Connection: The smoothness constrained solution and the pseudoinverse solution are connected as:
(5) 
Therefore . As proved in [10], the pseudoinverse solution is a degenerate case where the nonrigid shape at each frame lies on a plane. can be viewed as a perframe weighted version of .
3.2 Spatial Smoothness Simplified
The temporal smoothness constrains the dense nonrigid reconstruction from the temporal dimension, i.e., the smoothness of 3D trajectory. However, it could not regularize the 3D shape at each frame. Garg et al.[14] proposed to use the total variation to encourage the spatial smoothness while maintaining sharp boundaries. The resultant optimization prohibits its real world application to large scale 3D reconstruction.
To efficiently and effectively utilize the smoothness alongside the spatial dimension, we propose a simple filtering mechanism, namely Laplacian filter, which enforces spatial smoothness locally in the 3D shape space. In Fig. 2, we illustrate different 2D filters in enforcing spatial smoothness. The Laplacian filter enforces a locally linear/planar model, which provides an easy way to encourage second order smoothness. As all linear filtering can be equivalently expressed as matrix multiplication, for the recovered nonrigid shape , the filtering output is defined as:
(6) 
where is a matrix containing all the filtering operation, each row of defines a spatial filter evaluated at the position of .
Spatial smoothness is effective in smoothing 3D reconstruction. However, the spatial smoothness itself is not sufficient to recover the correct shape. Without temporal constraint, the result will be close to the pseudoinverse case, which lies in a plane. By putting spatial smoothness and temporal smoothness together, we are able to achieve reliable 3D reconstruction even from noisy 2D inputs.
3.3 Optimization Robustified
Noise and outliers are inevitable in real world measurements. Dense NRSfM methods must handle them robustly. Most of the existing methods apply on the data term, thus could not handle noise and outliers well. We propose to replace the norm with norm, thus increasing the robustness of the data term .
To deal with the convex norm efficiently, we propose to use iterative reweighted least square (IRLS), where we solve for a least square problem in each iteration. Figure 1 illustrates the performance of norm on data with outliers. It is shown that our L1norm relaxation gives a better performance in data with outliers.
3.4 SpatialTemporal smoothness constraint
By enforcing the spatialtemporal smoothness constraint and applying the robust norm on data term, we reach:
(7) 
where and are the tradeoff parameters. The three terms are “data term”, “temporal smoothness term” and “spatial smoothness term” correspondingly. Under IRLS formulation, we solve the following least square problem in each iteration:
(8) 
A closedform solution can be derived by using the first order condition. However the computational complexity is high due to the filtering matrix . Instead, we propose to solve the least square problem with gradient descent, where the gradient is derived as:
(9) 
denotes the inverse operator of vectorization, which transforms a vector to matrix with proper dimension.
4 Experimental Results
Setting up: To evaluate our method against existing stateoftheart dense NRSfM methods, we used the 4 dense synthetic sequences and 3 real videos from [14]. Each sequence contains a 2D correspondence matrix and a quad mesh for neighborhood assignment. These sequences have over 20,000 trajectories forming dense surfaces, which makes the problem much more challenging than the sparse scenarios.
We first enforced the temporal smoothness constraint to obtain initialized 3D nonrigid reconstruction. Then our method runs iteratively to optimize the cost function with spatialtemporal constraints. The tradingoff parameters are set as , .
On synthetic face sequences, the results of our method are shown in Fig. 3. We overlap the ground truth shape in red and the our 3D reconstruction in blue. These figures show that our method can reconstruct the 3D object quite accurately. Table 1 shows the quantitative evaluation of our method along with various others methods, including Trajectory Basis (PTA)[8], Metric Projection[4] and Variational method[14]. As shown in the table, our method achieves competitive performance with the stateoftheart methods. It is worth noting that our method is pretty easy to implement which only involves a series of least squares.
Dataset  PTA [8]  MP [4]  DV [14]  Ours 

Seq1  0.2431  0.2575  0.0531  0.0636 
Seq2  0.0988  0.0644  0.0457  0.0569 
Seq3  0.0596  0.0682  0.0346  0.0374 
Seq4  0.0877  0.0772  0.0379  0.0428 
For dense sequences obtained from real videos, the input 2D video tracks and results obtained by our method are shown in Fig. 4. As shown in the figures, our method outputs reasonable results on the Face and Back sequences, while on the challenging Heart sequence that has both large deformations and small rotation, our result seems to be too flat. This emphasizes the importance of a correct rotation matrix.
Dense input with noise:
As stated in previous sections, assuming a smooth 3D surface, our spatial smoothness constraint encourages local smoothness, hence increasing the accuracy and resolution. To evaluate the performance of our method, we added Gaussian noise to the 2D input images, with the standard deviation
, where is the noise ratio ranging from 0.01 to 0.05. Each noise settings are repeated for 5 times to obtain statistical results.Figure 5 shows the performance of our method under different noise ratios on 4 synthetic sequences. It shows that even at large noise ratios, the 3D error of our method is still kept at a low level.
Dense input with outliers: To evaluate the capability of our method in dealing with outliers, we performed experiments with the following settings: a certain amount of points in the video ( points in total) are set at random positions. The outlier ratio varies at 2%, 4%, 6%, 8% and 10%, respectively. We compute the final 3D error by averaging 5 trials, in order to get a statistically accurate result.
Figure 5 illustrates the performance of our method under different outlier ratios. As outlier ratio increases, the 3D error increases slightly, keeping under 0.1 for all synthetic sequences. The error curves are almost linear, which demonstrates the robustness of our method.
5 Conclusions
In this paper, we propose a unified framework to dense nonrigid 3D reconstruction, which utilizes both spatial and temporal smoothness to regularize the underconstrained problem. Furthermore, the cost function has been robustified to deal with real world noise and outliers. Our method achieves competitive performance with stateoftheart dense NRSfM methods. The implementation of our method only involves solving a series of least squares problems, thus making dense NRSfM easy.
References

[1]
Christoph Bregler, Aaron Hertzmann, and Henning Biermann,
“Recovering nonrigid 3D shape from image streams,”
in
Proc. IEEE Conf. Computer Vision and Pattern Recognition
, 2000, pp. 690–696.  [2] Jing Xiao, Jinxiang Chai, and Takeo Kanade, “A closedform solution to nonrigid shape and motion recovery,” in Proc. European Conf. Computer Vision, 2004, vol. 3024, pp. 573–587.
 [3] Lorenzo Torresani, Aaron Hertzmann, and Chris Bregler, “Nonrigid structurefrommotion: Estimating shape and motion with hierarchical priors,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 878–892, 2008.
 [4] Marco Paladini, Alessio Del Bue, João Xavier, Lourdes Agapito, Marko Stosic, and Marija Dodig, “Optimal metric projections for deformable and articulated structurefrommotion,” Int. J. Comput. Vision, vol. 96, no. 2, pp. 252–276, Jan. 2012.
 [5] Yuchao Dai, Hongdong Li, and Mingyi He, “A simple priorfree method for nonrigid structurefrommotion factorization,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp. 2018–2025.

[6]
Minsik Lee, Jungchan Cho, ChongHo Choi, and Songhwai Oh,
“Procrustean normal distribution for nonrigid structure from motion,”
in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 1280–1287.  [7] Tomas Simon, Jack Valmadre, Iain Matthews, and Yaser Sheikh, Separable Spatiotemporal Priors for Convex Reconstruction of TimeVarying 3D Point Clouds, pp. 204–219, Springer International Publishing, Cham, 2014.
 [8] I. Akhter, Y. Sheikh, S. Khan, and T. Kanade, “Trajectory space: A dual representation for nonrigid structure from motion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 7, pp. 1442–1456, July 2011.
 [9] P.F.U. Gotardo and A.M. Martinez, “Nonrigid structure from motion with complementary rank3 spaces,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2011, pp. 3065–3072.
 [10] Yuchao Dai, Hongdong Li, and Mingyi He, “A simple priorfree method for nonrigid structurefrommotion factorization,” International Journal of Computer Vision, vol. 107, no. 2, pp. 101–122, 2014.
 [11] C. Russell, J. Fayad, and L. Agapito, “Dense nonrigid structure from motion,” in International Conference on 3D Imaging, Modeling, Processing, Visualization and Transmission (3DIMPVT), 2012, pp. 509–516.
 [12] Ravi Garg, Anastasios Roussos, and Lourdes Agapito, “A variational approach to video registration with subspace constraints,” International Journal of Computer Vision, pp. 1–29, 2013.
 [13] Chris Russell, Rui Yu, and Lourdes Agapito, “Video popup: Monocular 3d reconstruction of dynamic scenes,” in European Conference on Computer Vision. Springer, 2014, pp. 583–598.
 [14] R. Garg, A. Roussos, and L. Agapito, “Dense variational reconstruction of nonrigid surfaces from monocular video,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2013, pp. 1272–1279.
 [15] Katerina Fragkiadaki, Marta Salas, Pablo Arbelaez, and Jitendra Malik, “Groupingbased lowrank trajectory completion and 3d reconstruction,” in Advances in Neural Information Processing Systems 27, pp. 55–63. 2014.
 [16] Rui Yu, Chris Russell, Neill D. F. Campbell, and Lourdes Agapito, “Direct, dense, and deformable: Templatebased nonrigid 3d reconstruction from rgb video,” in The IEEE International Conference on Computer Vision (ICCV), December 2015.
 [17] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen Koltun, “Dense monocular depth estimation in complex dynamic scenes,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [18] Henrik Aans and Fredrik Kahl, “Estimation of deformable structure and motion,” in ECCV Workshop on Vision and Modelling of Dynamic Scenes, 2002, pp. 1–4.
 [19] S. I. Olsen and A. Bartoli, “Implicit nonrigid structurefrommotion with priors,” J. Math. Imaging Vis., vol. 31, no. 23, pp. 233–244, July 2008.
 [20] J. Valmadre and S. Lucey, “General trajectory prior for nonrigid reconstruction,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2012, pp. 1394–1401.
Comments
There are no comments yet.