Aiming at recovering the camera motion and non-rigid structure simultaneously from 2D images emanating from monocular cameras, non-rigid structure from motion (NRSfM) is central to many computer vision applications and has received considerable attention in recent years. This classical problem is highly under-constrained. Although existing approaches in NRSfM     have presented promising results but most of these methods assume that, there is only one object undergoing non-rigid deformation in the scene. However, real world non-rigid scenes are much more complex: for example multiple persons performing different activities, soccer players in the playground, salsa dance and etc. All these real world examples constitute multi-body non-rigid deformation, which could not be explained well with the single non-rigid object assumption. Therefore, it is quite natural to extend single-body NRSfM to multi-body NRSfM where the task would be to jointly reconstruct and segment multiple 3D deforming objects over-time.
In solving the problem of multi-body NRSfM, a natural and direct two-stage process is to reconstruct non-rigid multi-body structure by applying state-of-the-art non-rigid reconstruction methods  and then segment distinct objects using subspace clustering methods such as Sparse Subspace Clustering (SSC)  or other clustering algorithms or vice-versa. However, by adopting such pipelines the inherent structure of the problem has never been exploited, i.e non-rigid motion segmentation provides critical information to constrain 3D reconstruction while 3D non-rigid reconstruction could also constrain the corresponding motion segmentation problem. Furthermore, since the non-rigid shape deformation actually occurs in 3D space, it is more intuitive to perform segmentation of objects in 3D space rather than on projected 2D image space.
Additionally, it is always convenient–both computationally and numerically to solve a given task using a unified approach than solving it in a sequential way. Therefore, in this paper, we propose a framework to simultaneously reconstruct and cluster multiple non-rigid shapes by exploiting the spatio-temporal correlation in data. By such approach we can explain the dynamics of non-rigid shape in a more intuitive way. Explicitly, we represent multi-body NRSfM as union of subspace both in 3D trajectory space (spatially) and 3D shape space (temporally). We use the fact that each 3D trajectory can be expressed with other trajectory only if the trajectory is from the same subspace (spatial clustering) , and each individual activity can be expressed with activity belonging to the same subspace (temporal clustering) . A visual illustration of the spatio-temporal subspace concept is presented in Fig. 1. Concretely, spatial clustering tries to reconstruct a trajectory using affine combination of other trajectories from the same deforming object, while temporal clustering tries to explain the shape of deforming objects using affine combination of other shapes at different frame instance belonging to similar activity.
By exploiting the spatio-temporal clustering structure, our approach is able to learn the affinity matrices which naturally encode subspace information. From the affinity matrices, direct inference about number of deformable objects, different activities and membership of each sample to achieve reconstruction can be easily made. Furthermore, we exploit the fact that the connectivity between subspaces must be tight if it belongs to the same subspace and loose if belongs to different subspaces. Therefore, we propose to use a mixture of norm and norm regularization (also known as the Elastic Net ), which helps in controlling the sparsity of the affinity matrices.
We propose a joint segmentation and reconstruction framework to the challenging task of complex multi-body NRSfM by exploiting the inherent spatio-temporal union of subspace constraint.
We propose to efficiently solve the resultant non-convex optimization problem based on the Alternating Direction Method of Multipliers (ADMM) method .
Extensive experimental results on both synthetic and real multi-body NRSfM datasets demonstrate the superior performance of our proposed framework.
2 Related Works
Multi-body structure from motion (SfM) is an important problem in computer vision. To work out this problem for rigid motion is a direct extension to elegant multi-view geometry techniques . However, solution to multi-body NRSfM is not straightforward, due to the difficulty in modeling complex non-rigid variations. Recent state-of-the-art in NRSfM reconstruction  has shown promising results while Zhu et al.  proposed that such an approach may fail while modeling long-term complex non-rigid motions. The work quote that Dai et al.  work is “highly dependent on the complexity of the motion” . Hence, to overcome this difficulty they suggested to represent long-term non-rigid motion as union of subspace rather than a single subspace. Subsequently, Cho et al.  used probabilistic variations to model complex shape.
Despite the above accomplishments, NRSfM is still far behind its rigid counterpart. This gap is principally due to difficulty in modeling real world non-rigid deformation. If the deformation is irregular or arbitrary then to explain the 3D structure is nearly impossible. Nevertheless, many real world deformation can be constrained; as a result Bergler  introduced NRSfM which is considered a seminal work in NRSfM. In the work, Bergler demonstrated that non-rigid deformation can be represented by a linear combination of a set of shape basis. Following the work, several researchers tried to model NRSfM by utilizing additional constraints , , . In 2008, Akhter et al.  presented a dual approach by modeling 3D trajectories. In 2009, Akhter et al.  proved that even there is an ambiguity in shape bases or trajectory bases, non-rigid shapes can still be solved uniquely without any ambiguity. In 2012, Dai et al.  proposed a “prior-free” method to recover camera motion and 3D non-rigid deformation by exploiting low rank constraint only. Besides shape basis model and trajectory basis model, the shape-trajectory approach  combines two models and formulates the problems as revealing trajectory of the shape basis coefficients. Besides linear combination model, Lee et al. 
proposed a Procrustean Normal Distribution (PND) model, where 3D shapes are aligned and fit into a normal distribution. Simon et al. exploited the Kronecker pattern in the shape-trajectory (spati-temporal) priors. Zhu and Lucey  applied the convolutional sparse coding technique to NRSFM using point trajectories. However, the method requires to learn an over-complete basis of 3D trajectories, prior to performing 3D reconstruction.
Recently, Russell et al.  proposed to simultaneously segment a complex dynamic scene containing a mixture of multiple objects into constituent objects and reconstruct a 3D model of the scene by formulating the problem as hierarchical graph-cut based segmentation, where the whole scene is decomposed into background and foreground objects with complex motion of non-rigid or articulated objects are modeled as a set of overlapping rigid parts.
Our method varies from the aforementioned works in the following aspects: 1) We provide a novel framework to joint segmentation and reconstruction for multiple non-rigid deformation problem; 2) We propose a simple, yet efficient and elegant optimization routine and its solution based on ADMM; 3) Our method can be applied to both sparse and dense scenarios (up to the order of ten-thousand feature tracks).
Under our formulation, we intend to reconstruct 3D non-rigid shapes such that they satisfy both the spatio-temporal union of affine subspace constraint and the non-rigid shape constraints (low rank and spatial coherency). Let represent the , with the number of frames and the number of feature points. We use the model and eliminate the translation component of camera motions as suggested in .
where denotes the camera rotation matrix and represents the 3D shapes of deforming objects over entire frames. This classical representation for NRSfM problem  aims at recovering both the camera motion and the non-rigid 3D shapes from the 2D measurement matrix such that . Following the same representation to cater 2D-3D relation, we use to infer the re-projection error.
3.1 Representing multiple non-rigid deformations in trajectory space
To represent multiple non-rigid objects using a single linear trajectory space does not provide compact representation of 3D trajectories . When there are multiple non-rigid objects, each object can be characterized as lying in an affine subspace. Therefore, the 3D trajectories lie in a union of affine subspaces, which can equivalently be formulated in terms of self-expressiveness i.e,
where . To get rid of the trivial solution of or , we explicitly enforce the diagonal constraint as . As we represent each non-rigid object as lying in an affine subspace, we further enforce the affine constraint . Besides the above constraint, we also want to enforce a constraint that if the trajectories belong to the same deforming object then it must be tightly connected or loosely connected the otherwise. To cater this idea of inter-class and intra-class trajectories clustering, we use the elastic net formulation  to compromise between connectedness and sparsity. Combining all the constraints together, we reach the following optimization:
A visual illustration of this idea in trajectory space for a single trajectory is provided in Fig. 2. Here, and denote the -norm and the Frobenius norm respectively.
3.2 Representing multiple non-rigid deformations in shape space
An example complex non-rigid motion is shown in Figure 1, where the subjects are performing different activities at different time instances. Such distinct motion adheres to different local subspace and complete non-rigid motion lies in union of shape subspace. As mentioned in  such assumption leads to superior 3D reconstruction. To incorporate this concept in our formulation that different activities lie in union of affine subspaces, we express the 3D shapes in terms of self-expressiveness of frames along temporal direction.
where is the reshuffled version of
representing the per-frame 3D shape as a column vector,A visual intuition of this idea in shape space for single frame is provided in Fig. 3.
For temporal clustering, we also use the elastic net as regularization parameters due to similar reason mentioned in Section 3.1 for , thereby formulating the following optimization:
3.3 Enforcing the global shape constraint
In seeking a compact representation for multi-body non-rigid objects, we penalize the number of independent non-rigid shapes. Similar to  and , we penalize the nuclear norm of the reshuffled shape matrix , this is because the nuclear norm is known as the convex envelope of the rank function. In this way, the global shape constraint is expressed as:
denotes the nuclear norm of the matrix, ie, sum of singular values.
3.4 Joint Reconstruction and Segmentation Formulation
Putting all the above constraints (spatio-temporal union of subspace constraint and global shape constraint) together, we reach a multi-body non-rigid reconstruction and segmentation formulation:
where , , and . are the trade-off parameters.
To solve the proposed optimization we introduce decoupling variables in Eq. 7, which leads to the following formulation:
The auxiliary variables are introduced to simplify the derivation. denotes the linear mapping from to its reshuffled version . Specifically, S =
= . The first term in the above optimization is meant for penalizing re-projection error under orthographic projection. Under single-body NRSFM configuration, 3D shape can be well characterized as lying in a single low dimensional linear subspace. However, when there are multiple non-rigid objects, each non-rigid object could be characterized as lying in an affine subspace. To represent this idea mathematically in shape and trajectory space respectively, we introduce and .
In addition to this, to reveal the intrinsic structure of multi-body non-rigid structure-from-motion (NRSfM), we seek for the sparsest solution both in trajectory and shape space. Consequently, we enforce the norm for and . However, high sparsity may lead to misclassification of samples or trajectories. Therefore, to maintain the balance between sparsity and connectedness, we incorporate the elastic net for both and . Lastly, we enforce a global shape constraint () for compact representation of multi-body non-rigid objects by penalizing the rank of the entire non-rigid shape.
Due to the two bilinear terms and , the overall optimization of Eq.-(8) is non-convex. We solve it via the alternating direction method of multipliers (ADMM), which has a proven effectiveness for many non-convex problems and is widely used in computer vision. ADMM works by decomposing the original optimization problem into several sub-problems, where each sub-problem can be solved efficiently. To this end, we seek to decompose Eq.-(8) into several sub-problems.
where we define and . are the Lagrange multipliers. is the penalty parameter, where we use the same parameter for each augmented Lagrange term to simplify the derivation and parameter setting. The symbol represents the Frobenius inner product of two matrices, i.e, the trace of the product of two matrices. For example, given two matrices , the Frobenius inner product is calculated as Tr.
The ADMM works by minimizing Eq. (9) with respect to one variable while fixing the others. During each iteration, we update each variable and the Lagrange multipliers in sequel. The detailed derivation for the solution is presented in the Appendix.
Solution for S: The closed form solution for can be derived by taking derivative of (9) w.r.t to and equating to zero.
Solution for : The closed form solution for can be derived by taking derivative of (9) w.r.t and equating to zero.
Solution for : The closed form solution for can be derived as
Solution for : The closed form solution for can be derived as
Solution for : The optimization of given all the remaining variables can be expressed as:
A closed-form solution exists for this sub-problem. Let’s define the soft-thresholding operation as , the optimal can be obtained as:
where = .
Solution for : The closed-form solution for can be obtained similarly:
Solution for The derivation for the solution of is similar to .
Detailed derivations to each sub-problems solution are provided in 0.A. Finally, the Lagrange multipliers and are updated as:
5 Experiments and Results
We performed extensive experiments on benchmark data-sets that are freely available. We tested our approach on both real data and synthetic data under sparse and semi-dense scenarios. Denote
as the estimated 3D structure andas the ground-truth structure, we use the following error metrics to evaluate the performance of the approach:
(i) Relative error in multi-body non-rigid 3D reconstruction
(ii) Error in multi-body non-rigid motion segmentation,