Spatial-Temporal Union of Subspaces for Multi-body Non-rigid Structure-from-Motion

05/14/2017 ∙ by Suryansh Kumar, et al. ∙ Australian National University 0

Non-rigid structure-from-motion (NRSfM) has so far been mostly studied for recovering 3D structure of a single non-rigid/deforming object. To handle the real world challenging multiple deforming objects scenarios, existing methods either pre-segment different objects in the scene or treat multiple non-rigid objects as a whole to obtain the 3D non-rigid reconstruction. However, these methods fail to exploit the inherent structure in the problem as the solution of segmentation and the solution of reconstruction could not benefit each other. In this paper, we propose a unified framework to jointly segment and reconstruct multiple non-rigid objects. To compactly represent complex multi-body non-rigid scenes, we propose to exploit the structure of the scenes along both temporal direction and spatial direction, thus achieving a spatio-temporal representation. Specifically, we represent the 3D non-rigid deformations as lying in a union of subspaces along the temporal direction and represent the 3D trajectories as lying in the union of subspaces along the spatial direction. This spatio-temporal representation not only provides competitive 3D reconstruction but also outputs robust segmentation of multiple non-rigid objects. The resultant optimization problem is solved efficiently using the Alternating Direction Method of Multipliers (ADMM). Extensive experimental results on both synthetic and real multi-body NRSfM datasets demonstrate the superior performance of our proposed framework compared with the state-of-the-art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

page 16

page 17

page 18

page 19

page 20

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Aiming at recovering the camera motion and non-rigid structure simultaneously from 2D images emanating from monocular cameras, non-rigid structure from motion (NRSfM) is central to many computer vision applications and has received considerable attention in recent years. This classical problem is highly under-constrained. Although existing approaches in NRSfM

[6] [8] [24] [14] [4] have presented promising results but most of these methods assume that, there is only one object undergoing non-rigid deformation in the scene. However, real world non-rigid scenes are much more complex: for example multiple persons performing different activities, soccer players in the playground, salsa dance and etc. All these real world examples constitute multi-body non-rigid deformation, which could not be explained well with the single non-rigid object assumption. Therefore, it is quite natural to extend single-body NRSfM to multi-body NRSfM where the task would be to jointly reconstruct and segment multiple 3D deforming objects over-time.

In solving the problem of multi-body NRSfM, a natural and direct two-stage process is to reconstruct non-rigid multi-body structure by applying state-of-the-art non-rigid reconstruction methods[9][18] [29] and then segment distinct objects using subspace clustering methods such as Sparse Subspace Clustering (SSC) [12] or other clustering algorithms or vice-versa. However, by adopting such pipelines the inherent structure of the problem has never been exploited, i.e non-rigid motion segmentation provides critical information to constrain 3D reconstruction while 3D non-rigid reconstruction could also constrain the corresponding motion segmentation problem. Furthermore, since the non-rigid shape deformation actually occurs in 3D space, it is more intuitive to perform segmentation of objects in 3D space rather than on projected 2D image space.

Figure 1: Illustration of the two clustering constraints used in our framework. We observe that, when different objects are undergoing complex non-rigid motion, the temporal clustering helps in improving the 3D reconstruction by clustering different activities over-time such as stretch, walking, jumping and etc. The spatial clustering helps in explaining the segmentation of distinct structures over images. Frames with similar activities are shown in the same colors and different subjects undergoing deformations are shown in box. Here, T. Cluster refers to the Temporal cluster and S. Cluster refers to the Spatial Cluster. This flow diagram demonstrates that subjects performing different activities over-time lie in distinct temporal subspace and spatial subspace, subsequently different 3D trajectories spanned by different structures lies in distinct subspace. The example images are collected from the UMPM dataset [1]. (Best viewed on screen in color)

Additionally, it is always convenient–both computationally and numerically to solve a given task using a unified approach than solving it in a sequential way. Therefore, in this paper, we propose a framework to simultaneously reconstruct and cluster multiple non-rigid shapes by exploiting the spatio-temporal correlation in data. By such approach we can explain the dynamics of non-rigid shape in a more intuitive way. Explicitly, we represent multi-body NRSfM as union of subspace both in 3D trajectory space (spatially) and 3D shape space (temporally). We use the fact that each 3D trajectory can be expressed with other trajectory only if the trajectory is from the same subspace (spatial clustering) [17], and each individual activity can be expressed with activity belonging to the same subspace (temporal clustering) [29]. A visual illustration of the spatio-temporal subspace concept is presented in Fig. 1. Concretely, spatial clustering tries to reconstruct a trajectory using affine combination of other trajectories from the same deforming object, while temporal clustering tries to explain the shape of deforming objects using affine combination of other shapes at different frame instance belonging to similar activity.

By exploiting the spatio-temporal clustering structure, our approach is able to learn the affinity matrices which naturally encode subspace information. From the affinity matrices, direct inference about number of deformable objects, different activities and membership of each sample to achieve reconstruction can be easily made. Furthermore, we exploit the fact that the connectivity between subspaces must be tight if it belongs to the same subspace and loose if belongs to different subspaces. Therefore, we propose to use a mixture of norm and norm regularization (also known as the Elastic Net [31]), which helps in controlling the sparsity of the affinity matrices.

Contributions:
  1. We propose a joint segmentation and reconstruction framework to the challenging task of complex multi-body NRSfM by exploiting the inherent spatio-temporal union of subspace constraint.

  2. We propose to efficiently solve the resultant non-convex optimization problem based on the Alternating Direction Method of Multipliers (ADMM) method [5].

  3. Extensive experimental results on both synthetic and real multi-body NRSfM datasets demonstrate the superior performance of our proposed framework.

2 Related Works

Multi-body structure from motion (SfM) is an important problem in computer vision. To work out this problem for rigid motion is a direct extension to elegant multi-view geometry techniques [13][20]. However, solution to multi-body NRSfM is not straightforward, due to the difficulty in modeling complex non-rigid variations. Recent state-of-the-art in NRSfM reconstruction [9] has shown promising results while Zhu et al. [29] proposed that such an approach may fail while modeling long-term complex non-rigid motions. The work quote that Dai et al. [8] work is “highly dependent on the complexity of the motion” [29]. Hence, to overcome this difficulty they suggested to represent long-term non-rigid motion as union of subspace rather than a single subspace. Subsequently, Cho et al. [7] used probabilistic variations to model complex shape.

Despite the above accomplishments, NRSfM is still far behind its rigid counterpart. This gap is principally due to difficulty in modeling real world non-rigid deformation. If the deformation is irregular or arbitrary then to explain the 3D structure is nearly impossible. Nevertheless, many real world deformation can be constrained; as a result Bergler [6] introduced NRSfM which is considered a seminal work in NRSfM. In the work, Bergler demonstrated that non-rigid deformation can be represented by a linear combination of a set of shape basis. Following the work, several researchers tried to model NRSfM by utilizing additional constraints [25], [27], [21]. In 2008, Akhter et al. [4] presented a dual approach by modeling 3D trajectories. In 2009, Akhter et al. [3] proved that even there is an ambiguity in shape bases or trajectory bases, non-rigid shapes can still be solved uniquely without any ambiguity. In 2012, Dai et al. [8] proposed a “prior-free” method to recover camera motion and 3D non-rigid deformation by exploiting low rank constraint only. Besides shape basis model and trajectory basis model, the shape-trajectory approach [16] combines two models and formulates the problems as revealing trajectory of the shape basis coefficients. Besides linear combination model, Lee et al. [18]

proposed a Procrustean Normal Distribution (PND) model, where 3D shapes are aligned and fit into a normal distribution. Simon et al.

[23] exploited the Kronecker pattern in the shape-trajectory (spati-temporal) priors. Zhu and Lucey [30] applied the convolutional sparse coding technique to NRSFM using point trajectories. However, the method requires to learn an over-complete basis of 3D trajectories, prior to performing 3D reconstruction.

Recently, Russell et al. [22] proposed to simultaneously segment a complex dynamic scene containing a mixture of multiple objects into constituent objects and reconstruct a 3D model of the scene by formulating the problem as hierarchical graph-cut based segmentation, where the whole scene is decomposed into background and foreground objects with complex motion of non-rigid or articulated objects are modeled as a set of overlapping rigid parts.

Our method varies from the aforementioned works in the following aspects: 1) We provide a novel framework to joint segmentation and reconstruction for multiple non-rigid deformation problem; 2) We propose a simple, yet efficient and elegant optimization routine and its solution based on ADMM; 3) Our method can be applied to both sparse and dense scenarios (up to the order of ten-thousand feature tracks).

A part of this work has been published in 3DV 2016 [17], which addressed multi-body NRSfM by using the spatial constraint only. The work of [17] can be viewed as a special case of the present work.

3 Formulation

Under our formulation, we intend to reconstruct 3D non-rigid shapes such that they satisfy both the spatio-temporal union of affine subspace constraint and the non-rigid shape constraints (low rank and spatial coherency). Let represent the , with the number of frames and the number of feature points. We use the model and eliminate the translation component of camera motions as suggested in [6].

(1)

where denotes the camera rotation matrix and represents the 3D shapes of deforming objects over entire frames. This classical representation for NRSfM problem [6] aims at recovering both the camera motion and the non-rigid 3D shapes from the 2D measurement matrix such that . Following the same representation to cater 2D-3D relation, we use to infer the re-projection error.

3.1 Representing multiple non-rigid deformations in trajectory space

To represent multiple non-rigid objects using a single linear trajectory space does not provide compact representation of 3D trajectories [29]. When there are multiple non-rigid objects, each object can be characterized as lying in an affine subspace. Therefore, the 3D trajectories lie in a union of affine subspaces, which can equivalently be formulated in terms of self-expressiveness i.e,

(2)
Figure 2: Visual illustration of the affine subspace constraint = in trajectory space. Each column of is a trajectory of a 3D point (shown in green). This visualization states that a trajectory can be reconstructed using affine combination of few other trajectories. This pictorial representation is provided for better understanding and is only for illustration purpose. (Best viewed in color)

where . To get rid of the trivial solution of or , we explicitly enforce the diagonal constraint as . As we represent each non-rigid object as lying in an affine subspace, we further enforce the affine constraint . Besides the above constraint, we also want to enforce a constraint that if the trajectories belong to the same deforming object then it must be tightly connected or loosely connected the otherwise. To cater this idea of inter-class and intra-class trajectories clustering, we use the elastic net formulation [28] to compromise between connectedness and sparsity. Combining all the constraints together, we reach the following optimization:

(3)
subject to:

A visual illustration of this idea in trajectory space for a single trajectory is provided in Fig. 2. Here, and denote the -norm and the Frobenius norm respectively.

3.2 Representing multiple non-rigid deformations in shape space

(a)
(b)
Figure 3: Visual representation of union of subspace in shape space. (a) Two different subjects are performing Dance (Red) and Yoga (Green) respectively. (b) Equivalent representation of both activities in shape space for a single frame with green ellipsoid showing the shape space for Yoga activity and red ellipsoid showing the Dance activity. It can be observed that the space spanned by different shapes performing different activities span a distinct subspace. Gray color ellipsoid shows the union of both subspaces. (Best viewed in color)

An example complex non-rigid motion is shown in Figure 1, where the subjects are performing different activities at different time instances. Such distinct motion adheres to different local subspace and complete non-rigid motion lies in union of shape subspace. As mentioned in [29] such assumption leads to superior 3D reconstruction. To incorporate this concept in our formulation that different activities lie in union of affine subspaces, we express the 3D shapes in terms of self-expressiveness of frames along temporal direction.

(4)

where is the reshuffled version of

representing the per-frame 3D shape as a column vector,

A visual intuition of this idea in shape space for single frame is provided in Fig. 3.

For temporal clustering, we also use the elastic net as regularization parameters due to similar reason mentioned in Section 3.1 for , thereby formulating the following optimization:

(5)
subject to:

3.3 Enforcing the global shape constraint

In seeking a compact representation for multi-body non-rigid objects, we penalize the number of independent non-rigid shapes. Similar to [8] and [14], we penalize the nuclear norm of the reshuffled shape matrix , this is because the nuclear norm is known as the convex envelope of the rank function. In this way, the global shape constraint is expressed as:

(6)

where

denotes the nuclear norm of the matrix, ie, sum of singular values.

3.4 Joint Reconstruction and Segmentation Formulation

Putting all the above constraints (spatio-temporal union of subspace constraint and global shape constraint) together, we reach a multi-body non-rigid reconstruction and segmentation formulation:

(7)
subject to:

where , , and . are the trade-off parameters.

4 Solution

To solve the proposed optimization we introduce decoupling variables in Eq. 7, which leads to the following formulation:

(8)
subject to:

The auxiliary variables are introduced to simplify the derivation. denotes the linear mapping from to its reshuffled version . Specifically, S = and
= . The first term in the above optimization is meant for penalizing re-projection error under orthographic projection. Under single-body NRSFM configuration, 3D shape can be well characterized as lying in a single low dimensional linear subspace. However, when there are multiple non-rigid objects, each non-rigid object could be characterized as lying in an affine subspace. To represent this idea mathematically in shape and trajectory space respectively, we introduce and .

In addition to this, to reveal the intrinsic structure of multi-body non-rigid structure-from-motion (NRSfM), we seek for the sparsest solution both in trajectory and shape space. Consequently, we enforce the norm for and . However, high sparsity may lead to misclassification of samples or trajectories. Therefore, to maintain the balance between sparsity and connectedness, we incorporate the elastic net for both and . Lastly, we enforce a global shape constraint () for compact representation of multi-body non-rigid objects by penalizing the rank of the entire non-rigid shape.

Due to the two bilinear terms and , the overall optimization of Eq.-(8) is non-convex. We solve it via the alternating direction method of multipliers (ADMM), which has a proven effectiveness for many non-convex problems and is widely used in computer vision. ADMM works by decomposing the original optimization problem into several sub-problems, where each sub-problem can be solved efficiently. To this end, we seek to decompose Eq.-(8) into several sub-problems.

We introduce Lagrangian multipliers in the equation (8) and reach the Augmented Lagrangian formulation for Eq.-(8)

(9)

where we define and . are the Lagrange multipliers. is the penalty parameter, where we use the same parameter for each augmented Lagrange term to simplify the derivation and parameter setting. The symbol represents the Frobenius inner product of two matrices, i.e, the trace of the product of two matrices. For example, given two matrices , the Frobenius inner product is calculated as Tr.

The ADMM works by minimizing Eq. (9) with respect to one variable while fixing the others. During each iteration, we update each variable and the Lagrange multipliers in sequel. The detailed derivation for the solution is presented in the Appendix.

Solution for S: The closed form solution for can be derived by taking derivative of (9) w.r.t to and equating to zero.

(10)

Solution for : The closed form solution for can be derived by taking derivative of (9) w.r.t and equating to zero.

(11)

Solution for : The closed form solution for can be derived as

(12)
(13)

Solution for : The closed form solution for can be derived as

(14)
(15)

Solution for : The optimization of given all the remaining variables can be expressed as:

(16)

A closed-form solution exists for this sub-problem. Let’s define the soft-thresholding operation as , the optimal can be obtained as:

(17)

where = .

Solution for : The closed-form solution for can be obtained similarly:

(18)

Solution for The derivation for the solution of is similar to .

(19)

Detailed derivations to each sub-problems solution are provided in 0.A. Finally, the Lagrange multipliers and are updated as:

(20)
(21)
(22)
(23)
(24)

Initialization: Since the proposed problem is non-convex, proper initialization is required for fast convergence. In this work, we obtained rotation using [8] and initialized the matrix as pinv()* . , , were kept as , , and respectively. The complete implementation is provided in Algorithm 1.

0:    2D feature track matrix , camera motion , , , , , , ; Initialize: , , , , , , , = ;
  while not converged do
     1. Update by Eq. (10), Eq. (11), Eq. (18), Eq. (19), Eq. (13) and Eq. (15); The new value for each variable is updated over iteration, which was initialized for the first iteration.
     2. Update and by Eq. (20)-Eq. (24);
     3. Check the convergence conditions , , , , and , ; ;
  end while
   , , , , .

  Form an affinity matrix

, then apply spectral clustering

[19] to to achieve non-rigid motion segmentation.
Algorithm 1 Multi-body non-rigid 3D reconstruction and segmentation using ADMM

5 Experiments and Results

We performed extensive experiments on benchmark data-sets that are freely available. We tested our approach on both real data and synthetic data under sparse and semi-dense scenarios. Denote

as the estimated 3D structure and

as the ground-truth structure, we use the following error metrics to evaluate the performance of the approach:
(i) Relative error in multi-body non-rigid 3D reconstruction

(25)

(ii) Error in multi-body non-rigid motion segmentation,