A New Approach To Two-View Motion Segmentation Using Global Dimension Minimization

04/10/2013 ∙ by Bryan Poling, et al. ∙ University of Minnesota 0

We present a new approach to rigid-body motion segmentation from two views. We use a previously developed nonlinear embedding of two-view point correspondences into a 9-dimensional space and identify the different motions by segmenting lower-dimensional subspaces. In order to overcome nonuniform distributions along the subspaces, whose dimensions are unknown, we suggest the novel concept of global dimension and its minimization for clustering subspaces with some theoretical motivation. We propose a fast projected gradient algorithm for minimizing global dimension and thus segmenting motions from 2-views. We develop an outlier detection framework around the proposed method, and we present state-of-the-art results on outlier-free and outlier-corrupted two-view data for segmenting motion.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A classic problem in computer vision is that of feature-based motion segmentation from two views. In this problem one has two images, taken at different times, of a 3D scene. The scene is assumed to consist of multiple, independently moving rigid bodies. The goal is to identify the different moving objects and estimate a motion model for each one of them. For this purpose, one automatically tracks the locations of visually-interesting “features” in the scene, which are visible in both views (e.g., via Lucas Kanade type algorithm 

Lucas-Kanade-20years_2004

). Each feature is represented as a pair of 2-vectors, holding the image coordinates of the feature in the two different views; such a pair is referred to as a

point correspondence. The mathematical problem of feature-based, two-view motion segmentation is to both segment the point correspondences according to the rigid objects to which they belong, and estimate a motion model for each object.

A basic strategy to solve the feature-based, two-view motion segmentation problem is to first cluster point correspondences and then estimate the single-body motions within clusters (well-known methods for single-body motion estimation are described in Hartley2000 ; ma2004book ). This procedure was suggested in FengPerona98 , while clustering point correspondences with

-means or spectral clustering, and in 

Torr98geometricmotion , while alternating between clustering and motion segmentation via an EM procedure. Both clustering strategies of FengPerona98 and Torr98geometricmotion are based primarily on spatial separation between the clusters, however, different clusters in this setting may intersect each other (e.g., when motions share a symmetry).

Due to this problem, some algebraic methods have been developed for directly solving for the motion parameters, while eliminating the clustering of point correspondences Vidal+2views06 ; rao_ijcv . Another solution is to segment feature trajectories by taking into account their geometric structures, which may be different than spatial separation (the feature trajectories in 2-views are the 4-dimensional vectors concatenating the 2 point correspondences of the same feature from 2 views).

Costeira and Kanade Costeira98 showed that under the affine camera model, feature trajectories in -views (for these are vectors of length ) within each rigid body lie on an affine subspace of dimension at most 3. This observation has given rise to several feature-based motion segmentation schemes for -views, which are based on clustering subspaces; we refer to such clustering as Hybrid Linear Modeling (HLM). Many algorithms have been suggested for solving the HLM problem, for example, the -flats (KF) algorithm or any of its variants Tipping99mixtures ; Bradley00kplanes ; Tseng00nearest ; Ho03 ; MKF_workshop09 , methods based on direct matrix factorization Boult91factorization-basedsegmentation ; Costeira98 ; Kanatani01 ; Kanatani02

, Generalized Principal Component Analysis (GPCA) 

Vidal05 ; Ma07 ; Ozay10 , Local Subspace Affinity (LSA) Yan06LSA , RANSAC (for HLM) Yang06Robust , Agglomerative Lossy Compression (ALC) Ma07Compression , Spectral Curvature Clustering (SCC) spectral_applied , Sparse Subspace Clustering (SSC) ssc09 ; ssc_elhamifar13 , Local Best-Fit Flats (LBF and its spectral version SLBF) LBF_cvpr10 ; LBF_journal12 and Low-rank Representation (LRR) lrr_short ; lrr_long . Some theoretical guarantees for particular HLM algorithms appear in spectral_theory ; higher-order ; lp_recovery_part2_11 ; Soltanolkotabi2011 ; Soltanolkotabi+EC13 ; ALZ13 . Two recent reviews on HLM are by Vidal SubspaceClustering_Vidal and Aldroubi Aldroubi_review_13 .

For the more general and realistic model of the perspective camera, it can be shown that feature trajectories from two-views lie on quadratic surfaces of dimension at most 3 (in ) (see §2). Arias-Castro et al. higher-order suggested clustering the quadratic surfaces of point correspondences (in ) using Higher Order Spectral Clustering (HOSC) for manifold clustering. They demonstrated competitive results on the outlier-free database of rao_ijcv , when assuming that the clusters are of dimension 2. However, their results are not competitive for incorporating dimension 3 and they did not provide any numerical evidence that the dimension of the surfaces was 2 and not 3.

A different approach for clustering these particular quadratic surfaces can be obtained by embedding point correspondences into “quadratic coordinates” and then clustering subspaces. More precisely, if a point correspondence is mapped into , where denotes the Kronecker product, then these quadratic surfaces are mapped into linear subspaces of dimensions at most 8, which are determined by the fundamental matrices Hartley2000 ; ma2004book ; AtevKSCC of the different motions. Chen et al. AtevKSCC have used this idea for clustering such quadratic mappings of point correspondences by the Spectral Curvature Clustering (SCC) algorithm spectral_applied (they showed that instead of performing the actual mapping, one can apply the kernel trick). They claimed that other HLM algorithms (at that time) did not work well for such embedded data.

The drawback of applying SCC to this quadratic mapping of point correspondences in is that SCC does not work well with subspaces of mixed dimensions, and the subspace dimensions must be known a-priori. Unfortunately, the subspaces in this application have mixed and unknown dimensions (see §2). What makes SCC successful for this application is the fact that it takes into account some global information of the subspaces (i.e., for -dimensional subspaces it uses affinities based on arbitrary points, and in particular, far-away points). This helps SCC deal with nonuniform sampling along subspaces with local structure very different than the global one (see §2). On the other hand, local methods (e.g., LBF_journal12 ; mapa ) often do not work well in this setting.

The purpose of this paper is to develop an HLM algorithm that can successfully cluster the quadratically-embedded point correspondences in . In particular, it exploits the global structure of the underlying subspaces, i.e., their “dimensions”. We remark that earlier works Barbará00usingthe ; Gionis:2005:DIC:1081870.1081880 ; Haro06 ; Haro08TPMM used dimension estimators to segment data clusters according to their intrinsic dimension (Barbará00usingthe and Gionis:2005:DIC:1081870.1081880 used box counting estimators of some fractal dimensions and Haro06 ; Haro08TPMM used the statistical estimator of Levina05 ). However, their methods do not distinguish well subspaces of the same dimension. Here on the other hand, we aim to minimize “dimensions” within tentative clusters, instead of estimating them. Thus, we take into account the effect of tentative clusters on these “dimensions” and try to optimize accordingly the appropriate choice of clusters.

For this purpose, we propose a class of empirical dimension estimators, and a corresponding notion of global dimension for a mixture of subspaces (a function of the estimated dimensions of its constituent parts). We propose the global dimension minimization (GDM) algorithm, which is a fast projected gradient method aiming to minimize the global dimension among all data partitions. We also build an outlier detection framework into this development to allow for corrupted data sets. We demonstrate state-of-the-art results for two-view motion segmentation (via quadratic embedding), both in the outlier-free and outlier-corrupted cases. We even show that these results are competitive with the state-of-the-art results for multiple-views, i.e., using all frames of a video sequence (obtained under the affine camera model). To motivate the use of global dimension, we prove that for special settings and choice of parameters, the global dimension is minimized by the correct partition of the data (representing the underlying subspaces). We then discuss what to do in more general settings.

The paper is organized as follows: §2 briefly explains how the problem of 2-view motion segmentation can be formulated as a problem in HLM; §3 introduces global dimension and explains why its minimization can solve the HLM problem under some conditions; §4 develops a fast projected gradient method for minimizing global dimension; §5 develops an outlier detection/rejection framework for global dimension minimization; §6 demonstrates numerical results on real-world 2-view data sets for both outlier-removed and outlier-corrupted data; finally, §7 concludes this work. The appendix contains proofs of the key results in the paper.

2 Formulating 2-View Motion Segmentation as a Problem in HLM

One way of formulating the motion segmentation problem in terms of HLM is by exploiting the Affine Motion Subspace. Costeira and Kanade Costeira98 demonstrated that when a set of features all come from a single rigid body, then under the assumptions of the affine camera model, the corresponding feature trajectories lie in an affine subspace of dimension 3 or less. One can use this fact to partition the set of features by clustering their trajectories into subspaces. This is a popular formulation of the segmentation problem, even when dealing with only two views of a scene.

The formulation involving the affine motion subspace has the advantage that the feature trajectories tend to be nicely distributed in their respective subspaces, and the different subspaces all have nearly the same dimensions. This formulation has the drawback that it requires an affine camera model. The affine camera assumption breaks down when viewing objects close to the camera, or when looking at objects at significantly different ranges. The consequence of this is that the trajectories from a rigid body do not lie within a subspace, but rather in a manifold which is only locally approximated by a subspace of dimension at most 3.

When dealing with 2-view segmentation, there is another approach, based on a more general camera model, which avoids this problem of distortion. This approach assumes a perspective camera, and relies on the fundamental matrix Hartley2000 for a rigid body.

Indeed, if is the fundamental matrix for a rigid body, and and together form a point correspondence (in standard homogenous coordinates) from that body, then

which is algebraically equivalent to

(1)

where

and

We refer to the vector as the nonlinear or Kronecker embedding of a point correspondence (recall that is the Kronecker product). The vectors obtained through this nonlinear embedding for feature points on the same rigid object lie in a linear subspace of of dimension at most 8. Indeed, (1) says that there is a vector orthogonal to all of the feature trajectories in this set (it also shows that the linear embedding lies on a 3-dimensional quadratic manifold). However, the subspace dimension can decrease due to two different reasons. First of all, if there are very few points (per motion), then they may span a lower-dimensional subspace. The second cause is degeneracy in the 3D configuration of the features. If all world points and both camera centers live on a ruled quadratic surface111A surface is ruled if through every point of there exists a straight line that lies on ., then their corresponding subspace has dimension 7 or less. In particular, if all world points (but not necessarily the camera centres) are coplanar, the corresponding subspace will have dimension no larger than 6 (see (Hartley2000, , pg. 296)). Therefore, to make use of this embedding, the hybrid-linear modeling algorithm being employed must be tolerant of subspaces of mixed dimension.

Figure 1: Two views of a 3D scene with features overlayed (left), and the nonlinearly embedded point correspondences in , projected onto the 3-dimensional subspace spanned by their 3rd, 4th and 5th principal components (right). (Color figure online)

Since the perspective camera assumption is accurate in a much broader range of situations than the affine camera model, subspaces are more apparent with the nonlinear embedding than with the linear embedding. However, the nonlinear embedding distorts the original sampling and results in lower-dimensional structures (of dimension at most 3) within the higher dimensional subspaces (of typical dimensions 6, 7 or 8), which is a serious obstacle for many HLM algorithms, especially ones using local spatial information.

3 Global Dimension

From here on, we will be considering 2-view motion segmentation under the perspective camera model, i.e. using the Kronecker embedding. We will present a global HLM method, which is well-suited for handling the data which results from this embedding. We begin our development by providing some intuitive motivation for our approach.

Imagine that we have access to an oracle, who for any set of vectors in , can provide for us a good, robust estimate of the dimensionality of the set222In a noiseless case this would return the dimension of the linear span of the set of vectors. Now, suppose we have a set of vectors which are sampled from a hybrid-linear distribution. Consider a general partition of the data set, and define the “vector of set dimensions” for that partition to be the vector of oracle-provided approximate dimensions of each respective set in the partition. Our inspiration is the observation that for most partitions one may happen upon, each set in the partition will typically contain points from many of the underlying subspaces. The associated vector of set dimensions will contain relatively large numbers, and the -norm of this vector will be large. The -norm of the vector of set dimensions will be referred to as the global dimension of the partition. The best way to make the global dimension small, it would seem, is to try and decrease all of the elements of the vector of set dimensions by grouping together vectors that come from common subspaces. This notion will be made precise and we will show, in fact, that under certain conditions, the natural partition of the data set (the one where point assignment agrees with subspace affiliation) is a global minimizer of the global dimension function.

Our approach to HLM will be to find the partition of a data set that yields the lowest possible global dimension. In this section, we develop the global dimension objective function in two parts. In §3.1 we suggest a new class of dimension estimators that can perform the role of the oracle in our discussion above. In §3.2 we define global dimension and explain why we expect its minimizer to reveal the clusters corresponding to the underlying subspaces. A fast algorithm for this minimization will be later described in §4.

3.1 On Empirical Dimension

We present here a class of dimension estimators depending on a parameter . For and any , we use the notation to mean (even for , where is not a norm). For a given set of vectors in , , we denote by

the vector of singular values of the

data matrix (the matrix whose columns are the data vectors).

For the empirical dimension, denoted by (or simply ) is defined by

(2)

When , this is sometimes called the “effective rank”333“Effective rank” is sometimes defined differently. See Roy_theeffective . of the data matrix vershynin_book .

The following theorem explains why is a good estimator for dimension. Put simply, it says two things. First, if we rotate and/or uniformly scale our set of vectors by a non-zero amount, then the empirical dimension of the set does not change. Second, in the absence of noise, empirical dimension never exceeds true dimension, but it approaches true dimension in the limit (as the number of measurements goes to infinity) for spherically symmetric distributions. From now on we refer to -dimensional subspaces as -subspaces.

Theorem 3.1

For , possesses the following properties:

  1. is invariant under dilations (i.e., scaling).

  2. is invariant under orthogonal transformations.

  3. If are contained in a -subspace of , then .

  4. If

    are i.i.d. samples from a sub-Gaussian probability measure, which is spherically symmetric within a

    -subspace444A measure is spherically symmetric within a -subspace if it is supported on this subspace and invariant to rotations within this subspace. and non-degenerate555A measure is non-degenerate on a subspace if it does not concentrate mass on any proper subspace. In our setting the measure is also assumed to be spherically symmetric, and this assumption is equivalent to assuming the measure does not concentrate at the origin., then with probability 1.

To gain some intuition into the definition of empirical dimension, consider taking a large set of samples from a spherically symmetric distribution supported by a -subspace. Call the covariance matrix for this distribution . As the number of samples becomes large, the empirical covariance matrix approaches , which has the first elements on the main diagonal all equal (call the value

), and 0’s everywhere else. The empirical dimension of the set of vectors involves the singular values of the data matrix, which are approaching the square roots of the eigenvalues of

. Hence, as the number of samples increases, we get:

(3)

Thus, for any value of in , the empirical dimension approaches the true dimension of the set as the number of measurements increases.

If we look at a distribution that is not spherically symmetric, but still supported by a -subspace, then empirical dimension tends to under-estimate the true dimension of the distribution, even as the number of samples approaches infinity. This is actually desirable behavior. If we take a spherically symmetric distribution in a -subspace and imagine the process of collapsing it in one direction until it lies in a -subspace, then true dimension behaves discontinuously. The true dimension of a large set of samples will equal until the collapsing is complete; at that point the dimension will instantly drop to . Empirical dimension smoothly drops from to during this collapsing process. It is in this setting that we see the necessity of the parameter . This parameter controls how quickly the empirical dimension drops from to in this process. More generally, a low value of results in a “strict” dimension estimator (meaning that it will not under-estimate dimension easily, even when distributions are asymmetric). When is large (approaching 1), empirical dimension is a lenient dimension estimator. It is much more tolerant of noise, but it may consequently under-estimate the dimension of highly asymmetric distributions. The trade-off is that when dealing with noisy data or distributions only approximately supported by linear subspaces, a stricter estimator can mistakenly interpret noise or distortion as energy in new directions, thereby causing an over-estimate of dimension. Numerical experiments (e.g., Fig. 2) have shown that values of between 0.3 and 0.7 seem to provide reasonable estimators, which tend to agree with our intuitive notion of dimension.

Figure 2:

The experiment mentioned above is illustrated. A normally-distributed point cloud is created in

and is slowly collapsed into a plane and then a line. One can see that if is close to , empirical dimension more closely tracks true dimension, resulting in a strict dimension estimator. If is close to , empirical dimension changes more smoothly, resulting in a lenient estimator. (Color figure online)

In our application, since the perspective camera model is reasonably accurate (as opposed to the affine model), the nonlinear embedding of point correspondences results in subspaces with rather negligible distortion. We can thus afford a low value of . In fact, this is needed because the data vectors are frequently distributed in very non-isotropic ways with this embedding. Thus, to avoid underestimating dimension, we choose , which lies just slightly above the lowest value we confirmed for (0.3). Notice that this value is not “tuned” to individual data sets, but is chosen based on the properties of the application as a whole and the nature of the embedding.

3.2 On Global Dimension

Assume we are provided a data set in (in our application with the nonlinear embedding) and a partition of it for some (i.e., are disjoint subsets of whose union is ). We also assume that lies on a union of subspaces and denote the “correct” (or natural) partition of the data (where each subset contains only points from a single underlying subspace) by . For a fixed , are the empirical dimensions of the sets . We seek to minimize a function based on these dimensions to recover . To this end we define global dimension (GD). When thinking of this function, we take the set of data vectors to be fixed and given, and we view GD as a function of partitions, , of the set of data vectors. For a fixed (we discuss the meaning of later) we define GD as follows:

(4)

Our strategy for recovering will be to try and find the partition of the data set that minimizes . Intuitively, by trying to minimize the -norm of the vector of set dimensions, we are looking for a partition where all of the set dimensions are small. Imagine trying to minimize this objective function by hand, and starting with a partition close to, but not equal to . If there is a point assigned to the wrong cluster, then removing it from the set it is currently assigned to should result in a significant drop in the dimension of that particular set. Re-assigning that point to the correct set, on the other hand, will have little impact on the dimension of the target set because the point will lie approximately in the span of other points already in the set.

Thus, such a change would cause a significant drop in one of the set dimensions, without disturbing the other sets, and the global dimension will decrease. This would suggest that amongst partitions that are close to it, yields the lowest global dimension. Additionally, if one considers a “random”, or usual partition, then each set in that partition will tend to contain vectors from many different subspaces. Each set will have a large dimension, and the global dimension will exceed that of . This would suggest that minimizing global dimension may be a reasonable objective if we want to recover .

Unfortunately, there can exist certain special partitions of a data set that result in low global dimension (in some cases even lower than that of ). For example, let us choose , so that the global dimension of a partition is simply the sum of the dimensions of its constituent parts. Now consider 3 lines in the plane, and a data set consisting of many points sampled from each line. In this case will consist of three sets. Each set will contain only points from a single line. The dimension of each set in is 1. Hence, . On the other hand, if we consider the “degenerate” partition, that simply puts all points in a single set, then since we are in , the dimension of that set, and hence the global dimension of the partition, is 2.

The above example is actually rather special. Consider the same data set, but set to a large value instead of 1. When is large, the global dimension approximately returns the largest value from . Now consider minimizing this quantity, subject to the constraint that the partition contains no more than 3 sets. Minimizing global dimension in this setting penalizes partitions consisting of fewer, higher-dimensional sets instead of multiple, more balanced sets. Specifically, the global dimension of the degenerate partition is again approximately 2, while the global dimension of is approximately 1, since that is the maximum dimension of its constituent sets. In fact, as shown in the next theorem, using large effectively resolves the issue of special partitions yielding lower global dimension than .

We will consider the setting where we have distinct linear subspaces of , each of dimension . Call these subspaces . Assume we have a collection of non-degenerate measures supported by these subspaces (so that is supported by , ). Let be a set consisting of i.i.d. points from each (so ). We require that for each so that each subspace is adequately represented in the data set. Let be global dimension for a fixed parameter , defined using true dimension as the “dimension estimator” for a set. That is, where the sets are the constituent sets of the partition and returns the true dimension of its parameter set. Then, we get the following result:

Theorem 3.2

Let , , satisfy the conditions above. If

(5)

then amongst all partitions of into or fewer sets, the natural partition is almost surely (w.r.t ) the unique minimizer of .

The weakness of the above theorem is that it requires all of the intrinsic subspaces to have the same dimension. In practice, the global dimension objective function appears to be rather robust to subspaces with mixed dimensions. If there is a large difference in dimension between two subspaces in a dataset, then the minimum of global dimension tends to be very near , the only difference being that a few points from the higher-dimensional set are re-assigned to lower-dimensional sets to balance out the set dimensions.

Theorem 3.2 gives us a quantitative way of selecting an appropriate value of for our application. Specifically, if we identify the largest number of clusters we will need to address () and an upper bound for , then the right-hand side (RHS) of (5) gives us a lower bound on the value of . As long as is larger than this bound, then Theorem 3.2 ensures that the natural partition (uniquely) minimizes global dimension. Table 1 exemplifies the values of the RHS of (5) for different values of and . We do not want to choose extravagantly large because of the potential for numerical issues when taking large powers. In our setting, we want to be able to handle up to sets and we will use one less than the ambient dimension as an upper bound for the intrinsic dimension (). According to Theorem 3.2 we should select . In all of our experiments we set to give us a safety margin. We did some tests on one of our motion segmentation databases (outlier-free RAS) to see how sensitive the minimizer of global dimension is to in practice. We found that values as low as result in nearly identical performance to , and we don’t start to see significant degradation in results in the other direction until .

8 7 6 5 4
2 5.89 5.19 4.50 3.80 3.11
3 9.33 8.23 7.13 6.03 4.92
4 11.77 10.38 8.99 7.60 6.21
Table 1: Values of the RHS of (5) for various values of and . Theorem 3.2 ensures that is the unique minimizer of when is larger than these values

4 A Fast Algorithm for Minimizing Global Dimension

Global dimension is defined on the set of partitions of a data set. With a discrete domain, finding ways of quickly minimizing the objective function is non-trivial. In this section we briefly introduce a method, which we will call Global Dimension Minimization (GDM) for doing exactly this.

GDM is based on the gradient projection method (bertsekas1995nonlinear, , §2.3). In order to apply a gradient-based method, we need to re-formulate the problem so that we have a smooth objective function over a convex domain. To do this we employ the notion of fuzzy assignment. Rather than trying to assign each data point a label, identifying it with a single cluster, we allow each point to be associated with every cluster simultaneously, in varying amounts. Specifically, we assign each data point a probability vector where the ’th coordinate holds the strength of ’s affiliation with cluster . Assuming we have a data set of points in , and we seek clusters, we need probability vectors of length to encode the soft partition of the data. This membership information will be stored in a membership matrix, , where each column is a probability vector. Element of the matrix holds the strength of ’s affiliation with cluster .

The next step is to extend the definition of global dimension so that it is defined on soft partitions in a meaningful way. In its original formulation, to evaluate the global dimension of a partition, we would break up the data set into parts, based on the partition, and estimate the dimension of each part using empirical dimension. To extend this to soft partitions, we estimate the dimension of the ’th set in a partition by scaling each data point by its respective affiliation strength to set ( is multiplied by ). We then use empirical dimension to estimate the dimension of the scaled set. In essence, each point is now included in each dimension estimate. However, if a point is scaled so that it lays near the origin when considering a given set, it has little impact on the estimated dimension of that set. In fact, if we look at the global dimension of a soft partition that assigns each data point entirely to a single set ( has only 1’s and 0’s in it), then the global dimension of that soft partition, using our new definition, agrees with the global dimension of the corresponding “hard partition”, using our original definition. Thus, this change is a reasonable extension of the original definition to soft partitions. Our extended definition of global dimension is:

(6)

With this modified formulation, global dimension is an almost-everywhere differentiable function defined over the Cartesian product of -dimensional probability simplexes. One can check that this is a convex domain (the product of convex sets is convex). A natural approach to minimizing a problem of this sort is the gradient projection method (bertsekas1995nonlinear, , §2.3). In this method, we begin at some initial state, compute the gradient of the objective function, take a step in the direction opposite the gradient, and then project our new state back into the domain of optimization. This is repeated until our state converges.

The gradient of global dimension can be computed, but we need some notation first. For , we denote by the -by- matrix whose ’th column equals for (i.e., is the data matrix scaled according to weights for cluster ). Let be the thin SVD of , and denote the vector of elements from the diagonal of . Let . Define

(7)
Theorem 4.1

The derivative of global dimension w.r.t. an arbitrary element of the membership matrix is given by:

(8)

A proof of Theorem 4.1 is included in the appendix. This theorem allows us to evaluate the gradient vector of global dimension. As was mentioned before, in an iteration of the gradient projection method we take a step in the direction opposite the gradient. Computing a good step size is frequently a challenging task, but here we are fortunate. Our domain has a meaningful natural scale, since it is formed as a product of probability simplexes. Intuitively, our step size should be large enough to move us across the entire space in a reasonable number of steps, but small enough that any individual membership vector can move only a fraction of the way across its own simplex in one step. In practice, we scale each step so that the membership vectors most affected by the step move a distance of .3 on average. This seems to work well in general.

Finally, one can check that projecting onto the domain of optimization can be accomplished by individually projecting each column of onto the standard -dimensional probability simplex.

We have outlined a projected gradient descent method for minimizing global dimension. The above method forms the core of the GDM algorithm. However, since the global dimension function is non-convex (and hence may contain multiple local minimums), it is important to achieve reasonably good initialization. Our initialization strategy is inspired by ALC Ma07Compression . We start with a “trivial” partition where each point is in its own set, and we randomly select many pairs of sets in the partition. For each pair, we hypothetically merge the two sets and measure the resulting global dimension. We select the pair that results in the lowest global dimension when merged and we effect that merge. We then repeat the process iteratively until we have the desired number of sets in our partition (in each step the number of sets in the partition decreases by ). After initialization, the projected gradient descent algorithm is run until convergence (or for a fixed, but large number of iterations). Thresholding is performed to recover a “hard partition” from our soft partition (point is assigned to cluster if is the largest element from column of ). After this is done, we perform a final genetic stage to clean up small errors which may have occurred in any of the previous stages. This is done by taking each point and hypothetically re-assigning it to each different cluster (while all other point assignments are kept fixed) and retaining the assignment that results in the lowest global dimension. This is repeated a few times or until no single-point re-assignments reduce global dimension. This primarily helps with placing points that lie near the intersections of different subspaces (their fuzzy assignments may associate them almost equally to different subspaces, making them difficult to place). Finally, we run this entire process several times and return the best partition of all runs (as measured by global dimension).

0:  : data, : number of clusters, : global dimension parameter, : empirical dimension parameter, , , : number of iterations (default: , )
0:  A partition, , of into disjoint clusters
  for  do
      Partition of where each point is in its own set.
     while number of sets in greater than  do
         Randomly choose several pairs of sets.
         For each pair, measure the effect on global dimension if the pair is merged.
         Merge the pair of sets which results in the lowest global dimension.
     end while
      Convert to a soft partition, encoded in membership matrix .
     for  do
         Compute gradient of global dimension, .
         = average magnitude of largest 10% of columns of .
         Take a step in direction of length .
         Project each column of onto the standard -dimensional probability simplex.
     end for
      Convert back to a “hard partition”, , by thresholding.
     for  do
        for  do
            Check if re-assigning point to some other cluster decreases global dimension.
            If so, re-assign point to that cluster.
        end for
     end for
  end for
   return partition from all runs with lowest global dimension.
Algorithm 1 GDM Algorithm for HLM

4.1 Complexity of GDM

A thorough analysis of the computational complexity is not included here; this is a short summary of the computational aspects involved. The main numerical component of GDM is computing . For a single iteration its complexity is . Our choice of requires a sorting procedure and is thus of order operations for a single iteration. The initialization of the algorithm via ALC-type procedure Ma07Compression requires operations. Also, the last genetic step has the following complexity (without taking advantage of incremental SVD). In theory, we can make the algorithm linear in the number of points , by randomly initializing it, removing the genetic “clean-up” step and changing how we select our step size666This is assuming that we will not require more iterations to get close enough to the minimum that we can apply thresholding. In our experiments the number of needed iterations does not appear to grow with , but we do not have any results to guarantee this.. We have good numerical evidence, even with large , that this can result in good accuracy and speed for artificial data. Regardless, for the values of in our application the algorithm is sufficiently fast and these additional steps help improving accuracy, especially for points which are nearby several clusters (whose percentage is not negligible when is small).

5 Detecting and Rejecting Outliers with GDM

In practice, it turns out that the GDM algorithm described above is naturally robust to a small number of outliers (in that they do not tend to affect the classification of inliers), but no instruments were put in place for explicitly detecting or rejecting these outlying points. In this section, we introduce a modification to GDM that allows for explicit outlier detection and rejection. The guiding intuition is that an outlier has the property that if the true hybrid-linear structure is reflected in a partition, then no matter which group we assign the outlier to, it causes a significant increase in the empirical dimension of that group. This, in turn, results in a significant increase in global dimension. In other words, if we have a partition that reflects the true hybrid-linear structure of the data set, then there is no good place to put an outlier. If the algorithm was given the option of paying a fixed, low price for the right to ignore a given point, it would make sense for it to exercise this option on outliers, and only segment inliers.

We propose modifying the global dimension objective function, and the accompanying variational development in the following way:

(9)

where:

(10)

This modification adds an additional “cluster” to the problem (call it cluster 1), and we treat it differently than the others. Clusters through contribute to the global dimension in the same way that they did in the original development. Cluster contributes to the cost function the sum of the membership strengths of all data points to this cluster. This is the “fuzzy assignment” version of the following notion: we allow the algorithm to pay a fixed price, , for the right to ignore any particular data point (not assign it to any true cluster).

5.1 Modification to GDM

The proposed modification to the objective function only trivially changes the state space (now it is the product of -dimensional probability simplexes, as opposed to -dimensional simplexes). Thus, our method of projecting states onto the convex domain is effectively the same. The change to the objective function does mean that we must re-evaluate the gradient of global dimension. The computation is very similar to the unmodified version, and the result is:

(11)

and, for all

(12)

where the notation and constants are as defined in §4. Thus, the necessary modifications to GDM are:

  1. Update the evaluation of the objective function according to (9).

  2. Update the initialization of the state vector to include an outlier group.

  3. Update the state projection routine to accommodate additional dimensions in domain.

  4. Update the evaluation of according to 11 and 12.

5.2 Practical Implementations of Outlier Rejection

We have described an idea for how to handle outliers, but it introduces a new parameter, . It is not immediately clear how one should choose this parameter, and how sensitive the results will be to it. In theory one would need to choose an outlier cost, , that is not so high that nothing is ever assigned to the outlier group, but not so low that large quantities of inliers are assigned to this group. The appropriate values would likely depend on multiple quantities, like intrinsic dimension, noise level, and distortion of the underlying subspaces. These are quantities that can vary not just between applications, but also from data set to data set for a single application. Applying the suggested modification exactly as proposed (and trying to “tune” this parameter) would therefore lead to an unreliable and unpredictable algorithm. We refer to this approach as GDM-Naive, and Figure 3 illustrates why this method is unsound. Instead, we propose two variations of this method, which lead to more reliable solutions.

             The three images here illustrate the problem with GDM-Naive. Each triangle represents the probability simplex containing the fuzzy assignment vectors for a fictitious data set. The fuzzy assignment for each point is plotted after many iterations of GDM. Points in red are inliers and points in green are outliers. The quantization regions (for the threshold step) are numbered 1-3. One can see that if is not chosen correctly, points can be quantized into the wrong cluster. On the other hand, the outlier ranking of a point (the # of points closer the outlier corner) is a more stable quantity. (Color figure online)

Large : Nothing ends up in outlier group.
Medium : Only outliers end up in outlier group.
Low : Some inliers end up in outlier group.
Figure 3: Graphical depiction of the problem with GDM-Naive
  1. GDM Known-Fraction: Run the proposed algorithm with a fixed, low value of (we use ) but stop before the threshold step. Rank the data points according to their membership strengths to the outlier group. Remove a pre-set fraction of the data set (the part that most strongly affiliates with the outlier group). Continue with the classic (non-outlier version) of the variational algorithm on the surviving points only777We could skip this step and segment directly from the fuzzy assignment that we already have. Refining the membership matrix after removing the outliers is done to repair whatever damage the outliers may have done to the membership matrix before thresholding. - this provides the inlier segmentation. The points that were removed are labelled outliers.

  2. GDM Model-Reassign: Run method 1 above (GDM Known-Fraction). Fit subspaces of appropriate dimension (round the empirical dimension) to each set in the resulting partition. Re-assign all points (including those that were decided to be outliers) according to their distances from each subspace. Call a point an outlier if it is more than some fixed distance, , from all of the subspaces.

Each of the proposed methods handles the task of selecting , but introduces a new parameter. For method 1, this is the percentage of the data set to throw out. For method 2, the new parameter is the maximum distance a point can be from a subspace to be considered an inlier. Both of these parameters are more natural than selecting

. In a noisy environment, one may have an idea, based on experiments, of what percentage of the data set will be outliers, or what the inlier modelling error tends to be. Additionally, when using the “Model-Reassign” method, one could find the average and variance of the residuals,

, and respectively, when fitting subspaces to the inlier clusters. These quantities can be used to come up with a reasonable value of for a given application ( for some ). One could also find these values on a per-cluster basis and have a different outlier threshold for each cluster.

6 Results on Real-World Data

6.1 Performance in the Absence of Outliers

We tested the GDM algorithm on 2 motion segmentation databases. First, we used the outlier-free RAS database rao_ijcv ; AtevKSCC and compared with many leading methods in 2-view segmentation. We noticed that some of the HLM methods performed better when using the linearly embedded point correspondences than with the nonlinear embedding. Therefore, in Table 2 we present each of the competing HLM algorithms twice. Where “Linear” appears, the algorithm was run on the feature trajectories in . Where “Nonlinear” appears, the algorithm was run on the Kronecker products (in ) of the standard homogeneous coordinates of each feature correspondence. Figure 5 presents more details on the performance of the HLM methods with the nonlinear embedding, and Table 3 gives the average runtimes of these methods. The other HLM methods we included are SCC spectral_applied , MAPA mapa , SSC ssc09 , SLBF LBF_journal12 , and LRR lrr_short . We also included two other successful methods for two-views (for which there was a code available online): RAS rao_ijcv and HOSC higher-order . Algorithm parameters and our experiment procedure are detailed in §8.4.

(a) SCC
(b) GDM
(c) MAPA
Figure 4: Clustering by SCC, GDM, and MAPA on file 6 of the outlier-free RAS database. (Color figure online)
File Number Average Average
1 2 3 4 5 6 7 8 9 10 11 12 13 w/o File #8
Method/Embedding GDM Nonlinear 0.85 0.00 1.57 0.65 0.00 0.00 0.00 12.76 0.00 0.00 0.00 0.00 0.00 1.22 0.26
SCC Linear 0.85 0.00 1.18 0.65 0.00 1.37 0.00 1.42 0.39 0.00 0.00 1.01 0.00 0.53 0.45
SCC Nonlinear 0.85 0.00 24.41 0.00 0.00 19.18 0.00 0.00 0.00 13.97 5.36 0.84 1.10 5.05 5.48
MAPA Linear 0.85 3.65 1.18 0.65 0.00 13.70 15.97 1.29 0.00 0.00 0.00 0.67 3.30 3.17 3.33
MAPA Nonlinear 0.85 20.55 21.65 0.65 0.00 21.92 6.25 7.73 0.00 13.97 1.43 0.34 3.30 7.59 7.57
SSC Linear 1.69 18.26 0.79 1.94 0.00 0.00 6.25 32.22 0.00 0.00 14.64 1.35 4.40 6.27 4.11
SSC Nonlinear 1.27 0.00 22.44 0.65 0.00 21.92 0.00 9.02 0.00 13.97 9.29 12.12 6.59 7.48 7.35
SLBF Linear 0.85 0.46 1.18 0.65 0.00 0.00 0.00 0.26 0.00 0.00 0.00 0.67 4.40 0.65 0.68
SLBF Nonlinear 0.85 0.00 5.12 1.94 0.00 19.18 0.00 10.57 0.00 13.97 0.00 1.68 14.29 5.20 4.75
LRR Linear 5.08 24.66 1.18 2.58 2.38 2.74 0.00 29.12 0.00 0.00 8.93 14.81 18.68 8.47 6.75
LRR Nonlinear 1.27 9.13 2.76 1.94 0.00 0.00 3.47 3.61 0.00 0.00 9.64 18.18 2.20 4.02 4.05
RAS 11.65 0.00 2.56 9.68 16.19 26.03 26.74 11.21 3.28 13.97 3.21 2.36 6.59 10.27 10.19
HOSC d=2 0.85 0.00 24.41 1.61 0.00 0.00 0.00 22.94 0.00 0.00 0.00 0.00 2.20 4.00 2.42
HOSC d=3 1.27 23.74 24.41 3.23 0.00 19.18 12.15 19.59 23.75 0.00 1.43 1.01 17.58 11.33 10.65
Table 2: Misclassification Rates (given as % Error) on the outlier-free RAS database.
Figure 5: GDM is compared against other HLM methods on the nonlinear 2-view embedding of the outlier-free RAS database. (Color figure online)
Runtime (seconds)

Method

GDM 12.7
SCC 2.3
MAPA 5.6
SSC 89.5
SLBF 4.0
LRR 0.8
Table 3: Average runtimes (per file) of HLM-based methods on non-linearly embedded (outlier-free) RAS data.

From Table 2 and Figure 5, we can see that GDM performs very competitively on this database. There is only a single file () on which GDM exhibits significant error. This file contains features from two bent magazines as well as a rigid background. Since the bent magazines are clearly non-rigid, our model assumptions are not met (see Fig. 6). There were two methods in the comparison that had a lower average misclassification error than GDM (“SCC Linear” and “SLBF Linear”). This is because they perform significantly better on file (). Both of these are spectral methods, accompanied by the linear embedding, and are therefore better able to handle the manifold structure that results from the non-rigidity of the objects in this file. Amongst the other files however, GDM performs better on average than both of these two methods (see the last column of Table 2). Comparing just the HLM-based methods on the nonlinearly-embedded data, GDM performs better than any other method, with the most perfect classifications and the fewest number of files with significant errors. Figure 5 more clearly emphasizes this superb performance amongst methods using the nonlinear embedding.

(a) Frame 1
(b) Frame 2
Figure 6: File 8 in the RAS database. This is a problematic file because the two magazines in the scene appear to undergo a non-rigid transformation between the two frames. Point correspondences are colored according to ground-truth segmentation. (Color figure online)

We also performed experiments on the Hopkins155 database Tron07abenchmark . For 2-view segmentation we extracted the first and last frame of each sequence and performed 2-view segmentation on the nonlinear embedding (in ) of the data. For comparison, we demonstrate the results of some other HLM algorithms on this embedded data: MAPA mapa , SCC-MS spectral_applied ; LBF_journal12 and SLBF-MS LBF_journal12 . We also supply results for a few state-of-the-art HLM methods on the full -view feature trajectories. For these -view results we chose in this table the best methods on Hopkins155 we are aware of, which do not require careful tuning with parameters: SSC ssc09 and SLBF-MS LBF_journal12 . We also include the reference (REF) results Tron07abenchmark . REF finds the best linear models (via least squares approximation) for each cluster of embedded points (given the ground truth segmentation), and then finds new clusters by assigning points to the models they best agree with. For GDM on this database, it was necessary to increase the number of random initializations ( in Algorithm 1) to achieve reliable convergence (we changed it from 10 to 30). From Table 4

we see that GDM outperforms the other 2-view methods (although SCC matches or nearly matches its performance in some categories). We remark that we also tested a genetic algorithm for minimizing the global dimension and it achieved even more accurate results, however, we do not include it here since it is not as fast as GDM.

Checker Traffic Articulated All
2-motion Mean Median Mean Median Mean Median Mean Median

-view

GDM 2.79 0.00 1.78 0.00 2.66 0.00 2.51 0.00
MAPA 12.85 14.07 6.49 6.93 7.15 5.33 10.69 10.03
SCC (d=7) 2.79 0.00 1.97 0.00 3.42 0.00 2.64 0.00
SLBF (d=6) 8.18 1.39 3.98 0.53 4.73 0.40 6.78 1.11
-view SLBF-MS (2,3) 1.28 0.00 0.21 0.00 0.94 0.00 0.98 0.00
SSC-N (4,3) 1.29 0.00 0.29 0.00 0.97 0.00 1.00 0.00
REF 2.76 0.49 0.30 0.00 1.71 0.00 2.03 0.00
(a)
Checker Traffic Articulated All
3-motion Mean Median Mean Median Mean Median Mean Median

-view

GDM 5.37 3.23 4.23 2.69 5.32 5.32 5.14 3.13
MAPA 21.89 19.49 13.15 13.04 9.04 9.04 19.41 18.09
SCC (d=7) 8.05 5.85 4.67 5.45 5.85 5.85 7.25 5.45
SLBF (d=6) 14.08 12.80 7.93 6.75 4.79 4.79 12.32 9.57
-view SLBF-MS (2,3) 3.33 0.39 0.24 0.00 2.13 2.13 2.64 0.22
SSC-N (4,3) 3.22 0.29 0.53 0.00 2.13 2.13 2.62 0.22
REF 6.28 5.06 1.30 0.00 2.66 2.66 5.08 2.40
(b)
Table 4: The mean and median percentage of misclassified points for two-motions and three-motions in Hopkins 155 database with comparisons to state-of-the-art -views. Winning results amongst the 2-view methods are bold-faced in each category.

It is also interesting to note that our results for 2-views are comparable to the reference results with -views. That is, the results of GDM are the best one can expect with pure linear modeling given many views and assuming an affine camera model. GDM for -views gave comparable results and we thus did not include it. On the other hand, both SLBF-MS and SSC-N are able to obtain better results with -views and this may be because their machinery of spectral clustering (together with good choices of spectral weights) allows them to take into account some of the manifold structure and nearness of points (information beyond linear modeling).

6.2 Performance in the Presence of Outliers

We tested the methods suggested in §5.2 on the outlier-corrupted RAS database rao_ijcv . The performance of classic GDM (no outlier rejection machinery) is also presented on this database, as is the performance of GDM on the corresponding outlier-free database (for comparison purposes). We also show results from three competing methods for segmenting motion with outliers: RAS rao_ijcv , HOSC higher-order , and LRR lrr_short ; lrr_long with outlier rejection performed by identifying the largest columns of , as suggested in (lrr_long, , pg. 9). The details of this experiment, including parameter values, are given in §8.4.

It is non-trivial to fairly compare different algorithms in the presence of outliers. Each method generally has at least one parameter for controlling how it handles outliers. This parameter balances the desire for a high outlier detection rate with a desire for a low false alarm rate (these two quantities are invariably correlated). Using any popular metric for evaluating segmentation accuracy (like misclassification rate for true inliers888“True inliers” are points that are inliers according to ground truth.

), the performance of each algorithm will depend substantially on its outlier handling parameter. In general terms, if an algorithm is allowed to discard points as outliers more freely, then the accuracy on the surviving points will improve. Thus, if one method is more conservative than another in discarding points as outliers, the results will likely be skewed in favor of one method over the other. It is therefore important when looking at segmentation accuracy to think in terms of accuracy for a given

true positive rate (TPR) and false positive rate (FPR):

TPR
FPR

There are two aspects of these algorithms we wish to compare. The first is outlier detection performance (how good is each method at distinguishing between inliers and outliers). The second is segmentation performance, where we evaluate how good each method is at segmenting motions in the presence of outliers.

To compare the outlier detection performance of multiple methods, a common tool is the ROC curve, which parametrically plots the TPR vs. FPR as a function of the outlier parameter for a method. A “random classifier” that randomly labels points as inliers or outliers will have an ROC curve lying along the line

. An ideal classifier will follow the line . Hence, methods can be compared by seeing which ROC curve is highest over the broadest range of FPRs (or over the FPRs one is interested in). The ROC curves for GDM (using the Model-Reassign outlier detection method and varying ), LRR (by varying ), RAS (by varying “outlierFraction”), and HOSC (by varying ), are presented in Fig. 7.

File Number Average Average
1 2 3 4 5 6 7 8 9 10 11 12 13 w/o File #8
Method GDM - Model-Reassign 2.97 0.00 4.33 1.29 0.95 0.00 0.00 12.63 0.00 6.62 0.00 2.02 17.58 3.72 2.98
GDM - Classic 0.85 0.00 1.57 32.26 0.00 0.00 0.00 22.94 0.00 0.00 22.14 8.75 16.48 8.08 6.84
RAS 19.49 5.02 1.97 5.81 15.71 23.29 25.00 11.86 2.32 13.97 12.14 18.18 21.98 13.60 13.74
LRR 4.24 20.55 22.83 7.10 7.14 8.22 18.75 34.54 2.32 27.21 8.57 11.78 25.27 15.27 13.67
HOSC (d=2) 11.02 22.37 16.54 33.55 10.95 2.74 11.11 11.34 3.09 13.97 36.79 66.67 8.79 19.15 19.80
GDM - clean 0.85 0.00 1.57 0.65 0.00 0.00 0.00 12.76 0.00 0.00 0.00 0.00 0.00 1.22 0.26
Table 5: Misclassification Rates (given as % Error) of inliers on the RAS database. All but ‘Classic GDM - clean’ are misclassification rates when run on the outlier-corrupted datasets. ‘GDM - clean’ gives the performance of the unmodified GDM algorithm, when run on the outlier-removed datasets (included as a reference).

GDM was again run using the nonlinear embedding of the data. HOSC was run with the linear embedding and LRR was run with the nonlinear embedding since these were the cases that yielded the best performance in the outlier-free tests for each algorithm (see §8.4 for more details). From Fig. 7 we can see that GDM is very competitive at detecting outliers on this database. At low FPRs GDM yields comparatively excellent performance. At higher FPRs HOSC has a moderate advantage at outlier detection v.s. GDM. However, it will be seen later (Table 5) that HOSC is not competitive at segmentation in the presence of outliers. Furthermore, the presented HOSC results were prepared using (see §8.4), instead of as argued for by its authors. Using gave worse results and made the algorithm take an extremely long time to execute.

Figure 7: The outlier detection performance of GDM (Model-Reassign) is compared against other motion segmentation methods on the outlier-free RAS database. (Color figure online)

The TPR and FPR for a robust segmentation algorithm cannot generally be controlled independently or arbitrarily. Thus, for a comparison of segmentation accuracy, one must select “reasonable” parameters for each method, which correspond to the same general region of ROC space. It should be understood that since the TPR and FPR cannot be controlled exactly for each method, any such comparison is inherently unfair, and by manipulating outlier parameters the results can be skewed somewhat in any direction.

For the purpose of fairly comparing GDM with other methods, we must select only one of the suggested outlier detection schemes for GDM (“GDM - Known Fraction” or “GDM Model-Reassign”). To effectively use “GDM - Known Fraction”, one must either know roughly what fraction of his or her data are going to be outliers, or be in a situation where over-rejecting points as outliers is acceptable (you can then over-estimate the outlier fraction). Since this is not usually the case, we will consider the results of “GDM Model-Reassign” when comparing with other methods.

In Table 5 we present a file-by-file comparison of segmentation accuracy for the aforementioned methods using parameters that place the FPR of each method in the range of 0.01 to 0.08. Table 6 reports the average TPR and FPR for each of these methods.

TPR FPR

Method

GDM - Model-Reassign 0.56 0.01
GDM - Classic NA NA
RAS 0.74 0.08
LRR 0.49 0.04
HOSC 0.71 0.06
GDM - clean NA NA
Table 6: True Positive Rate (TPR) and False Positive Rate (FPR) for each method in our segmentation comparison in Table 5. GDM - Model Reassign, RAS, LRR, and HOSC were each tuned to achieve a false positive rate in the range of 0.01 to 0.08.

One can see from Table 5 that “GDM Model-Reassign” causes an overall improvement in segmentation accuracy (vs “GDM - Classic”) in the presence of outliers. There were several files where the outliers cause the classic GDM method to misclassify large fractions of the data sets (files 4, 8, and 11 have inlier misclassification rates over 20%). On these files the error rates of “GDM Model-Reassign” are dramatically lower. There are some files where the outlier detection framework appears to hurt performance, but in most of these cases the degradation is slight. The results for GDM are better in most cases (and on average) than the competing methods, although there are a few files where GDM is outperformed by a small margin. Unlike the strong outlier detection performance of HOSC discussed earlier, the segmentation capabilities of HOSC appear very intolerant to outliers (if even a few outliers slip through, segmentation performance suffers).

7 Conclusions

We presented a new approach to 2-view motion segmentation, which is also a general method for HLM. Its development was motivated by the main obstacle of recovering multiple subspaces within the nonlinear embedding of point correspondences into ; namely, the nonuniform distributions along subspaces (of unknown dimensions). The idea was to minimize a global quantity, i.e., global dimension. Unlike KSCC AtevKSCC , which also exploits global information, this approach does not make an a-priori assumption on the dimensions of the underlying subspaces. We formulated a fast method to minimize this global dimension, which we referred to as GDM. We demonstrated state-of-the-art results of GDM for 2-view motion segmentation.

We carefully explained the meaning of the two main parameters in our algorithm, and , and the trade-offs they express. We gave a theoretical basis for selecting an appropriate value of . Needless to say that these parameters are fixed throughout the paper. We described a preliminary theory which motivated the notion of global dimension, and we justified why it makes sense as an objective function in our application.

Finally, we presented an outlier detection/rejection framework for GDM. We explored two complimentary implementations of this framework, and we presented results demonstrating that it is competitive at handling outliers in this application.

8 Appendix

8.1 Proof of Theorem 3.1

We prove the four properties of the statement of the theorem. For simplicity we assume that . That is, the number of data points is greater than the dimension of the ambient space. This is the usual case in many applications.
Proof of Property 1: Clearly, scaling all data vectors by results in scaling all the singular values of the corresponding data matrix by . Furthermore, this results in scaling by both the numerator and denominator of the expression for the empirical dimension for any . Therefore, the empirical dimension is invariant to this scaling.
Proof of Property 2: The singular values of a matrix (in particular the data matrix) are invariant to any orthogonal transformation of this matrix and thus the empirical dimension is invariant to such transformation.
Proof of Property 3: If are contained in a -subspace, then since these form the columns of , . Since and are orthogonal, . In particular, has at most singular values. Let be the vector of singular values of , and let be the indicator vector of 999 has a in each coordinate where has a non-zero element, and ’s in all other coordinates..

The generalized Hölder’s Inequality (grafakos2004classical, , pg. 10) states that if:

(13)

then

(14)

To apply this result to vectors, we view them as functions over the set with counting measure.

Let , , . Also let , . These values satisfy (13). We therefore get:

(15)

Proof of Property 4: By hypothesis, the data vectors are i.i.d. and sampled according to probability measure , where is sub-Gaussian, non-degenerate, and spherically symmetric in a -subspace of . We define the th data matrix:

Then is the th sample covariance matrix of our data set. Also, let

be a random variable with probability measure

. Then is the covariance matrix of the distribution. A consequence of being spherically symmetric in a -subspace is that after an appropriate rotation of space, is diagonal with a fixed constant in of its diagonal entries and in all other locations. We are trying to prove a result about empirical dimension, which is scale invariant and invariant under rotations of space. Because of these two properties we can assume that the appropriate rotation and scaling has been done so that is diagonal with value in diagonal entries and in all others. Without any loss of generality, we assume that the first diagonal entries are the non-zero ones.

Let , , denote the vector of singular values of the matrix . Our first task will be to show that converges in probability (as ) to the vector:

(16)

To accomplish our task, we will first relate to the vector of singular values of , and then use a result showing that converges to as .

It is clear that the vector of singular values of , which we will denote by , is given by:

(17)

Next, we will need the following result regarding covariance estimation. This is Corollary 5.50 of vershynin_book , adapted to be consistent with our notation.


Lemma 1 (Covariance Estimation): Consider a sub-Gaussian distribution in

with covariance matrix . Let , and . If , then with probability at least , , where denotes the spectral norm (i.e., largest singular value of the matrix). The constant depends only on the sub-Gaussian norm of the distribution.


In our problem, we are applying this lemma to the distribution . Let be given. If

(18)

then with probability at least . The -norm of the difference of two matrices bounds the differences of their individual singular values. We will use the following result to make this precise:


Lemma 2 bhatia:1997 : Let denote the th largest singular value of an arbitrary -by- matrix. Then: , for each .


Because is diagonal with only values and on the diagonal, the singular values of are simply these diagonal values. We will use to denote the ’th singular value of .

Setting and , in lemma 2 we get: , for each . This implies that:

(19)

Notice that as , approaches . Specifically, for any desired tolerance, , and any desired certainty, , can be chosen large enough that with probability greater than , , simultaneously for each . It follows from this that the vector converges in probability to (16) as .

Finally, . Thus, is a continuous function of the vector . Hence, since converges to as , converges in probability to

(20)

8.2 Proof of Theorem 3.2

Recall that denotes the natural partition of the data set. First, we notice that , where is the true dimension of set of the partition. Notice that cannot exceed since is supported by , a -subspace. Furthermore, since does not concentrate mass on subspaces it is a probability 0 event that all points from exist in a proper subspace of . Thus, for the natural partition, is almost surely , for each . Hence, is almost surely .

Next, we will find a lower bound for the global dimension of any non-natural partition of the data, and show that if meets the hypothesis criteria, the lower bound we get is greater than . To accomplish this we need the following lemma.

Lemma 1

If then almost surely has one set with dimension at least .

Before proving the lemma, observe that a consequence is that if , then with probability 1:

(21)

Then, from our hypothesis:

(22)

Hence,

(23)

Thus, if we show Lemma 1, the proof of the theorem follows. To prove Lemma 1 we require an a simpler lemma:

Lemma 2

If a set in has fewer than points from a subspace , then either has dimension at least or adding another point from to (an R.V. with probability measure , independent from all other samples) will almost surely increase the dimension of by 1.

Proof

If then has dimension strictly less than the ambient space (). Observe that is a linear subspace of , which a.s. does not contain . We cannot have proper containment since . Also, we have fewer than points from in , and each other point in lies in with probability 0 (All do not concentrate mass on subspaces). Thus, a.s. does not equal .

Therefore, if we intersect with we get a proper subspace of ; call it . We note that since does not concentrate on subspaces. Thus, since has probability measure , a.s. lies outside the intersection of and . It follows that if we add to , the dimension of a.s. increases by 1.

Now we prove Lemma 1. We will assume all sets in have dimension less than and pursue a contradiction. By hypothesis, our set contains at least points from each subspace . Since , there is some subspace whose points are assigned to or more distinct sets in . Let be a point from . Now, choose points from each and denote this collection of points . When making this selection, ensure that is not chosen and that of the points selected from , not all of them are assigned to the same set in as . Notice that induces a partition on .

Select any point and remove it from the set . Since we are assuming that each set in has dimension less than , Lemma 2 implies that the set in to which belongs will have its dimension decrease by . Now select another point and remove it. Lemma 2 still applies and so the set to which belonged will have its dimension decrease by 1. We can repeat this until all points have been removed. Since each removal decreases the dimension of some set in by it follows that before any removals the sum of the dimensions of all sets in was at least . Since each of the sets in had dimension or less, we conclude that in fact each set must have had dimension exactly .

Now, consider our set and add in . By our choice of , Lemma implies that its addition a.s. increases the dimension of its target set in by (to ). Adding in all remaining points from will only increase the dimensions of the sets in . Thus, we almost surely have a set of dimension at least in , contradicting our hypothesis.

8.3 Proof of Theorem 3

Recall that the soft partition is stored in a membership matrix . Specifically, the ’th element of , denoted , holds the “probability” that vector belongs to cluster . Thus, each column of forms a probability vector.

Hence, global dimension is a real-valued function of the matrix . We will think of the membership matrix as being vectorized, so that the domain of optimization can be thought of as a subset of . However, we will not explicitly vectorize the membership matrix. Thus, when we talk about the gradient of global dimension, we are referring to another -by- matrix, where the ’th element is the derivative of global dimension w.r.t. .

To differentiate global dimension we must be able to differentiate the singular values of a matrix w.r.t. each element of that matrix. A treatment of this is available in PapadopouloSVD .

To begin, recall the definition of :

(24)

We will denote the thin SVD (only columns of and are used) of :

(25)

Also, we will let refer to the ’th element of

. Then, using the chain rule: