SubKit
Subspace Clustering Library
view repo
We propose Ordered Subspace Clustering (OSC) to segment data drawn from a sequentially ordered union of subspaces. Similar to Sparse Subspace Clustering (SSC) we formulate the problem as one of finding a sparse representation but include an additional penalty term to take care of sequential data. We test our method on data drawn from infrared hyper spectral, video and motion capture data. Experiments show that our method, OSC, outperforms the state of the art methods: Spatial Subspace Clustering (SpatSC), Low-Rank Representation (LRR) and SSC.
READ FULL TEXT VIEW PDFSubspace Clustering Library
In many areas such as machine learning and image processing, high dimensional data are ubiquitous. This high dimensionality has adverse effects on the computation time and memory requirements of many algorithms. Fortunately, it has been shown that high dimensional data often lie in a space of much lower dimension than the ambient space
elhamifar2012sparse ; vidal2011subspace. This has motivated the creation of many dimension reduction techniques. These techniques, such as Principal Component Analysis (PCA), assume that the data belongs to a single low dimensional subspace
candes2011robust . However in reality the data often lies in a union of multiple subspaces. Therefore it is desirable to determine the subspaces in the data so that one can apply dimension reduction to each subspace separately. The problem of assigning data points to subspaces is known as subspace segmentation.Given a data matrix of observed column-wise samples , where is the dimension of the data, the objective of subspace segmentation is to learn corresponding subspace labels . Data within is assumed to be drawn from a union of subspaces of dimensions . Both the number of subspaces and the dimension of each subspace are unknown. To further complicate the problem it is rarely the case that clean data is observed. Instead we usually observe data which has been corrupted by noise. Subspace segmentation is a difficult task since one must produce accurate results quickly while contending with numerous unknown parameters and large volume of potentially noisy data.
The use of subspace segmentation as a pre-processing method has not been limited to dimensionality reduction. For example it has been used in other applications such as image compression hong2006multiscale , image classification zhang2013learning ; DBLP:conf/dicta/BullG12
liu2012fixed ; liu2011latent , image segmentation yang2008unsupervised ; cheng2011multi . Furthermore state-of-the-art subspace segmentation has shown impressive results for pure segmentation tasks such as identifying individual rigidly moving objects in video tomasi1992shape ; costeira1998multibody ; kanatani2002motion ; jacquet2013articulated , identifying face images of a subject under varying illumination basri2003lambertian ; georghiades2001few , segmentation of human activities zhu2014complex and temporal video segmentation vidal2005generalized .This paper is concerned with a variant of subspace segmentation in which the data has a sequential structure. The data is assumed to be sampled at uniform intervals in either space or time in a single direction. For example video data which as a function of time has a sequential structure vidal2005generalized ; tierney2014subspace where it is assumed that frames are similar to their consecutive frames (neighbours) until the scene ends. Another example is hyper-spectral drill core data elhamifar2012sparse , which is obtained by sampling the infrared reflectance along the length of the core. The mineralogy is typically stratified meaning segments of mineral compounds congregate together Guo.Y;Gao.J;Li.F-2013 ; guo2014spatial . The sequential structure implies that consecutive data samples are likely to share the same subspace label i.e. , until of course a boundary point is reached.
This papers main contribution is the proposal and discussion of the Ordered Subspace Clustering (OSC) method, which exploits the sequential structure of the data. Experimental evaluation demonstrates that OSC outperforms state-of-the-art subspace segmentation methods on both synthetic and real world datasets. A preliminary version of this paper was published in CVPR14 tierney2014subspace . The optimisation scheme that was suggested in the preliminary version lacked a guarantee of convergence and suffered from huge computational cost. In this paper we provide two new optimisation schemes to solve the OSC objective, which have guaranteed convergence, much lower computational requirements and can be computed in parallel. Furthermore we perform experiments on new synthetic and real datasets.
The state-of-the-art methods in subspace segmentation are the spectral subspace segmentation methods such as Sparse Subspace Clustering (SSC) and Low-Rank Representation (LRR). Spectral subspace segmentation methods consist of two steps:
Learn the subspace structure from the data
Interpret the structure as an affinity matrix and segment via spectral clustering
The main difference between spectral methods is in their approaches to learning the subspace structure.
To learn the subspace structure of the data, spectral subspace segmentation methods exploit the the self expressive property elhamifar2012sparse :
each data point in a union of subspaces can be efficiently reconstructed by a combination of other points in the data.
In other words a point in a subspace can only be represented by a linear combination of points from within the same subspace. Unless the subspaces intersect or overlapping, which is assumed to be extremely unlikely in practice. This leads to the following model
(1) |
where
is a vector of coefficients, which encode the subspace structure. Due to the self-expressive property the non-zero elements of
will correspond to samples in that are in the same subspace as sample . Therefore learning the coefficient vectors for each data sample can reveal some of the underlying subspace structure. The model can be expressed for all data points aswhere columns of .
After learning the next step is to assign each data point a subspace label. The first step in this process is to build a symmetric affinity matrix. The affinity matrix is usually defined as
(2) |
where element of is interpreted as the affinity or similarity between data points and . Next this affinity matrix is used by a spectral clustering method for final segmentation. Normalised Cuts (NCut) shi2000normalized is the de facto spectral clustering method for this task elhamifar2012sparse ; liu2010robust .
So far it has been assumed that the original and clean data is observed. Unfortunately this ideal situation is rare with real world data. Instead the data is usually corrupted by noise in the data capture process or during data transmission of the data. Therefore most subspace clustering methods assume the following data generation model
where is the original data where each point (column) lies on a subspace and is noise.
follows some probability distribution. Two common assumptions for
are Gaussian distribution and Laplacian distribution.
Since it may be difficult to isolate the original data from the noise , most subspace clustering methods actually address the issue of noise by allowing greater flexibility in the self-expressive model. The self-expressive model usually becomes
where is a fitting error and is different from .
Sparse Subspace Clustering (SSC) was originally introduced by Elhamifar & Vidal elhamifar2012sparse ; elhamifar2009sparse . SSC adopts concepts from the domain of sparse models, namely that
there exists a sparse solution, , whose nonzero entries correspond to data points from the same subspace as .
In the case where the observed data is noiseless, i.e. we have , each data point lying in the -dimensional subspace can be represented by points. This corresponds to the sparse representation of points, ideally a sparse solution should only select coefficients belonging to the same subspace as each point. Furthermore the number of non-zero coefficients should correspond to the dimension of the underlying subspace. The sparsity goals of SSC could be achieved through a solution to the following
(3) |
where is called the norm and is defined the number of non-zero entries. The diagonal constraint is used to avoid the degenerate solution of expressing the point as a linear combination of itself. However this problem is intractable, instead the convex relaxation norm is used
(4) |
The is the norm and is defined as
i.e. the sum of absolute values of the entries. We call this heuristic SSC.
To overcome the simultaneous presence of noise and outliers, Elhamifar & Vidal
elhamifar2009sparse devised the following alternative(5) | |||
where is the Frobenius norm and is high magnitude sparse fitting error. This model allows for flexibility in the fitting error since setting either or to eliminates or from the model. This only compensates for fitting errors however shows surprising robustness in practice.
Recent work by Soltanolkotabi, Elhamifar and Candes soltanolkotabi2014robust showed that under rather broad conditions using noisy data the approach should produce accurate clustering results. These conditions include maximum signal-to-noise ratio, number of samples in each cluster and distance between subspaces and appropriate selection of parameters. They use the following relaxed objective
(6) |
with regularisation parameter tuned for each data sample.
In practice SSC allows for efficient computation. Each column of can be computed independently and in parallel with the other columns. In contrast to LRR (discussed next) the computational requirements are lightweight. Furthermore is sparse, which can reduce memory requirements and decrease time computational requirements and time spent during the final spectral clustering step.
Rather than compute the sparsest representation of each data point individually, Low-Rank Representation (LRR) by Liu, Lin and Yu liu2010robust attempts to incorporate global structure of the data by computing the lowest-rank representation of the set of data points. Therefore the objective becomes
(7) |
This means that not only can the data points be decomposed as a linear combination of other points but the entire coefficient matrix should be low-rank. The aim of the rank penalty is to create a global grouping effect that reflects the underlying subspace structure of the data. In other words, data points belonging to the same subspace should have similar coefficient patterns.
Similar to SSC, the original objective for LRR is intractable. Instead the authors of LRR suggest a heuristic version which uses the closest convex envelope of the rank operator: the nuclear or trace norm. The objective then becomes
(8) |
where
is the nuclear norm and is the sum of the singular values. The singular values can be computed through the Singular Value Decomposition (SVD).
LRR has achieved a lot of attention in the subspace segmentation community, which had led to some interesting discoveries. The most surprising of which is that there is a closed form solution to the heuristic noiseless LRR objective. The closed form solution is given by the Shape Interaction Matrix (SIM) and is defined as
where are the right singular vectors given by the SVD of .
In the case where noise is present, the authors of LRR suggested a similar model to that used in SSC. However they assume that their fitting error will be only be present in a small number of columns. This results in the following objective
(9) |
where is the norm.
Even though LRR has shown impressive accuracy performance in many subspace segmentation tasks it has two drawbacks:
high computational cost,
large memory requirements.
LRR’s high computational cost comes from the required computation of the SVD of at every iteration. Depending on the convergence tolerance LRR may iterate hundreds or even thousands of times. However some improvements have been made by computing partial or skinny SVD approximations. Similarly the large memory requirements of LRR stem from the computation of the SVD of . Since the number of elements in scales quadratically with the number of number of data samples it may not be possible to apply LRR even for modest datasets. Work has been done in fast approximations of SVD liberty2007randomized ; woolfe2008fast but it has not yet been applied to LRR at the time of writing.
Laplacian Regularised LRR Yin:rz and LRR with Local Constraint Zheng2013398 incorporate Laplacian regularisation to ensure that data points close in the ambient space share similar coefficient structure. The objectives for both approaches can be generalised to
(10) |
where is a placeholder for a fitting term and other regularisation such as nuclear norm or on and is a weight based on distance between sample and .
Spatial Subspace Clustering (SpatSC) Guo.Y;Gao.J;Li.F-2013 extended SSC by incorporating a sequential neighbour penalty
(11) | |||
where is a lower triangular matrix with on the diagonal and on the second lower diagonal:
(12) |
Therefore . The aim of this formulation is to force consecutive columns of to be similar.
The assumption for Ordered Subspace Clustering (OSC) tierney2014subspace is that the data is sequentially structured. Since physically neighbouring data samples are extremely likely to lie in the same subspace they should have similar coefficient patterns. Consider a a video sequence from a television show or movie. The frames are sequentially ordered and each scene lies on a subspace. Since the scene changes are relatively rare compared to the high frame rate it is extremely likely that consecutive frames are from the same subspace. In other words the columns of should follow the rule .
Similar to SpatSC, OSC extends SSC with an additional regularisation penalty. The objective is as follows:
(13) |
where is defined as in (12).
Instead of the norm over as used in SpatSC, OSC uses the norm to enforce column similarity of . In contrast to SpatSC this objective much more strictly enforces column similarity in . only imposes sparsity at the element level in the column differences and does not directly penalise whole column similarity. Therefore it allows the support (non-zero entries) of each consecutive column to vary. In effect this allows some values in consecutive columns to be vastly different. This does not meet the stated objective of .
Thus in (13), the weak penalty from SpatSC has been replaced with the stronger penalty to strictly enforce column similarity. We remove the diagonal constraint as it is no longer required in most cases and can interfere with column similarity. However we discuss in the following section how to include the constraint if it is required.
In the preliminary version of this paper tierney2014subspace a procedure to solve the relaxed version of the objective (13) was discussed. However the former procedure lacked a guarantee of convergence. Furthermore the procedure suffered from huge computational complexity due to the expensive Sylvester equation penzl1998numerical ; golub1979hessenberg required.
In this paper two new procedures are discussed for relaxed and exact variants, both of which have guaranteed convergence and reduced computational complexity. For a demonstration of the speed improvements please see Section 8 and Figure 5. There improvements have been achieved through adoption of the LADMAP (Linearized Alternating Direction Method with Adaptive Penalty) boyd2011distributed , LinLiuSu2011 and LADMPSAP (Linearized Alternating Direction Method with Parallel Spliting and Adaptive Penalty) DBLP:conf/acml/LiuLS13 frameworks. Overviews of the procedures can be found in Algorithms 1 and 2.
The relaxed variant of (13) can be written as:
(14) |
Then the Augmented Lagrangian for the introduced auxiliary variable constraint is,
(15) |
Objective (15) will be solved for and in a sequential and alternative manner when fixing the other, respectively. Given the solution state and adaptive constant , the procedure for is as follows:
Update by solving the following subproblem
(16) |
which is equivalent to
(17) | ||||
There is no closed form solution to the above problem because of the coefficient matrices and on . Thus linearisation over the last three terms is used. Denote and which is from the augmented Langrangian. The linear approximation at BeckTeboulle2009 for and respectively, is
(18) |
and
(19) |
where and
Problem (20) is separable at element level and each has a closed-form solution defined by the soft thresholding operator, see bach2011convex ; liu2010efficient , as follows
(21) |
Given the new value from last step, is updated by solving
The linear term is easily absorbed into the quadratic term such that a solvable problem can be achieved as follows,
(22) |
where with a constant .^{1}^{1}1Ideally . For the purposes of convergence analysis in Section 5.4 is set to larger than 1. Denote by , then the above problem has a closed-form solution defined as follows,
(23) |
where and are the -th columns of and , respectively. Please refer to liu2010robust .
Update by
(24) |
Update adaptive constant by
The entire procedure for solving the relaxed OSC objective is summarized in Algorithm 1. This set of updating rules is a generalisation of those in LADMAP LinLiuSu2011 , as such it will be referred to as v-LADMAP for short. Note however, that in the original LADMAP the linearisation is performed only on the augmented Lagrange term, i.e. on , based on which the convergence analysis is carried out. Whereas in the v-LADMAP case, both and are linearised in order to obtain a closed-form solution to . This difference means that the convergence analysis in LADMAP is no longer applicable here. As such, detailed analysis on the convergence of v-LADMAP is provided in Section 4.4.
Similar to the relaxed version, auxiliary constraint variables are introduced
(25) | |||
Then the Augmented Lagrangian form is used to incorporate the constraints
(26) |
In problem (26), there are three primary variables , and , so a simple linearised ADM as used in the previous subsection may diverge in the multi-variable case as demonstrated in DBLP:conf/acml/LiuLS13 . To overcome this the so-called Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty method (LADMPSAP) is adopted, which for problem (26), consists of the following steps, see DBLP:conf/acml/LiuLS13 :
Update
(27) |
Define
By linearizing , (27) can be approximated with the following proximal problem
where ( is an appropriate constant) and
As discussed before the solution is given by the soft thresholding operator defined in (21) with .
Update by
This is a least square problem whose solution can be given by
(28) |
Update by
The solution can be obtained by using (23) with replaced by .
Update multipliers with the new values of primary variables by
Update
The entire procedure is summarised in Algorithm 2.
In some cases, it may be desirable to enforce the constraint i.e. we should not allow each data point to be represented by itself. The objective becomes
(29) | |||
To enforce such a constraint it is not necessary to significantly alter the aforementioned optimisation schemes. This constraint only affects the step involving . Since this step is the soft shrinkage operator and is separable at the element level one can simply set the diagonal entries to afterwards. In other words the update solution (21) becomes
(30) |
LADMPSAP adopts a special strategy that updates all primary variables in parallel using their values from the last iteration. See Step 2 to Step 11 and the equations they refer to. The LADMPSAP algorithm for problem (25) is guaranteed to converge. For the convergence analysis, please refer to DBLP:conf/acml/LiuLS13 . The convergence theorem is repeated here with some modifications reflecting the settings in our problem.
If is non-decreasing and upper bounded, , then the sequence generated by Algorithm 2 converges to a KKT point of problem (25).
Differently in v-LADMAP, which is used to solve the relaxed objective, updating the primary variables is performed in sequence. Meaning that one updated primary variable is used immediately to update another primary variable so that the optimisation is carried out by alternating directions sequentially. In Step 2 in Algorithm 1, the updated value of is used to obtain . The proof of convergence for LADMAP does not completely extend to v-LADMAP and since variables are updated sequentially the convergence from LADMPSAP does not apply either. As such the convergence theorem for v-LADMAP is presented in the remainder of this section.
Consider the original relaxed constrained version (14). The KKT conditions of problem (14) lead to the following: there exists a triplet such that
(31) | |||
(32) |
where denotes the subdifferential.
The following relations hold
(33) | ||||
(34) |
where
(35) | |||
(36) |
For the sequence generated by Algorithm 1 the following identity holds