# High Dimensional Low Rank plus Sparse Matrix Decomposition

This paper is concerned with the problem of low rank plus sparse matrix decomposition for big data. Conventional algorithms for matrix decomposition use the entire data to extract the low-rank and sparse components, and are based on optimization problems with complexity that scales with the dimension of the data, which limits their scalability. Furthermore, existing randomized approaches mostly rely on uniform random sampling, which is quite inefficient for many real world data matrices that exhibit additional structures (e.g. clustering). In this paper, a scalable subspace-pursuit approach that transforms the decomposition problem to a subspace learning problem is proposed. The decomposition is carried out using a small data sketch formed from sampled columns/rows. Even when the data is sampled uniformly at random, it is shown that the sufficient number of sampled columns/rows is roughly O(rμ), where μ is the coherency parameter and r the rank of the low rank component. In addition, adaptive sampling algorithms are proposed to address the problem of column/row sampling from structured data. We provide an analysis of the proposed method with adaptive sampling and show that adaptive sampling makes the required number of sampled columns/rows invariant to the distribution of the data. The proposed approach is amenable to online implementation and an online scheme is proposed.

## Authors

• 15 publications
• 17 publications
• ### Stability of Sampling for CUR Decompositions

01/08/2020 ∙ by Keaton Hamm, et al. ∙ 0

• ### Unmixing Incoherent Structures of Big Data by Randomized or Greedy Decomposition

Learning big data by matrix decomposition always suffers from expensive ...
09/02/2013 ∙ by Tianyi Zhou, et al. ∙ 0

• ### Robust and Scalable Column/Row Sampling from Corrupted Big Data

Conventional sampling techniques fall short of drawing descriptive sketc...
11/18/2016 ∙ by Mostafa Rahmani, et al. ∙ 0

• ### Identifiability of Low-Rank Sparse Component Analysis

Sparse component analysis (SCA) is the following problem: Given an input...
08/27/2018 ∙ by Jérémy E. Cohen, et al. ∙ 0

• ### Spatial Random Sampling: A Structure-Preserving Data Sketching Tool

Random column sampling is not guaranteed to yield data sketches that pre...
05/09/2017 ∙ by Mostafa Rahmani, et al. ∙ 0

• ### High-Dimensional Optimization in Adaptive Random Subspaces

We propose a new randomized optimization method for high-dimensional pro...
06/27/2019 ∙ by Jonathan Lacotte, et al. ∙ 0

• ### Simpler is better: A comparative study of randomized algorithms for computing the CUR decomposition

The CUR decomposition is a technique for low-rank approximation that sel...
04/13/2021 ∙ by Yijun Dong, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Suppose we are given a data matrix , which can be expressed as

 D=L+S, (1)

where is a low rank (LR) and is a sparse matrix with arbitrary unknown support, whose entries can have arbitrarily large magnitude. Many important applications in which the data under study can be naturally modeled using (1) were discussed in [1]. The cutting-edge Principal Component Pursuit approach developed in [1, 2], directly decomposes into its LR and sparse components by solving the convex program

 min˙L,˙Sλ∥˙S∥1+∥˙L∥∗subject to˙L+˙S=D (2)

where is the -norm, is the nuclear norm and determines the trade-off between the sparse and LR components [2]. The convex program (2) can precisely recover both the LR and sparse components if the columns and rows subspace of are sufficiently incoherent with the standard basis and the non-zero elements of are sufficiently diffused [2]. Although the problem in (2) is convex, its computational complexity is intolerable with large volumes of high-dimensional data. Even the efficient iterative algorithms proposed in [3, 4] have prohibitive computational and memory requirements in high-dimensional settings.

Contributions: This paper proposes a new randomized decomposition approach, which extracts the LR component in two consecutive steps. First, the column-space (CS) of is learned from a small subset of the columns of the data matrix. Second, the row-space (RS) of is obtained using a small subset of the rows of . Unlike conventional decomposition that uses the entire data, we only utilize a small data sketch, and solve two low-dimensional optimization problems in lieu of one high-dimensional matrix decomposition problem (2) resulting in significant running time speed-ups.

To the best of our knowledge, it is shown here for the first time that the sufficient number of randomly sampled columns/rows scales linearly with the rank and the coherency parameter of even with uniform random sampling. Also, in contrast to the existing randomized approaches [5, 6, 7], which use blind uniform random sampling, we propose a new methodology for efficient column/row sampling. When the columns/rows of are not distributed uniformly in the CS/RS of , which prevails much of the real world data, the proposed sampling approach is shown to achieve significant savings in data usage compared to uniform random sampling-based methods that require remarkable portions of the data. The proposed sampling algorithms can be independently used for feature selection from high-dimensional data.

In the presented approach, once the CS is learned, each column is decomposed efficiently and independently using the proposed randomized vector decomposition method. Unlike most existing approaches, which are batch-based, this unique feature enables applicability to online settings. The presented vector decomposition method can be independently used in many applications as an efficient vector decomposition algorithm or for efficient linear decoding

[8, 9, 10].

### I-a Notation and definitions

We use bold-face upper-case letters to denote matrices and bold-face lower-case letters to denote vectors. Given a matrix , denotes its spectral norm, its Frobenius norm, and the infinity norm, which is equal to the maximum absolute value of its elements. In an -dimensional space, is the vector of the standard basis (i.e., the element of is equal to one and all the other elements are equal to zero). The notation denotes an empty matrix and the matrix

 A=[A1A2...An]

is the column-wise concatenation of the matrices . Random sampling refers to sampling without replacement.

## Ii Background and Related Work

### Ii-a Exact LR plus sparse matrix decomposition

The incoherence of the CS and RS of is an important requirement for the identifiability of the decompostion problem in (1) [1, 2]. For the LR matrix with rank and compact SVD (where , and ), the incoherence condition is typically defined through the requirements [2, 1]

 maxi∥UTei∥22≤μrN1,maxi∥VTei∥22≤μrN2and∥UVT∥∞≤√μrN2N1 (3)

for some parameter that bounds the projection of the standard basis onto the CS and RS. Other useful measures for the coherency of subspaces are given in [11] as,

 (4)

where and bound the coherency of the CS and the RS, respectively. When some of the elements of the orthonormal basis of a subspace are too large, the subspace is coherent with the standard vectors. Actually, it is not hard to show that .

The decomposition of a data matrix into its LR and sparse components was analyzed in [2, 1], and sufficient conditions for exact recovery using the convex minimization (2) were derived. In [1]

, the sparsity pattern of the sparse matrix is selected uniformly at random following the so-called Bernoulli model to ensure that the sparse matrix is not LR with overwhelming probability. In this model, which is also used in this paper, each element of the sparse matrix can be non-zero independently with a constant probability. Without loss of generality (w.l.o.g.), suppose that

. The following lemma states the main result of [1].

###### Lemma 1 (Adapted from [1]).

Suppose that the support set of follows the Bernoulli model with parameter . The convex program (2) with yields the exact decomposition with probability at least provided that

 r≤ρrN2μ−1(log(N1))−2,ρ≤ρs (5)

where , and are numerical constants.

The optimization problem in (2) is convex and can be solved using standard techniques such as interior point methods [2]. Although these methods have fast convergence rates, their usage is limited to small-size problems due to the high complexity of computing a step direction. Similar to the iterative shrinking algorithms for -norm and nuclear norm minimization, a family of iterative algorithms for solving the optimization problem (2) were proposed in [4, 3]. However, they also require working with the entire data. For example, the algorithm in [4] requires computing the SVD of an matrix in every iteration.

### Ii-B Randomized approaches

Owing to their inherent low-dimensional structures, the robust principal component analysis (PCA) and matrix decomposition problems can be conceivably solved using small data sketches, i.e., a small set of random observations of the data

[12, 13, 14, 6, 7, 15]. In [12]

, it was shown based on a simple degrees-of-freedom analysis that the LR and the sparse components can be precisely recovered using a small set of random linear measurements of

. A convex program was proposed in [12]

to recover these components using random matrix embedding with a polylogarithmic penalty factor in sample complexity, albeit the formulation also requires solving a high-dimensional optimization problem.

The iterative algorithms which solve (2) have complexity per iteration since they compute the partial SVD decomposition of dimensional matrices [4]. To reduce complexity, GoDec [16] uses a randomized method to efficiently compute the SVD, and the decomposition algorithm in [17] minimizes the rank of instead of , where is a random projection matrix. However, these approaches do not have provable performance guarantees and their memory requirements scale with the full data dimensions. Another limitation of the algorithm in [17] is its instability since different random projections may yield different results.

The divide-and-conquer approach in [5] (and a similar algorithm in [18]

), can achieve super-linear speedups over full-scale matrix decomposition. This approach forms an estimate of

by combining two low-rank approximations obtained from submatrices formed from sampled rows and columns of using the generalized Nyström method [19]. Our approach also achieves super-linear speedups in decomposition, yet is fundamentally different from [5] and offers several advantages for the following reasons. First, our approach is a subspace-pursuit approach that focuses on subspace learning in a structure-preserving data sketch. Once the CS is learned, each column of the data is decomposed independently using a proposed randomized vector decomposition algorithm. Second, unlike [5], which is a batch approach that requires to store the entire data, the structure of the proposed approach naturally lends itself to online implementation (c.f. Section IV-E), which could be very beneficial for settings where the data comes in on the fly. Third, while the analysis provided in [5] requires roughly random observations to ensure exact decomposition with high probability (whp), we show that the order of sufficient number of random observations depends linearly on the rank and the coherency parameter even if uniform random sampling is used. Fourth, the structure of the proposed approach enables us to leverage efficient sampling strategies for challenging and realistic scenarios in which the columns and rows of

are not uniformly distributed in their respective subspaces, or when the data exhibits additional structures (e.g. clustering structures) (c.f. Sections

IV-B,IV-C). In such settings, the uniform random sampling used in [5] requires significantly larger amounts of data to carry out the decomposition.

## Iii Structure of the Proposed Approach and Theoretical Result

In this section, the structure of the proposed randomized decomposition method is presented. A step-by-step analysis of the proposed approach is provided and sufficient conditions for exact decomposition are derived. Theorem 5 stating the main theoretical result of the paper is presented at the end of this section. The proofs of the lemmas and the theorem are deferred to the appendix. Let us rewrite (1) as

 D=UQ+S, (6)

where . The representation matrix is a full row rank matrix that contains the expansion of the columns of in the orthonormal basis . The first step of the proposed approach aims to learn the CS of using a subset of the columns of , and in the second step the representation matrix is obtained using a subset of the rows of .

Let denote the CS of . Fundamentally, can be obtained from a small subset of the columns of . However, since we do not have direct access to the LR matrix, a random subset of the columns of is first selected. Hence, the matrix of sampled columns can be written as , where is the column sampling matrix and is the number of selected columns. The matrix of selected columns can be written as

 Ds1=Ls1+Ss1, (7)

where and are its LR and sparse components, respectively. The idea is to decompose the sketch into its LR and sparse components to learn the CS of from the CS of . Note that the columns of are a subset of the columns of since . Should we be able to decompose into its exact LR and sparse components (c.f. Lemma 3), we also need to ensure that the columns of span . The following lemma establishes that a small subset of the columns of sampled uniformly at random contains sufficient information (i.e., the columns of the LR component of the sampled data span ) if the RS is incoherent.

###### Lemma 2.

Suppose columns are sampled uniformly at random from the matrix with rank . If

 m1≥rγ2(V)max(c2logr,c3log(3δ)), (8)

then the selected columns of the matrix span the columns subspace of with probability at least where and are numerical constants.

Thus, if is small (i.e., the RS is not coherent), a small set of randomly sampled columns can span . According to Lemma 2, if satisfies (8), then and have the same CS whp. The following optimization problem (of dimensionality ) is solved to decompose into its LR and sparse components.

 min˙Ls1,˙Ss11√N1∥˙Ss1∥1+∥˙Ls1∥∗subject to˙Ls1+˙Ss1=Ds1. (9)

Thus, the columns subspace of the LR matrix can be recovered by finding the columns subspace of . Our next lemma establishes that (9) yields the exact decomposition using roughly randomly sampled columns. To simplify the analysis, in the following lemma it is assumed that the CS of the LR matrix is sampled from the random orthogonal model [20], i.e., the columns of are selected uniformly at random among all families of

###### Lemma 3.

Suppose the columns subspace of is sampled from the random orthogonal model, has the same column subspace of and the support set of follows the Bernoulli model with parameter . In addition, assume that the columns of were sampled uniformly at random. If

 m1≥rρrμ′(logN1)2andρ≤ρs, (10)

then (9) yields the exact decomposition with probability at least , where

 (11)

and , and are constant numbers provided that is greater than the RHS of first inequality of (10).

Therefore, according to Lemma 3 and Lemma 2, the CS of can be obtained using roughly uniformly sampled data columns. Note that for high-dimensional data as scales linearly with . Hence, the requirement that is also greater than the RHS of first inequality of (10) is by no means restrictive and is naturally satisfied.

Suppose that (9) decomposes into its exact components and assume that has been correctly identified. W.l.o.g., we can use as an orthonormal basis for the learned CS. An arbitrary column of can be written as , where and are the corresponding columns of and , respectively. Thus, is a sparse vector. This suggests that can be learned using the minimization

 min^qi∥di−U^qi∥1, (12)

where the -norm is used as a surrogate for the -norm to promote a sparse solution [8, 11]. The optimization problem (12) is similar to a system of linear equations with unknown variables and equations. Since , the idea is to learn using only a small subset of the equations. Thus, we propose the following vector decomposition program

 min^qi∥ST2di−ST2U^qi∥1, (13)

where selects rows of (and the corresponding elements of ).

First, we have to ensure that the rank of is equal to the rank of , for if is the optimal point of (13), then will be the LR component of . According to Lemma 2, , is sufficient to preserve the rank of when the rows are sampled uniformly at random. In addition, the following lemma establishes that if the rank of is equal to the rank of , then the sufficient value of for (13) to yield the correct columns of whp is linear in .

###### Lemma 4.

Suppose that the rank of is equal to the rank of and assume that the CS of is sampled from the random orthogonal model. The optimal point of (13) is equal to with probability at least provided that

 ρ≤0.5rβ(c6κlogN1δ+1) ,m2≥max(2rβ(β−2)log(1δ)3(β−1)2(c6κlogN1δ+1),c5(logN1δ)2,6√3δ) (14)

where , and are constant numbers and can be any real number greater than one.

Therefore, we can obtain the LR component of each column using a random subset of its elements. Since (12) is an -norm minimization, we can write the representation matrix learning problem as

 min^Q∥ST2D−ST2U^Q∥1. (15)

Thus, (15) learns using a subset of the rows of as is the matrix formed from sampled rows of .

As such, we solve two low-dimensional subspace pursuit problems (9) and (15) of dimensions and , respectively, instead of an -dimensional decomposition problem (2), and use a small random subset of the data to learn and . The table of Algorithm 1 explains the structure of the proposed approach.

We can readily state the following theorem which establishes sufficient conditions for Algorithm 1 to yield exact decomposition. In this theorem, it is assumed that the columns and rows are sampled uniformly at random. In the next section, an efficient method for column and row sampling is presented. In addition, a scheme for online implementation is proposed.

###### Theorem 5.

Suppose the columns subspace of the LR matrix is sampled from the random orthogonal model and the support set of follows the Bernoulli model with parameter . In addition, it is assumed that Algorithm 1 samples the columns and rows uniformly at random. If for any small ,

 (16)

where

 μ′ =max(c7max(r,logN1)r,6γ2(V),(c9γ(V)logN1)2), κ =logN1r,

, and are constant numbers and is any real number greater that one, then the proposed approach (Algorithm 1) yields the exact decomposition with probability at least provided that is greater than the RHS of first inequality of (10).

Theorem (5) guarantees that the LR component can be obtained using a small subset of the data. The randomized approach has two main advantages. First, it significantly reduces the memory/storage requirements since it only uses a small data sketch and solves two low-dimensional optimization problems versus one large problem. Second, the proposed approach has per-iteration running time complexity, which is significantly lower than per iteration for full scale decomposition (2) [3, 4] implying remarkable speedups for big data. For instance, consider and sampled from , , and following the Bernoulli model with . For values of equal to 500, 1000, 5000, and , if , the proposed approach yields the correct decomposition with 90, 300, 680, 1520 and 4800 - fold speedup, respectively, over directly solving (2).

## Iv Efficient Column/Row Sampling

In sharp contrast to randomized algorithms for matrix approximations rooted in numerical linear algebra (NLA) [21, 22], which seek to compute matrix approximations from sampled data using importance sampling, in matrix decomposition and robust PCA we do not have direct access to the LR matrix to measure how informative particular columns/rows are. As such, the existing randomized algorithms for matrix decomposition and robust PCA[5, 6, 13, 7] have predominantly relied upon uniform random sampling of columns/rows.

In Section IV-A, we briefly describe the implications of non-uniform data distribution and show that uniform random sampling may not be favorable for data matrices exhibiting some structures that prevail much of the real datasets. In Section IV-B, we demonstrate an efficient column sampling strategy which will be integrated with the proposed decomposition method. The decomposition method with efficient column/row sampling is presented in Sections IV-C and IV-D.

### Iv-a Non-uniform data distribution

When data points lie in a low-dimensional subspace, a small subset of the points can span the subspace. However, uniform random sampling is only effective when the data points are distributed uniformly in the subspace. To clarify, Fig. 1 shows two scenarios for a set of data points in a two-dimensional subspace. In the left plot, the data points are distributed uniformly at random. In this case, two randomly sampled data points can span the subspace whp. In the right plot, 95 percent of the data lie on a one-dimensional subspace, thus we may not be able to capture the two-dimensional subspace from a small random subset of the data points.

In practice, the data points in a low-dimensional subspace may not be uniformly distributed, but rather exhibit some additional structures. A prevailing structure in many modern applications is clustered data [23, 24]. For example, user ratings for certain products (e.g. movies) in recommender systems are not only LR due to their inherent correlations, but also exhibit additional clustering structures owing to the similarity of the preferences of individuals from similar backgrounds (e.g. education, culture, or gender) [23, 25].

To further show that uniform random sampling falls short when the data points are not distributed uniformly in the subspace, consider a matrix generated as . For , where , . For , where , . The elements of and are sampled independently from a normal distribution. The parameter is set equal to 60, thus, the rank of is equal to 60 whp. Fig. 2 illustrates the rank of the randomly sampled columns versus the number of sampled columns for different number of clusters . As increases, so does the required number of uniformly sampled columns. When , it turns out that we need to sample more than half of the columns to span the CS. As such, we cannot evade high-dimensionality with uniform random column/row sampling. In [23], it was shown that the RS coherency increases when the columns follow a more clustered distribution, and vice-versa. These observations match our theoretical result in Lemma 2, which established that the sufficient number of randomly sampled columns depends linearly on the coherency parameter of the RS.

### Iv-B Efficient column sampling method

Column sampling is widely used for dimensionality reduction and feature selection [15, 26, 27]. In the column sampling problem, the LR matrix (or the matrix whose span is to be approximated with a small set of its columns) is available. Thus, the columns are sampled based on their importance, measured by the so-called leverage scores [21], as opposed to blind uniform sampling. We refer the reader to [15, 27] and references therein for more information about efficient column sampling methods.

Next, we present a sampling approach which will be used in Section IV-C where the proposed decomposition algorithm with efficient sampling is presented. The proposed sampling strategy is inspired by the approach in [27] in the context of volume sampling. The table of Algorithm 2 details the presented column sampling procedure. Given a matrix with rank , the algorithm aims to sample a small subset of the columns of

that span its CS. The first column is sampled uniformly at random or based on a judiciously chosen probability distribution

[21]. The next columns are selected sequentially so as to maximize the novelty to the span of the selected columns. As shown in step 2.2 of Algorithm 2, a design threshold is used to decide whether a given column brings sufficient novelty to the sampled columns by thresholding the -norm of its projection on the complement of the span of the sampled columns. The threshold is naturally set to zero in a noise-free setting. Once the selected columns are believed to span the CS of , they are removed from . This procedure is repeated times (using the remaining columns). In each time, the algorithm finds columns which span the CS of . After every iteration, the rank of the matrix of remaining columns is bounded above by . As such, the algorithm samples approximately columns in total. In the proposed decomposition method with efficient column/row sampling (presented in Section IV-C), we set large enough to ensure that the selected columns form a low rank matrix.

### Iv-C Proposed decomposition algorithm with efficient sampling

In this section, we develop a modified decomposition algorithm that replaces uniform random sampling with the efficient column/row sampling method (Algorithm 2). In Section V, it is shown that the proposed technique can remarkably reduce the sampling requirement. We consider a setting wherein the data points (the columns of ) are not uniformly distributed, rather they admit an additional structure (such as clustering), wherefore a small subset of uniformly sampled columns is not likely to span the CS. However, we assume that the rows of are distributed well enough, in the sense that they do not much align along any specific directions, such that rows of sampled uniformly at random span its RS whp, for some constant . In Section IV-D, we dispense with this assumption. The proposed decomposition algorithm rests on three key ideas detailed next.

#### Iv-C1 Informative column sampling

The first important idea underlying the proposed sampling approach is to start sampling along the dimension that has the better distribution. For instance, in the example considered in Section IV-A, the columns of admit a clustering structure. However, the CS of is a random -dimensional subspace, which means that the rows of are distributed uniformly at random in the RS of . Thus, in this case we start with row sampling. The main intuition is that while almost 60 randomly sampled rows of span the RS, a considerable portion of the columns (almost 4000) should be sampled to capture the CS as shown in Fig. 2. As another example, consider an extreme scenario where only two columns of are non-zero. In this case, with random sampling we need to sample almost all the columns to ensure that the sampled columns span the CS of . But, if the non-zero columns are non-sparse, a small subset of randomly chosen rows of will span its row space.

Let denote a known upper bound on . Such knowledge is often available as side information depending on the particular application. For instance, facial images under varying illumination and facial expressions are known to lie on a special low-dimensional subspace [28]. For visualization, Fig. 3 provides a simplified illustration of the matrices defined in this section. We sample rows of uniformly at random. Let denote the matrix of sampled rows. We choose sufficiently large to ensure that the non-sparse component of is a LR matrix. Define , assumably with rank , as the LR component of . If we locate a subset of the columns of that span its CS, the corresponding columns of would span its CS. To this end, the convex program (2) is applied to to extract its LR component denoted . Then, Algorithm 2 is applied to to find a set of informative columns by sampling columns. In Remark 1, we discuss how to choose in the algorithm. Define as the matrix of columns selected from . The matrix is formed using the columns of corresponding to the sampled columns of .

#### Iv-C2 CS learning

Similar to the CS learning step of Algorithm 1, we can obtain the CS of by decomposing . However, we propose to leverage valuable information in the matrix in decomposing . In particular, if is decomposed correctly, the RS of would be same as that of given that the rank of is equal to . Let be an orthonormal basis for the RS of . Thus, in order to learn the CS of , we only need to solve

 min^U∥Ds1−^UVTs1∥1. (17)
###### Remark 1.

Define as the row of . According to (17), (the row of ) is obtained as the optimal point of

 min^Ui∥(dis1)T−Vs1(^Ui)T∥1. (18)

Based on the analysis provided for the proof of Lemma 4, the optimal point of (18) is equal to if , where is the row of , the number of non-zero elements of , and a real number which depends on the coherency of the subspace spanned by . Thus, here is determined based on the rank of and the sparsity of , i.e., has to be sufficiently greater than the expected value for the number of non-zero elements of the rows of .

###### Remark 2.

We note that the convex algorithm (2) may not always yield accurate decomposition of since structured data may not be sufficiently incoherent [2, 23] suggesting that the decomposition step can be further improved. Let be the matrix consisting of the columns of corresponding to the columns selected from to form . According to our investigations, an improved can be obtained by applying the decomposition algorithm presented in [29] to then use the RS of as an initial guess for the RS of the non-sparse component of . Since is low-dimensional (roughly dimensional matrix), this extra step is a low complexity operation.

#### Iv-C3 Representation matrix learning

Suppose that the CS of was learned correctly, i.e., the span of the optimal point of (17) is equal to the span of . Thus, we use as a basis for the learned CS. Now we leverage the information embedded in to select the informative rows. Algorithm 2 is applied to to locate rows of . Thus, we form the matrix from the rows of corresponding to the selected rows of . Thus, the representation matrix is learned as

 min^Q∥Ds2−Us2^Q∥1, (19)

where is the matrix of the selected rows of . Subsequently, the LR matrix can be obtained from the learned CS and the representation matrix.

### Iv-D Column/Row sampling from sparsely corrupted data

In Section IV-C, we assumed that the LR component of has rank . However, if the rows are not well-distributed, a reasonably sized random subset of the rows may not span the RS of . Here, we present a sampling approach which can find the informative columns/rows even when both the columns and the rows exhibit clustering structures such that a small random subset of the columns/rows of cannot span its CS/RS. The algorithm presented in this section (Algorithm 3) can be independently used as an efficient sampling approach from big data. In this paper, we use Algorithm 3 to form if both the columns and rows exhibit clustering structures.

The table of Algorithm 3, Fig. 4 and its caption provide the details of the proposed sampling approach and the definitions of the used matrices. We start the cycle from the position marked “I” in Fig. 4 with formed according to the initialization step of Algorithm 3. For ease of exposition, assume that and , i.e., and are decomposed correctly. The matrix is the informative columns of . Thus, the rank of is equal to the rank of . Since , is a subset of the rows of . If the rows of exhibit a clustering structure, it is likely that rank. Thus, rank. We continue one cycle of the algorithm by going through steps 1, 2 and 3 of Fig. 4 to update . Using a similar argument, we see that the rank of an updated will be greater than the rank of . Thus, if we run more cycles of the algorithm – each time updating and – the rank of and will increase. As detailed in the table of Algorithm 3, we stop if the dimension of the span of the obtained LR component does not change in consecutive iterations. While there is no guarantee that the rank of will converge to (it can converge to a value smaller than ), our investigations have shown that Algorithm 3 performs quite well and the RS of converges to the RS of in few steps. We have also found that adding some randomly sampled columns (rows) to can effectively avert converging to a lower dimensional subspace. For instance, some randomly sampled columns can be added to , which was obtained by applying Algorithm 2 to .

Algorithm 3 was found to converge in a very small number of iterations (typically less than 4 iterations). Thus, even when Algorithm 3 is used to form the matrix , the order of complexity of the proposed decomposition method with efficient column/row sampling (presented in Section IV-C) is roughly .

### Iv-E Online Implementation

The proposed decomposition approach consists of two main steps, namely, learning the CS of the LR component then decomposing the columns independently. This structure lends itself to online implementation, which could be very beneficial in settings where the data arrives on the fly. The idea is to first learn the CS of the LR component from a small batch of the data and keep tracking the CS. Since the CS is being tracked, any new data column can be decomposed based on the updated subspace. The table of Algorithm 4 details the proposed online matrix decomposition algorithm, where denotes the received data column.

Algorithm 4 uses a parameter which determines the rate at which the algorithm updates the CS of the LR component. For instance, if , then the CS is updated every 20 new data columns (step 2.2 of Algorithm 4). The parameter has to be set in accordance with the rate of the change of the subspace of the LR component; a small value for is used if the subspace is changing rapidly. The parameter determines the number of columns last received that are used to update the CS. If the subspace changes rapidly, the older columns may be less relevant to the current subspace, hence a small value for is used. On the other hand, when the data is noisy and the subspace changes at a slower rate, choosing a larger value for can lead to more accurate estimation of the CS.

### Iv-F Noisy Data

In practice, noisy data can be modeled as

 D=L+S+N, (22)

where is an additive noise component. In [30], it was shown that the program

 min^L,^Sλ∥^S∥1+∥^L∥∗subject to∥∥^L+^S−D∥∥F≤ϵn, (23)

can recover the LR and sparse components with an error bound that is proportional to the noise level. The parameter has to be chosen based on the noise level. This modified version can be used in the proposed algorithms to account for the noise. Similarly, to account for the noise in the representation learning problem (15), the -norm minimization problem can be modified as follows:

 min^Q,^E∥ST2D−ST2U^Q−^E∥1subject to∥^E∥F≤δn. (24)

is used to cancel out the effect of the noise and the parameter is chosen based on the noise level [31].

## V Numerical Simulations

In this section, we present some numerical simulations to study the performance of the proposed randomized decomposition method. First, we present a set of simulations confirming our analysis which established that the sufficient number of sampled columns/rows is linear in . Then, we compare the proposed approach to the state-of-the-art randomized algorithm [5] and demonstrate that the proposed sampling strategy can lead to notable improvement in performance. We then provide an illustrative example to showcase the effectiveness of our approach on real video frames for background subtraction and activity detection. Given the structure of the proposed approach, it is shown that side information can be leveraged to further simplify the decomposition task. In addition, a numerical example is provided to examine the performance of Algorithm 3. Finally, we investigate the performance of the online algorithm and show that the proposed online method can successfully track the underlying subspace.

In all simulations, the Augmented Lagrange multiplier (ALM) algorithm [4, 1] is used to solve the optimization problem (2). In addition, the -magic routine [32] is used to solve the -norm minimization problems. It is important to note that in all the provided simulations (except in Section V-D), the convex program (2) that operates on the entire data can yield correct decomposition with respect to the considered criteria. Thus, if the randomized methods cannot yield correct decomposition, it is because they fall short of acquiring the essential information through sampling.

### V-a Phase transition plots

In this section, we investigate the required number of randomly sampled columns/rows. The LR matrix is generated as a product , where and . The elements of and are sampled independently from a standard normal distribution. The sparse matrix follows the Bernoulli model with . In this experiment, Algorithm 1 is used and the column/rows are sampled uniformly at random.

Fig. 5

shows the phase transition plots for different numbers of randomly sampled rows/columns. In this simulation, the data is a

matrix. For each , we generate 10 random realizations. A trial is considered successful if the recovered LR matrix satisfies . It is clear that the required number of sampled columns/rows increases as the rank or the sparsity parameter are increased. When the sparsity parameter is increased to 0.3, the proposed algorithm can hardly yield correct decomposition. Actually, in this case the matrix is no longer a sparse matrix.

The top row of Fig. 5 confirms that the sufficient values for and are roughly linear in . For instance, when the rank is increased from 5 to 25, the required value for increases from 30 to 140. In this experiment, the column and RS of are sampled from the random orthogonal model. Thus, the CS and RS have small coherency whp [20]. Therefore, the important factor governing the sample complexity is the rank of . Indeed, Fig. 6 shows the phase transition for different sizes of the data matrix when the rank of is fixed. One can see that the required values for and are almost independent of the size of the data confirming our analysis.

### V-B Efficient column/row sampling

In this experiment, the algorithm presented in Section IV-C is compared to the randomized decomposition algorithm in [5]. It is shown that the proposed sampling strategy can effectively reduce the required number of sampled columns/rows, and makes the proposed method remarkably robust to structured data. In this experiment, is a matrix. The LR component is generated as

 L=[G1G2...Gn].

For ,

 Gi=UiQi,

where , and the elements of and

are sampled independently from a normal distribution

. For ,

 Gi=13UiQi,

where , , and the elements of and are sampled independently from an distribution. We set equal to 60; thus, the rank of is equal to 60 whp. The sparse matrix follows the Bernoulli model and each element of is non-zero with probability 0.02. In this simulation, we do not use Algorithm 3 to form . The matrix is formed from 300 uniformly sampled rows of .

We evaluate the performance of the algorithm for different values of , i.e., different number of clusters. Fig. 7 shows the performance of the proposed approach and the approach in [5] for different values of and . For each value of , we compute the error in LR matrix recovery averaged over 10 independent runs, and conclude that the algorithm can yield correct decomposition if the average error is less than 0.01. In Fig. 7, the values 0, 1 designate incorrect and correct decomposition, respectively. It can be seen that the presented approach requires a significantly smaller number of samples to yield the correct decomposition. This is due to the fact that the randomized algorithm [5] samples both the columns and rows uniformly at random and independently. In sharp contrast, we use to find the most informative columns to form , and also leverage the information embedded in the CS to find the informative rows to from . One can see that when , [5] cannot yield correct decomposition even when .

### V-C Vector decomposition for background subtraction

The LR plus sparse matrix decomposition can be effectively used to detect a moving object in a stationary background [1, 33]. The background is modeled as a LR matrix and the moving object as a sparse matrix. Since videos are typically high dimensional objects, standard algorithms can be quite slow for such applications. Our algorithm is a good candidate for such a problem as it reduces the dimensionality significantly. The decomposition problem can be further simplified by leveraging prior information about the stationary background. In particular, we know that the background does not change or we can construct it with some pre-known dictionary. For example, consider the video from [34], which was also used in [1]. Few frames of the stationary background are illustrated in Fig. 8.

Thus, we can simply form the CS of the LR matrix using these frames which can describe the stationary background in different states. Accordingly, we just need to learn the representation matrix. As such, the background subtraction problem is simplified to a vector decomposition problem.

Fig. 9 shows that the proposed method successfully separates the background and the moving objects. In this experiment, 500 randomly sampled rows are used (i.e., 500 randomly sampled pixels) for the representation matrix learning (15). While the running time of our approach is just few milliseconds, it takes almost half an hour if we use (2) to decompose the video file [1].

### V-D Alternating algorithm for column sampling

In this section, we investigate the performance of Algorithm 3 for column sampling. The rank of the selected columns is shown to converge to the rank of even when both the rows and columns of exhibit a highly structured distribution. To generate the LR matrix we first generate a matrix as in Section IV-A but setting . Then, we construct the matrix from the first right singular vectors of . We then generate in a similar way and set equal to the first right singular vectors of . Let the matrix . For example, for , . Note that the resulting LR matrix is nearly sparse since in this simulation we consider a very challenging scenario in which both the columns and rows of are highly structured and coherent. Thus, in this simulation we set the sparse matrix equal to zero and use Algorithm 3 as follows. The matrix is formed using 300 columns sampled uniformly at random and the following steps are performed iteratively:
1. Apply Algorithm 2 to with to sample approximately columns of and form from the rows of corresponding to the selected rows of .
2. Apply Algorithm 2 to with to sample approximately columns of and form from the columns of corresponding to the selected columns of . Fig. 10 shows the rank of after each iteration. It is evident that the algorithm converges to the rank of in less than 3 iterations even for clusters. For all values of , i.e., , the data is a matrix.

### V-E Online Implementation

In this section, the proposed online method is examined. It is shown that the proposed scalable online algorithm tracks the underlying subspace successfully. The matrix follows the Bernoulli model with . Assume that the orthonormal matrix spans a random -dimensional subspace. The matrix is generated as follows.

For from 1 to
1. Generate and randomly.
2. .
3. If (mod = 0)

End If
End For

The elements of and are sampled from standard normal distributions. The output of the function approx-r is the matrix of the first left singular vectors of the input matrix and mod is the remainder of . The parameters and control the rate of change of the underlying subspace. The subspace changes at a higher rate if is increased or is decreased. In this simulation, , i.e., the CS is randomly rotated every 10 new data columns. In this simulation, the parameter and . We compare the performance of the proposed online approach to the online algorithm in [35]. For our proposed method, we set when Algorithm 2 is applied to , i.e., rows of are sampled. The method presented in [35] is initialized with the exact CS and its tuning parameter is set equal to . The algorithm [35] updates the CS with every new data column. The parameter of the proposed online method is set equal to 4 (i.e., the CS is updated every 4 new data columns) and the parameter is set equal to . Define as the recovered LR matrix. Fig. 11 shows the -norm of the columns of normalized by the average -norm of the columns of for different values of . One can see that the proposed method can successfully track the CS while it is continuously changing. The online method [35] performs well when the subspace is not changing (), however, it fails to track the subspace when it is changing.

## Appendix

Proof of Lemma 2
The selected columns of can be written as . Using the compact SVD of , can be rewritten as . Therefore, to show that the CS of is equal to that of , it suffices to show that the matrix is a full rank matrix. The matrix selects rows of uniformly at random. Therefore, using Theorem 2 in [11], if

 m1≥rγ2(V)max(c2logr,c3log3δ), (25)

then the matrix satisfies the inequality

 ∥I−N2m1VTS1ST1V∥≤12 (26)

with probability at least , where are numerical constants [11]. Accordingly, if and

denote the largest and smallest singular values of

, respectively, then

 m12N2≤σ21≤σ2r≤3m12N2 (27)

Therefore, the singular values of the matrix are greater than . Accordingly, the matrix is a full rank matrix.

###### Remark 3.

A direct application of Theorem 2 in [11] would in fact lead to the sufficient condition

 m1≥rγ2(R)max(c2logr,c3log3δ), (28)

where denotes the matrix of right singular vectors of . The bound in (25) is slightly tighter since it uses the incoherence parameter in (28), where consists of the first columns of . This follows easily by replacing the incoherence parameter in the step that bounds the -norm of the row vectors of the submatrix in the proof of ([11], Theorem 2).

Proof of lemma 3
The sampled columns are written as

 Ds1=DS1=Ls1+Ss1. (29)

First, we investigate the coherency of the new LR matrix . Define as the projection matrix onto the CS of which is equal to the rows subspace of . Therefore, the projection of the standard basis onto the rows subspace of can be written as

 maxi∥PST1Vei∥22=maxi∥ST1V(VTS1ST1V)−1VTS1ei∥22≤maxj∥ST1V(VTS1ST1V)−1VTej∥22≤∥ST1V(VTS1S