Principal Components Analysis (PCA) is a tool that is frequently used for dimension reduction. Given a matrix of data
, PCA computes a small number of orthogonal directions, called principal components, that contain most of the variability of the data. For relatively noise-free data that lies close to a low-dimensional subspace, PCA is easily accomplished via singular value decomposition (SVD). The problem of PCA in the presence of outliers is referred to as robust PCA (RPCA). In recent work, Candès et al. posed RPCA as a problem of separating a low-rank matrix, , and a sparse matrix, , from their sum, . They proposed a convex program called principal components’ pursuit (PCP) that provided a provably correct batch solution to this problem under mild assumptions. PCP solves
where is the nuclear norm (sum of singular values), is the sum of the absolute values of the entries, and is an appropriately chosen scalar. The same program was analyzed in parallel by Chandrasekharan et al.  and later by Hsu et al. . Since these works, there has been a large amount of work on batch approaches for RPCA and their performance guarantees.
When RPCA needs to be solved in a recursive fashion for sequentially arriving data vectors it is referred to as online (or recursive) RPCA. Online RPCA assumes that a short sequence of outlier-free (sparse component free) data vectors is available. An example application where this problem occurs is the problem of separating a video sequence into foreground and background layers (video layering) on-the-fly . Video layering is a key first step for automatic video surveillance and many other streaming video analytics tasks. In videos, the foreground usually consists of one or more moving persons or objects and hence is a sparse image. The background images usually change only gradually over time , e.g., moving lake waters or moving trees in a forest, and hence are well modeled as lying in a low-dimensional subspace that is fixed or slowly changing. Also, the changes are global (dense) . In most video applications, it is valid to assume that an initial short sequence of background-only frames is available and this can be used to estimate the initial subspace via SVD.
Often in video applications the sparse foreground is actually the signal of interest, and the background is the noise. In this case, the problem can be interpreted as one of recursive sparse recovery in (potentially) large but structured noise. Our result allows for to be large in magnitude as long as it is structured. The structure we impose is that the ’s lie in a low dimensional subspace that changes slowly over time.
In some other applications, instead of there being outliers, parts of a data vector may be missing entirely. When the (unknown) complete data vector is a column of a low-rank matrix, the problem of recovering it is referred to as matrix completion (MC). For example, recovering video sequences and tracking their subspace changes in the presence of easily detectable foreground occlusions. If the occluding object’s intensity is known and is significantly different from that of the background, its support can be obtained by simple thresholding. The background video recovery problem then becomes an MC problem. A nuclear norm minimization (NNM) based solution for MC was introduced in  and studied in . The convex program here is to minimize the nuclear norm of subject to and agreeing on all observed entries. Since then there has been a large amount of work on batch methods for MC and their correctness results.
I-a Problem Definition
Consider the online MC problem. Let denote the set of missing entries at time . We observe a vector that satisfies
with the possibility that can be infinity too. Here is such that, for large enough (quantified in Model 2.2), the matrix is a low-rank matrix. Notice that by defining as above, we are setting to zero the entries that are missed (see the notation section on page I-D).
Consider the online RPCA problem. At time we observe a vector that satisfies
Here is as defined above and is the sparse (outlier) vector. We use to denote the support set of .
For both problems above, for , we are given complete outlier-free measurements so that it is possible to estimate the initial subspace. For the video surveillance application, this would correspond to having a short initial sequence of background only images, which can often be obtained. For , the goal is to estimate (or and in case of RPCA) as soon as arrives and to periodically update the estimate of .
In the rest of the paper, we refer to as the missing/corrupted entries’ set.
I-B Related Work
Some other work that also studies the online MC problem (defined differently from above) includes [7, 8, 9, 10]. We discuss the connection with the idea from  in Section IV. The algorithm from , GROUSE, is a first order stochastic gradient method; a result for its convergence to the local minimum of the cost function it optimizes is obtained in . The algorithm of , PETRELS, is a second order stochastic gradient method. It is shown in  that PETRELS converges to the stationary point of the cost function it optimizes. The advantage of PETRELS and GROUSE is that they do not need initial subspace knowledge. Another somewhat related work is .
Partial results have been provided for ReProCS for online RPCA in our older work . In other more recent work  another partial result is obtained for online RPCA defined differently from above. Neither of these is a correctness result. Both require an assumption that depends on intermediate algorithm estimates. Another somewhat related work is  on online PCA with contaminated data. This does not model the outlier as a sparse vector but defines anything that is far from the data subspace as an outlier.
Some other works only provide an algorithm without proving any performance results, e.g., .
We discuss the most related works in detail in Sec III-C.
In this work we develop and study a practical modification of the Recursive Projected Compressive Sensing (ReProCS) algorithm introduced and studied in our earlier work  for online RPCA. We also develop a special case of it that solves the online MC problem. The main contribution of this work is that we obtain a complete correctness result for ReProCS-based algorithms for both online MC and online RPCA (or more generally, online sparse plus low-rank matrix recovery). Online algorithms are useful because they are causal (needed for applications like video surveillance) and, in most cases, are faster and need less storage compared to most batch techniques (we should mention here that there is some recent work on faster batch techniques as well, e.g., ). To the best of our knowledge, this work and an earlier conference version of this  may be among the first correctness results for online RPCA. The algorithm studied in  is more restrictive.
Moreover, as we will see, by exploiting temporal dependencies, such as slow subspace change, and initial subspace knowledge, our result is able to allow for a more correlated set of missing/corrupted entries than do the various results for PCP [2, 3, 4] or NNM  (see Sec. III).
Our result uses the overall proof approach introduced in our earlier work  that provided a partial result for online RPCA. The most important new insight needed to get a complete result is described in Section IV-C. Also see Sec. III-C. New proof techniques are needed for this line of work because almost all existing works only analyze batch algorithms that solve a different problem. Also, as explained in Section IV, the standard PCA procedure cannot be used in the subspace update step and hence results for it are not applicable.
As shown in , because it exploits initial subspace knowledge and slow subspace change, ReProCS has significantly improved recovery performance compared with batch RPCA algorithms - PCP  and  - as well as with the online algorithm of  for foreground and background extraction in many simulated and real video sequences; it is also faster than the batch methods but slower than .
We use to denote transpose. The 2-norm of a vector and the induced 2-norm of a matrix are denoted by . For a set of integers, denotes its cardinality and denotes its complement set. We use to denote the empty set. For a vector , is a smaller vector containing the entries of indexed by . Define to be an
matrix of those columns of the identity matrix indexed by. For a matrix , define . For matrices and where the columns of are a subset of the columns of , refers to the matrix of columns in and not in .
For an Hermitian matrix ,
denotes an eigenvalue decomposition. That is,has orthonormal columns, and is a diagonal matrix of size at least . (If is rank deficient, then can have any size between and .) For Hermitian matrices and , the notation means that is positive semi-definite. We order the eigenvalues of an Hermitian matrix in decreasing order. So .
For integers and , we use the interval notation to mean all of the integers between and , inclusive, and similarly for etc.
For a matrix , the restricted isometry constant (RIC) is the smallest real number such that
for all -sparse vectors . A vector is -sparse if it has or fewer non-zero entries.
We refer to a matrix with orthonormal columns as a basis matrix. Notice that if is a basis matrix, then .
For basis matrices and , define . This quantifies the difference between their range spaces. If and have the same number of columns, then , otherwise the function is not necessarily symmetric.
The remainder of the paper is organized as follows. In Section II we give the model and main result for both online MC and online RPCA. Next we discuss our main results in Section III. The algorithms for solving both problems are given and discussed in Section IV. The discussion also explains why the proof of our main result should go through. Section IV-C within this section describes the key insight needed by the proof and Section IV-D gives the proof outline. The most general form of our model on the missing entries set, , is described in Section V. A key new lemma for proving our main results is also given in this section. The proof of our main results can be found in Section VI. Proofs of three long lemmas needed for proving the lemmas leading to the main theorem are postponed until Section VII. Section VIII shows numerical experiments backing up our claims. We discuss some extensions in Section IX and give conclusions in Section X
Ii Online Matrix Completion: Assumptions and Main Result
Before we give our model on , we need the following definition.
Recall that for is the training data. Let be the minimum non-zero eigenvalue of . That is
Define to be the matrix containing the eigenvectors of
to be the matrix containing the eigenvectors of, with eigenvalues larger than or equal to , as its columns.
We will use as the initial subspace knowledge in the algorithms. We will use in our algorithms to set the eigenvalue threshold to both detect subspace change and estimate the number of newly added directions. We also use to state the slow subspace change assumption below We use this to state the most general version of the slow subspace change assumption in Model 2.2. However, as explained in the footnote in the line below (4), we can get a slightly more restrictive model without using .
Ii-a Model on
We assume that is a vector from a fixed or slowly changing low-dimensional subspace that changes in such a way that the matrix is low rank for large enough. This can be modeled in various ways. The simplest and most commonly used model for data that lies in a low-dimensional subspace is to assume that at all times, it is is independent and identically distributed (iid) with zero mean and a fixed covariance matrix that is low rank. However this is often impractical since, in most applications, data statistics change with time, albeit “slowly”. To model this perfectly, one would need to assume that is zero mean with covariance that changes at each time. Let denote its diagonalization (with tall); then this means that both and can change at each time . This is the most general case but it but it has an identifiability problem for estimating the subspace of . The subspace spanned by the columns of cannot be estimated with one data point. If has columns, one needs or more data points for its accurate estimation. So, if changes at each time, it is not clear how all the subspaces can be accurately estimated. Moreover, in general (without specific assumptions), this will not ensure that the matrix is low rank. To resolve this issue, a general enough but tractable option is to assume that is constant for a certain period of time and then changes and can change at each time. To ensure that changes “slowly”, we assume that, when changes, the eigenvalues along the newly added subspace directions are small for some time ( frames) after the change. One precise model for this is specified next. We also assumed boundedness of . This is more practically valid rather than the usual Gaussian assumption (often made for simplicity) since most sensor data or noise is bounded.
Model 2.2 (Model on ).
Assume that the are zero mean and bounded random vectors in that are mutually independent over time. Also assume that their covariance matrix has an eigenvalue decomposition
where changes as
and changes as follows. For , define and assume that
where and but not too large 111Our result would still hold if the were different for each change time (i.e. ). We let them be the same to reduce notation. If we do not want to use here in the model on , we can replace (4) by (for a positive constant ) instead and assume in the theorem that is close to , e.g. will suffice. . We assume that (a) for a ; and (b) for all , . Here and are algorithm parameters that are set in Theorem 2.7.
Other minor assumptions are as follows. (i) Define and assume that . (ii) For , define
and assume that . This ensures that, for all , the matrix is low-rank. (iii) Define
as the maximum eigenvalue at any time and assume that .
Observe from the above that is a basis matrix and is diagonal. We refer to the ’s as the subspace change times.
A visual depiction of the above model can be found in Figure 1.
Define the largest and smallest eigenvalues along the new directions for the first frames after a subspace change as
The slow change model on is one way to ensure that
i.e. the maximum variance of the projection ofalong the new directions is small enough for the first frames after a change. Also the minimum variance is larger than a constant greater than zero (and hence detectable). The proof of our main result only relies on (5) and does not use the actual slow increase model in any other way. The above inequality along with quantifies “slow subspace change”.
Notice that the above model does not put any assumption on the eigenvalues along the existing directions. In particular, they do not need to be greater than zero and hence the model automatically allows existing directions (columns of for ) to drop out of the current subspace. It could be the case that for some time period, (for an corresponding to a column of ), so that the column of is not contributing anything to at that time. For the same index , could also later increase again to a nonzero value. Therefore is only a bound on the rank of for , and not necessarily the rank itself. A more explicit model for deletion of directions is to let change as
where contains the columns of for which the variance is zero. If we add the assumption that be a basis matrix (i.e. deleted directions cannot be part of a later ), then this is a special case of Model 2.2 above. We say special case because this only allows deletions at times , whereas Model 2.2 allows deletion of old directions at any time.
The above model assumes that ’s are zero mean and mutually independent over time. In the video analytics application, zero mean is easy to ensure by letting be the background image at time with an empirical ‘mean background image’ (computed using the training data) subtracted out. The independence assumption then models independent background variations around a common mean. As we explain in Section IX , this can be easily relaxed and we can get a result very similar to the current one under a first order autoregressive model on the
, this can be easily relaxed and we can get a result very similar to the current one under a first order autoregressive model on the’s.
For , let and . Observe that Model 2.2 does not have any constraint on . Thus if we assume that its entries are such that their changes from to are smaller than or equal to , then clearly, for all and all 222This follows because and . Thus the ratio is bounded by since .. Since is large, the upper bound is a small quantity, i.e. the covariance matrix changes slowly. For later time instants, we do not have any requirement (and so in particular could still change slowly). Hence the above model includes “slow changing” and low-rank as a special case.
Ii-B Model on the set of missing entries or the outlier support set,
Our result requires that the set of missing entries (or the outlier support sets), , have some changes over time. We give one simple model for it below. One example that satisfies this model is a video application consisting of a foreground with one object of length or less that can remain static for at most frames at a time. When it moves, it moves downwards (or upwards, but always in one direction) by at least pixels, and at most pixels. Once it reaches the bottom of the scene, it disappears. The maximum motion is such that, if the object were to move at each frame, it still does not go from the top to the bottom of the scene in a time interval of length , i.e. . Anytime after it has disappeared another object could appear. We show this example in Fig. 2.
Model 2.3 (model on ).
Let , with , denote the times at which changes and let denote the distinct sets. For an integer (we set its value in Theorem 2.7), assume the following.
Assume that for all times with and .
Let be a positive integer so that for any ,
For any ,
and for any ,
(One way to ensure is to require that for all , with .)
In this model, takes values ; the largest value it can take is (this will happen if changes at every time).
Clearly the video moving object example satisfies the above model as long as . 333Let be the support set of the object (set of pixels containing the object). The first condition holds since there is at most one object of size or less and the object cannot remain static for more than frames. Since it moves in one direction by at least each time it moves, this means that definitely after it moves times, the supports will be disjoint (second condition). The third condition holds because it moves in one direction and by at most with (so even if it were to move at each , i.e. if for all , the third condition will hold). Also see Fig. 2. This becomes clearer from Fig. 2.
In order to recover the ’s from missing data or to separate them from the sparse outliers, the basis vectors for the subspace from which they are generated cannot be sparse. We quantify this using the incoherence condition from . Let be the smallest real number such that
Recall from the notation section that is the column of the identity matrix (or standard basis vector). We bound and in the theorem.
Ii-D Main Result for Online Matrix Completion
Recall that and . Define , and .
Also define , and for , . Let
Notice that . Also, and for , .
The following theorem gives a correctness result for Algorithm 1 given and explained in Section IV. The algorithm has two parameters - and . The parameter is the number of consecutive time instants that are used to obtain an estimate of the new subspace, and is the total number of times the new subspace is estimated before we get an accurate enough estimate of it. The algorithm uses and defined in Definition 2.1 and as inputs.
Suppose that the following hold.
Then, with probability at least
Then, with probability at least, at all times ,
the subspace error is bounded above by for .
Theorem 2.5 says that if an accurate estimate of the initial subspace is available; the two algorithm parameters are set appropriately; the ’s are mutually independent over time and the low-dimensional subspace from which is generated changes “slowly” enough, i.e. (a) the delay between change times is large enough () and (b) the eigenvalues along the newly added directions are small enough for frames after a subspace change (so that (5) holds); the set of missing entries at time , , has enough changes; and the basis vectors that span the low-dimensional subspaces are dense enough; then, with high probability (w.h.p.), the error in estimating will be small at all times . Also, the error in estimating the low-dimensional subspace will be initially large when new directions are added, but will decay to a small constant times within a finite delay.
Consider the accurate initial subspace assumption. If the training data truly satisfies (without any noise or modeling error) and if we have at least linearly independent ’s (if ’s are continuous random vectors, this corresponds to needing almost surely), then the estimate of obtained from training data will actually be exact, i.e. we will have . The theorem assumption that allows for the initial training data to be noisy or not exactly satisfying the model. If the training data is noisy, we need to know (in practice this is computed by thresholding to retain a certain percentage of largest eigenvalues). In this case we can let be the -th eigenvalue of and be the top eigenvectors.
The following corollary is also proved when we prove the above result.
The following conclusions also hold under the assumptions of Theorem 2.7 with probability at least
The estimates of the subspace change times given by Algorithm 1 satisfy , for ;
The estimates of the number of new directions are correct, i.e. for and ;
The recovery error satisfies:
The subspace error satisfies,
Ii-E Main Result for Online Robust PCA
Recall that in this case we assume that the observations satisfy with the support of , denoted , not known. We have the following result for Algorithm 2 given and explained in Section IV. This requires two extra assumptions beyond what the previous result needed. For the matrix completion problem, the set of missing entries is known, while in the robust PCA setting, the support set, , of the sparse outliers, , must be determined. We recover this using an ell-1 minimization step followed by thresholding. To do this correctly, we need a lower bound on the absolute values of the nonzero entries of . Moreover, Algorithm 2 has two extra parameters - , which is the bound on the two norm of the noise seen by the ell-1 minimization step, and , which is the threshold used to recover the support of . These need to be set appropriately.
The two extra algorithm parameters are set as: and
Then with probability at least ,
the support set is exactly recovered, i.e. for all ;
The second assumption above can be interpreted as either a lower bound on , or as an upper bound on in terms of . This latter interpretation is another “slow subspace change” condition. For the ’s, this result shows that their support is exactly recovered w.h.p. and its nonzero entries are accurately recovered.
Ii-F Simple Generalizations
Model on . Consider the subspace change model, Model 2.2. For simplicity we put a slow increase model on the eigenvalues along the new directions for the entire period . However, as explained below the model, the proof of our result does not actually use this slow increase model. It only uses (5), i.e. . Recall that and are the minimum and maximum eigenvalues along the new directions for the first frames after a subspace change. Thus, in the interval our proof actually does not need any constraint on .
With a minor modification to our proof, we can prove our result with an even weaker condition. We need (5) to hold with being the minimum of the minimum eigenvalues of any -frame average covariance matrix along the new directions over the period , i.e. with . For video analytics, this translates to requiring that, after a subspace change, enough (but not necessarily all) background frames have ‘detectable’ energy along the new directions, so that the minimum eigenvalue of the average covariance is above a threshold.
Secondly, we should point out that there is a trade off between the bound on , and consequently on , in Model 2.2 and the bound on assumed in Model 2.3. Allowing a larger value of (faster subspace change) will require a tighter bound on which corresponds to requiring more changes to . We chose the bounds and for simplicity of computations. There are many other pairs that would also work. The above trade-off can be seen from the proof of Lemma 6.14. The proof uses Model 5.1 of which Model 2.3 is a special case. For video analytics, this means that if the background subspace changes are faster, then we also need the foreground objects to be moving more so we can ‘see’ enough of the background behind them.
Thirdly, in Model 2.2 we let be an EVD of . This automatically implies that is diagonal. But our proof only uses the fact that is block diagonal with blocks and . If we relax this and we let be a decomposition of where is block diagonal as above, then our model allows the variance along any direction from to become zero for any period of time and/or become nonzero again later. Thus, in the special case of (6) we can actually allow , where is an rotation matrix and contains the columns of for which the variance is zero. This will be a special case of this generalization if is a basis matrix.
Initialization condition. The first condition of the theorem requires that we have accurate initial subspace knowledge. As explained below the theorem, this means that we can allow noisy training data. Moreover, notice that if we let , then new background directions can enter the subspace at the same time as the first foreground object. Said another way, all we need is an accurate enough estimate of all but directions of the initial subspace, and an assumption of small eigenvalues for sometime ( frames) along the directions for which we do not have an accurate enough estimate (or do not have an estimate).
Denseness assumption. Consider the denseness assumption. Define the (un)denseness coefficient as follows.
For a basis matrix , define .
Notice that left hand side in (7) is . Using the triangle inequality, it is easy to show that . Therefore, using the fact that for a basis matrix , (see proof of the first statement of Lemma C.2 in Appendix C), the denseness assumptions of Theorem 2.7 imply that
The reason for defining as above is the following lemma from .
Lemma 2.9 ().
For a basis matrix , .
Lower bound on minimum nonzero entry of in the online RPCA result (Corollary 2.7). For online RPCA, notice that our result needs a lower bound on the minimum magnitude nonzero entry, , of the outlier vector . This may seem counter-intuitive, since it means that outlier magnitudes need to be large enough for the proposed algorithm to work whereas one would expect that smaller corruptions are easier to deal with. This is actually true in our case as well, and the lower bound on minimum nonzero entry of is an artifact of trying to use a simpler model and a simpler proof approach. As we explain next, what we really need is that the corruptions either be small enough (to not affect subspace recovery too much) or be large enough (to be detectable).