 # Compressed Sensing and Matrix Completion with Constant Proportion of Corruptions

We improve existing results in the field of compressed sensing and matrix completion when sampled data may be grossly corrupted. We introduce three new theorems. 1) In compressed sensing, we show that if the m × n sensing matrix has independent Gaussian entries, then one can recover a sparse signal x exactly by tractable ℓ1 minimimization even if a positive fraction of the measurements are arbitrarily corrupted, provided the number of nonzero entries in x is O(m/(log(n/m) + 1)). 2) In the very general sensing model introduced in "A probabilistic and RIPless theory of compressed sensing" by Candes and Plan, and assuming a positive fraction of corrupted measurements, exact recovery still holds if the signal now has O(m/(log^2 n)) nonzero entries. 3) Finally, we prove that one can recover an n × n low-rank matrix from m corrupted sampled entries by tractable optimization provided the rank is on the order of O(m/(n log^2 n)); again, this holds when there is a positive fraction of corrupted samples.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Introduction on Compressed Sensing with Corruptions

Compressed sensing (CS) has been well-studied in recent years [9, 19]. This novel theory asserts that a sparse or approximately sparse signal can be acquired by taking just a few non-adaptive linear measurements. This fact has numerous consequences which are being explored in a number of fields of applied science and engineering. In CS, the acquisition procedure is often represented as , where is called the sensing matrix and

is the vector of measurements or observations. It is now well-established that the solution

to the optimization problem

 min~x∥~x∥1~{}~{}such that~{}~{}A~x=y, (1.1)

is guaranteed to be the original signal

with high probability, provided

is sufficiently sparse and obeys certain conditions. A typical result is this: if has iid Gaussian entries, then exact recovery occurs provided [10, 18, 37] for some positive numerical constant . Here is another example, if is a matrix with rows randomly selected from the DFT matrix, the condition becomes .

This paper discusses a natural generalization of CS, which we shall refer to as compressed sensing with corruptions. We assume that some entries of the data vector are totally corrupted but we have absolutely no idea which entries are unreliable. We still want to recover the original signal efficiently and accurately. Formally, we have the mathematical model

 y=Ax+f=[A,I][xf], (1.2)

where and . The number of nonzero coefficients in is and similarly for . As in the above model, is an

sensing matrix, usually sampled from a probability distribution. The problem of recovering

(and hence ) from has been recently studied in the literature in connection with some interesting applications. We discuss a few of them.

• Clipping. Signal clipping frequently appears because of nonlinearities in the acquisition device [27, 38]. Here, one typically measures rather than , where is always a nonlinear map. Letting , we thus observe . Nonlinearities usually occur at large amplitudes so that for those components with small amplitudes, we have . This means that is sparse and, therefore, our model is appropriate. Just as before, locating the portion of the data vector that has been clipped may be difficult because of additional noise.

• CS for networked data. In a sensor network, different sensors will collect measurements of the same signal independently (they each measure ) and send the outcome to a center hub for analysis [23, 30]. By setting as the row vectors of , this is just . However, typically some sensors will fail to send the measurements correctly, and will sometimes report totally meaningless measurements. Therefore, we collect , where models recording errors.

There have been several theoretical papers investigating the exact recovery method for CS with corruptions [40, 29, 28, 30, 38], and all of them consider the following recovery procedure in the noiseless case:

 min~x,~f∥~x∥1+λ(m,n)∥~f∥1%  suchthat  A~x+~f=[A,I][~x~f]=y. (1.3)

We will compare them with our results in Section 1.4.

### 1.2 Introduction on matrix completion with corruptions

Matrix completion (MC) bears some similarity with CS. Here, the goal is to recover a low-rank matrix from a small fraction of linear measurements. For simplicity, we suppose the matrix is square as above (the general case is similar). The standard model is that we observe where and

 PO(L)ij={Lij~{}~{}if~{}~{}(i,j)∈O;0~{}~{}otherwise.

The problem is to recover the original matrix , and there have been many papers studying this problem in recent years, see [33, 8, 12, 26, 21]

, for example. Here one minimizes the nuclear norm — the sum of all the singular values

— to recover the original low rank matrix. We discuss below an improved result due to Gross  (with a slight difference).

Define for some by meaning that

are iid Bernoulli random variables with parameter

. Then the solution to

 min˜L∥˜L∥∗~{}~{}such that~{}~{}PO(˜L)=PO(L), (1.4)

is guaranteed to be exactly with high probability, provided . Here, is a positive numerical constant, is the rank of , and is an incoherence parameter introduced in  which is only dependent of .

This paper is concerned with the situation in which some entries may have been corrupted. Therefore, our model is that we observe

 PO(L)+S, (1.5)

where and are the same as before and is supported on . Just as in CS, this model has broad applicability. For example, Wu et al. used this model in photometric stereo . This problem has also been introduced in  and is related to recent work in separating a low-rank from a sparse component [14, 4, 24, 13, 43]. A typical result is that the solution to

 min˜L,˜S∥˜L∥∗+λ(m,n)∥˜S∥1~{}~{}such that~{}~{}PO(˜L)+˜S=PO(L)+S, (1.6)

is guaranteed to be the true pair with high probability under some assumptions about [4, 16]. We will compare them with our result in Section 1.4.

### 1.3 Main results

This section introduces three models and three corresponding recovery results. The proofs of these results are deferred to Section 2 for Theorem 1.1, Section 3 for Theorem 1.2 and Section 4 for Theorem 1.3.

#### 1.3.1 CS with iid matrices [Model 1]

###### Theorem 1.1

Suppose that is an random matrix whose entries are iid Gaussian variables with mean

and variance

, the signal to acquire is , and our observation is where and . Then by choosing , the solution to

 min~x,~f∥~x∥1+λ∥~f∥1~{}~% {}such that~{}~{}∥(A~x+~f)−y)∥2≤ϵ (1.7)

satisfies with probability at least . This holds universally; that is to say, for all vectors and obeying and . Here , , and are numerical constants.

In the above statement, the matrix is random. Everything else is deterministic. The reader will notice that the number of nonzero entries is on the same order as that needed for recovery from clean data [10, 19, 3, 37], while the condition of implies that one can tolerate a constant fraction of possibly adversarial errors. Moreover, our convex optimization is related to LASSO  and Basis Pursuit .

#### 1.3.2 CS with general sensing matrices [Model 2]

In this model, and

 A=1√m⎛⎜⎝a∗1...a∗m⎞⎟⎠,

where are iid copies of a random vector whose distribution obeys the following two properties: 1) ; 2) . This model has been introduced in  and includes a lot of the stochastic models used in the literature. Examples include partial DFT matrices, matrices with iid entries, certain random convolutions  and so on.

In this model, we assume that and in (1.2) have fixed support denoted by and , and with cardinality and . In the remainder of the paper, is the restriction of to indices in and is the restriction of to . Our main assumption here concerns the sign sequences: the sign sequences of and are independent of each other, and each is a sequence of symmetric iid variables.

###### Theorem 1.2

For the model above, the solution to (1.3), with , is exact with probability at least , provided that and . Here , and are some numerical constants.

Above, and have fixed supports and random signs. However, by a recent de-randomization technique first introduced in , exact recovery with random supports and fixed signs would also hold. We will explain this de-randomization technique in the proof of Theorem 1.3. In some specific models, such as independent rows from the DFT matrix, could be a numerical constant, which implies the proportion of corruptions is also a constant. An open problem is whether Theorem 1.2 still holds in the case where and have both fixed supports and signs. Another open problem is to know whether the result would hold under more general conditions about as in  in the case where has both random support and random signs.

We emphasize that the sparsity condition is a little stronger than the optimal result available in the noise-free literature [9, 7]), namely,. The extra logarithmic factor appears to be important in the proof which we will explain in Section 3, and a third open problem is whether or not it is possible to remove this factor.

Here we do not give a sensitivity analysis for the recovery procedure as in Model 1. Actually by applying a similar method introduced in  to our argument in Section 3, a very good error bound could be obtained in the noisy case. However, technically there is little novelty but it will make our paper very long. Therefore we decide to only discuss the noiseless case and focus on the sampling rate and corruption ratio.

#### 1.3.3 MC from corrupted entries [Model 3]

We assume is of rank and write its reduced SVD as , where and . Let be the smallest quantity such that for all ,

 ∥UU∗ei∥22≤μrn,∥VV∗ei∥22≤μrn,and ∥UV∗∥∞≤√μrn.

This model is the same as that originally introduced in , and later used in [21, 32, 12, 4, 16]. We observe , where and is supported on . Here we assume that satisfy the following model:

##### Model 3.1:

1. Fix an by matrix , whose entries are either or .
2. Define for a constant satisfying . Specifically speaking, are iid Bernoulli random variables with parameter .
3. Conditioning on , assume that are independent events with . This implies that .
4. Define . Then we have
5. Let be supported on , and .

###### Theorem 1.3

Under Model 3.1, suppose and . Moreover, suppose and denote as the optimal solution to the problem (1.6). Then we have with probability at least for some numerical constant , provided the numerical constants is sufficiently small and is sufficiently large.

In this model is available while , and are not known explicitly from the observation . By the assumption , we can use to approximate . From the following proof we can see that is not required to be exactly for the exact recovery. The power of our result is that one can recover a low-rank matrix from a nearly minimal number of samples even when a constant proportion of these samples has been corrupted.

We only discuss the noiseless case for this model. Actually by a method similar to 

, a suboptimal estimation error bound can be obtained by a slight modification of our argument. However, it is of little interest technically and beyond the optimal result when

is large. There are other suboptimal results for matrix completion with noise, such as , but the error bound is not tight when the additional noise is small. We want to focus on the noiseless case in this paper and leave the problem with noise for future work.

The values of are chosen for theoretical guarantee of exact recovery in Theorem 1.1, 1.2 and 1.3. In practice, is usually taken by cross validation.

### 1.4 Comparison with existing results, relative works and our contribution

In this section we will compare Theorems 1.1, 1.2 and 1.3 with existing results in the literature.

We begin with Model 1. In , Wright and Ma discussed a model where the sensing matrix has independent columns with common mean and normal perturbations with variance . They chose , and proved that with high probability provided , and has random signs. Here is much smaller than . We notice that since the authors of  talked about a different model, which is motivated by , it may not be comparable with ours directly. However, for our motivation of CS with corruptions, we assume satisfy a symmetric distribution and get better sampling rate.

A bit later, Laska et al.  and Li et al.  also studied this problem. By setting , both papers establish that for Gaussian (or sub-Gaussian) sensing matrices , if , then the recovery is exact. This follows from the fact that obeys a restricted isometry property known to guarantee exact recovery of sparse vectors via minimization. Furthermore, the sparsity requirement about is the same as that found in the standard CS literature, namely, . However, the result does not allow a positive fraction of corruptions. For example, if , we have , which will go to zero as goes to zero.

As for Model 2, an interesting piece of work  (and later  on the noisy case) appeared during the preparation of this paper. These papers discuss models in which

is formed by selecting rows from an orthogonal matrix with low incoherence parameter

, which is the minimum value such that for any . The main result states that selecting gives exact recovery under the following assumptions: 1) the rows of are chosen from an orthogonal matrix uniformly at random; 2) is a random signal with independent signs and equally likely to be either ; 3) the support of is chosen uniformly at random. (By the de-randomization technique introduced in  and used in , it would have been sufficient to assume that the signs of are independent and take on the values with equal probability). Finally, the sparsity conditions require and , which are nearly optimal, for the best known sparsity condition when is . In other words, the result is optimal up to an extra factor of ; the sparsity condition about is of course nearly optimal.

However, the model for does not include some models frequently discussed in the literature such as subsampled tight or continuous frames. Against this background, a recent paper of Candès and Plan  considers a very general framework, which includes a lot of common models in the literature. Theorem 1.2 in our paper is similar to Theorem 1 in . It assumes similar sparsity conditions, but is based on this much broader and more applicable model introduced in . Notice that, we require whereas  requires . Therefore, we improve the condition by a factor of , which is always at least and can be as large as . However, our result imposes , which is worse than by the same factor. In , the parameter depends upon , while our is only a function of and . This is why the results differ, and we prefer to use a value of that does not depend on because in some applications, an accurate estimate of may be difficult to obtain. In addition, we use different techniques of proof which the clever golfing scheme of  is exploited.

Sparse approximation is another problem of underdetermined linear system where the dictionary matrix is always assumed to be deterministic. Readers interested in this problem (which always requires stronger sparsity conditions) may also want to study the recent paper  by Studer et al. There, the authors introduce a more general problem of the form , and analyzed the performance of -recovery techniques by using ideas which have been popularized under the name of generalized uncertainty principles in the basis pursuit and sparse approximation literature.

As for Model 3, Theorem 1.3 is a significant extension of the results presented in , in which the authors have a stringent requirement . In a very recent and independent work , the authors consider a model where both and are unions of stochastic and deterministic subsets, while we only assume the stochastic model. We recommend interested readers to read the paper for the details. However, only considering their results on stochastic and , a direct comparison shows that the number of samples we need is less than that in this reference. The difference is several logarithmic factors. Actually, the requirement of in our paper is optimal even for clean data in the literature of MC. Finally, we want to emphasize that the random support assumption is essential in Theorem 1.3 when the rank is large. Examples can be found in .

We wish to close our introduction with a few words concerning the techniques of proof we shall use. The proof of Theorem 1.1 is based on the concept of restricted isometry, which is a standard technique in the literature of CS. However, our argument involves a generalization of the restricted isometry concept. The proofs of Theorems 1.2 and 1.3 are based on the golfing scheme, an elegant technique pioneered by David Gross , and later used in [32, 4, 7] to construct dual certificates. Our proof leverages results from . However, we contribute novel elements by finding an appropriate way to phrase sufficient optimality conditions, which are amenable to the golfing scheme. Details are presented in the following sections.

## 2 A Proof of Theorem 1.1

In the proof of Theorem 1.1, we will see the notation . Here is a -dimensional vector, is a subset of and we also use to represent the subspace of all -dimensional vectors supported on . Then is the projection of onto the subspace , which is to keep the value of on the support and to change other elements into zeros. In this section we use the notation “” of “floor function” to represent the integer part of any real number.

First we generalize the concept of the restricted isometry property (RIP)  for the convenience to prove our theorem:

###### Definition 2.1

For any matrix , define the RIP-constant by the infimum value of such that

 (1−δ)(∥x∥22+∥f∥22)≤∥∥∥Φ[xf]∥∥∥22≤(1+δ)(∥x∥22+∥f∥22)

holds for any with and with .

###### Lemma 2.2

For any and such that , and , , we have

 ∣∣∣⟨Φ[x1f1],Φ[x2f2]⟩∣∣∣≤δs1,s2√∥x1∥22+∥f1∥22√∥x2∥22+∥f2∥22

Proof First, we suppose . By the definition of , we have

 2(1−δs1,s2)≤⟨Φ[x1+x2f1+f2],Φ[x1+x2f1+f2]⟩≤2(1+δs1,s2),

and

 2(1−δs1,s2)≤⟨Φ[x1−x2f1−f2],Φ[x1−x2f1−f2]⟩≤2(1+δs1,s2).

By the above inequalities, we have , and hence by homogeneity, we have without the norm assumption.

###### Lemma 2.3

Suppose with RIP-constant ()and is between and . Then for any with , any with , and any with the solution to the optimization problem (1.7) satisfies .

Proof Suppose and . Then by (1.7) we have

 ∥∥∥Φ[ΔxΔf]∥∥∥2≤∥w∥2+∥∥∥Φ[^x^f]−(Φ[xf]+w)∥∥∥2≤2ϵ.

It is easy to check that the original satisfies the inequality constraint in (1.7), so we have

 ∥x+Δx∥1+λ∥f+Δf∥1≤∥x∥1+λ∥f∥1. (2.1)

Then it suffices to show .

Suppose with such that . Denote where and . Moreover, suppose contains the indices of the largest (in the sense of absolute value) coefficients of , contains the indices of the largest coefficients of , and so on. Similarly, define such that and , and divide in the same way. By this setup, we easily have

 ∑j≥2∥PTjΔx∥2≤s−121∥PTc0Δx∥1, (2.2)

and

 ∑j≥2∥PVjΔf∥2≤s−122∥PVc0Δf∥1. (2.3)

On the other hand, by the assumption and , we have,

 ∥x+Δx∥1=∥PT0x+PT0Δx∥1+∥PTc0Δx∥1≥∥x∥1−∥PT0Δx∥1+∥PTc0Δx∥1, (2.4)

and similarly,

 ∥f+Δf∥1≥∥f∥1−∥PV0Δf∥1+∥PVc0Δf∥1. (2.5)

By inequalities (2.1), (2.4) and (2.5), we have

 ∥PTc0Δx∥1+λ∥PVc0Δf∥1≤∥PT0Δx∥1+λ∥PV0Δf∥1. (2.6)

By the definition of , the fact and Lemma 2.2, we have

 (1−δ2s1,2s2)(∥∥PT0Δx+PT1Δx∥∥22+∥∥PV0Δf+PV1Δf∥∥22) ≤∥∥∥Φ[PT0Δx+PT1ΔxPV0Δf+PV1Δf]∥∥∥22 =⟨Φ[PT0Δx+PT1ΔxPV0Δf+PV1Δf],Φ[ΔxΔf]−Φ[PT2Δx+...+PTlΔxPV2Δf+...+PVkΔf]⟩ ≤−⟨Φ[PT0Δx+PT1ΔxPV0Δf+PV1Δf],Φ[PT2Δx+...+PTlΔxPV2Δf+...+PVkΔf]⟩+2ϵ∥∥∥Φ[PT0Δx+PT1ΔxPV0Δf+PV1Δf]∥∥∥2 ≤δ2s1,2s2(∥∥∥[PT0ΔxPV0Δf]∥∥∥2+∥∥∥[PT1ΔxPV1Δf]∥∥∥2)(∑j≥2∥PTjΔx∥2+∑j≥2∥PVjΔf∥2) +2ϵ√1+δ2s1,2s2√∥PT0Δx∥22+∥PT1Δx∥22+∥PV0Δf∥22+∥PV1Δf∥22.

Moreover, since

 ∑j≥2∥PTjΔx∥2+∑j≥2∥PVjΔf∥2 ≤s−121∥PTc0Δx∥1+s−122∥PVc0Δf∥1 By (2.2) and (2.3) ≤2s−121(∥PTc0Δx∥1+λ∥PVc0Δf∥1) By λ>12√s1s2 ≤2s−121(∥PT0Δx∥1+λ∥PV0Δf∥1) By (2.6) ≤2s−121(s121∥PT0Δx∥2+λs122∥PV0Δf∥2) By Cauchy-Schwartz inequality ≤4∥PT0Δx∥2+4∥PV0Δf∥2, By λ<2√s1s2

we have

 (∥∥∥[PT0ΔxPV0Δf]∥∥∥2+∥∥∥[PT1ΔxPV1Δf]∥∥∥2)(∑j≥2∥PTjΔx∥2+∑j≥2∥PVjΔf∥2) ≤8(∥PT0Δx∥22+∥PT1Δx∥22+∥PV0Δf∥22+∥PV1Δf∥22).

Therefore, by , we have

 √∥PT0Δx∥22+∥PT1Δx∥22+∥PV0Δf∥22+∥PV1Δf∥22≤2ϵ√1+δ2s1,2s21−9δ2s1,2s2.

Since

 ∑j≥2∥PTjΔx∥2+∑j≥2∥PVjΔf∥2≤4∥PT0Δx∥2+4∥PV0Δf∥2,

we have

 ∥Δx∥2+∥Δf∥2 ≤5(∥PT0Δx∥2+∥PV0Δf∥2)+(∥PT1Δx∥2+∥PV1Δf∥2) ≤√52√∥PT0Δx∥22+∥PT1Δx∥22+∥PV0Δf∥22+∥PV1Δf∥22 ≤4√13+13δ2s1,2s21−9δ2s1,2s2ϵ.

We now cite a well-known result in the literature of CS, e.g. Theorem 5.2 of .

###### Lemma 2.4

Suppose is a random matrix defined in model 1. Then for any , there exist such that with probability at least ,

 (1−δ)∥x∥22≤∥Ax∥22≤(1+δ)∥x∥22

holds universally for any with .

Also, we cite a well-know result which can give a bound for the biggest singular value of random matrix, e.g.  and .

###### Lemma 2.5

Let be an matrix whose entries are independent standard normal random variables. Then for every , with probability at least , one has .

We now prove Theorem 1.1:
Proof Suppose , are two constants independent of and , and their values will be specified later. Set and . We want to bound the RIP-constant for the matrix when is sufficiently small. For any with and with , and any with , any with , we have

 ∥∥∥[A,I][xf]∥∥∥22=∥Ax+f∥22=∥Ax∥22+∥f∥22+2⟨PVAPTx,f⟩.

By Lemma 2.4, assuming , with probability at least we have

 (1−δ)∥x∥22≤∥Ax∥22≤(1+δ)∥x∥22 (2.7)

holds universally for any such and .

Now we we fix and , and we want to bound . By Lemma 2.5, we actually have

 ∥PVAPT∥2,2≤1√m(√2s1+√2s2+√δ2m)≤(2√2α+δ) (2.8)

with probability at least . Then with probability at least , inequality 2.8 holds universally for any satisfying and satisfying . By , we have , where only depends on and as , and hence . Similarly, because , we have , where only depends on and as , and hence . Therefore, inequality 2.8 holds universally for any such and with probability at least .

Combined with 2.7, we have

 (1−δ)∥x∥22+∥f∥22−(2√2α+δ)∥x∥2∥f∥2≤∥∥∥[A,I][xf]