A coherence parameter characterizing generative compressed sensing with Fourier measurements

07/19/2022
by   Aaron Berk, et al.
Concordia University
McGill University
0

In Bora et al. (2017), a mathematical framework was developed for compressed sensing guarantees in the setting where the measurement matrix is Gaussian and the signal structure is the range of a generative neural network (GNN). The problem of compressed sensing with GNNs has since been extensively analyzed when the measurement matrix and/or network weights follow a subgaussian distribution. We move beyond the subgaussian assumption, to measurement matrices that are derived by sampling uniformly at random rows of a unitary matrix (including subsampled Fourier measurements as a special case). Specifically, we prove the first known restricted isometry guarantee for generative compressed sensing with subsampled isometries, and provide recovery bounds with nearly order-optimal sample complexity, addressing an open problem of Scarlett et al. (2022, p. 10). Recovery efficacy is characterized by the coherence, a new parameter, which measures the interplay between the range of the network and the measurement matrix. Our approach relies on subspace counting arguments and ideas central to high-dimensional probability. Furthermore, we propose a regularization strategy for training GNNs to have favourable coherence with the measurement operator. We provide compelling numerical simulations that support this regularized training strategy: our strategy yields low coherence networks that require fewer measurements for signal recovery. This, together with our theoretical results, supports coherence as a natural quantity for characterizing generative compressed sensing with subsampled isometries.

READ FULL TEXT VIEW PDF

page 1

page 7

10/13/2020

Deep generative demixing: Recovering Lipschitz signals from noisy subgaussian mixtures

Generative neural networks (GNNs) have gained renown for efficaciously c...
10/30/2021

Beyond Independent Measurements: General Compressed Sensing with GNN Application

We consider the problem of recovering a structured signal 𝐱∈ℝ^n from noi...
05/26/2021

Compressed Sensing Measurement of Long-Range Correlated Noise

Long-range correlated errors can severely impact the performance of NISQ...
09/07/2016

Optimizing Codes for Source Separation in Color Image Demosaicing and Compressive Video Recovery

There exist several applications in image processing (eg: video compress...
01/30/2018

Algorithms for the Construction of Incoherent Frames Under Various Design Constraints

Unit norm finite frames are extensions of orthonormal bases with many ap...
03/19/2018

Lossless Analog Compression

We establish the fundamental limits of lossless analog compression by co...
06/07/2020

Constant-Expansion Suffices for Compressed Sensing with Generative Priors

Generative neural networks have been empirically found very promising in...

I Introduction

The solution of undetermined linear inverse problems has many important applications including geophysics (Kumar et al., 2015; Herrmann et al., 2012) and medical imaging (Adcock and Hansen, 2021; Lustig et al., 2008). In particular, compressed sensing permits stable and robust order-optimal recovery of signals that are well represented by one of a certain set of structural proxies (e.g., sparsity) (Foucart and Rauhut, 2013). Moreover, this recovery is effected using an order-optimal number of random measurements (Foucart and Rauhut, 2013). In applications like medical imaging (Adcock and Hansen, 2021), the measurement matrices under consideration are derived from a bounded orthonormal system (a unitary matrix with bounded entries), which complicates the theoretical analysis. Furthermore, for such applications one desires a highly effective representation for encoding the images. Developing a theoretical analysis that properly accounts for realistic measurement paradigms and complexly designed image representations is non-trivial in general (Bora et al., 2017; Foucart and Rauhut, 2013; Jalal et al., 2021). For example, there has been much work validating that generative neural networks (GNNs) are highly effective at representing natural signals (Kingma and Welling, 2013; Goodfellow et al., 2014; Radford et al., 2015). In this vein, recent work has shown promising empirical results for compressed sensing with realistic measurement matrices when the structural proxy is a GNN. Other recent work has established recovery guarantees for compressed sensing when the structural proxy is a GNN and the measurement matrix is subgaussian (Bora et al., 2017). (See I-A Related work for a fuller depiction of related aspects of this problem.) However, an open problem is the following (Scarlett et al., 2022, p. 10):

Open problem (subIso GCS).

A theoretical analysis of compressed sensing when the measurement matrix is structured (e.g., a randomly subsampled unitary matrix) and the signal model proxy is a GNN.

Broadly, we approach a solution to Open problem (subIso GCS). as follows. For a matrix , a particular GNN architecture and an unknown signal , the range of , we determine the conditions (on , etc.) under which it is possible to approximately recover from noisy linear measurements by (approximately) solving an optimization problem of the form

(1)

Above, is some unknown corruption. Specifically, we are interested in establishing sample complexity bounds (lower bounds on ) for realistic measurement matrices — where is an underdetermined matrix randomly subsampled from a unitary matrix. Namely, the rows of have been sampled uniformly at random without replacement from a unitary matrix . We next present a mathematical description of the subsampling of the rows, described similarly to (Dirksen, 2015, Section 4).

Definition I.1 (subsampled isometry).

Let be integers and let be a unitary matrix. Let

be an iid Bernoulli random vector:

. Define the set and enumerate the elements of as where

is a binomial random variable with

. Let be the matrix whose th row is . We call an -subsampled isometry. When there is no risk of confusion we simply refer to as a subsampled isometry and implicitly acknowledge the existence of an giving rise to .

With so defined, is isotropic: where is the identity matrix.

Remark I.1.

An important example of a matrix isometry is the discrete orthogonal system given by the (discrete) Fourier basis. For example, “in applications such as medical imaging, one is confined to using subsampled Fourier measurements due to the inherent design of the hardware” (Scarlett et al., 2022, p. 10). Let with for each . Here, we have used to denote the complex number satisfying . The matrix

is known as the discrete Fourier transform (DFT) matrix, and has important roles in signal processing and numerical computation 

(Adcock and Hansen, 2021; Scarlett et al., 2022; Jalal et al., 2021). Thus, all of our results apply, in particular, when the measurement matrix is a subsampled DFT. ∎

Lastly, we introduce the kind of GNN to which we restrict our attention in this work. Namely, we study

-activated expansive neural networks, where ReLU is the so-called rectified linear unit defined as

, acting element-wise on the entries of .

Definition I.2 (-generative network).

Fix the integers and suppose for that . A -generative network is a function of the form

With these ingredients, we provide a suggestive “cartoon” of the main theoretical contribution of this work, which itself can be found in Theorem II.1.

Theorem Sketch (Cartoon).

Let be a -generative network, and a subsampled isometry. Suppose is “incoherent” with respect to the rows of , quantified by a parameter . If the number of measurements satisfies , then, with high probability on , it is possible to approximately recover an unknown signal from noisy undetermined linear measurements with nearly order-optimal error.

Coherence is defined in Definition I.4 below, using the measurement norm introduced in Definition I.3. The notion of “incoherence” in the cartoon above is specified in  II. Coherence is related to the concept of incoherent bases (Foucart and Rauhut, 2013, p. 373), while the measurement norm is closely related to the so-called -norm in Rudelson and Vershynin (2008). Effectively, coherence characterizes the alignment between the components comprising and the row vectors of the subsampled isometry .

Definition I.3 (Measurement norm).

Let be a unitary matrix. Define the norm by

Definition I.4 (Coherence).

Let be a set and a unitary matrix. For , say that is -coherent with respect to if

We refer to the quantity on the left-hand side as the coherence.

The idea is that the structural proxy/prior under consideration should be incoherent with respect to the measurement process. Thus, we desire that vectors in be not too closely aligned (in the sense controlled by ) with the rows of , to which the subsampled isometry

is associated. We argue that this is a natural quantity to consider. Though it is likely difficult to measure precisely in practice, we propose a computationally efficient heuristic that upper bounds the coherence. Furthermore, we propose, under suitable randomness assumptions, that an assumed bound on the coherence may be considered rather mild. Loosely: if

has Gaussian weights, one may take in Theorem Sketch (Cartoon). (see Theorem II.1 and Theorem III.1).

We briefly itemize the main contributions of this paper:

  • we introduce the coherence for characterizing recovery efficacy via the alignment of the network’s range with the measurement matrix (see Definition I.4);

  • we establish a restricted isometry property for -generative networks with subsampled isometries (see Theorem II.2 and  II);

  • we prove sample complexity and recovery bounds in this setting (see Theorem II.1);

  • we propose a regularization strategy for training GNNs with low coherence (see  IV-A) and demonstrate improved sample complexity for recovery (see  IV-B);

  • together with our theory, we provide compelling numerical simulations that support coherence as a natural quantity of interest linked to favourable deep generative recovery (see  IV-B).

I-a Related work

Theoretically, Bora et al. (2017) have analyzed compressed sensing problems in the so-called generative prior framework, focusing on Gaussian or subgaussian measurement matrices. Similar problems in the generative prior framework have been analyzed, but all have relied on random designs to our knowledge. For example, Berk (2021) extends the analysis to the setting of demixing with subgaussian matrices, while Liu and Liu (2022) analyzes the semi-parametric single-index model with generative prior under Gaussian measurements. Finally, exact recovery of the underlying latent code for GNNs (i.e., seeking such that ) has been analyzed; however, these analyses rely on the GNN having a suitable structure with weight matrices that possess a suitable randomness (Hand and Voroninski, 2019; Joshi et al., 202book1; Hand and Joshi, 2019; Hand et al., 2018). For a review of these and related problems, see Scarlett et al. (2022).

Promising empirical results of Jalal et al. (2021)

suggest remarkable efficacy of generative compressed sensing (GCS) in realistic measurement paradigms. Furthermore, the authors provide a framework with theoretical guarantees for using Langevin dynamics to sample from a generative prior. Several recent works have developed sophisticated generative adversarial networks (GANs) (which are effectively a type of GNN) for compressed sensing in medical imaging 

(Deora et al., 2020; Mardani et al., 2018). Other work has empirically explored multi-scale (non-Gaussian) sampling strategies for image compressed sensing using GANs (Li et al., 2022). Separately, see Wentz and Doostan (2022)

for the use of GCS in uncertainty quantification of high-dimensional partial differential equations with random inputs. Recently popular is the use of untrained GNNs for signal recovery 

(Ulyanov et al., 2018; Heckel and Hand, 2019). For instance, Darestani and Heckel (2021) executed a promising empirical investigation of medical image compressed sensing using untrained GNNs.

Compressed sensing with subsampled isometries is well studied for sparse signal recovery. The original works developing such recovery guarantees are Candès et al. (2006); Donoho (2006), with improvements appearing in Rudelson and Vershynin (2008); Rauhut (2010). See Foucart and Rauhut (2013) for a thorough presentation of this material including relevant background. See Dirksen (2015, Sec. 4) for a clear presentation of this material via an extension of generic chaining. In this setting, the best-known number of log factors in the sample complexity bound sufficient to achieve the restricted isometry property is due to Bourgain (2014) with subsequent extensions and improvements in (Haviv and Regev, 2017; Chkifa et al., 2018; Brugiapaglia et al., 2021). Naderi and Plan (2022) address compressed sensing with subsampled isometries when the structural proxy is a neural network with random weights.

Using a notion of coherence to analyze the solution of convex linear inverse problems was proposed in Candès et al. (2006); Candès and Romberg (2007). Cape et al. (2019) relate this notion to the matrix norm (defined in I-B Notation

) in order to analyze covariance estimation and singular subspace recovery. Additionally, see 

Donoho and Elad (2003) or Foucart and Rauhut (2013, p. 373) for a discussion of incoherent bases, and Rudelson and Vershynin (2008, p. 1034) for the analogue of our measurement norm in the sparsity case.

The present work relies on important ideas from high-dimensional probability, such as controlling the expected supremum of a random process on a geometric set. These ideas are well treated in Liaw et al. (2017); Jeong et al. (2020); see Vershynin (2018) for a thorough treatment of high-dimensional probability. This work also relies on counting linear regions comprising the range of a -activated GNN. In this respect, we rely on a result that appears in Naderi and Plan (2021). Tighter but less analytically tractable bounds appear in (Serra et al., 2018), while a computational exploration of region counting has been performed in Novak et al. (2018).

I-B Notation

For an integer denote . For , denote the norm for by and for by . Here, if then and the conjugate is given by . If is a matrix then the conjugate transpose is denoted . The norm for real numbers, is defined in the standard, analogous way. Denote the real and complex sphere each by , disambiguating only where unclear from context. The operator norm of a matrix , induced by the Euclidean norm, is denoted . Unless otherwise noted, denotes the th row of the matrix , viewed as a column vector. The Frobenius norm of is denoted and satisfies . The matrix norm for is . We use to denote the standard projection operator onto the set , which selects a single point lexicographically, if necessary, to ensure uniqueness.

denotes the Bernoulli distribution with parameter

;

the binomial distribution for

items with rate .

Throughout this work, represents an absolute constant having no dependence on any parameters, whose value may change from one appearance to the next. Constants with dependence on a parameter will be denoted with an appropriate subscript — e.g.,  is an absolute constant depending only on a parameter . Likewise, for two quantities , if then ; analogously for . Finally, given two sets , denotes the Minkowski sum/difference: . Similarly, for , and . The range of a function is denoted (e.g., if is a matrix then denotes the column space of ).

Ii Main results

We start by presenting the main result of the paper, which establishes sample complexity and recovery bounds for generative compressed sensing with subsampled isometries. Below, .

Theorem II.1 (subsampled isometry GCS).

Let be a -generative network and a subsampled isometry associated to a unitary matrix . If is -coherent with respect to , and

then, the following holds with probability at least on the realization of .

For any , let where . If

and satisfies , then,

Remark II.1.

The modelling error incurred via could be large compared to . Generally, it holds that . However, if admits a good representation of the modelled data distribution, then one might expect this term still to be small. Certainly, if , the final expression in Theorem II.1 reduces to

Otherwise, if is independent of , by Jensen’s inequality. Thus, by Markov’s inequality one has . Finally, a strategy for more precisely controlling is given in  A-C (see especially  A-C2). ∎

Analogous to the restricted isometry property of compressed sensing or the set-restricted eigenvalue condition of 

Bora et al. (2017), the proof of Theorem II.1 relies on a restricted isometry condition. This condition guarantees when pairwise distances of points in are approximately preserved under the action of . We first state a result controlling norms of points in under the action of ; control over pairwise distances then follows easily.

Theorem II.2 (Gen-RIP).

Let be a subsampled isometry associated to a unitary matrix . Suppose that is a -generative network and that is -coherent with respect to . If

then with probability at least on the realization of , it holds that

Remark II.2.

The above result is interesting when . In  III we argue that such dependence is possible (see  III and Theorem III.1). ∎

We now state the result that provides the notion of restricted isometry needed for Theorem II.1. This result, which controls pairwise differences of elements in , is an immediate consequence of Theorem II.2 using the observation in Remark A.3.

Corollary (Restricted isometry on the difference set).

Let be a -generative network and suppose is a subsampled isometry associated to a unitary matrix . Assume that is -coherent with respect to . If

then with probability at least on the realization of , it holds that

The proofs of Theorem II.2 and Theorem II.1 are deferred to  V-A. The result Theorem II.2 follows directly from  II and  A-B, the former of which is presented next. It characterizes restricted isometry of a subspace incoherent with . Its proof is deferred to  V-A.

Lemma (RIP for incoherent subspace).

Let be a subsampled isometry associated to a unitary matrix and suppose that is a -dimensional subspace that is -coherent with respect to . Then, for any ,

Remark II.3.

Convincing empirical results of Novak et al. (2018) suggest the number of linear regions for empirically observed neural networks may typically be linear in the number of nodes, rather than exponential in the width. Such a reduction would be a boon for the sample complexity obtained in Theorem II.2, which depends on the number of linear regions comprising (using  A-B; see  V-A). ∎

Iii Typical Coherence

The first result of this section establishes a lower bound on the coherence parameter yielding a quadratic “bottleneck” on the sample complexity in terms of the parameter .

Proposition .

Let be a unitary matrix. Any -dimensional subspace is at least -coherent with respect to . Furthermore, this lower bound is tight.

Under mild assumptions, when the generative network has random weights one may show that this is a typical coherence level between the network and the measurement operator.

Theorem III.1.

Let be a unitary matrix and be a -generative network. Assume the weight matrices of , , are jointly iid Gaussian: , . Then, with probability at least , is -coherent with respect to , where

Remark III.1.

We briefly comment on the behaviour of the third term, which, we argue, dominates for the principal case of interest. Assume the layers have approximately constant size: i.e., for two absolute constants ,

In this case, all terms in the sum in the third term will be of the same order, making this term have order . If we further make the reasonable assumption that , then the third term dominates all others, hence

Remark III.2.

Using  II and Remark III.1, one may take as the sample complexity for Theorem II.1, in the case of a -generative network with Gaussian weights,

Iv Numerics

In this section we explore the connection between coherence and recovery error empirically, to suggest that coherence is indeed the salient quantity dictating recovery error. In addition, we propose a regularization strategy to train low coherence GNNs. This regularization strategy is new to our knowledge. The first experiment illustrates a phase portrait that empirically shows dependence on coherence and number of measurements for successful recovery. We also show, for a fixed number of measurements, that the probability of recovery failure increases with higher coherence. In the second experiment, we use the novel regularization approach to show that fewer measurements are required for signal recovery when a GNN is trained to have low coherence.

Iv-a Experimental methodology

Iv-A1 Coherence heuristic and regularization

Ideally, in these experiments, one would calculate the coherence of the network exactly, via Definition I.4. However, computing coherence is likely intractable in general. Instead, we use an upper bound on the coherence obtained as follows. Let be a -generative network and let

be its final weight matrix. Write the QR decomposition of

as

where is orthogonal, has invertible submatrix and is the submatrix multiplying with . Let , and let

be an orthogonal matrix. Using that

, we bound the coherence with respect to as

(2)

where the penultimate line uses . To re-phrase: a network is always -coherent with respect to . Our experiments and theory are consistent with the hypothesis that this is an effective heuristic for coherence.

Motivated by (2), we propose a strategy — novel, to our knowledge — to promote low coherence of the final layer with respect to a fixed orthogonal matrix . This is achieved by applying the following regularization to the final weight matrix of the GNN during training:

(3)

Namely, the regularizer

is added to the training loss function. Roughly, this regularization promotes low coherence because

is smallest when is orthonormal, making the coherence of with respect to .

Iv-A2 Network architectures

In the experiments, we use three generative neural networks trained on the MNIST dataset 

(Deng, 2012), which consists of 60,000 images of handwritten digits. The GNNs are fully connected networks with three layers and parameters , , . Precisely, let the first one be , where

is the sigmoid activation function. Let the remaining two GNNs be

, . We use , which has a more realistic architecture for real applications, as a point of comparison with

. Variational autoencoders (VAEs) 

(Kingma and Welling, 2013), with the decoder network as and , were trained using the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.001 and a mini-batch size of 64 using Flux (Innes, 2018). We trained another VAE with decoder network

, using the same hyperparameters but using the regularization strategy described in 

 IV-A1 to promote low coherence of the final layer with respect to a fixed orthogonal matrix . Specifically, the expression was added to the VAE loss function. In all cases the VAE loss function was the usual one. See the code repository for specific implementation details including the definition of the encoders, and refer to (Kingma and Welling, 2013; 2019) for further background on VAEs.

Iv-A3 Measurement matrix

Throughout the experiments, the matrix was chosen to be the discrete cosine transform (DCT) matrix. For DCT implementation details, see for instance Virtanen et al. (2020, fftpack.dct). The matrix is a mild variant of the subsampled isometry defined in Definition I.1, modified to ensure that each realization of has

rows. Namely, the random matrix

is subsampled from by selecting the first elements of a uniform random permutation of . Note is still re-normalized as in Definition I.1.

Iv-A4 First experiment

For the first experiment, let be a -generative network with inner layers , and last layer defined by

Recall that and are the final layers of and , respectively. Here,

is an interpolation parameter. The coherence of

was , while the coherence of was upper bounded via (2) as . As a result, for large , one should expect to have large coherence with respect to . We randomly sample , fix the number of measurements , and set . For each measurement size and coherence upper bound, we perform independent trials. For each trial, we approximately solve (1) by running ADAM with a learning rate of for iterations, or until the norm of the gradient is less than , and set to be the output. See the code repository for specific implementation details. We say the target signal was successfully recovered if the relative reconstruction error (rre) between and is less than :

Iv-A5 Second experiment

For the second experiment, we use each trained network , . The coherence upper bounds of and , computed using (2) are and , respectively, which empirically shows that the regularization (3) promotes low coherence during training. For the networks , let be the corresponding encoder network from their shared VAE. We randomly sample an image from the test set of the MNIST dataset and let i.e.,  most likely resembles the test set image . Let and set . For each measurement size , we run independent trials. On each trial, we generate a realization of and randomly sample a test image from the MNIST dataset. To estimate on each trial, we approximately solve (1) by running ADAM with a learning rate of for iterations, or until the Euclidean norm of the gradient is less than . See the Berk et al. (2022) for specific implementation details.

Iv-B Numerical results

Iv-B1 Recovery phase transition

(a) Empirical recovery probability as a function of coherence and . Each block corresponds to the average from 20 independent trials. White corresponds with 20 successful recoveries (rre ); black with no successful recoveries.
(b) Empirical rre as a function of coherence for

. Each dot corresponds to one of 20 trials at each coherence level. The solid line shows the empirical geometric mean rre

vs. coherence upper bound. The envelope shows

geometric standard deviation.

Fig. 1: Dependence of recovery on coherence and number of measurements for GNNs trained on MNIST.

The results of the first experiment appear in Figure 1. Specifically, (a) plots the fraction of successful recoveries from 20 independent trials as a function of the coherence heuristic (2) and number of measurements. White squares correspond to successful recovery (all errors were below ), while black squares correspond to no successful recoveries (all errors were above ). In (b), we show a slice of the phase plot for , plotting rre as a function of the coherence heuristic (2). Each dot corresponds to one of 20 trials at each coherence level. The plot is shown on a - scale. The solid line plots the geometric mean of rre as a function of coherence, with an envelope representing geometric standard deviation (see Adcock et al. (2022, App. A.1.3) for more information on this visualization strategy). Figure 1 indicates that coherence may be effectively controlled via the heuristic (2), and that coherence is a natural quantity associated with recovery performance. These findings corroborate our theoretical results.

Iv-B2 Incoherent networks require fewer measurements

In the second experiment, we provide compelling numerical simulations that support our regularization strategy for lowering coherence of the trained network, resulting in stable recovery of the target signal with much fewer measurements. The results of the second experiment are shown in Figure 2 and Figure 3. In Figure 2, we show the recovered image for three images from the MNIST test set. For each block of images, the top row corresponds with the low coherence (); the middle row, the high coherence (); and the bottom row, , which uses sigmoid activation. The left-most column is the target image belonging to , labelled signal. All images were clamped to the range . The figure shows that a GNN with low coherence can effectively recover the target signal with much fewer measurements compared to a network with high coherence, even when that network uses a final sigmoid activation function (which is a realistic choice in practical settings). Remarkably, in some cases we observed that images could be recovered with fewer than measurements. This highlights the importance of regularizing for networks with low coherence during training. In Figure 3, we further provide empirical evidence of the benefit of low coherence for recovery. For each measurement, we show the results of 10 independent trials for (squares), (triangles) and (circles), respectively. The lines correspond to the empirical geometric mean rre for each network: the dotted line is associated with ; the dashed line, the high coherence (); and the solid line, the low coherence (). The data are plotted on a - scale. Each shaded envelope corresponds to geometric standard deviation about the respective geometric means. This figure empirically supports that high probability successful recovery is achieved with markedly lower sample complexity for the lower coherence network , as compared with either or . This finding corroborates our theoretical results.

Fig. 2: Recovery comparison of MNIST images for various measurement sizes (denoted by column heading) for a low coherence network, high-coherence network and network with final sigmoid activation. In each block: the top row corresponds to (); middle row (); bottom row (labelled Sig). The leftmost column, signal, corresponds to the target image .
Fig. 3: Performance comparison for three GNNs trained on the MNIST dataset — one with low coherence, another with high coherence, and the last with sigmoid activation on the last layer. Plotted against number of measurements is . For each value of , each dot corresponds to one of 10 trials. In each trial, a random image drawn from the MNIST test partition was used as the target signal. Lines depict the empirical geometric mean rre as a function of ; shaded regions correspond to geometric standard deviation. The solid line corresponds to ; the dashed line to ; the dotted line to .

V Proofs

V-a Proofs for Ii Main results

We proceed by proving Theorem II.2, then Theorem II.1. Note that  II, needed for Theorem II.1, follows immediately from Theorem II.2 using Remark A.3.

Proof of Theorem II.2.

It follows from  A-B that may be contained in a union of linear subspaces of dimension at most , , with satisfying

via Remark A.2. For any linear subspace in the collection, observe that is -coherent with respect to by assumption. Consequently, by a union bound,

The latter quantity is bounded above by if

whence, by substituting the bound for , it suffices to take

Proof of Theorem II.1.

With probability at least on the realization of , satisfies a restricted isometry condition on the difference set by  II. Therefore, since ,

In particular, since ,

Therefore, recalling that ,

In the final inequality, we have used that

together with the fact that .

Proof of  Ii.

Observe that since is a unitary matrix. Thus, since ,

Define , using that is an orthogonal projection, hence symmetric. Then,

Since and belong to a -dimensional subspace, there exists an isometry so that we may view as a -dimensional vector. By abuse of notation, for and being the operator norm for a Hermitian matrix over induced by the Euclidean norm,

We will apply the matrix Bernstein inequality ( A-A1) to achieve concentration of the operator norm of the sum of mean-zero random matrices above. By the -coherence assumption, . Consequently, the operator norm of each constituent matrix is bounded almost surely: for each ,

and the operator norm of the covariance matrix is bounded as

Therefore, by  A-A1 it follows that