# Improved Algorithms for Matrix Recovery from Rank-One Projections

We consider the problem of estimation of a low-rank matrix from a limited number of noisy rank-one projections. In particular, we propose two fast, non-convex proper algorithms for matrix recovery and support them with rigorous theoretical analysis. We show that the proposed algorithms enjoy linear convergence and that their sample complexity is independent of the condition number of the unknown true low-rank matrix. By leveraging recent advances in low-rank matrix approximation techniques, we show that our algorithms achieve computational speed-ups over existing methods. Finally, we complement our theory with some numerical experiments.

• 22 publications
• 52 publications
06/02/2022

### Robust recovery of low-rank matrices and low-tubal-rank tensors from noisy sketches

A common approach for compressing large-scale data is through matrix ske...
04/06/2020

### Low-Rank Matrix Estimation From Rank-One Projections by Unlifted Convex Optimization

We study an estimator with a convex formulation for recovery of low-rank...
05/21/2021

### Lecture notes on non-convex algorithms for low-rank matrix recovery

Low-rank matrix recovery problems are inverse problems which naturally a...
06/27/2017

### Fast Algorithms for Learning Latent Variables in Graphical Models

We study the problem of learning latent variables in Gaussian graphical ...
02/28/2021

### Sensitivity of low-rank matrix recovery

We characterize the first-order sensitivity of approximately recovering ...
02/13/2019

### Phaseless Low Rank Matrix Recovery and Subspace Tracking

This work introduces the first simple and provably correct solution for ...
10/07/2013

### Learning Non-Parametric Basis Independent Models from Point Queries via Low-Rank Methods

We consider the problem of learning multi-ridge functions of the form f(...

## 1 Introduction

### 1.1 Setup

This paper studies the following inverse problem: given a fixed (but unknown) rank- symmetric matrix , recover from a small number of rank-one projections of the form:

 yi=xTiL∗xi+ei=⟨xixTi,L∗⟩+ei,   i=1,…,m. (1)

where

denote (known) feature vectors, and

denotes stochastic noise. This inverse problem can be used to model numerous challenges in statistics and machine learning, including:

Matrix sensing. Reconstructing low-rank matrices from (noisy) linear measurements of the form impacts several applications in control and system identification [1], collaborative filtering [2], and imaging. The problem (1) specializes the matrix sensing problem to the case where the measurement vectors are constrained to be the specific rank-one form; recently, such measurements has been shown to provide several computational benefits [3, 4].

Covariance sketching. Estimating a high-dimensional covariance matrix, given a stream of independent samples , involves maintaining the empirical estimate , which can require quadratic () space complexity. Alternatively, one can record a sequence of linear sketches of each sample: for . At the conclusion of the stream, sketches corresponding to a given vector are squared and aggregated to form a measurement: , which is nothing but a sketch of in the form (1); efficient recovery methods to invert such sketches exist [5, 6] .

Polynomial neural network learning.

Consider a shallow (two-layer) neural network architecture with

input nodes, a single hidden layer with neurons with quadraticactivation function and weights , and an output layer comprising of a single node and weights . The input-output relationship between an input and the corresponding output can be described as [7]:

 y=r∑j=1αj⟨wj,x⟩2=xT(r∑j=1αjwjwTj)x.

Here, the learning problem is to estimate the weights as accurately as possible given a sequence of training input-output pairs . The recent works [7, 8] explore efficient algorithms to recover the weights of such a “ground-truth" network under distributional assumptions.

### 1.2 Our contributions

In this paper, we make concrete algorithmic progress on solving problems of the form (1).

A range of algorithms for solving (1

) (or variants thereof) exist in the literature, and can be broadly classified into two categories: (i)

convex approaches, all of which involve modeling the rank- assumption in terms of a convex penalty term, such as the nuclear norm [1, 2, 5, 6], and (ii) nonconvex approaches based on either alternating minimization [3, 8] or greedy approximation [7, 9]

. Both types of approaches suffer from severe computational difficulties, particularly in the high dimensional regime. Even the most computationally efficient convex approaches inevitably require multiple invocations of singular value decomposition (SVD) of a potentially large

matrix, which can incur cubic () running time. Moreover, even the best available non-convex approaches inevitably require a very accurate initialization, and that the underlying matrix is well-conditioned (if this is not the case, the running time of all available methods again inflates to , or worse).

We take a different approach, and show how to leverage recent results in randomized numerical analysis [10, 11] to our advantage. Our algorithm is also non-convex; however, unlike all earlier works, our method does not require any full SVD calculations. Specifically, we demonstrate that a careful concatenation of randomized, approximate SVD methods, coupled with appropriately defined gradient steps, leads to efficient and accurate estimation. To our knowledge, this work constitutes the first method for matrix recovery for rank-one projections that has nearly linear running time, is nearly sample-optimal for fixed target rank , and is unconditional (i.e., it does not depend on the condition number of the underlying matrix.) Numerical experiments reveal that our methods yield a very attractive tradeoff between sample complexity and running time for efficient matrix recovery.

### 1.3 Techniques

At a high level, our method can be viewed as a variant of the seminal algorithms proposed in [12, 13], which essentially constitute projected/proximal gradient descent with respect to the space of rank- matrices. However, since computing SVD in high dimensions can be a bottleneck, we cannot use this approach directly. To this end, we use the approximation-based matrix recovery framework proposed in [11]. This work demonstrates how to carefully integrate approximate SVD methods into SVP-based matrix recovery algorithms; in particular, algorithms that satisfy certain “head" and "tail" projection properties (explained below in Section 3) are sufficient to guarantee robust and fast convergence. This framework enables us to remove the need to compute even a single SVD.

However, a direct application of [11] does not succeed for the observation model (1). Two major obstacles arise in our case: (i) It is known the measurement operator that maps to does not satisfy the Restricted Isometry Property (in terms of the Frobenius norm) over rank- matrices [5, 6, 3]; therefore, all statistical and algorithmic correctness arguments no longer apply. (ii) The structure of the rank-one measurement inflates the running time of computing even a simple gradient update to (ignoring any cost incurred during rank- projection, whether done using exact or approximate SVDs).

We resolve the first issue by studying the concentration properties of certain linear operators based on the rank-one projections, leveraging an approach first proposed in [3]. We show that a non-trivial “bias correction" step to projected descent-type methods together with a fresh set of samples in each iteration is necessary to achieve fast (linear) convergence. To be more precise, let the operator be defined as for , where is a standard normal random vector. A simple calculation shows that at any given iteration , if is the estimate of the underlying low rank matrix then:

 EA∗A(Lt−L∗)=2(Lt−L∗)+Tr(Lt−L∗)I,

where the operator denotes the trace of a matrix and the expectation is taken with respect to the randomness in the ’s. The left hand side of this equation (roughly) corresponds to the expected value of the gradient in each iteration, and it is clear that while the gradient points in the “correct" direction , it is additionally biased by the extra term. Motivated by this, we develop a new descent scheme that achieves linear convergence by carefully accounting for this bias. Interestingly, the sample complexity of our approach only increases by a mild factor (specifically, an extra factor together with polylogarithmic terms) when compared to the best available techniques even by using fresh samples in each iteration which increases the sample complexity by a log factor due to the linear convergence of the proposed algorithm.

We resolve this issue by carefully exploiting the rank-one structure of the observations. In particular, we develop a modification of the randomized block-Krylov SVD (BK-SVD) algorithm of [10] to work for the case of certain “implicitly defined" matrices, i.e., where the input to the SVD routine is not a matrix, but rather an linear operator that is constructed using the current estimate . This modification, coupled with the tail- and head-projection arguments developed in [11] enables us to achieve a fast per-iteration computational complexity. In particular, our algorithm strictly improves over the (worst-case) per-iteration running time of all existing algorithms.

## 2 Related Work

Due to space constraints, we only provide here a brief (and necessarily incomplete) review of related work. The low-rank matrix recovery problem has received significant attention from the machine learning community over the last few years [14]. The so-called affine rank minimization problem can be thought as the generalization of popular compressive sensing [15, 16]

to the set of low-rank matrices and has several applications in signal processing, communications, control, and computer vision

[1, 2, 17]. In early works for matrix recovery, the observation operator is assumed to be parametrized by independent full-rank matrices that satisfy certain restricted isometry conditions [2, 18]. In this setup, it has been established that observations are sufficient to recover an unknown rank- matrix in (2[19], which is statistically optimal.

More recently, several works have studied the special case for which the operator is parametrized by independent rank- matrices, as introduced above in (1[6, 5, 4, 3, 20]. This model arises in several key applications. Three popular practical applications include: (i) Matrix completion [21]. This problem is popular for studying collaborative filtering and recommender systems, and here every observation corresponds to a single entry of . Therefore, the rank-one observation matrices are of the form , with corresponding to canonical basis vectors. (ii) Covariance sketching for real-time data monitoring [6, 22]

. Here, a stream of high dimensional data

is sequentially observed at a high rate, and acquisition systems with memory and processing limitations can only record a few summaries (or sketches of individual data points in the form . The eventual goal would be to infer the second order statistics (i.e., covariance matrix) from the recorded sketches. (iii) Phase retrieval. This is a well-known challenge in certain types of imaging systems where the goal is to reconstruct a signal (or image) from just amplitude information [23, 24].

We now briefly discuss algorithmic techniques for low-rank matrix recovery from rank-one projections. The first class of techniques can be categorized as convex-relaxation based approaches [6, 5, 4, 23]. For instance, the authors in [6, 5] demonstrate that the observation operator satisfies a specialized mixed-norm isometry condition called the RIP-. Further, they show that the sample complexity of matrix recovery using rank-one projections matches the optimal rate . However, these methods advocate using either semidefinite programming (SDP) or proximal sub-gradient algorithms [25], both of which are too slow for very high-dimensional problems.

The second class of techniques include non-convex methods which are all based on a factorization-based approach initially advocated by Burer and Monteiro. Here, the underlying low-rank matrix is factorized as , where . In the Altmin-LRROM method by [3], and are updated in alternative fashion such that in each iteration the columns of and are orthonormalized, while in the generalized Factorization Machine (gFM) method by [20], and

are updated based on the construction of certain sequences of moment estimators. However both of these approaches require a spectral initialization which involves running a rank-

SVD on a given matrix, and therefore the running time heavily depends on the condition number (i.e., the ratio of the maximum and the minimum nonzero singular values) of . Our proposed approach is also non-convex, and parts of our statistical analysis is based on certain matrix concentration results introduced in [3, 20]. However, unlike these previous works, our method does not get adversely affected by poor matrix condition number either in theory (as demonstrated in Table 1) or in practice (as demonstrated by our numerical experiments below). We achieve this improvement by leveraging new methods for randomized approximate SVD [10] coupled with the framework of [11].

Finally, we mention that a related matrix recovery scheme using approximate SVDs (based on Frank-Wolfe type greedy approximation) has been proposed for learning polynomial neural networks [9, 7]

. Moreover, this approach has been shown to compare favorably to typical neural network learning methods (such as stochastic gradient descent); however, the rate of convergence is sub-linear. We extend this line of work by providing (a) rigorous statistical analysis that precisely establishes upper bounds on the number of samples required for learning such networks, and (b) an algorithm that exhibits

linear convergence to the desired solution; see Table 1.

## 3 Main Results

### 3.1 Preliminaries

Let us first introduce some notations. Throughout this paper, and denote the Frobenius and spectral norm, respectively, and denotes the trace of a matrix. For any given set we denote the orthogonal projection onto by . The underlying observation model (1) can be represented as follows:

 y=A(L∗)+e (2)

where is a symmetric matrix, the linear operator, defined as and each is a random vector in for . Also, the adjoint operator, is defined as . Here, denotes additive noise, and throughout the paper we assume that is zero-mean, subgaussian with i.i.d entries, and independent of ’s. The goal is to recover the unknown low-rank matrix with rank from as few observations as possible.

In the analysis of our algorithms, we need the following regularity condition of the operators and with respect to the set of low-rank matrices. which we term as the Conditional Unbiased Restricted Isometry Property, abbreviated as CU-RIP:

###### Definition 1.

(CU-RIP) Consider fixed rank- matrices and . Assume that and for linear operator parametrized by rank 1 matrices. Then is said to be satisfied CU-RIP if there exits , and arbitrary small constant such that

 (3)

Let denote the set of all rank- matrix subspaces, i.e., subspaces of which are spanned by any atoms of the form where are unit -norm vectors. We use the idea of head and tail approximate projections with respect to first proposed in [26], and instantiated in the context of low-rank approximation in [11].

###### Definition 2 (Approximate tail projection).

Let . Then is a -approximate tail projection algorithm if for all , returns a subspace that satisfies: , where is the optimal rank- approximation of .

###### Definition 3 (Approximate head projection).

Let be a constant. Then is a -approximate head projection if for all , the returned subspace satisfies: , where is the optimal rank- approximation of .

Observe that (resp. ), then the algorithms and can be realized using ordinary SVD techniques.

### 3.2 Algorithms and theoretical results

We now propose methods to estimate given knowledge of . Our first method is somewhat computationally inefficient, but achieves very good sample complexity and serves to illustrate the overall algorithmic approach. Consider the non-convex optimization problem:

 minL∈Rp×p s.t. rank(L)≤r. (4)

To solve this problem, we first propose an algorithm that we call Exact Projections for Rank-One Matrix recovery, or EP-ROM, described in pseudocode form in Algorithm 1. In Alg 1, denotes the projection operator onto the set of rank- matrices.

We now analyze this algorithm. First, we provide a theoretical result which establishes statistical and optimization convergence rates of EP-ROM. More precisely, we derive the upper bound on the estimation error (measured using the spectral norm) of recovering the unknown low-rank matrix . Due to space constraints, we defer all the proofs to the appendix.

###### Theorem 4 (Linear convergence of EP-ROM).

Consider the sequence of iterates obtained in EP-ROM. Let , and denote the column spaces of , and , respectively. Define the subspace such that . Also, assume that the linear operator satisfies CU-RIP with parameters , and where is the step size of EP-ROM and are arbitrary small constants (to be specified later). Choose the step size in EP-ROM as . Then, EP-ROM outputs a sequence of estimates such that:

 ∥Lt+1−L∗∥2≤q1∥Lt−L∗∥2+q2Tr(L∗)+2ηm∥PJtA∗e∥2, (5)

where and .

Theorem 5 implies (via induction) that EP-ROM exhibits linear convergence; further, the radius of convergence is dependent on the third term on the the third term on the right hand  (5). This term represents the statistical error rate. We now prove that this error term can be suitably bounded.

###### Theorem 5 (Bounding the statistical error of EP-ROM).

Consider the observation model (2) with zero-mean subgaussian noise with i.i.d. entries (and independent of the ’s) such that (Here, denotes the -norm; see Definition 11

in the appendix). Then, with probability at least

, we have:

 ∥∥1mA∗e∥∥2≤C′′τ√plogpmlog(pξ3). (6)

where is constant which depends on .

In establishing linear convergence of EP-ROM, we assume that the CU-RIP holds at each iteration. The following theorem certifies this assumption by showing that this condition holds with high probability.

###### Theorem 6 (Verifying CU-RIP(ρ1,ρ2,ρ3)).

At any iteration , with probability at least , CU-RIP is satisfied with , , and where are some arbitrary small constants provided that for some .

Integrating the above results and using the idea of fresh sampling in each iteration, we obtain the following corollary formally establishing linear convergence.

###### Corollary 7.

Consider all the assumptions in theorem 4. To achieve accuracy in the estimation of in spectral norm, EP-ROM needs iterations. In other words, the output of EP-ROM satisfies the following after iterations with high probability:

 ∥Lt+1−L∗∥2≤(q1)K∥L∗∥2+q21−q1Tr(L∗)+C′′τη1−q1√plogpmlog(pξ3). (7)

Based on Theorems 56, and Corollary 7, the sample complexity of EPROM scales as
for some .

While EP-ROM exhibits linear convergence, the per-iteration complexity is still high since it requires projection onto the space of rank- matrices, which necessitates the application of SVD. In the absence of any spectral assumptions on the input to the SVD, the per-iteration running time of EP-ROM can be cubic, which can be prohibitive. Overall, we obtain a running time of in order to achieve -accuracy (please see section 5.3

in appendix for more discussion). To reduce the running time, one can instead replace a standard SVD routine with heuristics such as Lanczos iterations

[27]; however, these may may not result in algorithms with provable convergence guarantees. Instead, following [11], we can use two inaccurate rank- projections (in particular, tail- and head-approximate projection operators), and we show that this leads to provable convergence. Based on this idea, we propose our second algorithm that we call Approximate Projection for Rank One Matrix recovery, or AP-ROM. We display the pseudocode of AP-ROM in Algorithm 2.

The choice of approximate low-rank projections operators and is flexible. We note that tail approximate projections has been widely studied in the randomized numerical linear algebra literature [28, 29, 30]; however, head approximate projection methods are less well-known. In our method, we use the randomized Block Krylov SVD (BK-SVD) method proposed by [10], which has been shown to satisfy both types of approximation guarantees [11]. The nice feature of this method is that the running time incurred while computing a low-rank approximation of a given matrix is independent of the spectral gap of the matrix. We leverage this property to show asymptotic improvements over other existing fast SVD methods (such as the well-known power method).

We briefly discuss the BK-SVD algorithm. In particular, BK-SVD takes an input matrix with size with rank and returns a -dimensional subspace which approximates the top right singular vectors of the input. Mathematically, if is the input, is the best rank- approximation to it, and is a basis matrix that spans the subspace returned by BK-SVD, then the projection of into , satisfies the following relations:

 ∥A−B∥F≤cT∥A−Ar∥F, |uTiAATui−ziAATzi|≤(1−cH)σ2r+1,

where , , are the tail and head projection constants and denotes the

right eigenvector of

. In section Appendix-B of [11] has been shown that the per-vector guarantee can be used to prove the approximate head projection property, i.e., .We now establish that AP-ROM also exhibits linear convergence, while obeying similar statistical properties as EP-ROM. We have the following results:

###### Theorem 8 (Convergence of AP-ROM).

Let . Also, Assume that satisfies CU-RIP with , and where is the step size of AP-ROM and and are arbitrary small positive constants. Choose the step size such that , where and are constants. Then, AP-ROM outputs a sequence of estimates such that:

 ∥Lt+1−L∗∥2≤≤q′1∥Lt−L∗∥2+q′2Tr(L∗)+1m(η(1+CT)+ϕ(1+CH)√1−ϕ2)∥∥PVtA∗e∥∥F, (8)

where and .

Similar to the corollary 7 and by the idea of fresh sampling, we have the following result.

###### Corollary 9.

Under the assumptions in Theorem 8, in order to achieve -accuracy in the estimation of in terms of spectral norm, AP-ROM requires iterations. Specifically, the output of AP-ROM satisfies the following after iterations with high probability:

 ∥Lt+1−L∗∥2≤≤(q′1)K∥L∗∥2+q′21−q′1Tr(L∗)+q′31−q′1√plogpmlog(pξ3). (9)

where

From Theorem 8 and Corollary 9, we observe that the sample-complexity of AP-ROM (i.e., the number of samples to achieve a given accuracy) remains the same as in EP-ROM.

The above analysis of AP-ROM shows that instead of using exact rank- projections (as in EP-ROM), one can use instead tail and head approximate projection which is implemented by the randomized block Krylov method proposed by [10]. The running time for this method is given by if according to Theorem 7 in [10]. While the running time of the projection step is gap-independent, the calculation of the gradient (i.e., the input to the head projection method ) is itself a challenge. In essence, the bottleneck arises while calculating of the gradient is related to the calculation of the adjoint operator, which requires operations for each sample. Coupled with the sample-complexity , this means that the running time per-iteration scaled as , which overshadows any gains achieved during the projection step (please see the appendix for more discussion).

To address this challenge, we propose a modified version of BK-SVD for head approximate projection which uses the special rank-one structures involved in the calculation of the gradients. We call this method Modified BK-SVD or MBK-SVD. The basic idea is to efficiently evaluate the Krylov-subspace iteration of the BK-SVD in order to fully avoid any explicit calculations of the adjoint operator . Due to space constraints, the pseudocode as well as the analysis of MBK-SVD is deferred to the appendix.

###### Theorem 10.

To achieve accuracy, AP-ROM (with MBK-SVD) runs in time (Here, hides dependency on ).

It is worthwhile to note that both the sample complexity and running time of AP-ROM depends on the spectral norm . However, in several applications this can be assumed to be bounded above. For example, in learning two-layer polynomial networks, as discussed above, the relation between the output and input is given by such that and (see [7], Section 4.3). Therefore, the spectral norm of is in the worst case and constant under reasonable incoherence assumptions on .

## 4 Experimental results

In this section, we illustrate some experiments to support our proposed algorithms. We compare EP-ROM and AP-ROM with convex nuclear norm minimization as well as the gFM algorithm of [8]. To solve the nuclear norm minimization, we use FASTA [31, 32] which efficiently implements an accelerated proximal sub-gradient method. For AP-ROM, we consider three different implementations. The first implementation of AP-ROM uses a low-rank approximation heuristic based on Lanczos iterations (implementable by Matlab’s SVDS) instead of exact SVDs. The second implementation of AP-ROM uses the randomized Block Krylov SVD (BK-SVD) of [10]. Finally, the third AP-ROM implementation uses our proposed modified BK-SVD method. In all the experiments, we generate a low-rank matrix, , such that with where the entries of

is randomly chosen according to the standard normal distribution.

Figures 1(a) and 1(b) show the phase transition of successful recovery as well as the evolution of the objective function, versus the iteration count for four algorithms (here, MBK-SVD has not been plotted since it behaves similar to BK-SVD in terms of accuracy). In these plots, we have used Monte Carlo trials and the phase transition plot is generated based on the empirical probability of success; here, success is when the relative error between (the estimate of ) and the ground truth (measured in terms of spectral norm) is less than . For solving convex nuclear norm minimization using FASTA, we set the Lagrangean parameter, i.e., via a grid search. In Figure 1(a), there is no additive noise; in Figure 1

(b) we consider an additive standard normal noise with standard deviation equal to

. As we can see in Figure 1(a), the phase transition for convex method is slightly better than non-convex algorithms, which is consistent with known theoretical results. However, the convex method is improper, i.e., the rank of is much higher than .

In Figure 1(c), we compare our algorithms with the gFM method of [8] with respect to the condition number of the ground truth matrix. The setup is as before, and we fix the number measurements as . We plot the probability of success when the relative error between and ground truth is less than for Monte Carlo trials. This plot shows that the success of gFM heavily depends on condition number. (As per the theory, to be able to recover , an additional factor proportional to the square of condition number is required.) Finally, in Figure 1(d), we compare the algorithms in the high-dimensional regime where and in terms of running time. Here, we have used our proposed modified BK-SVD algorithm. The y-axis denotes the relative error in spectral norm and we report averages over Monte Carlo trials. As we can see, convex methods are the slowest (as expected); the non-convex methods are comparable to each other, while MBK-SVD is the fastest. This plot verifies that our modified head approximate projection routine is almost times faster than other methods, which makes it a promising approach for high-dimensional matrix recovery applications.

## References

• [1] M. Fazel. Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford University, 2002.
• [2] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
• [3] K. Zhong, P. Jain, and I. Dhillon. Efficient matrix sensing using rank-1 gaussian measurements. In Int. Conf on Algorithmic Learning Theory, pages 3–18. Springer, 2015.
• [4] R. Kueng, H. Rauhut, and U. Terstiege. Low rank matrix recovery from rank one measurements. Appl. Comput. Harmon. Anal., 42(1):88–116, 2017.
• [5] T. Cai and A. Zhang. Rop: Matrix recovery via rank-one projections. Ann. Stat., 43(1):102–138, 2015.
• [6] Y. Chen, Y. Chi, and A. Goldsmith. Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Trans. Inform. Theory, 61(7):4034–4059, 2015.
• [7] R. Livni, S. Shalev-Shwartz, and O. Shamir. On the computational efficiency of training neural networks. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 855–863, 2014.
• [8] M. Lin and J. Ye. A non-convex one-pass framework for generalized factorization machine and rank-one matrix sensing. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 1633–1641, 2016.
• [9] S. Shalev-shwartz, A. Gonen, and O. Shamir. Large-scale convex minimization with a low-rank constraint. In Proc. Int. Conf. Machine Learning, pages 329–336, 2011.
• [10] C. Musco and C. Musco. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 1396–1404, 2015.
• [11] C. Hegde, I. Indyk, and S. Ludwig. Fast recovery from a union of subspaces. In Adv. Neural Inf. Proc. Sys. (NIPS), 2016.
• [12] P. Jain, R. Meka, and I. Dhillon. Guaranteed rank minimization via singular value projection. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 937–945, 2010.
• [13] J. Cai, E. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
• [14] M. Davenport and J. Romberg. An overview of low-rank matrix recovery from incomplete observations. IEEE J. Select. Top. Sig. Proc., 10(4):608–622, 2016.
• [15] E. Candès and M. Wakin. An introduction to compressive sampling. IEEE Sig. Proc. Magazine, 25(2):21–30, 2008.
• [16] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing, volume 1. Springer.
• [17] A. Jung, G. Tauböck, and F. Hlawatsch. Compressive spectral estimation for nonstationary random processes. IEEE Trans. Inform. Theory, 59(5):3117–3138, 2013.
• [18] Y. Liu. Universal low-rank matrix recovery from pauli measurements. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 1638–1646, 2011.
• [19] E. Candes and Y. Plan. Tight oracle inequalities for low-rank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory, 57(4):2342–2359, 2011.
• [20] M. Lin, S. Qiu, B. Hong, and J. Ye. The second order linear model. arXiv preprint arXiv:1703.00598, 2017.
• [21] E. Candès and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 9(6), 2009.
• [22] G. Dasarathy, P. Shah, B. Bhaskar, and R. Nowak.

Sketching sparse matrices, covariances, and graphs via tensor products.

IEEE Trans. Inform. Theory, 61(3):1373–1388, 2015.
• [23] E. Candes, T. Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 66(8):1241–1274, 2013.
• [24] P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 2796–2804, 2013.
• [25] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
• [26] C. Hegde, P. Indyk, and L. Schmidt. Approximation algorithms for model-based compressive sensing. IEEE Trans. Inform. Theory, 61(9):5129–5147, 2015.
• [27] C. Lanczos.

An iteration method for the solution of the eigenvalue problem of linear differential and integral operators1.

Journal of Research of the National Bureau of Standards, 45(4), 1950.
• [28] K. L Clarkson and D. Woodruff. Low rank approximation and regression in input sparsity time. In

Proc. ACM Symp. Theory of Comput.

, pages 81–90. ACM, 2013.
• [29] M. Mahoney and P. Drineas. Cur matrix decompositions for improved data analysis. Proc. Natl. Acad. Sci., 106(3):697–702, 2009.
• [30] V. Rokhlin, A. Szlam, and M. Tygert.

A randomized algorithm for principal component analysis.

SIAM J. Matrix Anal. Appl., 31(3):1100–1124, 2009.
• [31] T. Goldstein, C. Studer, and R. Baraniuk. A field guide to forward-backward splitting with a FASTA implementation. arXiv eprint, abs/1411.3406, 2014.
• [32] T. Goldstein, C. Studer, and R. Baraniuk. FASTA: A generalized implementation of forward-backward splitting, January 2015. http://arxiv.org/abs/1501.04979.
• [33] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
• [34] J. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.

## 5 Appendix

Below, the expression for two sets and refers to the Minkowski sum of two sets, defined as for given sets and . Also, denotes the set of vectors associated with , the set of all rank-r matrix subspaces. In addition, we denote the entry of matrix and entry of vector as and , respectively.

###### Proof of Theorem 4.

Here we show that the error in the estimate of decreases in one iteration. Let , and denote the bases for the column space of , and , respectively. Define set as and as projection onto it. In addition, define as follows:

 b=Lt−ηPJ(1mm∑i=1(xiLtxTi−yi)xixTi−Tr(Lt−¯y)I)=Lt−ηmPJA∗(A(Lt)−y)+ηPJTr(Lt−¯y)I.

We have:

 ∥Lt+1−L∗∥2 ≤∥Lt+1−b∥2+∥b−L∗∥2a1≤2∥b−L∗∥2 ≤2∥Lt−L∗−ηmPJA∗(A(Lt)−y)+ηPJTr(Lt−¯y)I∥2 a2≤2∥PJ(Lt−L∗−ηmA∗A(Lt−L∗)+ηTr(Lt−¯y)I)∥2+2ηm∥PJA∗e∥2 a3≤2∥Lt−L∗−ηmA∗A(Lt−L∗)+ηTr(Lt−¯y)I∥2+2ηm∥A∗e∥2 a4≤2(ηδ1−1)∥Lt−L∗∥2+2ηδ2Tr(L∗)+2ηm∥A∗e∥2, (10)

where holds since is generated by projecting onto the set of matrices with rank , and by definition of , also has the minimum Euclidean distance to b over all matrices with rank ; holds by the definition of from (2) and the triangle inequality; is followed by the fact that lies in set and due to the spectral norm of projection matrix ; finally, holds by the CU-RIP assumption in the theorem. By choosing and , the proof is completed. ∎

###### Proof of Corollary 7.

First, we note that by our assumption on in Theorem 4, . Since EP-ROM uses fresh samples in each iteration, is independent of the sensing vectors, ’s for all . On the other hand, from Theorem 6, the CU-RIP holds with probability . As a result, by a union bound over the iterations of the algorithm, the CU-RIP holds after iterations with probability at least . By recursively applying inequality (5) (with zero initialization) and applying Theorem 5, we obtain the claimed result. ∎

###### Proof of Theorem 8.

Assume that such that and . Also, define

 b′=Lt−ηH(1mm∑i=1(xiLtxTi−yi)xixTi−Tr(Lt−¯y)I)=Lt−ηmH(A∗(A(Lt)−y)−Tr(Lt−¯y)I).

Furthermore, by definition of approximate tail projection, . Now, we have:

 ∥Lt+1−L∗∥F =∥L∗−T(b′)∥F ≤∥L∗−b′∥F+∥b′−T(b′)∥F a1≤(1+cT)∥b′−L∗∥F =(1+cT)∥∥Lt−L∗−ηH(1mA∗(A(Lt)−y)−Tr(Lt−¯y)I)∥∥F a2=(1+cT)∥∥Lt−L∗−ηPV(1mA∗(A(Lt)−y)−Tr(Lt−¯y)I)∥∥F,

where is implied by the triangle inequality and the definition of approximate tail projection, and inequality holds by the definition of approximate head projection. Next, we have:

 ∥Lt+1 −L∗∥F (11) a3≤(1+cT)∥∥PV(Lt−L∗)+PV⊥(Lt−L∗)−ηPV(1mA∗(A(Lt)−y)−Tr(Lt−¯y)I)∥∥F a4≤(1+cT)∥∥PV(Lt−L∗)−ηPV(1mA∗A(Lt−L∗)−Tr(Lt−¯y)I)∥∥F a5≤(1+cT)∥∥PV+Y(Lt−L∗−η(1mA∗A(Lt−L∗)−Tr(Lt−¯y)I))∥∥F (12)

where follows by decomposing the residual on the two subspaces and , and is due to the triangle inequality and the fact that and .

Now, we need to bound the three terms in (11). The third (statistical error) term can be bounded by using Theorem 5 which we will use in Corollary 9. For the first term, we have:

 (1+cT)∥∥PV+Y(Lt−L∗−η(1mA∗A(Lt−L∗)−Tr(Lt−¯y)I))∥∥F a1≤β1(1+CT)√r∥∥Lt−L∗−η(1mA∗A(Lt−L∗)−Tr(Lt−¯y)I)∥∥2 a2≤β1(1+CT)(ηδ1−1)√r∥