1 Introduction
1.1 Setup
This paper studies the following inverse problem: given a fixed (but unknown) rank symmetric matrix , recover from a small number of rankone projections of the form:
(1) 
where
denote (known) feature vectors, and
denotes stochastic noise. This inverse problem can be used to model numerous challenges in statistics and machine learning, including:
Matrix sensing. Reconstructing lowrank matrices from (noisy) linear measurements of the form impacts several applications in control and system identification [1], collaborative filtering [2], and imaging. The problem (1) specializes the matrix sensing problem to the case where the measurement vectors are constrained to be the specific rankone form; recently, such measurements has been shown to provide several computational benefits [3, 4].
Covariance sketching. Estimating a highdimensional covariance matrix, given a stream of independent samples , involves maintaining the empirical estimate , which can require quadratic () space complexity. Alternatively, one can record a sequence of linear sketches of each sample: for . At the conclusion of the stream, sketches corresponding to a given vector are squared and aggregated to form a measurement: , which is nothing but a sketch of in the form (1); efficient recovery methods to invert such sketches exist [5, 6] .
Polynomial neural network learning.
Consider a shallow (twolayer) neural network architecture with
input nodes, a single hidden layer with neurons with quadraticactivation function and weights , and an output layer comprising of a single node and weights . The inputoutput relationship between an input and the corresponding output can be described as [7]:Here, the learning problem is to estimate the weights as accurately as possible given a sequence of training inputoutput pairs . The recent works [7, 8] explore efficient algorithms to recover the weights of such a “groundtruth" network under distributional assumptions.
1.2 Our contributions
In this paper, we make concrete algorithmic progress on solving problems of the form (1).
A range of algorithms for solving (1
) (or variants thereof) exist in the literature, and can be broadly classified into two categories: (i)
convex approaches, all of which involve modeling the rank assumption in terms of a convex penalty term, such as the nuclear norm [1, 2, 5, 6], and (ii) nonconvex approaches based on either alternating minimization [3, 8] or greedy approximation [7, 9]. Both types of approaches suffer from severe computational difficulties, particularly in the high dimensional regime. Even the most computationally efficient convex approaches inevitably require multiple invocations of singular value decomposition (SVD) of a potentially large
matrix, which can incur cubic () running time. Moreover, even the best available nonconvex approaches inevitably require a very accurate initialization, and that the underlying matrix is wellconditioned (if this is not the case, the running time of all available methods again inflates to , or worse).We take a different approach, and show how to leverage recent results in randomized numerical analysis [10, 11] to our advantage. Our algorithm is also nonconvex; however, unlike all earlier works, our method does not require any full SVD calculations. Specifically, we demonstrate that a careful concatenation of randomized, approximate SVD methods, coupled with appropriately defined gradient steps, leads to efficient and accurate estimation. To our knowledge, this work constitutes the first method for matrix recovery for rankone projections that has nearly linear running time, is nearly sampleoptimal for fixed target rank , and is unconditional (i.e., it does not depend on the condition number of the underlying matrix.) Numerical experiments reveal that our methods yield a very attractive tradeoff between sample complexity and running time for efficient matrix recovery.
Algorithm  Sample complexity  Total Running Time  SD 

Convex [6, 5, 4]  Yes  
GECO [7, 9]  N/A  Yes  
AltMinLRROM [3]  Yes  
gFM [8]  Yes  
EPROM [This paper]  Yes  
APROM [This paper]  No 
1.3 Techniques
At a high level, our method can be viewed as a variant of the seminal algorithms proposed in [12, 13], which essentially constitute projected/proximal gradient descent with respect to the space of rank matrices. However, since computing SVD in high dimensions can be a bottleneck, we cannot use this approach directly. To this end, we use the approximationbased matrix recovery framework proposed in [11]. This work demonstrates how to carefully integrate approximate SVD methods into SVPbased matrix recovery algorithms; in particular, algorithms that satisfy certain “head" and "tail" projection properties (explained below in Section 3) are sufficient to guarantee robust and fast convergence. This framework enables us to remove the need to compute even a single SVD.
However, a direct application of [11] does not succeed for the observation model (1). Two major obstacles arise in our case: (i) It is known the measurement operator that maps to does not satisfy the Restricted Isometry Property (in terms of the Frobenius norm) over rank matrices [5, 6, 3]; therefore, all statistical and algorithmic correctness arguments no longer apply. (ii) The structure of the rankone measurement inflates the running time of computing even a simple gradient update to (ignoring any cost incurred during rank projection, whether done using exact or approximate SVDs).
We resolve the first issue by studying the concentration properties of certain linear operators based on the rankone projections, leveraging an approach first proposed in [3]. We show that a nontrivial “bias correction" step to projected descenttype methods together with a fresh set of samples in each iteration is necessary to achieve fast (linear) convergence. To be more precise, let the operator be defined as for , where is a standard normal random vector. A simple calculation shows that at any given iteration , if is the estimate of the underlying low rank matrix then:
where the operator denotes the trace of a matrix and the expectation is taken with respect to the randomness in the ’s. The left hand side of this equation (roughly) corresponds to the expected value of the gradient in each iteration, and it is clear that while the gradient points in the “correct" direction , it is additionally biased by the extra term. Motivated by this, we develop a new descent scheme that achieves linear convergence by carefully accounting for this bias. Interestingly, the sample complexity of our approach only increases by a mild factor (specifically, an extra factor together with polylogarithmic terms) when compared to the best available techniques even by using fresh samples in each iteration which increases the sample complexity by a log factor due to the linear convergence of the proposed algorithm.
We resolve this issue by carefully exploiting the rankone structure of the observations. In particular, we develop a modification of the randomized blockKrylov SVD (BKSVD) algorithm of [10] to work for the case of certain “implicitly defined" matrices, i.e., where the input to the SVD routine is not a matrix, but rather an linear operator that is constructed using the current estimate . This modification, coupled with the tail and headprojection arguments developed in [11] enables us to achieve a fast periteration computational complexity. In particular, our algorithm strictly improves over the (worstcase) periteration running time of all existing algorithms.
2 Related Work
Due to space constraints, we only provide here a brief (and necessarily incomplete) review of related work. The lowrank matrix recovery problem has received significant attention from the machine learning community over the last few years [14]. The socalled affine rank minimization problem can be thought as the generalization of popular compressive sensing [15, 16]
to the set of lowrank matrices and has several applications in signal processing, communications, control, and computer vision
[1, 2, 17]. In early works for matrix recovery, the observation operator is assumed to be parametrized by independent fullrank matrices that satisfy certain restricted isometry conditions [2, 18]. In this setup, it has been established that observations are sufficient to recover an unknown rank matrix in (2) [19], which is statistically optimal.More recently, several works have studied the special case for which the operator is parametrized by independent rank matrices, as introduced above in (1) [6, 5, 4, 3, 20]. This model arises in several key applications. Three popular practical applications include: (i) Matrix completion [21]. This problem is popular for studying collaborative filtering and recommender systems, and here every observation corresponds to a single entry of . Therefore, the rankone observation matrices are of the form , with corresponding to canonical basis vectors. (ii) Covariance sketching for realtime data monitoring [6, 22]
. Here, a stream of high dimensional data
is sequentially observed at a high rate, and acquisition systems with memory and processing limitations can only record a few summaries (or sketches of individual data points in the form . The eventual goal would be to infer the second order statistics (i.e., covariance matrix) from the recorded sketches. (iii) Phase retrieval. This is a wellknown challenge in certain types of imaging systems where the goal is to reconstruct a signal (or image) from just amplitude information [23, 24].We now briefly discuss algorithmic techniques for lowrank matrix recovery from rankone projections. The first class of techniques can be categorized as convexrelaxation based approaches [6, 5, 4, 23]. For instance, the authors in [6, 5] demonstrate that the observation operator satisfies a specialized mixednorm isometry condition called the RIP. Further, they show that the sample complexity of matrix recovery using rankone projections matches the optimal rate . However, these methods advocate using either semidefinite programming (SDP) or proximal subgradient algorithms [25], both of which are too slow for very highdimensional problems.
The second class of techniques include nonconvex methods which are all based on a factorizationbased approach initially advocated by Burer and Monteiro. Here, the underlying lowrank matrix is factorized as , where . In the AltminLRROM method by [3], and are updated in alternative fashion such that in each iteration the columns of and are orthonormalized, while in the generalized Factorization Machine (gFM) method by [20], and
are updated based on the construction of certain sequences of moment estimators. However both of these approaches require a spectral initialization which involves running a rank
SVD on a given matrix, and therefore the running time heavily depends on the condition number (i.e., the ratio of the maximum and the minimum nonzero singular values) of . Our proposed approach is also nonconvex, and parts of our statistical analysis is based on certain matrix concentration results introduced in [3, 20]. However, unlike these previous works, our method does not get adversely affected by poor matrix condition number either in theory (as demonstrated in Table 1) or in practice (as demonstrated by our numerical experiments below). We achieve this improvement by leveraging new methods for randomized approximate SVD [10] coupled with the framework of [11].Finally, we mention that a related matrix recovery scheme using approximate SVDs (based on FrankWolfe type greedy approximation) has been proposed for learning polynomial neural networks [9, 7]
. Moreover, this approach has been shown to compare favorably to typical neural network learning methods (such as stochastic gradient descent); however, the rate of convergence is sublinear. We extend this line of work by providing (a) rigorous statistical analysis that precisely establishes upper bounds on the number of samples required for learning such networks, and (b) an algorithm that exhibits
linear convergence to the desired solution; see Table 1.3 Main Results
3.1 Preliminaries
Let us first introduce some notations. Throughout this paper, and denote the Frobenius and spectral norm, respectively, and denotes the trace of a matrix. For any given set we denote the orthogonal projection onto by . The underlying observation model (1) can be represented as follows:
(2) 
where is a symmetric matrix, the linear operator, defined as and each is a random vector in for . Also, the adjoint operator, is defined as . Here, denotes additive noise, and throughout the paper we assume that is zeromean, subgaussian with i.i.d entries, and independent of ’s. The goal is to recover the unknown lowrank matrix with rank from as few observations as possible.
In the analysis of our algorithms, we need the following regularity condition of the operators and with respect to the set of lowrank matrices. which we term as the Conditional Unbiased Restricted Isometry Property, abbreviated as CURIP:
Definition 1.
(CURIP) Consider fixed rank matrices and . Assume that and for linear operator parametrized by rank 1 matrices. Then is said to be satisfied CURIP if there exits , and arbitrary small constant such that
(3) 
Let denote the set of all rank matrix subspaces, i.e., subspaces of which are spanned by any atoms of the form where are unit norm vectors. We use the idea of head and tail approximate projections with respect to first proposed in [26], and instantiated in the context of lowrank approximation in [11].
Definition 2 (Approximate tail projection).
Let . Then is a approximate tail projection algorithm if for all , returns a subspace that satisfies: , where is the optimal rank approximation of .
Definition 3 (Approximate head projection).
Let be a constant. Then is a approximate head projection if for all , the returned subspace satisfies: , where is the optimal rank approximation of .
Observe that (resp. ), then the algorithms and can be realized using ordinary SVD techniques.
3.2 Algorithms and theoretical results
We now propose methods to estimate given knowledge of . Our first method is somewhat computationally inefficient, but achieves very good sample complexity and serves to illustrate the overall algorithmic approach. Consider the nonconvex optimization problem:
s.t.  (4) 
To solve this problem, we first propose an algorithm that we call Exact Projections for RankOne Matrix recovery, or EPROM, described in pseudocode form in Algorithm 1. In Alg 1, denotes the projection operator onto the set of rank matrices.
We now analyze this algorithm. First, we provide a theoretical result which establishes statistical and optimization convergence rates of EPROM. More precisely, we derive the upper bound on the estimation error (measured using the spectral norm) of recovering the unknown lowrank matrix . Due to space constraints, we defer all the proofs to the appendix.
Theorem 4 (Linear convergence of EPROM).
Consider the sequence of iterates obtained in EPROM. Let , and denote the column spaces of , and , respectively. Define the subspace such that . Also, assume that the linear operator satisfies CURIP with parameters , and where is the step size of EPROM and are arbitrary small constants (to be specified later). Choose the step size in EPROM as . Then, EPROM outputs a sequence of estimates such that:
(5) 
where and .
Theorem 5 implies (via induction) that EPROM exhibits linear convergence; further, the radius of convergence is dependent on the third term on the the third term on the right hand (5). This term represents the statistical error rate. We now prove that this error term can be suitably bounded.
Theorem 5 (Bounding the statistical error of EPROM).
Consider the observation model (2) with zeromean subgaussian noise with i.i.d. entries (and independent of the ’s) such that (Here, denotes the norm; see Definition 11
in the appendix). Then, with probability at least
, we have:(6) 
where is constant which depends on .
In establishing linear convergence of EPROM, we assume that the CURIP holds at each iteration. The following theorem certifies this assumption by showing that this condition holds with high probability.
Theorem 6 (Verifying CURIP).
At any iteration , with probability at least , CURIP is satisfied with , , and where are some arbitrary small constants provided that for some .
Integrating the above results and using the idea of fresh sampling in each iteration, we obtain the following corollary formally establishing linear convergence.
Corollary 7.
Consider all the assumptions in theorem 4. To achieve accuracy in the estimation of in spectral norm, EPROM needs iterations. In other words, the output of EPROM satisfies the following after iterations with high probability:
(7) 
While EPROM exhibits linear convergence, the periteration complexity is still high since it requires projection onto the space of rank matrices, which necessitates the application of SVD. In the absence of any spectral assumptions on the input to the SVD, the periteration running time of EPROM can be cubic, which can be prohibitive. Overall, we obtain a running time of in order to achieve accuracy (please see section 5.3
in appendix for more discussion). To reduce the running time, one can instead replace a standard SVD routine with heuristics such as Lanczos iterations
[27]; however, these may may not result in algorithms with provable convergence guarantees. Instead, following [11], we can use two inaccurate rank projections (in particular, tail and headapproximate projection operators), and we show that this leads to provable convergence. Based on this idea, we propose our second algorithm that we call Approximate Projection for Rank One Matrix recovery, or APROM. We display the pseudocode of APROM in Algorithm 2.The choice of approximate lowrank projections operators and is flexible. We note that tail approximate projections has been widely studied in the randomized numerical linear algebra literature [28, 29, 30]; however, head approximate projection methods are less wellknown. In our method, we use the randomized Block Krylov SVD (BKSVD) method proposed by [10], which has been shown to satisfy both types of approximation guarantees [11]. The nice feature of this method is that the running time incurred while computing a lowrank approximation of a given matrix is independent of the spectral gap of the matrix. We leverage this property to show asymptotic improvements over other existing fast SVD methods (such as the wellknown power method).
We briefly discuss the BKSVD algorithm. In particular, BKSVD takes an input matrix with size with rank and returns a dimensional subspace which approximates the top right singular vectors of the input. Mathematically, if is the input, is the best rank approximation to it, and is a basis matrix that spans the subspace returned by BKSVD, then the projection of into , satisfies the following relations:
where , , are the tail and head projection constants and denotes the
right eigenvector of
. In section AppendixB of [11] has been shown that the pervector guarantee can be used to prove the approximate head projection property, i.e., .We now establish that APROM also exhibits linear convergence, while obeying similar statistical properties as EPROM. We have the following results:Theorem 8 (Convergence of APROM).
Let . Also, Assume that satisfies CURIP with , and where is the step size of APROM and and are arbitrary small positive constants. Choose the step size such that , where and are constants. Then, APROM outputs a sequence of estimates such that:
(8) 
where and .
Similar to the corollary 7 and by the idea of fresh sampling, we have the following result.
Corollary 9.
Under the assumptions in Theorem 8, in order to achieve accuracy in the estimation of in terms of spectral norm, APROM requires iterations. Specifically, the output of APROM satisfies the following after iterations with high probability:
(9) 
where
From Theorem 8 and Corollary 9, we observe that the samplecomplexity of APROM (i.e., the number of samples to achieve a given accuracy) remains the same as in EPROM.
The above analysis of APROM shows that instead of using exact rank projections (as in EPROM), one can use instead tail and head approximate projection which is implemented by the randomized block Krylov method proposed by [10]. The running time for this method is given by if according to Theorem 7 in [10]. While the running time of the projection step is gapindependent, the calculation of the gradient (i.e., the input to the head projection method ) is itself a challenge. In essence, the bottleneck arises while calculating of the gradient is related to the calculation of the adjoint operator, which requires operations for each sample. Coupled with the samplecomplexity , this means that the running time periteration scaled as , which overshadows any gains achieved during the projection step (please see the appendix for more discussion).
To address this challenge, we propose a modified version of BKSVD for head approximate projection which uses the special rankone structures involved in the calculation of the gradients. We call this method Modified BKSVD or MBKSVD. The basic idea is to efficiently evaluate the Krylovsubspace iteration of the BKSVD in order to fully avoid any explicit calculations of the adjoint operator . Due to space constraints, the pseudocode as well as the analysis of MBKSVD is deferred to the appendix.
Theorem 10.
To achieve accuracy, APROM (with MBKSVD) runs in time (Here, hides dependency on ).
It is worthwhile to note that both the sample complexity and running time of APROM depends on the spectral norm . However, in several applications this can be assumed to be bounded above. For example, in learning twolayer polynomial networks, as discussed above, the relation between the output and input is given by such that and (see [7], Section 4.3). Therefore, the spectral norm of is in the worst case and constant under reasonable incoherence assumptions on .
Comparison of algorithms. (a) Phase transition plot with
and various values of . (b) Evolution of the objective function versus number of iterations with and . (c) Probability of success in terms of condition number with and . (d) Running time of the algorithm with and .4 Experimental results
In this section, we illustrate some experiments to support our proposed algorithms. We compare EPROM and APROM with convex nuclear norm minimization as well as the gFM algorithm of [8]. To solve the nuclear norm minimization, we use FASTA [31, 32] which efficiently implements an accelerated proximal subgradient method. For APROM, we consider three different implementations. The first implementation of APROM uses a lowrank approximation heuristic based on Lanczos iterations (implementable by Matlab’s SVDS) instead of exact SVDs. The second implementation of APROM uses the randomized Block Krylov SVD (BKSVD) of [10]. Finally, the third APROM implementation uses our proposed modified BKSVD method. In all the experiments, we generate a lowrank matrix, , such that with where the entries of
is randomly chosen according to the standard normal distribution.
Figures 1(a) and 1(b) show the phase transition of successful recovery as well as the evolution of the objective function, versus the iteration count for four algorithms (here, MBKSVD has not been plotted since it behaves similar to BKSVD in terms of accuracy). In these plots, we have used Monte Carlo trials and the phase transition plot is generated based on the empirical probability of success; here, success is when the relative error between (the estimate of ) and the ground truth (measured in terms of spectral norm) is less than . For solving convex nuclear norm minimization using FASTA, we set the Lagrangean parameter, i.e., via a grid search. In Figure 1(a), there is no additive noise; in Figure 1
(b) we consider an additive standard normal noise with standard deviation equal to
. As we can see in Figure 1(a), the phase transition for convex method is slightly better than nonconvex algorithms, which is consistent with known theoretical results. However, the convex method is improper, i.e., the rank of is much higher than .In Figure 1(c), we compare our algorithms with the gFM method of [8] with respect to the condition number of the ground truth matrix. The setup is as before, and we fix the number measurements as . We plot the probability of success when the relative error between and ground truth is less than for Monte Carlo trials. This plot shows that the success of gFM heavily depends on condition number. (As per the theory, to be able to recover , an additional factor proportional to the square of condition number is required.) Finally, in Figure 1(d), we compare the algorithms in the highdimensional regime where and in terms of running time. Here, we have used our proposed modified BKSVD algorithm. The yaxis denotes the relative error in spectral norm and we report averages over Monte Carlo trials. As we can see, convex methods are the slowest (as expected); the nonconvex methods are comparable to each other, while MBKSVD is the fastest. This plot verifies that our modified head approximate projection routine is almost times faster than other methods, which makes it a promising approach for highdimensional matrix recovery applications.
References
 [1] M. Fazel. Matrix rank minimization with applications. PhD thesis, PhD thesis, Stanford University, 2002.
 [2] B. Recht, M. Fazel, and P. Parrilo. Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
 [3] K. Zhong, P. Jain, and I. Dhillon. Efficient matrix sensing using rank1 gaussian measurements. In Int. Conf on Algorithmic Learning Theory, pages 3–18. Springer, 2015.
 [4] R. Kueng, H. Rauhut, and U. Terstiege. Low rank matrix recovery from rank one measurements. Appl. Comput. Harmon. Anal., 42(1):88–116, 2017.
 [5] T. Cai and A. Zhang. Rop: Matrix recovery via rankone projections. Ann. Stat., 43(1):102–138, 2015.
 [6] Y. Chen, Y. Chi, and A. Goldsmith. Exact and stable covariance estimation from quadratic sampling via convex programming. IEEE Trans. Inform. Theory, 61(7):4034–4059, 2015.
 [7] R. Livni, S. ShalevShwartz, and O. Shamir. On the computational efficiency of training neural networks. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 855–863, 2014.
 [8] M. Lin and J. Ye. A nonconvex onepass framework for generalized factorization machine and rankone matrix sensing. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 1633–1641, 2016.
 [9] S. Shalevshwartz, A. Gonen, and O. Shamir. Largescale convex minimization with a lowrank constraint. In Proc. Int. Conf. Machine Learning, pages 329–336, 2011.
 [10] C. Musco and C. Musco. Randomized block krylov methods for stronger and faster approximate singular value decomposition. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 1396–1404, 2015.
 [11] C. Hegde, I. Indyk, and S. Ludwig. Fast recovery from a union of subspaces. In Adv. Neural Inf. Proc. Sys. (NIPS), 2016.
 [12] P. Jain, R. Meka, and I. Dhillon. Guaranteed rank minimization via singular value projection. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 937–945, 2010.
 [13] J. Cai, E. Candès, and Z. Shen. A singular value thresholding algorithm for matrix completion. SIAM Journal on Optimization, 20(4):1956–1982, 2010.
 [14] M. Davenport and J. Romberg. An overview of lowrank matrix recovery from incomplete observations. IEEE J. Select. Top. Sig. Proc., 10(4):608–622, 2016.
 [15] E. Candès and M. Wakin. An introduction to compressive sampling. IEEE Sig. Proc. Magazine, 25(2):21–30, 2008.
 [16] S. Foucart and H. Rauhut. A mathematical introduction to compressive sensing, volume 1. Springer.
 [17] A. Jung, G. Tauböck, and F. Hlawatsch. Compressive spectral estimation for nonstationary random processes. IEEE Trans. Inform. Theory, 59(5):3117–3138, 2013.
 [18] Y. Liu. Universal lowrank matrix recovery from pauli measurements. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 1638–1646, 2011.
 [19] E. Candes and Y. Plan. Tight oracle inequalities for lowrank matrix recovery from a minimal number of noisy random measurements. IEEE Trans. Inform. Theory, 57(4):2342–2359, 2011.
 [20] M. Lin, S. Qiu, B. Hong, and J. Ye. The second order linear model. arXiv preprint arXiv:1703.00598, 2017.
 [21] E. Candès and B. Recht. Exact matrix completion via convex optimization. Found. Comput. Math., 9(6), 2009.

[22]
G. Dasarathy, P. Shah, B. Bhaskar, and R. Nowak.
Sketching sparse matrices, covariances, and graphs via tensor products.
IEEE Trans. Inform. Theory, 61(3):1373–1388, 2015.  [23] E. Candes, T. Strohmer, and V. Voroninski. Phaselift: Exact and stable signal recovery from magnitude measurements via convex programming. Comm. Pure Appl. Math., 66(8):1241–1274, 2013.
 [24] P. Netrapalli, P. Jain, and S. Sanghavi. Phase retrieval using alternating minimization. In Adv. Neural Inf. Proc. Sys. (NIPS), pages 2796–2804, 2013.
 [25] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [26] C. Hegde, P. Indyk, and L. Schmidt. Approximation algorithms for modelbased compressive sensing. IEEE Trans. Inform. Theory, 61(9):5129–5147, 2015.

[27]
C. Lanczos.
An iteration method for the solution of the eigenvalue problem of linear differential and integral operators1.
Journal of Research of the National Bureau of Standards, 45(4), 1950. 
[28]
K. L Clarkson and D. Woodruff.
Low rank approximation and regression in input sparsity time.
In
Proc. ACM Symp. Theory of Comput.
, pages 81–90. ACM, 2013.  [29] M. Mahoney and P. Drineas. Cur matrix decompositions for improved data analysis. Proc. Natl. Acad. Sci., 106(3):697–702, 2009.

[30]
V. Rokhlin, A. Szlam, and M. Tygert.
A randomized algorithm for principal component analysis.
SIAM J. Matrix Anal. Appl., 31(3):1100–1124, 2009.  [31] T. Goldstein, C. Studer, and R. Baraniuk. A field guide to forwardbackward splitting with a FASTA implementation. arXiv eprint, abs/1411.3406, 2014.
 [32] T. Goldstein, C. Studer, and R. Baraniuk. FASTA: A generalized implementation of forwardbackward splitting, January 2015. http://arxiv.org/abs/1501.04979.
 [33] R. Vershynin. Introduction to the nonasymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
 [34] J. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(12):1–230, 2015.
5 Appendix
Below, the expression for two sets and refers to the Minkowski sum of two sets, defined as for given sets and . Also, denotes the set of vectors associated with , the set of all rankr matrix subspaces. In addition, we denote the entry of matrix and entry of vector as and , respectively.
Proof of Theorem 4.
Here we show that the error in the estimate of decreases in one iteration. Let , and denote the bases for the column space of , and , respectively. Define set as and as projection onto it. In addition, define as follows:
We have:
(10) 
where holds since is generated by projecting onto the set of matrices with rank , and by definition of , also has the minimum Euclidean distance to b over all matrices with rank ; holds by the definition of from (2) and the triangle inequality; is followed by the fact that lies in set and due to the spectral norm of projection matrix ; finally, holds by the CURIP assumption in the theorem. By choosing and , the proof is completed. ∎
Proof of Corollary 7.
First, we note that by our assumption on in Theorem 4, . Since EPROM uses fresh samples in each iteration, is independent of the sensing vectors, ’s for all . On the other hand, from Theorem 6, the CURIP holds with probability . As a result, by a union bound over the iterations of the algorithm, the CURIP holds after iterations with probability at least . By recursively applying inequality (5) (with zero initialization) and applying Theorem 5, we obtain the claimed result. ∎
Proof of Theorem 8.
Assume that such that and . Also, define
Furthermore, by definition of approximate tail projection, . Now, we have:
where is implied by the triangle inequality and the definition of approximate tail projection, and inequality holds by the definition of approximate head projection. Next, we have:
(11)  
(12) 
where follows by decomposing the residual on the two subspaces and , and is due to the triangle inequality and the fact that and .