 # Block CUR : Decomposing Large Distributed Matrices

A common problem in large-scale data analysis is to approximate a matrix using a combination of specifically sampled rows and columns, known as CUR decomposition. Unfortunately, in many real-world environments, the ability to sample specific individual rows or columns of the matrix is limited by either system constraints or cost. In this paper, we consider matrix approximation by sampling predefined blocks of columns (or rows) from the matrix. This regime is commonly found when data is distributed across multiple nodes in a compute cluster, where such blocks correspond to columns (or rows) of the matrix stored on the same node, which can be retrieved with much less overhead than retrieving individual columns stored across different nodes. We propose a novel algorithm for sampling useful column blocks and provide guarantees for the quality of the approximation. We demonstrate the practical utility of this algorithm for computing the block CUR decomposition of large matrices in a distributed setting using Apache Spark. Using our proposed block CUR algorithms, we can achieve a significant speed-up compared to a regular CUR decomposition with the same quality of approximation.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The ability to perform large-scale data analysis is often limited by two opposing forces. The first force is the need to store data in a matrix format for the purpose of analysis techniques such as regression or classification. The second force is the inability to store the data matrix completely in memory due to the size of the matrix in many application settings. This conflict gives rise to storing factorized matrix forms, such as SVD or CUR decompositions .

We consider a matrix with rows and columns, i.e., . Using a truncated

number of singular vectors (

e.g., where

), the singular value decomposition (SVD) provides the best rank-

approximation to the original matrix. The singular vectors often do not preserve the structure in original data. Preserving the original structure in the data may be desirable due to many reasons including interpret-ability in case of biometric data or for storage efficiency in case of sparse matrices. This has led to the introduction of the CUR decomposition, where the factorization is performed with respect to a subset of rows and columns of the matrix itself. This specific decomposition describes the matrix as the product of a subset of matrix rows and a subset of matrix columns (along with a matrix that fits ).

Significant prior work has examined how to efficiently choose the rows and columns in the CUR decomposition and has derived worst-case error bounds (e.g., ). These methods have been applied successfully to many real-world problems including genetics , astronomy , and mass spectrometry imaging . Unfortunately, a primary assumption of current CUR techniques, that individual rows and columns of the matrix can be queried, is either impossible or quite costly in many real world problems and instead require a block approach.

In this paper, we consider the following two applications which represent the two main motivating factors for considering block decompositions.

Biometric data analysis. In applications where the ordering of rows or columns is meaningful, such as images, video, or speech data matrices, sampling contiguous blocks of columns adds contextual information that is necessary for interpretability of the factorized representation. One emerging application is audience reaction analysis of video content using biometrics. We focus on the scenario where users watch video content while wearing sensors, and changes in biometric sensors indicate changes in reaction to the content. For example, increases in heart rate or a spike in electrodermal activity indicate an increase in content engagement. In this paper, a matrix of biometric data such as Electrodermal Activity (EDA) is collected from users reacting to external stimuli, e.g., watching video content. In prior work, EDA has shown to be useful for a variety of user analytics tasks to assess the reaction of viewers[14, 8]. In this setting, is the number of users and corresponds to the number of time samples for which biometric reaction is collected. Unfortunately, there is significant cost in acquiring each user’s reaction to lengthy content so instead we collect full responses (corresponding to some rows of the matrix) from only a limited number of users. For remaining users, we propose to collect responses for only a few important scenes of the video (corresponding to column blocks of the matrix) as shown in Figure (a)a and then approximate their full response. An individual time sample in this use case cannot be queried in isolation due to the lack of context that caused that biometric reaction. Instead, collections of time segments (i.e., blocks) must be presented to the user. In this setting block sampling can be viewed as a restriction which leads to more interpretable solutions.

Distributed storage systems. Large-scale datasets often require distributed storage, a regime where there can be substantial overhead involved in querying individual rows or columns of a matrix. In these regimes, it is more efficient to retrieve predefined blocks of rows or columns at one time corresponding to the rows or columns stored on the same node, as shown in Figure (b)b, in order to minimize the overhead in terms of latency while keeping the throughput constant. In doing so, one forms a Block CUR decomposition, with more details provided in Section 4.2. Current CUR decomposition techniques do not take advantage of this predefined block structure.

Main contributions. Using these insights into real-world applications of CUR decomposition, this paper makes a series of contributions. We propose a simple randomized Block CUR algorithm for subset selection of rows and blocks of columns and derive novel worst-case error bounds for this randomized algorithm. On the theory side, we present new theoretical results related to approximating matrix multiplication and generalized regression in the block setting. These results are the fundamental building blocks used to derive the error bounds for the presented randomized algorithms. The sample complexity bounds feature a non-trivial dependence on the matrix partition, the distribution of information in the blocks of the matrix. This dependence is non-trivial in that it cannot be obtained by simply extending the analysis of the original individual column CUR setting to the Block CUR setting. As a result, our analysis finds a sample complexity improvement on the order of the block stable rank of a matrix (See Table 1 in Section 3).

On the practical side, this algorithm performs fast block sampling taking advantage of the natural storage of matrices in distributed environments (See Table 2 in Section 4.2). We demonstrate empirically that the proposed Block CUR algorithms can achieve a significant speed-up when used to decompose large matrices in a distributed data setting. We conduct a series of CUR decomposition experiments using Apache Spark on Amazon Elastic Map-Reduce (Amazon EMR) using both synthetic and real-world data. In this distributed environment, we find that our Block CUR approach achieves a speed-up of x to x for matrices larger than . This is compared with previous CUR approaches that sample individual rows and columns and while achieving the same matrix approximation error rate. We also perform experiments with real-world user biometric data from a content testing environment and present interesting use cases where our algorithms can be applied to user analytics tasks.

## 2 Setup and background

### 2.1 Notation

Let denote the identity matrix and

denote a zero matrix of appropriate size. We denote vectors (matrices) with lowercase (uppercase) bold symbols like

(). The -th row (column) of a matrix is denoted by (). We represent the -th block of rows of a matrix by and the -th block of columns of a matrix by .

Let denote the set . Let and . The singular value decomposition (SVD) of can be written as where contains the left singular vectors; is the diagonal matrix of singular values, for ; and is an orthonormal matrix containing the right singular vectors of . We denote as the best rank- approximation to in terms of Frobenius norm. The pseudoinverse of is defined as . Also, note that is the projection of onto the column space of , and is the projection of onto the row space of .

The Frobenius norm and spectral norm of a matrix are denoted by and respectively. The square of the Frobenius norm is given by . The spectral norm is given by .

### 2.2 The CUR problem and other related work

The need to factorize a matrix using a collection of rows and columns of that matrix has motivated the CUR decomposition literature. CUR decomposition is focused on sampling rows and columns of the matrix to provide a factorization that is close to the best rank- approximation of the matrix. One of the most fundamental results for a CUR decomposition of a given matrix was obtained in . We re-state it here for the sake of completion and setting the appropriate context for our results to be stated in the next section. This relative error bound result is summarized in the following theorem.

###### Theorem 1

(Theorem 2 from  applied to ) Given and an integer , let and . There exist randomized algorithms such that, if columns are chosen to construct and rows are chosen to construct

, then with probability

, the following holds:

 ∥A−CUR∥F≤(1+ε)∥A−Ak∥F

where , and is the scaled intersection of and .

This theorem states that as long as enough rows and columns of the matrix are acquired ( and , respectively), then the CUR decomposition will be within a constant factor of the error associated with the best rank- approximation of that matrix. Central to the proposed randomized algorithm was the concept of sampling columns of the matrix based on a leverage score. The leverage score measures the contribution of each column to the approximation of .

###### Definition 1

The leverage score of a column is defined as the squared row norm of the top- right singular vectors of corresponding to the column:

 ℓj=∥VTA,kej∥22, j∈[n],

where consists of the top- right singular vectors of as its rows, and is the -th column of identity matrix which picks the -th column of .

The CUR algorithm involves randomly sampling rows using probabilities generated by the calculated leverage scores to obtain the matrix , and thereafter sampling columns of based on leverage scores of the matrix to obtain . The key technical insight in  is that the leverage score of a column measures “how much” of the column lies in the subspace spanned by the top- left singular vectors of ; therefore, this method of samping is also known as subspace sampling. By sampling columns that lie in this subspace more often, we get a relative-error low rank approximation of the matrix. The concept of sampling the important columns of a matrix based on the notion of subspace sampling first appeared in context of fast regression in  and was refined in  to obtain performance error guarantees for CUR matrix decomposition.

These guarantees were subsequently improved in follow-up work . Modified versions of this problem have been studied extensively for adaptive sampling , divide-and-conquer algorithms for parallel computations , and input-sparsity algorithms . The authors of  propose an adaptive sampling-based algorithm which requires only columns to be sampled when the entire matrix is known and its SVD can be computed. The authors of  also proposed an optimal, deterministic CUR algorithm. In , the authors prove the lower bound of the column selection problem; at least columns are selected to achieve the ratio.

These prior results require sampling of arbitrary rows and columns of the matrix which may be either unrealistic or inefficient in many practical applications. In this paper, we focus on the problem of efficiently sampling pre-defined blocks of columns (or rows) of the matrix to provide a factorization that is close to the best rank- approximation of the matrix in the more natural environment of block sampling for biometric and distributed computation, explore the performance advantages of block sampling over individual column sampling, and provide the first non-trivial theoretical error guarantees for Block CUR decomposition. In the following section, we propose and analyze a randomized algorithm for sampling blocks of the matrix based on block leverage scores.

## 3 The Block CUR algorithm

A block may be defined as a collection of columns or rows. For clarity of exposition, without loss of generality, we consider column blocks but the techniques and derivations also hold for row blocks by applying them to the transpose of the matrix. For ease of exposition, we also assume equal-sized blocks but one could easily extend the methods to blocks of varying sizes. Let be the number of possible blocks in . We consider the blocks to be predefined due to natural constraints or cost, such as data partitioning in a distributed compute cluster.

The goal of the Block CUR algorithm is to approximate the underlying matrix using blocks of columns and rows, as represented in Figure 2. For example, in the biometric analysis setting each block could correspond to user reactions at a collection of time points corresponding to a scene in a movie. The goal is to approximate the users’ reactions to the full movie using only their response to a summary of the movie (containing a subset of the scenes). Figure 2: Example Block CUR decomposition, where Ct∈Rm×s for t∈[g] is sampled from {A(jt) : jt∈[G]}.

Given the new regime of submatrix blocks, we begin by defining a block leverage score for each block of columns.

###### Definition 2

The block leverage score of a group of columns is defined as the sum of the squared row norms of the top- right singular vectors of corresponding to the columns in the block:

 ℓg(A,k)=∥VTA,kEg∥2F, g∈[G],

where consists of the top- right singular vectors of , and consists of the corresponding block of columns in the identity matrix which picks the columns of corresponding to the elements in block .

Much like the individual column leverage scores defined in , the block leverage scores measure how much a particular column block contributes to the approximation of the matrix .

### 3.1 Algorithm details

The Block CUR Algorithm, detailed in Algorithm LABEL:groupalgo, takes as input the matrix and returns as output an matrix consisting of a small number of rows of and an matrix consisting of a small number of column blocks from .

algocf[h]

In Algorithm LABEL:groupalgo, for , block is sampled with some probability and scaled using matrix . The -th non-zero block of is defined as where is the number of blocks picked by the algorithm. This sampling matrix picks the blocks of columns and scales each block to compute . A similar sampling and scaling matrix is defined to pick the blocks of rows and scale each block to compute . An example of sampling matrix with blocks chosen in order is as follows:

 Sn×gs=⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣1√gp1Is00001√gp2Is01√gp3Is0000⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦.

In addition to considering block sampling of columns, another advantage of this algorithm is not requiring the computation of a full SVD of . In many large-scale applications, it may not feasible to compute the SVD of the entire matrix

. In these cases, algorithms requiring knowledge of the leverage scores cannot be used. Instead, we use an estimate of the block leverage scores called the

approximate block leverage scores. A subset of the rows (corresponding to users) are chosen uniformly at random, and the block scores are calculated using the top- right singular vectors of this row matrix instead of the entire matrix. This step is not the focus of the experiments in this paper so it can also be replaced with other fast approximate calculations of leverage scores involving sketching or additional sampling [3, 16]. The advantage of using our approximate leverage scores is that the same set of rows is used to approximate the scores and also to compute the CUR approximation. Hence no additional sampling or sketching steps are required. In terms of the biometric application, each row corresponds to a user’s biometric reaction to a movie. Since collecting user reactions to lengthy content can be expensive, eliminating redundant sampling leads to huge savings in resources.

The running time of Algorithm LABEL:groupalgo is essentially driven by the time required to compute the SVD of , i.e., ) time, and the time to construct , and . Construction of requires ) time, construction of takes ) time, construction of requires ) time and construction of takes ) time.

### 3.2 Theoretical results and discussion

The main technical contribution of the paper is a novel relative-error bound on the quality of approximation using blocks of columns or rows to approximate a matrix . Before stating the main result, we define two important quantities that measure important properties of the matrix that are fundamental to the quality of approximation. We first define a property of matrix rank relative to the collection of matrix blocks. Specifically, we focus on the concept of matrix stable rank from  and define the block stable rank as the minimum stable rank across all matrix blocks.

###### Definition 3

Let consist of the top- right singular vectors of . Then the block stable rank is defined as

 αA=ming∈[G]∥VTA,kEg∥2F∥VTA,kEg∥22,

where consists of the corresponding block of columns in the identity matrix that picks the columns of corresponding to the elements in block .

Intuitively, the above definition gives a measure of how informative the worst matrix column block is. The second property is a notion of column space incoherence. When we sample rows uniformly at random, we can give relative error approximation guarantees when the matrix satisfies an incoherence condition. This avoids pathological constructions of rows of that cannot be sampled at random.

###### Definition 4

The top- column space incoherence is defined as

 μ:=μ(UTA,k)=mkmaxi∥UTA,kei∥22,

where picks the -th column of .

The column space incoherence is used to provide a guarantee for fast approximation without computing the SVD of the entire matrix . Equipped with these definitions, we state the main result that provides a relative-error guarantee for the Block CUR approximation in Theorem 2.

###### Theorem 2

Given with incoherent top- column space, i.e., , let and . There exist randomized algorithms such that, if rows and column blocks are chosen to construct and , respectively, then with probability , the following holds:

 ∥A−CUR∥F≤(1+ε)∥A−Ak∥F,

where and is the pseudoinverse of scaled intersection of and .

We provide a sketch of the proof and highlight the main technical challenges in proving the claim in Section 3.3 and defer the proof details to the Appendix. In Section 3.3, we first provide a relative-error guarantee (Lemma 2) for the approximation provided by Algorithm LABEL:groupalgo. After applying standard boosting techniques (explained in Section 3.3) we get the main result stated above.

We detail the differences between our technique and prior CUR algorithms here. This includes additional assumptions required, algorithmic trade-offs, and discussion of sampling and computational complexity.

Block stable rank. Theorem 2 tells us that the number of blocks required to achieve an relative error depends on the structure of the blocks (through ). Intuitively, this is saying the groups that provide more information improve the approximation faster than less informative groups. The term depends on the stable or numerical rank (a stable relaxation of exact rank) of the blocks. The stable rank is a relaxation of the rank of the matrix; in fact, it is stable under small perturbations of the matrix . For instance, the stable rank of an approximately low rank matrix tends to be low. The term defined in Theorem 2 is the minimum stable rank of the column blocks. Thus, the term gives a dependence of the block sampling complexity on the stable ranks of the blocks. It is easy to check that . In the best case, when all the groups have full stable rank with equal singular values, achieves its maximum. The worst case is achieved when a group or block is rank-. That is, sampling groups of rank gives us a lot more information than groups of rank 1, which leads to a reduction in the total sampling complexity.

Incoherence. The column space incoherence (Definition 4) is used to provide a guarantee for approximation without computing the SVD of the entire matrix . However, if it is possible to compute the SVD of the entire matrix, then the rows can be sampled using row leverage scores, and the incoherence assumption can be dropped. The relative error guarantee, independent of incoherence, for the full SVD Block CUR approximation is stated as Corollary 1 in the Appendix. The corollary follows by a similar analysis as Theorem 2 so we defer the proof to the Appendix. Other than block sampling, the setup of this result is equivalent to the traditional column sampling result stated in Lemma 2. Next, we compare the block sampling result with extensions of traditional column sampling.

Sample complexity: comparison with extensions of traditional CUR results. In order to compare the sample complexity of our block sampling results with trivial block extensions of traditional column sampling results we focus our attention on the similar leverage score based CUR result in Theorem 1. A simple extension to block setting could be obtained by considering a larger row space in which blocks are expanded to vectors. This would lead to a sample complexity bound obtained by Theorem 1. The sampling complexity of the Block CUR derived in Theorem 2 tells us the number of sampling operations or queries that need to be made to memory in order to construct the and matrices. As shown in Table 1 the column block sample complexity obtained by traditional CUR extensions results is always greater than or equal to those required by our Block CUR result because . This happens since traditional CUR-based results are obtained by completely ignoring the block structure of the matrix.

As a side note, the authors are aware that more recent adaptive column sampling-based algorithms such as [15, 2] require only columns to be sampled. These results assume full computation of the SVD is possible, and they are byproducts of heavy machinery using ideas like deterministic, Batson/Srivastava/Spielman (BSS) sampling and adaptive sampling on top of leverage scores. By extending these advanced techniques to block sampling, it may be possible to obtain tighter bounds but it does not bring new insight into the problem of sampling blocks and unnecessarily complicates the discussion. Therefore we defer this extension to future work.

### 3.3 Proof sketch of main result

In this section, we provide a sketch of the proof of Theorem 2 and defer the details to the Appendix. The proof of the main result rests on two important lemmas. These results are important in their own right and could be useful wherever the block sampling issue arises. The first result concerns approximate block multiplication.

Block multiplication lemma. The following lemma shows that the multiplication of two matrices and can be approximated by the product of the smaller sampled and scaled block matrices. This is the key lemma in proving the main result.

###### Lemma 1

Let , ,, and be defined as Construct and using sampling probabilities that satisfy

 pi≥β∥A(i)∥2F∑Gj=1∥A(j)∥2F,

for all and where . Then, with probability at least ,

 ∥AB−CR∥F≤1δ√βgαA∥A∥F∥B∥F.

The proof details are provided in the Appendix. The main difficulty in proving this claim is to account for the block structure. Even though one could trivially extend individual column sampling analysis to this setting by serializing the blocks, this would lead to trivial bounds as they do not leverage the block structure. Our results exploit this knowledge and hence introduce a dependence of the sample complexity on the block stable rank of the matrix.

Using the block multiplication lemma we prove Lemma 2, which states a non-boosting approximation error result for Algorithm LABEL:groupalgo.

###### Lemma 2

Given with incoherent top- column space, i.e., , let and . If rows and column blocks are chosen according to Algorithm LABEL:groupalgo, then with probability at least 0.7, the following holds:

 ∥A−CUR∥F≤(1+ε)∥A−Ak∥F,

where , is the pseudoinverse of the scaled intersection of and .

The proof of Lemma 2 follows standard techniques in  with modifications necessary for block sampling (see Appendix for the proof details). Finally, the result in Theorem 2 follows by applying standard boosting methods to Lemma 2 and running Algorithm LABEL:groupalgo times. By choosing the solution with minimum error and observing that , we have that the relative error bound holds with probability greater than .

Remark. As a consequence of Lemma 2, we show that if enough blocks are sampled with high probability, then . This gives a guarantee on the approximate solution obtained by solving a block-sampled regression problem instead of the entire least squares problem. As a special case of the above result, when we get a bound for the block column subset selection problem. If blocks are chosen, then with probability at least we have

## 4 Experiments

### 4.1 Experiments with biometric data

One emerging application is audience reaction analysis of video content using biometrics. Specifically, users watch video content while wearing sensors, with changes in biometric sensors indicating changes in reaction to the content. For example, increases in heart rate or a spike in electrodermal activity indicate an increase in content engagement. In prior work, biometric signal analysis techniques have been developed to determine valence  (e.g., positive vs. negative reactions to films) and content segmentation . Unfortunately these experiments require a large number of users to sit through the entire video content, which can be both costly and time-consuming.

We consider the observed biometric signals as a matrix with users (as rows) and biometric time samples (as columns). Matrix approximation techniques, such as CUR decomposition, point to the ability to infer the complete matrix by showing the entire content to only a subset of users (i.e., rows), while the remaining users see only selected scenes of the content (i.e., column blocks). To replicate a user’s true reaction to content, individual columns cannot be sampled (e.g., showing the user 0.25 seconds of video content) given the lack of scene context. Instead, longer scenes must be shown to the user to gather a representative response. Therefore, the Block CUR decomposition proposed in this paper is directly applicable.

The biometric experiment setup is as follows. We attached subjects with the Empatica E3 wearable sensor  that measures electro-dermal activity (EDA) at Hz. The subjects were shown a -minute episode of the television series “NCIS”, in the genres of action and crime. The resulting biometric data matrix was

. Our goal is to use Block CUR decomposition to show only a subset of users the entire content, and to then impute the biometric data for users that have viewed only a small number of selected scenes from the content.

Results. We refer to the biometric data matrix as and plot the EDA traces (rows) corresponding to four users in Figure (a)a. To demonstrate the low rank nature of the data, we plot the Frobenius norm of covered by as a function of in Figure (b)b. We find that for this data, only singular vectors are needed to capture of the total Frobenius norm of the complete matrix. Next, we segment the columns of this matrix into blocks such that columns (or seconds). In Figure 4, we show the computed block leverage scores. The leverage scores seem to suggest that certain scenes are more important than others. For example, the highest leverage scores are around the 12, 26, and 38 minute marks. This corresponds to scenes of a dead body, unveiling of a clue to solving the mystery, and the final arrest, respectively.

Using Algorithm LABEL:groupalgo, we uniformly sample EDA traces (rows) of 20 users and hold out the EDA traces of 4 users. We sample column blocks and plot the resulting error in Frobenius norm in Figure 5. The plots show the normalized Frobenius norm error of the CUR approximation as a function of the number of blocks, , sampled. More precisely, the ratio and are plotted for two values of the target rank, and and two values of block size, and columns per block ( and seconds), respectively. We also compare the error using , the rank- approximation of , which leads to an exactly rank- matrix approximation since this may be a restriction in some applications. We repeat Algorithm LABEL:groupalgo ten times111These plots were generated using sampling without replacement even though our theory supports sampling with replacement since sampling the same blocks is inefficient in practice. and plot the mean normalized error over 10 trials.

The error drops sharply as we sample more blocks but quickly flattens demonstrating that a summary of the movie could suffice to approximate the full responses. The plots also show the interplay between the number of blocks sampled and the issue of context which is related to block size. To give the viewer some context we would want to make the scene as long as possible but we want to show them only a summary of the content to reduce the cost. These conflicting aims result in a trade-off of block size and the number of blocks sampled. For example, for , the normalized error is less than when a minute long clip is shown to the viewer, that is with block size columns (or seconds), whereas the normalized error is less than when a minute long clip is shown to the viewer () with block size columns (or seconds). These results demonstrate the practical use of the Block CUR algorithm.

### 4.2 Distributed experiments

In this section we demonstrate empirically that the proposed block sampling based CUR algorithms can achieve a significant speed-up when used to decompose matrices in a distributed data setting by comparing their performance with individual column sampling based traditional CUR algorithms on both synthetic and real-world data. We report the relative-error of the decomposition (i.e., ) and the sampling time of each algorithm on different data-sets.

We implemented the algorithms in Scala 2.10 and Apache Spark 2.11 on Amazon Elastic Map-Reduce (Amazon EMR). The compute cluster was constructed using four Amazon m4.4xlarge instances, with each compute node having 64 GB of RAM. Using Spark, we store the data sets as resilient distributed dataset (RDD), a collection of elements partitioned across the nodes of the cluster (see Figure 2). In other words, Spark partitions the data into many blocks and distributes these blocks across multiple nodes in the cluster. Using block sampling, we can approximate the matrix by sampling only a subset of the important blocks. Meanwhile, individual column sampling would require looking up all the partitions containing specific columns of interest as shown in Table 2. Our experiments examine the runtime speed-up from our block sampling CUR that exploits the partitioning of data.

Synthetic experiments. The synthetic data is generated by where and are random matrices with i.i.d. Gaussian random entries, resulting in a low rank matrix . We perform CUR decomposition on matrices of size with , target rank , and number of blocks (set here across all experiments to be ). The leverage scores are calculated by computing the SVD of the rows sampled uniformly with . We sample one-sixth of the rows.

Figure 6 shows the plots for relative error achieved with respect to the runtime required to sample and matrices for both Block CUR and traditional CUR algorithms. To focus on the speed-up achieved by taking into account the block storage of data we compare running times of only the sampling operations of the algorithms (which excludes the time required to compute the SVD). We note that other steps in both algorithms can be updated to include faster variants such as the approximation of leverage scores by sketching or sampling . We vary , the number of blocks chosen, from to . The number of columns chosen is thus , where denotes the number of columns in a block and varies from to . We repeat each algorithm (Block CUR and traditional CUR) twice for the specified number of columns, with each realization as a point in the plot. The proposed Block CUR algorithm samples the columns in blocks, while traditional CUR algorithm samples the columns one at a time.

Consistently, these results show that block sampling achieves the relative error much faster than the individual column sampling – with performance gains increasing as the size of the matrix grows, as shown in Figure 6. While the same amount of data is being transmitted regardless of whether block or individual column sampling is used, block sampling is much faster because it needs to contact fewer executors to retrieve blocks of columns rather than the same number of columns individually. In the worst case, sampling individual columns may need to communicate with all of the executors, while block sampling only needs to communicate with executors. Thus, by exploiting the partitioning of the data, the Block CUR approach is able to achieve roughly the same quality of approximation as traditional column-based CUR, as measured by relative error, with significantly less computation time.

Real-world experiments. We also conduct experiments on the Arcene dataset  which has 900 rows and 10,000 columns. We compare the running time for both block and traditional CUR decomposition. We again find consistent improvements for the block-wise approach compared with individual column sampling. With block size , sampling up to 10 groups led to an average speed up of 11.2 over individual column sampling, as shown in Figure 7. The matrix is very low rank, and sampling a few groups gave small relative errors.

## 5 Conclusion

In this paper we extended the problem of CUR matrix decomposition to the block setting which is naturally relevant to distributed storage systems and biometric data analysis. We proposed a novel algorithm and derived its performance bounds. We demonstrated its practical utility on real-world distributed storage systems and audience analytics. Some possible future directions for this work include calculating the leverage scores quickly or adaptively, and considering the algorithms and error bounds when the matrix has a pre-specified structure like sparsity.

## References

•  C. Boutsidis, P. Drineas, and M. Magdon-Ismail. Near-optimal column-based matrix reconstruction. SIAM Journal on Computing, 43(2):687–717, 2014.
•  C. Boutsidis and D. P. Woodruff. Optimal CUR matrix decompositions. In

Proceedings of the 46th Annual ACM Symposium on Theory of Computing

, pages 353–362. ACM, 2014.
•  P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage.

Journal of Machine Learning Research

, 13(Dec):3475–3506, 2012.
•  P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Sampling algorithms for regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 1127–1136. Society for Industrial and Applied Mathematics, 2006.
•  P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, 30(2):844–881, 2008.
•  M. Garbarino, M. Lai, D. Bender, R. Picard, and S. Tognetti. Empatica E3—A wearable wireless multi-sensor device for real-time computerized biofeedback and data acquisition. In Proceedings of the 4th International Conference on Wireless Mobile Communication and Healthcare, pages 39–42, November 2014.
•  I. Guyon, S. R. Gunn, A. Ben-Hur, and G. Dror.

Result analysis of the NIPS 2003 feature selection challenge.

In Advances in Neural Information Processing Systems, pages 545–552, 2004.
•  S. Jain, U. Oswal, K. S. Xu, B. Eriksson, and J. Haupt. A compressed sensing based decomposition of electrodermal activity signals. IEEE Transactions on Biomedical Engineering, 64(9):2142–2151, 2017.
•  W. Lian, V. Rao, B. Eriksson, and L. Carin. Modeling correlated arrival events with latent semi-markov processes. In Proceedings of the International Conference on Machine Learning, 2014.
•  L. W. Mackey, M. I. Jordan, and A. Talwalkar. Divide-and-conquer matrix factorization. In Advances in Neural Information Processing Systems, pages 1134–1142, 2011.
•  M. W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, 106(3):697–702, 2009.
•  P. Paschou, E. Ziv, E. G. Burchard, S. Choudhry, W. Rodriguez-Cintron, M. W. Mahoney, and P. Drineas. PCA-correlated SNPs for structure identification in worldwide human populations. PLoS Genetics, 3(9):e160, 2007.
•  M. Rudelson and R. Vershynin. Sampling from large matrices: An approach through geometric functional analysis. Journal of the ACM, 54(4):21, 2007.
•  F. Silveira, B. Eriksson, A. Sheth, and A. Sheppard. Predicting audience responses to movie content from electro-dermal activity signals. In Proceedings of the ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 707–716. ACM, 2013.
•  S. Wang and Z. Zhang. A scalable CUR matrix decomposition algorithm: lower time complexity and tighter bound. In Advances in Neural Information Processing Systems, pages 647–655, 2012.
•  M. Xu, R. Jin, and Z.-H. Zhou. CUR algorithm for partially observed matrices. In Proceedings of the International Conference on Machine Learning, pages 1412–1421, 2015.
•  J. Yang, O. Rübel, M. W. Mahoney, and B. P. Bowen. Identifying important ions and positions in mass spectrometry imaging data using cur matrix decompositions. Analytical Chemistry, 87(9):4658–4666, 2015.
•  C.-W. Yip, M. W. Mahoney, A. S. Szalay, I. Csabai, T. Budavári, R. F. Wyse, and L. Dobos. Objective identification of informative wavelength regions in galaxy spectra. The Astronomical Journal, 147(5):110, 2014.

## Appendix A Supplementary Material for Block CUR: Decomposing Matrices using Groups of Columns

As stated in Section 3.3, the result in Theorem 2 follows by applying standard boosting methods to Lemma 2 and running Algorithm 1 times. By choosing the solution with minimum error and observing that , we have that the relative error bound holds with probability greater than . Hence, it suffices to prove Lemma 2 to prove the main result.

### a.1 Proof of Lemma  2

First, note and .

 ∥A−CUR∥F =∥A−AS(RS)†R∥F

Recall that has rank no greater than ; ; and that the same column blocks from and

are picked with the following probability distribution:

 pi=∥VTR,rEi∥2Fr, ∀i∈[G].

We can use Lemma 3 (stated and proved in next section) with probability at least 0.85 we have

 ∥A−AS(RS)†R∥F≤(1+ε)∥A−AR†R∥F.

Next, we bound . Since has incoherent column space, the uniform sampling distribution satisfies eqn. (13) in  with . Consequently, we can apply modified version of Theorem 1 in  we get with probability at least 0.85, . Finally, we get with probability 0.7,

 ∥A−CUR∥F ≤(1+ε′)2∥A−Ak∥F, ≤(1+ε′′)∥A−Ak∥F, letting ε′′=3ε′.

This completes the proof of Lemma 1.

#### a.1.1 Approximating generalized ℓ2 regression in the block setting

In this section, we give theory for generalized least squares using block subset selection that is used to prove the main results for the algorithms but applies to arbitrary matrices and . Given matrices and , the generalized least squares problem is

 minX∈Rm×r∥A−XB∥F.

It is well-known that the solution to this optimization problem is given by . To approximate this problem by a subsampled problem, we sample some blocks of columns from and to approximate the standard regression by the following optimization:

 minX∈Rm×r∥(AS)−X(BS)∥F.

The solution of this problem is given by . In the following lemma, we give a guarantee stating that, when enough blocks are sampled with the specified probability, the approximate solution is close to the actual solution to the regression.

###### Lemma 3

Suppose has rank no greater than ; ; and let the same column blocks from and be picked with the following probability distribution:

 pi=∥(VB,k)(i)∥2Fk, ∀i∈[G].

If blocks are chosen, then with probability at least we have

 ∥A−AS(BS)†B∥F≤(1+ε)∥A−AB†B∥F.
###### Proof 1

Let and

We start by showing is full rank. Using Lemma 2, if and , we get the following with probability ,

 ∥VTkVk−VTkSSTVk∥2=∥Ik−VTkSSTVk∥2≤4kδ√αRg≤ε12.

This further gives us a bound on the singular values of , for all ,

 |1−σ2i(VTkS)|=|σi(VTkVk)−σi(VTkSSTVk)|≤∥Ik−VTkSSTVk∥2≤ε1. (1)

Thus, it follows for all singular values of ,

 √1−ε1≤σi(VTkS)≤√1+ε1. (2)

Now, consider

 ∥Ω∥2 =∥(VTkS)†−(VTkS)T∥2 =∥Σ−1VTkS−ΣVTkS∥2 =maxi∣∣ ∣∣σi(VTkS)−1σi(VTkS)∣∣ ∣∣ =maxi|σ2i(VTkS)−1||σi(VTkS)| ≤∥VTkVk−VTkSSTVk∥2√1−∥VTkVk−VTkSSTVk∥2 ≤ε1/2√1−ε1/2 ≤ε1/√2,

where the first inequality follows from equation (1), the second inequality follows by applying Lemma 1 and the last inequality follows since implies

Also, for any we have,

 E[∥QS∥2F] =E[g∑t=1∥∥∥1√gpjtQ(jt)∥∥∥2F] =g∑t=1E[1gpjt∥∥Q(jt)∥∥2F] =g∑t=1G∑i=1pi1gpi∥∥Q(i)∥∥2F =∥Q∥2F.

By Jensen’s inequality,

 E[∥QS∥F]2≤E[∥QS∥2F]=∥Q∥2F.

By applying Markov’s inequality, we get with probability ,

 ∥QS∥F≤1δ′E[∥QS∥F]≤1δ′∥Q∥F. (3)

The following will be useful later,

 AS(BS)†B =AS(UkΣkVTkS)†UkΣkVTk =AS(VTkS)†Σ−1kUTkUkΣkVTk =AS(VTkS)†VTk

Using this result and observing that , we break down the left hand term into 3 manageable components,

 ∥A−AS(BS)†B∥F =∥A−AS(VTkS)†VTk∥F =∥A−AVkVTkS(VTkS)†VTk+AV⊥kV⊥TkS(VTkS)†VTk∥F

As seen before, with high probability, is full rank. Using this fact along with triangle inequality gives us

 ∥A−AS(BS)†B∥F =∥A−AVkVTk+AV⊥kV⊥TkS(VTkS)†VTk∥F ≤∥A−AVkVTk∥F+∥AV⊥kV⊥TkS(VTkS)†VTk∥F

Define ,

 ∥A−AS(BS)†B∥F =∥AV⊥kV⊥Tk∥F+∥AV⊥kV⊥TkS(Ω+(VTkS)T)∥F ≤∥AV⊥kV⊥Tk∥F+∥AV⊥kV⊥TkS∥F∥Ω∥2+∥AV⊥kV⊥TkSSTVk∥F

By (3) and since ,

 ∥A−AS(BS)†B∥F ≤(1+1δ′∥Ω∥2)∥AV⊥kV⊥Tk∥F+∥AV⊥kV⊥TkVk−AV⊥kV⊥TkSSTVk∥F

Using Lemma 1 and ,

 ∥A−AS(BS)†B∥F ≤(1+1δ′∥Ω∥2)∥AV⊥kV⊥Tk∥F+1δ2√αRg∥AV⊥kV⊥Tk∥F∥Vk∥F ≤(1+1δ′∥Ω∥2)∥AV⊥kV⊥Tk∥F+√kδ2√αRg∥AV⊥kV⊥Tk∥F ≤(1+1δ′∥Ω∥2+ε1√8δ2)∥AV⊥kV⊥Tk∥F

where the second inequality follows since and the last inequality follows since .

Finally, using , we have

 ∥A−AS(BS)†B∥F≤(1+1δ′∥Ω∥2+ε1√8δ2)∥A−AB†B∥F

Thus, we can conclude the following with probability

 ∥A−AS(BS)†B∥F ≤(1+(1√2δ′+12δ2)ε1)∥A−AB†B∥F ≤(1+ε)∥A−AB†B∥F

by setting and .

Lemma 2 is used, then . Finally, note that by assumption.

### a.2 Proof of Lemma 1

###### Proof 2

Note that,

 E⎡⎣(A(jt)B(jt)gpjt)i1i2⎤⎦=G∑k=1pk(A(k)B(k)gpk)i1i2=1g(AB)i1i2

Since each block is picked independently we have,

 var[(CR)i1i2] =var⎡⎣g∑t=1(A(jt)B(jt)gpjt)i1i2⎤⎦ =g∑t=1var⎡⎣(A(jt)B(jt)gpjt)i1i2⎤⎦ =g∑t=1⎛⎜⎝E⎡⎢⎣(A(jt)B(jt)gpjt)2i1i2⎤⎥⎦−E⎡⎣(A(jt)B(jt)gpjt)i1i2⎤⎦2⎞⎟⎠ =g⎛⎝G∑k=1pk(A(k)B(k)gpk)2i1i2−(AB)2i1i2g2⎞⎠ =1g⎛⎜⎝