# Localized sketching for matrix multiplication and ridge regression

We consider sketched approximate matrix multiplication and ridge regression in the novel setting of localized sketching, where at any given point, only part of the data matrix is available. This corresponds to a block diagonal structure on the sketching matrix. We show that, under mild conditions, block diagonal sketching matrices require only O(stable rank / ϵ^2) and O( stat. dim. ϵ) total sample complexity for matrix multiplication and ridge regression, respectively. This matches the state-of-the-art bounds that are obtained using global sketching matrices. The localized nature of sketching considered allows for different parts of the data matrix to be sketched independently and hence is more amenable to computation in distributed and streaming settings and results in a smaller memory and computational footprint.

## Authors

• 4 publications
• 18 publications
• 29 publications
11/19/2020

### Approximate Weighted CR Coded Matrix Multiplication

One of the most common, but at the same time expensive operations in lin...
11/18/2018

### Stark: Fast and Scalable Strassen's Matrix Multiplication using Apache Spark

This paper presents a new fast, highly scalable distributed matrix multi...
07/22/2021

### Flexible Distributed Matrix Multiplication

The distributed matrix multiplication problem with an unknown number of ...
10/27/2021

### An Efficient Reversible Algorithm for Linear Regression

This paper presents an efficient reversible algorithm for linear regress...
07/06/2018

### Leveraging Well-Conditioned Bases: Streaming & Distributed Summaries in Minkowski p-Norms

Work on approximate linear algebra has led to efficient distributed and ...
05/11/2021

### Optimal Sampling Algorithms for Block Matrix Multiplication

In this paper, we investigate the randomized algorithms for block matrix...
04/13/2022

### Sketching Algorithms and Lower Bounds for Ridge Regression

We give a sketching-based iterative algorithm that computes 1+ε approxim...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Efficient linear algebraic computations are of fundamental importance in machine learning and signal processing applications. This has led to a rise in randomized linear algebraic methods that aim to solve large problems only approximately, but with much less time complexity compared to standard methods (see

[Woodruff:2014:STN:2693651.2693652, wang2017sketched, 7355313, pmlr-v80-chowdhury18a] and references therein). In this work, we consider two specific examples: sketched matrix multiplication [optimal] and ridge regression [sharper] but with additional constraints on the sketching matrices that arise in the context of distributed data acquisition. Formally, if and , computing the product takes time, which can be prohibitive for large . The sketched version then aims to find matrices such that

 ∥∥(SW)T(SY)−WTY∥∥≤ϵ∥W∥∥Y∥. (1)

Computing the sketched matrix product then takes only time (not accounting the time to compute and themselves). State-of-the-art bounds show that suffices, where is the stable rank of a matrix (defined in Section 2 and is a stable alternative for the rank). Similarly, given with and , the ridge regression problem is

 x∗=arg % minx∈Rd f(x):=∥Ax−b∥2+λ∥x∥2 (2)

and can be solved in time. The sketched problem instead seeks to find matrices such that solving

 ^x=arg% minx∈Rd fS(x):=∥SAx−Sb∥2+λ∥x∥2 (3)

yields

 f(^x)≤(1+ϵ)f(x∗). (4)

The state-of-the-art bounds show that for small , suffices, where is the statistical dimension and is again a more stable alternative to the rank of , as defined in Section 2.

With this background in place, let us consider a scenario where the data matrix is naturally divided into blocks that are not all available at a single location. Let each block then be of size , where . Such partitioning of data into different blocks occurs naturally in many applications. For example, dynamic systems produce data that evolve over time. To store the entire data before sketching it would require large amounts of memory [9]. It would be of use to sketch the system as it evolves, leading to a natural partition. In yet another application, consider the square kilometer array [11]. This array consists of antennas distributed across the continents of Australia and Africa. To handle the massive data rates ( TB/s), it is desirable to sketch the data locally at each antenna and then transmit to the central processing location. In distributed systems that use edge-cloud architecture, edge nodes collect data that needs to be communicated to the cloud for inference. The communication requirements can be made smaller if the data at each edge node is compressed to an “optimal” dimension.

A feature of existing sketching methods (including those that use fast Johnson-Lindenstrauss matrices such as Subsampled Randomized Hadamard Transform (SRHT) [ailon2006approximate] and sparse sketching matrices [clarkson2017low]) is that they need access to all or an arbitrary subset of the rows of (See Figure 1). Clearly, this is unsuitable for an application with distributed data. This leads us to ask the following questions: Is there a way to adapt sketching techniques to such applications? What is the best way to model dimensionality reduction for such applications? Two naïve ways are readily available: i) Since each block is of size , its rank is upper bounded by . One could obtain a subspace embedding for each block and communicate these sketched blocks to the central node. The resulting dimension of the aggregated data is then , since each block needs to be sketched to , ii) Sketch each data block separately, and add the resulting sketches at the central node instead of aggregating them. In fact, this results in a sketch of the entire data matrix . Using existing bounds, one can conclude that the final sketch needs to be , which again requires each data block also to be sketched to .

A major drawback of both of the above approaches is that they do not take advantage of the inherent low dimensionality of the entire matrix , resulting in a sketch size of for each data block. Our observation is that it should be possible to lose information locally, while still retaining all the information about globally. This is exactly what we address in this paper: we show theoretically that it is possible for the each of the blocks to be sketched to . This implies that the sketch obtained from a single block may not be big enough to provide a subspace embedding for that block. Yet, an embedding of the entire matrix can be obtained, once the sketches from the individual blocks are aggregated. Hence, our work aims to initiate a study of how to extend sketching methods to distributed data acquisition scenarios.

Our proposal is to impose a block diagonal structure on the sketching matrix . We denote such a sketching matrix as . We then partition the data matrices , and analogously. This results in sketches of the form

 SDA=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣S10⋯00S2⋯0⋮⋮⋱⋮00⋯SJ⎤⎥ ⎥ ⎥ ⎥ ⎥⎦⎡⎢ ⎢ ⎢ ⎢⎣A1A2⋮AJ⎤⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢⎣S1A1S2A2⋮SJAJ⎤⎥ ⎥ ⎥ ⎥⎦. (5)

We assume that where and such that , although our results extend to the case where the ’s are of different sizes. Further, in our paper we assume that the non-zero entries of the matrix

are drawn from the Gaussian distribution. Our goal is to study the sample complexities

required to achieve similar guarantees as those in [optimal] and [sharper] for dense (non-block diagonal) sketching matrices.

Apart from the structural advantages described above, computing the product can also be much cheaper when compared to an unstructured random projection. For generic , the sketch can be computed in time , as compared to the required for a dense, unstructured sketch. Second, the computation is trivial to parallelize into blocks, each requiring time. For large problems with low effective rank, when we can take , this gives us a sketch with structured randomness competitive with methods that use SRHT and sparse embedding matrices [Woodruff:2014:STN:2693651.2693652]. Furthermore, the blocks themselves could be designed to be fast transforms. Owing to these computational advantages, blocking could be a strategy by itself.

### 1.1 Related work

There is a vast and growing literature on sketching techniques. Here we briefly review some of the work most relevant to ours in the context of our setting. Note that while sketching can also be used as a pre-conditioning method [7355313], here we will only address “sketch and solve” methods where the original problem is (approximately) solved in a reduced dimension.

Sketching methods for solving ordinary least squares problems are well summarized in

[Woodruff:2014:STN:2693651.2693652]. However, as noted in [sharper], solutions for sketched ridge regression problems are more relevant in practice since regularization is often necessary. Similar to [sharper], we address this problem but in the setting where the sketching matrix is block diagonal. We provide conditions on the matrix under which such structured matrices can have the same sample complexity as [sharper].

Our work is closely related to that of [RIPBD] which studies the restricted isometry property (RIP) of block diagonal matrices. These results can be used to directly obtain subspace embedding guarantees for block diagonal matrices. However, this approach requires a sample complexity dependent on the rank of and not its approximate rank. For large matrices with fast spectral decay, this dependency can lead to sub-optimal sample complexity. Another difference is that we consider block diagonal matrices that have different sized blocks, while [RIPBD] assumes that all the blocks are of the same size. One of the main conclusions of our paper is that choosing the block sizes in a data dependent fashion leads to improved (optimal) sample complexity.

A statistical analysis of sketched ridge regression in a distributed setting is provided in [wang2017sketched]. This work considers the ridge regression problem in the multivariate setting (where and

are matrices) and analyzes model averaging in the case of distributed computation of the sketched ridge regression solution. In this setting, various processors each solve the problem with a part of the data and the estimators are then communicated to a central agent. In contrast, we consider a scenario where the estimate is computed by the central agent with only sketched data sent from various nodes.

Another work that is similar in spirit to ours and addresses sketched regression in a distributed setting is [mcwilliams2014loco]. The setting considered in this work lies somewhere between that of [wang2017sketched] and ours. It considers multiple processors solving the ridge regression problem with different parts of the data similar to [wang2017sketched], but also assumes that the data used by each processor is available to all other processors in a sketched form. In contrast, in our work, the sketched data from all the nodes is available to only a central computing agent.

A complimentary line of work focuses on the same problem but where . In [Chen:2015:FRA:3020847.3020869]

, a sketching based algorithm is proposed that achieves a relative error guarantee for the solution vector. This result is further improved in

[pmlr-v80-chowdhury18a]. Sketching has also been applied in the context of kernel ridge regression, where the data points are mapped to higher dimensional feature space before solving the regression problem. Sketching is used to reduce the number of such high dimensional features in [krr_drineas] and [doi:10.1137/16M1105396]. Sampling and rescaling of features is considered in [krr_drineas]. Random feature maps are also used to construct pre-conditioners in [doi:10.1137/16M1105396] to solve kernel ridge regression, where it is shown that a number of random feature maps proportional to the effective rank of the kernel matrix suffices to obtain a high quality pre-conditioner. While our work targets a different setting (where ) and requires a different set of analytical tools, it is noteworthy that our guarantees involve a similar dependence on the stable rank of the underlying data matrix.

## 2 Main results

Our main contribution is theoretical analysis of the block model described in (5). A naïve strategy to analyze block diagonal matrices is to treat each block separately and use a number of random projections proportional to its effective rank. But this would not take advantage of the low dimensional structure of the full matrix , resulting in a highly suboptimal sample complexity. Instead, we show that under mild assumptions on , the total sample complexity of of the matrix can match the existing bounds mentioned above.

### 2.1 Stable rank, statistical dimension and incoherence

Before we can state our main results, we need to define a few quantities that characterize the complexity of matrix multiplication and ridge regression problems.

Stable rank of a matrix: The stable rank of a matrix is defined as . Note that

. For matrices with a flat spectrum, the stable rank equals the rank of the matrix. However, if the singular values decay, then the stable rank captures the effective low dimensionality of the matrix, even when it is technically full rank.

Statistical dimension of the ridge regression problem: The ridge regression problem defined in (2) can be reformulated as

 minx∈Rd ∥∥∥[A√λId]x−[b0]∥∥∥2⇔minx∈Rd ∥∥˜Ax−˜b∥∥2.

The scalar multiple of the identity on the bottom of means it will technically be rank . But in some sense, a more nuanced notion of rank would count dimensions in the column space of that have singular values greater than differently that those with singular values less than . One way to make to bring this distinction out is through the statistical dimension

 sdλ=∑iσ2iσ2i+λ.

In the sum above, if , then the contribution for that term is approximately one, while if , it is essentially zero. This allows us to interpret as a kind of “effective rank”. Note that and can be much lower than . While making very large can of course make very small, this also introduces a larger bias in the estimates provided by (2) and (3), driving both of their solutions to zero. Choosing the

that balances this bias-variance trade-off is equally important in sketched and non-sketched ridge regression.

Incoherence of the data matrices:

In randomized sampling schemes, the sampling probability of each row depends on the corresponding

leverage score, which is the norm of the corresponding row of an orthobasis for . Leverage scores highlight the relative importance of each row of .

Block diagonal matrices can be thought of as a generalization of sampling matrices. Instead of a single row, each block now accesses a submatrix of . Instead of using uniformly sized diagonal blocks , we show that a relative importance term associated with each block similar to leverage scores dictates the number of random projections required to attain optimal sample complexity. Let be an orthobasis for the column space of the matrix . Let , where . We will show that the corresponding relative importance parameter, which we term as coherence of , is

 Γ(Uj)=min(∥∥Uj∥∥2∞N,∥∥Uj∥∥22).

Here, denotes the element-wise infinity norm and denotes the spectral norm. We can observe that

 1J≤maxjΓ(Uj)≤1. (6)

When the ’s are all close to , the columns of are incoherent, or not too aligned with respect to the standard basis vectors. On the contrary, when they are close to , then there are vectors in the column space of which are close (in an inner product sense) to the standard basis vectors. We describe bases that have small coherence parameters as being incoherent. We will show that as long as the coherence is not too high, the sample complexity of block diagonal matrices can match that of generic sketching matrices.

Number of random projections: Low values of the coherence parameter (highly incoherent bases) indicate relative uniformity in the importance of the blocks. For such subspaces, it would be reasonable to expect that roughly the same number of random projections can be drawn from each data block . On the other hand, when the coherence parameters have a high dynamic range, it can be expected that the number of random projections from each block should be proportional to the corresponding . This is precisely our proposed strategy to design the number of random projections . We propose that can be chosen as

 Mj=M0Γ(Uj) (7)

for some constant that we will determine later. Our theoretical results state that block diagonal sketching matrices can achieve optimal sample complexity when ’s are designed as in (7). This is also reminiscent of sampling algorithms, where the sampling probability of each row is proportional to the corresponding leverage score.

### 2.2 Sample complexity bounds for localized sketching

#### Localized sketching for matrix multiplication

Some of the earlier works that addressed this problem required to be of size where denotes the rank of the matrix. However, matrices with high ranks can still be approximately low dimensional, as indicated by their stable rank. In [optimal] it is shown that the sample complexity of in (1) (under certain distributions) depends only on the stable ranks of the matrices. They describe distributions that satisfy

 PS∼D(∥∥(SW)T(SY)−WTY∥∥>ϵ∥W∥∥Y∥√(1+sr(W)/k)√(1+sr(Y)/k))<δ

for any desired and a suitable . When is a dense matrix with sub-Gaussian entries, this holds for . Then, for , satisfies (1). Hence, to achieve a relative error in the spectral norm, only needs to have a number of rows proportional to the stable ranks of and .

Our first main result is such a guarantee for block diagonal sketching matrices. Unlike the distributions proposed in [optimal], block diagonal distributions cannot be both oblivious to the data matrices and have optimal sample complexity. A naïve way to achieve (2.2) when is block diagonal is to use triangle inequality:

 ∥∥(SDW)T (SDY)−WTY∥∥≤∑j∥∥(SjWj)T(SjYj)−WTjYj∥∥

where and are corresponding blocks as in (5). However, this requires that for each . This can lead to suboptimal sample complexities, as and can be as high as and themselves. We show in our analysis that we can in fact achieve

 ˜M=∑jMj=Ω(sr(W)+sr(Y)ϵ2)

for incoherent matrices. With designed as in (7), we have the following result for computing approximate matrix products:

###### Theorem 1

Fix matrices and and let be a block diagonal matrix as in (5) with the entries of are drawn from the distribution . Let be an orthobasis for the matrix and be the corresponding incoherence terms. Then the tail bound (2.2) holds with when are taken as in (7) with

 M0=Ω(klog(2/δ)ϵ2). (8)

We can examine the total sample complexity of . Consider a highly incoherent basis : each entry of such a basis is bounded away from . Examples of such bases include orthobases of matrices with entries drawn from the Gaussian distribution and any subset of the Fourier basis. Since each column of has an -norm of , for such bases, . Then we have and . We see that even though has a block diagonal structure, it can still have an optimal sample complexity.

#### Block diagonal sketching of ridge regression

Let us now consider the sketched ridge regression problem shown in (3). Let comprise the first rows of an orthobasis for the matrix . Then, (4) holds with constant probability, if satisfies the following two conditions:

 ∥∥UT1STSU1−UT1U1∥∥ ≤14, (9) ∥∥UT1STSr∗−UT1r∗∥∥ ≤√ϵf(x∗)2, (10)

where and we recall that . These conditions are well known in the randomized linear algebra community. (See [sharper] Lemma 9.) Both of the above conditions on can be re-expressed as approximate matrix product guarantees by choosing the pair of matrices as for (9) and and for (10). We now state our main result for block diagonal sketching of ridge regression problems. Let and be as defined above and let be an orthobasis for a basis for the range of of size at most with ’s being the corresponding incoherence terms.

###### Theorem 2

Let be an orthobasis for the matrix and be the corresponding incoherence terms. Let be a block diagonal matrix as in (5) with the entries of are drawn from the distribution . Let be the solution to (2), and be the solution to (3). Then

 f(^x)≤(1+ϵ)f(x∗),

with constant probability when obeys (7) with .

As before, if and are such that the basis is incoherent, then the total sample complexity . We are hence able to establish that though highly structured, block diagonal random matrices can in fact have optimal sample complexities.

#### Estimating the incoherence terms

An important question is about how the coherence parameters ’s can be estimated. Note that the main challenge is in computing an orthobasis for the data matrix . We develop an algorithm to empirically estimate the ’s to within a constant factor of the true values using a sketching based algorithm. The algorithm uses fast localized random projections of the blocks ’s and computes an estimate of the QR factorization of at a central processing unit. Using the approximate R factor, the blocks ’s are estimated locally. The algorithm is detailed in the appendix and has a worst case time complexity of . Note that this is less than the sketch compute time for not too large. In Figure 3, we show the estimated incoherence parameters and the true parameters for a test matrix with , . We can see that the estimated values are within a constant factor of the true ’s. An important note here is that in many applications, an estimate of the ’s may be obtained using a priori domain knowledge. Yet another insight is that if distributional assumptions on the data can be made, as common in machine learning, then ’s can be very reliably estimated a priori [RIPBD]. Any such prior information will lead to better sample complexities as compared to the naïve techniques described in the introduction.

## 3 Experiments

We demonstrate the effectiveness of block diagonal sketching matrices by performing experiments on both synthetic and real data. In our first experiment, we demonstrate the importance of choosing the size of the diagonal blocks according to our proposed method given in (7). We use the following parameters: , , . We design the singular values such that for , , but . For each trial, we generate with entries drawn from and with the entries of drawn from . In Figure 2, we plot averaged over trials for different values of . In particular, we show that when , has the same rate of decay for as , and has a worse rate otherwise.

In our next set of experiments, we study performance in terms of prediction accuracy on the YearPredictionMSD dataset. It contains 89 audio features of a set of songs and the task is to predict their release year. The dataset has 463,715 training samples and 51,630 test samples. In this case, we use diagonal blocks of the same size. Across independent realizations of and , we compute the empirical probability of for various values of and

. We show phase transition plots in Figure

4 which demonstrate that block diagonal matrices are as effective as dense matrices in terms of accuracy, for the same sample complexity.

We also seek to highlight the computational advantages provided by block matrices. To this end, we compare the sketch compute times for block diagonal matrices with that of SRHT sketching matrices. We consider matrices of sizes , and and divide them into

blocks respectively. In order to ensure fair comparison, we replace the SRHT matrix with randomly subsampled Fast Fourier transform (FFT) matrix, since both have the same theoretical sketch compute time, but the FFT matrix has very efficient software implementations. The sketch compute times are shown in Table

1. Our choice of renders each block small enough for very efficient computations. This results in block diagonal matrices being much faster compared to the FFT matrix.

## 4 Proof Sketch

In this section, we provide a sketch of the proof for both Theorems 1 and 2. Full proofs are provided in the appendix. We first prove Theorem 1 and the proof for Theorem 2 follows by choosing and appropriately, as explained in Section 2.2. The fundamental property of a distribution of matrices that enables any to satisfy (2.2

) is the subspace embedding moment property, defined in

[sharper]:

 ES∼D∥∥(SU)T(SU)−I∥∥l≤ϵlδ, (11)

for some , where and are tolerance parameters that determine the sample complexity and is any orthobasis for the span of the columns of and . Thus, our main goal is to prove the subspace embedding moment property holds for block diagonal sketching matrices.

Our methods differ from the common -net argument, since using union bound for block diagonal matrices results in a suboptimal sample complexity. The main tools we use are the estimates for the suprema of chaos processes found in [krahmerSOCP] and an entropy estimate from the study of restricted isometry properties of block diagonal matrices computed in [RIPBD]. We first establish tail bounds on the spectral norm of the matrix

 Δ=(SDU)T(SDU)−I, (12)

where is an orthobasis for a subspace of dimension and then bound its moments to establish the subspace embedding moment property.

### 4.1 Tail bound on the spectral norm of the matrix Δ

We first express as

 ∥Δ∥ =supz∈Rd∥z∥=1∣∣zT(SDU)T(SDU)z−1∣∣ (13)

For the matrices , let denote their vectorized versions, obtained by stacking the columns one below the other. Let be the vector containing all of the ’s. Note that is a vector with entries drawn from . We can then express (20) as

 ∥Δ∥=supPz∈P∣∣∥PzSv∥2−E∥PzSv∥2∣∣

where is defined as

 P =⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩Pz=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣P1(z)0⋯00P2(z)⋯0⋮⋮⋱⋮00⋯PJ(z)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦⎫⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪⎭ Pj(z) =1√Mj⎡⎢ ⎢ ⎢ ⎢ ⎢⎣(U1z)T0⋯00(U1z)T⋯0⋮⋮⋱⋮00⋯(U1z)T⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

where and . Observe that is then the supremum of the deviation of a Gaussian quadratic form from its expectation, taken over the set . This matches the framework developed in [krahmerSOCP] to bound such suprema. We use their result (Theorem 3.1, [krahmerSOCP]) to obtain tail bounds on , stated in Lemma 4.

###### Lemma 1

For any orthonormal matrix and a block diagonal matrix as in Theorem 1, there exists a constant such that

 P(∥Δ∥≤c√dlog(2/δ)M0)≥1−δ. (14)

For a desired tolerance , if , . This is similar to a subspace embedding guarantee. We now show that this tail bound naturally induces a bound on the moments of , from which the main theorems in section 2 can be proved.

### 4.2 Moment bound on ∥Δ∥

Tail bounds for certain random variables can be translated into bounds on their moments using the following result:

###### Lemma 2 (7.13, [foucart2013mathematical])

Suppose that a random variable satisfies for some and for all . Then, for , where is the Gamma function.

By choosing , , and , we obtain

###### Lemma 3

For any orthonormal matrix and a block diagonal matrix as in Theorem 1, if , then for ,

 E∥Δ∥p≤ϵpδ (15)

Approximate matrix product guarantee Let and be as in (2.2). As explained in [optimal], we can assume that they have orthogonal columns. For a given as in (2.2), let and be partitioned into groups of columns, with and denoting the groups. The approach in [optimal] then uses the following result in their argument, which follows from (24):

 E∥∥(SWl)T(SYl′)−WTlYl′∥∥p≤ϵp∥Wl∥p∥Yl′∥pδ (16)

for all pairs . In their setting, this holds since the sketching is oblivious to the data matrices. Although block diagonal matrices are not oblivious, this result holds with for . This is because of the observation that if is an orthobasis for the span of and and is an orthobasis for the span of and , then for all pairs . Hence, a given block diagonal sketching matrix can satisfy (25). The rest of the proof remains the same as in [optimal]. This concludes the proof for Theorem 1.

## 5 Conclusion

In this paper, we study a particular model that can be used while applying sketching techniques to high dimensional data that are available in a distributed fashion. Our proposed block diagonal sketching model forms an intermediate model between sampling methods and random projection methods and is a useful abstraction. We show theoretically and experimentally that choosing the sketch sizes proportional to a certain coherence term of the data blocks results in an optimal sample complexity. While we do not provide formal analysis of the algorithm to estimate the coherence parameters, we show empirically that they can be estimated.

## Appendix A Proof of Theorem 1

The fundamental property of a distribution of matrices that enables any to satisfy (8, main paper) is the subspace embedding moment property, defined in [sharper]:

 ES∼D∥∥(SU)T(SU)−I∥∥l≤ϵlδ, (17)

for some , where and are tolerance parameters that determine the sample complexity and is any orthobasis for the span of the columns of and . Thus, our main goal is to prove the subspace embedding moment property holds for block diagonal sketching matrices.

Our methods differ from the common -net argument, since using union bound for block diagonal matrices results in a suboptimal sample complexity. The main tools we use are the estimates for the suprema of chaos processes found in [krahmerSOCP] and an entropy estimate from the study of restricted isometry properties of block diagonal matrices computed in [RIPBD]. We first establish tail bounds on the spectral norm of the matrix

 Δ=(SDU)T(SDU)−I, (18)

where is an orthobasis for a subspace of dimension and then bound its moments to establish the subspace embedding moment property.

### a.1 Suprema of chaos processes

We briefly state here the main result from [krahmerSOCP] that provides a uniform bound on the deviation of a Gaussian quadratic form from its expectation. Obtaining a tail bound on the spectral norm of is just a particular application of this general framework.

For a given set of matrices , we define the spectral radius , the Frobenius norm radius , and the Talagrand functional as

 d2(P) =supP∈P∥P∥, dF(P) γ2(P,∥⋅∥2) =∫d2(P)0√logN(P,∥∥2,u)du,

where denotes the covering number of the set with respect to balls of radius in the spectral norm. The main result of [krahmerSOCP] then is the following theorem.

###### Theorem 3

[Theorem 3.1, [krahmerSOCP] ] Let be a set of matrices and let be a vector of i.i.d. standard normal entries. Then for ,

 P(supP∈P|∥Pϕ∥2−E∥Pϕ∥2|>c1E+t)≤2e−c2min{t2V2,tU} (19)

where

 E =γ2(P)[γ2(P)+dF(P)]+d2(P)dF(P), V =d2(P)[γ2(P)+dF(P)], U =d22(P).

A similar approach of using the results from [krahmerSOCP] to analyze block diagonal random matrices was first used in [RIPBD] in the context of compressed sensing. However, we target a different set of problems that result in different theoretical considerations and proof techniques.

### a.2 Tail bound on the spectral norm of the matrix Δ

We first express as

 ∥Δ∥ =supz∈Rd∥z∥=1∣∣zT(SDU)T(SDU)z−1∣∣ (20) (21)

For the matrices , let denote their vectorized versions, obtained by stacking the columns one below the other. Let be the vector containing all of the ’s. Note that is a vector with entries drawn from . We can then express (20) as

 ∥Δ∥=supPz∈P∣∣∥PzSv∥2−E∥PzSv∥2∣∣

where is defined

 P =⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩Pz=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣P1(z)0⋯00P2(z)⋯0⋮⋮⋱⋮00⋯PJ(z)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦⎫⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪⎭, Pj(z) =1√Mj⎡⎢ ⎢ ⎢ ⎢ ⎢⎣(U1z)T0⋯00(U1z)T⋯0⋮⋮⋱⋮00⋯(U1z)T⎤⎥ ⎥ ⎥ ⎥ ⎥⎦

where and . Observe that is then the supremum of the deviation of a Gaussian quadratic form from its expectation, taken over the set .

We can then compute the corresponding quantities , and as follows.

The spectral radius is defined as

 supPz∈P∥Pz∥ =maxj,∥z∥2=1∥∥Ujz∥∥√Mj ≤min(√N∥∥Uj∥∥∞∥z∥1√Mj,∥∥Uj∥∥∥z∥2√Mj) ≤min(√N∥∥Uj∥∥∞∥z∥1√Mj,∥∥Uj∥∥∥z∥1√Mj) ≤∥z∥1/√M0≤d1√M0

where the fourth line follows from the definition of .

The radius in the Frobenius norm is defined as

 supPz∈P∥Pz∥F =∑j∥∥Ujz∥∥2=1.

The upper bound for can be obtained from the Equation (34) in Eftekhari et al., 2015).. In their derivation, they consider a full orthobasis and the set of -sparse vectors. This bound also holds for a fixed -dimensional subspace. Hence,

 γ2(P,∥⋅∥)≲√dM0logdlog˜M (22)

Plugging these quantities into Theorem 3, we can obtain Lemma 1.

###### Lemma 4

For any orthonormal matrix and a block diagonal matrix as in Theorem 1, there exists a constant such that

 P(∥Δ∥≤c√dlog(2/δ)M0)≥1−δ. (23)

For a desired tolerance , if , . This is similar to a subspace embedding guarantee. We now show that this tail bound naturally induces a bound on the moments of , from which the main theorems in Section 2 can be proved.

### a.3 Moment bound on ∥Δ∥

Tail bounds for certain random variables can be translated into bounds on their moments using the following result:

###### Lemma 5 (Proposition 7.13, [foucart2013mathematical])

Suppose that a random variable satisfies, for some ,

 P(|q|≥e1/γαu)≤βe−uγ/γ

for all . Then, for ,

 E|q|p≤βαp(eγ)p/γΓ(pγ+1)

where is the Gamma function.

To adapt this result to bound the moments of the spectral norm of the random matrix

, we can choose , , and . We can then obtain the following result.

###### Lemma 6

For any orthonormal matrix and a block diagonal matrix as in Theorem 1 and , then

 E∥Δ∥p≤ϵpδ (24)

for .

### a.4 Approximate matrix product guarantee

With the moment bound established above, we can now use the framework given by [optimal] to establish (8, main paper). However, we cannot use their proof directly, since the sample complexity in the moment bound in (24) is not oblivious to the matrix . However, once we fix the data matrix, we can adapt the argument used in [optimal] to show that (8, main paper) holds.

Let and be as in (8, main paper). As explained in [optimal], we can assume that they have orthogonal columns. For a given as in (8, main paper), let and be partitioned into groups of columns, with and denoting the groups. [optimal] then use the following result in their argument, which follows from (24):

 E∥∥(SWl)T(SYl′)−WTlYl′∥∥p≤ϵp∥Wl∥p∥Yl′∥pδ (25)

for all pairs . This holds since in their setting, the sketching matrices are oblivious to the data matrices.

Although block diagonal matrices are not oblivious, this result holds with for . This is because of the observation that if is an orthobasis for the span of and and is an orthobasis for the span of and , then

 Γ(Ul,l′j)≤Γ(Uj) (26)

for all pairs . Hence, a given block diagonal sketching matrix can satisfy (25) as well. The rest of the proof remains the same as [optimal]. This concludes the proof for Theorem 1. Extending this to prove Theorem 2 is straightforward, with being a particular case of their framework.

## Appendix B Algorithm for estimation of the incoherence parameters Γ(Uj)

Our algorithm for estimating the block incoherence parameters is inspired by the algorithms for leverage score estimation in the row sampling literature [drineas2012fast, Woodruff:2014:STN:2693651.2693652] and from randomized SVD algorithms [halko2011finding].

The main idea is the following: suppose we had access to the QR factorization of the data matrix :

 A=QR. (27)

Then, an orthobasis can be obtained by computing . However, computing the QR-factorization is as expensive as the matrix multiplication or ridge regression problems. We use a similar approach, but we only aim to capture the row space of in a distributed fashion. However, we take random projections in an iterative fashion, until the row space of the sketch “converges”. we estimate the QR factorization from this resulting sketch. Our algorithm is described in Algorithm LABEL:algo_gamma. Note that we only aim to compute a constant factor approximation of the QR factors. Hence, computing the takes, in the worst case, time. The QR factorization in each iteration can be updated from its previous estimates efficiently. Computing the final estimate takes about time. Finally computing ’s takes time, resulting in a total worst case time complexity of .