# pylspack: Parallel algorithms and data structures for sketching, column subset selection, regression and leverage scores

We present parallel algorithms and data structures for three fundamental operations in Numerical Linear Algebra: (i) Gaussian and CountSketch random projections and their combination, (ii) computation of the Gram matrix and (iii) computation of the squared row norms of the product of two matrices, with a special focus on “tall-and-skinny” matrices, which arise in many applications. We provide a detailed analysis of the ubiquitous CountSketch transform and its combination with Gaussian random projections, accounting for memory requirements, computational complexity and workload balancing. We also demonstrate how these results can be applied to column subset selection, least squares regression and leverage scores computation. These tools have been implemented in pylspack, a publicly available Python package,[https://github.com/IBM/pylspackhttps://github.com/IBM/pylspack] whose core is written in C++ and parallelized with OpenMP, and which is compatible with standard matrix data structures of SciPy and NumPy. Extensive numerical experiments indicate that the proposed algorithms scale well and significantly outperform existing libraries for tall-and-skinny matrices.

## Authors

• 3 publications
• 2 publications
05/09/2022

### Robust Parameter Identifiability Analysis via Column Subset Selection

We advocate a numerically reliable and accurate approach for practical p...
11/09/2020

### Quantum-Inspired Algorithms from Randomized Numerical Linear Algebra

We create classical (non-quantum) dynamic data structures supporting que...
11/20/2018

### Parallel Matrix Condensation for Calculating Log-Determinant of Large Matrix

Calculating the log-determinant of a matrix is useful for statistical co...
03/15/2019

### Subset Selection for Matrices with Fixed Blocks

Subset selection for matrices is the task of extracting a column sub-mat...
05/23/2021

### Estimating leverage scores via rank revealing methods and randomization

We study algorithms for estimating the statistical leverage scores of re...
06/22/2020

### dm_control: Software and Tasks for Continuous Control

The dm_control software package is a collection of Python libraries and ...
10/04/2021

### A New Acceleration Paradigm for Discrete CosineTransform and Other Fourier-Related Transforms

Discrete cosine transform (DCT) and other Fourier-related transforms hav...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In the past two decades, randomized dimensionality reduction, also known as “sketching”, has become a fundamental operation in Randomized Numerical Linear Algebra (RandNLA). The underlying concept is to use random linear maps to reduce the problem size and ultimately derive, with high probability, fast exact or approximate solutions. This is typically achieved by random sampling or random projections. Such approaches have been studied in the context of applications in several areas, including Graph Algorithms

[79, 9][41, 27], Optimization [71, 33] and Quantum Mechanics [52].

A variety of random projections have been proposed in the literature, with different structures, sparsity and statistical guarantees; cf. [1, 45, 2, 3, 81, 68, 67, 25, 48, 49, 63, 26, 34, 20] to name a few. In applications, such transforms often satisfy the Johnson-Lindenstrauss lemma [47] or provide Oblivious Subspace Embeddings [75]. For convenience, we recall these concepts.

###### Definition 1.

is a Johnson-Lindenstrauss transform (JLT) with parameters , or -JLT for short, if with probability at least , for any -element set it holds that for any .

###### Definition 2.

A random matrix is an -Oblivious Subspace Embedding (-OSE) with parameters , if with probability at least , for any fixed matrix , , it holds that for all range with probability at least .

From these definitions, it is clear that specialized matrix-multiplication related operations are a key component for sketching to be useful in practice. As noted in [61], such operations must take into account special structures and face the challenges of sparse computations, including efficient data structures for representing the underlying matrices as well as specialized algorithms. Such aspects were mentioned, but not elaborated, in our recent companion paper [78]. We here aim to contribute in this direction, by providing a detailed analysis for a set of specific operations and sketching transforms which are critical for various important problems in RandNLA and beyond, for example leverage scores computation [36, 25, 67, 4, 58, 29, 28, 88, 78], least squares regression [75, 67, 85, 72, 35, 8, 64, 70], column subset selection [23, 17, 10, 24, 77, 78], and rank computation [76, 23, 22, 15, 62]. Specifically, given a tall-and-skinny matrix with , we present parallel algorithms, data structures and a numerical library for the following set of operations, all of which can be seen as special cases of sparse matrix multiplication:

 (i) SA,(ii) GA,(iii) GSA,(iv) A⊤A,(v) diag(ABB⊤A⊤),

where is a CountSketch matrix [21, 25],

is a matrix with i.i.d elements from the standard normal distribution, and

is an arbitrary dense matrix.

### 1.1 Contributions and structure of this paper

Our main contributions are the following.

1. A novel analysis of parallel algorithms and data structures for the CountSketch transform and its combination with Gaussian random projections.

2. A detailed analysis of parallel algorithms for the computation of the Gramian of a tall-and-skinny matrix .

3. Applications of the aforementioned methods in column subset selection, least squares regression and leverage scores computation.

4. pylspack, a multithreaded numerical library implementing the aforementioned methods which outperforms existing state-of-the-art libraries.

The outline of the paper is the following. In Section 2 we describe a novel sparse data structure for the CountSketch transform. This data structure has low memory requirements, and can be efficiently constructed and multiplied with a sparse or a dense matrix in parallel. We also show that it can be efficiently combined with Gaussian random projections to obtain smaller sketching sizes, which is a common technique in RandNLA. To achieve low memory requirements, the random projection matrices are computed on the fly and discarded once used, in a streaming fashion. Evidently, alternative approaches should be considered when the sketching matrix needs to be persistent in memory for repeated applications [12]. For the CountSketch matrix, we derive probabilistic bounds regarding load balancing on a shared memory machine. In Section 3 we provide parallel algorithms for operations and , namely the computation of the Gramian and the computation of the row norms of the product of two matrices. At the time of this writing, it appears that among the most prominent numerical libraries for matrix computations, the only method targeting the computation of the Gramian of a sparse matrix and returning the result as a dense matrix is mkl_sparse_?_syrkd in Intel222Intel is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries. MKL, which is closed source and its actual implementation is not available. In Section 4 we show how these tools can be used to yield efficient parallel algorithms for column subset selection, leverage scores computation and least squares regression. In Section 5 we describe pylspack, a publicly available Python package, whose core is written in C++ and parallelized with OpenMP, and which is compatible with standard matrix data structures of SciPy and NumPy. The package is easily installable with a single command:

 \$ pip install git+https://github.com/IBM/pylspack

Numerical experiments with a variety of matrices in Section 6 demonstrate that pylspack

can significantly outperform existing open-source and vendor libraries, especially for the tall-and-skinny case.

### 1.2 Related work

The literature on parallel algorithms and data structures for sketching is rather limited, with most of the existing works focusing on a small set of sketching transforms, namely Gaussian, sparse Rademacher and subsampled randomized Hadamard transforms [86, 8, 72, 64, 84], while other state-of-the-art transforms, like OSNAP [67] or the aforementioned CountSketch,444CountSketch is in fact an OSNAP with sparsity nonzero per column.

have been overlooked. Software for these tasks is also scarce. In Python, for example, which is arguably the most popular language for data science to date, the only existing implementation of random projections readily available in standard repositories like Anaconda

or Pypi, appears to be the random_projection module of Scikit-learn [69], which is in general well implemented, but it is limited to Gaussian transforms and Sparse Sign transforms described in [1, 59] and it is not multithreaded. The open source Skylark library [51], which has been used in several works [12, 13, 11], is a prominent example of software for sketching and related algorithms, with great potential in data science applications. It includes implementations for a very large number of sparse and dense sketching transforms and related algorithms, most of them multithreaded, however, the main focus of Skylark is distributed memory parallelism using MPI, rather than shared memory parallelism, which is the focus of this work. Two limitations of Skylark are that its Python API is implemented in Python 2, which is officially sunset, and it requires substantial effort to build. In our implementation, the pylspack Python package, we do not only focus on performance and memory efficiency, but also ease-of-use and applicability, which are desirable properties for a consumable technology.

### 1.3 Notation

Throughout the paper, unless stated otherwise, the following notation is used. We follow the Householder notation, denoting matrices with capital letters, vectors with small letters, and scalars with Greek letters. All vectors are assumed to be columns. For a matrix

, and are the -th row and -th column, respectively, both assumed to be column vectors, and is the element in row and column . is the set , where . For a set , denotes the submatrix of containing the columns defined by . denotes the best rank- approximation of in the -norm. denotes the pseudoinverse. Unless specifically noted otherwise, the 2-norm is assumed for matrices and vectors. is the probability of occurrence of an event , while is the normal distribution with mean value . As usual, denotes the

-th largest singular value of

. The matrix argument is omitted when it is evident from the context. is the number of nonzeros of , where in this case can be either a vector or a matrix. We also define , which will be used in complexity bounds. We use , , for some constant , and is used if and at the same time. We will be referring to matrices with i.i.d. elements from as Gaussian matrices. We denote by randn

standard normal random variables, by

randi random integers and by randb boolean random variables. In the compressed sparse row (CSR) format, a matrix is encoded as three 1-dimensional arrays, , and , which store the values, column indices (ordered by row index) and row pointers, respectively. In particular, the -th element of shows the starting position in and of the -th row of . Therefore, is the number of nonzero elements of the -th row. In the algorithms and their analysis we will use the notation , and to refer to the corresponding arrays , and of the matrix .

## 2 Sketching algorithms and data structures

We consider parallel algorithms for sketching. We first describe our data structures for CountSketch transforms, one of the most important random projection constructions in the RandNLA literature. We define certain target properties for storage requirements and complexities, and then propose specific data structures that satisfy them. We then describe efficient parallel algorithms for the combination of Gaussian and CountSketch transforms. Finally, we also describe a memory efficient parallel algorithm for Gaussian sketching, which is very similar to other existing implementations, but we add it for completeness since it is part of our software library. We note that, even though we focus on CountSketch, it is straightforward to adapt our methods for any OSNAP [67] (CountSketch is an OSNAP with nonzero per column).

### 2.1 CountSketch

We describe a simple sparse data structure for CountSketch, with low memory requirements, that can be efficiently constructed and applied to a given input matrix in parallel. Hereafter we assume that the given input matrix is stored in a row-major storage format, specifically, CSR for sparse matrices and row-major array for dense matrices. The output is always stored as a dense row-major array. We justify this choice in Section 2.2.3.

Recall that a CountSketch is a matrix which has exactly one non-zero element in each column at a random position chosen uniformly, and each nonzero is equal to with probability [21, 25, 63, 67]. In order to satisfy the OSE property, needs to have rows [67, 68]. An example of a CountSketch matrix is depicted in Figure 2. Note that, regardless of the number of rows of , the number of nonzeros is always equal to . An efficient data structure for should have the following properties.

###### Definition 3.

An efficient CountSketch data structure should adhere to the following specifications, assuming that there are processors available:

1. It stores no more than elements.

1. can construct the data structure in operations,777Operations can be of any type, including integer, logic operations and (pseudo) random number generation.

2. can compute the product in operations,

3. can compute the product , where is a dense matrix, in operations.

3. These algorithms require low (ideally none):

2. communication between processors.

#### 2.1.1 Vector representation

Since each column of a CountSketch matrix has exactly one nonzero element, a minimal data structure to represent it is to simply store it as a vector of size , containing signed integers, one for each column. The -th element of , say , implies that . In Figure 1 we show an example of a vector representation of the matrix of Figure 2.

The vector representation fulfills all the requirements for , but it does not lend itself for parallel algorithms. An additional partitioning by rows can remedy this effect.

#### 2.1.2 Partitioning

While the vector representation is simple and has low memory requirements, it is not convenient for computing the products and in parallel. A row-major storage, which would allow the independent computation of each row of , thus eliminating write conflicts, would be preferable. To achieve both low memory requirements as well as efficient parallel computation of and , we use partitioning, which is a standard way of performing parallel matrix computations. A straightforward partitioning strategy is to use a dense 2-d array, where each element of the array is a pointer to a sparse matrix data structure storing the corresponding submatrix of . We formalize this in the following Block CountSketch (BCS) data structure. As we will show later, this simple data structure satisfies all the required specifications. An example is illustrated in Figure 2.

###### Definition 4.

Let be a CountSketch matrix of size , which is partitioned in blocks of size each.888Without loss of generality, we assume that divides exactly and divides exactly, that is, and . A Block CountSketch data structure is a 2-d array of abstract data types , . Each represents a submatrix , where , , , .

The choice of the data structure to represent submatrices is discussed in detail in Section 2.3.

### 2.2 Block CountSketch Algorithms

Having a partitioned matrix, performing operations in parallel is rather straightforward. The matrix construction is parallelized across the columns, since the columns are independent but there are dependencies within each column. The multiplication is parallelized across rows, since is assumed to be stored in a row-major format (either sparse CSR or dense row-major). The complexities depend on the actual data structure that will be used to store the blocks of .

#### 2.2.1 Construction

Algorithm 1 describes the procedure for populating an empty Block CountSketch instance. The input arguments are integers , , , , denoting the number of rows, columns, rows-per-row-block and columns-per-column-block of the Block CountSketch matrix, respectively. The complexity is parametrized by the choice of , which is the data structure type used to store the submatrices (e.g. COO, CSR, CSC, other). Algorithm 1 performs operations, where denotes the complexity of inserting an element in an instance of type . In order to satisfy Requirement (2a), it must hold that .

#### 2.2.2 Computing the product Sa.

Algorithm 2 describes a procedure for the parallel computation of the product At this point we make the assumption that the product will be dense. This is because typically consists of rows, in order to satisfy the OSE property [67, 63], where each row is constructed as a linear combination of rows of on average. For the tall-and-skinny case, we can assume that , and therefore it is justified to treat and store as a dense matrix.

Let denote the complexity of multiplying , stored as a data structure of type , with a CSR input matrix . The (sequential) complexity of Algorithm 2 is . Since is such that each row of will be used exactly once in the final product, then, as long as the chosen data structure does not perform redundant operations, the complexity is simply . It is not straightforward how to derive upper bounds for the parallel complexity using processors, but due to the randomized nature of we it is possible to obtain probabilistic guarantees.

#### 2.2.3 Computing the product Gsa (CountGauss)

The combination of CountSketch and Gaussian matrices has been mentioned in [25, 85] and has been extensively studied in [49]

in the context of Support Vector Machines and Nonnegative Matrix Factorizations, where it was first referred to as the “CountGauss” transform. It has also been studied in the context of least squares, column selection and leverage scores computation in

[78]. Building upon the previous algorithms, we describe an implementation for computing the product where is a Gaussian matrix and is a CountSketch. Parallelization is achieved within each individual step of the algorithm, which is listed in Algorithm 3. To reduce memory requirements for temporary storage, the computation is performed in batches. This does not affect the complexity.

### 2.3 Choosing a data structure for the submatrices

It remains to choose an appropriate data structure to store the submatrices of . We opt to store submatrices in the well-known coordinate format (COO). We do not assume any ordering, therefore constant time insertion can be achieved by appending new elements at the end. Moreover, in the unordered COO format the product for some input matrix can be computed in by iterating over the nonzeros of , multiplying the -th row of by and updating the corresponding columns of the -th row of the resulting dense array. Hereafter, all references to the COO format pertain to the unordered case.

Of interest is the special case of Block CountSketch where , that is, when the partitioning takes place only column-wise. Since each block of consists of only one single row, then it can be represented as a sparse vector: it is simply a list of signed integers, each integer denoting the column index of the nonzero element, while its sign denotes whether this nonzero element is equal to or . We will refer to this special case as Block Column CountSketch (BCCS). The complexity of insertion and matrix multiplication for BCCS is the same as COO. A difference between COO and BCCS is the amount of memory required to store the entire Block CountSketch matrix. For BCCS, we only store one signed index for each nonzero element, totaling elements. In addition, pointers to the submatrices are stored. In summary, the memory requirements are as follows:

• COO: indices + pointers,

• BCCS: indices + pointers.

additional bits are required to store the signs in both representations.

### 2.4 Probabilistic bounds for processor workload

Due to the randomized nature of the CountSketch matrix, we can derive upfront probabilistic bounds on the workload distribution among processors. Let be a CountSketch matrix and . Let . Recall that the product is computed in . Assume that we have processors, and the multiplication is performed in parallel as follows: each processor computes randomly chosen rows of . Each row of , , is simply the product of . Let , be the set of indices of nonzero columns of . Clearly, the product is computed in . Since is a random matrix, we can try to get probabilistic bounds on the “workload” of each processor. Let be the random variable which counts the number of nonzeros of that “contribute” to the product . This variable can also be rewritten as , where is one if was chosen to contribute in and zero otherwise. But is simply a Bernoulli random variable with parameter since each column of has only one nonzero in a position which is chosen uniformly at random from the set . Generalizing for a fixed set of rows instead of simply one row we have the following.

###### Lemma 1.

Let be a CountSketch and a fixed input matrix. Let , be a subset of rows of , chosen uniformly at random without replacement. For , let , , and be independent Bernoulli random variables with parameter , that is, with probability and with probability . Let . The following hold:

 μ =E[Y]=qnnz(A), σ2 =Var(Y)=q(1−q)nnz2(A).
###### Proof.

For the expectation we have

 μ=E[Y]=E[n∑i=1wiϵi]=n∑i=1wiE[ϵi]=qn∑i=1wi=qnnz(A).

Similarly, we can compute the variance. First we need to compute

. We have

 EY2 =E[(n∑i=1wiϵi)2]=E⎡⎣(n∑i=1w2iϵ2i)+2∑i≠kwiwkϵiϵk⎤⎦=qn∑i=1w2i+2q2∑i≠kwiwk.

We can then compute

 Var(Y)=E[Y2]−(E[Y])2=q(1−q)n∑i=1w2i=q(1−q)nnz2(A).

We can then use Chebyshev’s inequality to get a first tail bound.

###### Corollary 1.

Let , and be as in Lemma 1. It holds that

 Pr[|Y−EY|≥√nq(1−q)nnz2(A)]≤1n.
###### Proof.

The result follows by simply applying Chebyshev’s inequality on the random variable and using Lemma 1. This gives

 Pr[|Y−μ|≥t]≤σ2t2.

The result follows by substituting . ∎

As it is known, the inequalities of Chebyshev and Markov are generally tight. We can say more by noting that the random variables are bounded in magnitude. Therefore, we can use Hoeffding’s inequality to get a tighter tail bound. For proofs of Theorem 1 see the corresponding references.

###### Theorem 1 (Hoeffding’s Inequality [16, 82]).

Let be independent, bounded and centered random variables, that is and . Then,

 Pr[∣∣ ∣∣n∑i=1Zi∣∣ ∣∣>t]≤2exp(−t22∑ni=1α2i).

To apply Hoeffding’s inequality, it suffices to derive the appropriate centered random variables. To that end, let . Clearly, if we assume that .

###### Lemma 2.

Let , and be the same as in Lemma 1 and . The following hold

 Pr[∣∣ ∣∣n∑i=1Zi∣∣ ∣∣>(1−q)√2nnz2(A)logn.] ≤2n, Pr[∣∣ ∣∣n∑i=1Zi∣∣ ∣∣>μ√2(1−qq)2logn.] ≤2n.
###### Proof.

Since are independent, centered, and bounded, we can directly use Hoeffding’s inequality. This gives

 Pr[∣∣ ∣∣n∑i=1Zi∣∣ ∣∣>t]≤2exp(−t22(1−q)2nnz2(A)).

Setting gives the first inequality. For the second inequality, we set , where and . This gives

 Pr[∣∣ ∣∣n∑i=1Zi∣∣ ∣∣>δμ]≤2exp(−δ2μ22(1−q)2nnz2(A))=2exp(−δ2q2nnz2(A)2(1−q)2nnz2(A)).

Note that , since these two quantities are simply the norms and where is a -dimensional vector containing all . Moreover, for . Therefore the failure probability can be further bounded as

 2exp(−δ2q2nnz2(A)2(1−q)2nnz2(A))≤2exp(−δ2q22(1−q)2).

Setting concludes the proof for the last inequality. ∎

This implies that, for a fixed subset of with size , large fluctuations in the corresponding workload have small probability to occur. We have finally left to analyze the behaviour over all processors. The analysis is finalized by taking a union bound over sets, with rows in each set. We summarize in the following lemma.

###### Lemma 3.

Let be the same as in Lemma 1. If we use processors to perform the multiplication , by assigning distinct rows of on each processor, then the following hold:

1. Each processor will execute operations in expectation,

2. the bounds of Lemma 2 hold for all processors at the same time with probability at least .

###### Proof.

The expectation comes directly from Lemma 1. For the tail, we simply take the union bound of success probabilities of distinct fixed sets. ∎

We conclude the analysis of CountSketch data structures with the following theorem.

###### Theorem 2.

Given a sparse tall-and-skinny matrix , the Block-COO-CountSketch and Block-Column-CountSketch data structures satisfy all conditions of Definition 3 deterministically, except Condition (2b), which is satisfied in expectation and it is well centered with high probability based on Lemma 3.

### 2.5 Gaussian sketching

Unlike other sketching transforms, Gaussian projections have been well studied not only in theory but also in practice, most notably for computing randomized least squares preconditioners [72, 64]. The description that we provide here is very similar to the aforementioned works. If is dense we can directly use BLAS routines to compute the product. If is sparse in CSR format, one can take advantage of the sparsity as described in Algorithm 4. To parallelize, each process is assigned a block of rows of , say , and computes the entire product . In this manner, there are no write conflicts and no need for synchronization during execution, while each process executes the same number of operations. However, as well noted in [64], it is crucial to use a fast algorithm for generating Gaussian random variables, otherwise the total performance might be overwhelmed by this operation.

## 3 Gram matrix and Euclidean row norms

Given a sparse matrix , the computation of the Gram matrix is a fundamental operation in various applications. This can be performed with standard general sparse matrix multiplication algorithms, based on Gustavson’s algorithm [43] or more recent proposals [14, 18]; see also [50, Chapters 13 and 14] for a review. However, most works focus on the general case where the inputs are distinct sparse matrices,999Some recent works [7, 38] are focused on multiplying by its transpose, albeit for dense matrices. and the resulting matrix is also sparse, and therefore they consider efficient access patterns and sparsity structures for . Here, instead, we consider the special case of the Gramian when is tall-and-skinny. The task, in general, is not easy, and this has been recognized in several implementations and software libraries for sparse matrix computations [73, 37]. From a theoretical perspective, the fastest known sequential algorithms are based on the concept of fast matrix multiplication. As it is known, two general rectangular dense matrices and can be multiplied in , where is the best known bound for the exponent of square matrix multiplication [5]. Analogous results exist for explicitly rectangular matrix multiplication [40] and for sparse matrix multiplication [87]. Outer-product based algorithms that are considered in this work might be asymptotically “slower”, however, they are simple to analyze and work well in practice.

A key observation is that when is sufficiently tall, then the resulting Gram matrix will most likely be dense. This assumption greatly simplifies the complexity of efficient algorithms. In fact, there is a remarkably simple algorithm to perform this operation which is optimal in the sense that it performs scalar multiplications, where is the number of required scalar multiplications to be performed by any standard inner-product or outer-product based algorithm (see also [50, Section 1.4.3]). We list this algorithm in Algorithm 5.

Notably, at the time of writing, none of the following scientific software libraries, all of which include sparse matrix components, provide function calls targeting this specific operation: Eigen [42], SPARSKIT [73], CombBLAS [19], SciPy [83], IBM® ESSL [44],101010IBM and the IBM logo are trademarks of International Business Machines Corporation, registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. Acurrent list of IBM trademarks is available on ibm.com/trademark. Ginkgo [6], CSparse [32]. It appears that the only available implementations of this operation are the Intel MKL [46] mkl_sparse_?_spmmd and mkl_sparse_?_syrkd methods, which are closed-source and the actual implementation used is not available. Algorithm 5 is simple and optimal in terms of operations. However, due to the irregular sparsity patterns of the outer products formulation, it is not straightforward to implement it in parallel.

### 3.1 Parallel Gram 1: A memory efficient algorithm

We describe a memory-efficient parallel variant of Algorithm 5, which requires no additional memory for intermediate results and no communication between processors. The Algorithm works as follows: each processor is assigned a predifined range , of rows of and computes the product . However, since the matrix is stored in CSR, and therefore is stored in CSC, the processors do not have constant time access to the rows of . Each processor needs to filter out the rows that it needs to discard. That is, for each row of , each processor must check which column indices of that row belong in the range . A sequential search runs in , summing up to . However, if column indices of each row are sorted in an ascending or descending order, which is the case for CSR, then search can be performed in a binary fashion to yield a total of and . Moreover, unless the distribution of nonzeros of among the rows and the columns is highly imbalanced, or more specifically, if each row has nonzeros and each column has nonzeros, then the time complexity of the actual multiplication operation is . This means that the time complexity of the Algorithm is , which is close to the desired . As shown in the experimental evaluation, the algorithm scales adequately even for matrices with non-zeros.

### 3.2 Parallel Gram 2: A faster algorithm

Faster algorithms can be derived if additional memory is available to store temporary results. One such algorithm can procceed by splitting the matrix in row blocks, assigning each processor to compute the Gram matrix of one row block, storing it locally in a dense matrix, and ultimately summing all the partial results. This algorithm requires additional storage for intermediate results. A similar analysis to the one of the previous section, shows that for any matrix with nonzeros per row, each processor in Algorithm 7 will compute the corresponding Gram matrix of the rows that it was assigned in steps. Thereafter, the partial results will be summed in steps, therefore, the total time complexity of the algorithm is . For the tall-and-skinny case, i.e. when , this algorithm has a much smaller time complexity than Algorithm 6. In future work, it would be of interest to also take I/O into account, in the spirit of recent research [56, 55].

### 3.3 Computing Euclidean row norms

We next describe an algorithm for computing the squared row norms of the matrix product . As we will see, this is a computational kernel in advanced algorithms for leverage scores computation, with being the matrix whose leverage scores are sought, while is an “orthogonalizer” for , i.e. it is such that forms an (exact or approximate) orthonormal basis for range. A naive approach to compute the row norms is to first compute the product and then compute the squared euclidean row norms. An alternative approach that has the same complexity but lower memory requirements is to avoid precomputing the product and instead, iterate over the rows of , computing and storing only the final result. For sparse matrices, there is an even better approach, achieving complexity, instead of ; cf. [78, Lemma 2.1].

Algorithm 8 performs this operation for a sparse CSR input matrix and achieves desired number of operations. There are no write conflicts during execution and therefore the outer loop can be fully parallelized. In our implementation, to improve load balancing, we do not restrict a specific list of rows for each process to compute. Instead, we let each process choose a new element of to compute once it has finished its previous computation.

## 4 Applications

We next demonstrate some example applications of the proposed methods. We note that the high level algorithms that we will describe are parallel variants of algorithms that have already been discussed in the literature. We thus omit details related to the proofs of correctness and statistical guarantees and refer the reader to the corresponding publications.

### 4.1 Column subset selection

In Algorithm 9 we describe an algorithm to select a subset of columns of a tall-and-skinny matrix. A practical choice is to use rows for and rows for to achieve decent speed as well as approximation guarantees. We refer to [23, 17, 10, 24, 77, 78] for more details.

### 4.2 Least squares regression

In Algorithm 10 we describe a direct least squares solver based on Algorithm 7. In this case is explicitly formed and its pseudoinverse is used to compute the solution of the least squares problem.

In Algorithm 11 we describe a direct least squares solver based on Algorithm 3. Such algorithms in the literature are often referred as “sketch-and-solve”. In this case we first sketch the matrix to a smaller size using a sketching matrix which satisfies the OSE property and then directly solve the least squares problem . In this case the solution is approximate. For further details see [75, 35, 67, 85].

In Algorithm 12 we describe an iterative least squares solver, based on building a preconditioner for using Algorithm 3 and then using a preconditioned iterative method to solve the least squares problem. Such algorithms in the literature have been referred as “sketch-and-precondition”. denotes the number of rows of the Gaussian embedding , and denotes the number of rows of the CountSketch matrix (note that is also the number of columns of ). By choosing , then only the CountSketch transform will be applied on , e.g. only will be computed, ignoring . Practical combinations for and are and ([64, 78]) or even and [31]; see [72, 35, 8, 64, 70] for more details on sketching algorithms for least squares regression.

### 4.3 Leverage scores

Using the basic algorithms of Sections 2 and 3 we can describe parallel algorithms for leverage scores. For details on accuracy and parameter selection we refer the reader to [78, §5]. Algorithm 13 computes the exact leverage scores by first computing the Gram matrix and then using its pseudoinverse as an orthogonalaier for . Algorithm 14 first selects a subset of columns using Algorithm 9. Algorithms 15 and 16 return approximate solutions and are suited for dense matrices.