 # Inv-ASKIT: A Parallel Fast Diret Solver for Kernel Matrices

We present a parallel algorithm for computing the approximate factorization of an N-by-N kernel matrix. Once this factorization has been constructed (with N ^2 N work), we can solve linear systems with this matrix with N N work. Kernel matrices represent pairwise interactions of points in metric spaces. They appear in machine learning, approximation theory, and computational physics. Kernel matrices are typically dense (matrix multiplication scales quadratically with N) and ill-conditioned (solves can require 100s of Krylov iterations). Thus, fast algorithms for matrix multiplication and factorization are critical for scalability. Recently we introduced ASKIT, a new method for approximating a kernel matrix that resembles N-body methods. Here we introduce INV-ASKIT, a factorization scheme based on ASKIT. We describe the new method, derive complexity estimates, and conduct an empirical study of its accuracy and scalability. We report results on real-world datasets including "COVTYPE" (0.5M points in 54 dimensions), "SUSY" (4.5M points in 8 dimensions) and "MNIST" (2M points in 784 dimensions) using shared and distributed memory parallelism. In our largest run we approximately factorize a dense matrix of size 32M × 32M (generated from points in 64 dimensions) on 4,096 Sandy-Bridge cores. To our knowledge these results improve the state of the art by several orders of magnitude.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

Given a set of points and a kernel function , a kernel matrix is the matrix111We restrict our discussion to the case in which and come from the same point set. The -indexed points are called the target points, and the -indexed points are called the source points. A kernel matrix can also involve target and source points that come from different sets. whose entries are given by for . We are interested in the factorization of .

Kernel matrices are typically dense, so matrix-vector multiplication with

(henceforth, “MatVec”) requires work. Solving a linear system with (which we will refer to as “Invert” K) is not stable because can be nearly singular. We typically invert , where is the regularization parameter and is the identity matrix. The LU factorization of requires work. An iterative Krylov method requires work but the constant can be quite high depending on .

Motivation and significance. Kernel matrices appear in N-body problems , machine learning methods for classification, regression, and density estimation [9, 30, 16], spatial statistics , Gaussian processes , and approximation theory 

, among other applications. Typical operations required with kernel matrices are matrix-vector multiplication, solving linear systems, and eigenvalue computations. For example, in machine learning, solving linear systems is used in ridge regression and support vector machines. Eigenvalue computations are required in spectral clustering. Such operations can be done either with iterative solvers in which we only require matrix-vector multiplication. However, the conditioning can be bad and convergence can be quite slow.

For many (but not all) kernel matrices, dramatic acceleration of the MatVec and the Invert operations can be achieved if we exploit subtle properties of their structure. For example, consider the Gaussian kernel, a popular kernel in machine learning,

 K(xi,xj)=exp(−12∥xi−xj∥22h2), (1)

where is the bandwidth. For small , approaches the identity matrix whereas for large , approaches the rank-one constant matrix. The first regime suggests sparse approximations while the second regime suggests global (numerical) low-rank approximations. For the majority of values, however, is neither sparse nor globally low-rank. It turns out that the off-diagonal blocks of may admit good low-rank approximations. A key challenge is identifying and constructing these approximations. When the points are in low dimensions (say ), effective techniques for the MatVec and Invert operations exist. But for high-dimensional problems, there is very little work.

Contributions: Our goal is a method that exhibits algorithmic and parallel scalability for inverting for large and . We only require the ability to compute an entry of in time. Our scheme extends ASKIT222Approximate Skeletonization Kernel Independent Treecode. The ASKIT library is available at http://padas.ices.utexas.edu/libaskit., a method we introduced in a series of papers [22, 25, 23]. ASKIT approximates in time. Here we introduce Inv-ASKIT for factorizing and inverting the ASKIT-based approximation of . In particular:

• We present sequential and parallel factorization algorithms that have complexity and can be applied in time (described in §II-C, §II-D).

• We present a theoretical analysis of the complexity of the algorithm (in terms of time, communication and storage). The estimates for the parallel factorization and solution are given by (16) and (15) respectively in §II-D.

• An ASKIT variant approximates as a triple sum: a block-diagonal plus a low-rank plus a sparse matrix. In this case we use Inv-ASKIT as a preconditioner for a GMRES iterative solver.

• We apply our solver to a binary classification problem using kernel regression. We present the solver’s accuracy, training and test errors, and compare it with Krylov iterative methods (described in §IV).

• We conduct strong and weak scaling analysis in which we report communication, computation and overall efficiency (described in §IV).

Our method uses the hierarchical decomposition of ASKIT. The fundamental concept behind it is simple: First, approximate as the sum of a block-diagonal matrix and a low-rank matrix. Second, compute the inverse using the Sherman-Morrison-Woodbury (SMW) formula. Third, apply this decomposition recursively to each block of the block-diagonal matrix. To our knowledge, Inv-ASKIT

is the only algorithm that can be used with high-dimensional data for a large variety of kernels and the only algorithm that supports shared and distributed memory parallelism.

Inv-ASKIT does not assume symmetry of the kernel, global low-rank, sparsity, or any other property other than that, up to a small sparse matrix-correction, the off-diagonal blocks admit a low-rank approximation.

Related work: We do not review the large literature of accelerating the matrix-vector operation with a kernel matrix. We focus only on fast factorization schemes. Many times kernel matrices admit a good global low-rank approximation. For such matrices the Nystrom method and its variants can be very effective [31, 8, 14] for both approximating and its regularized inverse. However, in many applications does not have a global low-rank structure and Nystrom methods fail. In , we discussed this in detail and presented classification results on real datasets for which the Nystrom method fails.

The idea of exploiting the block-diagonal-plus-low-rank structure has been discussed extensively in the past fifteen years for kernel matrices in low dimensions and finite-difference/finite element matrices. Indeed for point datasets in low dimensions, there exist methods that are much faster and more scalable than ASKIT. Approximating is more challenging, but this is also a well-studied problem. Efficient schemes for approximating the inverse of kernel (but also other types of) matrices are also known as fast direct solvers, and they have roots in nested dissection (or domain decomposition), multifrontal methods and truncated Newton-Schulz iterations. Examples include [2, 1, 3, 15, 10, 5, 29, 20, 13] and this list is non-exhaustive. All these works, however, have been applied successfully only in low-dimensional problems () and do not address parallelism and performance.

To our knowledge,  is the earliest work that parallelizes hierarchical matrices with bulk synchronous parallelization on shared memory systems, but it is not efficient due to the critical path. In , -LU improves its efficiency by employing a DAG-based task scheduling. Distributed implementations include  and , but, to our understanding, they were designed for sparse matrices related to 2D and 3D stencil discretizations, not kernel matrices. Although the basic machinery is similar, but the matrices are quite different. We discuss this more in §II-A. For Gaussian kernels, the only work for fast inversion is [3, 2] in which the authors consider Gaussian and other radial-basis kernels in low dimensions (). Parallelism and scalability were not considered in that work. Their largest test had 1M points in 3D. We consider datasets that have 30 more points and over 200 higher dimensions.

Limitations: This is a novel algorithm and this paper represents our first attempt in an efficient implementation. For this reason it has several limitations. First, we do not provide a theoretical analysis of its numerical stability. (But it is relatively easy to prove it for symmetric and positive definite matrices.) Second, we do not exploit certain symmetries in ASKIT that can further reduce the factorization time. For example, can be symmetric. Third, we do not performance-tune various algorithm parameters. Fourth, we do not discuss optimizations that can be used to amortize the factorization cost over different values of . Fifth, for certain ASKIT variants our scheme can be only used as a preconditioner and the errors can be significant. Sixth, there are additional low-rank structures that we do not exploit. Finally, there exist fundamental limitations inherited from ASKIT. Not all kernel matrices have a block-diagonal-plus-low-rank structure. Examples are point datasets that have high ambient dimensionality and kernels that are highly oscillatory (e.g., the high-frequency Helmholtz kernel). Also, the sparse-matrix correction for the off-diagonals can have large norm, in which case using this version of Inv-ASKIT as a preconditioner will not work.

## Ii Methods

Notation: In Table I, we summarize the main notation used in the paper. We begin with discussing the basic concepts in factorization of hierarchical matrices §II-A. Then we present a brief summary of the ASKIT algorithm §II-B. We conclude with the sequential §II-C and parallel §II-D factorization algorithms.

### Ii-a Hierarchical Matrices and Treecodes

Informally, is hierarchical if it can be approximated by , where is a block-diagonal matrix, is low-rank, and each block of is also a hierarchical matrix. To define the , , matrices first we consider the following block partitioning to -by- blocks

 K=[K11K12K21K22]=[K1100K22]+[0K12K210]. (2)

is the first matrix in the sum and the product of is equal to the second matrix; and will be defined shortly. So in summary: (1) Blocks and admit low-rank approximations, which implies that and are low-rank. (2) The same decomposition applies recursively for the diagonal blocks and . This approximation derived by recursive partitioning is directly associated with a binary partitioning of the rows and columns of . Note that using lexicographic ordering of the rows like in (2) will not, in general, result in low-rank off-diagonal blocks. An appropriate reordering, in our case using the tree structure, is critical for exposing such low-rank structure. Given the finest granularity , we partition the points using geometric information and a binary tree so that we have less than points per leaf. (The number of points is manually selected. Typically is about 1048 or 2048 and it effects both computation time and accuracy. For a discussion, see ). We denote the levels of the tree by , with at the root and at the leaves. Let be a node in the tree. We “overload” the notation by using to also indicate the indices of the points assigned to so that all . Then the hierarchical partition and approximation of assumes that for every node , we can partition the self-interactions between its points as follows

 Kαα=[KllKlrKrlKrr]=[KllKrr]+[KlrKrl], (3)

where denotes the left child of a node, and the right child of a node. Equivalently to (2), we can write . The matrices and are selected to approximate off-diagonal blocks in (3) such that

 [0KlrKrl0]≈UαVα. (4)

The particular form of these matrices for the ASKIT method is given in (4). If is invertible, we can write immediately the approximate inverse of by using the SMW formula:

 ˜K=D+UV=D(I+D−1UV)=D(I+WV),~{}where~{}W=D−1U.˜K−1=(I+WV)−1D−1=(I−W(I+VW)−1V)D−1=(I−WZV)D−1,~{}where~{}Z=(I+VW)−1. (5)

The extension to is trivial. In Inv-ASKIT, factorizing amounts to visiting every tree node to compute and . We require traversing the descendants of to compute . If is the numerical rank of and , then computing requires linear solves with . Computing requires factorizing (using LAPACK) the matrix . The recursion stops at the leaves of the tree in which we have a single matrix , which we then again factorize using LAPACK. We give pseudocode for the precise algorithms in §II-C.

As mentioned in §I, there is extensive work on fast solvers for hierarchical matrices. Such solvers rely on the SMW formula. But the details can differ significantly, for example the permutation or reordering of the matrix, construction of and , and the exploitation of various symmetries.333More precisely our matrix can be described as a HODLR (Hierarchical Off-Diagonal Low Rank) matrix . Other types of matrices include hierarchically semi-separable matrices (HSS), and -matrices . ASKIT can also produce a slightly different decomposition (known as a fast multipole matrix or an -matrix), which decomposes into a sum of a hierarchical matrix (which has the form we discussed above) and a sparse matrix . Next, we give some details for ASKIT’s construction of the hierarchical approximation of and briefly explain the decomposition.

Details on the algorithm can be found in . Here we briefly summarize it. The input is a set of points and a list of nearest neighbor points in for each point . We build a balanced binary tree-based partitioning of the points. We use a parallel ball tree  data structure. For quadtrees and octrees can be effectively used, but in high dimensions such data structures cannot be used.

After building the tree, we construct the hierarchical representation. In ASKIT we use the notion of a point being separated from a tree node . A basic separation criterion is that . If is the interaction between and , we construct and such that

 Kiα≈Ki˜αP˜αα. (6)

The points are referred to as the “skeletons”. The skeletons are the pivot columns of the QR factorization of a subsample of the block matrix . (The nearest-neighbor information is used to effectively sample this block matrix.) does not involve any approximation, but instead of computing interactions with all points in we just compute interactions with .

is computed using the interpolative decomposition algorithm. A key property of

ASKIT is that the skeletons of a node are selected from the skeletons of its children. That is, for a node the skeletons are a subset of or . Furthermore (when we compute ) can be computed recursively by only storing . If we assume the skeleton size is fixed for all nodes, then is an matrix.

The details of constructing skeletons and the matrices can be found in . In the pseudocode below we outline the two steps required to perform a fast MatVec . The “Skeletonize” step builds the projection matrices for each node and computes , auxiliary variables that are then used in the “Evaluate” step in which for every target point , the tree is traversed top-down to compute the entry of the output vector for point .

1:  //Skeletonize:
2:  Skeletonize
3:  if  is leaf then
4:     Find
5:     Construct
6:
7:  else
8:     Skeletonize
9:     Skeletonize
10:     Construct ,
11:
12:  end if
1:  //Evaluate:
2:  MatVec
3:  if Separated then
4:
5:  else
6:     if  is leaf then
7:
8:     else
9:         MatVec
10:         MatVec
11:     end if
12:  end if

In the Skeletonize algorithm, lines 4, 5 and 10 do not depend on . Therefore, if we want to apply to a different vector , only lines 6 and 11 are needed. In the Evaluate phase, ASKIT provides two different algorithms for the Separated function. In the first algorithm, is separated from if . This definition creates a hierarchical approximation of identical to (3). Then, if and are two sibling nodes with parent , the matrices and are approximated by

 Klr≈Kl˜rP˜rr~{}and~{}Krl≈Kr˜lP˜ll. (7)

We now can define for a node

 Uα=[Kl˜r00Kr˜l]~{}and~{}Vα=[0P˜rrP˜ll0]. (8)

This completes the definition of the hierarchical matrix for Inv-ASKIT (compare (8) with (4)). To reduce memory costs however, ASKIT never constructs explicitly. Instead, looking at the Skeletonize function, ASKIT only stores for each node . In the factorization we will need the whole matrix. We explain how to compute it in §II-C.

Other then , ASKIT can also define to be separated from if no nearest-neighbor of is in . This definition introduces a more complicated structure in the matrix. We do construct (7) but for points that have nearest-neighbors in , line 4 in the Evaluate is replaced by , i.e., an exact evaluation. For certain kernels this significantly increases the accuracy of the overall scheme. A convenient way to think about this is that the second algorithm builds an approximation that is equivalent to

 K≈(D+UV)+S=˜K+S, (9)

where is the hierarchical matrix and is a sparse matrix that is zero if we use the separation criterion. If the norm of is small relative to we expect that will be a good preconditioner for . (Again there are kernels for which this is not the case.)

In this section we summarized the basic ideas in ASKIT. Its structure is very similar to what is known as a treecode in computational physics. We omitted several important details of the ASKIT algorithm. For example, we assumed that the number of skeleton points is fixed. In reality, this number differs for each tree node and it is chosen adaptively so that the approximation error is below a tolerance . Finally, the parallelization and scalability of ASKIT was described in . Another version of ASKIT that resembles the relationship between fast-multipole methods and treecodes is described in . In these papers, we also discuss the error analysis and work/storage complexity of ASKIT. Next, we will assume that we are given from ASKIT and we proceed with discussing the Inv-ASKIT factorization in detail using the ideas we outlined in §II-A.

### Ii-C Fast direct solver

Given the binary tree, the regularization parameter , and in terms of and for every tree node , we construct a factorization of that allows the fast solution of linear systems. We describe the algorithm for the case of , for notational simplicity, but the algorithms are stated for any . So given , we wish to compute .

In the factorization phase we compute , , and . We first show how (5) is computed in a blocked matrix. is computed as

 [˜Kll˜Krr](I+[Wl˜rWr˜l][P˜rrP˜ll]). (10)

where and . Then the partitioned form of is

 I−[Wl˜rWr˜l][IP˜rrWr˜lP˜llWl˜rI]−1[P˜rrP˜ll]. (11)

If is an non-leaf node, is evaluated recursively by

 P˜αα=Pα[˜l˜r][P˜llP˜rr] using {GEMM}. (12)

At the leaf nodes is computed by ASKIT (using the LAPACK routines GEQP3 and GELS ).

Factorization Phase. Algorithm II.1 performs a postorder traversal of the tree. If is a leaf node, we factorize with partial-pivoting LU factorization (LAPACK routine GETRF). Otherwise, we first compute , , , using (12). Then we solve and by calling Solve() (Algorithm II.2). (Note that when we visit a node at some level, the factorization information required for the solve is already in place.) We compute and factorize using the blocked representation in (11). Solve() is described in Algorithm II.2, which applies to a vector .

Solving Phase. If is a leaf node, directly invokes an LU solver (LAPACK routine GETRS). Otherwise, we first solve and recursively. Then we compute (11) with two matrix-matrix multiplications (BLAS routine GEMM) and an LU solver (GETRS). Note that both algorithms are implemented using bottom-up tree traversal and “for- loops”—we do not actually employ recursion.

Refactoring for different . In the case that we want to refactor for a different value of , (e.g., in cross-validation studies), needs to be refactorized, but the skeletons and we have computed remain the same. Thus, when we refactor with different , only and need to be recomputed. (The factorization at the leaf level can be also amortized if SVD is used instead of LU, but we do not discuss this here.)

Complexity of the sequential algorithm. We present the complexity analysis of Algorithm II.1 and Algorithm II.2. The main parameters are and

. They depend on the kernel, the dimensionality, and the geometric distribution of points. They control the accuracy of the scheme, but are not explicitly related to problem size

. In the actual implementation is chosen adaptively for each node. If we use its maximum value (across nodes), then the results below are an upper bound to the work complexity. Note that in some applications, the kernel does depend on the problem size, for example the bandwidth value for the Gaussian kernel in machine learning problems. In that case and will need to change accordingly. But here we assume a fixed kernel for all values of , since the way the kernel changes with increasing depends on the application and the geometry of the dataset.

denotes the complexity of the factorization Algorithm II.1 with points, and denotes the complexity of Algorithm II.2 with points. We can verify that

 Ts(N)=2Ts(N2)+O(Ns)=O(sNlogN). (13)

Using the solving phase complexity, we derive as

 Tf(N)=2Tf(N2)+s2Ts(N)=O(s2Nlog2N). (14)

To summarize, Algorithm II.1 takes work and Algorithm II.2 takes work.

Storage. Beyond the storage required for ASKIT, we need to store (overwritten by ), and the LU factorization in each level. and take in each level. Overall, the space requirement is . To save about half of the memory, can be constructed recursively when it’s needed, and we discard it when we reach the parent level. This will result in slightly larger constants in the solving phase.

Shared memory parallelism: There are several ways to exploit shared memory parallelism. We have implemented and tested two of those. The first way is level-by-level parallel tree traversal for both construction and evaluation, in which we use sequential versions of BLAS/LAPACK routines, concurrent traversal of nodes at the same level, and sequential traversal across levels. The second way is to use sequential traversals and use the shared-memory parallelism for the BLAS/LAPACK routines. It turns out, that it is more efficient to use a hybrid version of these two approaches. From level- to the leaf level, we disable parallel BLAS and LAPACK because we have experimentally observed that parallel tree traversal provides higher efficiency due to better memory locality. From level- to level- in the local tree, we switch from parallel tree traversal to parallel BLAS operations. In parallel tree traversal some cores may start to idle from tree level three, which only has 8 tree nodes. So in the top four levels of the tree we use parallel GETRF, GETRS and GEMM.

### Ii-D Parallel algorithm

In  we discussed the parallelization of ASKIT. Here we build on that work and derive our parallel factorization algorithm. We assume that we have ranks in the global MPI communicator. The data-structure used in ASKIT is summarized in Figure 1. The distributed-memory parallelism in the algorithm is invoked between tree levels-0 and -. We briefly review the structure of the tree before discussing the algorithms and the complexity.

Distributed Tree. Without loss of generality, we assume and are powers of two and that each MPI process owns points. Figure 1 is a distributed tree with . We use purple boxes to indicate tree nodes. Each treenode node is assigned to a number of MPI processes. All processes sharing a node are grouped to an MPI communicator. We use {i} to indicate the MPI rank in the local communicator. For example, at level 0 the root of the tree is distributed among all 4 processes (from green to blue in Figure 1). The green process has local MPI rank {0}, and the blue process has rank {3}. If a node contains processes, then the processes with rank {} are assigned to the left child of , and processes with rank {} are assigned to the right child of . At level , each tree node is assigned to a single MPI process, which is also assigned the entire subtree of that node from level to the leaf level (not displayed in Figure 1). Fig. 1: The ASKIT distributed binary tree with p=4. The purple boxes indicate tree nodes. Depending on the level a tree node is logically distributed across several MPI processes, which are organized under an MPI-communicator. The ranks are indicated in curly brackets. Each process owns the data of its colored rectangle region. For example, the yellow process owns r(α) and P~rr at level 2, but it doesn’t own ~α at level 1. We depict the main data used in the factorization, the matrices P and the skeleton points at each node. Notice that the nodes at level 2 are not the leaf nodes. But the computations in the subtrees at higher levels are entirely local to the MPI processes that own them. In this example we have four such processes, indicated by different colors. Fig. 2: Distributed Kα˜β and Pα˜α factors of the ˜K matrix that corresponds to the partitioning of Figure 1. Each MPI processes (in the global communicator) stores only the matrices that have the same color with it. A matrix that has multiple color is distributed across several processes. We also show the partitioning of input and output vectors.

Some processes may own the skeletons and matrix of the distributed node. We give an example in Figure 1; only {0} in each node stores the skeletons and . In the factorization we need to build matrices , which in turn involve , , and . These matrices are unnecessary in ASKIT but in Inv-ASKIT we need to compute and store them in parallel. These computations involve collective broadcasts and reductions in the communicators of and point-to-point exchanges between the {0} and {} ranks of the processes that were assigned to . Finally matrix is stored at {0}. Figure 2 shows how these matrices are distributed for the matrix generated by the tree in Figure 1.

Factorization. In Algorithm II.3, we describe how (11) and (12) are computed. Each process uses its local MPI rank to make different decisions. E.g., if a statement is labeled with , then only processes with rank execute it. Otherwise, all statements are executed by all processes. The whole communication pattern is simple. If a process owns the skeletons and of , then it sends to all processes in the same node and sends its skeletons to all processes in its sibling tree node. In the pseudo-code, this is a little bit complicate, since we can’t simply broadcast between communicators. Thus, some point-to-point communication is involved.

We will explain the algorithm using the yellow process (Yellow) in Figure 1 and Figure 2. At level 2, Algorithm II.1 is invoked, and the local tree of is factorized. At level 1, Yellow returns from line 5 as rank {1}. At line 6, it receives , which will later be used to compute in the parent level. At line 7, it receives from Green, and at line 9 it sends it owns to Green. is used to evaluate at line 12. Evaluating doesn’t require communication at this level, since all factors in the subtree are owned by Yellow. At line 14, is overwritten by . Finally, Yellow sends its to Green, and is computed and factorized on Green. At level-0, we only illustrate how that Yellow received at level 1 is used to compute the at line 11. By the definition of (12), the right portion of only requires and . The first part is received at level 1, and the second part is computed at level 1.

Solver. Algorithm II.4 uses a similar bottom-up tree traversal. Algorithm II.2 is invoked when the traversal reaches level-. At line 7, {0} reduces all portions of and computes where was previously factorized. At line 8, is broadcast by {0}. Finally, each process computes its own .

Complexity. We use a network model to derive the parallel complexity. We assume that the network topology provides complexity for Bcast and Reduce for a message of size . For notational clarity we omit the latency cost.

Let be the parallel complexity of Algorithm II.4. is the base case complexity of Algorithm II.2. We can derive the complexity recursively as Here is the message transfer computation cost of (11), and is the communication cost. The overall solve complexity with grain size is given by

 Ts(n,p)=O(nlogN)comp+O(slog2p)comm, (15)

including both computation and communication. The number of messages required is if one wants to include latency costs. We observe that the communication cost doesn’t scale with the problem size , but only with and .

The complexity of Algorithm II.3 is similarly given by Here is the time in Algorithm II.4, the time in (12), and is the communication cost spent on broadcasting skeletons and . Overall, the factorization complexity with grain size is

 Tf(n,p)=O(nlog2N)comp+O(s2log3p+dlogp)comm, (16)

including both computation and communication. Observe that the sequential complexity is times the parallel computation complexity, which indicates that the algorithm should be able to achieve almost linear scalability. The communication cost is minor since it does not depend on .

Discussion. Inv-ASKIT shares the common problems of all treecodes: (1) the parallelism diminishes exponentially when traversing to the root, and (2) the possible global synchronization during the level by level traversal. Solving (5) in a distributed node reduces the product to one process in the communicator. This takes the advantage of the collective Bcast and Reduce, but the parallelism diminishes. This efficiency degradation is not significant when and are small, but we will observe this effect in the scaling experiments. The load-balancing problem happens when the adaptive skeletonization is used to dynamically decide the rank of the off-diagonal blocks. The processes with the largest rank becomes the bottleneck. This load-balancing issues requires a dynamic scheduling and beyond the scope of this paper.

## Iii Experimental Setup

We performed numerical experiments to examine the accuracy and the scalability of Inv-ASKIT. Notice that the only alternative method that we could compare with Inv-ASKIT for the datasets of the size and dimensionality is a global low-rank approximation of . But in , we have already shown that for several bandwidths of practical interest the low-rank approximation is not a good scheme for approximating .

We also applied Inv-ASKIT to binary classification tasks on real-world datasets using kernel regression. Next we summarize the hardware, datasets, and parameters used in our experiments. In the majority of our tests we use the Gaussian kernel function (1).

Implementation and hardware. All experiments were conducted on the Maverick and Stampede clusters at the Texas Advanced Computing Center. Maverick nodes have two 10-core Intel Xeon E5-2680 v2 (3.1GHz) processors and 256GB RAM. Stampede nodes have two 8-core E5-2680 (2.8GHz) processors and 32GB RAM. Inv-ASKIT and ASKIT are written in C++, employing OpenMP for shared memory parallelism, Intel MKL for high performance linear algebra operations and Intel MPI for distributed memory parallelism. A custom direct “MatVec”444

See the General Stride Kernel Summation project:

https://github.com/ChenhanYu/ks. in ASKIT is optimized with SSE2 and AVX. The classification code is written in C++, employing a Krylov subspace method (GMRES) from the PETSc library .

Iterative methods. Recall that Inv-ASKIT can only compute without the sparse matrix . While , ASKIT can better approximate using nearest-neighbor pruning ( will be smaller), but we can no longer compute its exact inverse directly. Still Inv-ASKIT can be a good preconditioner. In this case, converging to the desired accuracy may take several GMRES iterations, but the classification accuracy may also be improved.

Datasets: In Table II, we give details of each dataset. In strong scaling and the classification tasks on Maverick, we use three real-world datasets: SUSY (high-energy physics) and COVTYPE (forest cartographic variables) from , MNIST (handwritten digit recognition) from . In the weak scaling tasks, we generate a random 64D point cloud which are drawn from a 6D NORMAL distribution and embedded in 64D with additional noise. This resembles a dataset with a high ambient but relatively small intrinsic dimension, something typical in data analysis. We also report strong scaling results using a 16M NORMAL synthetic dataset.

Bandwidth selection and accuracy metrics. For the majority of experiments, we used the Gaussian kernel. The values are taken from 

, where they were selected using cross-validation with a kernel Bayes classifier. The regularization

is also selected by cross-validation.

We test both versions of ASKIT (defined by the point-node separation criterion discussed in §II-B), which correspond to and in (9). We use three metrics to measure the quality of the approximations for the MatVec and Invert approximations of ASKIT and Inv-ASKIT:555Reminder on the notation: Here is the exact kernel matrix and is its ASKIT approximation with . By , we refer to the inverse based on the ASKIT approximation.

 ϵM=∥Kw−(˜K+S)w∥2/∥Kw∥2,ϵI=∥w−(λI+˜K)−1(λI+˜K)w∥2/∥w∥2,ρ=∥u−(λI+˜K+S)w)∥2/∥u∥2. (17)

The first error measures the ASKIT MatVec error. If the off-diagonal blocks of indeed admit low-rank, then should be small. Since evaluating exactly ( work) is impossible for large , we estimate by sampling 1K points. We measure the condition of Inv-ASKIT’s Invert for the case using . If is well-conditioned, then should be small. We test as a preconditioner for GMRES for solving and we report the normalized residual upon termination of GMRES. Typically the desired for the classification tasks is 1e-3. However, if the GMRES terminates while reaching the maximum iteration steps, then will be larger. Overall, we measure the quality of a preconditioner according to the required iteration steps #iter and the achieved accuracy .

Kernel regression for binary classification. Given a set of points with binary labels666The sets MNIST and COVTYPE have several classes. For those we perform a one-vs-all binary classification. Take MNIST dataset as an example, the two classes in the binary classification task are whether a point is the digit or not. , we split these points into the training and the testing sets. We use the training set to compute the weights for the regression. That is, given the correct classification labels , where for each training point , we solve for the regression weights . Once we compute these weights, the label for a test point is given by is . By comparing the true labels of the test points (which are given) to the computed ones using the kernel regression, we can estimate the classification accuracy as a percentage of correctly classified points. The classification errors are reported in the next section.

## Iv Empirical Results

We present three sets of results, for weak and strong scaling and for kernel regression. We conducted 29 experiments labeled from #1 to #29 in the tables. In Table III, we report numerical accuracy and strong scaling results. In Table IV, we report weak scaling for both Gaussian and polynomial kernels. Finally, we apply Inv-ASKIT as a preconditioner in the binary classification problem. We show that Inv-ASKIT’s speeds up the convergence rate up to 5 and thus it is a good preconditioner for . We expect this to be the case when is sufficiently smaller than . In the results the parameter indicates the accuracy tolerance used in the skeletonization phase of the ASKIT algorithm. is the maximum skeleton size. is the number of the nearest-neighbors.

Accuracy. In Table III and Table IV, we report ASKIT’s error for reference but our focus is on and , which are related to Inv-ASKIT. Note that is not always small due to the amplification of rounding errors by the conditioning of . Both and have strong impact on the condition number. For a Gaussian kernel matrix, a smaller usually results in better conditioned systems since it approaches the identity matrix. Runs #6–#8 are examples where the systems are well-conditioned even with . With a larger and lower , becomes nearly rank-deficient. In #12–#19 we see that is not sufficiently large, and that’s why we observe large .

Scaling. Now, we discuss the weak and strong scaling results in Table III and Table IV. The overall Inv-ASKIT time is divided into factorization time and the solving time . The factorization time is split into (1) communication and (2) computation . is further subdivided into local and distributed time. The communication part of is negligible so we do not split . Take #1 as an example, the communication time is seconds, the local factorization time is seconds, and the distributed factorization time is seconds. The solving time is second, and the overall time is seconds.

In Table III we report the strong-scaling efficiency . The base-line experiments are #1, #4, #6 and #9, which have . Fixing the problem size, we measure the efficiency by doubling the core number each time. We observe that the communication cost is insignificant; roughly speaking, doubling the number of cores will reduce the efficiency by 4%–12%. The efficiency degradation is due to diminishing parallelism during the computation of (5), where only the root rank in each subcommunicator works. We also measure the relative-to-peak FLOPS efficiency. The peak for 640-core on Maverick is 777Each E5-2680 v2 processor can compute GFLOPS per second. The aggregate theoretical peak performance of 64 processors is GFLOPS TFLOPS. TFLOPS, and the peak of 4096-core on Stampede is TFLOPS. The ratio FLOPSpeak from low-to-high is MNIST2M 8%, NORMAL 20%, COVTYPE 36% and SUSY 45%.

The scalability varies between datasets during the factorization, because there are load-balancing issues in the level-by-level traversal. Adaptive skeletonization results in different values of for nodes in the same tree level. The node with the largest becomes the critical path in each level. The load-balancing problem is especially serious in MNIST where the workload can vary by orders of magnitude. On the other hand, SUSY has the most balanced workload, which allows it to reach 45% peak floating point efficiency. We also report the MatVec performance in the column. The MatVec in ASKIT is highly optimized and does not require any communication. Typically, it can reach near-linear scaling and more than 80% of peak floating point efficiency.

In Table IV, we report weak-scaling results. We perform experiments using Gaussian and polynomial kernels and we observe similar behavior. We report the FLOPS rate () and relative parallel efficiency (). The base-line experiments are #12 and #16. The efficiency drops dramatically from 20-core runs to 80-core runs, since #12 and #16 do not involve MPI. From 80 cores to 640 cores, doubling the number of cores results only in 3% loss, which shows that the weak scalability is almost linear. The weak-scalability experiments can reach a higher efficiency, since there are no communication and load-balancing issues. As a result, Inv-ASKIT can reach 52% theoretical peak floating point efficiency.

Classification. We conduct three classification experiments, following . The training time is presented in pairs to show both the runtime and the iteration steps of the GMRES solver. Four experiments are conducted for each dataset, combining different kernel matrix approximations ( or ) and different preconditioners ( indicates no preconditioning or ). We present both training and testing (10K samples) accuracy, and we present the normalized residual upon termination of GMRES.

Approximating the kernel matrix using , we reach 96% accuracy in COVTYPE, 79% accuracy in SUSY and 100% accuracy in MNIST. Using the nearest-neighbor-based separation criterion which creates a nonzero , we can reach a higher accuracy (97%) in COVTYPE. In #21, #25 and #29, we use as initial guess, which allows GMRES to converge immediately to a high precision by only applying the preconditioner. The training time only involves the factorization time and the solving time, which are both deterministic. Without preconditioners, GMRES takes 3 to 5 longer to converge to the desired accuracy. During the cross-validation, time spent on training may be much longer depending on the combination and .

In runs #23 and #27, we show that Inv-ASKIT also works well as a preconditioner when we include the nearest-neighbors in the ASKIT MatVec approximation. Comparing #22 to #23, we can find that #23 with Inv-ASKIT only takes 43 iterations to converge to 1e-3. On the other hand, #22 reaches the maximum iteration limit, only converging to 3e-3. Involving nearest-neighbors can better approximate in some cases. For example, #23 can reach a higher classification accuracy than #21. Overall Inv-ASKIT provides a 2 to 3

speedup. For non