I Introduction
Given a set of points and a kernel function , a kernel matrix is the matrix^{1}^{1}1We restrict our discussion to the case in which and come from the same point set. The indexed points are called the target points, and the indexed points are called the source points. A kernel matrix can also involve target and source points that come from different sets. whose entries are given by for . We are interested in the factorization of .
Kernel matrices are typically dense, so matrixvector multiplication with
(henceforth, “MatVec”) requires work. Solving a linear system with (which we will refer to as “Invert” K) is not stable because can be nearly singular. We typically invert , where is the regularization parameter and is the identity matrix. The LU factorization of requires work. An iterative Krylov method requires work but the constant can be quite high depending on .Motivation and significance. Kernel matrices appear in Nbody problems [11], machine learning methods for classification, regression, and density estimation [9, 30, 16], spatial statistics [7], Gaussian processes [27], and approximation theory [28]
, among other applications. Typical operations required with kernel matrices are matrixvector multiplication, solving linear systems, and eigenvalue computations. For example, in machine learning, solving linear systems is used in ridge regression and support vector machines. Eigenvalue computations are required in spectral clustering. Such operations can be done either with iterative solvers in which we only require matrixvector multiplication. However, the conditioning can be bad and convergence can be quite slow.
For many (but not all) kernel matrices, dramatic acceleration of the MatVec and the Invert operations can be achieved if we exploit subtle properties of their structure. For example, consider the Gaussian kernel, a popular kernel in machine learning,
(1) 
where is the bandwidth. For small , approaches the identity matrix whereas for large , approaches the rankone constant matrix. The first regime suggests sparse approximations while the second regime suggests global (numerical) lowrank approximations. For the majority of values, however, is neither sparse nor globally lowrank. It turns out that the offdiagonal blocks of may admit good lowrank approximations. A key challenge is identifying and constructing these approximations. When the points are in low dimensions (say ), effective techniques for the MatVec and Invert operations exist. But for highdimensional problems, there is very little work.
Contributions: Our goal is a method that exhibits algorithmic and parallel scalability for inverting for large and . We only require the ability to compute an entry of in time. Our scheme extends ASKIT^{2}^{2}2Approximate Skeletonization Kernel Independent Treecode. The ASKIT library is available at http://padas.ices.utexas.edu/libaskit., a method we introduced in a series of papers [22, 25, 23]. ASKIT approximates in time. Here we introduce InvASKIT for factorizing and inverting the ASKITbased approximation of . In particular:

An ASKIT variant approximates as a triple sum: a blockdiagonal plus a lowrank plus a sparse matrix. In this case we use InvASKIT as a preconditioner for a GMRES iterative solver.

We apply our solver to a binary classification problem using kernel regression. We present the solver’s accuracy, training and test errors, and compare it with Krylov iterative methods (described in §IV).

We conduct strong and weak scaling analysis in which we report communication, computation and overall efficiency (described in §IV).
Our method uses the hierarchical decomposition of ASKIT. The fundamental concept behind it is simple: First, approximate as the sum of a blockdiagonal matrix and a lowrank matrix. Second, compute the inverse using the ShermanMorrisonWoodbury (SMW) formula. Third, apply this decomposition recursively to each block of the blockdiagonal matrix. To our knowledge, InvASKIT
is the only algorithm that can be used with highdimensional data for a large variety of kernels and the only algorithm that supports shared and distributed memory parallelism.
InvASKIT does not assume symmetry of the kernel, global lowrank, sparsity, or any other property other than that, up to a small sparse matrixcorrection, the offdiagonal blocks admit a lowrank approximation.Related work: We do not review the large literature of accelerating the matrixvector operation with a kernel matrix. We focus only on fast factorization schemes. Many times kernel matrices admit a good global lowrank approximation. For such matrices the Nystrom method and its variants can be very effective [31, 8, 14] for both approximating and its regularized inverse. However, in many applications does not have a global lowrank structure and Nystrom methods fail. In [24], we discussed this in detail and presented classification results on real datasets for which the Nystrom method fails.
The idea of exploiting the blockdiagonalpluslowrank structure has been discussed extensively in the past fifteen years for kernel matrices in low dimensions and finitedifference/finite element matrices. Indeed for point datasets in low dimensions, there exist methods that are much faster and more scalable than ASKIT. Approximating is more challenging, but this is also a wellstudied problem. Efficient schemes for approximating the inverse of kernel (but also other types of) matrices are also known as fast direct solvers, and they have roots in nested dissection (or domain decomposition), multifrontal methods and truncated NewtonSchulz iterations. Examples include [2, 1, 3, 15, 10, 5, 29, 20, 13] and this list is nonexhaustive. All these works, however, have been applied successfully only in lowdimensional problems () and do not address parallelism and performance.
To our knowledge, [18] is the earliest work that parallelizes hierarchical matrices with bulk synchronous parallelization on shared memory systems, but it is not efficient due to the critical path. In [19], LU improves its efficiency by employing a DAGbased task scheduling. Distributed implementations include [29] and [17], but, to our understanding, they were designed for sparse matrices related to 2D and 3D stencil discretizations, not kernel matrices. Although the basic machinery is similar, but the matrices are quite different. We discuss this more in §IIA. For Gaussian kernels, the only work for fast inversion is [3, 2] in which the authors consider Gaussian and other radialbasis kernels in low dimensions (). Parallelism and scalability were not considered in that work. Their largest test had 1M points in 3D. We consider datasets that have 30 more points and over 200 higher dimensions.
Limitations: This is a novel algorithm and this paper represents our first attempt in an efficient implementation. For this reason it has several limitations. First, we do not provide a theoretical analysis of its numerical stability. (But it is relatively easy to prove it for symmetric and positive definite matrices.) Second, we do not exploit certain symmetries in ASKIT that can further reduce the factorization time. For example, can be symmetric. Third, we do not performancetune various algorithm parameters. Fourth, we do not discuss optimizations that can be used to amortize the factorization cost over different values of . Fifth, for certain ASKIT variants our scheme can be only used as a preconditioner and the errors can be significant. Sixth, there are additional lowrank structures that we do not exploit. Finally, there exist fundamental limitations inherited from ASKIT. Not all kernel matrices have a blockdiagonalpluslowrank structure. Examples are point datasets that have high ambient dimensionality and kernels that are highly oscillatory (e.g., the highfrequency Helmholtz kernel). Also, the sparsematrix correction for the offdiagonals can have large norm, in which case using this version of InvASKIT as a preconditioner will not work.
Ii Methods
Notation  Description 

problem size  
dimension of input points  
leaf node size  
number of skeleton points  
number of MPI processes  
,  tree nodes, overloaded with the indices they own 
,  left and right children of the treenode 
, point dataset  
points owned by node , i.e.  
approximate kernel matrix 
Notation: In Table I, we summarize the main notation used in the paper. We begin with discussing the basic concepts in factorization of hierarchical matrices §IIA. Then we present a brief summary of the ASKIT algorithm §IIB. We conclude with the sequential §IIC and parallel §IID factorization algorithms.
Iia Hierarchical Matrices and Treecodes
Informally, is hierarchical if it can be approximated by , where is a blockdiagonal matrix, is lowrank, and each block of is also a hierarchical matrix. To define the , , matrices first we consider the following block partitioning to by blocks
(2) 
is the first matrix in the sum and the product of is equal to the second matrix; and will be defined shortly. So in summary: (1) Blocks and admit lowrank approximations, which implies that and are lowrank. (2) The same decomposition applies recursively for the diagonal blocks and . This approximation derived by recursive partitioning is directly associated with a binary partitioning of the rows and columns of . Note that using lexicographic ordering of the rows like in (2) will not, in general, result in lowrank offdiagonal blocks. An appropriate reordering, in our case using the tree structure, is critical for exposing such lowrank structure. Given the finest granularity , we partition the points using geometric information and a binary tree so that we have less than points per leaf. (The number of points is manually selected. Typically is about 1048 or 2048 and it effects both computation time and accuracy. For a discussion, see [22]). We denote the levels of the tree by , with at the root and at the leaves. Let be a node in the tree. We “overload” the notation by using to also indicate the indices of the points assigned to so that all . Then the hierarchical partition and approximation of assumes that for every node , we can partition the selfinteractions between its points as follows
(3) 
where denotes the left child of a node, and the right child of a node. Equivalently to (2), we can write . The matrices and are selected to approximate offdiagonal blocks in (3) such that
(4) 
The particular form of these matrices for the ASKIT method is given in (4). If is invertible, we can write immediately the approximate inverse of by using the SMW formula:
(5) 
The extension to is trivial. In InvASKIT, factorizing amounts to visiting every tree node to compute and . We require traversing the descendants of to compute . If is the numerical rank of and , then computing requires linear solves with . Computing requires factorizing (using LAPACK) the matrix . The recursion stops at the leaves of the tree in which we have a single matrix , which we then again factorize using LAPACK. We give pseudocode for the precise algorithms in §IIC.
As mentioned in §I, there is extensive work on fast solvers for hierarchical matrices. Such solvers rely on the SMW formula. But the details can differ significantly, for example the permutation or reordering of the matrix, construction of and , and the exploitation of various symmetries.^{3}^{3}3More precisely our matrix can be described as a HODLR (Hierarchical OffDiagonal Low Rank) matrix [1]. Other types of matrices include hierarchically semiseparable matrices (HSS), and matrices [12]. ASKIT can also produce a slightly different decomposition (known as a fast multipole matrix or an matrix), which decomposes into a sum of a hierarchical matrix (which has the form we discussed above) and a sparse matrix . Next, we give some details for ASKIT’s construction of the hierarchical approximation of and briefly explain the decomposition.
IiB Askit
Details on the algorithm can be found in [22]. Here we briefly summarize it. The input is a set of points and a list of nearest neighbor points in for each point . We build a balanced binary treebased partitioning of the points. We use a parallel ball tree [26] data structure. For quadtrees and octrees can be effectively used, but in high dimensions such data structures cannot be used.
After building the tree, we construct the hierarchical representation. In ASKIT we use the notion of a point being separated from a tree node . A basic separation criterion is that . If is the interaction between and , we construct and such that
(6) 
The points are referred to as the “skeletons”. The skeletons are the pivot columns of the QR factorization of a subsample of the block matrix . (The nearestneighbor information is used to effectively sample this block matrix.) does not involve any approximation, but instead of computing interactions with all points in we just compute interactions with .
is computed using the interpolative decomposition algorithm. A key property of
ASKIT is that the skeletons of a node are selected from the skeletons of its children. That is, for a node the skeletons are a subset of or . Furthermore (when we compute ) can be computed recursively by only storing . If we assume the skeleton size is fixed for all nodes, then is an matrix.The details of constructing skeletons and the matrices can be found in [22]. In the pseudocode below we outline the two steps required to perform a fast MatVec . The “Skeletonize” step builds the projection matrices for each node and computes , auxiliary variables that are then used in the “Evaluate” step in which for every target point , the tree is traversed topdown to compute the entry of the output vector for point .
In the Skeletonize algorithm, lines 4, 5 and 10 do not depend on . Therefore, if we want to apply to a different vector , only lines 6 and 11 are needed. In the Evaluate phase, ASKIT provides two different algorithms for the Separated function. In the first algorithm, is separated from if . This definition creates a hierarchical approximation of identical to (3). Then, if and are two sibling nodes with parent , the matrices and are approximated by
(7) 
We now can define for a node
(8) 
This completes the definition of the hierarchical matrix for InvASKIT (compare (8) with (4)). To reduce memory costs however, ASKIT never constructs explicitly. Instead, looking at the Skeletonize function, ASKIT only stores for each node . In the factorization we will need the whole matrix. We explain how to compute it in §IIC.
Other then , ASKIT can also define to be separated from if no nearestneighbor of is in . This definition introduces a more complicated structure in the matrix. We do construct (7) but for points that have nearestneighbors in , line 4 in the Evaluate is replaced by , i.e., an exact evaluation. For certain kernels this significantly increases the accuracy of the overall scheme. A convenient way to think about this is that the second algorithm builds an approximation that is equivalent to
(9) 
where is the hierarchical matrix and is a sparse matrix that is zero if we use the separation criterion. If the norm of is small relative to we expect that will be a good preconditioner for . (Again there are kernels for which this is not the case.)
In this section we summarized the basic ideas in ASKIT. Its structure is very similar to what is known as a treecode in computational physics. We omitted several important details of the ASKIT algorithm. For example, we assumed that the number of skeleton points is fixed. In reality, this number differs for each tree node and it is chosen adaptively so that the approximation error is below a tolerance . Finally, the parallelization and scalability of ASKIT was described in [25]. Another version of ASKIT that resembles the relationship between fastmultipole methods and treecodes is described in [23]. In these papers, we also discuss the error analysis and work/storage complexity of ASKIT. Next, we will assume that we are given from ASKIT and we proceed with discussing the InvASKIT factorization in detail using the ideas we outlined in §IIA.
IiC Fast direct solver
Given the binary tree, the regularization parameter , and in terms of and for every tree node , we construct a factorization of that allows the fast solution of linear systems. We describe the algorithm for the case of , for notational simplicity, but the algorithms are stated for any . So given , we wish to compute .
In the factorization phase we compute , , and . We first show how (5) is computed in a blocked matrix. is computed as
(10) 
where and . Then the partitioned form of is
(11) 
If is an nonleaf node, is evaluated recursively by
(12) 
At the leaf nodes is computed by ASKIT (using the LAPACK routines GEQP3 and GELS [22]).
Factorization Phase. Algorithm II.1 performs a postorder traversal of the tree. If is a leaf node, we factorize with partialpivoting LU factorization (LAPACK routine GETRF). Otherwise, we first compute , , , using (12). Then we solve and by calling Solve() (Algorithm II.2). (Note that when we visit a node at some level, the factorization information required for the solve is already in place.) We compute and factorize using the blocked representation in (11). Solve() is described in Algorithm II.2, which applies to a vector .
Solving Phase. If is a leaf node, directly invokes an LU solver (LAPACK routine GETRS). Otherwise, we first solve and recursively. Then we compute (11) with two matrixmatrix multiplications (BLAS routine GEMM) and an LU solver (GETRS). Note that both algorithms are implemented using bottomup tree traversal and “for loops”—we do not actually employ recursion.
Refactoring for different . In the case that we want to refactor for a different value of , (e.g., in crossvalidation studies), needs to be refactorized, but the skeletons and we have computed remain the same. Thus, when we refactor with different , only and need to be recomputed. (The factorization at the leaf level can be also amortized if SVD is used instead of LU, but we do not discuss this here.)
Complexity of the sequential algorithm. We present the complexity analysis of Algorithm II.1 and Algorithm II.2. The main parameters are and
. They depend on the kernel, the dimensionality, and the geometric distribution of points. They control the accuracy of the scheme, but are not explicitly related to problem size
. In the actual implementation is chosen adaptively for each node. If we use its maximum value (across nodes), then the results below are an upper bound to the work complexity. Note that in some applications, the kernel does depend on the problem size, for example the bandwidth value for the Gaussian kernel in machine learning problems. In that case and will need to change accordingly. But here we assume a fixed kernel for all values of , since the way the kernel changes with increasing depends on the application and the geometry of the dataset.denotes the complexity of the factorization Algorithm II.1 with points, and denotes the complexity of Algorithm II.2 with points. We can verify that
(13) 
Using the solving phase complexity, we derive as
(14) 
To summarize, Algorithm II.1 takes work and Algorithm II.2 takes work.
Storage. Beyond the storage required for ASKIT, we need to store (overwritten by ), and the LU factorization in each level. and take in each level. Overall, the space requirement is . To save about half of the memory, can be constructed recursively when it’s needed, and we discard it when we reach the parent level. This will result in slightly larger constants in the solving phase.
Shared memory parallelism: There are several ways to exploit shared memory parallelism. We have implemented and tested two of those. The first way is levelbylevel parallel tree traversal for both construction and evaluation, in which we use sequential versions of BLAS/LAPACK routines, concurrent traversal of nodes at the same level, and sequential traversal across levels. The second way is to use sequential traversals and use the sharedmemory parallelism for the BLAS/LAPACK routines. It turns out, that it is more efficient to use a hybrid version of these two approaches. From level to the leaf level, we disable parallel BLAS and LAPACK because we have experimentally observed that parallel tree traversal provides higher efficiency due to better memory locality. From level to level in the local tree, we switch from parallel tree traversal to parallel BLAS operations. In parallel tree traversal some cores may start to idle from tree level three, which only has 8 tree nodes. So in the top four levels of the tree we use parallel GETRF, GETRS and GEMM.
IiD Parallel algorithm
In [25] we discussed the parallelization of ASKIT. Here we build on that work and derive our parallel factorization algorithm. We assume that we have ranks in the global MPI communicator. The datastructure used in ASKIT is summarized in Figure 1. The distributedmemory parallelism in the algorithm is invoked between tree levels0 and . We briefly review the structure of the tree before discussing the algorithms and the complexity.
Distributed Tree. Without loss of generality, we assume and are powers of two and that each MPI process owns points. Figure 1 is a distributed tree with . We use purple boxes to indicate tree nodes. Each treenode node is assigned to a number of MPI processes. All processes sharing a node are grouped to an MPI communicator. We use {i} to indicate the MPI rank in the local communicator. For example, at level 0 the root of the tree is distributed among all 4 processes (from green to blue in Figure 1). The green process has local MPI rank {0}, and the blue process has rank {3}. If a node contains processes, then the processes with rank {} are assigned to the left child of , and processes with rank {} are assigned to the right child of . At level , each tree node is assigned to a single MPI process, which is also assigned the entire subtree of that node from level to the leaf level (not displayed in Figure 1).
Some processes may own the skeletons and matrix of the distributed node. We give an example in Figure 1; only {0} in each node stores the skeletons and . In the factorization we need to build matrices , which in turn involve , , and . These matrices are unnecessary in ASKIT but in InvASKIT we need to compute and store them in parallel. These computations involve collective broadcasts and reductions in the communicators of and pointtopoint exchanges between the {0} and {} ranks of the processes that were assigned to . Finally matrix is stored at {0}. Figure 2 shows how these matrices are distributed for the matrix generated by the tree in Figure 1.
Factorization. In Algorithm II.3, we describe how (11) and (12) are computed. Each process uses its local MPI rank to make different decisions. E.g., if a statement is labeled with , then only processes with rank execute it. Otherwise, all statements are executed by all processes. The whole communication pattern is simple. If a process owns the skeletons and of , then it sends to all processes in the same node and sends its skeletons to all processes in its sibling tree node. In the pseudocode, this is a little bit complicate, since we can’t simply broadcast between communicators. Thus, some pointtopoint communication is involved.
We will explain the algorithm using the yellow process (Yellow) in Figure 1 and Figure 2. At level 2, Algorithm II.1 is invoked, and the local tree of is factorized. At level 1, Yellow returns from line 5 as rank {1}. At line 6, it receives , which will later be used to compute in the parent level. At line 7, it receives from Green, and at line 9 it sends it owns to Green. is used to evaluate at line 12. Evaluating doesn’t require communication at this level, since all factors in the subtree are owned by Yellow. At line 14, is overwritten by . Finally, Yellow sends its to Green, and is computed and factorized on Green. At level0, we only illustrate how that Yellow received at level 1 is used to compute the at line 11. By the definition of (12), the right portion of only requires and . The first part is received at level 1, and the second part is computed at level 1.
Solver. Algorithm II.4 uses a similar bottomup tree traversal. Algorithm II.2 is invoked when the traversal reaches level. At line 7, {0} reduces all portions of and computes where was previously factorized. At line 8, is broadcast by {0}. Finally, each process computes its own .
Complexity. We use a network model to derive the parallel complexity. We assume that the network topology provides complexity for Bcast and Reduce for a message of size . For notational clarity we omit the latency cost.
Let be the parallel complexity of Algorithm II.4. is the base case complexity of Algorithm II.2. We can derive the complexity recursively as Here is the message transfer computation cost of (11), and is the communication cost. The overall solve complexity with grain size is given by
(15) 
including both computation and communication. The number of messages required is if one wants to include latency costs. We observe that the communication cost doesn’t scale with the problem size , but only with and .
The complexity of Algorithm II.3 is similarly given by Here is the time in Algorithm II.4, the time in (12), and is the communication cost spent on broadcasting skeletons and . Overall, the factorization complexity with grain size is
(16) 
including both computation and communication. Observe that the sequential complexity is times the parallel computation complexity, which indicates that the algorithm should be able to achieve almost linear scalability. The communication cost is minor since it does not depend on .
Discussion. InvASKIT shares the common problems of all treecodes: (1) the parallelism diminishes exponentially when traversing to the root, and (2) the possible global synchronization during the level by level traversal. Solving (5) in a distributed node reduces the product to one process in the communicator. This takes the advantage of the collective Bcast and Reduce, but the parallelism diminishes. This efficiency degradation is not significant when and are small, but we will observe this effect in the scaling experiments. The loadbalancing problem happens when the adaptive skeletonization is used to dynamically decide the rank of the offdiagonal blocks. The processes with the largest rank becomes the bottleneck. This loadbalancing issues requires a dynamic scheduling and beyond the scope of this paper.
Iii Experimental Setup
We performed numerical experiments to examine the accuracy and the scalability of InvASKIT. Notice that the only alternative method that we could compare with InvASKIT for the datasets of the size and dimensionality is a global lowrank approximation of . But in [24], we have already shown that for several bandwidths of practical interest the lowrank approximation is not a good scheme for approximating .
We also applied InvASKIT to binary classification tasks on realworld datasets using kernel regression. Next we summarize the hardware, datasets, and parameters used in our experiments. In the majority of our tests we use the Gaussian kernel function (1).
Implementation and hardware. All experiments were conducted on the Maverick and Stampede clusters at the Texas Advanced Computing Center. Maverick nodes have two 10core Intel Xeon E52680 v2 (3.1GHz) processors and 256GB RAM. Stampede nodes have two 8core E52680 (2.8GHz) processors and 32GB RAM. InvASKIT and ASKIT are written in C++, employing OpenMP for shared memory parallelism, Intel MKL for high performance linear algebra operations and Intel MPI for distributed memory parallelism. A custom direct “MatVec”^{4}^{4}4
See the General Stride Kernel Summation project:
https://github.com/ChenhanYu/ks. in ASKIT is optimized with SSE2 and AVX. The classification code is written in C++, employing a Krylov subspace method (GMRES) from the PETSc library [4].Iterative methods. Recall that InvASKIT can only compute without the sparse matrix . While , ASKIT can better approximate using nearestneighbor pruning ( will be smaller), but we can no longer compute its exact inverse directly. Still InvASKIT can be a good preconditioner. In this case, converging to the desired accuracy may take several GMRES iterations, but the classification accuracy may also be improved.
Dataset  

COVTYPE  500K  10K  54  0.07 
SUSY  4.5M  10K  8  0.07 
MNIST  1.6M  10K  784  0.10 
NORMAL  1M32M    64  0.310.17 
Datasets: In Table II, we give details of each dataset. In strong scaling and the classification tasks on Maverick, we use three realworld datasets: SUSY (highenergy physics) and COVTYPE (forest cartographic variables) from [21], MNIST (handwritten digit recognition) from [6]. In the weak scaling tasks, we generate a random 64D point cloud which are drawn from a 6D NORMAL distribution and embedded in 64D with additional noise. This resembles a dataset with a high ambient but relatively small intrinsic dimension, something typical in data analysis. We also report strong scaling results using a 16M NORMAL synthetic dataset.
Bandwidth selection and accuracy metrics. For the majority of experiments, we used the Gaussian kernel. The values are taken from [24]
, where they were selected using crossvalidation with a kernel Bayes classifier. The regularization
is also selected by crossvalidation.We test both versions of ASKIT (defined by the pointnode separation criterion discussed in §IIB), which correspond to and in (9). We use three metrics to measure the quality of the approximations for the MatVec and Invert approximations of ASKIT and InvASKIT:^{5}^{5}5Reminder on the notation: Here is the exact kernel matrix and is its ASKIT approximation with . By , we refer to the inverse based on the ASKIT approximation.
(17) 
The first error measures the ASKIT MatVec error. If the offdiagonal blocks of indeed admit lowrank, then should be small. Since evaluating exactly ( work) is impossible for large , we estimate by sampling 1K points. We measure the condition of InvASKIT’s Invert for the case using . If is wellconditioned, then should be small. We test as a preconditioner for GMRES for solving and we report the normalized residual upon termination of GMRES. Typically the desired for the classification tasks is 1e3. However, if the GMRES terminates while reaching the maximum iteration steps, then will be larger. Overall, we measure the quality of a preconditioner according to the required iteration steps #iter and the achieved accuracy .
Kernel regression for binary classification. Given a set of points with binary labels^{6}^{6}6The sets MNIST and COVTYPE have several classes. For those we perform a onevsall binary classification. Take MNIST dataset as an example, the two classes in the binary classification task are whether a point is the digit or not. , we split these points into the training and the testing sets. We use the training set to compute the weights for the regression. That is, given the correct classification labels , where for each training point , we solve for the regression weights . Once we compute these weights, the label for a test point is given by is . By comparing the true labels of the test points (which are given) to the computed ones using the kernel regression, we can estimate the classification accuracy as a percentage of correctly classified points. The classification errors are reported in the next section.
Iv Empirical Results
We present three sets of results, for weak and strong scaling and for kernel regression. We conducted 29 experiments labeled from #1 to #29 in the tables. In Table III, we report numerical accuracy and strong scaling results. In Table IV, we report weak scaling for both Gaussian and polynomial kernels. Finally, we apply InvASKIT as a preconditioner in the binary classification problem. We show that InvASKIT’s speeds up the convergence rate up to 5 and thus it is a good preconditioner for . We expect this to be the case when is sufficiently smaller than . In the results the parameter indicates the accuracy tolerance used in the skeletonization phase of the ASKIT algorithm. is the maximum skeleton size. is the number of the nearestneighbors.
#  core  
COVTYPE  , , , ,  
, .  
#1  160  1.1  145  1.00  
#2  320  0.6  77  0.94  
#3  640  0.3  42  0.86  
SUSY  , , , ,  
,  
#4  320  2.6  1307  1.00  
#5  640  1.3  681  0.96  
MNIST  , , , ,  
,  
#6  160  2.5  16  1.00  
#7  320  1.3  9  0.88  
#8  640  0.5  5  0.80  
NORMAL  , , ,  
,  
#9  1.5  96  1.00  
#10  0.8  52  0.92  
#11  0.4  30  0.80 
Accuracy. In Table III and Table IV, we report ASKIT’s error for reference but our focus is on and , which are related to InvASKIT. Note that is not always small due to the amplification of rounding errors by the conditioning of . Both and have strong impact on the condition number. For a Gaussian kernel matrix, a smaller usually results in better conditioned systems since it approaches the identity matrix. Runs #6–#8 are examples where the systems are wellconditioned even with . With a larger and lower , becomes nearly rankdeficient. In #12–#19 we see that is not sufficiently large, and that’s why we observe large .
Gaussian  Polynomial  
#  #12  #13  #14  #15  #16  #17  #18  #19 
1M  4M  16M  32M  1M  4M  16M  32M  
cores  20  80  320  640  20  80  320  640 
0.31  0.24  0.19  0.17          
1e1  2e1  3e1  3e1  
1e7  1e7  6e7  1e6  6e8  3e7  1e6  2e6  
0.0  0.1  0.1  0.2  0.0  0.0  0.1  0.1  
295  396  509  574  300  401  509  574  
2.5  3.1  3.8  4.2  2.5  3.1  3.8  4.2  
298  399  513  578  303  404  513  578  
0.3  1.1  4.2  8.2  0.3  1.1  4.3  8.2  
1.00  0.92  0.88  0.85  1.00  0.92  0.89  0.85 
Scaling. Now, we discuss the weak and strong scaling results in Table III and Table IV. The overall InvASKIT time is divided into factorization time and the solving time . The factorization time is split into (1) communication and (2) computation . is further subdivided into local and distributed time. The communication part of is negligible so we do not split . Take #1 as an example, the communication time is seconds, the local factorization time is seconds, and the distributed factorization time is seconds. The solving time is second, and the overall time is seconds.
In Table III we report the strongscaling efficiency . The baseline experiments are #1, #4, #6 and #9, which have . Fixing the problem size, we measure the efficiency by doubling the core number each time. We observe that the communication cost is insignificant; roughly speaking, doubling the number of cores will reduce the efficiency by 4%–12%. The efficiency degradation is due to diminishing parallelism during the computation of (5), where only the root rank in each subcommunicator works. We also measure the relativetopeak FLOPS efficiency. The peak for 640core on Maverick is ^{7}^{7}7Each E52680 v2 processor can compute GFLOPS per second. The aggregate theoretical peak performance of 64 processors is GFLOPS TFLOPS. TFLOPS, and the peak of 4096core on Stampede is TFLOPS. The ratio FLOPSpeak from lowtohigh is MNIST2M 8%, NORMAL 20%, COVTYPE 36% and SUSY 45%.
The scalability varies between datasets during the factorization, because there are loadbalancing issues in the levelbylevel traversal. Adaptive skeletonization results in different values of for nodes in the same tree level. The node with the largest becomes the critical path in each level. The loadbalancing problem is especially serious in MNIST where the workload can vary by orders of magnitude. On the other hand, SUSY has the most balanced workload, which allows it to reach 45% peak floating point efficiency. We also report the MatVec performance in the column. The MatVec in ASKIT is highly optimized and does not require any communication. Typically, it can reach nearlinear scaling and more than 80% of peak floating point efficiency.
In Table IV, we report weakscaling results. We perform experiments using Gaussian and polynomial kernels and we observe similar behavior. We report the FLOPS rate () and relative parallel efficiency (). The baseline experiments are #12 and #16. The efficiency drops dramatically from 20core runs to 80core runs, since #12 and #16 do not involve MPI. From 80 cores to 640 cores, doubling the number of cores results only in 3% loss, which shows that the weak scalability is almost linear. The weakscalability experiments can reach a higher efficiency, since there are no communication and loadbalancing issues. As a result, InvASKIT can reach 52% theoretical peak floating point efficiency.
#  #iter  Train (%)  Test (%)  
COVTYPE  .  
#20  231  100  99%  96%  4e3  
#21  62  0  99%  96%  2e11  
COVTYPE  .  
#22  1725  100  99%  97%  3e3  
#23  950  43  99%  97%  1e3  
SUSY  .  
#24  3056  49  80%  79%  1e3  
#25  870  0  80%  79%  1e14  
SUSY  .  
#26  11519  47  80%  79%  1e3  
#27  3419  7  80%  79%  1e3  
MNIST  .  
#28  185  5  100%  100%  1e9  
#29  36  0  100%  100%  2e14 
Classification. We conduct three classification experiments, following [24]. The training time is presented in pairs to show both the runtime and the iteration steps of the GMRES solver. Four experiments are conducted for each dataset, combining different kernel matrix approximations ( or ) and different preconditioners ( indicates no preconditioning or ). We present both training and testing (10K samples) accuracy, and we present the normalized residual upon termination of GMRES.
Approximating the kernel matrix using , we reach 96% accuracy in COVTYPE, 79% accuracy in SUSY and 100% accuracy in MNIST. Using the nearestneighborbased separation criterion which creates a nonzero , we can reach a higher accuracy (97%) in COVTYPE. In #21, #25 and #29, we use as initial guess, which allows GMRES to converge immediately to a high precision by only applying the preconditioner. The training time only involves the factorization time and the solving time, which are both deterministic. Without preconditioners, GMRES takes 3 to 5 longer to converge to the desired accuracy. During the crossvalidation, time spent on training may be much longer depending on the combination and .
In runs #23 and #27, we show that InvASKIT also works well as a preconditioner when we include the nearestneighbors in the ASKIT MatVec approximation. Comparing #22 to #23, we can find that #23 with InvASKIT only takes 43 iterations to converge to 1e3. On the other hand, #22 reaches the maximum iteration limit, only converging to 3e3. Involving nearestneighbors can better approximate in some cases. For example, #23 can reach a higher classification accuracy than #21. Overall InvASKIT provides a 2 to 3
speedup. For non
Comments
There are no comments yet.