    # Distributed-memory Hierarchical Interpolative Factorization

The hierarchical interpolative factorization (HIF) offers an efficient way for solving or preconditioning elliptic partial differential equations. By exploiting locality and low-rank properties of the operators, the HIF achieves quasi-linear complexity for factorizing the discrete positive definite elliptic operator and linear complexity for solving the associated linear system. In this paper, the distributed-memory HIF (DHIF) is introduced as a parallel and distributed-memory implementation of the HIF. The DHIF organizes the processes in a hierarchical structure and keep the communication as local as possible. The computation complexity is O(N N/P) and O(N/P) for constructing and applying the DHIF, respectively, where N is the size of the problem and P is the number of processes. The communication complexity is O(√(P)^3 P)α + O(N^2/3/√(P))β where α is the latency and β is the inverse bandwidth. Extensive numerical examples are performed on the NERSC Edison system with up to 8192 processes. The numerical results agree with the complexity analysis and demonstrate the efficiency and scalability of the DHIF.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper proposes an efficient distributed-memory algorithm for solving elliptic partial differential equations (PDEs) of the form,

 −∇⋅(a(x)∇u(x))+b(x)u(x)=f(x)x∈Ω⊂R3, (1)

with a certain boundary condition, where , and are given functions and is an unknown function. Since this elliptic equation is of fundamental importance to problems in physical sciences, solving (1) effectively has a significant impact in practice. Discretizing this with local schemes such as the finite difference or finite element methods leads to a sparse linear system,

 Au=f, (2)

where is a sparse symmetric matrix with non-zero entries with being the number of the discretization points, and and are the discrete approximations of the functions and , respectively. For many practical applications, one often needs to solve (1) on a sufficient fine mesh for which can be very large, especially for three dimensional (3D) problems. Hence, there is a practical need for developing fast and parallel algorithms for the efficient solution of (1).

### 1.1 Previous work

A great deal of effort in the field of scientific computing has been devoted to the efficient solution of (2). Beyond the

complexity naïve matrix inversion approach, one can classify the existing fast algorithms into the following groups.

The first one consists of the sparse direct algorithms, which take advantage of the sparsity of the discrete problem. The most noticeable example in this group is the nested dissection multifrontal method (MF) method [16, 14, 26]. By carefully exploring the sparsity and the locality of the problem, the multifrontal method factorizes the matrix (and thus ) as a product of sparse lower and upper triangular matrices. For 3D problems, the factorization step costs operations while the application step takes operations. Many parallel implementations [3, 4, 30] of the multifrontal method were proposed and they typically work quite well for problem of moderate size. However, as the problem size goes beyond a couple of millions, most implementations, including the distributed-memory ones, hit severe bottlenecks in memory consumption.

The second group consists of iterative solvers [9, 32, 33, 15], including famous algorithms such as the conjugate gradient (CG) method and the multigrid method. Each iteration of these algorithms typically takes steps and hence the overall cost for solving (2) is proportional to the number of iterations required for convergence. For problems with smooth coefficient functions and , the number of iterations typically remains rather small and the optimal linear complexity is achieved. However, if the coefficient functions lack regularity or have high contrast, the iteration number typically grows quite rapidly as the problem size increases.

The third group contains the methods based on structured matrices [7, 6, 8, 11]. These methods, for example the -matrix [18, 20], the -matrix , and the hierarchically semi-separable matrix (HSS) [10, 41], are shown to have efficient approximations of linear or quasi-linear complexity for the matrices and . As a result, the algebraic operations of these matrices are of linear or quasi-linear complexities as well. More specifically, the recursive inversion and the rocket-style inversion  are two popular methods for the inverse operation. For distributed-memory implementations, however, the former lacks parallel scalability [24, 25] while the latter demonstrates scalability only for 1D and 2D problems . For 3D problems, these methods typically suffer from large prefactors that make them less efficient for practical large-scale problems.

A recent group of methods explore the idea of integrating the MF method with the hierarchical matrix [28, 40, 38, 39, 17, 37, 21] or block low-rank matrix [34, 35, 2] approach in order to leverage the efficiency of both methods. Instead of directly applying the hierarchical matrix structure to the 3D problems, these methods apply it to the representation of the frontal matrices (i.e., the interactions between the lower dimensional fronts). These methods are of linear or quasi-linear complexities in theory with much small prefactors. However, due to the combined complexity, the implementation is highly non-trivial and quite difficult for parallelization [27, 42].

More recently, the hierarchical interpolative factorization (HIF) [22, 23] is proposed as a new way for solving elliptic PDEs and integral equations. As compared to the multifrontal method, the HIF includes an extra step of skeletonizing the fronts in order to reduce the size of the dense frontal matrices. Based on the key observation that the number of skeleton points on each front scales linearly as the one-dimensional fronts, the HIF factorizes the matrix (and thus ) as a product of sparse matrices that contains only non-zero entries in total. In addition, the factorization and application of the HIF are of complexities and , respectively, for

being the total number of degrees of freedom (DOFs) in (

2). In practice, the HIF shows significant saving in terms of computational resources required for 3D problems.

### 1.2 Contribution

This paper proposes the first distributed-memory hierarchical interpolative factorization (DHIF) for solving very large scale problems. In a nutshell, the DHIF organizes the processes in an octree structure in the same way that the HIF partitions the computation domain. In the simplest setting, each leaf node of the computation domain is assigned a single process. Thanks to the locality of the operator in (1), this process only communicates with its neighbors and all algebraic computations are local within the leaf node. At higher levels, each node of the computation domain is associated with a process group that contains all processes in the subtree starting from this node. The computations are all local within this process group via parallel dense linear algebra and the communications are carried out between neighboring process groups. By following this octree structure, we make sure that both the communication and computations in the distributed-memory HIF are evenly distributed. As a result, the distributed-memory HIF implementation achieves and parallel complexity for constructing and applying the factorization, respectively, where is the number of DOFs and is the number of processes.

We have performed extensive numerical tests. The numerical results support the complexity analysis of the distributed-memory HIF and suggest that the DHIF is a scalable method up to thousands of processes and can be applied to solve large scale elliptic PDEs.

### 1.3 Organization

The rest of this paper is organized as follow. In Section 2, we review the basic tools needed for both HIF and DHIF, and review the sequential HIF. Section 3 presents the DHIF as a parallel extension of the sequential HIF for 3D problems. Complexity analyses for memory usage, computation time and communication volume are given at the end of this section. The numerical results detailed in Section 4 show that the DHIF is applicable to large scale problems and achieves parallel scalability up to thousands of processes. Finally, Section 5 concludes with some extra discussions on future work.

## 2 Preliminaries

This section reviews the basic tools and the sequential HIF. First, we start by listing the notations that are widely used throughout this paper.

### 2.1 Notations

In this paper, we adopt MATLAB notations for simple representation of submatrices. For example, given a matrix and two index sets, and , represents the submatrix of with the row indices in and column indices in . The next two examples explores the usage of MATLAB notation “”. With the same settings, represents the submatrix of with row indices in and all columns. Another usage of notation “

” is to create regularly spaced vectors for integer values

and , for instance, is the same as for .

In order to simplify the presentation, we consider the problem (1) with periodic boundary condition and assume that the domain , and is discretized with a grid of size for , where and are both integers. In the rest of this paper, is known as the number of levels in the hierarchical structure and is the level number of the root level. We use to denote the total number of DOFs, which is the dimension of the sparse matrix in (2). Furthermore, each grid point is defined as

 xj=hj=h(j1,j2,j3) (3)

where , and . Figure 1: Cell Structure: top, front, left, and interior points are indicated by arrows; bottom, back, and right points are not plotted in the figure; the black dots denote the edge points; the dash line indicates that the front frame is pulled away in order to show the interior points.

In order to fully explore the hierarchical structure of the problem, we recursively bipartite each dimension of the grid into levels. Let the leaf level be level and the root level be level . At level , a cell indexed with is of size and each point in the cell is in the range, for and . denotes the grid point set of the cell at level indexed with .

A cell owns three faces: top, front, and left. Each of these three faces contains the grid points on the first frame in the corresponding direction. For example, the front face contains the grid points in . Besides these three in-cell faces (top, front, and left) that are owned by a cell, each cell is also adjacent to three out-of-cell faces (bottom, back, right) owned by its neighbors. Each of these three faces contains the grid points on the next to the last frame in the corresponding dimension. As a result, these faces contain DOFs that belong to adjacent cells. For example, the bottom face of contains the grid points in . These six faces are the surrounding faces of . One also defines the interior of to be for the same and . Figure 1 gives an illustration of a cell, its faces, and its interior. These definitions and notations are summarized in Table 1. Also included here are some notations used for the processes, which will be introduced later.

### 2.2 Sparse Elimination

Suppose that is a symmetric matrix. The row/column indices of are partitioned into three sets where refers to the interior point set, refers to the surrounding face point set, and refers to the rest point set. We further assume that there is no interaction between the indices in and the ones in . As a result, one can write in the following form

 A=⎡⎢ ⎢⎣AIIATFIAFIAFFATRFARFARR⎤⎥ ⎥⎦. (4)

Let the decomposition of be , where is lower triangular matrix with unit diagonal. According to the block Gaussian elimination of given by (4), one defines the sparse elimination to be

 STIASI=⎡⎢⎣DIBFFATRFARFARR⎤⎥⎦, (5)

where is the associated Schur complement and the explicit expressions for is

 SI=⎡⎢⎣L−TI−A−1IIATFIII⎤⎥⎦. (6)

The sparse elimination removes the interaction between the interior points and the corresponding surrounding face points and leaves and untouched. We call the entire point set, , the active point set. Then, after the sparse elimination, the interior points are decoupled from other points, which is conceptually equivalent to eliminate the interior points from the active point set. After this, the new active point set can be regarded as . Figure 2: Sparse elimination: the interior points are eliminated after the sparse elimination; the rest points are not all plotted in the figure.

Figure 2 illustrates the impact of the sparse elimination. The dots in the figure represent the active points. Before the sparse elimination (left), edge points, face points and interior points are active while after the sparse elimination (right) the interior points are eliminated from the active point set.

### 2.3 Skeletonization

Skeletonization is a tool for eliminating redundant point set from a symmetric matrix that has low-rank off-diagonal blocks. The key step in skeletonization uses the interpolative decomposition [12, 29] of low-rank matrices.

Let be a symmetric matrix of the form,

 A=[AFFATRFARFARR], (7)

where is a numerically low-rank matrix. The interpolative decomposition of is (up to a permutation)

 ARF=[AR\widebar\widebarFARˆF]≈[ARˆFTFARˆF], (8)

where is the interpolation matrix, is the skeleton point set, is the redundant point set, and . Applying this approximation to results

 A≈⎡⎢ ⎢ ⎢ ⎢⎣A\widebar\widebarF\widebar\widebarFATˆF\widebar\widebarFTTFATRˆFAˆF\widebar\widebarFAˆFˆFATRˆF\bigstrut[b]\bigstrut[t]ARˆFTFARˆFARR⎤⎥ ⎥ ⎥ ⎥⎦, (9)

and be symmetrically factorized as

 ST\widebar\widebarFQTFAQFS\widebar\widebarF≈ST\widebar\widebarF⎡⎢ ⎢ ⎢ ⎢⎣B\widebar\widebarF\widebar\widebarFBTˆF\widebar\widebarFBˆF\widebar\widebarFAˆFˆFATRˆF\bigstrut[b]\bigstrut[t]ARˆFARR⎤⎥ ⎥ ⎥ ⎥⎦S\widebar\widebarF=⎡⎢ ⎢ ⎢⎣D\widebar\widebarFBˆFˆFATRˆF\bigstrut[b]\bigstrut[t]ARˆFARR⎤⎥ ⎥ ⎥⎦, (10)

where

 B\widebar\widebarF\widebar\widebarF = A\widebar\widebarF\widebar\widebarF−TTFAˆF\widebar\widebarF−ATˆF\widebar\widebarFTF+TTFAˆFˆFTF, (11) BˆF\widebar\widebarF = AˆF\widebar\widebarF−AˆFˆFTF, (12) BˆFˆF = AˆFˆF−BˆF\widebar\widebarFB−1\widebar\widebarF\widebar\widebarFBTˆF\widebar\widebarF. (13)

The factor is generated by the block Gaussian elimination, which is defined to be

 QF=⎡⎢ ⎢⎣I−TFI\bigstrut[b]\bigstrut[t]I⎤⎥ ⎥⎦. (14)

Meanwhile, the factor is introduced in the sparse elimination:

 S\widebar\widebarF=⎡⎢ ⎢ ⎢⎣L−T\widebar\widebarF−B−1\widebar\widebarF\widebar\widebarFBTˆF\widebar\widebarFI\bigstrut[b]\bigstrut[t]I⎤⎥ ⎥ ⎥⎦ (15)

where and come from the factorization of , i.e., . Similar to what happens in Section 2.2, the skeletonization eliminates the redundant point set from the active point set. Figure 3: Skeletonization: the working face is colored by red and pink; red points are the skeleton points on the face whereas pink points are the redundant points on the face; skeletonization eliminates the redundant points from the active point set.

The point elimination idea of the skeletonization is illustrated in Figure 3. Before the skeletonization (left), the edge points, interior points, skeleton face points (red) and redundant face points (pink) are all active, while after the skeletonization (right) the redundant face points are eliminated from the active point set.

### 2.4 Sequential HIF

This section reviews the sequential hierarchical interpolative factorization (HIF) for 3D elliptic problems (1) with the periodic boundary condition. Without loss of generality, we discretize (1) with the seven-point stencil on a uniform grid, which is defined in Section 2.1. The discrete system is

 1h2(aj−12e1+aj+12e1+aj−12e2+aj+12e2+aj−12e3+aj+12e3)uj−1h2(aj−12e1uj−e1+aj+12e1uj+e1+aj−12e2uj−e2+aj+12e2uj+e2+aj−12e3uj−e3+aj+12e3uj+e3)+bjuj=fj (16)

at each grid point for and , where , , , and approximates the unknown function at . The corresponding linear system is

 Au=f (17)

where is a sparse SPD matrix if .

We first introduce the notion of active and inactive DOFs.

• A set of DOFs of are called active if is not a diagonal matrix or

is a non-zero matrix;

• A set of DOFs of are called inactive if is a diagonal matrix and is a zero matrix.

Here refers to the complement of the set . Sparse elimination and skeletonization provide concrete examples of active and inactive DOFs. For example, sparse elimination turns the indices from active DOFs of to inactive DOFs of in (5). Skeletonization turns the indices from active DOFs of to inactive DOFs of in (10).

With these notations, the sequential HIF in  is summarized as follows. A more illustrative representation of the sequential HIF is given on the left column of Figure 5.

• Preliminary. Let be the sparse symmetric matrix in (17), be the initial active DOFs of , which includes all indices.

• Level for .

• Preliminary. Let denote the matrix before any elimination at level . is the corresponding active DOFs. Let us recall the notations in Section 2.1. denotes the active DOFs in the cell at level indexed with . and denote the surrounding faces and interior active DOFs in the corresponding cell, respectively.

• Sparse Elimination. We first focus on a single cell at level indexed with , i.e., . To simplify the notation, we drop the superscript and subscript for now and introduce , , , and . Based on the discretization and previous level eliminations, the interior active DOFs interact only with itself and its surrounding faces. The interactions of the interior active DOFs and the rest DOFs are empty and the corresponding matrix is zero, . Hence, by applying sparse elimination, we have,

 STIAℓSI=⎡⎢ ⎢⎣DIBℓFF(AℓRF)TAℓRFAℓRR⎤⎥ ⎥⎦, (18)

where the explicit definitions of and are given in the discussion of sparse elimination. This factorization eliminates from the active DOFs of .

Looping over all cells at level , we obtain

 ˜Aℓ = (19) ˜Σℓ = Σℓ∖⋃I∈IℓI. (20)

Now all the active interior DOFs at level are eliminated from .

• Skeletonization. Each face at level not only interacts within its own cell but also interacts with faces of neighbor cells. Since the interaction between any two different faces is low-rank, this leads us to apply skeletonization. The skeletonization for any face gives,

 ST\widebar\widebarFQTF˜AℓQFS\widebar\widebarF=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣˜D\widebar\widebarF˜BℓˆFˆF(˜AℓRˆF)T\bigstrut[b]\bigstrut[t]˜AℓRˆF˜AℓRR⎤⎥ ⎥ ⎥ ⎥ ⎥⎦, (21)

where is the skeleton DOFs of , is the redundant DOFs of , and refers to the rest DOFs. Due to the elimination from previous levels, scales as and contains a non-zero submatrix of size . Therefore, the interpolative decomposition can be formed efficiently. Readers are referred to Section 2.3 for the explicit forms of each matrix in (21).

Looping over all faces at level , we obtain

 Aℓ+1≈⎛⎝∏F∈FℓS\widebar\widebarFQF⎞⎠T˜Aℓ⎛⎝∏F∈FℓS\widebar\widebarFQF⎞⎠=⎛⎝∏F∈FℓS\widebar\widebarFQF⎞⎠T⎛⎝∏I∈IℓSI⎞⎠TAℓ⎛⎝∏I∈IℓSI⎞⎠⎛⎝∏F∈FℓS\widebar\widebarFQF⎞⎠=(Wℓ)TAℓWℓ, (22)

where . The active DOFs for the next level is now defined as,

 Σℓ+1=˜Σℓ∖⋃F∈Fℓ\widebar\widebarF=Σℓ∖⎛⎝⎛⎝⋃I∈IℓI⎞⎠⋃⎛⎝⋃F∈Fℓ\widebar\widebarF⎞⎠⎞⎠. (23)
• Level . Finally, and are the matrix and active DOFs at level . Up to a permutation, can be factorized as

 AL=[ALΣLΣLDR]=[LΣLI][DΣLDR][LTΣLI]:=(WL)−TD(WL)−1. (24)

Combining all these factorization results

 (25)

and

 A−1≈W0⋯WL−1WLD−1(WL)T(WL−1)T⋯(W0)T=F−1. (26)

is factorized into a multiplicative sequence of matrices and each corresponding to level is again a multiplicative sequence of sparse matrices, , and . Due to the fact that any , or contains a small non-trivial (i.e., neither identity nor empty) matrix of size , the overall complexity for strong and applying is . Hence the application of the inverse of is of computation and memory complexity.

## 3 Distributed-memory HIF

This section describes the algorithm for the distributed-memory HIF.

### 3.1 Process Tree

For simplicity, assume that there are processes. We introduce a process tree that has levels and resembles the hierarchical structure of the computation domain. Each node of this process tree is called a process group. First at the leaf level, there are leaf process groups denoted as . Here , and the superscript refers to the leaf level (level 0). Each group at this level only contains a single process. Each node at level 1 of the process tree is constructed by merging 8 leaf processes. More precisely, we denote the process group at level 1 as for , , and . Similarly, we recursively define the node at level as . Finally, the process group at the root includes all processes. Figure 4 illustrates the process tree. Each cube in the process tree is a process group. Figure 4: Process tree: 64 processes are organized in the process tree.

### 3.2 Distributed-memory method

Same as in Section 2.4, we define the grid on for , where is a small positive integer and is the level number of the root level. Discretizing (1) with seven-point stencil on the grid provides the linear system , where is a sparse SPD matrix, is the unknown function at grid points, and is the given function at grid points.

Given the process tree (Section 3.1) with processes and the sequential HIF structure (Section 2.4), the construction of the distributed-memory hierarchical interpolative factorization (DHIF) consists of the following steps.

• Preliminary. Construct the process tree with processes. Each process group owns the data corresponding to cell and the set of active DOFs in are denoted as , for and . Set and let the process group own , which is a sparse tall-skinny matrix with non-zero entries.

• Level for .

• Preliminary. Let denote the matrix before any elimination at level . denotes the active DOFs owned by the process group for , , and the non-zero submatrix of is distributed among the process group using the two-dimensional block-cyclic distribution.

• Sparse Elimination. The process group owns , which is sufficient for performing sparse elimination for . To simplify the notation, we define as the active interior DOFs of cell , as the surrounding faces, and as the rest active DOFs. Sparse elimination at level within the process group performs essentially

 STIAℓSI=⎡⎢ ⎢⎣DIBℓFF(AℓRF)TAℓRFAℓRR⎤⎥ ⎥⎦, (27)

where ,

 SI=⎡⎢ ⎢⎣(LℓI)−T−(AℓII)−1(AℓFI)TII⎤⎥ ⎥⎦ (28)

with . Since is owned locally by , both and are local matrices. All non-trivial (i.e., neither identity nor empty) submatrices in are formed locally and stored locally for application. On the other hand, updating on requires some communication in the next step.

• Communication after sparse elimination. After all sparse eliminations are performed, some communication is required to update for each cell . For the problem (1) with the periodic boundary conditions, each face at level is the surrounding face of exactly two cells. The owning process groups of these two cells need to communicate with each other to apply the additive updates, a submatrix of . Once all communications are finished, the parallel sparse elimination does the rest of the computation, which can be conceptually denoted as,

 ˜Aℓ=⎛⎝∏I∈IℓSI⎞⎠TAℓ⎛⎝∏I∈IℓSI⎞⎠,˜Σℓj=Σℓj∖⋃I∈IℓI, (29)

for .

• Skeletonization. For each face owned by , the corresponding matrices is stored locally. Similar to the parallel sparse elimination part, most operations are local at the process group and can be carried out using the dense parallel linear algebra efficiently. By forming a parallel interpolative decomposition (ID) for , the parallel skeletonization can be, conceptually, written as,

 S\widebar\widebarFQF˜Aℓ(QF)T(S\widebar\widebarF)T≈⎡⎢ ⎢ ⎢ ⎢⎣D\widebar\widebarF˜BℓˆFˆF˜AℓRˆF\bigstrut[b]\bigstrut[t]˜AℓRˆF˜AℓRR⎤⎥ ⎥ ⎥ ⎥⎦, (30)

where the definitions of and are given in the discussion of skeletonization. Since , , and are all owned by , it requires only local operations to form

 ˜Bℓ\widebar\widebarF\widebar\widebarF=˜Aℓ\widebar\widebarF\widebar\widebarF−(TℓF)T˜AℓˆF\widebar\widebarF−(˜AℓˆF\widebar\widebarF)TTℓF+(TℓF)T˜AℓˆFˆFTℓF,˜BℓˆF\widebar\widebarF=˜AℓˆF\widebar\widebarF−˜AℓˆFˆFTℓF,˜BℓˆFˆF=˜AℓˆFˆF−˜BℓˆF\widebar\widebarF(˜Bℓ\widebar\widebarF\widebar\widebarF)−1(˜BℓˆF\widebar\widebarF)T. (31)

Similarly, , which is the factor of , is also formed within the process group . Moreover, since non-trivial blocks in and are both local, this implies that the applications of and are local operations. As a result, the parallel skeletonization factorizes conceptually as,

 (32)

and we can define

 Wℓ=⎛⎝∏I∈IℓSI⎞⎠⎛⎝∏F∈FℓS\widebar\widebarFQF⎞⎠,Σℓ+1/2j=˜Σℓj∖⋃F∈Fℓ\widebar\widebarF=Σℓj∖⎛⎝⎛⎝⋃F∈Fℓ\widebar\widebarF⎞⎠⋃⎛⎝⋃I∈IℓI⎞⎠⎞⎠. (33)

We would like to emphasize that the factors are evenly distributed among the process groups at level and that all non-trivial blocks are stored locally.

• Merging and Redistribution. Towards the end of the factorization at level , we need to merge the process groups and redistribute the data associated with the active DOFs in order to prepare for the work at level . For each process group at level , , for , , we first form its active DOF set by merging from all its children , where . In addition, is separately owned by . A redistribution among is needed in order to reduce the communication cost for future parallel dense linear algebra. Although this redistribution requires a global communication among , the complexities for message and bandwidth are bounded by the cost for parallel dense linear algebra. Actually, as we shall see in the numerical results, its cost is far lower than that of the parallel dense linear algebra.

• Level Factorization. The parallel factorization at level is quite similar to the sequential one. After factorizations from all previous levels, is distributed among . A parallel factorization of among the processes in results

 (34)

Consequently, we forms the DHIF for and as

 (35)

and

 A−1≈W0⋯WL−1WLD−1(WL)T(WL−1)T⋯(W0)T=F−1. (36)

The factors, are evenly distributed among all processes and the application of is basically a sequence of parallel dense matrix-vector multiplications.