I Introduction
Symmetric positive definite systems of linear equations arise in the solution of many scientific and engineering problems. Efficient solution of such linear system is important for the overall performance of the application codes. In this paper, we consider direct methods for solving a sparse symmetric positive definite linear system, which are based on Cholesky factorization. While direct methods can be expensive for large matrices, in terms of execution times and storage requirement when compared to iterative methods, they have the advantage that they terminate in a finite number of operations. Also, direct methods can handle linear systems that are ill conditioned or the situation when there are many multiple righthand sides. An example of illconditioned linear systems is in the computation of interior eigenvalues of a matrix using the shiftinvert Lanczos algorithm.
We propose a new implementation of sparse Cholesky factorization using an asynchronous task paradigm. We introduce a parallel distributed memory solver called symPACK. By using a taskbased formalism and dynamic scheduling techniques within a node, symPACK achieves good strong scaling on modern supercomputers.
An outline of the paper is as follows. In Section II, we provide some background on sparse Cholesky factorization. In Section III, we present our implementation in symPACK. The asynchronous paradigm is described in Section IV. Some numerical results are presented in Section V, followed by some concluding remarks in Section VI
Ii Background on Cholesky factorization
In the following, we give some background on Cholesky factorization and how symmetry and sparsity can be taken into account. We first review the basic Cholesky algorithm for dense matrices and then detail how it can be modified to handle sparse matrices efficiently. We also present fundamental notions on sparse matrix computations before reviewing the work related to sparse Cholesky factorization.
Iia The basic algorithms
Let be an by symmetric positive definite matrix. The Cholesky algorithm factors the matrix into
(1) 
where is a lower triangular matrix, and is the transpose of and is upper triangular. The factorization thus allows symmetry to be exploited, since only needs to be computed and saved.
The basic Cholesky factorization algorithm, given in Alg. 1, can be described as follows:

Current column of is computed using column of .

Column of is used to update the remaining columns of .
If is a dense matrix, then every column , , is updated.
Once the factorization is computed, the solution to the original linear system can be obtained by solving two triangular linear systems using the Cholesky factor .
IiB Cholesky factorization of sparse matrices
For largescale applications, is often sparse, meaning that most of the elements of are zero. When the Cholesky factorization of is computed, some of the zero entries will turn into nonzero (due to the subtraction operations in the column updates; see Alg. 1). The extra nonzero entries are referred to as fillin. For indepth discussion of sparse Cholesky factorization, the reader is referred to [1].
Following is an important observation in sparse Cholesky factorization. It is expected that the columns of will become denser and denser as one moves from the left to the right. This is due to the fact that the fillin in one column will result in additional fillin in subsequent columns. Thus, it is not uncommon to find groups of consecutive columns that eventually share essentially the same zerononzero structure. Such a group of columns is referred to as a supernode. To be specific, if columns , , , form a supernode, then the diagonal block of these columns will be completely dense, and row , , within the supernode is either entirely zero or entirely nonzero.
Fillin entries and supernodes of a sample symmetric matrix are depicted in Figure (a)a. In this example, 10 supernodes are found. Fillin entries are created in supernode 8 because of the nonzero entries in supernode 6.
The elimination tree of (or ) is a very important and useful tool in sparse Cholesky factorization. It is an acyclic graph that has vertices , with corresponding to column of . Suppose . There is an edge between and in the elimination tree if and only if is the first offdiagonal nonzero entry in column of . Thus, is called the parent of and is a child of . The elimination tree contains a lot of information regarding the sparsity structure of and the dependency among the columns of . See [2] for details.
An elimination tree can be expressed in terms of supernodes rather than column. In such a case, it is referred to as a supernodal elimination tree. An example of such tree is depicted in Figure (b)b.
IiC Scheduling in parallel sparse Cholesky factorization
In the following, we discuss scheduling of the computation in the numerical factorization. The only constraints that have to be respected are the numerical dependencies among the columns: column of has to be updated by column of , for any such that , but the order in which the updates occur is mathematically irrelevant, as long as the updates are performed before column of is factored. There is therefore significant freedom in the scheduling of computational tasks that factorization algorithms can exploit.
For instance, on sequential platforms, this has led to two wellknown variants of the Cholesky factorization algorithm: leftlooking and rightlooking schemes, which have been introduced in the context of dense linear algebra [3]. In the leftlooking algorithm, before column of is factored, all updates coming from columns of such that and are first applied. In that sense, the algorithm is “looking to the left” of column . In rightlooking, after a column has been factored, every column such that and is updated by column . The algorithm thus “looks to the right” of column .
Distributed memory platforms add the question of where the computations are going to be performed. Various parallel algorithms have been proposed in the literature for Cholesky factorization, such as MUMPS [4], which is based on the multifrontal approach (a variant of rightlooking), and PASTIX [5], which is leftlooking.
In [6]
, the author classifies parallel Cholesky algorithms into three families:
fanin, fanout and fanboth.The fanin family includes all algorithms such that all updates from a column to other columns , for such that , are computed on the processor owning column . When one of these columns, say , will be factored, the processor owning will have to “fanin” (or collect) updates from previous columns.
The fanout family includes algorithms that compute updates from column to columns , for such that , on processors owning columns . This means that the processor owning column has to “fanout” (or broadcast) column of the Cholesky factor.
The fanboth family generalizes these two families to allow these updates to be performed on any processors. This family relies on computation maps to map computations to processors.
In the rest of the paper, we will use the term fanboth algorithm as a shorthand to refer to an algorithm belonging to the fanboth family (and similarly for fanin and fanout).
Iii A versatile parallel sparse Cholesky algorithm
As mentioned in the previous section, there are many ways to schedule the computations as long as the precedence constraints are satisfied. The Cholesky factorization in symPACK is inspired by the fanboth algorithm. This leads to a high level of versatility and modularity, which allows symPACK to adapt to various platforms and network topologies.
Iiia Taskbased formulation
Both fanboth and symPACK involve three types of operations: factorization, update, aggregation. We let be an by matrix, and denote these tasks using the following notation^{1}^{1}1We use MATLAB notation in this paper.:

Factorization : compute column of the Cholesky factor.

Aggregation : apply all aggregate vectors from columns , with , to column .
An example of dependencies among these tasks for three columns , and , with and , is depicted in Figure 4. After column has been factored, its updates to dependent columns and can be computed. This corresponds to tasks and . Note that both these tasks require , which has to be fannedout to these two tasks. After these two tasks have been processed, and have been computed. can now be updated using , after which is ready to be executed. After that, the task , which produces , can be executed. The two aggregate vectors and are then applied on column during the execution of task , requiring aggregate vectors to be fannedin. Finally, task can be processed. As can be observed, fanboth indeed involves data exchanges that can be observed in either fanin or fanout.
IiiB Parallel algorithm and computation maps
We now describe fanboth in a parallel setting. We assume a parallel distributed memory platform with processors ranging from to . We assume that and are cyclically distributed by supernodes of various sizes in a 1D way, as depicted in Figure (a)a. The maximum supernode size is limited to 150 columns. This has the benefit of allowing a good load balancing of nonzero entries and computation per processor, although communication might not achieve optimal load balance. An example of such a distribution using 4 distributed memory nodes, or processors, is depicted in Figure (a)a.
Ashcraft [6] introduces the concept of computation maps to guide the mapping of tasks onto processors. A mapping is a twodimensional grid that “extends” to the matrix size (i.e., by). Values represent node ranks computed using a closedform generator expression. Therefore, the by grid is not explicitly stored. A mapping is said to be by when 1 rank is found in each column of , by1 when distinct values are found in every column and by when distinct ranks are found on each row and column.
A computation map is used to map the tasks as follows:

Tasks and are mapped onto node

Tasks is mapped onto node
In a parallel setting, aggregate vectors can be further accumulated on each node to reflect the updates of all local columns residing on a given node to a given column . We let be such an aggregate vector. We have:
Given a task mapping strategy , it is important to note that the factor columns , , need to be sent to at most the number of distinct node ranks present in the lower triangular part of column of . Aggregate vectors need to be sent to the number of distinct ranks in the lower triangular part of row of .
In [6], the author discusses the worst case communication volume depending on which computation map is used. The by maps involve at most nodes in each communication step, while each step involves at most nodes for either by1 or 1by computation maps. The volume is directly impacted by the number of nodes participating to each communication step. However, the by maps require two kinds of messages to be exchanged (i.e. both factors and aggregate vectors) while 1by or by1 only require one type of message. The latency cost is therefore higher for by computation maps.
Various formulations of the Cholesky factorization can generally be described by a fanboth algorithm with appropriate computation maps. For instance, fanin and fanout are good examples, as illustrated by Figure 8. Our symPACK also uses these computation maps but it is not restricted to such task assignments. This flexible design will allow us to derive and evaluate a wider range of task mapping and scheduling strategies.
IiiC Impact of communication strategy
Without loss of generality, parallel distributed memory algorithms perform communication following two strategies. A data transfer happening between two processors (or nodes) and can be performed the following ways:

sends the data to as soon as the data is available using a push strategy,

gets the data from as soon as data is required using a pull strategy.
These two strategies will be explained in detail below.
Another very important characteristic of communication protocols is whether a communication primitive is twosided or onesided. The former requires to issue a send operation and to issue a matching receive operation. Such strategy is employed in most MPI applications. The latter strategy can be employed in two ways, and relies on the fact that the communication library is able to write/read to/from a remote memory location. Either puts directly the data into ’s memory, or gets the data directly from ’s memory. This type of communication have been introduced in MPI2 and refined in MPI3, and is also available in other libraries such as GASNet.
In the following, we discuss those strategies and their corresponding implications in the context of the sparse Cholesky factorization, and more generally in the context of direct sparse solvers. Two kinds of messages can be exchanged throughout the factorization: factors and aggregate vectors. The first type of messages corresponds to the entries in a column after it has been factorized, or in other words, to a portion of the output data of the algorithm. The second type of messages is a temporary buffer in which a given will accumulate all its updates to a remote target column residing on .
In the next two paragraphs, we suppose that two tasks and are respectively mapped onto two processors and . Task depends on data produced by task . Let denote that data.
Push strategy
First, computes task . As soon as it is done processing that task, it sends to processor . When processor selects task to be executed, the first thing done is to post a receive request, and wait until has been fetched. Once it is received, task can be processed.
Pull strategy
Let us now consider the pull strategy. Processor processes task and produces . Later on, processor selects task . It first sends a message to processor , requesting to be sent to processor . Once this transfer is completed, task can be processed.
The key difference between the push strategy and the pull strategy is therefore which processor has the responsibility to initiate the data exchange.
IiiD Asynchronous communications and deadlock situations
In the following, we discuss the use of asynchronous communication primitives in sparse matrix solvers, and more specifically in the case of Cholesky factorization. Asynchronous communications, or nonblocking communications, are often used in parallel applications in order to achieve good strong scaling and deliver high performance.
In some situations though, asynchronous communications must be used with care. Communication libraries have to resort to a certain number of buffers to perform multiple asynchronous communications concurrently. However, the space for these buffers is a limited resource, and a communication library will certainly run out of buffer space if too many asynchronous communications are performed concurrently. In such a case, the communication primitives become blocking and deadlock might occur. This latter case corresponds to the situation in which each processor has only one send buffer and one receive buffer.
Let us consider a task graph and more precisely the simpler case where that task graph is a directed tree, and analyze it in the context of matrix computation. Let us also suppose that operations are performed on entire columns of an input matrix. Let us denote:

an input matrix of dimension on which operations are going to be applied to columns.

the set of columns in and distributed onto processors in a cyclic way. We have .

: a task tree, where is the set of vertices in the tree and the set of directed edges between vertices in .

a task from a source column to a target column , of matrix . Let us assume that column is modified after has been processed.

a dependency between two tasks and . It corresponds to a communication if tasks are not mapped onto the same processor.
For sparse Cholesky factorization, such a task tree can be derived from the elimination tree of matrix . We suppose that this elimination tree has been labeled in a postorder fashion, which is generally the case. Therefore, every edge in has to respect the constraint .
Suppose that each processor has only one send buffer and one receive buffer, and that processors will push, or send, the newly produced data as soon as it has been produced. Let us also assume that prior to executing a task, the incoming data has to be received.
Such a task tree is depicted in Figure 9, with different colors corresponding to distinct processors. Let us describe how such a tree is processed by the processors.
First, each of the processors executes one task , of the bottom level. Processors to send their respective data to processor , which receives each message one by one in a sequential way.
All processors can now compute tasks , and then send their respective data to processor on which task has been assigned. This consumes the send buffer of all processors but processor .
Processor then computes task in the rightmost branch of the tree. It cannot send the data because the send buffer is currently occupied. Processor is waiting for the incoming data to task , which cannot be sent by . Hence a deadlock situation.
In order to avoid this kind of situation, tasks and messages can be scheduled in the following way:

Process tasks in nondecreasing order of target column , then in nondecreasing order of source column .

Send message in nondecreasing order of target column , then in nondecreasing order of source column and only if with respect to this ordering where is the next task scheduled onto this processor.
This problem has also been observed in [7] in the context of multifrontal factorization, in which a similar criterion has to be used.
Iv Asynchronous taskbased formulation
Modern platforms can be subject to high performance variability, and it is hard to derive an accurate model of such platforms. Obtaining good static scheduling strategies is therefore difficult. For scheduling purposes, an application is often modeled using Directed Acyclic Graphs (or DAGs). Computations are modeled as tasks, and represented by vertices in the graph while dependencies between tasks are represented by the edges. In the context of sparse matrix computations, there is no fixed task graph for a particular numerical kernel. The task graph is inherently depending upon the structure of the sparse matrix on which the computation is going to be performed. This makes the use of advanced static scheduling techniques even harder to apply.
This motivated us to rely on a dynamic scheduling approach instead, which is by nature more amenable to cope with performance variations and incomplete knowledge of the task graph. Even when the task graph is known, which is the case in sparse matrix factorizations, task completions are hard to predict on a parallel platform and dynamic scheduling is an efficient way to deal with this issue. We propose the following data structure, where each processor has:

a local task queue (LTQ), containing all the tasks statically mapped onto this processor and awaiting execution,

a ready task queue (RTQ), containing all the tasks for which precedence constraints have been satisfied and that can therefore be processed.
This is illustrated in Figure 10.
A task is represented by a source supernode and a target supernode on which computations have to be applied. Each task also has an incoming dependency counter, initially set to the number of incoming edges.
As symPACK implements a factorization similar to fanboth, three types of tasks have to be dealt with. Similarly, a message exchanged to satisfy the dependence between tasks mapped onto distinct processors is labeled by the source supernode of the receiving task and the target supernode of the receiving task.
The overall mechanism that we propose is the following: whenever a task is completed, processors owning dependent tasks are notified that new input data is now available.
As soon as a processor is done with its current computation, it periodically handles these incoming notifications by issuing onesided gets. This get operation can either be a nonblocking communication or a blocking communication. The incoming dependency counter of the corresponding task is decremented when the communication has been completed. This corresponds to a strategy similar to the pull strategy discussed earlier, the only difference being that directly notifies rather than periodically requesting data to .
When a task from the LTQ has all its dependencies satisfied (i.e., when its dependency counter reaches zero) then it is moved to the RTQ, and is now ready for execution. The processor then picks a task from the RTQ and executes it. If multiple tasks are available in the RTQ, then the next task that will be processed is picked according to a dynamic scheduling policy. As a first step, we use the same criterion for picking a task in the RTQ than the criterion that prevents deadlocks. Evaluating different scheduling policies will be the subject of a future analysis.
Iva Datadriven asynchronous communication model
Communications are becoming a bottleneck in most scientific computing applications. This is even more true for sparse linear algebra kernels, which often exhibit a higher communication to computation ratio. Moreover, modern parallel platforms have to exploit interconnects that are more complex than in the past, and often display a deeper hierarchical structure. Larger scale also has a sideeffect which can be observed on most modern platforms: performance variability. In the following, we propose a communication protocol for our parallel sparse Cholesky implementation that allow the communications to drive the scheduling. This is a crucial piece toward an Asynchronous Task execution model.
We use the UPC++ PGAS library [8, 9] for communicating between distributed memory nodes. UPC++ is built on top of GASNet [10], and introduces several parallel programming features useful for our implementation.
First, it provides global pointers for accessing memory locations on remote nodes. Using the get and put functions, one can transfer the data between two nodes in a onesided way. Moreover, these transfers are handled by RDMA calls, and are therefore generally performed without interrupting the remote processor. Using this concept of global pointers, UPC++ also allows us to allocate and deallocate memory on a node from a remote node.
Another useful feature is the ability to perform remote asynchronous function calls. A processor can submit a function for execution on a remote node. It gets pushed into a queue that the remote processor executes when calling the UPC++ progress function.
width=
We consider a data notification and communication process which is heavily based on these two features of UPC++. This process is depicted in Figure 11. Let us suppose that at the end of a computation, has produced some data that needs to be sent out to .
First, notifies that some data has been produced by sending it a pointer ptr to the data along with some metadata meta. This is done by doing an asynchronous function call to a signal(ptr,meta) function on directly from , and is referred to as step 1 in the diagram.
When finishes its current computation, it calls a poll function (step 2), whose main role is to watch for incoming communications and do the bookkeeping of task dependencies. This function resorts to UPC++ progress function to execute all asynchronous calls to the signal(ptr,meta) function, which enqueues ptr and meta into a list. This corresponds to steps 3 and 4. The next step in the poll function is to go through that list of global ptr and issue a get operation to pull the data (step 5). Note that this get can be asynchronous, but for the sake of simplicity, we suppose a blocking get operation here. Once the get operation is completed, the poll function updates the dependencies of the every task that will be using this data (which can be found by looking at the metadata meta). If all dependencies of a task are met, that task is moved into the list of readytasks RTQ (at step 6).
Finally, resumes its work by selecting a task from RTQ and run it.
As mentioned before, two types of data are encountered in Cholesky factorization: factors and aggregate vectors. Factors represent the output of the algorithm. The procedure described in Figure 11 can be applied to these factors in a straightforward way. Aggregate vectors, however, are temporary data. Hence, they need to be deleted when not required anymore; that is after step 5 of the process has been completed. UPC++ allows a process to deallocate memory on a remote process using a global pointer to that memory zone. Therefore, when dealing with aggregate vectors, will deallocate the data pointed by ptr on after it is done fetching it, without interrupting .
V Performance evaluation
In this section, we present the performance of the sparse Cholesky factorization implemented in our solver symPACK. Our experiments are conducted on the NERSC Edison supercomputer, which is based on Cray XC30 nodes. Each node has two Intel(R) Xeon(R) CPU E52695 v2 “Ivy bridge” processors with 12 cores running at 2.40GHz and 64GB of memory [11].
We evaluate the performance of symPACK, our parallel asynchronous taskbased sparse Cholesky implementation using a set of matrices from the University of Florida Sparse Matrix Collection [12]. A description of each matrix can be found in Table I.
In this paper, we analyze the performance of symPACK in a distributed memory setting only. Therefore, all experiments are conducted without multithreading (which is commonly referred to as “flatMPI”).
For sparse Cholesky factorization, the amount of fillin that occurs depends on where the nonzero elements are in the matrix. Permuting (or ordering) the matrix symmetrically changes its zerononzero structure, and hence changes the amount of fillin in the factorization process. In our experiments, a fillreducing ordering computed using Scotch [13] is applied to the original matrix. The Scotch library contains an implementation of the nested dissection algorithm [14] to compute a permutation that reduces the number of fillin entries in the Cholesky factor.
Va Impact of communication and scheduling strategy
First, we aim at characterizing the impact of the communication strategy used during Cholesky factorization. We also aim at evaluating the impact of the dynamic scheduling described in Section IV. To this end, we conduct a strong scaling experiment using the boneS10 matrix from the University of Florida Sparse Matrix collection [12]. Run times are averaged out of two runs. Results are depicted in Figure 12
, with error bars representing standard deviations. In this experiment, three variants of
symPACK are compared: Push, Pull and Pull + dynamic scheduling.The Push variant of symPACK is based on a twosided push communication protocol implemented using MPI. It uses the scheduling constraints introduced in Section IIID to prevent deadlocks. These constraints apply to both computations and communications.
The Pull variant implements a onesided pull communication protocol using UPC++, but relieves the constraints on communications while still respecting the constraints on computations. As a result, both Push and Pull executes the same static schedule for computations, but organize communication in two different ways.
We observe in Figure 12 that the Pull variant of symPACK outperforms the Push variant. This confirms that the communication protocol described in Section IVA and that relies on UPC++ to perform the onesided communications displays a negligible overhead compared to a twosided communication strategy using MPI.
This performance difference confirms that the sorting criterion that needs to be applied on both tasks and outgoing communications when using a push strategy also significantly constrains the schedule. Removing the constraints on how communications are scheduled while avoiding still deadlocks through the use of the Pull strategy allows to achieve a better scalability.
This trend is further improved by using a dynamic scheduling policy in conjunction with the Pull strategy. This confirms the dynamic scheduling as described in Section IV is a good way to improve scalability in the context of sparse matrix computations. In the rest of the paper, results corresponding to symPACK will correspond to the Pull + dynamic scheduling variant.
VB Strong scaling
In the next set of experiments, we evaluate the strong scaling of our sparse symmetric solver symPACK. We compare its performance to two stateoftheart parallel symmetric solvers: MUMPS 5.0 [4] and PASTIX 5.2.2 [5]. The package MUMPS is a wellknown sparse solver based on the multifrontal approach and that implements a symmetric factorization. The code PASTIX is based on a rightlooking supernodal formulation.
We also provide the run times achieved by SuperLU_DIST 4.3 [15, 16] as a reference. Note that SuperLU_DIST is not a symmetric code and therefore requires twice as much memory and floating point operations (if the columns are factored in the same order). However, it is well known for its good strong scaling. Therefore, only scalability trend rather than run times should be compared.
The same ordering, Scotch, is used for all solvers presented in the experiments. As this paper focuses solely on distributed memory platforms, neither PASTIX, MUMPS nor SuperLU_DIST are using multithreading. Furthermore, the term processor corresponds to a distributed memory process. Each data point corresponds to the average of three runs.
On the G3_circuit matrix, for which results are depicted in Figure 13, MUMPS and PASTIX perform better when using up to 96 and 192 processors respectively. On larger platform, symPACK becomes faster than these two stateoftheart solvers, displaying a better strong scaling. The average speedup against the fastest solver for this specific matrix is 1.07, with a minimum value of 0.24 and a maximum value of 5.70 achieved when using 2048 processors.
The performance of symPACK on a smaller number of processors can be explained by the data structures which are used to reduce the memory usage at the expense of more expensive indirect addressing operations. The G3_circuit matrix being extremely sparse, it is very likely that simpler structures with lower overhead would yield a higher level of performance. In terms of scalability, symPACK displays a favorable trend when compared to SuperLU_DIST, which scales up 192 processors on the expanded problem.
On other problems, symPACK is faster than all alternatives, as observed on Figures 14, 15, 16, 17, and 18. Detailed speedups over the best symmetric solver and the best overall solver (thus including SuperLU_DIST) are presented in Table II. The highest average speedup is achieved on the af_shell7 problem, for which symPACK can achieve an average speedup of 3.21 over the best of every other solver. The corresponding minimum speedup is 0.89 while the maximum is 7.77.
Speedup vs. sym.  Speedup vs. best  

Problem  min  max  avg.  min  max  avg. 
G3_circuit  0.24  5.70  1.07  0.24  5.70  1.07 
Flan_1565  1.06  9.40  2.11  1.06  7.07  1.94 
af_shell7  0.89  10.61  3.61  0.89  7.77  3.21 
audikw_1  1.11  14.46  3.14  1.11  2.84  1.77 
boneS10  0.86  N.A.  N.A.  0.86  4.73  1.75 
bone010  1.06  16.83  3.34  1.06  2.03  1.47 
Interestingly, SuperLU_DIST is the fastest of the stateoftheart solvers on the audikw_1 and bone010 matrices when using more than 384 processors. In those two cases, symPACK achieves an average speedup of respectively 1.77 and 1.47. If the memory constraint is such that one cannot run an unsymmetric solver like SuperLU_DIST, then symPACK achieves an average speedup of respectively 3.14 and 3.34 over the best symmetric solver. Note that on the boneS10 matrix, neither PASTIX nor MUMPS succeeded using 2048 processors.
Comments
There are no comments yet.