1 Introduction
Gaussian elimination(GE) is one of the oldest and most important numerical method[1996Matrix, grcar2011mathematicians, yuan2012jiu]. It’s the most widely used method for solving linear systems:
(1) 
where is a square matrix with rank , is the use specified right hand side and is the solution. In many literatures, this method is also called direct method [duff2017direct, demmel200010], or direct sparse solver in the case of sparse matrix.
At the age of Gauss (1777–1855), the number of equations is very small. In recent decades, the rank of sparse matrix is getting larger and larger, from thousands to millions and billions and trillions. Its solution becomes the time and resource bottleneck of many applications.
In addition to the explosion of scale, another main challenge is how making efficient use of the hierarchical architecture of modern computers[dongarra1998numerical]. Only a few highly optimized dense matrix operators (such as BLAS[dongarra1990set, anderson1999lapack, kaagstrom1998gemm]) could make full use of the great performance of cores(CPU,GPU,TPU…). These dense block operation is hard to implement in sparse solver. How construct these blocks? How scheduling these blocks? How reduce the communication in these blocks? These all involve complex combinatorial optimization problems. Most problems are actually NPhard [bovet1994introduction], which cannot be solved in polynomial time. There are more problems unique to sparse solver, such as fillin ordering (Yannakakis proved it is NPcomplete[yannakakis1981computing]). Therefore, sparse solver is not just a simple extension of GE, but do need a new framework to handle these problems.
Modern sparse solver [schenk2004solving, li2003superlu_dist, amestoy2000mumps, Chen2018] is based on the operators on blocks. The first version is the frontal solver by Bruce Irons [irons1970frontal]. It builds a LU or Cholesky decomposition from the assembly of element matrices (fronts). Duff and Reid[duff2017direct] improved this idea to multifrontal solver, which deals with several independent fronts at the same time[liu1992multifrontal]. Their research has laid the foundation of modern sparse solver. And now we are still walking along the road they pointed out. Maybe this paper is the first attempt to improve multifrontal method from a more general framework, a learningbased framework.
1.1 Main Modules of Sparse Solver
Modern sparse solver could be divided into three stages (there are also four stages division, that is, an independent fillin order stage):

Symbolic analysis: includes many tasks on the symbolic (nonzero) pattern of matrix.

Transform a matrix into a specific structure (For example, block triangular form).

Reorder the matrix to reduce fillin (Detail in section 3.1)
The minimum degree order algorithm is NPhard[yannakakis1981computing]. 
Get the pattern of the L and U matrix. Construct frontals from the pattern.
There are some greedy algorithm to construct frontals. But the optimal frontals partition is NPhard. We would prove it in a later paper. 
Static parallel task scheduling on these frontals (Detail in section 3.2).
The static task scheduling problem is NPhard [graham1969bounds]. 
More, such as the elimination tree [gilbert1993elimination, liu1990role].
Finally, this stage would get a permutation matrix .


Numeric factorization: computes the and factors in frontal format.
(2) Pivoting is an important technique to improve numerical stability. For more details of adaptive pivoting, see section 3.3.
This is usually the most timeconsuming stage. The key to reducing time and improving efficiency is dynamic (online) task scheduling. It’s NPhard [graham1969bounds] and much more complicated than static scheduling. For more details, see section 3.2.

Solve: performs forward and backward substitution to get solution .
(3) Dynamic task scheduling is also needed to improve the effeciency.
1.2 Learn the Sparse Solver by Markov Decision Process
It’s clear that there are many combinatorial optimization problems in sparse solver. To handle these problems, Markov Decision Process (MDP in short) [bellman1957markovian, gosavi2015solving] is a smart and powerful model. We would use several sections to reveal this seemingly surprising discovery, to reveal many MDP in sparse solver. And how to learn this process. And how to improve the performance from learningbased method.
These methods have recently been in great glory in academia and industry. For example, Google’s AlphaGO[silver2016mastering]
defeated the go master Lee Sedol. The engine of AlphaGO is based on deep reinforcement learning, whose theoretical basis is MDP. It’s time to use these powerful modern weapons to tackle old Gaussian elimination problem.
2 Background
In this section, we give some background knowledge which are needed in the following analysis and derivation.
2.1 Gaussian Elimination with Pivoting
Let’s first review the classical Gaussian Elimination with Pivoting. For more detail, please see [1996Matrix, duff2017direct, foster1997growth, sherman1978algorithms].
As algorithm 1 shows, at step , the value and pattern of depends on and the pivot position. Different pivoting strategies will lead to different . For a input matrix , we define all possible as the elimination set of , denoted by , or simplified as in this paper.
The following figure 1 shows the elimination process of a simple matrix. The first step is column partial pivoting to swap rows and . And in this process, .
2.2 BLAS and Frontal
BLAS is an acronym for Basic Linear Algebra Subprograms[dongarra1990set, anderson1999lapack, kaagstrom1998gemm]
. It is the cornerstone of modern high performance computing. BLAS1 routines perform scalar, vector and vectorvector operations, BLAS2 routines perform matrixvector operations, and BLAS3 routines perform matrixmatrix operations. Many operations in frontal is BLAS3 routines.
2.3 Distributed/Heterogeneous Computing Systems
Today’s distributed computing systems often contain various processors (CPU, GPU, TPU…). So it’s also Heterogeneous Computing[karatza2002load]. This complex system has amazing peak computing power. But the question is how to exert its power. There are many problems would reduce the efficiency of the whole system:

load imbalance among processes;

overhead during communication;

synchronizations for computations;

online scheduling. A widely used model is the following DAG model.
2.4 DAG(directed graph with no directed cycles) based Task Scheduling
We could use a task graph to describe all tasks in sparse solver, where the nodes correspond to the tasks on frontals, and the edges represent dependencies between tasks. This DAG model has been widely used in task scheduling. [wu2015hierarchical] use a hierarchical DAG to schedule tasks between CPU and GPU, whereby the CPU has small tiles and the GPU has large tiles. [gupta2002improved] not only uses taskDAG, but also uses dataDAG to deal with the datadependency in sparse GE. The DAGuE framework in [bosilca2012dague] implement a complex DAG based scheduling engine, which could deal with many low level architectural issues such as load balancing, memory distribution, cache reuse and memory locality on nonuniform memory access (NUMA) architectures, and communications/computations overlapping. For more detail of task scheduling, please see section 3.2
2.5 Markov Property
Markov property is named after the brilliant Russian mathematician Andrey Markov. He said: in some process, the future does not depend on the past, only depends on the present state! This is a very interesting and powerful property. Whether you feel it or not, many greedy/progressive algorithms are based on this property. Especially many algorithms in sparse solver. For example, MD (minimum degree order) always selects node with minimum degree at current state. Partial Pivoting always select a element in the current column/row. In the process of GE, the value and degree of each element are varying from step and step. But we don’t care how they changed in the past, we only check the value at the current step. This also exists in the algorithm of task scheduling. Markov property is a prerequisite for many models, such as MDP mentioned below.
2.6 Markov Decision Process (MDP)
The Markov decision process (MDP in short) [bellman1957markovian, gosavi2015solving] is a smart and powerful model. As its name suggests, this model could find the solution (decision process) on the Markov property. The standard MDP includes four elements [bellman1957markovian]:

is the state space of MDP.

is all actions. We also use to denote the set of actions available from state ,

is the immediate reward (or expected immediate reward) received after transitioning from state to state , due to action .
A transition matrix is a square matrix used to describe the transitions of state space.
(4) 
Each element is a nonnegative real number representing a probability of moving from state to state . Each row summing of is 1: .
MDP is the theoretical foundation of Reinforcement Learning[kaelbling1996reinforcement]. A widelyused Reinforcement Learning technique is QLearning.
2.7 QLearning
The MDP model presents a general paradigm and an abstract mathematical framework. For practical problems, there are many powerful techniques, such as QLearning[watkins1989learning, watkins1992q]. "Q" refers to the expected rewards for an action taken in a given state[watkins1989learning]
. At each step, QLearning would take some action on the estimate of Qvalues. If we know "Q" value of any action in a state, then it’s simple to find the optimal solution of MDP.
The following is a typical framework of QLearning[sutton2018reinforcement].
This framework shows a key advantage of QLearning  avoiding the usage of transition matrix. In the case of sparse solver, the probability in the transition matrix is hard to estimate. Or the transition matrix needs huge memory The number of states of a millionorder matrix is astronomical!). Instead, we could always get the reward or Qvalue. So in practice, QLearning is more suitable for studying various combinatorial problems in GE. Watkins[watkins1992q] prove that it would converge to the optimum Qvalues with probability 1.So it’s a reliable technique for the combinatorial optimization problems in sparse solver.
2.8 Offline QLearning
Solution time is one of the most important metrics of sparse solver. We could avoid the training time in QLearning by offline technique. That is, first collect many matrices and train the model on these sample matrices. Then apply the learned Qvalue to solve a matrix. This offline QLearning (or batch QLearning) has recently regained popularity as a viable path towards effective realworld application. For example, Conservative Qlearning[kumar2020conservative, levine2020offline].
3 Markov Decision Process in Sparse Solver
In this section, we list many combinatorial optimization problems in sparse solver, which all have Markov Property(2.5). And there is implicit or explicit reward for the decision of each step. So all these algorithms could be regarded as a Markov Decision Process (MDP). We would list all the state space, actions and rewards in these algorithms. Next, we further use Qlearning to improve these algorithms.
3.1 MDP in Minimum Degree Order
Minimum degree (MD) order[duff1974comparison, george1977application, eisenstat1977yale, george1980fast, amestoy1996approximate] is an important technique to reduce the fillin. Its principle is simple and rough, as shown in Algorithm 4. At each step, it will always pick the node with minimum degree (may be approximate or weighted value). In the practical implementation, the key is how to update degree quickly without reducing the quality of sorting. For example, the StateOfTheArt AMD[amestoy1996approximate] uses an approximate formula on the upperbound of degree. Surprisingly, this approximation can even get less fillin than those methods on the accurate degree in many problems. This seemingly ’unreasonable’ shows the difficulty of combinatorial optimization problems again.
As algorithm 4 shows, the degrees of nodes are varying from step and step. But we don’t care how they changed in the past, we only check the degree at the current step. That’s Markov Property, decisions made does not depend on the past, only depends on the present state!
3.1.1 Action and Reward in Minimum Degree Order
At each step of MD method, we would always pick a node to reduce the fillin. So the negative number of fillin could be regarded as the reward value. And different strategies correspond to different actions. In addition to AMD mentioned above, there are more choices:

MMDF  modified minimum deficiency[ng1999performance].

MMMD  modified multiple minimum degree[ng1999performance].

MIND  Minimum Increase In Neighborhood Degree [ng1999performance].

AMIND  Approximate Minimum Increase In Neighborhood Degree[rothberg1998node].

MMF  Minimum Mean Local Fill [ng1999performance].

AMMF  Approximate Minimum Mean Local Fill[rothberg1998node].
So the action space would be . And the reward .
3.1.2 Adaptive Minimum Degree Order by QLearning
In all previous research[duff1974comparison, george1977application, eisenstat1977yale, george1980fast, amestoy1996approximate], only one method used in MD process. No research tries to use different strategy in different step. Since all the metric in are reasonable, why not choose different strategies according to different patterns? This choice can be learned through offline QLearning algorithm(section 2.8). So we propose a QLearning based MD order method, as Algorithm 5 shows:
3.2 MDP in Task Scheduling
Task scheduling is the key to get high performance, especially in the distributed and heterogeneous computing[sinnen2007task, luo2021learning, glaubius2012real]. The main target (objective function, reward) is to minimize the solution time (or makespan) and resource consumption (memory, power, …). A specific goal for sparse solver is the precision. We would discuss it in a later paper. And for most problem, the direct solver would always give a solution with reasonable precision as long as the solution is completed.
This problem has been deeply and widely studied. In recent years, the research on MDP based scheduling is very active and there are many good ideas [sinnen2007task, luo2021learning, glaubius2012real, shyalika2020reinforcement]. Unfortunately, these ideas are not applied to sparse solver. To our knowledge, in the case of matrix computation, only a few papers [agullo2015bridging, grinsztajn2020geometric, grinsztajn2021readys] have relevant research. They successfully apply MDP based algorithm to dense Cholesky decomposition. In the work of [grinsztajn2020geometric, grinsztajn2021readys]
, their QLearning method has the following module: 1) State space is actually from the embedding of each task to graph neural network
[kipf2016semi]. 2) Reward function is final makespan given by the whole scheduling trajectory, normalized by a baseline duration. 3) Action space consists in selecting an available task or in doing nothing (pass). They trained this model with A2C[mnih2016asynchronous] (a policy gradient method).For sparse solver, there is still no similar study. There are great difference between dense and sparse solver. They have different state space, action space and reward function. There are lots of work needed to apply MDP based task scheduling.
3.2.1 Action and Reward in Task Scheduling
There are many goals, as listed below. So many goals show the importance of scheduling.

TIME  the wall time spent by the solver

MEMORY  memory consumption [lacoste2015scheduling]

BALANCE  workload balance balancing between the resources

OVERLAP  Efficient Communication/Computation Overlap [marjanovic2010overlapping, sao2018communication]

POWER  Reduce the power and energy consumed[aguilar2020performance].

LOCALITY 
So the action space would be . And the reward is a weighted sum of these metrics .
3.2.2 Improve Task Scheduling by QLearning
For the sparse solver, the task scheduling is even more important and we propose the following algorithm:
3.3 Action and Reward in Adaptive Pivoting Method
Pivoting is an important technique to improve numerical stability. The partial pivoting listed in algorithm 1 is only one choice. There are more pivoting techniques. At each step of GE, there are many choices. So we could use learningbased method to pick one pivot action. That’s the basic principle of adaptive pivoting. Here are some common pivoting methods:

PP  Partial pivoting[sherman1978algorithms]

RP  Rook pivoting[foster1997growth]

CP  Complete pivoting[edelman1992complete]

SPP  Supernodal PartialPivoting[demmel1999supernodal] In the case of multifrontal solver, the pivoting could be applied only in the current frontal(supernodal). But in some case, we still need to find proper pivoting element in all uneliminated frontals.

SKIP  No pivoting(For example, there is no need to do pivoting in the Cholesky factorization)
supernodal partial pivoting version 
RBT  Recursive butterfly transforms [lindquist2020replacing]
So the action space of pivoting is .
An reasonable and classic reward value is the negative growth factor [wilkinson1988algebraic, foster1997growth, higham1989large].
(5) 
That is, pivoting method should reduce the growth factor as much as possible. As the outstanding Wilkinson pointed 60 years ago[wilkinson1988algebraic], low value of means high numerical stability, that’s the goal of pivoting.
4 Recasting and Improve Sparse Solver by QLearning
In the previous section, we analyzed most algorithms in the sparse solver and found that these algorithms can be described by MDP. We further proposed how to use QLearning technology to improve these algorithms. Based on these algorithm modules, we propose a novel offline QLearning framework for sparse Gaussian Elimination. This framework has three basic modules :
MDP 1  Minimum Fillin 


Reward  Action Space = { AMD, MMDF, MMMD, MIND, AMIND, MMF, AMMF… } 
MDP 2  Task scheduling 

Reward =  Action Space = { TIME, MEM, BALANCE, OVERLAP, POWER, LOCALITY… } 
MDP 3  Pivot for numerical stability 

Reward  Action Space = { PP, RP, CP, SPP, SKIP, RBT … } 

Based on the algorithm 1 ("Gaussian Elimination with Pivoting"), algorithm 5 ("An offline QLearning framework for minimum degree algorithm") and algorithm 6 ("An offline QLearning framework for the task scheduling of sparse solver"), we would finally unify these algorithms, and recasting the sparse GE in the following algorithm 7  "Sparse Gaussian Elimination in the framework of QLearning".
As the algorithm 7 shows, we would first learn Qvale in the offline training stage. These learned Qvalues would help to find a good "policy" at each step. The convergence of this algorithm is proved in [watkins1992q, melo2001convergence]. So it would find an optimal policy to maximizing the total reward over any and all successive steps. We listed common reward values in table 1. The process of maximizing reward corresponds to an accurate and efficient GE process: fewer fillin, high numerical stability, good parallelism, low communication overhead, …
Now, we have linked two seemingly unrelated areas QLearning and Sparse Solver. This is not a coincidence or a blunt explanation. Most problems in sparse solvers are combinatorial optimization. And MDP is a powerful framework for these NPhard problems. In addition to the Qlearning introduced in this paper, there are more powerful reinforcement learning methods. The reinforcement learning methods are special case of MDP and can be easily integrated into the algorithm[shyalika2020reinforcement, kaelbling1996reinforcement, grinsztajn2021readys]. We would expect more improvements from these tools and try more tools [chen2020deep] in later study.
4.1 A demo case
Let’s see figure 2. A tiny matrix with only 3 rows. If only row transformation is considered, there are six different elimination process in total.
(6) 
Figure 2 shows three case, which is on the different reward and action.

Figure 2.(a) . No pivoting. No row permutation. This process requires minimal computation.

Figure 2.(b) . can be eliminated at the same time, so there is a high degree of parallelism.

Figure 2.(c) . Apply column pivoting strategy to ensure numerical stability.
So even for this tiny matrix, there are different strategy for different goals. At each step, there are different policy to pick different action. We could use offline QLearning to learn these strategies. Then apply it these strategies to improve the efficiency and accuracy of the solution process.
4.2 GSS  a High Performance Solver with QLearning based Scheduler
We have implemented QLearning based scheduling algorithm in GSS[Chen2018]. It use QLearning based taskloadtree split technique to create a submatrix that can be fully factored in GPU. So the data transfer time reduced to a minimum. GSS also uses some adaptive load balance techniques in the framework of QLearning. In the static scheduling stage, LU factorization is split into many parallel tasks. Each task is assigned to a different computing core (CPU, GPU, …). Then in the actual elimination process, the dynamic scheduling strategy will do more load balancing on pretrained table. That is, if GPU cores have high computing power, then it will run more tasks automatically. If CPU is more powerful, then GSS will give it more tasks. The choice of target core is learned from table.
There are some experimental results in https://github.com/closestgit/GSS. For many large matrices, GSS is about 23 times faster than PARDISO[schenk2004solving] in MKL. We would give more detailed results in subsequent papers.
5 Prospect
In this paper, we list many combinatorial optimization processes in sparse solver. We unified these processes into the framework of Markov Decision Process, with detailed analysis of state space, action space, and reward. Then use QLearning technique to improve the solution.
This is only the first step to rethinking and recasting the Gaussian elimination process. More work should be done. We will report more method and numerical experimental results in subsequent papers.