Learning the Markov Decision Process in the Sparse Gaussian Elimination

09/30/2021
by   Yingshi Chen, et al.
0

We propose a learning-based approach for the sparse Gaussian Elimination. There are many hard combinatorial optimization problems in modern sparse solver. These NP-hard problems could be handled in the framework of Markov Decision Process, especially the Q-Learning technique. We proposed some Q-Learning algorithms for the main modules of sparse solver: minimum degree ordering, task scheduling and adaptive pivoting. Finally, we recast the sparse solver into the framework of Q-Learning. Our study is the first step to connect these two classical mathematical models: Gaussian Elimination and Markov Decision Process. Our learning-based algorithm could help improve the performance of sparse solver, which has been verified in some numerical experiments.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/12/2021

Fast Block Linear System Solver Using Q-Learning Schduling for Unified Dynamic Power System Simulations

We present a fast block direct solver for the unified dynamic simulation...
04/14/2010

Mean field for Markov Decision Processes: from Discrete to Continuous Optimization

We study the convergence of Markov Decision Processes made of a large nu...
08/01/2019

Mapping the uncertainty of 19th century West African slave origins using a Markov decision process model

The advent of modern computers has added an increased emphasis on channe...
07/10/2019

Markov Decision Process for MOOC users behavioral inference

Studies on massive open online courses (MOOCs) users discuss the existen...
11/05/2019

A Note on Quantum Markov Models

The study of Markov models is central to control theory and machine lear...
12/15/2020

An exact solution in Markov decision process with multiplicative rewards as a general framework

We develop an exactly solvable framework of Markov decision process with...

1 Introduction

Gaussian elimination(GE) is one of the oldest and most important numerical method[1996Matrix, grcar2011mathematicians, yuan2012jiu]. It’s the most widely used method for solving linear systems:

(1)

where is a square matrix with rank , is the use specified right hand side and is the solution. In many literatures, this method is also called direct method [duff2017direct, demmel200010], or direct sparse solver in the case of sparse matrix.

At the age of Gauss (1777–1855), the number of equations is very small. In recent decades, the rank of sparse matrix is getting larger and larger, from thousands to millions and billions and trillions. Its solution becomes the time and resource bottleneck of many applications.

In addition to the explosion of scale, another main challenge is how making efficient use of the hierarchical architecture of modern computers[dongarra1998numerical]. Only a few highly optimized dense matrix operators (such as BLAS[dongarra1990set, anderson1999lapack, kaagstrom1998gemm]) could make full use of the great performance of cores(CPU,GPU,TPU…). These dense block operation is hard to implement in sparse solver. How construct these blocks? How scheduling these blocks? How reduce the communication in these blocks? These all involve complex combinatorial optimization problems. Most problems are actually NP-hard [bovet1994introduction], which cannot be solved in polynomial time. There are more problems unique to sparse solver, such as fill-in ordering (Yannakakis proved it is NP-complete[yannakakis1981computing]). Therefore, sparse solver is not just a simple extension of GE, but do need a new framework to handle these problems.

Modern sparse solver [schenk2004solving, li2003superlu_dist, amestoy2000mumps, Chen2018] is based on the operators on blocks. The first version is the frontal solver by Bruce Irons [irons1970frontal]. It builds a LU or Cholesky decomposition from the assembly of element matrices (fronts). Duff and Reid[duff2017direct] improved this idea to multifrontal solver, which deals with several independent fronts at the same time[liu1992multifrontal]. Their research has laid the foundation of modern sparse solver. And now we are still walking along the road they pointed out. Maybe this paper is the first attempt to improve multifrontal method from a more general framework, a learning-based framework.

1.1 Main Modules of Sparse Solver

Modern sparse solver could be divided into three stages (there are also four stages division, that is, an independent fill-in order stage):

  1. Symbolic analysis: includes many tasks on the symbolic (non-zero) pattern of matrix.

    • Transform a matrix into a specific structure (For example, block triangular form).

    • Reorder the matrix to reduce fill-in (Detail in section 3.1)
      The minimum degree order algorithm is NP-hard[yannakakis1981computing].

    • Get the pattern of the L and U matrix. Construct frontals from the pattern.
      There are some greedy algorithm to construct frontals. But the optimal frontals partition is NP-hard. We would prove it in a later paper.

    • Static parallel task scheduling on these frontals (Detail in section 3.2).
      The static task scheduling problem is NP-hard [graham1969bounds].

    • More, such as the elimination tree [gilbert1993elimination, liu1990role].

    Finally, this stage would get a permutation matrix .

  2. Numeric factorization: computes the and factors in frontal format.

    (2)

    Pivoting is an important technique to improve numerical stability. For more details of adaptive pivoting, see section 3.3.

    This is usually the most time-consuming stage. The key to reducing time and improving efficiency is dynamic (online) task scheduling. It’s NP-hard [graham1969bounds] and much more complicated than static scheduling. For more details, see section 3.2.

  3. Solve: performs forward and backward substitution to get solution .

    (3)

    Dynamic task scheduling is also needed to improve the effeciency.

1.2 Learn the Sparse Solver by Markov Decision Process

It’s clear that there are many combinatorial optimization problems in sparse solver. To handle these problems, Markov Decision Process (MDP in short) [bellman1957markovian, gosavi2015solving] is a smart and powerful model. We would use several sections to reveal this seemingly surprising discovery, to reveal many MDP in sparse solver. And how to learn this process. And how to improve the performance from learning-based method.

These methods have recently been in great glory in academia and industry. For example, Google’s AlphaGO[silver2016mastering]

defeated the go master Lee Sedol. The engine of AlphaGO is based on deep reinforcement learning, whose theoretical basis is MDP. It’s time to use these powerful modern weapons to tackle old Gaussian elimination problem.

2 Background

In this section, we give some background knowledge which are needed in the following analysis and derivation.

2.1 Gaussian Elimination with Pivoting

Let’s first review the classical Gaussian Elimination with Pivoting. For more detail, please see [1996Matrix, duff2017direct, foster1997growth, sherman1978algorithms].

Input:

1:
2:for  do
3:     Apply pivoting strategy (partial pivoting[sherman1978algorithms], rook pivoting[foster1997growth],…)
4:     Select pivot element ant its position is
5:     if   then
6:         Interchange rows and columns      
7:      Update the un-eliminated part of the matrix
8:     for  do
9:         
10:         for  do
11:                             
Algorithm 1 Gaussian Elimination with Pivoting

As algorithm 1 shows, at step , the value and pattern of depends on and the pivot position. Different pivoting strategies will lead to different . For a input matrix , we define all possible as the elimination set of , denoted by , or simplified as in this paper.

The following figure 1 shows the elimination process of a simple matrix. The first step is column partial pivoting to swap rows and . And in this process, .

Figure 1: Gaussian elimination process of a matrix

2.2 BLAS and Frontal

BLAS is an acronym for Basic Linear Algebra Subprograms[dongarra1990set, anderson1999lapack, kaagstrom1998gemm]

. It is the cornerstone of modern high performance computing. BLAS-1 routines perform scalar, vector and vector-vector operations, BLAS-2 routines perform matrix-vector operations, and BLAS-3 routines perform matrix-matrix operations. Many operations in frontal is BLAS-3 routines.

2.3 Distributed/Heterogeneous Computing Systems

Today’s distributed computing systems often contain various processors (CPU, GPU, TPU…). So it’s also Heterogeneous Computing[karatza2002load]. This complex system has amazing peak computing power. But the question is how to exert its power. There are many problems would reduce the efficiency of the whole system:

  • load imbalance among processes;

  • overhead during communication;

  • synchronizations for computations;

  • online scheduling. A widely used model is the following DAG model.

2.4 DAG(directed graph with no directed cycles) based Task Scheduling

We could use a task graph to describe all tasks in sparse solver, where the nodes correspond to the tasks on frontals, and the edges represent dependencies between tasks. This DAG model has been widely used in task scheduling. [wu2015hierarchical] use a hierarchical DAG to schedule tasks between CPU and GPU, whereby the CPU has small tiles and the GPU has large tiles. [gupta2002improved] not only uses task-DAG, but also uses data-DAG to deal with the data-dependency in sparse GE. The DAGuE framework in [bosilca2012dague] implement a complex DAG based scheduling engine, which could deal with many low level architectural issues such as load balancing, memory distribution, cache reuse and memory locality on non-uniform memory access (NUMA) architectures, and communications/computations overlapping. For more detail of task scheduling, please see section 3.2

2.5 Markov Property

Markov property is named after the brilliant Russian mathematician Andrey Markov. He said: in some process, the future does not depend on the past, only depends on the present state! This is a very interesting and powerful property. Whether you feel it or not, many greedy/progressive algorithms are based on this property. Especially many algorithms in sparse solver. For example, MD (minimum degree order) always selects node with minimum degree at current state. Partial Pivoting always select a element in the current column/row. In the process of GE, the value and degree of each element are varying from step and step. But we don’t care how they changed in the past, we only check the value at the current step. This also exists in the algorithm of task scheduling. Markov property is a prerequisite for many models, such as MDP mentioned below.

2.6 Markov Decision Process (MDP)

The Markov decision process (MDP in short) [bellman1957markovian, gosavi2015solving] is a smart and powerful model. As its name suggests, this model could find the solution (decision process) on the Markov property. The standard MDP includes four elements [bellman1957markovian]:

  • is the state space of MDP.

  • is all actions. We also use to denote the set of actions available from state ,

  • is the probability that action

    in state at time ,

  • is the immediate reward (or expected immediate reward) received after transitioning from state to state , due to action .

A transition matrix is a square matrix used to describe the transitions of state space.

(4)

Each element is a non-negative real number representing a probability of moving from state to state . Each row summing of is 1: .

MDP is the theoretical foundation of Reinforcement Learning[kaelbling1996reinforcement]. A widely-used Reinforcement Learning technique is Q-Learning.

2.7 Q-Learning

The MDP model presents a general paradigm and an abstract mathematical framework. For practical problems, there are many powerful techniques, such as Q-Learning[watkins1989learning, watkins1992q]. "Q" refers to the expected rewards for an action taken in a given state[watkins1989learning]

. At each step, Q-Learning would take some action on the estimate of Q-values. If we know "Q" value of any action in a state, then it’s simple to find the optimal solution of MDP.

The following is a typical framework of Q-Learning[sutton2018reinforcement].

Init Parameters: Learning rate , discount rate and reward function

1:Initialize for all state and action . For the terminal state,
2:for  each episode do
3:     Pick a initial state
4:     while true do
5:         Choose action for the current state , using policy derived from
6:         Take action from some policy then enter the next state
7:         Get reward
8:         Update Q by some method, for example:
9:          
10:         Set
11:         if  is terminal then
12:              break               
Algorithm 2 A general framework of Q-Learning

This framework shows a key advantage of Q-Learning - avoiding the usage of transition matrix. In the case of sparse solver, the probability in the transition matrix is hard to estimate. Or the transition matrix needs huge memory The number of states of a million-order matrix is astronomical!). Instead, we could always get the reward or Q-value. So in practice, Q-Learning is more suitable for studying various combinatorial problems in GE. Watkins[watkins1992q] prove that it would converge to the optimum Q-values with probability 1.So it’s a reliable technique for the combinatorial optimization problems in sparse solver.

2.8 Offline Q-Learning

Solution time is one of the most important metrics of sparse solver. We could avoid the training time in Q-Learning by offline technique. That is, first collect many matrices and train the model on these sample matrices. Then apply the learned Q-value to solve a matrix. This offline Q-Learning (or batch Q-Learning) has recently regained popularity as a viable path towards effective real-world application. For example, Conservative Q-learning[kumar2020conservative, levine2020offline].

Init Parameters: Learning rate , discount rate and rewards function

Training stage:

Learn using offline data-set

Inference stage:

1:Pick a initial state
2:while true do
3:     Choose action for the current state , using policy derived from
4:     Take action from learned policy then enter the next state
5:     Get reward
6:     Set
7:     if  is terminal then
8:         break      
Algorithm 3 Conservative Q-learning

3 Markov Decision Process in Sparse Solver

In this section, we list many combinatorial optimization problems in sparse solver, which all have Markov Property(2.5). And there is implicit or explicit reward for the decision of each step. So all these algorithms could be regarded as a Markov Decision Process (MDP). We would list all the state space, actions and rewards in these algorithms. Next, we further use Q-learning to improve these algorithms.

3.1 MDP in Minimum Degree Order

Minimum degree (MD) order[duff1974comparison, george1977application, eisenstat1977yale, george1980fast, amestoy1996approximate] is an important technique to reduce the fill-in. Its principle is simple and rough, as shown in Algorithm 4. At each step, it will always pick the node with minimum degree (may be approximate or weighted value). In the practical implementation, the key is how to update degree quickly without reducing the quality of sorting. For example, the State-Of-The-Art AMD[amestoy1996approximate] uses an approximate formula on the upper-bound of degree. Surprisingly, this approximation can even get less fill-in than those methods on the accurate degree in many problems. This seemingly ’unreasonable’ shows the difficulty of combinatorial optimization problems again.

Input: Construct a graph from the nonzero pattern of

1:For each node , set to be the degree of
2:
3:while   do
4:     Select node that minimizes .
5:     Update permuting vector
6:     Eliminate , Update to
7:     For each node adjacent to , update
8:      return The permuting vector
Algorithm 4 The minimum degree order algorithm

As algorithm 4 shows, the degrees of nodes are varying from step and step. But we don’t care how they changed in the past, we only check the degree at the current step. That’s Markov Property, decisions made does not depend on the past, only depends on the present state!

3.1.1 Action and Reward in Minimum Degree Order

At each step of MD method, we would always pick a node to reduce the fill-in. So the negative number of fill-in could be regarded as the reward value. And different strategies correspond to different actions. In addition to AMD mentioned above, there are more choices:

  • MMDF - modified minimum deficiency[ng1999performance].

  • MMMD - modified multiple minimum degree[ng1999performance].

  • MIND - Minimum Increase In Neighborhood Degree [ng1999performance].

  • AMIND - Approximate Minimum Increase In Neighborhood Degree[rothberg1998node].

  • MMF - Minimum Mean Local Fill [ng1999performance].

  • AMMF - Approximate Minimum Mean Local Fill[rothberg1998node].

So the action space would be . And the reward .

3.1.2 Adaptive Minimum Degree Order by Q-Learning

In all previous research[duff1974comparison, george1977application, eisenstat1977yale, george1980fast, amestoy1996approximate], only one method used in MD process. No research tries to use different strategy in different step. Since all the metric in are reasonable, why not choose different strategies according to different patterns? This choice can be learned through offline Q-Learning algorithm(section 2.8). So we propose a Q-Learning based MD order method, as Algorithm 5 shows:

Training stage:

Learn on the MD process of many matrices.

table restores the choice of strategy at each state.

Inference stage:.

1:The initial state is just (input matrix)
2:while   do
3:     For the current state , take action from learned table. Then select node
4:     Eliminate
5:     Get reward
6:     
Algorithm 5 An offline Q-Learning framework for minimum degree algorithm

3.2 MDP in Task Scheduling

Task scheduling is the key to get high performance, especially in the distributed and heterogeneous computing[sinnen2007task, luo2021learning, glaubius2012real]. The main target (objective function, reward) is to minimize the solution time (or makespan) and resource consumption (memory, power, …). A specific goal for sparse solver is the precision. We would discuss it in a later paper. And for most problem, the direct solver would always give a solution with reasonable precision as long as the solution is completed.

This problem has been deeply and widely studied. In recent years, the research on MDP based scheduling is very active and there are many good ideas [sinnen2007task, luo2021learning, glaubius2012real, shyalika2020reinforcement]. Unfortunately, these ideas are not applied to sparse solver. To our knowledge, in the case of matrix computation, only a few papers [agullo2015bridging, grinsztajn2020geometric, grinsztajn2021readys] have relevant research. They successfully apply MDP based algorithm to dense Cholesky decomposition. In the work of [grinsztajn2020geometric, grinsztajn2021readys]

, their Q-Learning method has the following module: 1) State space is actually from the embedding of each task to graph neural network

[kipf2016semi]. 2) Reward function is final makespan given by the whole scheduling trajectory, normalized by a baseline duration. 3) Action space consists in selecting an available task or in doing nothing (pass). They trained this model with A2C[mnih2016asynchronous] (a policy gradient method).

For sparse solver, there is still no similar study. There are great difference between dense and sparse solver. They have different state space, action space and reward function. There are lots of work needed to apply MDP based task scheduling.

3.2.1 Action and Reward in Task Scheduling

There are many goals, as listed below. So many goals show the importance of scheduling.

  • TIME - the wall time spent by the solver

  • MEMORY - memory consumption [lacoste2015scheduling]

  • BALANCE - workload balance balancing between the resources

  • OVERLAP - Efficient Communication/Computation Overlap [marjanovic2010overlapping, sao2018communication]

  • POWER - Reduce the power and energy consumed[aguilar2020performance].

  • LOCALITY -

So the action space would be . And the reward is a weighted sum of these metrics .

3.2.2 Improve Task Scheduling by Q-Learning

For the sparse solver, the task scheduling is even more important and we propose the following algorithm:

Training stage:

Learn on the task scheduling of many matrices

Inference stage:

1:The initial state is just (input matrix)
2:while   do
3:     For the current state , take action from learned table.
4:     Run one or more task, Update task DAG.
5:     Get reward
6:     
Algorithm 6 An Offline Q-Learning framework for the task scheduling of sparse solver

3.3 Action and Reward in Adaptive Pivoting Method

Pivoting is an important technique to improve numerical stability. The partial pivoting listed in algorithm 1 is only one choice. There are more pivoting techniques. At each step of GE, there are many choices. So we could use learning-based method to pick one pivot action. That’s the basic principle of adaptive pivoting. Here are some common pivoting methods:

  • PP - Partial pivoting[sherman1978algorithms]

  • RP - Rook pivoting[foster1997growth]

  • CP - Complete pivoting[edelman1992complete]

  • SPP - Supernodal Partial-Pivoting[demmel1999supernodal] In the case of multifrontal solver, the pivoting could be applied only in the current frontal(supernodal). But in some case, we still need to find proper pivoting element in all uneliminated frontals.

  • SKIP - No pivoting(For example, there is no need to do pivoting in the Cholesky factorization)
    supernodal partial pivoting version

  • RBT - Recursive butterfly transforms [lindquist2020replacing]

So the action space of pivoting is .

An reasonable and classic reward value is the negative growth factor [wilkinson1988algebraic, foster1997growth, higham1989large].

(5)

That is, pivoting method should reduce the growth factor as much as possible. As the outstanding Wilkinson pointed 60 years ago[wilkinson1988algebraic], low value of means high numerical stability, that’s the goal of pivoting.

4 Recasting and Improve Sparse Solver by Q-Learning

In the previous section, we analyzed most algorithms in the sparse solver and found that these algorithms can be described by MDP. We further proposed how to use Q-Learning technology to improve these algorithms. Based on these algorithm modules, we propose a novel offline Q-Learning framework for sparse Gaussian Elimination. This framework has three basic modules :

  1. State space - the elimination set includes all possible eliminated matrix (section 2.1).

  2. Action space - table 1 lists all actions in different Markov Decision Process(MDP). It also lists the corresponding reward metrics.

  3. is the expected rewards for an action taken in a given state .


MDP 1 - Minimum Fill-in
Reward Action Space = { AMD, MMDF, MMMD, MIND, AMIND, MMF, AMMF… }

MDP 2 - Task scheduling
Reward = Action Space = { TIME, MEM, BALANCE, OVERLAP, POWER, LOCALITY… }

MDP 3 - Pivot for numerical stability
Reward Action Space = { PP, RP, CP, SPP, SKIP, RBT … }

Table 1: Action Space and Rewards of MDP in Sparse Solver

Based on the algorithm 1 ("Gaussian Elimination with Pivoting"), algorithm 5 ("An offline Q-Learning framework for minimum degree algorithm") and algorithm 6 ("An offline Q-Learning framework for the task scheduling of sparse solver"), we would finally unify these algorithms, and recasting the sparse GE in the following algorithm 7 - "Sparse Gaussian Elimination in the framework of Q-Learning".

Init Parameters: Learning rate , discount rate and rewards function

Pre-Training with many sample matrices:

Learn using offline data-set

Solve a matrix :

1:
2:for  do
3:     For the current state , select action (maybe order, pivoting, task scheduling…) on the Q value.
4:     switch action  do
5:         case Minimum Degree Order:
6:              Update matrix pattern, …          
7:         case Pivoting:
8:              Get pivot element ant its position is (r,c)
9:              if   then
10:                  Interchange rows and columns               
11:              Update the un-eliminated part of the matrix          
12:         case Scheduling:
13:              Update task DAG, …               return
Algorithm 7 Sparse Gaussian Elimination in the framework of Q-Learning

As the algorithm 7 shows, we would first learn Q-vale in the offline training stage. These learned Q-values would help to find a good "policy" at each step. The convergence of this algorithm is proved in [watkins1992q, melo2001convergence]. So it would find an optimal policy to maximizing the total reward over any and all successive steps. We listed common reward values in table 1. The process of maximizing reward corresponds to an accurate and efficient GE process: fewer fill-in, high numerical stability, good parallelism, low communication overhead, …

Now, we have linked two seemingly unrelated areas Q-Learning and Sparse Solver. This is not a coincidence or a blunt explanation. Most problems in sparse solvers are combinatorial optimization. And MDP is a powerful framework for these NP-hard problems. In addition to the Q-learning introduced in this paper, there are more powerful reinforcement learning methods. The reinforcement learning methods are special case of MDP and can be easily integrated into the algorithm[shyalika2020reinforcement, kaelbling1996reinforcement, grinsztajn2021readys]. We would expect more improvements from these tools and try more tools [chen2020deep] in later study.

4.1 A demo case

Let’s see figure 2. A tiny matrix with only 3 rows. If only row transformation is considered, there are six different elimination process in total.

(6)

Figure 2 shows three case, which is on the different reward and action.

  • Figure 2.(a) . No pivoting. No row permutation. This process requires minimal computation.

  • Figure 2.(b) . can be eliminated at the same time, so there is a high degree of parallelism.

  • Figure 2.(c) . Apply column pivoting strategy to ensure numerical stability.

So even for this tiny matrix, there are different strategy for different goals. At each step, there are different policy to pick different action. We could use off-line Q-Learning to learn these strategies. Then apply it these strategies to improve the efficiency and accuracy of the solution process.

(a) Minimize computation
(b) High parallelism
(c) High numerical stability
Figure 2: Different elimination process on different action

4.2 GSS - a High Performance Solver with Q-Learning based Scheduler

We have implemented Q-Learning based scheduling algorithm in GSS[Chen2018]. It use Q-Learning based task-load-tree split technique to create a sub-matrix that can be fully factored in GPU. So the data transfer time reduced to a minimum. GSS also uses some adaptive load balance techniques in the framework of Q-Learning. In the static scheduling stage, LU factorization is split into many parallel tasks. Each task is assigned to a different computing core (CPU, GPU, …). Then in the actual elimination process, the dynamic scheduling strategy will do more load balancing on pretrained table. That is, if GPU cores have high computing power, then it will run more tasks automatically. If CPU is more powerful, then GSS will give it more tasks. The choice of target core is learned from table.

There are some experimental results in https://github.com/closest-git/GSS. For many large matrices, GSS is about 2-3 times faster than PARDISO[schenk2004solving] in MKL. We would give more detailed results in subsequent papers.

5 Prospect

In this paper, we list many combinatorial optimization processes in sparse solver. We unified these processes into the framework of Markov Decision Process, with detailed analysis of state space, action space, and reward. Then use Q-Learning technique to improve the solution.

This is only the first step to rethinking and recasting the Gaussian elimination process. More work should be done. We will report more method and numerical experimental results in subsequent papers.

References