1 Introduction
The multigroup radiation diffusion (MGD) equations have a broad range applications, including inertial confinement fusion (ICF) and astrophysics m04 . As a result of the complicated nonlinear couplings among dozens of physical quantities at multiple temporal and spatial scales, MGD equations are often discretized by the finite volume method allowing for local conservations h03 ; n01 ; h04 ; g01 ; p03 ; s02 ; x05 , resulting in a series of nonsymmetric but positive definite largescale sparse linear systems. The overall complexity increases not only with mesh sizes but also with the level of couplings between physical quantities. It must be emphasized that these numerical solutions play a timeconsuming role (about 80% in general) in ICF numerical simulations, due to the fact that the coefficient matrices are invariably illconditioned.
To effectively address the aforesaid bottlenecks, numerous preconditioned Krylov subspace methods have been proposed in an efficient and scalable manner over the past decades, see b01 ; m01 ; x03 ; y01 ; y02 ; x01 ; x02 ; h02 ; y03 and references cited therein. These preconditioners are conceived as approximate inverses and they fall into the category of incomplete LU factorizations, domain decomposition preconditioners, monolithic algebraic multigrid (AMG) methods and their symmetric / nonsymmetric combinations. Since each of those coefficient matrices has an underlying block structure, one can also determine a block preconditioner to separate the global problem into easiertosolve subproblems and form an objectoriented framework to allow for codereuse and incorporate singlephysics experience into multiphysics simulations. The objective is to approximately invert numerous individual scalar systems instead of the fully coupled systems. Preconditioners of this type had been proposed and analyzed in the literature, including the block diagonal preconditioner d01 ; l01 ; z03 , block lower / upper triangular preconditioner b04 ; b02 ; c01 , product (splitting) preconditioner r02 ; w01 ; z02 and constraint preconditioner b06 ; k01 ; d02 . Block preconditioners with multigrid components had proven very successful in a variety of applications, e.g., liquid crystal directors modeling b07 , multiphase flow in porous media b03 , Stokes problem c02 , incompressible NavierStokes problem e01 , secondorder AgmonDouglisNirenberg elliptic systems l02 , magnetohydrodynamics model m02 , Dirichlet biharmonic problem p02 , electrical activity in the heart s01 , Brinkman problem v01 , allspeed melt pool flow physics w03 and fully coupled flow and geomechanics w02 . Our focus in this work is on the block preconditioning based on the classical AMG method owing to its general applicability, high efficiency and easytouse user interface.
In the recent work a01 , four types of operatorbased preconditioners have been developed in the Jacobianfree NewtonKrylov method for solving twodimensional threetemperature energy equations. These preconditioners are application specific, relying on physical properties of the energy equations as well as different linearizations on different terms in the nonlinear residual. They are demonstrated numerically to be very effective. However, the corresponding preconditioning matrix has to be assembled explicitly in each time step. In this study, we restrict our consideration to a variety of purely algebraic block preconditioners, which are based only on information available from the original block system, without constructing the preconditioning matrix.
The arrangement of this work proceeds as follows. Section 2 provides the mathematical formulation for MGD equations and the fully coupled and implicitly cellcentered finite volume discretization. Section 3 describes several AMG blockbased preconditioning strategies, including an adaptive preconditioning strategy and four approximations on matrix inverse to improve performance. In Section 4, we compare the performance of these preconditioners for representative MGD linear systems from capsule implosion simulations, on both sequential and parallel computers. Finally, we discuss conclusions in Section 5.
2 Problem formulation and discretization
This article is concerned with the timedependent MGD equations in a certain symmetric geometry p01
(2.1) 
where

, , , respectively denote the number of energy groups, velocity of light, density of the medium and energy transfer coefficient between electron and ion;

, , respectively denote the th spectral radiation and electron scattering energy densities and highly nonlinear radiation diffusion coefficient;

, , respectively denote the th radiation source, scattering and absorption coefficients of the Planckaveraged electron energy;

, , respectively denote the specific heat capacity, temperature and nonlinear thermalconductivity coefficient of electron ( = ) or ion ( = ).
Various discretization schemes (see q01 ) can be applied to reduce the continuous differential equations to finite dimensional sparse linear systems. Utilizing the (adaptive) backward Euler for the temporal discretization, followed by the frozenin coefficients for the linearization, and then an appropriate cellcentered finite volume for the spatial discretization, we obtain a series of sparse block structured linear systems from (2.1) by grouping together the unknowns corresponding to the same physical quantity:
(2.2) 
in which and for group index , causing the coefficient matrix is generally positive definite but necessarily nonsymmetric. For these linear systems, we use the restarted generalized minimal residual (GMRES()) method, where is the number of Krylov directions to orthogonalize against. Furthermore, preconditioning the GMRES() solver is essential for rapid convergence.
It is worth noting that the diagonal blocks () have the same nonzero structure of a discrete purely elliptic problem, while the coupling terms () are nonsingular diagonal matrices. Typically, algebraic characteristics of submatrices () are much better than those of the matrix . This is the fundamental motivation for devising block preconditioners to solve (2.2).
3 Preconditioning strategies
This section is devoted to four preconditioning strategies: the monolithic classical AMG, physicalvariable based coarsening twolevel (PCTL) and two types of block Schur complement preconditioners. Furthermore, two further improvements are introduced in Sections 3.4 and 3.5.
3.1 Monolithic classical AMG preconditioner
Nowadays classical AMG developed in r01 is quite mature and one of the most popular preconditioners in real applications, since its virtue of scalability and applicability on complicated domains and unstructured grids. It has Setup and Solve phases. The former phase builds all the ingredients required by a hierarchy of grids, the finest to the coarsest, under the assumption that no information on the underlying geometry, grids and continuous operators is available, while the latter phase performs Vcycle, Wcycle or Fcycle. Regarding more detailed information and challenges, we refer to the review articles s04 ; x07 ; x06
and related references therein. It should be noted, however, that different coarsening schemes, interpolation procedures and relaxation methods result in different operator complexities, Setup time, cycle time and convergence rates. We apply BoomerAMG
h01 , a parallel implementation of classical AMG in the HYPRE package, on the fullycoupled monolithic linear system as a blackbox preconditioner, empirically with a strength threshold 0.25, a single Vcycle with Falgout (a hybrid RS / CLJP) h01coarsening for test runs, an aggressive coarsening process on the finest level, one presmoothing and postsmoothing sweep performed by the hybrid GaussSeidel in symmetric ordering (i.e., down cycle with relaxations swept first all coarse points and then all fine points, and up cycle using the opposite traversal), classical modified interpolation, at most 100 degrees of freedom (DoFs) at the coarsest level which is solved via Gaussian elimination, and other parameters chosen from the default configuration.
3.2 PCTL preconditioner
The PCTL algorithm was first proposed by Xu and his coauthors for 3temperature () linear systems x03 . The fourlevel multigrid reduction preconditioner proposed recently by Bui, Wang and OseiKuffuor b08 can be viewed as a variation of PCTL. The generalization of PCTL used to precondition (2.2) is straightforward. Various components have to be chosen. Taking the special structure of the coefficient matrix into account, we set the electron temperature variables to be coarse points, and the others as fine points. The relaxation routine used here is C / F block relaxation sweeps, i.e., variables associated with coarse points are relaxed followed by the other unknowns. Denote by
the interpolation matrix to match up with the aforementioned coarsegrid selection, where submatrices () and . However, they are often significantly denser, even though
are all diagonal matrices. From the implementation point of view, they are restricted to be diagonal and give an exact approximation only for constant functions. That is, for vector
, the resulting vectors containing all the diagonal entries of blocks (), denoted by , can be respectively computed as follows:The restriction is defined as and the coarsegrid operator is taken as
(3.1) 
Noting that () are all diagonal, it can be deduced that and have the same nonzero structure. The preconditioning phase of the PCTL algorithm takes the following form:

(presmoothing step) Run block relaxation once to subsystems:

Solve the coarsegrid system: ;

(coarsegrid correction) Set .
where is an incoming arbitrary Krylov vector and is the outgoing Krylov solution.
3.3 Two types of block Schur complement preconditioners
Noting that the coefficient matrix can be rewritten in a factored form
(3.2) 
where the matrix and
is the identity matrix with the same size of
. At this point, it is important to emphasize that the factorization (3.2) breaks away from the original system. Then a similar factorizationis processed to decouple the remaining in two separate systems, where
The error associated with the global factorization would be zero if is inverted exactly and is determined completely. However, it would be prohibitively expensive to explicitly form the full block matrix , we instead implement with for , ignoring all offdiagonal elements of . In this situation, we obtain the completely decoupled shape, leading to the following specific implementation procedure:

(the intermediate ion segment) ;

(the intermediate electron segment) ;

, ;

(the corrected electron segment) ;

(the corrected ion segment) .
This preconditioning strategy is denoted as Schur1, whose error can be formulated as follows:
where
denotes the zero matrix of suitable size.
The last preconditioning strategy, denoted by Schur2, is motivated by reducing the block system into a and block systems. It is of approximate block factorization type. Assembling the block decompositions
and
we obtain the full Schur complement preconditioning matrix of the form
Schur2  
where and are the Schur complement matrices, respectively defined by
The first term in (3.3) represents the “elimination” stage, while the second factor in (3.3) executes the “substitution” stage. Schur2 can be viewed as an extension to the matrix of the primitive variable block Schur complement preconditioner proposed recently by Weston et al. w03 . In the practical implementation, similar to Schur1, we take with . In this setting, the preconditioning operation proceeds as follows:

(the intermediate electron segment) ;

, ; ;

(the corrected electron segment) .
Observe that the action of Schur2 contains (), once and twice. The error associated with this preconditioner can be written as
By comparing with the block matrix , it is not difficult to see that there are more nonzero data items in . A consequence of this is that, Schur1 may be able to perform better than Schur2 in terms of the convergence behavior.
3.4 Fast approximations of the matrix inverse
It is customary to take advantage of the diagonal dominance property in order to efficiently approximate the action of each matrix inverse of the preceding block preconditioners. We start with a definition.
Definition 1
For a given threshold , define the weak diagonally dominant factor for the diagonal blocks ():
Remark 1
Besides is a strictly increasing function, the larger the factor , the weaker diagonally dominant the matrix becomes.
It is clear by Definition 1 that can be regarded as a diagonal matrix if the indicator holds. is taken to be in our experiments. We implement two options for approximating the operation :

When , this operation is performed by a fixed number of Jacobi iterations;

Otherwise, a fixed number of AMG Vcycles is used.
It is worthwhile to point out that the interpolation operator appeared in PCTL preconditioner should be constructed by another option:

The matrix inverse is solved by a number of AMG Vcycles to some prescribed accuracy.
Also, notice that the inverse of still consists in the Schur complement matrices and . This can be avoided by a diagonal approximation. In this fashion, and are replaced by
where denotes the diagonal matrix whose diagonal entries are those of . Therefore, () and can be formed explicitly and approximated by option (#3). Approximations on and (), denoted by and , are carried out in exactly the same manner. Alternatively, another more accurate approach can be used to eliminate the quite slow convergence when options (#1)(#3) break down: the behavior of can be approximated in an iterative fashion, that is,

Taking the zero vector as the initial guess, the approximate solution at the th step is given by
where () is iteratively approximated by option (#1), (#2) or (#3).
We see that option (#4) would be much more expensive. In addition, we can treat the remaining matrix inverses , and () in the same fashion as above.
3.5 An adaptive block preconditioning strategy
The coupling relationship can be used as the clear distinction to automatically choose a preconditioner according to each different system from nonlinear iterations at each integration timestep. Although the change from one linear system to the next may be relatively small, the accumulative change after many timesteps would be significant. An indicator is introduced in Definition 2 to measure the coupling strength for each coupling term.
Definition 2
For a given threshold , define the weak coupling factor for the coupling term () relative to :
Remark 2
The larger the factor , the weaker coupling between the th and th physical quantities appears. Furthermore, is a strictly increasing function.
With this definition, it is possible to assume that can be neglected without a loss in robustness, provided that is larger than a switch criterion . The smaller is chosen, this is considered to be more aggressive. In particular, if is satisfied for some physicalvariable , then it can be extracted from the preconditioning matrix, i.e., the th row and column must be removed. In this regard, the th approximate solution is computed independently, and the remaining unknowns can be treated in the same way as in the previous scenario. In our computations, and are set to and , respectively. It should be mentioned that this adaptive block preconditioner strategy can be imposed on the block preconditioners described in Sections 3.23.3.
Remark 3
Obviously, the block lower (upper) triangular preconditioner is equivalent to the particular case when and ( and ) for , while the block diagonal preconditioner corresponds to the simplest case that and for .
Remark 4
It must be emphasized that it is necessary to perform the coarsegrid correction step in PCTL preconditioner except when the conditions () are all satisfied.
3.6 A summary of block preconditioning strategies
In summary, the block Schur complement preconditioners utilize legacy algorithms to deal with the reduced scalar subsystems in an operatorsplitting fashion, whereas the PCTL preconditioner is derived from the inherence that there is no coupling between the radiation and ion temperature variables, however, they are separately coupled with the electron temperature variables. Here, one should be aware that they are based on the divideandconquer programming paradigm, namely, each task is first split into more tractable subtasks, and then their results are gathered and coupled.
Five types of operations that are performed in Setup and Preconditioning phases of these block preconditioners are compared in Table 1 with respect to a certain subsystem. It is important to note that all the coupling terms () are diagonal. Consequently, all occurred matrixvector multiplications reduce to the socalled Hadamard products of two or three vectors except the one in the second step of PCTL preconditioner, and the matrix update confined within diagonal entries is mathematically equivalent to one vector update. Note that the additional matrix inverses as well as matrix updates are required in Setup phase of PCTL. As can be seen, compared with Schur2, there are
more matrixvector multiplications in PCTL, as well as one more matrix inverse in Schur1. Evidently, Schur2 is often preferred over PCTL and Schur1 when they exhibit similar inner and outer convergence behaviors. It was mentioned above that Schur1 would probably be more accurate, giving rise to a faster convergence and also better efficiency than Schur2 in many cases.
Setup phase  Preconditioning phase  
matrix inverse  matrix update  matrix inverse  matrixvector multiplication  Hadamard product of vectors  vector update  
PCTL  
Schur1  0  0  0  
Schur2  0  0  0 
Each adaptive block preconditioned GMRES() solver consists of two levels. The adaptive block preconditioning strategy described in Section 3.5 is considered at the first level. Its purpose is to find an appropriate preconditioner with less computation time for the problems relatively easier to solve. The second level bifurcates every matrix inverse appeared in the preceding block preconditioners into the fast or accurate approximation proposed in Section 3.4.
3.7 Implementation details
The three block preconditioners described in Section 3.23.3 are implemented on the basis of the HYPRE package using C and message passing interface (MPI) with data located in distributed memory. The parallel executable code decomposes the global communicator, illustrated in Figure 1, such that each subsystem owns a group holding an ordered collection of processor identifiers for the purpose of running parallelized modules to solve large problems.
An effective parallel data decomposition plays a key role in achieving good performance. Each subsystem should be partitioned in such a way that each part is of about equal size to avoid load balance issues. In our parallel implementation, all diagonal blocks are stored using the ParCSR matrix data structure within the associated communication group, while all coupling terms are represented by the ParVector data structure of global length . More detailed information about ParCSR and ParVector can be found in Falgout et al. f01 . Each matrix inverse is accomplished by BoomerAMG h01 with one of options depicted in Section 3.4, where we use V(1,1)cycles with hybrid GaussSeidel relaxation in symmetric ordering, coarsegrid matrices formed algebraically by Galerkin process, and algebraic interfacebased coarsening recently presented by Xu and Mo in x02 to gain a good balance between convergence and complexity.
It is apparent that Schur1 and Schur2 are far easier to implement than PCTL because of the construction and numerical solution of the coarsegrid system, involving much more complicated global communication between different processors y04 , e.g., all diagonal blocks () and coupling terms () should be distributed locally (data sets used in setup and preconditioning phases) and globally (data sets used to generate the globally distributed matrix by formula (3.1) in setup phase). However, their distributed local data sets are enough for Schur1 and Schur2, whose parallel flow diagrams and MPI communication patterns are depicted in Figure 2. The numerical solution of a certain subsystem at step 1, 2, 3 and 4 is followed by the data exchange process labeled (a), (b), (c) and (d), respectively. Another communication group is to be established in setup phase so that processors with the same local identifier in the partition (shown in Figure 1) are grouped together for the purpose of making exchanging messages (such as (b) and (c) in Schur1) easier and more efficient to implement. At this point, the two MPI functions MPI_Bcast and MPI_Gather can be utilized to spread elements from the root processor (belonging to comm_E) to the others and take elements from other processors and gather them to the root processor, respectively.
4 Numerical results
In this section, we provide experimental results intended to study the effectiveness and parallel scalability of the block preconditioners against BoomerAMG h01 implemented in the parallel C library HYPRE version 2.15.1. These results are obtained on the Tianhe1 supercomputer y05 , which is a multiarray, configurable and cooperative parallel system with a theoretical peak speed of 1.372 petaflops, composed of high performance generalpurpose microprocessors and a highspeed Infiniband network. Our code is compiled with mpich3.1.3 using the icc compiler version 11.1.059 and O2 optimization level. The convergence of the preconditioned GMRES() was halted when the Euclidean norm of the current relative residual was smaller than . Unless otherwise stated, the tolerances for stopping are chosen to be for performance consideration in approximating , , and () in Schur complement preconditioners, as well as the construction of the interpolation and block relaxations in the PCTL preconditioner.
The following notations are given in tables below to illustrate results: denotes the number of preconditioned GMRES() iterations; represents the CPU time measured in seconds to solve the linear system by the preconditioned GMRES() solver; PCTL, Schur1 and Schur2 stand for the adaptive PCTL, Schur1 and Schur2 variants, respectively; indicates the number of used processors.
4.1 Numerical experiments on one processor
Results in this subsection have been structured in two distinct parts. They are based on two suites of representative MGD linear systems from twodimensional capsule implosion simulations: the first suite includes 4 threetemperature (i.e., only onegroup) linear systems, respectively denoted by ; the other suite consists of 37 threetemperature and 20 twentygroup linear systems.
The performance of seven preconditioned GMRES(30) solvers is investigated for problems  on two different grids, as shown in Table 2. We can observe from this table that (i) AMG preconditioned GMRES(30) doesn’t converge robustly enough, and achieves convergence in an excessive number of iterations (shown by ); (ii) PCTL, Schur1 and Schur2 exhibit a better convergence behavior, and, as expected, Schur1 is numerically more robust than the other two preconditioners; (iii) Schur1 and Schur2 require less wall time than that of PCTL; (iv) The adaptive variants are much more efficient, although there are additional arithmetic operations to determine an appropriate and inexpensive preconditioner.
AMG  PCTL  PCTL  Schur1  Schur1  Schur2  Schur2  AMG  PCTL  PCTL  Schur1  Schur1  Schur2  Schur2  

9  3.6  3  1.7  2  1.5  3  1.4  3  1.4  4  1.6  4  1.6  9  14.8  3  7.6  2  6.4  3  7.1  3  6.9  5  7.7  5  7.8  
80  31.2  12  6.1  10  5.3  4  3.1  5  2.9  4  3.0  5  2.9  100  164.2  17  270.3  14  28.7  4  13.5  6  14.0  4  11.8  6  13.0  
29  12.1  11  5.4  6  3.8  4  3.1  5  2.9  6  3.4  6  3.2  32  56.4  14  28.2  7  17.4  4  14.3  6  13.6  8  18.0  8  16.2  
11  4.8  3  11.6  6  3.9  3  9.1  6  3.9  3  9.3  6  3.9  11  20.6  3  24.9  6  16.3  3  18.5  6  16.3  3  19.5  6  16.3 
An important observation to make in Table 2 is why the computational cost of PCTL becomes unacceptable for on grid, despite its reasonable iteration counts. The focus is on the inner iteration comparisons among PCTL, PCTL, Schur1, Schur1, Schur2 and Schur2. The results are illustrated in Table 3, where (i) entries with the superscript indicate that option (#1) is used and option (#2) otherwise, here we consider just a single Jacobi iteration or AMG Vcycle to approximate matrix inverse; (ii) entries with the superscript inform that the block diagonal preconditioner with option (#1) or (#2) is utilized to improve performance. Since the simulation is much more sensitive to the degree of accuracy in radiation temperatures, three iterations are used to solve approximately for them, while one iteration is applied to subsystems associated with electron and ion temperatures. The primary reason why PCTL is rather expensive is that a total number of 623 AMG Vcycles are required to solve all of the subsystems associated with all radiation temperatures. This is remedied in PCTL by using option (#1) or (#2) instead of option (#3), however, with many more coarsegrid correction steps (from 9 to 15). Fortunately, there is a decrease in the number of outer iterations (from 17 to 14), resulting in a much faster preconditioner.
PCTL  PCTL  Schur1  Schur1  Schur2  Schur2  

0  1  4  3  3  4  1  1  3  3  6  3  7  6  3  14  3  5  4  
2  1  10  623  9  9  6  1  15  15  15  10  12  9  14  7  9  10  5  13  7  
3  1  5  19  5  11  6  1  8  8  8  10  14  10  14  7  17  14  9  17  9  
7  1  6  27  4  1  0  0  0  5  25  5  7  27  4 
Next we examine the iteration counts and wall time as functions of the second suite of representative linear systems. A plot of this data appears in Figure 3 with problem sizes roughly 2.3 and 16.9 million DoFs, respectively. It can be easily seen that PCTL, Schur1, Schur2 and their adaptive variants converge much faster than AMG, however, with 3 twentygroup exceptions which could not be solved by PCTL, Schur1 and Schur2. The reason for this is that the tolerance for stopping in option (#3) is not small enough. However, the smaller the tolerance, the higher their total costs. It should be emphasized that, although they may demonstrate a slightly slower convergence, PCTL, Schur1 and Schur2 are less expensive to use, because the actual convergence behavior is produced by the block diagonal preconditioner. From the viewpoint of the total wall time, Schur1 and Schur2 have comparable computational cost, and they are both observed to be the most computationally efficient option, resulting in a 13.7% (threetemperature) and 41.6% (twentygroup) reduction compared with PCTL, as well as 5.5 (threetemperature) and 8.5 (twentygroup) times faster than AMG.
4.2 Parallel experiments
Run times, strong and weak parallel scaling properties are both of particular interest in this subsection. Since our previous performance evaluations showed that PCTL, Schur1 and Schur2 leads generally to better convergence, robustness and efficiency, we only coded their parallel implementations while PCTL, Schur1 and Schur2 are not considered here. Timings are measured by the MPI function MPI_Wtime in seconds.
We consider 6 twentygroup linear systems, respectively denoted by , from the previous subsection and examine the strong parallel scalabilities of AMG, PCTL, Schur1 and Schur2 preconditioned GMRES(30) solvers on a grid (see Table 4). The simulation is run on 22 to 352 cores, each time doubling the number of used cores. The problem size on each processor is 48,000 DoFs for 352 cores. The results in Table 4 show that these four preconditioners exhibit good convergence (in terms of number of iterations) and numerical (regarding wall time) scalabilities in the strong sense. We note that all of PCTL, Schur1 and Schur2 preconditioned GMRES(30) solvers converge in at most 3 steps robustly with respect to the number of processors and problem character, and run averagely 3.4, 4.1 and 5.5 times faster than AMG on 352 cores.
AMG  PCTL  
19  22.7  18  14.7  17  9.4  19  5.8  16  3.0  2  6.8  2  4.3  2  2.5  2  1.7  2  1.1  
18  22.0  18  15.0  18  9.9  18  5.6  17  3.1  2  6.7  2  4.3  2  2.5  2  1.8  2  1.2  
18  21.9  17  14.5  19  10.2  19  5.8  19  3.3  2  6.8  2  4.2  2  2.5  2  1.7  2  1.1  
34  55.6  32  23.1  36  16.5  32  8.2  34  4.5  2  6.9  2  4.2  2  2.6  2  1.6  2  1.1  
26  53.7  24  22.5  24  13.7  27  7.4  30  4.3  2  7.0  2  4.3  2  2.6  2  1.6  2  1.1  
27  49.8  29  31.3  33  17.8  29  7.6  32  4.4  3  8.5  3  5.2  2  2.7  3  2.1  2  1.1  
Schur1  Schur2  
2  6.8  2  4.0  2  2.5  2  1.6  2  0.9  2  6.1  2  3.8  2  2.1  2  1.3  2  0.7  
2  6.7  2  4.1  2  2.5  2  1.5  2  0.9  2  6.1  2  3.9  2  2.1  2  1.2  2  0.7  
2  6.8  2  4.1  2  2.4  2  1.4  2  0.9  2  6.2  2  3.8  2  2.3  2  1.1  2  0.6  
2  6.8  2  4.2  2  2.5  2  1.5  2  0.9  2  6.2  2  3.9  2  2.2  2  1.3  2  0.7  
2  6.8  2  4.1  2  2.5  3  1.9  2  1.0  2  6.2  2  4.0  2  2.3  2  1.2  2  0.7  
3  8.2  3  5.0  2  2.5  3  1.9  2  0.9  3  7.4  3  4.7  2  2.2  3  1.6  2  0.7 
The subsequent experiments are run for a weak parallel scaling study to analyze the convergence and numerical scalabilities of GMRES(30) solvers preconditioned by AMG, PCTL, Schur1 and Schur2. We start with a mesh size of and then refine the mesh in all directions up to , so the largest problem has about 16.9 million unknowns. We run these problems using 22, 88 and 352 cores (with 48,000 DoFs per processor), respectively. As shown in Table 5, all of the preconditioners tested here weakly scale well in the numerical sense. PCTL, Schur1 and Schur2 preconditioned GMRES(30) solvers converge robustly with respect to the discrete problem size, while the number of iterations of AMG preconditioned GMRES(30) solver for varies from 16 on the grid to 30 on the grid to achieve convergence. Regarding the parallel efficiency calculated by s05 , where is the wall time using processors, the average parallel efficiency of AMG, PCTL, Schur1 and Schur2 preconditioned GMRES(30) solvers is 72.2%, 79.3%, 74.7% and 75.1%, respectively. Furthermore, in terms of wall time, Schur2 has obvious advantages over AMG, PCTL and Schur1.
AMG  PCTL  Schur1  Schur2  

15  1.33  15  2.29  16  3.01  2  0.62  2  0.87  2  1.08  2  0.48  2  0.67  2  0.91  2  0.35  2  0.50  2  0.71  
14  1.25  17  2.49  17  3.13  2  0.58  2  0.85  2  1.15  2  0.46  2  0.65  2  0.93  2  0.36  2  0.49  2  0.69  
14  1.24  18  2.63  19  3.26  2  0.58  2  0.84  2  1.11  2  0.47  2  0.65  2  0.91  2  0.35  2  0.48  2  0.64  
28  2.05  30  3.55  34  4.52  2  0.59  2  0.88  2  1.12  2  0.46  2  0.64  2  0.89  2  0.35  2  0.46  2  0.67  
16  1.47  18  2.61  30  4.31  2  0.58  3  0.98  2  1.11  2  0.45  3  0.76  2  0.95  2  0.34  3  0.58  2  0.70  
21  1.68  21  2.78  32  4.40  2  0.59  2  0.86  2  1.09  2  0.46  3  0.76  2  0.94  2  0.34  3  0.59  2  0.72 
5 Conclusions
Three types of AMG blockbased preconditioning techniques and two improvements have been introduced to solve the linear systems resulting from MGD equations. They are easy to set up, due to the fact that they only involve operations on blocks that are readily extracted from the monolithic linear system. Our experiments show that they all scale well both algorithmically and in parallel, and exhibit overall better convergence behavior and computational efficiency than the monolithic AMG preconditioner.
Acknowledgement
This work is financially supported by Science Challenge Project (TZ2016002), National Natural Science Foundation of China (11601462, 11971414), Project of Scientific Research Fund of Hunan Provincial Science and Technology Department (2018WK4006), Hunan Provincial CivilMilitary Integration Industrial Development Project “Adaptive Multilevel Solver and Its Application in ICF Numerical Simulation” and Hunan Provincial Natural Science Foundation of China (2018JJ3494). The numerical calculations in this paper have been done on the supercomputing system of the National Supercomputing Center in Changsha.
References
References
 (1) H. B. An, Z. Y. Mo, X. W. Xu, X. W. Jia, Operatorbased preconditioning for the 2D 3T energy equations in radiation hydrodynamics simulations, J. Comput. Phys. 385 (2019) 51–74.

(2)
Z. Z. Bai, G. H. Golub, L. Z. Lu, J. F. Yin,
Block triangular and skewHermitian splitting methods for positivedefinite linear systems
, SIAM J. Sci. Comput. 26 (2005) 844–863.  (3) Z. Z. Bai, M. K. Ng, Z. Q. Wang, Constraint preconditioners for symmetric indefinite matrices, SIAM J. Matrix Anal. Appl. 31 (2009) 410–433.
 (4) C. Baldwin, P. N. Brown, R. Falgout, F. Graziani, J. Jones, Iterative linear solvers in 2D radiationhydrodynamics code: Methods and performance, J. Comput. Phys. 154 (1999) 1–40.
 (5) F. P. A. Beik, M. Benzi, Block preconditioners for saddle point systems arising from liquid crystal directors modeling, Calcolo 55 (2018) 29.

(6)
M. Benzi, B. Uçar,
Block triangular preconditioners for Mmatrices and Markov chains
, Electron. Trans. Numer. Anal. 26 (2007) 209–227.  (7) Q. M. Bui, H. C. Elman, J. D. Moulton, Algebraic multigrid preconditioners for multiphase flow in porous media, SIAM J. Sci. Comput. 39 (2017) S662–S680.

(8)
Q. M. Bui, L. Wang, D. OseiKuffuor,
Algebraic multigrid preconditioners for twophase flow in porous media with phase transitions
, Adv. Water Resour. 114 (2018) 19–28.  (9) Z. H. Cao, Positive stable block triangular preconditioners for symmetric saddle point problems, Appl. Numer. Math. 57 (2007) 899–910.
 (10) A. M. A. Côrtes, L. Dalcin, A. F. Sarmiento, N. Collier, V. M. Calo, A scalable blockpreconditioning strategy for divergenceconforming Bspline discretizations of the Stokes problem, Comput. Methods Appl. Mech. Engrg. 316 (2017) 839–858.
 (11) E. de Sturler, J. Liesen, Blockdiagonal and constraint preconditioners for nonsymmetric indefinite linear systems. Part I: Theory, SIAM J. Sci. Comput. 26 (2005) 1598–1619.
 (12) H. S. Dollar, Constraintstyle preconditioners for regularized saddlepoint problems, SIAM J. Matrix Anal. Appl. 29 (2007) 672–684.
 (13) H. Elman, V. E. Howle, J. Shadid, R. Shuttleworth, R. Tuminaro, A taxonomy and comparison of parallel block multilevel preconditioners for the incompressible NavierStokes equations, J. Comput. Phys. 227 (2008) 1790–1808.
 (14) R. D. Falgout, J. E. Jones, U. M. Yang, The design and implementation of hypre, a library of parallel high performance preconditioners, Lect. Notes Comput. Sci. Eng. 51 (2006) 267–294.
 (15) Y. N. Gao, X. K. Zhao, Y. H. Li, Finite volume element methods for twodimensional threetemperature radiation diffusion equations, Numer. Math. Theor. Meth. Appl. 9 (2016) 470–496.
 (16) A. Hasečić, S. Muzaferija, I. Demirdžić, Finite volume method for radiative transport in multiphase flows with free surfaces, Numer. Heat Trans. A 70 (2016) 347–365.
 (17) P. Hassanzadeh, G. D. Raithby, Finitevolume solution of the secondorder radiative transfer equation: Accuracy and solution cost, Numer. Heat Trans. B 53 (2008) 374–382.
 (18) V. E. Henson, U. M. Yang, BoomerAMG: A parallel algebraic multigrid solver and preconditioner, Appl. Numer. Math. 41 (2002) 155–177.
 (19) Q. Y. Hu, L. Zhao, Domain decomposition preconditioners for the system generated by discontinuous Galerkin discretization of 2D3T heat conduction equations, Commun. Comput. Phys. 22 (2017) 1069–1100.
 (20) C. Keller, N. I. M. Gould, A. J. Wathen, Constraint preconditioning for indefinite linear systems, SIAM J. Matrix Anal. Appl. 21 (2000) 1300–1317.
 (21) B. Lee, Multigrid for secondorder ADN elliptic systems, Numer. Linear Algebra Appl. 26 (2019) e2256.
 (22) J. J. Lee, K. A. Mardal, R. Winther, Parameterrobust discretization and preconditioning of Biot’s consolidation model, SIAM J. Sci. Comput. 39 (2017) A1–A24.
 (23) Y. C. Ma, K. B. Hu, X. Z. Hu, J. Xu, Robust preconditioners for incompressible MHD models, J. Comput. Phys. 316 (2016) 721–746.
 (24) D. Mihalas, B. WeibelMihalas, Foundations of radiation hydrodynamics, Dover, 1999.
 (25) Z. Y. Mo, L. J. Shen, G. Wittum, Parallel adaptive multigrid algorithm for 2D 3T diffusion equations, Int. J. Comput. Math. 81 (2004) 361–374.
 (26) C. Y. Nie, S. Shu, H. Y. Yu, J. Wu, Superconvergence and asymptotic expansions for bilinear finite volume element approximations, Numer. Math. Theor. Meth. Appl. 6 (2013) 408–423.
 (27) W. B. Pei, The construction of simulation algorithms for laser fusion, Commun. Comput. Phys. 2 (2007) 255–270.
 (28) G. Peng, Z. M. Gao, W. J. Yan, X. L. Feng, A positivitypreserving finite volume scheme for threetemperature radiation diffusion equations, Appl. Numer. Math. (2020), doi: https://doi.org/10.1016/j.apnum.2020.01.013.
 (29) J. Pestana, R. Muddle, M. Heil, F. Tisseur, M. Mihajlović, Efficient block preconditioning for a finite element discretization of the Dirichlet biharmonic problem, SIAM J. Sci. Comput. 38 (2016) A325–A345.

(30)
A. Quarteroni, A. Valli,
Numerical approximation of partial differential equations
, SpringerVerlag, 1994.  (31) Y. H. Ran, J. G. Wang, D. L. Wang, On preconditioners based on HSS for the space fractional CNLS equations, East Asian J. Appl. Math. 7 (2017) 70–81.
 (32) J. W. Ruge, K. Stüben, Algebraic multigrid, in Multigrid Methods, Front. Appl. Math. 3 (1987) 73–130.
 (33) S. Sahni, V. Thanvantri, Performance metrices: Keeping the focus on runtime, IEEE Parall. Distrib. 4 (1996) 43–56.
 (34) Z. Q. Sheng, J. Y. Yue, G. W. Yuan, Monotone finite volume schemes of nonequilibrium radiation diffusion equations on distorted meshes, SIAM J. Sci. Comput. 31 (2009) 2915–2934.
 (35) K. Stüben, A review of algebraic multigrid, J. Comput. Appl. Math. 128 (2001) 281–309.
 (36) J. Sundnes, G. T. Lines, K. A. Mardal, A. Tveito, Multigrid block preconditioning for a coupled system of partial differential equations modeling the electrical activity in the heart, Comput. Method Biomec. 5 (2002) 397–409.
 (37) P. S. Vassilevski, U. Villa, A blockdiagonal algebraic multigrid preconditioner for the Brinkman problem, SIAM J. Sci. Comput. 35 (2013) S3–S17.
 (38) C. Wang, T. Z. Huang, C. Wen, A new preconditioner for indefinite and asymmetric matrices, Appl. Math. Comput. 219 (2013) 11036–11043.
 (39) B. Weston, R. Nourgaliev, J. P. Delplanque, A. T. Barker, Preconditioning a NewtonKrylov solver for allspeed melt pool flow physics, J. Comput. Phys. 397 (2019) 108847.
 (40) J. A. White, R. I. Borja, Blockpreconditioned NewtonKrylov solvers for fully coupled flow and geomechanics, Comput. Geosci. 15 (2011) 647–659.
 (41) H. Xie, X. Xu, C. L. Zhai, H. Yong, A positivitypreserving finite volume scheme for heat conduction equation on generalized polyhedral meshes, Commun. Comput. Phys. 24 (2018) 1375–1408.
 (42) J. Xu, L. Zikatanov, Algebraic multigrid methods, Acta Numer. 26 (2017) 591–721.
 (43) X. W. Xu, Parallel algebraic multigrid methods: Stateofthe art and challenges for extremescale applications, J. Num. Method Comp. Appl. 40 (2019) 243–260.
 (44) X. W. Xu, Z. Y. Mo, Algebraic interfacebased coarsening AMG preconditioner for multiscale sparse matrices with applications to radiation hydrodynamics computation, Numer. Linear Algebra Appl. 24 (2017) e2078.
 (45) X. W. Xu, Z. Y. Mo, H. B. An, Algebraic twolevel iterative method for 2D 3T radiation diffusion equations, Chinese J. Comput. Phys. 26 (2009) 1–8.
 (46) X. W. Xu, Z. Y. Mo, H. B. An, An adaptive AMG preconditioning strategy for solving largescale sparse linear systems, Sci. Sin. Inf. 46 (2016) 1411–1420.
 (47) X. J. Yang, X. K. Liao, W. X. Xu, J. Q. Song, Q. F. Hu, J. S. Su, L. Q. Xiao, K. Lu, Q. Dou, J. P. Jiang, C. Q. Yang, TH1: China’s first petaflop supercomputer, Front. Comput. Sci. China 4 (2010) 445–455.
 (48) X. Q. Yue, S. Shu, J. X. Wang, Z. Y. Zhou, Substructuring preconditioners with a simple coarse space for 2D 3T radiation diffusion equations, Commun. Comput. Phys. 23 (2018) 540–560.
 (49) X. Q. Yue, S. Shu, X. W. Xu, Z. Y. Zhou, An adaptive combined preconditioner with applications in radiation diffusion equations, Commun. Comput. Phys. 18 (2015) 1313–1335.
 (50) X. Q. Yue, X. W. Xu, S. Shu, JASMINbased twodimensional adaptive combined preconditioner for radiation diffusion equations in inertial fusion research, East Asian J. Appl. Math. 7 (2017) 495–507.
 (51) X. Q. Yue, Z. Y. Zhou, X. W. Xu, S. Shu, A parallel adaptive PCTL preconditioner for 3T radiation diffusion equations, Nat. Sci. J. Xiangtan Univ. 40 (2018) 6–10.
 (52) Q. F. Zhang, C. J. Zhang, Block preconditioning strategies for nonlinear viscous wave equations, Appl. Math. Model. 37 (2013) 5801–5813.
 (5