Testing performance with and without Block Low Rank Compression in MUMPS and the new PaStiX 6.0 for JOREK nonlinear MHD simulations

07/31/2019
by   Richard Nies, et al.
0

The interface to the MUMPS solver was updated in the JOREK MHD code to support Block Low Rank (BLR) compression and an interface to the new PaStiX solver version 6 has been implemented supporting BLR as well. First tests were carried out with JOREK, which solves a large sparse matrix system iteratively in each time step. For the preconditioning, a direct solver is applied in the code to sub-matrices, and at this point BLR was applied with the results being summarized in this report. For a simple case with a linearly growing mode, results with both solvers look promising with a considerable reduction of the memory consumption by several ten percent was obtained. A direct increase in performance was seen in particular configurations already. The choice of the BLR accuracy parameter ϵ proves to be critical in this simple test and also in more realistic simulations, which were carried out only with MUMPS due to the limited time available. The more realistic test showed an increase in run time when using BLR, which was mitigated when using larger values of ϵ. However, the GMRes iterative solver does not reach convergence anymore when ϵ is too large, since the preconditioner becomes too inaccurate in that case. It is thus critical to use an ϵ as large as possible, while still reaching convergence. More tests regarding this optimum will be necessary in the future. BLR can also lead to an indirect speed-up in particular cases, when the simulation can be run on a smaller number of compute nodes due to the reduced memory consumption.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/09/2017

Recovery of Sparse and Low Rank Components of Matrices Using Iterative Method with Adaptive Thresholding

In this letter, we propose an algorithm for recovery of sparse and low r...
12/20/2017

A distributed-memory hierarchical solver for general sparse linear systems

We present a parallel hierarchical solver for general sparse linear syst...
02/10/2021

An exact solver for the Weston-Watkins SVM subproblem

Recent empirical evidence suggests that the Weston-Watkins support vecto...
10/12/2021

Fast Block Linear System Solver Using Q-Learning Schduling for Unified Dynamic Power System Simulations

We present a fast block direct solver for the unified dynamic simulation...
05/11/2020

Approximating inverse FEM matrices on non-uniform meshes with H-matrices

We consider the approximation of the inverse of the finite element stiff...
02/11/2020

Direct Solution of FEM Models: Are Sparse Direct Solvers the Best Strategy?

A brief summary of direct solution approaches for finite element methods...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Background

1.1 JOREK solver

The JOREK [1] solver always involves LU factorisations, either to solve the entire matrix directly (generally used when the simulation is run axisymmetrically) or as a preconditioner before using the iterative solver GMRes. We focus on the latter case.

In the preconditioner, the blocks of toroidal harmonics are assumed to be decoupled from one another (i.e. off-diagonal blocks are not considered in the preconditioner), which greatly reduces the memory requirements and the runtime. This approach comes to its limits when the system is very nonlinear, as the preconditioner will be inaccurate and many GMRes iterations may be needed to converge to the solution.

1.2 LU factorisation of sparse matrix

The blocks of harmonics are sparse matrices (i.e. a small percentage of entries is non-zero) and naive algorithms used for dense matrices should thus not be applied here. Indeed, these would lead to a substantial fill-in: the combined storage cost for the and factors would greatly exceed that of the original matrix. The JOREK code makes use of either PaStiX [2] or MUMPS [3], both of which are able to efficiently handle sparse matrices.

In the following, we will briefly present the main steps of these solvers, without going into the details. More information can be found e.g. in [4].

In the analysis phase, a graph ordering tool is called, generally SCOTCH or METIS. These represent the matrix as a graph and an elimination tree is constructed to reduce fill-in. This tree is traversed as each node is eliminated and its contribution to nodes further up the tree are computed (corresponding to fill-in). This corresponds to an update of nodes up the tree which can be done either immediately after elimination (fan-out or right-looking), as late as possible before the elimination of the node further up the tree (fan-in or left-looking) or it can be carried up the tree, accumulating further updates from other nodes in so called fronts (fan-out or right-looking)111 PaStiX is using a fan-in strategy in the communication. In the sense that are accumulated locally on the origin node, and then sent to the remote node when all contributions are applied, such that the remote node can perform the operation . Except for this specific case in distributed, all the algorithm is fan-out/right-looking oriented to get more parallelism, even for the accumulation of the remote contribution. MUMPS, in contrast, based on the multi-frontal approach, can be considered as fan-in/left-looking oriented algorithm..

In the factorisation phase, the factors of the and matrices are computed, following the elimination tree obtained in the analysis phase. This is typically the most CPU intensive step, although the factorisation does not need to be repeated for every time step in a typical JOREK simulation, since the preconditioning matrix is reused as long as reasonable convergence is obtained in GMRes.

In the solve phase, the solution is computed using standard forward and back-substitution methods. A solve also has to be repeated for every iteration in GMRes.

1.3 BLR compression

Even with the use of graph ordering tools, the fill-in can be substantial and the required memory storage for the factors of and can be very high. As a possible solution to this problem, the factors could be compressed using Block-Low-Rank (BLR) compression.

Let be a matrix of size . Let be the approximated numerical rank of at accuracy . is a low-rank matrix if there exist three matrices of size , of size and of size such that :

(1)

where and . The last condition implies lower dimensionality and thus also lower storage costs for combined U and V than for original A, provided the matrix is sufficiently dense. The basic idea is to represent the matrix as the product of two skinny matrices. Note that lossless compression is possible as the accuracy can be set to .

Besides possible memory gains, using BLR compression can in principle also reduce the runtime, as basic matrix operations are accelerated. Indeed, for , matrix-matrix products go from (for dense matrices) to operations, while a triangular solve goes from to operations. The complexity is thus reduced for and respectively.

The and matrices are generally not low-rank, but individual sub-matrices may be efficiently compressed. There is no general admissibility criterion to determine if a block should be stored in low-rank. This depends on the solver used, see [5] (MUMPS) and [6] (PaStiX) for more details on the implementation of BLR compression in these solvers.

1.3.1 Runtime performance in JOREK

The runtime of the JOREK solver will be impacted by the use of BLR compression. Before testing this in detail in the rest of this report, we layout how the performance is affected by BLR compression.

The analysis step takes a longer time due to the additional search for suitable matrices to compress. However, the impact on the total runtime is negligible, as the analysis only has to be done once at the simulation (re-)start.

Two opposing effects will play a role in the factorisation step: the compression of the LU blocks leads to an overhead but the compressed matrices lead to faster basic matrix operations (see above). In practice, the time for factorisation seems to be generally increased with BLR in our tests, although higher values of can reduce or even revert this effect.

The same conclusions hold for the solve step, as the (de-)compression induces an overhead, which the faster matrix operations can cancel222This overhead can be avoided and will be avoided in both PaStiX and MUMPS in the future.

The runtime performance of the GMRes iterative solver is set by the number of iterations needed to reach convergence and the time needed for a basic solve (see step above), as it is repeated for every GMRes iteration. As for the factorisation and solve steps above, this means that the use of BLR causes an overhead but larger values of will not always be beneficial here. Indeed, they may reduce the number of matrix operations and thereby reduce runtime, but a too large value of deteriorates the accuracy of the preconditioner and leads to worse convergence in GMRes, increasing the number of iterations necessary to reach convergence. Taken to the extreme, this can stop the simulation, as the GMRes iterative solver is given a maximum number of steps to reach convergence in JOREK.

2 BLR compression in MUMPS

2.1 Resolution scan

2.1.1 Basic setup

All simulations in this report were carried out on a Linux cluster, where each compute node is equipped with 2x Intel Xeon Gold 6130 CPUs with 16 cores and 2.1 GHz base clock speed, AVX 512, “Skylake” architecture, 22 MB L3 cache, and fast Omnipath interconnect.

We first perform basic resistive ballooning mode simulations (a test case called inxflow in JOREK) with 2 toroidal harmonics () and 4 toroidal planes. The simulations are restarted during the linear growth phase for 10 time steps, with the factorisation enforced to be carried out in every time step. The iterative solver GMRes is given a tolerance of and maximal iterations to reach convergence. Each simulation is run on one compute node with 2 MPI tasks, with 2 OpenMP threads each.

The resolution was varied from (n_flux, n_tht) = (16, 20) up to
(n_flux, n_tht) = (128, 160), where n_flux denotes the radial and n_tht the poloidal number of grid points, in 6 steps of approximate factors . For each resolution, the runtime performance and memory consumption of the solver MUMPS are investigated, with and without BLR compression, and with different values of the BLR tolerance: .

2.1.2 Memory consumption

The total memory consumption in MB of the MUMPS solver depending on resolution is shown in Tab 1 and Fig. 1. Note that the blocks corresponding to nonzero toroidal harmonics have twice the dimensionality of the block, such that of the total memory indicated is used in the block and in the block.

Significant memory gains can be made using BLR compression. These gains are exacerbated for large values of (as the matrices can be compressed more effectively) and for large problem sizes (as there are more opportunities for compression).

n_flux,n_tht No BLR
Table 1: Memory consumption in the MUMPS resolution scan in Megabytes
Figure 1: Memory gains from BLR compression

Note, however, that large values of can significantly deteriorate the quality of the preconditioning, leading to an increased number of iterations for GMRes (see Sec. 1.3.1). Depending on the required tolerance for GMRes and the number of iterations, this can even prevent convergence, which was the case in the highest resolution simulations with .

2.1.3 Runtime

The runtimes listed below are averages over the simulations’ 10 time steps. The activation of the Block-Low-Rank feature leads to an increase in the analysis phase’s runtime by a factor of approximately . It was however left out in the following, as the analysis only has to be performed once and its impact on the performance is thus negligible.

n_flux,n_tht No BLR
Table 2: Runtime for factorisation step (in s)
n_flux,n_tht No BLR
Table 3: Runtime for entire time step (in s)

The runtime for the factorisation step of the MUMPS solver is shown in Tab. 2 and Fig. 2. Here, the use of BLR compression can be detrimental (overhead caused by the compression of LU factors) or beneficial (speedup of matrix operations). Which of these two effects dominates depends on the size of the problem and the choice of .

The runtime for the solution step of the MUMPS solver is shown in Fig. 3. The solution step generally incurs an overhead from compression, which can however be mitigated at high resolutions, where the decrease in the number of operations can compensate for the overhead.

The runtime for the GMRes step of the JOREK solver is shown in Fig. 4, where the strong spike for reflects the loss of convergence. A slowdown is observed at all resolutions as the same overhead mentioned in the solution step comes into play here, as well as larger number of iterations in the GMRes solver when is too high and the preconditioner is too inaccurate.

The reduction of the time for the solution phase in Fig. 3 at the highest resolution leaves open the possibility that this phase is actually accelerated at even larger resolutions. This could then also possibly lead to a reduction in the time for GMRes, if it was possible to compensate for both the (de-)compression overhead and the higher number of iterations caused by the preconditioner’s inaccuracy.

Figure 2: Runtime for factorisation step
Figure 3: Runtime for solution step
Figure 4: Runtime for GMRes
Figure 5: Average runtime per time step

The average total runtime per time step shown in Tab. 3 confirms the generally increased runtime when using BLR compression. For large values of , this increase is quite modest. In our simulations at highest resolution, the optimal choice is , where the increase in runtime is minimal (no BLR: s, : s) while the memory requirements are substantially lower (no BLR: GB, : GB). The high performance observed here is due to the reduced runtime for factorisation, largely offsetting the increase in runtime from GMRes. The peculiar spike in the runtime for and a resolution of (n_flux, n_tht) = (64, 80) is due to an increase in the number of iterations of GMRes for out of the time steps, suggesting that fluctuations in the preconditioner’s accuracy can come into play for larger values of .

However, the JOREK solver generally keeps the preconditioning LU factors for several timesteps, until it does not satisfactorily precondition the matrix anymore, i.e. the number of GMRes iterations in the previous time step has passed a certain threshold. As a consequence, for many time steps only the solve step and GMRes have to be performed. For too large , this could increase the runtime substantially, as GMRes is repeated several times for each LU factorisation. Additionally, the LU factorisation might need to be repeated after a smaller number of time steps. However, the runtimes for the solution phase and GMRes phases might be reduced at very high resolutions due to the reduced number of operations, as Figs. 3 and 4 seem to suggest.

The accuracy should be chosen as large as possible to allow for significant memory gains and keep the runtime as low as possible, without being too high, such that GMRes convergence is not too greatly impaired. This choice of will depend on the exact problem at hand as well as the setup of the GMRes solver.

2.2 More realistic simulation

The goal of this section is to ascertain the benefits of BLR compression during the nonlinear phase, where the preconditioning matrix is not as effective and the convergence of GMRes is thus even more critical, as well as its general usefulness in a typical JOREK simulation.

We thus turn to simulations of shattered pellet injections (SPI), running for a larger number of time steps () through linear and nonlinear phases, with 6 toroidal harmonics () and 32 toroidal planes. The resolution of all simulations is (n_flux,n_tht) = (56, 138). The maximal number of iterations of the GMRes iterative solver was set to while its tolerance was set to and , to investigate its effects on the convergence properties when using compression. Indeed, we force the recalculation of the preconditioner (i.e. a factorisation) every time the number of steps in GMRes exceeds , such that a lower GMRes tolerance will lead to more factorisations.

We again compare simulations with different MUMPS setups: without BLR and with BLR, using different values of . For the higher GMRes tolerance value of , the simulations with did not yield satisfactory convergence in GMRes, the former in the linear and the latter in the nonlinear phase. We thus examine only the values .

The total memory consumption of the MUMPS solver in GB amounts to

  • No BLR:           ()

  • BLR, :    ()

  • BLR, : ()

  • BLR, : ()

  • BLR, : ()

The runtime performance data of all simulations is given in Tab. 4. Note that the runtime for the factorisation and solve steps is independent of the GMRes tolerance value.

GMRes tol. No BLR
Nbr. of factorisations
Avg. time per iter. (s)
Avg. nbr. of GMRes iter.
Avg. time per GMRes (s)
Nbr. of factorisations
Avg. time per iter. (s)
Avg. nbr. of GMRes iter.
Avg. time per GMRes (s)
Avg. time per facto. (s)
Avg. time per solve (s)
Table 4: Performance of BLR compression in SPI simulations

2.2.1 Runtime performance for a GMRes tolerance of

In the first set of simulations with a GMRes tolerance of , the number of factorisations varied somewhat unexpectedly, as can be seen in Tab.  4. Indeed, the number of factorisations rose substantially for , largely exceeding the number of factorisations for and . Indeed, the frequency (number of occurences) for each occuring number of GMRes iterations shown in Fig. 6 reveals an increased number of GMRes iterations between and for . This in turns leads to a higher average time per iteration (Tab. 4).

Figure 6: Histogram of GMRes iteration numbers for a tolerance of

This discrepancy can be explained by the fact that a GMRes tolerance of is too high, possibly leading to different physical results. Note that in this simulation, most factorisations take place during the physically violent thermal quench of the plasma core. Even small variations in the duration or intensity of this process can lead to the observed differences. Indeed, a longer thermal quench is observed for , as shown in Fig. 7 (at ). Note that this does not directly explain why the number of factorisations increases for , but it does make this fact less surprising. The reduction of the relative differences in the number of factorisations at a higher GMRes tolerance value of (see next section) seems to confirm that the original tolerance of is too high.

Figure 7: Core temperature evolution for GMRes tolerances of and .

At all observed converging values of , the runtime is substantially increased, as shown by the average runtime per iteration in Tab. 4. This is only partly due to longer and more numerous factorisations, as the biggest contributing factor to the increased runtime is the increased runtime in GMRes, which originates from an increased runtime per solve, not from an increased number of iterations, as can be seen from Tab. 4. At the highest converging value of , this leads to a runtime increase for a memory gain of .

2.2.2 Runtime performance for a GMRes tolerance of

In the case of a GMRes tolerance of , the number of factorisations seems to be roughly constant for all values of

, and it is surprisingly slightly reduced compared to the simulations without BLR compression. The latter is probably again due to small differences in the physical results, as the former indicates that the accuracy of the preconditioner is determined here by the linearity of the system. Indeed, the average number of GMRes iterations and the number of factorisations does not increase for large values of

up to . For , this does not hold anymore, as the GMRes solver did not reach convergence in this case.

The constancy of the number of factorisations in this case means that the time per iteration is now set by the time for factorisation, solve and GMRes, all of which incur increased runtimes from (de-)compression. However, these are partially offset by the use of larger values of . Even at the high number of factorisations for this GMRes tolerance value, the main contributor to the increased runtime is the runtime in GMRes, although the increased runtime for factorisation now plays a larger role compared to the higher GMRes tolerance value. At the highest observed converging value of , a memory gain of for a runtime increase.

The above analysis suggests that the optimal value should be found somewhere between and , which could unfortunately not be investigated due to time constraints for this study. Logically, the optimal value should be such that the inaccuracy in the preconditioner induced by the compression accuracy is similar to that inherent in our assumption of decoupled harmonics in the preconditioner. This should ensure that the time for factorisation, solve and GMRes is decreased without incurring too great an increase in the preconditioner’s inaccuracy.

Moreover, it might even be worth going beyond this limit by increasing the maximal number of GMRes iterations when using BLR compression, allowing for still larger values of which further reduce memory consumption and possibly offset the runtime increase caused by the larger number of GMRes iterations.

3 PaStiX version 6.x

The new version 6.0 of the PaStiX sparse matrix solver is still in development, but seems to be the way forward for JOREK, as it brings new features such as Block-Low-Rank compression 333For the tests shown here, a development version equivalent to release 6.0.2 with some additional corrections was used. The hash key of the respective commit in the git repository is 17f18ce6e87de504580c27d52e5672311c413d21.. It is however not yet MPI-parallelised, which will be implemented in version 6.1, so its current use in JOREK is restricted to cases with one MPI task per PaStiX instance (i.e. one task per toroidal harmonic). Apart from this, the solver is already fully functional in JOREK, as an updated interface has been implemented in mod_poiss.f90 (equilibrium), solve_pastix_all.f90 (direct LU factorisation of entire matrix, generally used in axisymmetric runs), solve_mat_n.f90 and gmres_precondition.f90 (LU factorisation as preconditioner followed by GMRes, generally used for multiple harmonics).

The implementation can handle multiple degrees of freedom to make analysis and factorisation more efficient (switched on through the flag

USE_BLOCK in JOREK), although the underlying matrix structure and the analysis results currently have to be expanded during the analysis phase, as the thereafter invoked calls do not yet support multiple degrees of freedom. As this expansion is as of now only implemented in the analysis phase, the analysis is currently being repeated for every time step when using PaStiX 6 with multiple degrees of freedom in JOREK444This overhead can be avoided in the future.. Once the PaStiX developers have remedied to these problems, the expansion and repeated analysis should be removed in the JOREK implementation, giving a further boost to the runtime performance.

3.1 Benchmarking

In the following, we present small benchmark tests of PaStiX 6 with the previously generally used PaStiX 5.2.1 (”Release 4492”). We again use a peeling-ballooning scenario (inxflow): where the simulation is run with 2 toroidal harmonics () in the phase of linear growth (Sec. 3.1.1).

The time evolution is computed with and without the USE_BLOCK feature, which also the evaluation of its usefulness in different versions of PaStiX. Furthermore, in Sec. 3.1.2, the scaling in the number of OpenMP threads is checked for the different PaStiX versions by running 10 time steps in the linear growth phase with different numbers of OpenMP threads. Finally, 10 time steps are again rerun in the linear growth phase for various spatial resolutions in Sec. 3.1.3 to check how the new PaStiX version scales with the problem size. The basic setup has a spatial resolution of (n_flux,n_tht) = (32,40) and the number of OpenMP threads is set to .

3.1.1 Basic simulation

The simulation is run for 90 time steps in the linear growth phase to obtain a first idea of the typical runtime performance but also to later assess if the OpenMP and resolution scans which are run over only 10 time steps yield the correct results. The runtimes measured in this first benchmark are given in Tab. 5. They are all averaged except the analysis for single-dof which is performed only once. Both single and multiple degrees of freedom (dof) are investigated (USE_BLOCK feature).

Runtime (in s) single dof multiple dofs
PaStiX 5.2.1 PaStiX 6.0.2 Release 5.2.1 PaStiX 6.0.2
Analysis
Factorisation
Solve
GMRes
Table 5: Basic benchmark of PaStiX 6

Considering a single degree of freedom, the time for analysis is subtantially reduced in PaStiX 6, by a factor of . However, the analysis phase only has to be repeated once per simulation (restart), such that this gain is appreciated but should not heavily influence the total runtime.

The reason why the use of multiple degrees of freedom seems to lead to a greater speed-up in the older PaStiX version is that the time listed under Analysis also includes the conversion of the matrix to an input usable by PaStiX. This conversion takes slightly longer for PaStiX 6 because the latter necessitates an additional new sparse matrix structure, whereas the matrix was directly passed to PaStiX beforehand. This additional overhead could be reduced in the future by directly using the new sparse matrix structure in JOREK’s distribute_harmonics routine.

The time for factorisation is also reduced in this simple case for the new PaStiX version. The difference shrinks when multiple degrees of freedom are used, such that the times for factorisation are very similar here. Indeed, multiple degrees of freedom are not yet implemented in the factorisation part of PaStiX 6, such that the speed-up is reduced to the contribution from a more performant analysis.

Finally, the runtimes for solution and for GMRes are reduced in the new PaStiX version, by approximately when a single degree of freedom is assumed. Here, the use of multiple degrees of freedom does not lead to a speed-up for PaStiX 6, as it is not yet implemented in the solve part.

3.1.2 OpenMP thread scan

In this scan, 10 time steps were performed during the linear growth phase and the factorisation was forced to take place at every time step. The number of OpenMP threads was varied in factors of from to . The resulting runtimes for the factorisation and solve steps, as well as the GMRes solver, are shown in Fig. 8. The analysis step was left out as it did not vary depending on the number of OpenMP threads, so the result from the previous basic benchmark stands (old version: s, new version: s).

Figure 8: Runtimes in OpenMP scan for the single-dof case

Although the scaling in the number of OpenMP threads seems to be worse for the factorisation step in the new PaStiX version, the difference is minimal even at threads, where the runtimes for factorisation are basically the same between the two PaStiX versions. The difference also becomes smaller for the solve and GMRes steps, but the new version of PaStiX was still always faster for these compared to the old version. This can also be seen in Fig. 9, where the speed-up in PaStiX 6 has been computed by dividing the runtime of the older PaStiX version by that of PaStiX 6. The analysis phase speed-up is for all OpenMP configurations.

Figure 9: Speedup from PaStiX 6 in the OpenMP scan

3.1.3 Resolution scan

The same 10 time steps were now performed at resolutions between (n_flux, n_tht) = (16, 20) and (128, 160) in 6 steps of approximate size . Only OpenMP threads were used, and the factorisation was again forced to be repeated at every time step.

The speed-up in this scan can be seen in Fig. 10, showing that PaStiX 6 leads to a speed-up for all phases (Factorisation, Solution, GMRes) and all resolutions, with the notable exception of the solution and GMRes phases at the very highest resolutions. The analysis phase was not included in Fig. 10, as it stays constant around for all resolutions.

Figure 10: Speedup from PaStiX 6 in the resolution scan

The speed-up in the factorisation is most encouraging as it does not plummet at larger resolutions, which is positive considering how significant this part of the solver is (e.g. runtime for factorisation at the highest resolution employed here, PaStiX Release 4492: s, PaStiX 6: s). The solve and GMRes phases show a more mixed picture, as the speed-up seems to be lost when going to very high resolutions. Whether this trend continues at even higher resolutions remains to be investigated, and this trend could of course change during the further development of PaStiX 6. Moreover, these phases are generally less critical for the total runtime (e.g. runtime for solution phase at the highest resolution employed here, PaStiX Release 4492: s, PaStiX 6: s). This prompts the need for a future benchmarking of PaStiX 6 on a realistic JOREK simulation, where the effect on the total runtime could be meaningfully investigated.

3.2 Preliminary results from BLR compression with PaStiX 6

The new BLR compression feature in PaStiX 6 could not yet be thoroughly tested, as some convergence problems occured when using the memory-optimal settings. The following results are derived from a resolution scan with the same setup as in Sec. 2.1 but using the PaStiX 6 solver with the ”just-in-time” setting, which optimises runtime gains instead of memory gains [6].

The memory consumption, obtained from a PaStiX 6 diagnostic, is listed in Tab. 6. Note that the numbers may not all be accurate to the last digit, and can not necessarily be directly compared to those given in the MUMPS resolution scan. Nevertheless, they indicate that very good compression can be attained, even for small (albeit non-zero) values of . This also seems to indicate that the accuracy of the BLR solver here cannot be directly compared to that from the MUMPS solver, possibly due to an additional internal scaling in MUMPS.

n_flux,n_tht No BLR
Table 6: Memory consumption in the PaStiX 6 resolution scan in Megabytes

The speed-up in the average time per iteration in Fig. 11 confirms that the values cannot be directly compared to those of the MUMPS solver, as many more simulations could not reach convergence here (all runs with and almost all with )555Note, that version 6.0.2 of PaStiX also had a bug, which will be resolved in 6.0.3, which might partially explain this behaviour. However, it seems easier to obtain a speed-up here, as many simulations with demonstrate, especially at high resolutions.

Figure 11: Average runtime per time step

This speed-up can mostly be traced back to the speed-up in the factorisation step, shown in Fig. 12. Note that the factorisation is forced to be repeated every time step in this scan, such that the factorisation will be the main contribution to the total time per iteration. This is not necessarily the case in a typical JOREK simulation, as Sec. 2.2 demonstrated.

Figure 12: Runtime for factorisation step

It is thus instructive to investigate the speed-up in the solve and GMRes steps, shown in Figs. 13 and 14. The speed-up or slow-down of the solve phase depends on the value of employed, a higher resolution only mildly increases the speed-up for a given . In comparison, the GMRes runtimes seem to always be longer, as a result of the worse preconditioner. At high resolutions however, the fact that the time for solve is reduced with BLR suggests that the runtime for GMRes could also be reduced in a realistic JOREK simulation. Indeed, the number of GMRes iterations in Tab. 4 showed that for small enough values of the accuracy of the preconditioner is primarily determined by the nonlinearity of the system, not by the BLR accuracy. In other words, the increased runtime for GMRes observed in Fig. 14 merely reflects the highly increased number of iterations in GMRes, which were not observed in the realistic simulation (Sec. 2.2) when using small enough values of .

The results in this section are preliminary, as many aspects of BLR compression in PaStiX 6 remain to be investigated. However, they are already very encouraging as the memory consumption is greatly reduced and the factorisation step seems to enjoy a subtantial speed-up. Whether this speed-up can rival the slow-down in the GMRes phase in a realistic JOREK simulation remains to be investigated.

Figure 13: Runtime for solution step
Figure 14: Runtime for GMRes

4 Conclusions

Interfaces in the JOREK MHD code have been updated for MUMPS and PaStiX in order to test block low rank compression offered by these solver libraries. First tests show promising trends. Further tests are necessary to fully evaluate the benefit in production simulations.

Acknowledgements

First and foremost, the authors would like to thank Mathieu Faverge and the rest of the PaStiX 6 development team for their quick and helpful responses to the diverse problems we encountered during the implementation of the new PaStiX 6 interface in JOREK. The authors would also like to thank Guido Huijsmans for his instructive comments on the use of Block-Low-Rank compression in JOREK.

References

  • [1] G. Huysmans and O. Czarny, “MHD stability in x-point geometry: simulation of ELMs,” Nuclear Fusion, vol. 47, pp. 659–666, jun 2007.
  • [2] P. Hénon, P. Ramet, and J. Roman, “PaStiX: A High-Performance Parallel Direct Solver for Sparse Symmetric Definite Systems,” Parallel Computing, vol. 28, no. 2, pp. 301–321, 2002.
  • [3] P. Amestoy, I. Duff, J. L’Excellent, and J. Koster, “A fully asynchronous multifrontal solver using distributed dynamic scheduling,” SIAM Journal on Matrix Analysis and Applications, vol. 23, no. 1, pp. 15–41, 2001.
  • [4] T. Mary, Block low-rank multifrontal solvers : complexity, performance, and scalability. Theses, Université Paul Sabatier - Toulouse III, Nov. 2017.
  • [5] P. R. Amestoy, A. Buttari, J.-Y. L’Excellent, and T. Mary, “Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures,” research report, INPT-IRIT ; CNRS-IRIT ; INRIA-LIP ; UPS-IRIT, Apr. 2017.
  • [6] G. Pichon, E. Darve, M. Faverge, P. Ramet, and J. Roman, “Sparse supernodal solver using block low-rank compression: Design, performance and analysis,” International Journal of Computational Science and Engineering, vol. 27, pp. 255 – 270, July 2018.