1.1 JOREK solver
JOREK  solver always involves LU factorisations, either to solve the entire matrix directly (generally used when the simulation is run axisymmetrically) or as a preconditioner before using the iterative solver GMRes. We focus on the latter case.
In the preconditioner, the blocks of toroidal harmonics are assumed to be decoupled from one another (i.e. off-diagonal blocks are not considered in the preconditioner), which greatly reduces the memory requirements and the runtime. This approach comes to its limits when the system is very nonlinear, as the preconditioner will be inaccurate and many GMRes iterations may be needed to converge to the solution.
1.2 LU factorisation of sparse matrix
The blocks of harmonics are sparse matrices (i.e. a small percentage of entries is non-zero) and naive algorithms used for dense matrices should thus not be applied here. Indeed, these would lead to a substantial fill-in: the combined storage cost for the and factors would greatly exceed that of the original matrix. The
JOREK code makes use of either
PaStiX  or
MUMPS , both of which are able to efficiently handle sparse matrices.
In the following, we will briefly present the main steps of these solvers, without going into the details. More information can be found e.g. in .
In the analysis phase, a graph ordering tool is called, generally
METIS. These represent the matrix as a graph and an elimination tree is constructed to reduce fill-in. This tree is traversed as each node is eliminated and its contribution to nodes further up the tree are computed (corresponding to fill-in). This corresponds to an update of nodes up the tree which can be done either immediately after elimination (fan-out or right-looking), as late as possible before the elimination of the node further up the tree (fan-in or left-looking) or it can be carried up the tree, accumulating further updates from other nodes in so called fronts (fan-out or right-looking)111
PaStiX is using a fan-in strategy in the communication. In the sense that are accumulated locally on the origin node, and then sent to the remote node when all contributions are applied, such that the remote node can perform the operation . Except for this specific case in distributed, all the algorithm is fan-out/right-looking oriented to get more parallelism, even for the accumulation of the remote contribution. MUMPS, in contrast, based on the multi-frontal approach, can be considered as fan-in/left-looking oriented algorithm..
In the factorisation phase, the factors of the and matrices are computed, following the elimination tree obtained in the analysis phase. This is typically the most CPU intensive step, although the factorisation does not need to be repeated for every time step in a typical
JOREK simulation, since the preconditioning matrix is reused as long as reasonable convergence is obtained in GMRes.
In the solve phase, the solution is computed using standard forward and back-substitution methods. A solve also has to be repeated for every iteration in GMRes.
1.3 BLR compression
Even with the use of graph ordering tools, the fill-in can be substantial and the required memory storage for the factors of and can be very high. As a possible solution to this problem, the factors could be compressed using Block-Low-Rank (BLR) compression.
Let be a matrix of size . Let be the approximated numerical rank of at accuracy . is a low-rank matrix if there exist three matrices of size , of size and of size such that :
where and . The last condition implies lower dimensionality and thus also lower storage costs for combined U and V than for original A, provided the matrix is sufficiently dense. The basic idea is to represent the matrix as the product of two skinny matrices. Note that lossless compression is possible as the accuracy can be set to .
Besides possible memory gains, using BLR compression can in principle also reduce the runtime, as basic matrix operations are accelerated. Indeed, for , matrix-matrix products go from (for dense matrices) to operations, while a triangular solve goes from to operations. The complexity is thus reduced for and respectively.
The and matrices are generally not low-rank, but individual sub-matrices may be efficiently compressed. There is no general admissibility criterion to determine if a block should be stored in low-rank. This depends on the solver used, see  (
MUMPS) and  (
PaStiX) for more details on the implementation of BLR compression in these solvers.
1.3.1 Runtime performance in JOREK
The runtime of the
JOREK solver will be impacted by the use of BLR compression. Before testing this in detail in the rest of this report, we layout how the performance is affected by BLR compression.
The analysis step takes a longer time due to the additional search for suitable matrices to compress. However, the impact on the total runtime is negligible, as the analysis only has to be done once at the simulation (re-)start.
Two opposing effects will play a role in the factorisation step: the compression of the LU blocks leads to an overhead but the compressed matrices lead to faster basic matrix operations (see above). In practice, the time for factorisation seems to be generally increased with BLR in our tests, although higher values of can reduce or even revert this effect.
The same conclusions hold for the solve step, as the (de-)compression induces an overhead, which the faster matrix operations can cancel222This overhead can be avoided and will be avoided in both PaStiX and MUMPS in the future.
The runtime performance of the GMRes iterative solver is set by the number of iterations needed to reach convergence and the time needed for a basic solve (see step above), as it is repeated for every GMRes iteration. As for the factorisation and solve steps above, this means that the use of BLR causes an overhead but larger values of will not always be beneficial here. Indeed, they may reduce the number of matrix operations and thereby reduce runtime, but a too large value of deteriorates the accuracy of the preconditioner and leads to worse convergence in GMRes, increasing the number of iterations necessary to reach convergence. Taken to the extreme, this can stop the simulation, as the GMRes iterative solver is given a maximum number of steps to reach convergence in
2 BLR compression in MUMPS
2.1 Resolution scan
2.1.1 Basic setup
All simulations in this report were carried out on a Linux cluster, where each compute node is equipped with 2x Intel Xeon Gold 6130 CPUs with 16 cores and 2.1 GHz base clock speed, AVX 512, “Skylake” architecture, 22 MB L3 cache, and fast Omnipath interconnect.
We first perform basic resistive ballooning mode simulations (a test case called
JOREK) with 2 toroidal harmonics () and 4 toroidal planes. The simulations are restarted during the linear growth phase for 10 time steps, with the factorisation enforced to be carried out in every time step. The iterative solver GMRes is given a tolerance of and maximal iterations to reach convergence. Each simulation is run on one compute node with 2 MPI tasks, with 2 OpenMP threads each.
The resolution was varied from
(n_flux, n_tht) = (16, 20) up to
(n_flux, n_tht) = (128, 160), where
n_flux denotes the radial and
n_tht the poloidal number of grid points, in 6 steps of approximate factors . For each resolution, the runtime performance and memory consumption of the solver MUMPS are investigated, with and without BLR compression, and with different values of the BLR tolerance: .
2.1.2 Memory consumption
The total memory consumption in MB of the MUMPS solver depending on resolution is shown in Tab 1 and Fig. 1. Note that the blocks corresponding to nonzero toroidal harmonics have twice the dimensionality of the block, such that of the total memory indicated is used in the block and in the block.
Significant memory gains can be made using BLR compression. These gains are exacerbated for large values of (as the matrices can be compressed more effectively) and for large problem sizes (as there are more opportunities for compression).
Note, however, that large values of can significantly deteriorate the quality of the preconditioning, leading to an increased number of iterations for GMRes (see Sec. 1.3.1). Depending on the required tolerance for GMRes and the number of iterations, this can even prevent convergence, which was the case in the highest resolution simulations with .
The runtimes listed below are averages over the simulations’ 10 time steps. The activation of the Block-Low-Rank feature leads to an increase in the analysis phase’s runtime by a factor of approximately . It was however left out in the following, as the analysis only has to be performed once and its impact on the performance is thus negligible.
The runtime for the factorisation step of the MUMPS solver is shown in Tab. 2 and Fig. 2. Here, the use of BLR compression can be detrimental (overhead caused by the compression of LU factors) or beneficial (speedup of matrix operations). Which of these two effects dominates depends on the size of the problem and the choice of .
The runtime for the solution step of the MUMPS solver is shown in Fig. 3. The solution step generally incurs an overhead from compression, which can however be mitigated at high resolutions, where the decrease in the number of operations can compensate for the overhead.
The runtime for the GMRes step of the JOREK solver is shown in Fig. 4, where the strong spike for reflects the loss of convergence. A slowdown is observed at all resolutions as the same overhead mentioned in the solution step comes into play here, as well as larger number of iterations in the GMRes solver when is too high and the preconditioner is too inaccurate.
The reduction of the time for the solution phase in Fig. 3 at the highest resolution leaves open the possibility that this phase is actually accelerated at even larger resolutions. This could then also possibly lead to a reduction in the time for GMRes, if it was possible to compensate for both the (de-)compression overhead and the higher number of iterations caused by the preconditioner’s inaccuracy.
The average total runtime per time step shown in Tab. 3 confirms the generally increased runtime when using BLR compression. For large values of , this increase is quite modest. In our simulations at highest resolution, the optimal choice is , where the increase in runtime is minimal (no BLR: s, : s) while the memory requirements are substantially lower (no BLR: GB, : GB). The high performance observed here is due to the reduced runtime for factorisation, largely offsetting the increase in runtime from GMRes. The peculiar spike in the runtime for and a resolution of
(n_flux, n_tht) = (64, 80) is due to an increase in the number of iterations of GMRes for out of the time steps, suggesting that fluctuations in the preconditioner’s accuracy can come into play for larger values of .
JOREK solver generally keeps the preconditioning LU factors for several timesteps, until it does not satisfactorily precondition the matrix anymore, i.e. the number of GMRes iterations in the previous time step has passed a certain threshold. As a consequence, for many time steps only the solve step and GMRes have to be performed. For too large , this could increase the runtime substantially, as GMRes is repeated several times for each LU factorisation. Additionally, the LU factorisation might need to be repeated after a smaller number of time steps. However, the runtimes for the solution phase and GMRes phases might be reduced at very high resolutions due to the reduced number of operations, as Figs. 3 and 4 seem to suggest.
The accuracy should be chosen as large as possible to allow for significant memory gains and keep the runtime as low as possible, without being too high, such that GMRes convergence is not too greatly impaired. This choice of will depend on the exact problem at hand as well as the setup of the GMRes solver.
2.2 More realistic simulation
The goal of this section is to ascertain the benefits of BLR compression during the nonlinear phase, where the preconditioning matrix is not as effective and the convergence of GMRes is thus even more critical, as well as its general usefulness in a typical
We thus turn to simulations of shattered pellet injections (SPI), running for a larger number of time steps () through linear and nonlinear phases, with 6 toroidal harmonics () and 32 toroidal planes. The resolution of all simulations is
(n_flux,n_tht) = (56, 138). The maximal number of iterations of the GMRes iterative solver was set to while its tolerance was set to and , to investigate its effects on the convergence properties when using compression. Indeed, we force the recalculation of the preconditioner (i.e. a factorisation) every time the number of steps in GMRes exceeds , such that a lower GMRes tolerance will lead to more factorisations.
We again compare simulations with different MUMPS setups: without BLR and with BLR, using different values of . For the higher GMRes tolerance value of , the simulations with did not yield satisfactory convergence in GMRes, the former in the linear and the latter in the nonlinear phase. We thus examine only the values .
The total memory consumption of the MUMPS solver in GB amounts to
No BLR: ()
BLR, : ()
BLR, : ()
BLR, : ()
BLR, : ()
The runtime performance data of all simulations is given in Tab. 4. Note that the runtime for the factorisation and solve steps is independent of the GMRes tolerance value.
|GMRes tol.||No BLR|
|Nbr. of factorisations|
|Avg. time per iter. (s)|
|Avg. nbr. of GMRes iter.|
|Avg. time per GMRes (s)|
|Nbr. of factorisations|
|Avg. time per iter. (s)|
|Avg. nbr. of GMRes iter.|
|Avg. time per GMRes (s)|
|Avg. time per facto. (s)|
|Avg. time per solve (s)|
2.2.1 Runtime performance for a GMRes tolerance of
In the first set of simulations with a GMRes tolerance of , the number of factorisations varied somewhat unexpectedly, as can be seen in Tab. 4. Indeed, the number of factorisations rose substantially for , largely exceeding the number of factorisations for and . Indeed, the frequency (number of occurences) for each occuring number of GMRes iterations shown in Fig. 6 reveals an increased number of GMRes iterations between and for . This in turns leads to a higher average time per iteration (Tab. 4).
This discrepancy can be explained by the fact that a GMRes tolerance of is too high, possibly leading to different physical results. Note that in this simulation, most factorisations take place during the physically violent thermal quench of the plasma core. Even small variations in the duration or intensity of this process can lead to the observed differences. Indeed, a longer thermal quench is observed for , as shown in Fig. 7 (at ). Note that this does not directly explain why the number of factorisations increases for , but it does make this fact less surprising. The reduction of the relative differences in the number of factorisations at a higher GMRes tolerance value of (see next section) seems to confirm that the original tolerance of is too high.
At all observed converging values of , the runtime is substantially increased, as shown by the average runtime per iteration in Tab. 4. This is only partly due to longer and more numerous factorisations, as the biggest contributing factor to the increased runtime is the increased runtime in GMRes, which originates from an increased runtime per solve, not from an increased number of iterations, as can be seen from Tab. 4. At the highest converging value of , this leads to a runtime increase for a memory gain of .
2.2.2 Runtime performance for a GMRes tolerance of
In the case of a GMRes tolerance of , the number of factorisations seems to be roughly constant for all values of
, and it is surprisingly slightly reduced compared to the simulations without BLR compression. The latter is probably again due to small differences in the physical results, as the former indicates that the accuracy of the preconditioner is determined here by the linearity of the system. Indeed, the average number of GMRes iterations and the number of factorisations does not increase for large values ofup to . For , this does not hold anymore, as the GMRes solver did not reach convergence in this case.
The constancy of the number of factorisations in this case means that the time per iteration is now set by the time for factorisation, solve and GMRes, all of which incur increased runtimes from (de-)compression. However, these are partially offset by the use of larger values of . Even at the high number of factorisations for this GMRes tolerance value, the main contributor to the increased runtime is the runtime in GMRes, although the increased runtime for factorisation now plays a larger role compared to the higher GMRes tolerance value. At the highest observed converging value of , a memory gain of for a runtime increase.
The above analysis suggests that the optimal value should be found somewhere between and , which could unfortunately not be investigated due to time constraints for this study. Logically, the optimal value should be such that the inaccuracy in the preconditioner induced by the compression accuracy is similar to that inherent in our assumption of decoupled harmonics in the preconditioner. This should ensure that the time for factorisation, solve and GMRes is decreased without incurring too great an increase in the preconditioner’s inaccuracy.
Moreover, it might even be worth going beyond this limit by increasing the maximal number of GMRes iterations when using BLR compression, allowing for still larger values of which further reduce memory consumption and possibly offset the runtime increase caused by the larger number of GMRes iterations.
3 PaStiX version 6.x
The new version 6.0 of the
PaStiX sparse matrix solver is still in development, but seems to be the way forward for
JOREK, as it brings new features such as Block-Low-Rank compression 333For the tests shown here, a development version equivalent to release 6.0.2 with some additional corrections was used. The hash key of the respective commit in the git repository is 17f18ce6e87de504580c27d52e5672311c413d21.. It is however not yet MPI-parallelised, which will be implemented in version 6.1, so its current use in
JOREK is restricted to cases with one MPI task per
PaStiX instance (i.e. one task per toroidal harmonic). Apart from this, the solver is already fully functional in
JOREK, as an updated interface has been implemented in
solve_pastix_all.f90 (direct LU factorisation of entire matrix, generally used in axisymmetric runs),
gmres_precondition.f90 (LU factorisation as preconditioner followed by GMRes, generally used for multiple harmonics).
The implementation can handle multiple degrees of freedom to make analysis and factorisation more efficient (switched on through the flag
JOREK), although the underlying matrix structure and the analysis results currently have to be expanded during the analysis phase, as the thereafter invoked calls do not yet support multiple degrees of freedom. As this expansion is as of now only implemented in the analysis phase, the analysis is currently being repeated for every time step when using
PaStiX 6with multiple degrees of freedom in
JOREK444This overhead can be avoided in the future.. Once the
PaStiXdevelopers have remedied to these problems, the expansion and repeated analysis should be removed in the
JOREKimplementation, giving a further boost to the runtime performance.
In the following, we present small benchmark tests of
PaStiX 6 with the previously generally used
PaStiX 5.2.1 (”Release 4492”). We again use a peeling-ballooning scenario (
inxflow): where the simulation is run with 2 toroidal harmonics () in the phase of linear growth (Sec. 3.1.1).
The time evolution is computed with and without the
USE_BLOCK feature, which also the evaluation of its usefulness in different versions of
PaStiX. Furthermore, in Sec. 3.1.2, the scaling in the number of OpenMP threads is checked for the different
PaStiX versions by running 10 time steps in the linear growth phase with different numbers of OpenMP threads. Finally, 10 time steps are again rerun in the linear growth phase for various spatial resolutions in Sec. 3.1.3 to check how the new
PaStiX version scales with the problem size. The basic setup has a spatial resolution of
(n_flux,n_tht) = (32,40) and the number of OpenMP threads is set to .
3.1.1 Basic simulation
The simulation is run for 90 time steps in the linear growth phase to obtain a first idea of the typical runtime performance but also to later assess if the OpenMP and resolution scans which are run over only 10 time steps yield the correct results. The runtimes measured in this first benchmark are given in Tab. 5. They are all averaged except the analysis for single-dof which is performed only once. Both single and multiple degrees of freedom (dof) are investigated (
|Runtime (in s)||single dof||multiple dofs|
|PaStiX 5.2.1||PaStiX 6.0.2||Release 5.2.1||PaStiX 6.0.2|
Considering a single degree of freedom, the time for analysis is subtantially reduced in
PaStiX 6, by a factor of . However, the analysis phase only has to be repeated once per simulation (restart), such that this gain is appreciated but should not heavily influence the total runtime.
The reason why the use of multiple degrees of freedom seems to lead to a greater speed-up in the older
PaStiX version is that the time listed under Analysis also includes the conversion of the matrix to an input usable by
PaStiX. This conversion takes slightly longer for
PaStiX 6 because the latter necessitates an additional new sparse matrix structure, whereas the matrix was directly passed to
PaStiX beforehand. This additional overhead could be reduced in the future by directly using the new sparse matrix structure in
The time for factorisation is also reduced in this simple case for the new
PaStiX version. The difference shrinks when multiple degrees of freedom are used, such that the times for factorisation are very similar here. Indeed, multiple degrees of freedom are not yet implemented in the factorisation part of
PaStiX 6, such that the speed-up is reduced to the contribution from a more performant analysis.
Finally, the runtimes for solution and for GMRes are reduced in the new
PaStiX version, by approximately when a single degree of freedom is assumed. Here, the use of multiple degrees of freedom does not lead to a speed-up for
PaStiX 6, as it is not yet implemented in the solve part.
3.1.2 OpenMP thread scan
In this scan, 10 time steps were performed during the linear growth phase and the factorisation was forced to take place at every time step. The number of OpenMP threads was varied in factors of from to . The resulting runtimes for the factorisation and solve steps, as well as the GMRes solver, are shown in Fig. 8. The analysis step was left out as it did not vary depending on the number of OpenMP threads, so the result from the previous basic benchmark stands (old version: s, new version: s).
Although the scaling in the number of OpenMP threads seems to be worse for the factorisation step in the new
PaStiX version, the difference is minimal even at threads, where the runtimes for factorisation are basically the same between the two
PaStiX versions. The difference also becomes smaller for the solve and GMRes steps, but the new version of
PaStiX was still always faster for these compared to the old version. This can also be seen in Fig. 9, where the speed-up in
PaStiX 6 has been computed by dividing the runtime of the older
PaStiX version by that of
PaStiX 6. The analysis phase speed-up is for all OpenMP configurations.
3.1.3 Resolution scan
The same 10 time steps were now performed at resolutions between
(n_flux, n_tht) = (16, 20) and (128, 160) in 6 steps of approximate size . Only OpenMP threads were used, and the factorisation was again forced to be repeated at every time step.
The speed-up in this scan can be seen in Fig. 10, showing that
PaStiX 6 leads to a speed-up for all phases (Factorisation, Solution, GMRes) and all resolutions, with the notable exception of the solution and GMRes phases at the very highest resolutions. The analysis phase was not included in Fig. 10, as it stays constant around for all resolutions.
The speed-up in the factorisation is most encouraging as it does not plummet at larger resolutions, which is positive considering how significant this part of the solver is (e.g. runtime for factorisation at the highest resolution employed here,
PaStiX Release 4492: s,
PaStiX 6: s). The solve and GMRes phases show a more mixed picture, as the speed-up seems to be lost when going to very high resolutions. Whether this trend continues at even higher resolutions remains to be investigated, and this trend could of course change during the further development of
PaStiX 6. Moreover, these phases are generally less critical for the total runtime (e.g. runtime for solution phase at the highest resolution employed here,
PaStiX Release 4492: s,
PaStiX 6: s). This prompts the need for a future benchmarking of
PaStiX 6 on a realistic
JOREK simulation, where the effect on the total runtime could be meaningfully investigated.
3.2 Preliminary results from BLR compression with PaStiX 6
The new BLR compression feature in
PaStiX 6 could not yet be thoroughly tested, as some convergence problems occured when using the memory-optimal settings. The following results are derived from a resolution scan with the same setup as in Sec. 2.1 but using the
PaStiX 6 solver with the ”just-in-time” setting, which optimises runtime gains instead of memory gains .
The memory consumption, obtained from a
PaStiX 6 diagnostic, is listed in Tab. 6. Note that the numbers may not all be accurate to the last digit, and can not necessarily be directly compared to those given in the MUMPS resolution scan. Nevertheless, they indicate that very good compression can be attained, even for small (albeit non-zero) values of . This also seems to indicate that the accuracy of the BLR solver here cannot be directly compared to that from the
MUMPS solver, possibly due to an additional internal scaling in
The speed-up in the average time per iteration in Fig. 11 confirms that the values cannot be directly compared to those of the
MUMPS solver, as many more simulations could not reach convergence here (all runs with and almost all with )555Note, that version 6.0.2 of PaStiX also had a bug, which will be resolved in 6.0.3, which might partially explain this behaviour. However, it seems easier to obtain a speed-up here, as many simulations with demonstrate, especially at high resolutions.
This speed-up can mostly be traced back to the speed-up in the factorisation step, shown in Fig. 12. Note that the factorisation is forced to be repeated every time step in this scan, such that the factorisation will be the main contribution to the total time per iteration. This is not necessarily the case in a typical JOREK simulation, as Sec. 2.2 demonstrated.
It is thus instructive to investigate the speed-up in the solve and GMRes steps, shown in Figs. 13 and 14. The speed-up or slow-down of the solve phase depends on the value of employed, a higher resolution only mildly increases the speed-up for a given . In comparison, the GMRes runtimes seem to always be longer, as a result of the worse preconditioner. At high resolutions however, the fact that the time for solve is reduced with BLR suggests that the runtime for GMRes could also be reduced in a realistic
JOREK simulation. Indeed, the number of GMRes iterations in Tab. 4 showed that for small enough values of the accuracy of the preconditioner is primarily determined by the nonlinearity of the system, not by the BLR accuracy. In other words, the increased runtime for GMRes observed in Fig. 14 merely reflects the highly increased number of iterations in GMRes, which were not observed in the realistic simulation (Sec. 2.2) when using small enough values of .
The results in this section are preliminary, as many aspects of BLR compression in
PaStiX 6 remain to be investigated. However, they are already very encouraging as the memory consumption is greatly reduced and the factorisation step seems to enjoy a subtantial speed-up. Whether this speed-up can rival the slow-down in the GMRes phase in a realistic
JOREK simulation remains to be investigated.
Interfaces in the JOREK MHD code have been updated for MUMPS and PaStiX in order to test block low rank compression offered by these solver libraries. First tests show promising trends. Further tests are necessary to fully evaluate the benefit in production simulations.
First and foremost, the authors would like to thank Mathieu Faverge and the rest of the
PaStiX 6 development team for their quick and helpful responses to the diverse problems we encountered during the implementation of the new
PaStiX 6 interface in
The authors would also like to thank Guido Huijsmans for his instructive comments on the use of Block-Low-Rank compression in
-  G. Huysmans and O. Czarny, “MHD stability in x-point geometry: simulation of ELMs,” Nuclear Fusion, vol. 47, pp. 659–666, jun 2007.
-  P. Hénon, P. Ramet, and J. Roman, “PaStiX: A High-Performance Parallel Direct Solver for Sparse Symmetric Definite Systems,” Parallel Computing, vol. 28, no. 2, pp. 301–321, 2002.
-  P. Amestoy, I. Duff, J. L’Excellent, and J. Koster, “A fully asynchronous multifrontal solver using distributed dynamic scheduling,” SIAM Journal on Matrix Analysis and Applications, vol. 23, no. 1, pp. 15–41, 2001.
-  T. Mary, Block low-rank multifrontal solvers : complexity, performance, and scalability. Theses, Université Paul Sabatier - Toulouse III, Nov. 2017.
-  P. R. Amestoy, A. Buttari, J.-Y. L’Excellent, and T. Mary, “Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures,” research report, INPT-IRIT ; CNRS-IRIT ; INRIA-LIP ; UPS-IRIT, Apr. 2017.
-  G. Pichon, E. Darve, M. Faverge, P. Ramet, and J. Roman, “Sparse supernodal solver using block low-rank compression: Design, performance and analysis,” International Journal of Computational Science and Engineering, vol. 27, pp. 255 – 270, July 2018.