## I Introduction

J. Dongarra at his talk at International Congress ICMS-2016 [1] put attansion on the several difficult challenges. He noted that the task of managing calculations on a cluster with distributed memory for algorithms with sparse matrices is today one of these the most difficult challenges.

We have to add also one more problem. It is a high computational complexity, that can be connected with the type of the basic algebra: you can take a matrix over field or over commutative ring.

For sparse matrices, it is not true that all computations over polynomials or integers can be effectivelly reduced due to the technic of modular computations. It was proved in theoretical investigatios of computational complexity for some algorithms with sparse matrices in [2]. Below we give experiments with large sparse matrices, which confirm these theoretical reults.

We consider the class of block-recursive matrix algorithms. The most famous of them are standard and Strassen’s block matrix multiplication, Strassen’s block-matrix inversion [3].

Block-recursive algorithms were not so important as long as the calculations were performed on computers with shared memory. Only in the nineties it became clear that block-recursive matrix algorithms are required to operate with sparse large matrices on a supercomputer with distributed memory.

Note that the generalization of Strassen’s matrix inversion algorithm [3] with additional permutations of rows and columns by J. Bunch and J. Hopkroft [4] is not a block-recursive algorithm.

The block recursive algorithm for the solution of systems of linear equations and for adjoint matrix computation which is some generalisation of Strassen’s inversion in commutative domains was suggested in the papers [8], [9] and [11]. See also at the book [10]. However, in all these algorithms, except matrix multiplication, a very strong restriction are imposed on the matrix: the leading corner minors should not be zero.

This restriction was removed later. The algorithm that computes the adjoint matrix, the echelon form, and the kernel of the matrix operator for the commutative domains was proposed in [12]. The block-recursive algorithm for the Bruhat decomposition and the LEU decomposition for the matrix over the field was obtained in [13], and these algorithms were generalized for the matrices over commutative domains in [15] and in [16].

In this article we review the main achievements in this class of algorithms and present the results of experiments that we conducted with these algorithms on a supercomputer MVS-10P at the Joint Supercomputer Center of the Russian Academy of Science.

In the next section, we present the main application areas in which such algorithms are used.

## Ii Some important areas for applications of sparse matrices algorithms

### Ii-a Computations of functions of electronic circuits

The behavior of electronic circuits can be described by Kirchhoff’s laws. The three basic approaches in this theory are direct current, constant frequency current and a current that varies with time. All these cases require the compilation and solution of sparse systems of equations (numerical, polynomial or differential). The solution of such differential equations by the Laplace method also leads to the solution of polynomial systems of equations [17].

### Ii-B Control systems

In 1967 Howard H. Rosenbrock introduced a useful state-space representation and transfer function matrix form for control systems, which is known as the Rosenbrock System Matrix [18]. Since that time, the properties of the matrix of polynomials being intensively studied in the literature of linear control systems.

### Ii-C Computation of Gröbner bases

Another important application is the calculation of Gröbner bases. A matrix composed of Buchberger S-polynomials is a strongly sparse matrix. Reduction of the polynomial system is performed when calculating the echelon and diagonal forms of this matrix. The algorithm F4 [19] was the first such matrix algorithm.

### Ii-D Solving ODE’s and PDE’s.

Solving ODE’s and PDE’s is often based on solution of leanear systems with sparse matrices over numbers or over polynomials. One of the important class of sparse matrix is called quasiseparable. Any submatrix of quasiseparable matrix entirely below or above the main diagonal has small rank. These quasiseparable matrices arise naturally in solving PDE’s for particle interaction with the Fast Multi-pole Method (FMM). The efficiency of application of the block-recursive algorithm of the Bruhat decomposition to the quasiseparable matrices is studied in [21].

## Iii Development of the recursive matrix agorithms in integral domain

We can trace how developed the matrix recursive agorithms in integral domain, which eventually led to the creation of modern algorithms. There are several separate periods.

### Iii-a Algorithms for solution of a system of linear equations of size in an integral domain, which served as the basis for recursive algorithms

(1983) Forward and backward algorithm () [5].

(1989) One pass algorithm () [6].

(1995) Combined algoritm with upper left block of size ( for ) [7].

Really, this was already the first step of a recursive algorithm.
It was first discovered that when the matrix is divided into equal four blocks (), the least computational complexity is achieved.
Consequently, further dichotomous division of blocks can give the best algorithm.
It remains to prove several determinant identities that would allow us to do recursive calculations.

### Iii-B Recursive algorithms for solution of a system of linear equations and for adjoint matrix computation in an integral domain without permutations

(1997) Recursive algorithm for solution of a system of linear equations [8].

(2000) Adjoint matrix computation (with 6 levels) [9].

(2006) Adjoint matrix computation alternative algorithm (with 5 levels) [11].

Now it remained to solve the problem of permutation of blocks and ensure the fulfillment of determinant identities.

### Iii-C Main recursive algorithms for matrices in a domain

(2008) Computation of adjoint and inverse matrices and the operator kernel in a domain [12].

(2010) Bruhat and LEU decompositions in a feild [13].

(2012) Bruhat and LDU decompositions in a domain [14], [15].

Recursive algorithms for sparse matrices in commutative domains with the complexity of matrix multiplication are obtained.
The complexity of computing the matrix product for matrices of size we denote by .

### Iii-D New achivements and new applications

(2013) It is proved that the LEU algorithm has the complexity for rank matrices
[20].

(2015) New algorithms for Bruhat and LDU decompositions in a domain (alternative algorithm) [16].

(2017) It is proved that the LEU algorithm has the complexity for quasiseparable matrix. A matrix is called
quasiseparable if any it’s submatrix which entirely disposed below or above the main diagonal has small rank , [21].

## Iv Recursive standard and strassen’s matrix multiplication

The graph of recursive algorithm for standard matrix multiplication is shown at Figure 1.

Number of operations for the standard algorithm is .

The graph of the Strassen multiplication algorithm can be easily represented similarly. The number of operations for this algorithm is .

The algorithm for multiplying matrices on leaf tops should take into account the sparse matrix structure and compact storage form.

We note that there exists a boundary with respect to the density of the matrix, which separates the region of applicability of the Strassen multiplication. We note that there exists a theoretical boundary for the density of sparse matrix, which separates the region of efficient application of the Strassen algorithm of multiplication. If the density of the matrix is below this boundary, then only standard multiplication is effective. This is due to the fact that the addition of blocks, which is performed in the Strassen algorithm, leads to an increase in the density of the matrix blocks (see details in [2]).

## V Recursive Strassen’s matrix inversion

If , and then

We have denoted here , , , , , .

The graph of recursive Strassen’s matrix inversion is shown on Figure 2.

## Vi Recursive inversion of triangular matrix

If is triangular matrix of order and then

## Vii Recursive Cholesky decomposition

Let be a positive definite symmetric matrix and be a low triangle matrix with the property . The mapping

is called an Cholesky decomposition. Let , then you can used following recursive algorithm.

#### Vii-1

Let .

We can compute C= and

#### Vii-2

Let .

Then and .

## Viii Recursive computation of the adjoint matrix, kernel and determinant

We consider matrices over a commutative domain.

Semigroup is formed by matrices, which have the number of unit elements coincides with its rank, and the remaining elements are zero. The semigroup is formed by the diagonal matrices: , =

. The identity matrix

is a unit in and in . For each matrix we define diagonal matrices Also, we used the involution fanction on , with the property .For the matrix , the matrix is a left annihilator, and the matrix is a right annihilator. So we can denote the set of echelon matrix of order : In other words, () is a block of echelon matrix with , such that the sets of zero rows of the matrices and coincide, and each nonzero column of the matrix coincides with the same column of the matrix . We write: .

Below we will use such notation for any matrix and :

The mapping

for is called an extended adjoint mapping of the pair () if it is defined recursively as follows.

For we define

For and we define

In all other cases, we split the matrix into four equal blocks

#### Viii-1

Let

We denote , , .

#### Viii-2

Let .

#### Viii-3

Let

We denote ,

#### Viii-4

Let

We denote ,

, ,

,

Sentence. The map defines an extended adjoint matrix , an echelon matrix , and a matrix such that and [12]. The graph of extended adjoint map is shown at Figure 3.

## Ix Results of experiments with matrix recursive algorithms on a cluster with distributed memory

The block-recursive matrix algorithms require a special approachs to managing parallel programs. One approach to the cluster computations management is a scheme with one dispatcher.

We consider another scheme of cluster menagement. It is a scheme with multidispatching, when each involved computing module has its own dispatch thread and several processing threads. Each processor, along with its subtask, receives a list of slave processors. During the computation, this list changes when new processors are added or when some of these processors complete their subtasks [22], [23].

### Ix-a Scalability

We have done the experiments using the supercomputer MVS-10P based on RSC Tornado architecture. It is a 10 Petaflops supercomputing system at the Joint Supercomputer Center (JSCC) of the Russian Academy of Science (RAS).

We demonstrate the results of experiments with parallel programms on the base of a scheme with multidispatching, using C++ and OpenMPI. We demonstrate a scalability of these programs. To do this, we plot the graph of efficiency factor.

For an “ideal parallel program” the product of the time for solving the computational problem by the number of cores in the computational cluster must remain constant:

So the value

can be taken as the “efficiency factor” of the n-cores with respect to the k-cores.

In order to investigate how the efficiency factor changes with increasing number of cores for a given program, it is possible to conduct a series of experiments with different number of cores in the cluster. If the program has this coefficient above 50% for some range of number of cores, then we conside that it has good scalability in this range. We conducted several series of experiments and obtained graphs that show how the efficiency factor varies. In all experiments, except Strassen’s matrix inversion, we took matrices with 15 bit integers.

For the algorithm of recursive Strassen’s matrix inversion, we took a dense matrix with double-precision numbers. Figures 4 and 5 show the results of experiments with dence matrices of sizes 8000x8000 and 16000x16000, correspondingly. For a cluster having 200 cores, the efficiency coefficient is equal to 51% for a matrix size of 8000x8000 and 73% for a matrix size of 16000x16000.

In Figure 6, the efficiency is shown for a recursive algorithm for computing the adjoint matrix and the kernel, when the number of cores in a cluster changes from 8 to 400. We took arbitrary dense matrices with size 8000x8000. For a cluster having 200 cores, the efficiency coefficient is equal to 66%. Efficiency coefficient drops to 44% when the number of cores reaches 400.

Figures 4– 7 demonstrate that the larger the size of the matrix, the better the efficiency coefficient remains with increasing cluster sizes.

### Ix-B Comparison of the calculation time

The next three figures show how the calculation time varies with the growth of the matrix sizes for different algorithms and for different sparseness of the matrices.

We investigate the algorithm of computing the adjoint matrix and the kernel (see the algorithm in Figure 3). Experiments with two different versions of this algorithm are compared in Figures 7 and 8. These experiments were carried out for integer matrices on a cluster with 100 cores. We compared a standard algorithm and an algorithm in which the Chinese remainder theorem (CRT) was applied. It is well known that the application of the Chinese remainder theorem makes it possible to reduce the number of operations in such algorithms by times, where is the size of the matrix, if the standard algorithm for multiplying integers is used. This is true for dense matrices.

Figure 7 shows the results of experiments for dense matrices. For a matrix of size 2500x2500, the CRT algorithm is about twice as fast as a standard algorithm.

Figure 8 shows the results of experiments for sparse matrices that have a density of 1 percent. For a matrix of size 2500x2500, the CRT algorithm is approximately twice as slow as the standard algorithm.

We see that for very sparse matrices, the CRT algorithm should not be used.

## X Results of experiments with sequential programs: comparison with Mathematica and MAPLE

An experimental comparison of the sequential recursive algorithm for computing the adjoint matrix with similar programs in Mathematica and MAPLE systems is shown in Figure 9.

We compared our algorithms with Mathematica 11 and MAPLE 2015. For comparison, we took random dense integer matrices and performed calculations with identical matrices in 4 programs. The best time of calculations was demonstrated by sequential recursive algorithm. For example, for 600x600 matrices, it is twice as fast as Mathematica 11 and 7 times faster than MAPLE 2015. A slightly worse calculation time was shown by the fourth program. This is the parallel program based on recursive algorithm that was run on a single processor.

## Xi Conclusion

The algorithms we discussed were used in the cloud computing system of the computer algebra Math Partner. You can used this cloud system on the websites http://mathpar.cloud.unihub.ru.

Functions that correspond to these algorithms can be called by the operators adjoint, kernel, det, LDU, BruhatDecomposition.

The algorithms that have been discussed above have a wide application. Therefore, it would be important to have such a public site where users could perform parallel computing to solve their specific tasks using a large number of processors.

## Xii Acknowledgment

The authors are grateful to the Ivannikov Institute for System Programming of the Russian Academy of Science for hosting the cloud computer algebra system Math Partner and to the Joint Supercomputer Center of the Russian Academy of Science for providing the ability to perform calculations on the supercomputer MVS-10P.

## References

- [1] J. Dongarra “With Extrim Scale Computing the Rules Have Changed”, in Mathematical Software. ICMS 2016, 5th International Congress, Proceedings (G.-M. Greuel, T. Koch, P. Paule, A. Sommese, eds.), Springer, LNCS, vol. 9725, pp. 3–8, 2016.
- [2] G.I.Malashonok, Y.D.Valeev, A.O. Lapaev, “On the choice of a multiplication algorithm for polynomials and polynomial matrices”, J. Math Sci. vol. 168, no. 3, pp. 398–416, 2010.
- [3] V. Strassen, “ Gaussian Elimination is not optimal,” Numerische Mathematik. vol. 13, issue 4, pp. 354–356, 1969.
- [4] J. Bunch, J. Hopkroft, “Triangular factorization and inversion by fast matrix multiplication,” Mat. Comp. vol. 28, pp. 231–236, 1974.
- [5] G.I.Malashonok, “Solution of a system of linear equations in an integral domain, ” USSR J. of Comput. Math. and Math. Phys.,vol. 23, no. 6, pp. 1497–1500, 1983.
- [6] G.I.Malashonok, “Algorithms for the solution of systems of linear equations in commutative rings. ” Effective methods in Algebraic Geometry, Progr. Math.,vol. 94, Birkhauser Boston, Boston, MA, 1991, pp. 289–298, 1991.
- [7] G.I. Malaschonok. “Algorithms for computing determinants in commutative rings. ” Discrete Math. Appl., Vol. 5, no. 6, pp. 557–566, 1995.
- [8] G.I.Malashonok, “Recursive Method for the Solution of Systems of Linear Equations,” Computational Mathematics. A. Sydow Ed, Proceedings of the 15th IMACS World Congress, vol. I, (Berlin, August 1997), Wissenschaft & Technik Verlag, Berlin, pp. 475–480, 1997.
- [9] G. Malashonok, “ Effective Matrix Methods in Commutative Domains”, Formal Power Series and Algebraic Combinatorics, Springer, Berlin, pp. 506–517, 2000.
- [10] G. Malashonok, “ Matrix computational methods in commutative rings,” Tambov: Tambov State University, 2002.
- [11] A.G.Akritas, G.I. Malashonok, “ Computation of Adjoint Matrix,” Computational Science, ICCS 2006, LNCS 3992, Springer, Berlin, pp. 486–489, 2006.
- [12] G. Malashonok, “On computation of kernel of operator acting in a module” [Tambov University Reports. Series: Natural and Technical Sciences], vol. 13, issue 1, pp. 129–131, 2008.
- [13] G. Malashonok, “Fast Generalized Bruhat Decomposition,” Computer Algebra in Scientific Computing, LNCS 6244, Springer, Berlin, pp. 194–202, 2010.
- [14] G. Malashonok, “On fast generalized Bruhat decomposition in the domains,” Tambov University Reports. Series: Natural and Technical Sciences, vol. 17, issue 2, pp. 544–551, 2012.
- [15] G. Malashonok, “Generalized Bruhat decomposition in commutative domains,” Computer Algebra in Scientific Computing, CASC’2013, LNCS 8136, Springer, Heidelberg, 2013, pp. 231–242, 2013.
- [16] G. Malashonok, A. Scherbinin, “Triangular Decomposition of Matrices in a Domain,” Computer Algebra in Scientific Computing, LNCS 9301, Springer, Switzerland, 2015, pp. 290–304, 2015.
- [17] R.P. Clayton, “Fundamentals of Electric Circuit Analysis,” Mercer University & University of Kentucky, John Wiley & Sons, Inc., 2001.
- [18] Rosenbrock, H.H. “ Transformation of linear constant system equations,” Proc. I.E.E. vol. 114, pp. 541–544, 1967.
- [19] Faugere, J.-C. “A new efficient algorithm for computing Gröbner bases (F4)” . Journal of Pure and Applied Algebra. Elsevier Science, vol. 139, no. 1, pp. 61–88, 1999.
- [20] J.-G. Dumas, C. Pernet, Z. Sultan, “ Simultaneous computation of the row and column rank profiles,” In: Kauers, M. (Ed.), Proc. ISSAC’13. ACM Press, pp. 181–188, 2013.
- [21] Pernet C., Storjohann A. “ Time and space efficient generators for quasiseparable matrices,” Journal of Symbolic Computation, vol. 85, no. 2, pp. 224–246, 2018.
- [22] Ilchenko E.A. “An algorithm for the decentralized control of parallel computing process,” Tambov University Reports, series: Natural and Technical Sciences, vol. 18, no. 4, pp. 1198-1206, 2013.
- [23] Ilchenko E.A. “About effective methods parallelizing block recursive algorithms,” Tambov University Reports. series: Natural and Technical Sciences, vol. 20, no. 5, pp. 1173–1186, 2015.

Comments

There are no comments yet.