1 Introduction
Let and let . The matrix pencil consists of all matrices of the form where
. The set of (generalized) eigenvalues of the matrix pencil
is given by(1) 
We say that is a (generalized) eigenvector of the matrix pencil if and only if and
(2) 
The eigenvalues of can be computed by first reducing to real Schur form . Specifically, there exist orthogonal matrices and such that is quasiupper triangular and is upper triangular. It is clear that
(3) 
Moreover, if is a generalized eigenvector of corresponding to the eigenvalue , then is a generalized eigenvector of corresponding to the eigenvalue .
In this paper, we consider the parallel computation of eigenvectors of a matrix pencil in real Schur form. In exact arithmetic, this problem can be solved using substitution. However, substitution is very vulnerable to floatingpoint overflow.
In LAPACK [3] there exists a family xTGEVC of subroutines which compute the generalized eigenvectors of a matrix pencil in Schur form. They prevent overflow by dynamically scaling the eigenvectors. These subroutines are scalar codes which compute the eigenvectors one by one. In this paper we discuss the construction of algorithms which are not only robust, but blocked and parallel.
Our paper is organized as follows. In Section 2 we consider the problem of computing the eigenvectors of a matrix pencil in real Schur form using real arithmetic. This problem is equivalent to solving a homogeneous matrix equation of the form
(4) 
where is diagonal and is block diagonal with diagonal blocks which are either 1by1 or 2by2. In Section 3 we present a blocked algorithm for solving this matrix equation. In Section 4 we discuss how to prevent overflow in this algorithm. The concept of an augmented matrix is central to this discussion. A robust taskparallel solver Zazamoukh has been developed and integrated into the new StarNEig library for solving nonsymmetric eigenvalue problems [1, 9]. The performance of Zazamoukh is compared to LAPACK in Section 6. We outline several directions for future work in Section 7.
2 Real arithmetic
Let and be given. The set of generalized eigenvalues of the matrix pencil can be computed by first reducing to generalized real Schur form. Specifically, there exist orthogonal matrices and such that
are upper block triangular and . It is clear that
In order to simplify the discussion, we will make the following assumptions.

If , then has a single real eigenvalue.

If , then has two complex conjugate eigenvalues.

All eigenvalues are distinct.
We follow the standard convention and represent eigenvalues
using an ordered pair
where and . If , then is a finite eigenvalue. The case of and , corresponds to an infinite eigenvalue. The case of corresponds to an indefinite problem.We now consider the problem of computing generalized eigenvectors of . Our goal is to obtain an equivalent problem which can be solved using a blocked algorithm.
2.1 Computing a single eigenvector
It this subsection, we note that the problem of computing a single generalized eigenvector of is equivalent to solving a tall homogenous matrix equation involving real matrices. Let and let where and
Let and let and be given by
(5) 
when (or equivalently ) and
(6) 
when (or equivalently ). With this notation, the problem of computing an eigenvector is equivalent to solving the homogeneous linear equation
(7) 
with respect to . This is an immediate consequence of the following lemma.
Lemma 1
Let and let where . Then the following statements are true.
2.2 Computing all eigenvectors
It this subsection, we note that the problem of computing all generalized eigenvectors of is equivalent to solving a homogenous matrix equation involving real matrices. Specifically, let and be given by
(8) 
where and are given by equations (5) and (6). Then solves the homogeneous matrix equation
(9) 
if and only if solves equation (7).
3 A blocked algorithm
It is straightforward to derived a blocked algorithm for solving the homogeneous matrix equation (9). Specifically, redefine
to denote any partitioning of into an by block matrix which does not split any of the 2by2 blocks along the diagonal of . Apply the same partitioning to , , , and . Then , , and are block triangular. It is straightforward to see that equation (9) can be solved using Algorithm 1.
(10) 
We see that Algorithm 1 can be implemented using three distinct kernels:

Compute the eigenvectors corresponding to a small matrix pencil in real Schur form.

Execute linear updates of the form
(11) where and are small dense matrices, is diagonal and is block diagonal with blocks of size or .

Solution of small matrix equations of the form
(12) where is a matrix pencil in real Schur form and is diagonal and is block diagonal with diagonal blocks of dimension or .
Once these kernels have been implemented, it is straightforward to parallelize Algorithm 1 using a taskbased runtime system such as StarPU [4].
4 Deriving a robust blocked algorithm
The subroutines xTGEVC prevent overflow using principles originally derived by Edward Anderson and implemented in the subroutines xLATRS [2]. These subroutines apply to triangular linear systems
(13) 
with a single righthand side . Mikkelsen and Karlsson [7] formalized the work of Anderson and derived a robust blocked algorithm for solving (13). Mikkelsen, Schwarz and Karlsson [8] derived a robust blocked algorithm for solving triangular linear systems
with multiple right hand sides. Their StarPU implementation (Kiya) is significantly faster than DLATRS when numerical scaling is necessary and not significantly slower than DTRSM when numerical scaling is unnecessary.
A robust variant of Algorithm 1 can be derived using the principles applied by Mikkelsen, Schwarz and Karlsson [8]. We can use the two real functions ProtectUpdate and ProtectDivision introduced by Mikkelsen and Karlsson [7]. They are used to prevent scalar divisions and linear updates from exceeding the overflow threshold . They have the following key properties.

If and , and
(14) then and . It follows that the scaled division
(15) cannot exceed .

If is defined, with
(16) and
(17) then and . It follows that
(18) can be computed without any intermediate or final result exceeding .
Now consider the kernels needed for implementing Algorithm 1. It is easy to understand that they can be implemented using loops, divisions and linear updates. It is therefore plausible that robust variants can be implemented using ProtectDivision and ProtectUpdate. However, resolving all the relevant details requires many pages. Here we explain how the updates given by equation (11) can be done without exceeding the overflow threshold . We will use the concept of an augmented matrix introduced by Mikkelsen, Schwarz and Karlsson [8].
Definition 1
Let be partitioned into block columns
(19) 
and let have . The augmented matrix represents the real matrix given by
(20) 
This is a trivial extension of the original definition which only considered the case of . The purpose of the scaling factors is used to extend the normal representational range of our floatingpoint numbers.
(21) 
Now can be computed without overflow using Algorithm 2. In our case is diagonal, so we are merely scaling the columns of . However, can now be computed without overflow using Algorithm 3. It follows that can be computed without overflow using two applications of Algorithm 2 and Algorithm 3. Now suppose that is by . Then Algorithm 2 does flops on data. However, Algorithm 3 then does flops on the same data. Therefore, the overall arithmetic intensity is .
(22) 
(23) 
(24) 
In order to execute all linear updates needed for a robust variant of Algorithm 1 we require certain norms. Specifically, we need the infinity norms of all superdiagonal blocks of and . Moreover, we require the infinity norm of certain submatrices of . These submatrices consists of either a single column (segment of real eigenvector) or two adjacent columns (segment of complex eigenvector). The infinity norm must be computed whenever a submatrix has been initialized or updated. ProtectUpdate requires that the input arguments are bounded by and failure is possible if they are not. It is necessary to scale the matrices and to ensure that all blocks have infinity norm bounded by .
5 Zazamoukh  a taskparallel robust solver
The new StarNEig library runs on top of StarPU and can be used to solve dense nonsymmetric eigenvalue problems. A robust variant of Algorithm 1 has been implemented in StarNEig. This implementation (Zazamoukh) uses augmented matrices and scaling factors which are signed 32 bit integers. Zazamoukh can compute eigenvectors corresponding to a subset which is closed under complex conjugation. Zazamoukh is currently limited to shared memory, but an extension to distributed memory is planned.
5.1 Memory layout
Given block sizes and Zazamoukh partitions , and the matrix of eigenvectors conformally by rows and columns. In the absence of any 2by2 diagonal blocks on the diagonal blocks the tiles of and are by and the tiles of are by . The only exceptions can be found along the right and lower boundaries of the matrices. This default configuration is adjusted minimally to prevent splitting any 2by2 block of or separating the real part and the imaginary part of a complex eigenvector into separate block columns.
5.2 Tasks
Zazamoukh relies on four types of tasks

Preprocessing tasks which compute all quantities needed for robustness. This includes the infinity norm of all superdiagonal tiles of and as well as all norms needed for the robust solution of equations of the type (12). If necessary, the matrics and are scaled minimally.

Solve tasks which use DTGEVC to compute the lower tips of eigenvectors and a robust solver based on DLALN2 to solve equations of the type (12).

Update tasks which execute updates of the type (11) robustly.

Postprocessing tasks which enforce a consistent scaling on all eigenvectors.
5.3 Task insertion order and priorities
Zazamoukh is closely related to Kiya which solves triangular linear systems with multiple righthand sides. Apart from the preprocessing and postprocessing tasks, the main task graph is a the disjoint union of taskgraphs, one for each block column of the matrix of eigenvectors. Zazamoukh uses the same task insertion order and priorities as Kiya to process each of the subgraphs.
6 Numerical experiments
In this section we give the result of a set of experiments involving tiny () and small () matrices. Each experiment consisted of computing all eigenvectors of the matrix pencil. The run time was measured for (DTGEVC) LAPACK and Zazamoukh. Results related to somewhat larger matrices can be found in the NLAFET Deliverable 2.7 [10].
The experiments were executed on an Intel Xeon E52690v4 (“Broadwell”) node with 28 cores arranged in two NUMA islands with 14 cores each. The theoretical peak performance in doubleprecision arithmetic is 41.6 GFLOPS/s for one core and 1164.8 GFLOPS/s for a full node.
We used the StarNEig testprogram starneigtest to generate reproducible experiments. The default parameters produce matrix pencils where approximately 1 percent of all eigenvalues are zeros, 1 percent of all eigenvalues are infinities and there are no indefinite eigenvalues. Zazamoukh used the default tile size which is 1.6 percent of the matrix dimension for matrix pencils with dimension .
All experiments were executed with exclusive access to a complete node (28 cores). LAPACK was run in sequential mode, while Zazamoukh used 28 StarPU workers and 1 master thread. The summary of our results are given in Figure 1. The speedup of Zazamoukh over LAPACK is initially very modest as there is not enough tasks to keep 28 workers busy, but it picks up rapidly and Zazamoukh achieves a superlinear speedup over DTGEVC when . This is an expression of the fact that Zazamoukh uses a blocked algorithm, whereas DTGEVC computes the eigenvectors one by one.
dimension  eigenvalue analysis  runtime (ms)  SpeedUp  

m  zeros  inf.  indef.  LAPACK  StarNEig  
1000  11  13  0  295  175  1.6857 
2000  25  16  0  1598  409  3.9071 
3000  24  30  0  6182  929  6.6545 
4000  42  49  0  15476  1796  8.6169 
5000  54  37  0  30730  2113  14.5433 
6000  61  64  0  53700  2637  20.3641 
7000  67  64  0  84330  3541  23.8153 
8000  56  69  0  122527  4769  25.6924 
9000  91  91  0  171800  6189  27.7589 
10000  108  94  0  242466  7821  31.0019 
20000  175  197  0  2034664  49823  40.8378 
30000  306  306  0  7183746  162747  44.1406 
40000  366  382  0  17713267  380856  46.5091 
7 Conclusion
Previous work by Mikkelsen, Schwarz and Karlsson has shown that triangular linear systems can be solved in parallel without overflow using augmented matrices. In this paper we have shown that the eigenvectors of a matrix pencil can be computed in parallel without overflow using augmented matrices. Certainly, robust algorithms are slower than nonrobust algorithms when numerical scaling is not needed, but robust algorithms will always return a result which can be evaluated in the context of the user’s application. To the best of our knowlegde StarNEig is the only library which contains a parallel robust solver for computing the generalized eigenvectors of a dense nonsymmetric matrix pencil. The StarNEig solver (Zazamoukh) runs on top of StarPU and uses augmented matrices and scaling factors with are integer powers of to prevent overflow. It acheives superlinear speedup compared with (DTGEVC) from LAPACK. In the immediate future we expect to pursue the following work:

Extend Zazamoukh to also compute left eigenvectors. Here the layout of the loops is different and we must use the 1norm instead of the infinity norm when executing the overflow protection logic.

Extend Zazamoukh to distributed memory machines.

Extend Zazamoukh’s solver to use recursive blocking to reduce the runtime further. The solve tasks all lie on the critical path of the task graph.

Extend Zazamoukh to complex datatypes. This case is simpler than real arithmetic because there are no by blocks on the main diagonal of .
Acknowlegdements
This work is part of a project (NLAFET) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671633. This work was supported by the Swedish strategic research programme eSSENCE. We thank the High Performance Computing Center North (HPC2N) at Umeå University for providing computational resources and valuable support during test and performance runs.
References
 [1] StarNEig — A taskbased library for solving nonsymmetric eigenvalue problems, https://github.com/NLAFET/StarNEig

[2]
Anderson, E.: LAPACK Working Note No. 36: Robust Triangular Solves for Use in Condition Estimation. Tech. Rep. CSUTCS91142, University of Tennessee, Knoxville, TN, USA (August 1991)
 [3] Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. SIAM, Philadelphia, PA, 3rd edn. (1999)
 [4] Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. CCPE  Special Issue: EuroPar 2009 23, 187–198 (Feb 2011). https://doi.org/10.1002/cpe.1631, http://hal.inria.fr/inria00550877
 [5] Baudin, M.: Personal email to C.C.K. Mikkelsen
 [6] Baudin, M., Smith, R.L.: A robust complex division in scilab. CoRR abs/1210.4539 (2012), http://arxiv.org/abs/1210.4539
 [7] Kjelgaard Mikkelsen, C.C., Karlsson, L.: Blocked Algorithms for Robust Solution of Triangular Linear Systems. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) Parallel Processing and Applied Mathematics. pp. 68–78. Springer International Publishing, Cham (2018)
 [8] Kjelgaard Mikkelsen, C.C., Schwarz, A.B., Karlsson, L.: Parallel Robust Solution of Triangular Linear Systems. CCPE 0(0), e5064 (2018). https://doi.org/10.1002/cpe.5064
 [9] Myllykoski, M., Kjelgaard Mikkelsen, C.C.: StarNEig  A Taskbased Library for Solving Nonsymmetric Eigenvalue Problems. Submitted to PPAM2019
 [10] Myllykoski, M., Kjelgaard Mikkelsen, C.C., Schwarz, A., Kågström, B.: D2.7 eigenvalue solvers for nonsymmetric problems. Tech. rep., Umeå University (2019), http://www.nlafet.eu/wpcontent/uploads/2019/04/D2.7EVPsolversevaluationfinal.pdf
Comments
There are no comments yet.