Let and let . The matrix pencil consists of all matrices of the form where
. The set of (generalized) eigenvalues of the matrix pencilis given by
We say that is a (generalized) eigenvector of the matrix pencil if and only if and
The eigenvalues of can be computed by first reducing to real Schur form . Specifically, there exist orthogonal matrices and such that is quasi-upper triangular and is upper triangular. It is clear that
Moreover, if is a generalized eigenvector of corresponding to the eigenvalue , then is a generalized eigenvector of corresponding to the eigenvalue .
In this paper, we consider the parallel computation of eigenvectors of a matrix pencil in real Schur form. In exact arithmetic, this problem can be solved using substitution. However, substitution is very vulnerable to floating-point overflow.
In LAPACK  there exists a family xTGEVC of subroutines which compute the generalized eigenvectors of a matrix pencil in Schur form. They prevent overflow by dynamically scaling the eigenvectors. These subroutines are scalar codes which compute the eigenvectors one by one. In this paper we discuss the construction of algorithms which are not only robust, but blocked and parallel.
Our paper is organized as follows. In Section 2 we consider the problem of computing the eigenvectors of a matrix pencil in real Schur form using real arithmetic. This problem is equivalent to solving a homogeneous matrix equation of the form
where is diagonal and is block diagonal with diagonal blocks which are either 1-by-1 or 2-by-2. In Section 3 we present a blocked algorithm for solving this matrix equation. In Section 4 we discuss how to prevent overflow in this algorithm. The concept of an augmented matrix is central to this discussion. A robust task-parallel solver Zazamoukh has been developed and integrated into the new StarNEig library for solving non-symmetric eigenvalue problems [1, 9]. The performance of Zazamoukh is compared to LAPACK in Section 6. We outline several directions for future work in Section 7.
2 Real arithmetic
Let and be given. The set of generalized eigenvalues of the matrix pencil can be computed by first reducing to generalized real Schur form. Specifically, there exist orthogonal matrices and such that
are upper block triangular and . It is clear that
In order to simplify the discussion, we will make the following assumptions.
If , then has a single real eigenvalue.
If , then has two complex conjugate eigenvalues.
All eigenvalues are distinct.
We follow the standard convention and represent eigenvalues
using an ordered pairwhere and . If , then is a finite eigenvalue. The case of and , corresponds to an infinite eigenvalue. The case of corresponds to an indefinite problem.
We now consider the problem of computing generalized eigenvectors of . Our goal is to obtain an equivalent problem which can be solved using a blocked algorithm.
2.1 Computing a single eigenvector
It this subsection, we note that the problem of computing a single generalized eigenvector of is equivalent to solving a tall homogenous matrix equation involving real matrices. Let and let where and
Let and let and be given by
when (or equivalently ) and
when (or equivalently ). With this notation, the problem of computing an eigenvector is equivalent to solving the homogeneous linear equation
with respect to . This is an immediate consequence of the following lemma.
Let and let where . Then the following statements are true.
2.2 Computing all eigenvectors
It this subsection, we note that the problem of computing all generalized eigenvectors of is equivalent to solving a homogenous matrix equation involving real matrices. Specifically, let and be given by
if and only if solves equation (7).
3 A blocked algorithm
It is straightforward to derived a blocked algorithm for solving the homogeneous matrix equation (9). Specifically, redefine
to denote any partitioning of into an by block matrix which does not split any of the 2-by-2 blocks along the diagonal of . Apply the same partitioning to , , , and . Then , , and are block triangular. It is straightforward to see that equation (9) can be solved using Algorithm 1.
We see that Algorithm 1 can be implemented using three distinct kernels:
Compute the eigenvectors corresponding to a small matrix pencil in real Schur form.
Execute linear updates of the form
where and are small dense matrices, is diagonal and is block diagonal with blocks of size or .
Solution of small matrix equations of the form
where is a matrix pencil in real Schur form and is diagonal and is block diagonal with diagonal blocks of dimension or .
4 Deriving a robust blocked algorithm
The subroutines xTGEVC prevent overflow using principles originally derived by Edward Anderson and implemented in the subroutines xLATRS . These subroutines apply to triangular linear systems
with a single right-hand side . Mikkelsen and Karlsson  formalized the work of Anderson and derived a robust blocked algorithm for solving (13). Mikkelsen, Schwarz and Karlsson  derived a robust blocked algorithm for solving triangular linear systems
with multiple right hand sides. Their StarPU implementation (Kiya) is significantly faster than DLATRS when numerical scaling is necessary and not significantly slower than DTRSM when numerical scaling is unnecessary.
A robust variant of Algorithm 1 can be derived using the principles applied by Mikkelsen, Schwarz and Karlsson . We can use the two real functions ProtectUpdate and ProtectDivision introduced by Mikkelsen and Karlsson . They are used to prevent scalar divisions and linear updates from exceeding the overflow threshold . They have the following key properties.
If and , and
then and . It follows that the scaled division
cannot exceed .
If is defined, with
then and . It follows that
can be computed without any intermediate or final result exceeding .
Now consider the kernels needed for implementing Algorithm 1. It is easy to understand that they can be implemented using loops, divisions and linear updates. It is therefore plausible that robust variants can be implemented using ProtectDivision and ProtectUpdate. However, resolving all the relevant details requires many pages. Here we explain how the updates given by equation (11) can be done without exceeding the overflow threshold . We will use the concept of an augmented matrix introduced by Mikkelsen, Schwarz and Karlsson .
Let be partitioned into block columns
and let have . The augmented matrix represents the real matrix given by
This is a trivial extension of the original definition which only considered the case of . The purpose of the scaling factors is used to extend the normal representational range of our floating-point numbers.
Now can be computed without overflow using Algorithm 2. In our case is diagonal, so we are merely scaling the columns of . However, can now be computed without overflow using Algorithm 3. It follows that can be computed without overflow using two applications of Algorithm 2 and Algorithm 3. Now suppose that is by . Then Algorithm 2 does flops on data. However, Algorithm 3 then does flops on the same data. Therefore, the overall arithmetic intensity is .
In order to execute all linear updates needed for a robust variant of Algorithm 1 we require certain norms. Specifically, we need the infinity norms of all super-diagonal blocks of and . Moreover, we require the infinity norm of certain submatrices of . These submatrices consists of either a single column (segment of real eigenvector) or two adjacent columns (segment of complex eigenvector). The infinity norm must be computed whenever a submatrix has been initialized or updated. ProtectUpdate requires that the input arguments are bounded by and failure is possible if they are not. It is necessary to scale the matrices and to ensure that all blocks have infinity norm bounded by .
5 Zazamoukh - a task-parallel robust solver
The new StarNEig library runs on top of StarPU and can be used to solve dense non-symmetric eigenvalue problems. A robust variant of Algorithm 1 has been implemented in StarNEig. This implementation (Zazamoukh) uses augmented matrices and scaling factors which are signed 32 bit integers. Zazamoukh can compute eigenvectors corresponding to a subset which is closed under complex conjugation. Zazamoukh is currently limited to shared memory, but an extension to distributed memory is planned.
5.1 Memory layout
Given block sizes and Zazamoukh partitions , and the matrix of eigenvectors conformally by rows and columns. In the absence of any 2-by-2 diagonal blocks on the diagonal blocks the tiles of and are by and the tiles of are by . The only exceptions can be found along the right and lower boundaries of the matrices. This default configuration is adjusted minimally to prevent splitting any 2-by-2 block of or separating the real part and the imaginary part of a complex eigenvector into separate block columns.
Zazamoukh relies on four types of tasks
Pre-processing tasks which compute all quantities needed for robustness. This includes the infinity norm of all super-diagonal tiles of and as well as all norms needed for the robust solution of equations of the type (12). If necessary, the matrics and are scaled minimally.
Solve tasks which use DTGEVC to compute the lower tips of eigenvectors and a robust solver based on DLALN2 to solve equations of the type (12).
Update tasks which execute updates of the type (11) robustly.
Post-processing tasks which enforce a consistent scaling on all eigenvectors.
5.3 Task insertion order and priorities
Zazamoukh is closely related to Kiya which solves triangular linear systems with multiple right-hand sides. Apart from the pre-processing and post-processing tasks, the main task graph is a the disjoint union of task-graphs, one for each block column of the matrix of eigenvectors. Zazamoukh uses the same task insertion order and priorities as Kiya to process each of the sub-graphs.
6 Numerical experiments
In this section we give the result of a set of experiments involving tiny () and small () matrices. Each experiment consisted of computing all eigenvectors of the matrix pencil. The run time was measured for (DTGEVC) LAPACK and Zazamoukh. Results related to somewhat larger matrices can be found in the NLAFET Deliverable 2.7 .
The experiments were executed on an Intel Xeon E5-2690v4 (“Broadwell”) node with 28 cores arranged in two NUMA islands with 14 cores each. The theoretical peak performance in double-precision arithmetic is 41.6 GFLOPS/s for one core and 1164.8 GFLOPS/s for a full node.
We used the StarNEig test-program starneig-test to generate reproducible experiments. The default parameters produce matrix pencils where approximately 1 percent of all eigenvalues are zeros, 1 percent of all eigenvalues are infinities and there are no indefinite eigenvalues. Zazamoukh used the default tile size which is 1.6 percent of the matrix dimension for matrix pencils with dimension .
All experiments were executed with exclusive access to a complete node (28 cores). LAPACK was run in sequential mode, while Zazamoukh used 28 StarPU workers and 1 master thread. The summary of our results are given in Figure 1. The speedup of Zazamoukh over LAPACK is initially very modest as there is not enough tasks to keep 28 workers busy, but it picks up rapidly and Zazamoukh achieves a superlinear speedup over DTGEVC when . This is an expression of the fact that Zazamoukh uses a blocked algorithm, whereas DTGEVC computes the eigenvectors one by one.
|dimension||eigenvalue analysis||runtime (ms)||SpeedUp|
Previous work by Mikkelsen, Schwarz and Karlsson has shown that triangular linear systems can be solved in parallel without overflow using augmented matrices. In this paper we have shown that the eigenvectors of a matrix pencil can be computed in parallel without overflow using augmented matrices. Certainly, robust algorithms are slower than non-robust algorithms when numerical scaling is not needed, but robust algorithms will always return a result which can be evaluated in the context of the user’s application. To the best of our knowlegde StarNEig is the only library which contains a parallel robust solver for computing the generalized eigenvectors of a dense nonsymmetric matrix pencil. The StarNEig solver (Zazamoukh) runs on top of StarPU and uses augmented matrices and scaling factors with are integer powers of to prevent overflow. It acheives superlinear speedup compared with (DTGEVC) from LAPACK. In the immediate future we expect to pursue the following work:
Extend Zazamoukh to also compute left eigenvectors. Here the layout of the loops is different and we must use the 1-norm instead of the infinity norm when executing the overflow protection logic.
Extend Zazamoukh to distributed memory machines.
Extend Zazamoukh’s solver to use recursive blocking to reduce the run-time further. The solve tasks all lie on the critical path of the task graph.
Extend Zazamoukh to complex data-types. This case is simpler than real arithmetic because there are no -by- blocks on the main diagonal of .
This work is part of a project (NLAFET) that has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671633. This work was supported by the Swedish strategic research programme eSSENCE. We thank the High Performance Computing Center North (HPC2N) at Umeå University for providing computational resources and valuable support during test and performance runs.
-  StarNEig — A task-based library for solving nonsymmetric eigenvalue problems, https://github.com/NLAFET/StarNEig
Anderson, E.: LAPACK Working Note No. 36: Robust Triangular Solves for Use in Condition Estimation. Tech. Rep. CS-UT-CS-91-142, University of Tennessee, Knoxville, TN, USA (August 1991)
-  Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK Users’ Guide. SIAM, Philadelphia, PA, 3rd edn. (1999)
-  Augonnet, C., Thibault, S., Namyst, R., Wacrenier, P.A.: StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures. CCPE - Special Issue: Euro-Par 2009 23, 187–198 (Feb 2011). https://doi.org/10.1002/cpe.1631, http://hal.inria.fr/inria-00550877
-  Baudin, M.: Personal e-mail to C.C.K. Mikkelsen
-  Baudin, M., Smith, R.L.: A robust complex division in scilab. CoRR abs/1210.4539 (2012), http://arxiv.org/abs/1210.4539
-  Kjelgaard Mikkelsen, C.C., Karlsson, L.: Blocked Algorithms for Robust Solution of Triangular Linear Systems. In: Wyrzykowski, R., Dongarra, J., Deelman, E., Karczewski, K. (eds.) Parallel Processing and Applied Mathematics. pp. 68–78. Springer International Publishing, Cham (2018)
-  Kjelgaard Mikkelsen, C.C., Schwarz, A.B., Karlsson, L.: Parallel Robust Solution of Triangular Linear Systems. CCPE 0(0), e5064 (2018). https://doi.org/10.1002/cpe.5064
-  Myllykoski, M., Kjelgaard Mikkelsen, C.C.: StarNEig - A Task-based Library for Solving Nonsymmetric Eigenvalue Problems. Submitted to PPAM-2019
-  Myllykoski, M., Kjelgaard Mikkelsen, C.C., Schwarz, A., Kågström, B.: D2.7 eigenvalue solvers for nonsymmetric problems. Tech. rep., Umeå University (2019), http://www.nlafet.eu/wp-content/uploads/2019/04/D2.7-EVP-solvers-evaluation-final.pdf