# The LAPW method with eigendecomposition based on the Hari–Zimmermann generalized hyperbolic SVD

In this paper we propose an accurate, highly parallel algorithm for the generalized eigendecomposition of a matrix pair (H, S), given in a factored form (F^∗ J F, G^∗ G). Matrices H and S are generally complex and Hermitian, and S is positive definite. This type of matrices emerges from the representation of the Hamiltonian of a quantum mechanical system in terms of an overcomplete set of basis functions. This expansion is part of a class of models within the broad field of Density Functional Theory, which is considered the golden standard in condensed matter physics. The overall algorithm consists of four phases, the second and the fourth being optional, where the two last phases are computation of the generalized hyperbolic SVD of a complex matrix pair (F,G), according to a given matrix J defining the hyperbolic scalar product. If J = I, then these two phases compute the GSVD in parallel very accurately and efficiently.

There are no comments yet.

## Authors

• 5 publications
• 8 publications
• 5 publications
• 3 publications
03/14/2020

### A Kogbetliantz-type algorithm for the hyperbolic SVD

In this paper a two-sided, parallel Kogbetliantz-type algorithm for the ...
05/18/2020

### Hyperbolic Distance Matrices

Hyperbolic space is a natural setting for mining and visualizing data wi...
04/18/2021

### Fifty Three Matrix Factorizations: A systematic approach

The success of matrix factorizations such as the singular value decompos...
09/02/2019

### On the inclusion of damping terms in the hyperbolic MBO algorithm

The hyperbolic MBO is a threshold dynamic algorithm which approximates i...
06/03/2021

### Nonlinear Matrix Approximation with Radial Basis Function Components

We introduce and investigate matrix approximation by decomposition into ...
06/24/2014

### Saccadic Eye Movements and the Generalized Pareto Distribution

We describe a statistical analysis of the eye tracker measurements in a ...
12/07/2019

### Augmented hyperbolic models with diffusive-dispersive shocks

Given a first-order nonlinear hyperbolic system of conservation laws end...

## Code Repositories

### MPHZ

A multi-precision Hari-Zimmermann complex GSVD.

### FLAPWxHZ

The Hari-Zimmermann complex generalized hyperbolic SVD and EVD.

### JACSD

Some utilities for the Jacobi-type SVD algorithms and beyond, plus a Cosine-Sine Decomposition tester.

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Density Functional Theory (DFT) is the Standard Model at the base of simulations in Condensed Matter Physics. At the center of most DFT simulations lays the initialization of the Hamiltonian matrix and its diagonalization. In many DFT methods the form and size of the Hamiltonian depends on the choice of the set of basis functions used to expand the atomic orbitals. When such a basis set is not orthonormal, a positive definite and Hermitian Overlap matrix has to be computed and diagonalized simultaneously with ; this pair of matrices define a generalized Hermitian eigenproblem (or eigenpencil, in short). In a subset of all DFT methods labeled as LAPW, the entries of both and

are represented as multiple sums and products of smaller matrices with specific properties. It this paper, we show how one can exploit this peculiar representation and solve the generalized eigenvalue problem without explicitly assembling the

and matrices. Our alternative method solves the eigenpencil using a cascade of phases ending with the Hari–Zimmermann algorithm for a generalized SVD. We implemented a shared memory version of this method and demonstrate its scalability on a number of test cases extracted from concrete DFT simulations. In those cases where the matrix is ill-conditioned, our eigendecomposition has the additional benefit of providing enhanced accuracy and avoid altogether the need of the failure-prone Cholesky decomposition.

The birth of DFT is marked by two fundamental articles authored in the mid-60s by the Nobel prize winner Walter Kohn and his collaborator Lu J. Sham and Pierre Hohenberg [14, 16]. DFT provides an approach to the theory of electronic structure that is alternative to the solution of the Schrödinger equation. While in the latter the emphasis is on a many-electron wave function describing the dynamics of electrons in a multi-atomic system, in DFT the electron density distribution plays a central role. Besides providing a complementary perspective by focusing on quantities depending mainly on the three-dimensional coordinate space r, DFT has made possible the simulation of much larger systems than the conventional multi-particle wave function methods. Depending on the specific DFT method, computing complexity scales at most with the cube of the number of atoms, with ongoing progress towards bringing it down to linear scaling.

Despite being a general theory, DFT can be realized in as many flavors as are the sets of basis functions one can choose from. Two widely spread classes of basis functions build on the simplicity of plane waves to build more complex and rich sets of basis functions, namely Projected Plane Waves (PAW) [23] and Linearized Augmented Plane Waves (LAPW) [26]. The complexity of these sets lays in that they are made up of non-orthogonal basis functions. In the case of LAPW, the set of functions is also overcomplete. The consequence of non-orthogonality is that the matrix , whose entries are the scalar products among all the basis functions of any given finite size set, is usually dense. In the particular case of LAPW methods such matrix is positive definite but could have few singular values quite close to zero. This potential problem is due to the overcompleteness of the basis set and tends to worsen as the number of atoms increases since the number of basis functions grows linearly with the number of atoms.

In DFT methods the dynamics of the quantum systems is described by a Hamiltonian operator. In practice, the Hamiltonian is translated into a Hermitian matrix whose size and structure depends on the specific DFT method. This is because the matrix is the result of the projection of the Hamiltonian operator over the finite set of basis functions of the given method. In the LAPW method, the mathematical form of the functions leads to an expression for both and in terms of a sum of smaller matrices over all possible atoms,

 H =NA∑a=1(A∗aT[AA]aAa+A∗aT[AB]aBa+B∗aT[BA]aAa+B∗aT[BB]aBa) (1) S =NA∑a=1(A∗aAa+B∗aU∗aUaBa),

where , , while the remaining matrices in (1) are complex, square, and of order , with some additional properties. Matrices are real and diagonal, and are Hermitian, and for all holds

 (T[AB]a)∗=T[BA]a.

Except for , the other matrices are in general dense, and can have a range of sizes dictated by the constants , , and (see section 2 for some their typical range). Despite the formulation above could lend itself to computation through specialized middleware libraries such as the Basic Linear Algebra Subprograms (BLAS), the standard approach followed by most code developers was one based on minimizing memory footprint and FLOP count [8, 17].

Recently, an alternative method for the assembly of the matrices and was presented in [11] and further developed in [9]. In their work [11], Di Napoli et al. consolidate the underlying matrix structure of the operations and proceed to encapsulate them in terms of the level 3 kernels of the BLAS library. For instance, in order to maximize the arithmetic intensity of the computation, matrix is written as , with

 HAA=NA∑a=1A∗aT[AA]aAa,HAB+BA+BB=NA∑a=1(B∗aZa+Z∗aBa),

where

 Za=T[BA]aAa+12T[BB]aBa.

Each of the s and s matrices is then packed in memory in two consecutive 2-dimensional arrays and , respectively. In the end, the sum is computed by just two ZHER2K BLAS subroutines. A similar procedure holds for the matrix . Once assembled the algebraic dense generalized eigenproblem is solved by standard methods. A Cholesky decomposition is used to reduce the problem to standard form . In turn, the standard problem is solved by a dense direct algorithm such as MRRR [10] provided by the LAPACK library [1], or an iterative eigensolver specialized for DFT computation (e.g., the ChASE library [30]). When the assembled matrix is ill-conditioned, as it may happen for quantum systems with a large number of atoms (), the Cholesky decomposition may fail and makes it practically impossible to solve the corresponding generalized eigenproblem.

In this work, we propose an alternative numerical method for solving the eigenproblem generated by the matrix pair from (1), without forming the matrices explicitly. The core of the method is based on the generalized hyperbolic singular value decomposition (GHSVD) [2]. Not only such a method solves for the eigenproblem directly without assembling and , but also could give more accurate results when is nearly singular. This is possible since the GHSVD decomposition acts directly on the multiplying factors making up , conceivably reducing the singularity down to the square root of the condition number of . As a surplus, if , the matrix of the hyperbolic scalar product, is equal to the identity, the GHSVD reduces to the generalized SVD (GSVD), which is computed very efficiently in parallel.

The four phases of the algorithm are:

1. The problem is expressed as , , the (tall-and-skinny) matrices and are assembled, and the matrices , formed from , , , and , are simultaneously factored by the Hermitian indefinite factorization with complete pivoting, reformulating as , with .

2. Optionally, the matrix is shortened by the indefinite, -QR factorization to obtain the square factor and a new signature matrix ; also, is shortened by the QR factorization to obtain the square factor .

3. The GEVD of is obtained by computing the -GHSVD of (or -GHSVD of , if the second phase is skipped) by the implicit Hari–Zimmermann method.

4. Optionally, the GHSVD process is formally completed by explicitly computing the right generalized singular vectors

from the generalized eigenvector matrix

.

All the phases are implemented in Fortran (with some C) code and parallelized using OpenMP threading, but the non-optional ones are carefully designed to facilitate an easy conversion into a distributed-memory algorithm, should the problem sizes so require. The Herimitian indefinite factorization with complete pivoting, the indefinite QR factorization, and the complex GHSVD of the Hari–Zimmermann type constitute an efficient (due to heavy vectorization efforts) shared memory software contribution in their own right, with an intent for them to be reusable in other problems as well.

The paper is subdivided into eight sections. In section 2, we present in more detail the physics of the problem and the mathematical model leading to the expression (1) for and . Section 3 is devoted to formulating the problem in precise algebraic terms. The next four sections deal with the four phases of the algorithm, where the first three of them belong to the algorithm proper, and the fourth one completes the computation of the GHSVD and is unrelated to the underlying mathematical physics origin of the problem. Since each phase is an algorithm on its own, at the end of each section we present the numerical results and the parallelization techniques applied. The paper concludes with a discussion of the possible future work.

## 2 The H and S matrices in LAPW methods

Density Functional Theory is based on the seminal work of Hohenberg [14], and the follow up landmark paper by Kohn and Sham [16]. At the core of DFT are a set of equations, called Kohn–Sham equations, that have to be solved for each of the single particle wave function

 ^HKSψi(r)=[−ℏ22me∇2r+V[n({r})](r)]ψi(r)=ϵiψi(r),i=1,…,Ne. (2)

The peculiarity of these equations is that the Hamiltonian operator depends implicitly on all the s through the charge density function , which makes the entire set of Kohn–Sham equation strongly coupled and non-linear. In particular the function is the sum of the squares of all s up to the total number of electrons in any given quantum system

 n({r})=Ne∑i=1|ψi(r)|2. (3)

Because the equations (2) are non-linearly coupled, they can be solved only self-consistently: one starts from a reasonable guess for the charge density , computes the potential , and solves (2). The resulting functions s and values s are then used to compute a new density as in (3), which is compared to the starting one. If the two densities do not match, the self-consistent loop is repeated with a new mixed charge density. The loop stops only when the new and the old density agree up to some defined constant.

So far we have described the general setup. There are many methods that translate this setup into a computable algorithm, and this is where the various “flavors” of DFT differ. The first element of difference is in the choice of the set of functions s used to expand every one particle wave function

 ψi(r)=NG∑t=1ct,iφt(r). (4)

In the LAPW method [29, 15], the configuration space where the atomic cells are defined is divided in two disjoint areas where the wave functions have distinct symmetries: close to the atomic nuclei, solutions tend to be spherically symmetric and strongly varying, while further away from the nuclei, they can be approximated as uniformly oscillating. The qualitative structure of the solution leads to a space composed of non-overlapping spheres—called muffin tins (MT)—separated by interstitial (INT) areas. The complete set of basis functions are given by a piece-wise definition for each of the atoms and relative surrounding regions.

 φt(r)=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩lmax∑l=0l∑m=−l[A(l,m),a,tul,a(r)+B(l,m),a,t˙ul,a(r)]Yl,m(^ra),ath MT1√Ωexp(ikt⋅r),% INT. (5)

In the MT spheres, each basis function depends on specialized radial functions , their derivative and the spherical harmonics ; the former only depend on the distance r from the MT center, while the latter form a complete basis on the unit sphere defined by and so depends solely on the MT spherical angles. Despite being piece-wise functions, s must be continuous and differentiable for each index and each atomic index . The coefficients are set to guarantee that for each of the values of the indices and . The variable ranges over the size of the plane wave functions set in INT, and is used to label the vector living in the space reciprocal to r. As such, the momentum characterizes the specific wave function entering in the basis set. The total size of the basis set is determined by setting a cutoff value .

When one substitutes the expansion of  (4) in (2), the Kohn–Sham equations become an algebraic generalized eigenvalue problem that needs to be solved for the -tuples of coefficients

 NG∑t=1(H)t′,tct,i=ϵiNG∑t=1(S)t′,tct,i

The complexity of the LAPW basis set is transferred to the definition of the entries of the Hamiltonian and Overlap matrices, respectively and , given by

 (H)t′,t=∑a∬φ∗t′(r)^HKSφt(r)dr,(S)t′,t=∑a∬φ∗t′(r)φt(r)dr.

By substituting explicitly the functions of equation (5) and computing the integrals, one ends up with the following expressions for and :

 (H)t′,t=∑a∑L′,L (6) +

and

 (S)t′,t=∑a∑L=(l,m)A∗L,a,t′AL,a,t+B∗L,a,t′BL,a,t∥˙ul,a∥2. (7)

The new matrices

are dense and their computation involves multiple integrals between the radial basis functions

and the non-spherical part of the potential multiplied by Gaunt coefficients***For details see [17].. As can be seen by simple inspection, equations (6)–(7) are equivalent to equations (1): while the former are written with all indices explicit, the latter have a subset of them implicit which highlights their matrix form.

We conclude with a small excursus on the structure of the self-consistent loop and its computational cost. In the first step, a starting charge density is used to compute the Kohn–Sham Hamiltonian . In a second step the set of basis functions is set up and the set of coefficients are derived. Then, the Hamiltonian and Overlap matrices are initialized, followed by the fourth step when the generalized eigenvalue problems is solved numerically to return the eigenpairs . Finally a new charge density is computed and convergence is checked before starting a new loop. Out of all the steps above, initializing and and solving the eigenproblem accounts for more than 80% of CPU time. Having cubic complexity , the eigenproblem solution is usually considered the most expensive of the two. It turns out that generating the matrices may be as expensive. If and are the range of the summations and , then a simple evaluation shows that equations (7) and (6) have complexity equal to and , respectively. A typical simulation uses approximately basis functions, with ranging from about to about , and an angular momentum , which results in . It follows that the factor is roughly of the same order of magnitude as so that the generation of and also displays cubic complexity . In practice, the constants above have values in the following order of magnitude: , , and .

## 3 Problem formulation

Our intention is to keep the matrices and from (1), given in a factored form, factored during the whole diagonalization process. The core of the algorithm (its third phase) uses a one-sided Jacobi-like method for the implicit diagonalization. More precisely, it computes a hyperbolic analog of the generalized SVD.

###### Definition 1

For the given matrices , , , , and , where is of full column rank, there exist a

(i.e., ), a unitary matrix , and a nonsingular matrix , such that

 F=UΣFX,G=VΣGX,ΣF∈Rm×n,ΣG∈Rp×n. (8)

The elements of and are zeroes, except for the diagonal entries, which are real and nonnegative. Furthermore, and satisfy

 ΣTFΣF+ΣTGΣG=I.

The ratios are called the generalized hyperbolic singular values of the pair . If the pair is real, then all matrices in (8) are real.

We choose to define the generalized hyperbolic SVD (GHSVD) only if the matrix is of full column rank. This implies , and there is no need to mention this in the definition. In the case of full column rank , the matrix is positive definite, and the matrix pair , where , is Hermitian and definite, so it can be simultaneously diagonalized by congruences (see, for example, [22]).

If the GHSVD is computed as in (8), then the generalized eigenvalues and eigenvectors of are easily retrieved, since

 H =F∗JF=X∗Σ∗FU∗JUΣFX=X∗Σ∗FJΣFX:=X∗ΛFX, S =G∗G=X∗Σ∗GV∗VΣGX=X∗Σ∗GΣGX:=X∗ΛGX.

Substituting in the expression for above, we get

 HZ=SZΛ;Z:=X−1,Λ:=Λ−1GΛF. (9)

Thus, from (9), the generalized eigenvalues of the matrix pair are the squared generalized hyperbolic singular values, with the signs taken from the corresponding diagonal elements in , i.e., , and the matrix of the generalized eigenvectors is the inverse of the matrix of the right generalized singular vectors. For theoretical purposes it can be assumed that is sorted descendingly, though for simplicity it is not the case in our implementation.

An approach that uses the SVD on a matrix factor, instead of the eigendecomposition on the multiplied factors, usually computes small eigenvalues more accurately.

The idea of our algorithm is to transform the initial problem by using the properties of matrices , , , and . In the first phase of the algorithm, matrices ,

 Ta=[T[AA]aT[AB]aT[BA]aT[BB]a]

are assembled from the already mentioned matrices and factored, to bring them in a suitable form for the GHSVD computation.

After that, we are left with two tall matrices, which have, in our test examples, between and times more rows than columns. It is widely known that the Jacobi-like SVD algorithms are more efficient if the factors are square, so in the second (optional) phase we could preprocess the factors — by the hyperbolic QR factorization (see [24]), and by the ordinary tall and skinny QR factorization (ZGEQR routine from LAPACK), into square ones.

The third phase, optionally augmented by the fourth phase, is a complex version of the implicit Hari–Zimmermann method. This method is a slight modification (at least from the mathematical point of view) of the real method presented in [21]. The complex transformations, for the two-sided method, have been derived in the PhD thesis of Vjeran Hari [12].

### 3.1 Testing environment

The testing environment consists of an Intel Xeon Phi 7210 CPU, running at 1.30 GHz with Turbo Boost turned off, in Quadrant cluster mode with 96 GiB of RAM and 16 GiB of flat-mode MCDRAM, under 64-bit CentOS Linux 7.6.1810 with the Intel Fortran and C compilers and Math Kernel Library (MKL), version 19.0.3.199. For the Intel 80-bit extended floating-point type support, required by the error-checking code, GNU Fortran 8.2.1 (and KIND=10) was used.

The software distributionAvailable in https://github.com/venovako/FLAPWxHZ repository. of all phases presented in this paper is written mostly in Fortran, with some auxiliary parts in C, while the parallelization relies on the OpenMP constructs. The phases are meant to be run in a sequence, where each phase is executed as a separate process with a number of OpenMP threads. Since the testing machine has enough memory to hold all required data, as the modern compute nodes in general do, the algorithms are implemented for the shared memory, but at least the algorithms for Phases 1, 3, and 4 can be easily transformed into distributed-memory ones, should the volume of data so require.

Apart from the maximal number of threads as described above (64), the tests were also performed with 32 threads, to assess the effects on the computational time of the larger block sizes and the availability of the whole L2 data cache (1 MiB, shared among two cores) to a thread. The algorithms do not require any particular number of threads in principle, but are not intended to be used single-threadedly.

Another hardware feature targeted is the SIMD vectorization: each core has a private L1 data cache of 32 kiB with a line size of 64 B, and equally wide (e.g., for 8 double precision floating-point numbers) vector registers upon which a subset of AVX-512 instructions is capable to operate in the SIMD fashion. The vectorization was employed both implicitly, by aligning the data to the cache line size whenever possible and instructing the compiler to vectorize the loops, and (semi-)explicitly, as will be described in the following sections. The code is parametrized by the maximal SIMD length (i.e., the number of 8 B lanes in the widest vector register type) , and it vectorizes successfully on other architectures (e.g., on AVX2, with ).

Under an assumption that the compiler-generated floating-point reductions (e.g., those of the SUM Fortran intrinsic) obey the same order of operations in each run, and due to the alignment enforced as above, the algorithms should be considered conditionally reproducible, in a sense that the multiple runs of the executables on the same data in the same environment should produce bitwise-identical results.

### 3.2 Datasets

Each dataset under test contained all matrix inputs (, , , , , and ) for a single problem instance. With 8 datasets, summarized in Table 1, we believe to have a representative coverage of the small-to-medium size problems from practice. As already mentioned in section 2, the maximum value of the momentum which appears as an index to the dataset label (e.g., AuAg_2.5) determines the size of the basis functions set . This is why datasets with same label (e.g., AuAg) but different index (e.g., 2.5 vs. 3.0) have differing values for . In the following, the datasets are referred to by their IDs.

## 4 Phase 1 – simultaneous factorizations of Ta matrices

The goal of this section is to rewrite the problem (1) in a form suitable for GHSVD computation.

### 4.1 Problem reformulation

The first step is to write (1) as

 H=NA∑a=1H∗aTaHa,S=NA∑a=1S∗aSa, (10)

where

 Ta=[T[AA]aT[AB]a(T[AB]a)∗T[BB]a],Ha=[AaBa],Sa=[AaUaBa].

Furthermore, matrices in (10) can be expressed as

 H =[H∗1⋯H∗NA]diag(T1,…,TNA)⎡⎢ ⎢⎣H1⋮HNA⎤⎥ ⎥⎦:=F∗0TF0 (11) S =[S∗1⋯S∗NA]⎡⎢ ⎢⎣S1⋮SNA⎤⎥ ⎥⎦:=˜G∗˜G.

In (11), stands for a block-diagonal matrix with the prescribed diagonal blocks , . Newly defined matrices have the following dimensions: , , , and . From now on, let , and .

To efficiently exploit the structure of the problem, the matrix needs to be diagonal, with its diagonal elements equal to either or (possibly with some zeroes in the case of singular ). There is no mathematical obstacle to apply the simultaneous (-)orthogonalization in the computation of the GHSVD on the already described matrices , , and implicitly, but the repeated multiplication (in each reduction step) by is slow. Therefore, should be either factored concurrently, by using a somewhat modified version of the Hermitian indefinite factorization of all blocks, or diagonalized concurrently: the factorizations (or diagonalizations) are completely independent of each other and can proceed in parallel. Since the diagonalization, compared to the Hermitian indefinite factorization, is a slower process, our choice is to factor all the diagonal blocks .

### 4.2 Hermitian indefinite factorization

Each is factored by the algorithm described in [27]. The algorithm for each consists of the Hermitian indefinite factorization with a suitable pivoting [3, 4, 5, 6, 7], followed by the transformation of the block-diagonal matrix. Such factorization has the following form

where is a permutation (in the LAPACK sense), is upper triangular, and is block-diagonal, with diagonal blocks of order or .

Then, is transformed into . If has a diagonal block of order at position , then stores the sign of this block in its th diagonal element, and the th row of is scaled by . In the case of a (Hermitian) pivot block of order , this block is diagonalized by a single Jacobi rotation, and the corresponding two rows of in (12) are multiplied by that rotation. After that, two transformations of the new diagonal elements of are performed, as above. To speed-up the process, the rotation and the scaling of two rows of are combined and then applied as a single transformation.

The outer permutations are generated starting from the identity, and stored as the partial permutations of the principal submatrices, as in LAPACK, according to the pivoting of choice. Since the matrices are of a relatively small order, our choice is the complete pivoting from [7].

### 4.3 Postprocessing

After the factorization, a postprocessing step is applied to obtain

 Ta=ˆM∗aˆJaˆMa,

where . Note that does not need to remain triangular.

Finally, by applying an inner permutation , can be rearranged into a diagonal matrix , where the positive signs precede the negative ones on the diagonal. This property of matrices can be exploited to speed-up computation of the hyperbolic scalar products in the subsequent phases. Let the whole factorization routine described thus far be called ZHEBPJIn the software distribution, ZHEBPJ corresponds to HIF_ZHEBPC routine, followed by HIF_JPART.. Then,

 Ta=˜M∗a˜Ja˜Ma,˜Ma=ˆPaˆMa,

and is multiplied by from the left, i.e., .

After such preprocessing, from (11) is written as

For datasets A1–A4, each has 3 positive and 239 negative signs, for a total of 324 positive and 25812 negative signs in . For datasets B1–B4, a (non-consecutive) half of matrices has 0 positive and 98 negative signs in each matrix, and the other half has 98 positive and 0 negative signs in each matrix (i.e., matrices are definite), for a total of 25088 positive and the same number of negative signs in .

### 4.4 Implementation

The computational task for an index is essentially sequential, up to a possible usage of a parallel BLAS. On the other hand, for different indices , the tasks are embarrassingly parallel. Therefore, it seems reasonable to parallelize the algorithm such that each thread is responsible for one or more indices , as indicated in the pseudocode of Algorithm 1.

Computing to assemble is in itself a loop with completely independent iterations, and could also be done in parallel, using the nested parallelism within each thread, should be large enough and should also the newly spawned threads for that loop have enough computational resources available to warrant the overhead of the thread management.

In our implementation, MCDRAM is not explicitly used. Each thread is solely responsible for allocating and accessing the memory for the data it processes. Thus, the NUMA data locality is achievable whenever each NUMA node has enough storage, what makes the algorithm viable in the heavily non-uniform memory access settings (e.g., SNC-4 mode of the Intel Xeon Phi CPUs).

In a distributed memory setting (e.g., using the MPI processes), the assembling of , , and can be done by assigning to each process a (not necessarily contiguous) subrange of the iteration range of the for-all loop from Algorithm 1, while inside the process all atoms assigned to it are processed exactly as above, within an OpenMP parallel-do loop. The matrices , , and would then end up being distributed in the chunks corresponding to the chosen subranges among the processes.

### 4.5 Testing

In Table 2 the average per-atom wall execution time of Phase 1 is shown. The results suggest that it is beneficial to have more L2 data cache available per thread, as is the case with 32 threads overall. In the breakdown of the weights (i.e., percentages of time taken) of each computational step it is confirmed that ZGEMM starts to dominate the other computational steps of Algorithm 1 as the ratio increases. It is a strong indication that even a procedure more expensive than ZHEBPJ, such as a diagonalization of , may be applied on the datasets having a square-like shape, without considerably degrading the relative performance of Phase 1.

### 4.6 An alternative way forward

After this phase has completed, one can proceed as described in the rest of the paper, should the condition numbers of (the yet unformed) matrices and be large enough to severely affect the accuracy of a direct solution of the generalized Hermitian eigenproblem with the pair .

An alternative and more time-efficient way to proceed would be to explicitly form and . For , one ZHERK call would suffice. For , a copy of should be made, and that copy’s rows should be scaled in parallel by the diagonal elements of . One ZGEMM call on and then completes the formation of . After that, an efficient solver for the generalized Hermitian eigenproblem can be employed on , such as ZHEGV or ZHEGVD from LAPACK.

#### 4.6.1 Row scaling

A cache-friendly implementation§§§See src/test/zhegvt.F90 in FLAPWxHZ repository for an example of such an approach. of the row scaling by is to iterate sequentially over the rows of a fixed column , and change the sign of each element for which , while the outer parallel-do loop iterates over all column indices . However, that implementation can be optimized further.

If has its diagonal partitioned into (regularly or irregularly sized) blocks of the same sign, then can be compactly encoded as a sequence of pairs , one for each block of negative signs, where is the first index belonging to a block , and is the block’s length. The iteration over all rows and the conditional sign changes as above can be replaced by iteration over all such blocks. For each block, iterate sequentially in the range of indices from to , and change the signs unconditionally, thus eliminating the conditional branching based on the sign of .

Such run-length-like encoding is employed in Phase 3, where it also accelerates the hyperbolic dot products in the case where the positive signs precede the negative ones on the diagonal of a sign matrix (i.e., at most one negative block exists) given by ZHEBPJ when forming the square factors for the inner Hari–Zimmermann method.

## 5 Phase 2 – optional (J,i) URV factorization

It is a well-established fact that the one-sided Jacobi-type algorithms are fastest if they work on square matrices, since the column dot-products and updates are the shortest possible.

Therefore, if we find the square factors , , and the corresponding of the matrix pair , instead of the rectangular factors , , and the corresponding , then we expect that the overhead of such a shortening will be less than the computational time saved by avoiding the rectangular factors. To this end, matrix is shortened by using the hyperbolic QR factorization (also called the JQR factorization), according to the given :

 P1˜FP2=QFF;˜Q∗F˜J˜QF=J,˜QF:=PT1QF, (13)

where is block upper triangular with diagonal blocks of order or , is the shortened signature matrix, and are the column and the row permutation matrix, respectively. Our application does not use , so it is not explicitly formed. From (13) it holds

 PT2˜F∗˜J˜FP2=F∗Q∗FP1˜JPT1QFF=F∗˜Q∗F˜J˜QFF=F∗JF.

Since the JQR requires both row and column pivoting (see [24]), matrix , with its columns prepermuted according to , the column pivoting of the JQR, will then be factored by the ordinary (tall and skinny) QR factorization (e.g., by the LAPACK routine ZGEQR). The latter QR factorization does not employ column pivoting, but in principle the row pivoting or presorting may be used:

 P3(˜GP2)=QGG;˜Q∗G˜QG=In,˜QG:=PT3QG,

where is upper triangular, and , which is not needed in our application. We also do not depend on the special forms of and later on.

From (8) it follows that the -GHSVD of and , and that of and , differ only in the column permutation of the right singular vectors, i.e., , or, from (9), the row permutation of the eigenvectors, i.e., , while , and thus , stay the same. Therefore, the square factors , , and can be used in place of , , and throughout the rest of the computation, and then the results could be easily converted back to the ones of the original problem.

If we expect to be badly conditioned, this optional phase could be skipped, or we should resort to a slower but more stable QR factorization of with the column (and maybe row) pivoting,

 P4(˜GP2)P5=QG′G′;˜Q∗G′˜QG′=In,˜QG′:=PT4QG′,

where has to be applied back to . The blocked, column-pivoted QR factorization is provided by the LAPACK routine ZGEQP3.

Then, and could be substituted for and in the rest of the computation. For (or ) thus obtained it holds (or, ), where . Such an approach has not been tested, since it was not required for our datasets, but it is not hard to be implemented should a need for it arise.

### 5.1 ˜J-dot products and norms

Since can in principle contain the positive and the negative signs in any order, and vectorization is strongly desired, a -dot product of two vectors, , is computed as piecewise sums , ,

 Re(Σj) =Re(Σj)+˜Ji(Re(fi)Re(gi)+Im(fi)Im(gi)), Im(Σj) =Im(Σj)+˜Ji(Re(fi)Im(gi)−Im(fi)Re(gi)),

where starts with a value of and increments in steps of up to . The real and the imaginary component of the resulting are obtained by SUM-reducing and , respectively. Similarly, is computed by SUM-reducing , where

 Σ′j=Σ′j+˜Ji(Re(fi)2+Im(fi)2).

The “square” of the -norm of a vector thus obtained can be positive or negative, with a possibility of cancellations inadvertently occurring in the summations. It remains an open question how to compute the -norms both efficiently and accurately, though one possible speed improvement might be to encode as described in subsection 4.6.1 and simplify the above three piecewise summations accordingly.

### 5.2 Pivoting

To achieve the maximal numerical stability, the JQR factorization is usually performed with the complete pivoting. In the first step the pivot column(s) are chosen from the –Grammian matrix , and later on, in the th step, from the –Grammian matrix , where the is a part of the matrix yet to be reduced, and is the matrix of signs that corresponds to the unreduced matrix (see Fig. 1). The complete pivoting in the first step needs formation of the whole , i.e., floating-point operations. Such an approach, consistently implemented throughout the algorithm, leads to operations solely for the choice of pivots. Therefore, we relaxed the pivoting strategy to the diagonal pivoting supplemented with the partial pivoting [5, Algorithm C].

#### 5.2.1 Diagonal pivoting

First, squares of the -norms , where is the th column of and , are computed in a parallel-do loop over and stored in a work array. Since all columns have the length of , each parallel loop iteration executes (sequentially) in approximately the same time and the work is therefore well balanced among the threads.

Let be the smallest index such that for all . If , the th and the th column of (and thus also the first and the th column of ) are swapped. If , the column pivoting is completed.

#### 5.2.2 Partial pivoting

Otherwise, -dot products , where is the th column of and , are computed in a parallel-do loop over and stored in a complex workspace, while their magnitudes are placed in a real workspace. Same as above, this work is well balanced among the threads.

Let be the smallest index such that for all . As in [5], if

 |h[k]11|≥α|h[k]1i|,

where , then the column pivoting for the step is completed.

Otherwise, -dot products , where is the th column of and , are computed in a parallel-do loop over and their magnitudes are stored in a real workspace. This work is only slightly imbalanced among threads, since for a thread assigned to the iteration sets as a way of recording that such value (and its index) should be skipped when searching for a maximumUnless all other values are also 0..

Let be the smallest index such that for all . As in [5], if

 |h[k]11||h[k]ij|≥α|h[k]1i|2,

then the column pivoting for the step is completed; else, if

 |h[k]ii|≥α|h[k]ij|,

then the th and the th column of (and thus also the first and the th column of ) are swapped and the column pivoting for the step is completed.

Otherwise, a pivot is chosen, by taking the first column of and swapping the th and th column of (and thus also the second and the th column of ), if ; else, the second pivot column is already in place.

The pivot column(s) have thus been brought to the front of the matrix by at most two column swaps. The ensuing row pivoting is explained further below.

### 5.3 Hyperbolic Householder reflectors

If a single pivot is chosen, the first column of is reduced by a single hyperbolic Householder reflector [25, Theorem 4.4] to a vector , , and is the first vector of the canonical base.

For the sake of completeness, a modified version of Theorem 4.4 (for the hyperbolic scalar product and a simple shape of ) is included here.

###### Theorem 1

Let be a hyperbolic scalar product matrix of order . Let be two distinct vectors. There exists a basic reflector ,

 H(w)=I−2w(w∗˜Jkw)+w∗˜Jk, (14)

such that if and only if

1. and satisfy the -isometry property

 ~f∗1˜Jk~f1=f∗1˜Jkf1, (15)

and -symmetry property

 ~f∗1˜Jkf1=f∗1˜Jk~f1