Matrix decomposition techniques Stewart (2000)
, and specifically eigenvalue decompositionsGolub and van der Vorst (2000)
, are widely used in numerical linear algebra, scientific computing, machine learning, quantum computing and other scientific fields.
In general, given no assumptions on the structure of an eigenspace, the eigenvector matrix of a given linear operator of sizeexhibits no advantageous numerical properties and therefore they require
operations when performing matrix-vector multiplications. In this paper, we want to perform an approximate eigenvalue decomposition so that we accurately represent the original eigenspace with a new one that also exhibits favorable numerical complexity, for example, it requires onlyoperations when performing matrix-vector multiplication with a generic vector. In such cases, a trade-off between the accuracy of the approximation and its numerical complexity exists. Several previous works, such as Lee et al. (2008) and Kondor et al. (2014), have already introduced these ideas in the machine learning community with considerable success. Such approximations are particularly useful in situations where, once computed, the eigenspace is repeatedly used in matrix-vector calculations in downstream applications.
Eigendecomposition algorithms developed in the matrix computations literature are different for symmetric and unsymmetric matrices. In the symmetric case, the eigenspace is always full (non-defective), real-valued and furthermore, orthonormal Golub and Van Loan (1996)[Chapter 8]. We approximate these eigenspaces by using extended Givens transformations (which are themselves orthonormal and include as a particular case the well-known Givens, sometimes also called Jacobi, rotations Givens (1958)
). In this case, given the spectrum or an estimation of it, we can provide a locally optimal iterative algorithm similar to Jacobi diagonalization for symmetric matricesJacobi (1846). The general, unsymmetric, case Golub and Van Loan (1996)[Chapter 7] is much more challenging as the given matrix might not even be diagonalizable and furthermore, even when it is, the factorization has to be done over the complex-valued field in general. As the eigenvector matrix is generally unstructured, in this case, we rely on a given number of scaling and shear transformations to approximate it Rusu (2018). We formulate optimization problems for each of these basic components and show how to locally, optimally solve them with an iterative algorithm and closed-form solutions.
For both proposed algorithms, we show experimental results on the approximation of random unstructured symmetric matrices and then show an application to the construction of fast graph Fourier transforms on synthetic and real-world directed and undirected graphs.
2 Prior approaches
The literature has always distinguished between eigendecompositions of symmetric and unsymmetric matrices and we will do the same.
In the symmetric case, the diagonalization is done with orthonormal matrices, which are well understood in terms of their decomposition with Givens rotations or Householder reflectors (the QR algorithm, see Chapters 5.1 and 5.2 of Golub and Van Loan (1996)). The starting point for some approaches in the literature is the Jacobi diagonalization process for symmetric matrices Jacobi (1846) which is an iterative procedure that uses Givens rotations to bring the symmetric matrix to a strongly diagonally dominant one. A truncated Jacobi procedure is used in Le Magoarou et al. (2018) to compute fast graph Fourier transforms for undirected graphs (as undirected implies symmetry in the adjacency and Laplacian matrices). Other methods deal directly with the orthonormal eigenspace. For example, Treelets Lee et al. (2008) and multiresolution Kondor et al. (2014) structures use Givens rotations in a structured way to decompose the orthonormal components into hierarchies or multiple different scales, respectively. Another approach is to exploit manifold optimization techniques to find approximate factorizations of orthonormal matrices with few Givens rotations either by greedy coordinate descent Shalit and Chechik (2014) or by –style optimization Frerix and Bruna (2019)
. Recently, an approach that combines rotations and reflections was proposed with an application to fast principal component analysis (PCA) projectionsRusu and Rosasco (2019). While this latter work needs to precompute the orthonormal eigenspace, in this paper we show how to perform the same factorization given the dataset.
In the unsymmetric case, we rely on sparse structured components. For example, the incomplete LU Meijerink and Vorst (1977), the randomized LU Shabat et al. (2018) factorizations, the additive low-rank plus multiresolution decomposition Mudrakarta et al. (2019), and approximate Gaussian elimination Kyng and Sachdeva (2016) all rely on structured sparse matrices to construct efficient approximations of a given unstructured matrix.
In this paper, we use structured matrices to construct numerically efficient approximations of eigenspaces. We describe the fundamental building blocks of our factorizations and provide exact optimization problems with closed-form solutions to optimally, locally update these blocks efficiently.
3 Problem setup and formulation
3.1 The symmetric case
Given a symmetric matrix the main result that we use is its eigenvalue factorization as
where we assume w.l.o.g. that the entries of , which are the real-valued eigenvalues of , are in descending algebraic order and is the real-valued orthonormal eigenspace. Based on (1), we consider the problem
where a set of orthonormal matrices such that matrix-vector multiplication with any matrix from this set is , instead of the classic . Let us now consider a particular set . Based on all the orthonormal matrices
where the non-zero entries located at rows/columns and , denoted as “*”, are the two possible options in (3). The matrices in (3) are basic building blocks of the orthonormal group because every orthonormal matrix can be diagonalized (and the diagonal entries are ) using Givens rotations (the matrix (4) with the first, unsymmetric, component in (3
)) by the QR decomposition of, see Chapter 5.2.5 of Golub and Van Loan (1996). Then, in this paper, any has the following structure
where all matrices are G-transforms (4). The number is given and fixed. With this structure, matrix-vector multiplication takes operations while storing takes approximately bits, where is the number of bits required for a double precision floating-point representation. Similar structures to (5) have been previously proposed by Lee et al. (2008), Kondor et al. (2014), and Frerix and Bruna (2019), but they all consider only Givens rotations, and no reflectors.
3.2 The unsymmetric case
Given a general diagonalizable the main result that we use is its eigenvalue factorization as
where contains the complex-valued eigenvalues and has the complex-valued eigenvectors. Based on (6), we consider the problem
where a set of general matrices such that matrix-vector multiplication with any matrix from this set or its inverse is , instead of . Let us now consider a particular set . Based on scaling and shear transformations
and, similarly to (4), we define the T-transform:
where the non-zero entries are the three possible options in (8). For the shear transformations we necessarily have while for the the scaling transformations we abuse notation and impose , i.e., in (9
) is the identity matrix except for thediagonal element that is . The matrices in (9) are building blocks of every diagonalizable because, by Gaussian elimination (see Chapter 3.2.1 of Golub and Van Loan (1996)), shear transformations (9) diagonalize and then scaling transformations (9) exactly represent the resulting diagonal. We choose these three specific matrix as the optimization in (7) takes place over the variable and its inverse and these matrices have trivial inverses. Then, in this paper, any has the following structure
where all matrices are T-transforms (9). Their number is given and fixed. We assume that the factorization contains scalings and shears (). With this structure, matrix-vector multiplication takes operations while storing takes approximately bits, where is the number of bits required for a double precision floating-point representation.
There are two important differences when compared to G-transforms. Neither the scaling nor the shears are orthogonal but they are more efficient: two computations and one degree of freedom per transform (as compared to the G-transform where we have 6 operations and one degree of freedom). Therefore, for the same computational cost, we expect T-transforms to provide more accurate approximations. The two types of transforms are connected since anyorthonormal transformation can be written as a product of three shears and scalings by the lifting scheme Daubechies and Sweldens (1998).
4 Proposed factorizations and algorithms
In this section we propose approximate solutions to the optimization problems (1) and (6). Therefore, we distinguish between the symmetric and unsymmetric cases. Furthermore, we analyze separately the initialization and iterative procedures that improve the approximation for each of the two problems. Both (2) and (6) echo the fast circulant matrix-vector multiplication which is possible because every circulant matrix of size has a factorization as , Chapter 4.8 of Golub and Van Loan (1996), where is the Fourier matrix and , . The idea is to replace the Fourier matrix with a new learned matrix ( as (5) or as (10)), with similar computational properties to the Fourier Cooley and Tukey (1965).
4.1 Approximation of symmetric matrices
Based on the eigenvalue decomposition, the idea is to approximate in (1) with a fast approximation (5). The approximation of would therefore be . Multiplications with can be viewed as a sequence of fast multiplications by , and finally . Matrix-vector multiplication with a diagonal matrix is fast, operations, so therefore our goal is to construct such that it also has advantageous numerical properties (for example, computations take less than operations). Therefore, based on (5), we propose an approximation as
where are the estimated eigenvalues of . These can be the actual eigenvalues of if they are known, or else they can be randomly initialized or set to the diagonal elements of – in the latter case, we should ensure entries of are distinct for reasons that will be clear later in this section. To find the best approximation of we can compute the best approximation to the spectrum by the following lemma.
Let and orthogonal be fixed, then the of the expression is given by
The complexity of computing is .
Let us now move to approximate the orthogonal eigenspace of . Given an approximation as (11) where all G-transforms were initialized, we now study the problem of initializing such that we minimize
where we have defined the symmetric matrix
(Optimal initialization of each G-transform) Let , be fixed and all components for while for , then the optimal component has its non-trivial values given by for the optimal coordinates
where we have denoted the quantity
and the symmetric .
Theorem 1 provides an efficient way to find the optimal G-transform that minimizes (13): both the indices and the transform values. Starting from we can continue down to and initialize in this fashion all G-transforms in (5). Also, notice that the unified approach we propose (allowing for both the rotation and the reflection) simplifies the results, i.e., we are not looking to optimize an angle of rotation but we get the optimal local solution by an eigenvalue decomposition (the solution to a two-sided Procrustes problem). The computational cost of (15) is dominated by the sweep of the indices ( operations).
(Connection to the Jacobi method) The Jacobi method, see Chapter 8.4 of Golub and Van Loan (1996), used to diagonalize symmetric matrices, only uses Givens rotations and selects indices that correspond to the off-diagonal element of with the highest magnitude, i.e., we have (15) with . The Jacobi algorithm is not concerned with the number of Givens rotations used to diagonalize and indeed, in general, more than rotations are used (see Chapter 8.4.3 of Golub and Van Loan (1996) for a detailed discussion on how the number of rotations relates to the converge of the method). Furthermore, the Jacobi method uses only Givens rotations (while we have a richer structure for given in (3)) and it does not explicitly have an objective function as (2), i.e., there is no reference matrix to be reconstructed in the sense of minimizing a Frobenius norm (the Jacobi method minimizes the squared sum of the off-diagonal entries of the approximation). Also, the Jacobi method does not need an estimate of the eigenvalues . If we now ignore the eigenvalue information, we can consider in (15). When we have that , just as in the Jacobi method, while when we have . Therefore, the calculated score approximates the Jacobi approach when off-diagonal elements are large but the selection criterium for the indices is different as the iterative process makes progress and the working matrix becomes diagonally dominant. Finally, we note that whenever which agrees with theoretical convergence results on the Jacobi method that hold when assuming distinct eigenvalues Henrici (1958).
The proposed approach is also significantly different from other previous approaches. As opposed to the approach in Kondor et al. (2014), by maximizing (15) we find the indices of the optimal transform without actually explicitly having to compute it. This saves up computational time (the are easy to compute: only 10 operations) and also leads to better approximation (as we also consider the reflector simultaneously with the Givens rotation in (3)). The approach in Frerix and Bruna (2019) uses again only Givens rotations to perform coordinate descent on the orthonormal manifold using a particular basis for the tangent space such that the exponential map is a Givens rotation. Noting that has the same structure as , i.e., the reflection can be seen as a coordinate swap followed by a rotation, we can interpret our approach as a simultaneous dual tangent space descent. Unfortunately, efforts to integrate this view with manifold optimization, in general, seem difficult at this stage as many difficulties arise: for example, the logarithmic map of the reflector is complex-valued and therefore it is not clear how to choose a basis for the tangent space corresponding to the reflector. As rotations and reflections are disconnected components the unified approach used in this paper may work only for the objective function and constraints we consider (due to the existence of the Procrustes solutions based on eigendecompositions).
Given an approximation as (11) where all G-transforms were initialized, we now study the problem of improving each individual iteratively. We want to optimize each G-transform sequentially such that we minimize
where, due to the invariance of the Frobenius norm to multiplications by and its transpose, we have defined the symmetric matrices
Notice that Theorem 1 covered only the case when is a diagonal matrix. Next we state a general result that holds for the minimization of (17) and any and .
(Optimal update of each G-transform) Let and be any symmetric matrices, then the minimizer of the quantity in (17) has its non-trivial values given by
where for the optimal coordinates
where . The new index runs through the two options (rotation and reflector) in (3). The matrices of size , the vectors of length and of length , and the matrices and all of size depend only on the entries in , and are given explicitly in the supplementary materials.
Unlike with the initialization procedure, Theorem 2 shows that considering both the rotation and the reflector simultaneously does not lead to a unified optimization problem. Indeed, the index runs through both transformations from (3). Still, solving the second problem, for , brings an extra computational load that is negligible as it shares most calculations with the first problem, for .
The iterative process is computationally expensive as it covers all unique pairs of indices while the calculation of (20) is itself non-trivial and requires operations (substantially more expensive than the initialization (16)). If the running time is an important constraint, we can run the iterative process just as a “polishing step”: keep the indices of the G-transforms fixed all the time and update only the values of the transformations .
4.2 Approximation of unsymmetric matrices
Similarly to the symmetric case, based now on the eigenvalue decomposition (6), the idea is to approximate with a numerically efficient approximation . The approximation of would therefore be . Multiplications with can be viewed as a sequence of fast multiplications by , and finally . Again, matrix-vector multiplication with a diagonal is fast and therefore the computational burden depends on the numerical properties of and its inverse. By using scaling and shear transformations (8) in the direct transformation the numerical properties transfer also to its inverse since inverses of scalings and shears are themselves scalings and shears, respectively. Therefore, based on (10), we propose an approximation as
where are the estimated eigenvalues of , which we constraint to be real-valued. Just like in the symmetric case, there are several ways to set : randomly, the diagonal values of or the true eigenvalues, if they are known. To find the best approximation of we can compute the best approximation to the spectrum by the following lemma.
Let and be fixed, then the of the expression is given by
where is the Khatri-Rao product.
The computational complexity of getting is , we solve a least squares problem of size where the columns of are Kronecker products. Alternatively, we can approximately solve the problem by some iterative method which exploits the Kronecker product structure to efficiently perform matrix-vector multiplications.
Let us now move to approximate the eigenspace of . Given an approximation as (22) where all T-transforms were initialized, we now study the problem of initializing such that we minimize
where we have defined the symmetric matrix
(Optimal initialization of each T-transform) Let and be fixed, let all components for while for , then the optimal component that minimizes the quantity in (24) is given by
where the quantities , for an index that runs through all three options in (8), are rational functions in and are given explicitly in the supplementary materials.
Given an approximation as (22) where all T-transforms were initialized, we now study the problem of improving each individual iteratively. Therefore, we want to optimize each T-transform sequentially such that we minimize
where we have defined the matrix
(Optimal update of each T-transform) Let and be any matrices, then the optimal component that minimizes the quantity in (27) is given by
where the quantities , for an index that runs through all three options in (8), are rational functions in and are given explicitly in the supplementary materials.
Although the initialization and iterative steps look very similar, (26) and (29), they are significantly different from a computational perspective. While a particular is computed in constant time , for each of the s we have quadratic complexity .
Similarly to the symmetric case, we now have a locally optimal way of choosing and updating our T-transforms. Also, the update step is again significantly more computationally expensive than the initialization and more are more expensive than the results for the symmetric case. The basic difficulty stems from the fact that we are dealing now with building blocks that are not orthogonal and therefore are not invariant in the Frobenius norm. The exact initialization of T-transforms takes while the update takes operations. Due to this high computational cost, simplification can be brought to the algorithm: for example, in the update steps we no longer search over every index pair but we keep these indices fixed and just calculate the locally optimal coefficient of the transformation . We call this step a polishing step and it reduces the computational complexity of updating the T-transforms to .
We now make two remarks regarding the proposed decomposition for general matrices.
(Using T-transforms for the symmetric case) The ideas of this section can also be applied to the symmetric case. Consider an approximation analogous to (11) based on T-transforms as
If the factorization in (11) is based on the eigendecomposition of symmetric matrices, this one is similar to a Cholesky factorization. With this structure we reach minimization problems similar to (27) that have similar solutions to (67) following the same development as in (63), for brevity we omit these formulas. The obvious disadvantage of this approach is that we do not explicitly preserve the eigen-information of the original matrix: we lose control over the spectrum of the approximation and the T-transform do not approximate explicitly the eigenspace. Still, there are some advantages: i) T-transforms (2 operations per degree of freedom) are more efficient than G-transforms (6 operations per degree of freedom) and therefore we expect to get better approximation accuracy for the same numerical complexity or vice-versa, lower computational complexity for the same representation accuracy; and ii) direct and inverse matrix-vector multiplications with are still efficient, e.g., take at most operations instead of with .
An alternative construction is to keep the eigendecomposition but use real-valued approximate eigenvalues and use T-transforms to approximate the orthonormal eigenspace
We lose the orthogonality of the eigenspace but we expect this representation to be more accurate just because, as already discussed, T-transforms have better numerical properties as compared to G-transforms. As every matrix in (3) can be written as four transformations from (8) (two shears and two scalings by Daubechies and Sweldens (1998)), we can use the approximation of from (11) as an initialization to the factorization in (31) with .
(An approximate Schur decomposition for ) Notice that in (2), instead of the diagonal containing we can use for example an upper (or lower) triangular matrix which is also sparse, say off-diagonal elements. The computational complexity of using such an approximation (directly or inversely) would still be while we expect the approximation accuracy to be better, lower overall due to the extra degrees in freedom in the new triangular factor. This factorization would be similar to the Schur decomposition , where is upper triangular and is orthonormal (but all real-valued). This is in contrast with the eigenvalues decomposition of which is done over the complex values, in general.
4.3 The proposed algorithm00footnotetext: https://epfl-lts2.github.io/gspbox-html
Given that we now have optimal ways to initialize (Theorems 1 and 3) and update G and T transforms (Theorems 2 and 4), we are ready the describe the full proposed procedure, in Algorithm 1. For brevity, we describe a single procedure for both the symmetric and general matrices. Whenever we allow for spectrum updates of or we explicitly mention so.
As previously discussed, every step of the algorithm is locally optimal and can only decrease the objective functions we consider. Thus, convergence to a stationary point is guaranteed. Compared to previous methods, there are two characteristics that we would like to highlight at this point: i) in the symmetric case, we use simultaneously both rotations and reflections to construct our approximation; and ii) for each sub-problem we define we can provide a closed-form solution based either on singular and eigenvalue decompositions or least squares. Also, because the proposed method relies on the calculation of scores that span indices and it is naturally parallelizable and is amenable to randomized linear algebra techniques.
Proofs of the lemmas and theorems from this section are collected in the supplementary materials.
5 Experimental results
In this section, we describe numerical experimental results with the proposed algorithms and compare them to previous work. Following previous methods Le Magoarou et al. (2018); Frerix and Bruna (2019), we measure the quality of the approximation using the Frobenius norm objective functions of the optimization problems we consider. Source code for Algorithm 1 is available online111https://github.com/cristian-rusu-research/FAST-EIGENSPACE-APPROXIMATIONS. In all the experiments we use ‘update’ for the spectrum estimation and only polishing in the iterative part of the algorithm (keep indices found in the initialization fixed and optimize only the values of the transforms - monotonicity in the objective function is still preserved).
We show an application to the calculation of the fast graph Fourier transforms. Given a graph with vertices we compute its Laplacian where is the diagonal degree matrix and is the adjacency matrix, i.e., if a directed edge exists between nodes and and otherwise. Given the eigenvalue decomposition of the Laplacian, we call the graph Fourier transformation of the graph. Our goal is now to build approximations of which enjoy the same numerical complexity as the classic time-domain Fourier transformation, i.e., . We distinguish between undirected and directed graphs. With undirected graphs, the adjacency matrix is symmetric and therefore the Laplacian is symmetric positive semidefinite allowing for an eigendecomposition with an orthonormal basis as . In this case, we use G-transforms to approximate the eigenspace. For directed graphs, the Laplacian does not have an orthonormal structure and therefore we use the more general T-transforms in the factorization. In Figure 1 we show approximation results for different types of graphs of different sizes (number of vertices ).
In Figure 4 we compared the proposed method against the previous state of the art for the computation of fast graph Fourier transforms. The results are shown for four graphs: Minnesota graph from Defferrard et al. (2015) with and 3304 edges, HumanProtein graph from Rual et al. (2005) with and 6726 edges, Email graph from Guimerà et al. (2004) with and 5451 edges and the Facebook graph from Leskovec and Mcauley (2012) with and 2981 edges. Our proposed method performs best in all these situations.
In this paper, we assume we do not have information about the eigenspace of , but have access to itself. Previous work Rusu and Rosasco (2019); Frerix and Bruna (2019) considered the possibility of performing first the eigendecomposition and then working directly with the eigenspace . For the same graphs, we show in Figure 4 the evolution of the accuracy of the overall Laplacian as a function of the number of basic transformations in their factorization. We emphasize again that this metric is different from the one used in Figure 4 on the accuracy of the eigenspace .
In Figure 4 we analyze the approximation accuracy of estimating the Laplacian for random undirected Erdos-Renyi graphs with and compare it against the approach in Rusu and Rosasco (2019) which needs directly the eigenspace . We show that our proposed method, especially with the updated spectrum, is the most appropriate to build an accurate approximation of .
The supplementary materials have further numerical experiments, comparisons, and measurements of the running time of the fast transformations (not just number of operations).
In this paper, we have described algorithms for the efficient approximate computation of eigenspaces which can be efficiently manipulated. We show an application to the computation of the fast graph Fourier transform (both for directed and undirected graphs) and compare against previous approaches from the literature, which we outperform. An open problem, that we cannot address at this time, is the setup of an appropriate theoretical framework where the proposed factorizations and algorithms can be analyzed and their limitations understood.
- An algorithm for the machine calculation of complex Fourier series. Math. Comp. 19, pp. 297–301. Cited by: §4.
- Factoring wavelet transforms into lifting steps. J. Fourier Anal. App. 4 (3), pp. 247–269. Cited by: §3.2, Remark 2.
- PyGSP: Graph Signal Processing in Python. External Links: Cited by: §5.
- Approximating orthogonal matrices with effective Givens factorization. In Proceedings 36th International Conference on Machine Learning (ICML), Cited by: §2, §3.1, Figure 4, §4.1, §5, §5.
- A constrained eigenvalue problem. Linear Algebra Appl. 114-115, pp. 815–839. Cited by: Proof of Theorem 2.
- Computation of plain unitary rotations transforming a general matrix to triangular form. Journal of the Society for Industrial and Applied Mathematics 6 (1), pp. 26–50. Cited by: §1.
- Eigenvalue computation in the 20th century. Journal of Computational and Applied Mathematics 123 (1-2), pp. 35–65. Cited by: §1.
- Matrix computations. Johns Hopkins University Press. Cited by: §1, §2, §3.1, §3.2, §4, Proof of Theorem 2, Remark 1.
- Self-similar community structure in a network of human interactions. Physical review. E, Statistical, nonlinear, and soft matter physics 68, pp. 065103. Cited by: §5.
- Cambridge University Press. Cited by: Proof of Theorem 1.
- On the speed of convergence of cyclic and quasicyclic Jacobi methods for computing eigenvalues of Hermitian matrices. Journal of the Society for Industrial and Applied Mathematics 6 (2), pp. 144–162. Cited by: Remark 1.
- Uber ein leichtes Verfahren die in der Theorie der Sacularstorungen vorkommenden Gleichungen numerisch aufzulosen. Journal fur die reine und angewandte Mathematik 30, pp. 51–94. Cited by: §1, §2.
- Multiresolution matrix factorization. In Proceedings 31st International Conference on Machine Learning (ICML), pp. II–1620–II–1628. Cited by: §1, §2, §3.1, Figure 4, §4.1.
- Approximate Gaussian elimination for Laplacians - fast, sparse, and simple. In IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), Vol. , pp. 573–582. Cited by: §2.
- Approximate fast graph Fourier transforms via multi-layer sparse approximations. IEEE Transactions on Signal and Information Processing over Networks 4 (2), pp. 407–420. Cited by: §2, Figure 4, §5.
- Treelets - an adaptive multi-scale basis for sparse unordered data. Annals of Applied Statistics 2 (2), pp. 435–471. Cited by: §1, §2, §3.1.
- Learning to discover social circles in ego networks. In Advances in Neural Information Processing Systems 25, pp. 539–547. Cited by: §5.
- An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix. Math. Comp. 31, pp. 148–162. Cited by: §2.
- Asymmetric multiresolution matrix factorization. arXiv 1910.05132. Cited by: §2.
- Towards a proteome-scale map of the human protein-protein interaction network. Nature 437, pp. 1173–8. Cited by: §5.
- Fast approximation of orthogonal matrices and application to PCA. arXiv 1907.08697. Cited by: §2, §3.1, Figure 4, §5, §5.
- Learning fast sparsifying transforms. IEEE Trans. Sig. Proc. 65 (16), pp. 4367–4378. Cited by: §3.1.
- Learning multiplication-free linear transformations. arXiv 1812.03412. Cited by: §1.
- On two-sided orthogonal Procrustes problems. Psychometrika 33 (1), pp. 19–33. Cited by: Proof of Theorem 1, Proof of Theorem 2.
- Randomized LU decomposition. Applied and Computational Harmonic Analysis 44 (2), pp. 246 – 272. Cited by: §2.
- Coordinate-descent for learning orthogonal matrices through Givens rotations. In Proceedings 31st International Conference on Machine Learning (ICML), pp. I–548–I–556. Cited by: §2.
- The decompositional approach to matrix computation. Computing in Science Engineering 2 (1), pp. 50–59. Cited by: §1.
Eigenvalues of symmetric matrices
Given the symmetric matrix its two eigenvalues are given by
Proof of Lemma 1
The result follows directly by using the invariance of the Frobenius norm under orthogonal transformations:
Then, because the Frobenius norm is entry-wise we have the minimizer .
Proof of Theorem 1
Given that we have initial values for all components for (while for ) and we now want to also initialize the component such that we minimize objective function of (2). That expression can be written as
where we have defined the cost
For convenience, we defined the symmetric matrices
and . In the development of (34) we have used the trace definition of the Frobenius norm , the fact that the Frobenius norm is invariant to orthonormal transformation (in particular G-transformation) and that operates only on rows and columns and .
The problem of maximizing the quantity in (35) is known as the two-side orthonormal Procrustes problem Schonemann  whose solution in our case is given by
We will use that . We assume that in the eigenvalue decomposition of where the diagonal matrix contains the eigenvalues in algebraic descending order. The same ordering is also assumed in . Therefore, by the rearrangement inequality, see Section 10.2, Theorem 368 of Hardy et al. , and with the trace quantity is maximized and it reduces to