1 Introduction
A wide variety of naturally occurring data, represented as large matrices, often has complex hierarchical structure. For instance, this is the case in data originating from social networks, complex biological interactions, circuit simulations and so on. With the increasing size of such matrices, there is a need to represent and compress them efficiently. Towards this end, a number of methods have been proposed (mahoney2009cur; coifman2006diffusion; hackbusch1999sparse; halko2009)
. All such methods make an implicit or explicit assumption about the nature of such matrices, which serves as a prior to facilitate efficient factorization and compression. Frequently, the assumption in question is that the matrices are of low rank. For example, Principal Component Analysis (PCA) decomposes a symmetric matrix
into the formwhere
is a orthogonal matrix and
is diagonal. Notably, such methods are singlelevel algorithms. Moreover, the columns of are usually dense, which does not adequately capture strong locality properties that might be exhibited, for instance, in real world graphs.Taking a different tack, Multiresolution Matrix Factorization (MMF) (kondor2014multiresolution) framed matrix factorization as a type of multiresolution analysis (MRA) over a discrete metric space whose metric structure is represented by a selfadjoint operator. Thus, in contrast to lowrank approaches, MMF factorizes a symmetric matrix in the following multilevel manner:
where each of the carefully selected matrices are orthogonal and extremely sparse, and the matrix is close to diagonal. MMF is able to discover soft hierarchical arrangements of localized structures in the data represented by the matrix. It is amenable to parallelization and fast implementations (kondor2015parallel). The wavelet bases uncovered by MMF also provide a natural sparse approximation for functions defined over the space. For instance, if the matrix is the normalized graph Laplacian, then the factorization provides us with wavelet bases for approximating functions on graphs.
While MMF has been shown to be useful in various practical settings (teneva2016multiresolution; ding2017multiresolution; ithapu2017decoding), one notable drawback of the method is that it is wholly defined using a selfadjoint operator such as the graph Laplacian. Therefore, MMF can only handle a multilevel factorization of symmetric matrices. However, many naturally occurring data are represented more concisely as an nonsymmetric matrix; the underlying space might only have a quasimetric structure. This is the case for instance, when we are working with directed graphs. For data living on such spaces, hierarchical or multiscale structure is as important, albeit more complicated, and thus there is much to be gained by defining appropriate sparse wavelet bases for function approximation.
Defining such wavelet bases, in particular for functions on directed graphs, is an active area of research (sevi2018harmonic; satoshi2019; mhaskar2018). However, there is currently no consensus on the correct approach for doing so. Further, existing methods are difficult to implement in practice for downstream applications. In this paper, we do not make an attempt to directly extend multiresolution analysis (MRA) in the spirit of MMF to such spaces, but rather provide practical methods to leverage intuitions used in MMF to make it operable on general square matrices. We show experimentally that our methods can perform very well in various real world matrix compression tasks.
To conclude this section, our contributions in this paper are twofold: 1) we provide two practical ways to adapt MMF to nonsymmetric matrices, 2) we argue that extending MRA to nonsymmetric matrices is not analogous to developing a multilevel SVD; in particular, we show experimentally that it is important to consider interactions between different levels in the hierarchical structure and lastly 3) we also show that a combined lowrank and MMF approach, which amounts to removing a small globalscale component of the matrix and then extracting hierarchical structure from the residual, is even more effective than each of the two complementary methods for matrix compression
After a description of notation used in this paper, in Section 3, we give a brief introduction to Multiresolution Matrix Factorization.
2 Notation
We denote
. Inner products between vectors are denoted by the operator
. The th entry of a matrix is denoted by . If is a set of indices, represents the submatrix of corresponding to those indices. denotes the complement of a set. When can be interpreted as an ordered set, for instance, if it is a set of indices, then represents the th index from that set.3 Multiresolution Matrix Factorization
Given a measurable space and square integrable functions on , multiresolution analysis (MRA) constructs a sequence of spaces of functions of increasing smoothness:
where each is split into a smoother space and a rougher space , such that . The lengthscale over which a function varies increases with and thus makes this decomposition an analysis of multiscale/multiresolution.
kondor2014multiresolution extends MRA to functions on discrete spaces (e.g. undirected graphs) by interpreting a given symmetric matrix as a discrete version of a continuous selfadjoint operator. Specifically, for , kondor2014multiresolution defines multiresolution analysis with respect to a symmetric matrix as a successive filtration
(1) 
where each has an orthonormal basis , and its complementary space has an orthonormal basis that satisfy the following three axioms:
 MRA1
 MRA2

The basis vectors of are localized in the sense that
increases no faster than a certain rate.
 MRA3

Suppose the matrix represents the matrix expressing in terms of the previous basis . Then, is sparse. ( is taken to be the standard basis.)
Once an overcomplete multiscale basis satisfying MRA13 for , i.e., , is determined, it can be used to compress the matrix . Applying each of the ’s sequentially, we obtain versions of compressed to various scales:
Since the choice of the ’s aligns with MRA1, kondor2014multiresolution argues that the matrices and are approximately in corediagonal form. That is, up to a permutation, each of those matrices has a blockdiagonal form, with one block being a diagonal matrix.
Formally, Multiresolution Matrix Factorization is defined as an approximate multilevel factorization of the form
(2) 
where the matrices and obey the following conditions:

Each is orthogonal and highly sparse. In the simplest case, each is a Givens rotation, i.e., a matrix which differs from the identity in just the four matrix elements
for some pair of indices and rotation angle . Multiplying a vector with such a matrix rotates it counterclockwise by in the plane. More generally, is a socalled point rotation, which rotates not just two, but coordinates.

Typically, in MMF factorizations , and the size of the active part of the Q_ℓ matrices decreases according to a set schedule . More precisely, there is a nested sequence of sets such that the part of each rotation is the dimensional identity. is called the active set at level . In the simplest case, .

is an corediagonal matrix, which means that it is block diagonal with two blocks: , called the core, which is dense, and which is diagonal. In other words, unless or .
Given a symmetric matrix, the factorization is carried out by minimizing the Frobenius norm error between the original matrix and its factorized form. Algorithms such as GreedyJacobi (kondor2014multiresolution) or pMMF (teneva2016multiresolution) are typically used in practical applications (teneva2016multiresolution; ding2017multiresolution; mudrakarta2017generic)
Having discussed the core ideas behind MMF, we briefly review some related art before moving forward.
4 Multiscale structure in nonsymmetric matrices
In this section, I want to write about why its not trivial to extend MMF to nonsymmetric matrices, and also explain a little bit about the intuition behind the formulations in the next sections. Two points to address:

How well is MRA1 satisfied in the formulations proposed? i.e., what measure of smoothness are we using?
Notes (just for guidance  not intended to be part of the paper)

The broad goal here is to efficiently compress and estimate large matrices arising in the real world

This process involves making assumptions on the matrix – the most common one being the lowrank assumption, which is typically modeled via PCA

The problem with PCA is that the eigenvectors are always dense, which does not capture cases where the matrix “more closely couples certain clusters of nearby coordinates than those farther apart w.r.t. some underlying topology”. Sparse PCA solves the locality problem, but does not capture global scale structure.

MMF is a method that aims to capture multiscale structure in discrete spaces such as graphs. Matrices are considered as operators on vectors, and MMF essentially decomposes that operator in terms of a multiscale basis.

Given a continuous operator such as a Laplacian. The application of this operator on a function can be expressed in terms of the eigenfunctions of the operator. In the case of the Laplacian, the eigenfunctions are the Fourier basis functions.
Similarily, when the symmetric matrix is the operator, the application of this operator (i.e., matrixvector product) can be expressed in terms of eigenvectors of the matrix. Thus, in this way the eigenvalue decomposition is an analogue of the Fourier decomposition.

Multiresolution Analysis (MRA) captures multiscale structure in a given space of functions by repeatedly reducing the space so that the smoothness of functions increases in some sense. For functions on , Mallat (1989) defines MRA in terms of the wavelets constructed via dilation and translation of certain canonical functions.
Question: how does Mallat show that this form of constructing wavelets correspond to the idea of compressing the space by smoothness of functions? In particular, what exactly is the measure of smoothness that he uses, and how is it related to symmetricity/asymmetricity of an operator that is used to determine the smoothness of a function?

The MMF paper defines MRA on discrete spaces similarly, in terms of successively compressing a given sucspace. Note that each subpsace here would be defined by a finite set of basis vectors. Three axioms are defined that charactersize the smoothness of each space (MRA1), the localization (MRA2), and the existence of fast wavelet transform (MRA3).
What does it mean to relax the criterion that each must be a strict subset of ?
How is smoothness defined in the asymmetric case?

Now that we can determine a wavelet basis in matrix form , the given operator can be transformed into it by performing
In the symmetric case, the wavelet functions are assumed to be close to eigenvectors of , so that the MRA1 criterion ensures that each subspace is smoother than the previous. This makes the structure of corediagonal. MMFfactorizable matrices are defined to be those for which applying the wavelet transformation results in being corediagonal.
In reality, never is corediagonal. We manually zero out entries of the wavelettransformed matrix so that it comes to that form. That makes MMF an approximate factorization.
In some of the asymmetric versions we develop, this criterion is relaxed, thereby violating the MRA1 axiom a bit, but that results in better compression.
5 Related Work
The work most closely related to some of the ideas in this paper is that on defining wavelets on functions on directed graphs. Two notable recent works on the theme are that of sevi2018harmonic and mhaskar2018. sevi2018harmonic did so by using a random walk operator, while mhaskar2018 used an extension of diffusion polynomial frames (mhaskarmaggioni), using a different operator. In fact, the influential work on diffusion wavelets (coifman2006diffusion), provides a more general theory in which symmetry of the diffusion operator is not necessary. However, it is unclear how to lift these ideas for defining a principled asymmetric MMF that has also a clear MRA interpretation. On the side of matrix factorization, works that have some resemblance to our approach are LiYangButterfly; LiYangInterpolate, in which the left and right matrices don’t have to be related by transposition. However these approaches don’t have a clear wavelet based motivation.
6 MMF on nonsymmetric matrices
Based on the MMF approach described in Section 3, in the sequel we provide two approaches that extend MMF to general square matrices.
6.1 MMF via an additive decomposition
We begin by observing that any square matrix
can be written as the sum of a symmetric matrix and a skewsymmetric matrix as follows
(3) 
We propose performing MRA on each term separately. Since the first term is symmetric, it directly yields to a multiresolution matrix factorization. In particular, the factorization of can be obtained via a parallel variant of MMF such as pMMF (kondor2015parallel). Thus, the problem of computing a multiresolution factorization of an asymmetric square matrix is now reduced to computing a multiresolution factorization for a skewsymmetric matrix.
6.1.1 MMF for skewsymmetric matrices
For skewsymmetric matrices, the MRA axioms, especially MRA1, are no longer applicable. The reason being that under the definition of smoothness in MRA1 , for a skewsymmetric matrix , the smoothness is always zero.
In symmetric MMF, the choice of the sparse orthogonal factors and the diagonal part of the corediagonal matrix are based on the assumption that the multiscale basis vectors determined by the MMF algorithm are as close to eigenvectors as possible. In other words, the filtration defined in equation 1 is a sequence of subspaces that is as close to a spectral filtration as possible. Indeed, MMF approximately diagonalizes a symmetric matrix using rotations, where is the dimension of the matrix. We consider this intuition as our starting off point to extend MMF to skewsymmetric matrices.
Note that a matrix is skewsymmetric if . Skewsymmetric matrices have complex eigenvalues and are therefore not diagonalizable in real space. Thus, a skewsymmetric matrix can never be perfectly represented using the same form of MMF as for a symmetric matrix over reals. In order to overcome this issue, we consider a well known observation (murnaghan1931canonical) that a skewsymmetric matrix, under a unitary transformation, assumes the socalled Murnaghan canonical form,
(4) 
where are matrices of the form
where .
This observation affords a multiresolution factorization for skewsymmetric matrices identical to that for symmetric matrices with the difference that the corediagonal matrix is allowed to have the following form:
(5) 
where are matrices as defined above.
A couple of observations based on this form are in order. First, notice that the skewsymmetricity of the matrix in its MMF is always preserved. Second, this form also allows for interaction between wavelets at adjacent levels. This feature is distinct from the original MMF procedure in which no such interactions occur. To obtain the factorization, the procedure is similar to the GreedyJacobi algorithm from kondor2014multiresolution, with the difference that in in the last step, instead of zeroing out offdiagonal elements, we sparsify the matrix into the Murnaghan normal form as described above.
Thus, the multiresolution decomposition of a square matrix can be done by factorizing each of the two parts from equation 3 separately and writing as a sum of two MMFs. In other words, we may write
where
(6)  
(7) 
where has the form defined in equation 5.
In the next section, we present another approach for multiresolution factorization.
6.2 MMF via direct factorization
In this approach, we aim to capture hierarchical structure by operating on the square matrix directly rather than decomposing it into an additive form. Recall that we would like to identify clusters of coordinates that a matrix more closely couples than others, and determine an appropriate multiresolution analysis of
. When the matrix is symmetric, the linear transformation that it represents is selfadjoint. In the asymmetric case, depending on whether we are transforming
or , we may end up with different ways in which might couple the coordinates.We consider two different ways of performing multiresolution analysis of , one that is based on , and another on . Suppose represents the wavelet transform in the former case, and of the latter. Thus, we have
which is the matrix in the bases and . In the symmetric case, takes a corediagonal form, based on the fact that the eigenvalue decomposition converts a symmetric matrix to diagonal, and MMF uses orthonormal transformations that are designed to be as close to eigenvectors as possible. In the asymmetric case, we relax this assumption and allow some of the offdiagonal elements of to be nonzero. This is motivated by the fact that the frequencies or smoothness coefficients filtered by the left and right hand sides may not align with each other.
Remark.
When
represents the adjacency matrix of a directed graph, we are essentially identifying hierarchical cluster structures considering 1) only the outgoing edges, and 2) only the incoming edges. The matrix
gives a mapping between these two cluster structures.We define asymmetric MMF to be a multiresolution factorization of the following form
(8) 
, where

Each are highly sparse, orthogonal rotation matrices similar to the rotations in a symmetric MMF

is a coresparse matrix, that is, under an appropriate permutation of row and columns, would be a blockdiagonal matrix with the upper left block being dense, and the other 3 blocks would be highly sparse

When and is in corediagonal form with a symmetric core, the above factorization would be a symmetric MMF.
The computation of the asymmetric MMF is obtained using an approach similar to the symmetric MMF. The algorithm we propose is designed to minimize the Frobenius norm error
We determine the left and right rotations in a greedy manner. That is, we compute a sequence of rotated version of
Each of the rotations is computed by identifying sets of similar rows (for the left rotations) and columns (for the right rotations), and determining an appropriate point rotation. This requires us to compute Gram matrices on both the rows as well as the columns. Finally, we sparsify the final rotated version of to determine the coresparse matrix . We explore three ways of sparsification

CoreDiagonal: retain only the diagonal elements of

TopN: retain the top elements in terms of magnitude

GreedyTopN: select elements by magnitude greedily such that no two of the selected elements fall in the same row/column
The full procedure is summarized in Algorithm 1.
The running time of this algorithm is , similar to GreedyJacobi routine used in the original symmetric MMF. Nevertheless, computation can be made faster by similarly extending the parallel MMF algorithm to maintain row and column Gram matrices and computing separate left and right rotations. It must be noted that applying the left rotations to does not affect the column Gram matrix, and applying the right rotations does not affect the row Gram matrix.
6.3 Numerical experiments
Similar to the evaluation of MMF in teneva2016multiresolution, we compare our asymmetric MMF formulations with the lowrankbased CUR decomposition (mahoney2009cur)
for the task of compressing square matrices. The CUR decomposition is a widely used alternative to the singular value decomposition to compress large matrices. To reduce a matrix
to rank, the method proceeds by carefully selecting rows () and columns () and determines a matrix such that the product is as close to the best rank approximation of as possible.To set up a comparison, we sample 70 square matrices from the UFlorida sparse matrix repository (davis2011university). The matrices in this repository were sourced from various fields of science and engineering and contain data arising from circuit simulations, social networks, chemical processes, 2D/3D problems, etc. All the sampled matrices were chosen such that the numerical symmetry was less than , and had a maximum dimensionality of .
6.3.1 Results on MMF via an additive decomposition
We first present results comparing the asymmetric MMF obtained by the additive decomposition to CUR. Table 1 shows the percentage of times the asymmetric MMF had a better compression error over CUR over 65 matrices. Since both the CUR and MMF algorithms are randomized, we report results averaged over 3 trials. CUR performs better than MMF in the majority of the cases.
Matrix kind  num. matrices  1%  10%  25%  50%  75% 

problems with underlying 2D/3D geometry  
structural problem  4  50.0  50.0  50.0  0.0  0.0 
computational fluid dynamics problem  3  0.0  0.0  0.0  0.0  0.0 
2D/3D problem  5  80.0  80.0  60.0  60.0  60.0 
materials problem  2  0.0  0.0  0.0  0.0  0.0 
problems with no underlying geometry  
economic problem  11  18.2  9.1  9.1  9.1  9.1 
directed graph  19  5.3  5.3  5.3  0.0  0.0 
power network problem  3  0.0  0.0  0.0  0.0  0.0 
circuit simulation problem  2  0.0  0.0  0.0  0.0  0.0 
chemical process simulation problem  23  0.0  0.0  8.7  0.0  0.0 
statistical/mathematical problem  1  100.0  0.0  0.0  0.0  0.0 
counterexample problem  1  0.0  0.0  0.0  0.0  0.0 
total wins:  13.5  10.8  12.2  5.4  5.4 
6.3.2 Results on MMF via direct factorization
Next, we report results on the direct factorization variant of the asymmetric MMF. In particular, we consider two different versions of the method based on what elements are retained in the coresparse matrix . In the first case, we retain the compressed dense core, and elements chosen greedily according to magnitude and made sure that no two of these elements fall in the same row or column.
Table 2 shows the percentage times this version of the asymmetric MMF wins over CUR in compression error.
Matrix kind  num. matrices  1%  10%  25%  50%  75% 

problems with underlying 2D/3D geometry  
structural problem  4  50.0  50.0  25.0  0.0  0.0 
materials problem  1  0.0  0.0  0.0  0.0  0.0 
2D/3D problem  4  100.0  75.0  75.0  75.0  0.0 
computational fluid dynamics problem  3  100.0  100.0  0.0  0.0  0.0 
problems with no underlying geometry  
statistical/mathematical problem  1  100.0  100.0  100.0  0.0  0.0 
economic problem  9  77.8  22.2  11.1  33.3  11.1 
power network problem  2  100.0  0.0  50.0  0.0  0.0 
directed graph  15  80  20  6.7  0.0  0.0 
chemical process simulation problem  21  80.9  23.8  14.3  0.0  0.0 
circuit simulation problem  2  100.0  0.0  0.0  0.0  0.0 
counterexample problem  1  0.0  0.0  0.0  0.0  0.0 
total wins:  79.4  30.2  17.5  9.5  1.6 
We observe that this version of MMF is significantly better than the additive version, and is able to capture structure in both problems with underlying known geometry and without underlying geometry.
Relaxing the structure of the matrix further, we report the results for simply storing the top elements in the coresparse matrix in Table 3.
Matrix kind  num. matrices  1%  10%  25%  50%  75% 

problems with underlying 2D/3D geometry  
structural problem  4  100.0  100.0  100.0  50.0  50.0 
materials problem  1  100.0  100.0  100.0  100.0  100.0 
2D/3D problem  5  100.0  100.0  100.0  80.0  60.0 
computational fluid dynamics problem  3  100.0  100.0  100.0  0.0  0.0 
problems with no underlying geometry  
directed graph  18  100.0  50.0  22.3  16.7  0.0 
chemical process simulation problem  21  100.0  90.5  67.7  57.1  38.1 
circuit simulation problem  2  100.0  100.0  100.0  50.0  0.0 
power network problem  3  100.0  100.0  100.0  66.7  33.3 
counterexample problem  1  100.0  100.0  100.0  100.0  100.0 
economic problem  10  100.0  100.0  100.0  70.0  50.0 
statistical/mathematical problem  1  100.0  100.0  100.0  100.0  0.0 
total wins:  100.0  84.1  69.6  49.3  29.0 
Here, we see that MMF is able to capture structure and compress nearly all kinds of matrices at all compression levels.
From the numerical results, it seems that allowing interaction between wavelet levels is critical to achieving good approximation errors. This indicates that hierarchical structure in nonsymmetric matrices is more complex than that uncovered by simply extending the symmetric MMF analogous to an SVD.
For smaller compression ratios, CUR seems to be preferable for some types of problems such as directed graphs. It is possible that some matrices may have globalscale behavior that is superimposed over the hierarchical structure. This is possible, for instance, when the data is noisy. In the next part, we explore ways of combining MMF and lowrank approaches to achieve even better compression errors.
6.3.3 Matrix compression via a combination of CUR and MMF
In order to better motivate this section, we first consider Figure 1, which shows the error of MMF approximation of a matrix as a function of the decay rate of its singular values.
We observe that when the singular values decay at a superlinear rate, MMF does not approximate the matrix well. However, in many realworld matrices, the decay of singular values may not be uniform and may fluctuate to very high and low values. An example of such a spectrum is shown in Figure 2. In this case we see that although there are segments where the decay rate is sublinear, the large decay rate around singular value 1500 may significantly impact the effectiveness of MMF on this matrix. Thus, it might be well motivated to first compress the matrix using lowrank methods and then use MMF on the residual. We describe this technique in Algorithm 2.
For various values of , we report the errors for compressing various matrices to 5% of their original size in Figure 3.
In most of the matrices, we observe that there exists at least one rank to which the matrix can be first compressed using CUR before compressing with MMF, that results in lower compression errors than each of CUR and MMF methods individually. In practice, there is no method to determine what this rank would be for a given matrix, however, it is an interesting and worthwhile problem to be able to remove a small globalscale component of the matrix, after which the hierarchical structure may be easily extracted.
In all the experiments, the running time of MMF is equal or slightly slower than that of lowrankbased methods. However, we note that our algorithms can be made significantly faster by straightforwardly adapting parallel versions of MMF (kondor2015parallel) to the asymmetric case. We expect speedups similar to those achieved in the symmetric case (teneva2016multiresolution).
7 Conclusion
In this paper, we have presented approaches to extend Multiresolution Matrix Factorization to nonsymmetric matrices. In general, efficiently estimating hierarchical structure in matrices is an increasingly important problem for which there currently exist very few practically useful methods. For future work we aim to use ideas from wavelet theory on more general spaces to define MMFs for general square matrices in which the MRA interpretation is clear.
Comments
There are no comments yet.