Tensor Graph Convolutional Networks for Prediction on Dynamic Graphs

10/16/2019 ∙ by Osman Asif Malik, et al. ∙ 0

Many irregular domains such as social networks, financial transactions, neuron connections, and natural language structures are represented as graphs. In recent years, a variety of graph neural networks (GNNs) have been successfully applied for representation learning and prediction on such graphs. However, in many of the applications, the underlying graph changes over time and existing GNNs are inadequate for handling such dynamic graphs. In this paper we propose a novel technique for learning embeddings of dynamic graphs based on a tensor algebra framework. Our method extends the popular graph convolutional network (GCN) for learning representations of dynamic graphs using the recently proposed tensor M-product technique. Theoretical results that establish the connection between the proposed tensor approach and spectral convolution of tensors are developed. Numerical experiments on real datasets demonstrate the usefulness of the proposed method for an edge classification task on dynamic graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Graphs are popular data structures used to effectively represent interactions and structural relationships between entities in structured data domains. Inspired by the success of deep neural networks for learning representations in the image and language domains, recently, application of neural networks for graph representation learning has attracted much interest. A number of graph neural network (GNN) architectures have been explored in the contemporary literature for a variety of graph related tasks and applications (Hamilton et al., 2017; Seo et al., 2018; Chen et al., 2018; Zhou et al., 2018; Wu et al., 2019)

. Methods based on graph convolution filters which extend convolutional neural networks (CNNs) to irregular graph domains are popular 

(Bruna et al., 2013; Defferrard et al., 2016; Kipf and Welling, 2016). Most of these GNN models operate on a given, static graph.

In many real-world applications, the underlining graph changes over time, and learning representations of such dynamic graphs is essential. Examples include analyzing social networks (Berger-Wolf and Saia, 2006), predicting collaboration in citation networks (Leskovec et al., 2005), detecting fraud and crime in financial networks (Weber et al., 2018; Pareja et al., 2019), traffic control (Zhao et al., 2019), and understanding neuronal activities in the brain (De Vico Fallani et al., 2014). In such dynamic settings, the temporal interdependence in the graph connections and features also play a substantial role. However, efficient GNN methods that handle time varying graphs and that capture the temporal correlations are lacking.

By dynamic graph, we mean a sequence of graphs , , with a fixed set of nodes, adjacency matrices , and graph feature matrices where

is the feature vector consisting of

features associated with node at time . The graphs can be weighted, and directed or undirected. They can also have additional properties like (time varying) node and edge classes, which would be stored in a separate structure. Suppose we only observe the first graphs in the sequence. The goal of our method is to use these observations to predict some property of the remaining graphs. In this paper, we use it for edge classification. Other potential applications are node classification and edge/link prediction.

In recent years, tensor constructs have been explored to effectively process high-dimensional data, in order to better leverage the multidimensional structure of such data 

(Kolda and Bader, 2009). Tensor based approaches have been shown to perform well in many image and video processing applications (Hao et al., 2013; Kilmer et al., 2013; Martin et al., 2013; Zhang et al., 2014; Zhang and Aeron, 2016; Lu et al., 2016; Newman et al., 2018). A number of tensor based neural networks have also been investigated to extract and learn multi-dimensional representations, e.g. methods based on tensor decomposition (Phan and Cichocki, 2010), tensor-trains (Novikov et al., 2015; Stoudenmire and Schwab, 2016), and tensor factorized neural network (Chien and Bao, 2017). Recently, a new tensor framework called the tensor M-product framework (Braman, 2010; Kilmer and Martin, 2011; Kernfeld et al., 2015) was proposed that extends matrix based theory to high-dimensional architectures.

Figure 1: TensorGCN approach.

In this paper, we propose a novel tensor variant of the popular graph convolutional network (GCN) architecture (Kipf and Welling, 2016), which we call TensorGCN. It captures correlation over time by leveraging the tensor M-product framework. The flexibility and matrix mimeticability of the framework, help us adapt the GCN architecture to tensor space. Figure 1 illustrates our method at a high level: First, the time varying adjacency matrices and feature matrices

of the dynamic graph are aggregated into an adjacency tensor and a feature tensor, respectively. These tensors are then fed into our TensorGCN, which computes an embedding that can be used for a variety of tasks, such as link prediction, and edge and node classification. GCN architectures are motivated by graph convolution filtering, i.e., applying filters/functions to the graph Laplacian (in turn its eigenvalues

(Bruna et al., 2013), and we establish a similar connection between TensorGCN and spectral filtering of tensors. Experimental results on real datasets illustrate the performance of our method for the edge classification task on dynamic graphs. Elements of our method can also be used as a preprocessing step for other dynamic graph methods.

2 Related Work

The idea of using graph convolution based on the spectral graph theory for GNNs was first introduced by Bruna et al. (2013). Defferrard et al. (2016) then proposed Chebnet, where the spectral filter was approximated by Chebyshev polynomials in order to make it faster and localized. Kipf and Welling (2016) presented the simplified GCN, a degree-one polynomial approximation of Chebnet, in order to speed up computation further and improve the performance. There are many other works that deal with GNNs when the graph and features are fixed/static; see the review papers by Zhou et al. (2018) and Wu et al. (2019) and references therein. These methods cannot be directly applied to the dynamic setting we consider. Seo et al. (2018) devised the Graph Convolutional Recurrent Network for graphs with time varying features. However, this method assumes that the edges are fixed over time, and is not applicable in our setting. Wang et al. (2018) proposed a method called EdgeConv, which is a neural network (NN) approach that applies convolution operations on static graphs in a dynamic fashion. Their approach is not applicable when the graph itself is dynamic. Zhao et al. (2019) develop a temporal GCN method called T-GCN, which they apply for traffic prediction. Their method assumes the graph remains fixed over time, and only the features vary.

The set of methods most relevant to our setting of learning embeddings of dynamic graphs use combinations of GNNs and recurrent architectures (RNN), to capture the graph structure and handle time dynamics, respectively. The approach in Manessi et al. (2020)

uses Long Short-Term Memory (LSTM), a recurrent network, in order to handle time variations along with GNNs. They design architectures for semi-supervised node classification and for supervised graph classification.

Pareja et al. (2019) presented a variant of GCN called EvolveGCN

, where Gated Recurrent Units (GRU) and LSTMs are coupled with a GCN to handle dynamic graphs. This paper is currently the state-of-the-art. However, their approach is based on a heuristic RNN/GRU mechanism, which is not theoretically viable, and does not harness a tensor algebraic framework to incorporate time varying information.

Newman et al. (2018) present a tensor NN which utilizes the M-product tensor framework. Their approach can be applied to image and other high-dimensional data that lie on regular grids, and differs from ours since we consider data on dynamic graphs.

3 Tensor M-Product Framework

Here, we cover the necessary preliminaries on tensors and the M-product framework. For a more general introduction to tensors, we refer the reader to the review paper by Kolda and Bader (2009). In this paper, a tensor is a three-dimensional array of real numbers denoted by boldface Euler script letters, e.g. . Matrices are denoted by bold uppercase letters, e.g. ; vectors are denoted by bold lowercase letter, e.g. ; and scalars are denoted by lowercase letters, e.g. . An element at position in a tensor is denoted by subscripts, e.g. , with similar notation for elements of matrices and vectors. A colon will denote all elements along that dimension; denotes the th row of the matrix , and denotes the th frontal slice of . The vectors are called the tubes of .

The framework we consider relies on a new definition of the product of two tensors, called the M-product (Braman, 2010; Kilmer and Martin, 2011; Kilmer et al., 2013; Kernfeld et al., 2015). A distinguishing feature of this framework is that the M-product of two three-dimensional tensors is also three-dimensional, which is not the case for e.g. tensor contractions (Bishop and Goldberg, 2012). It allows one to elegantly generalize many classical numerical methods from linear algebra, and has been applied e.g. in neural networks (Newman et al., 2018), imaging (Kilmer et al., 2013; Martin et al., 2013; Semerci et al., 2014)

, facial recognition

(Hao et al., 2013), and tensor completion and denoising (Zhang et al., 2014; Zhang and Aeron, 2016; Lu et al., 2016). Although the framework was originally developed for three-dimensional tensors, which is sufficient for our purposes, it has been extended to handle tensors of dimension greater than three (Martin et al., 2013). The following definitions 3.13.3 describe the M-product.

Definition 3.1 (M-transform).

Let be a mixing matrix. The M-transform of a tensor is denoted by and defined elementwise as

(1)

We say that is in the transformed space. Note that if is invertible, then . Consequently, is the inverse M-transform of . The definition in (1) may also be written in matrix form as , where the unfold operation takes the tubes of and stack them as columns into a matrix, and . Appendix A provides illustrations of how the M-transform works.

Definition 3.2 (Facewise product).

Let and be two tensors. The facewise product, denote by , is defined facewise as .

Definition 3.3 (M-product).

Let and be two tensors, and let

be an invertible matrix. The

M-product, denoted by , is defined as

(2)

In the original formulation of the M-product,

was chosen to be the Discrete Fourier Transform (DFT) matrix, which allows efficient computation using the Fast Fourier Transform (FFT)

(Braman, 2010; Kilmer and Martin, 2011; Kilmer et al., 2013). The framework was later extended for arbitrary invertible (e.g. discrete cosine and wavelet transforms) (Kernfeld et al., 2015). A benefit of the tensor M-product framework is that many standard matrix concepts can be generalized in a straightforward manner. Definitions 3.43.7 extend the matrix concepts of diagonality, identity, transpose and orthogonality to tensors (Braman, 2010; Kilmer et al., 2013).

Definition 3.4 (f-diagonal).

A tensor is said to be f-diagonal if each frontal slice is diagonal.

Definition 3.5 (Identity tensor).

Let be defined facewise as , where is the matrix identity. The M-product identity tensor is then defined as .

Definition 3.6 (Tensor transpose).

The transpose of a tensor is defined as , where for each .

Definition 3.7 (Orthogonal tensor).

A tensor is said to be orthogonal if .

Leveraging these concepts, a tensor eigendecomposition can now be defined (Braman, 2010; Kilmer et al., 2013):

Definition 3.8 (Tensor eigendecomposition).

Let be a tensor and assume that each frontal slice is symmetric. We can then eigendecompose these as , where is orthogonal and is diagonal (see e.g. Theorem 8.1.1 in Golub and Van Loan (2013)). The tensor eigendecomposition of is then defined as where is orthogonal, and if f-diagonal.

4 Tensor Dynamic Graph Embedding

Our approach is inspired by the first order GCN by Kipf and Welling (2016) for static graphs, owed to its simplicity and effectiveness. For a graph with adjacency matrix and feature matrix , a GCN layer takes the form , where

(3)

is diagonal with , is the matrix identity, is a matrix to be learned when training the NN, and

is an activation function, e.g., ReLU. Our approach translates this to a tensor model by utilizing the M-product framework. We first introduce a tensor activation function

which operates in the transformed space.

Definition 4.1.

Let be a tensor and an elementwise activation function. We define the activation function as .

We can now define our proposed dynamic graph embedding. Let be a tensor with frontal slices , where is the normalization of . Moreover, let be a tensor with frontal slices . Finally, let be a weight tensor. We define our dynamic graph embedding as . This computation can also be repeated in multiple layers. For example, a 2-layer formulation would be of the form

(4)

One important consideration is how to choose the matrix which defines the M-product. For time-varying graphs, we choose to be lower triangular and banded so that each frontal slice is a linear combination of the adjacency matrices , where we refer to as the “bandwidth” of . This choice ensures that each frontal slice only contains information from current and past graphs that are close temporally. Specifically, the entries of are set to

(5)

which implies that for each . Another possibility is to treat as a parameter matrix to be learned from the data.

In order to avoid over-parameterization and improve the performance, we choose the weight tensor (at each layer), such that each of the frontal slices of in the transformed domain remains the same, i.e., . In other words, the parameters in each layer are shared and learned over all the training instances. This reduces the number of parameters to be learned significantly.

An embedding can now be used for various prediction tasks, like link prediction, and edge and node classification. In Section 5, we apply our method for edge classification by using a model similar to that used by Pareja et al. (2019): Given an edge between nodes and at time , the predictive model is

(6)

where and are row vectors, is a weight matrix, and the number of classes. Note that the embedding is first M-transformed before the matrix is applied to the appropriate feature vectors. This, combined with the fact that the tensor activation functions are applied elementwise in the transformed domain, allow us to avoid ever needing to apply the inverse M-transform. This approach reduces the computational cost, and has been found to improve performance in the edge classification task.

4.1 Theoretical Motivation for TensorGCN

Here, we present the results that establish the connection between the proposed TensorGCN and spectral convolution of tensors, in particular spectral filtering and approximation on dynamic graphs. This is analogous to the graph convolution based on spectral graph theory in the GNNs by Bruna et al. (2013), Defferrard et al. (2016), and Kipf and Welling (2016). All proofs are provided in Appendix D.

Let be a form of tensor Laplacian defined as . Throughout the remainder of this subsection, we will assume that the adjacency matrices are symmetric.

Proposition 4.2.

The tensor has an eigendecomposition .

Much like the spectrum of a normalized graph Laplacian is contained in (Shuman et al., 2013), the tensor spectrum of satisfies a similar property.

Proposition 4.3 (Spectral bound).

The entries of lie in .

Following the work by Kilmer et al. (2013), three-dimensional tensors in can be viewed as operators on matrices, with those matrices “twisted” into tensors in . With this in mind, we define a tensor variant of the graph Fourier transform.

Definition 4.4 (Tensor-tube M-product).

Let and . Analogously to the definition of the matrix-scalar product, we define via .

Definition 4.5 (Tensor graph Fourier transform).

Let be a tensor. We define a tensor graph Fourier transform as .

This is analogous to the definition of the matrix graph Fourier transform. This defines a convolution like operation for tensors similar to spectral graph convolution (Shuman et al., 2013; Bruna et al., 2013). Each lateral slice is expressible in terms of the set as follows:

(7)

where each can be considered a tubal scalar. In fact, the lateral slices form a basis for the set with product ; see Appendix D for further details.

Definition 4.6 (Tensor spectral graph filtering).

Given a signal and a function , we define the tensor spectral graph filtering of with respect to as

(8)

where

(9)

In order to avoid the computation of an eigendecomposition, Defferrard et al. (2016) use a polynomial to approximate the filter function. We take a similar approach, and approximate with an M-product polynomial. For this approximation to make sense, we impose additional structure on .

Assumption 4.7.

Assume that is defined as

(10)

where is defined elementwise as with each continuous.

Proposition 4.8.

Suppose satisfies Assumption 4.7. For any , there exists an integer and a set such that

(11)

where is the tensor Frobenius norm, and where is the M-product of instances of , with the convention that .

As in the work of Defferrard et al. (2016), a tensor polynomial approximation allows us to approximate in (8) without computing the eigendecomposition of :

(12)

All that is necessary is to compute tensor powers of . We can also define tensor polynomial analogs of the Chebyshev polynomials and do the approximation in (12) in terms of those instead of the tensor monomials . This is not necessary for the purposes of this paper. Instead, we note that if a degree-one approximation is used, the computation in (12) becomes

(13)

Setting , which is analogous to the parameter choice made in the degree-one approximation by Kipf and Welling (2016), we get

(14)

If we let contain signals, i.e., , and apply filters, (14) becomes

(15)

where . This is precisely our embedding model, with replaced by a learnable parameter tensor .

5 Numerical Experiments

Here, we present results for edge classification on four datasets111We provide links to the datasets in Appendix B.: The Bitcoin Alpha and OTC transaction datasets (Kumar et al., 2016), the Reddit body hyperlink dataset (Kumar et al., 2018), and a chess results dataset (Kunegis, 2013). The bitcoin datasets consist of transaction histories for users on two different platforms. Each node is a user, and each directed edge indicates a transaction and is labeled with an integer between and which indicates the senders trust for the receiver. We convert these labels to two classes: positive (trustworthy) and negative (untrustworthy). The Reddit dataset is build from hyperlinks from one subreddit to another. Each node represents a subreddit, and each directed edge is an interaction which is labeled with for a hostile interaction or for a friendly interaction. We only consider those subreddits which have a total of 20 interactions or more. In the chess dataset, each node is a player, and each directed edge represents a match with the source node being the white player and the target node being the black player. Each edge is labeled for a black victory, for a draw, and for a white victory. Table 1 summarizes the statistics for the different datasets.

Dataset Nodes Edges Graphs () Time window length Classes
Bitcoin OTC 6,005 35,569 135 14 days 2
Bitcoin Alpha 7,604 24,173 135 14 days 2
Reddit 3,818 163,008 86 14 days 2
Chess 7,301 64,958 100 31 days 3
Table 1: Dataset statistics.

The data is temporally partitioned into graphs, with each graph containing data from a particular time window. Both and the time window length can vary between datasets. For each node-time pair in these graphs, we compute the number of outgoing and incoming edges and use these two numbers as features. The adjacency tensor is then constructed as described in Section 4. The frontal slices of are divided into training slices, validation slices, and testing slices, which come sequentially after each other; see Figure 2 and Table 2.

Figure 2: Partitioning of into training, validation and testing data.
Partitioning
Dataset Performance metric
Bitcoin OTC 95 20 20 F1 score
Bitcoin Alpha 95 20 20 F1 score
Reddit 66 10 10 F1 score
Chess 80 10 10 Accuracy
Table 2: Partitioning and performance metric for each dataset.

Since the adjacency matrices corresponding to graphs are very sparse for these datasets, we apply the same technique as Pareja et al. (2019) and add the entries of each frontal slice to the following frontal slices , where we refer to as the “edge life.” Note that this only affects , and that the added edges are not treated as real edges in the classification problem.

The bitcoin and Reddit datasets are heavily skewed, with about 90% of edges labeled positively, and the remaining labeled negatively. Since the negative instances are more interesting to identify (e.g. to prevent financial fraud or online hostility), we use the F1 score to evaluate the experiments on these datasets, treating the negative edges as the ones we want to identify. The classes are more well-balanced in the chess dataset, so we use accuracy to evaluate those experiments.

We choose to use an embedding for training. When computing the embeddings for the validation and testing data, we still need frontal slices of , which we get by using a sliding window of slices. This is illustrated in Figure 2, where the green, blue and red blocks show the frontal slices used when computing the embeddings for the training, validation and testing data, respectively. The embeddings for the validation and testing data are and

, respectively. Preliminary experiments with 2-layer architectures did not show convincing improvements in performance. We believe this is due to the fact that the datasets only have two features, and that a 1-layer architecture therefore is sufficient for extracting relevant information in the data. For training, we use the cross entropy loss function:

(16)

where is a one-hot vector encoding the true class of the edge at time , and is a vector summing to 1 which contains the weight of each class. Since the bitcoin and Reddit datasets are so skewed, we weigh the minority class more heavily in the loss function for those datasets, and treat

as a hyperparameter; see Appendix 

C for details.

The experiments are implemented in PyTorch with some preprocessing done in Matlab. Our code will eventually be made available at

https://github.com/OsmanMalik. In the experiments, we use an edge life of , a bandwidth , and output features. Since the graphs in the considered datasets are directed, we also investigate the impact of symmetrizing the adjacency matrices, where the symmetrized version of an adjacency matrix is defined as .

We compare our method with three other methods. The first one is a variant of the WD-GCN by Manessi et al. (2020), which they specify in Equation (8a) of their paper. For the LSTM layer in their description, we use output features instead of . This is to avoid overfitting and make the method more comparable to ours which uses 6 output features. For the final layer, we use the same prediction model as that used by Pareja et al. (2019) for edge classification. The second method is a 1-layer variant of EvolveGCN-H by Pareja et al. (2019). The third method is a simple baseline which uses a 1-layer version of the GCN by Kipf and Welling (2016). It uses the same weight matrix for all temporal graphs. Both EvolveGCN-H and the baseline GCN use 6 output features as well.

Table 4 shows the results when the adjacency matrices have not been symmetrized. In this case, our method outperforms the other methods on the two bitcoin datasets and the chess dataset, with WD-GCN performing best on the Reddit dataset. Table 4 shows the results for when the adjacency matrices have been symmetrized. Our method outperforms the other methods on the Bitcoin OTC dataset and the chess dataset, and performs similarly but slightly worse than the best performing methods on the Bitcoin Alpha and Reddit datasets. Overall, it seems like symmetrizing the adjacency matrices leads to lower performance.

Dataset
Method Bitcoin OTC Bitcoin Alpha Reddit Chess
WD-GCN 0.2062 0.1920 0.2337 0.4311
EvolveGCN 0.3284 0.1609 0.2012 0.4351
GCN 0.3317 0.2100 0.1805 0.4342
TensorGCN (Proposal) 0.3529 0.2331 0.2028 0.4708
Table 4: Results when using symmetrized adjacency matrices. A higher value is better.
Dataset
Method Bitcoin OTC Bitcoin Alpha Reddit Chess
WD-GCN 0.1009 0.1319 0.2173 0.4321
EvolveGCN 0.0913 0.2273 0.1942 0.4091
GCN 0.0769 0.1538 0.1966 0.4369
TensorGCN (Proposal) 0.3103 0.2207 0.2071 0.4713
Table 3: Results without symmetrizing adjacency matrices. A higher value is better.

6 Conclusion

We have presented a novel approach for dynamic graph embedding which leverages the tensor M-product framework. We used it for edge classification in experiments on four real datasets, where it performed competitively compared to state-of-the-art methods. Future research directions include further developing the theoretical guarantees for the method, investigating optimal structure and learning of the transform matrix , using the method for other prediction tasks, and investigating how to utilize deeper architectures for dynamic graph learning.

References

Appendix A Illustration of the M-transform

We provide some illustrations that show how the M-transform in Definition 3.1 works. Recall that . The matrix is first unfolded into a matrix, as illustrated in Figure 3. This unfolded tensor is then multiplied from the left by the matrix , as illustrated in Figure 4; the figure also illustrates the banded lower triangular structure of . Finally, the output matrix is folded back into a tensor. The fold operation is defined to be the inverse of the unfold operation.

Figure 3: Illustration of unfold operation applied to tensor.
Figure 4: Illustration of matrix product between and the unfolded tensor.

Appendix B Links to Datasets

Appendix C Further Details on the Experiment Setup

When partitioning the data into graphs, as described in Section 5, if there are multiple data points corresponding to an edge for a given time step , we only add that edge once to the corresponding graph and set the label equal to the sum of the labels of the different data points. For example, if bitcoin user makes three transactions to during time step with ratings , , , then we add a single edge to graph with label .

For training, we run gradient descent with a learning rate of 0.01 and momentum of 0.9 for 10,000 iterations. For each 100 iterations, we compute and store the performance of the model on the validation data. As mentioned in Section 5, the weight vector in the loss function (16) is treated as a hyperparameter in the bitcoin and Reddit experiments. Since these datasets all have two edge classes, let and be the weights of the minority (negative) and majority (positive) classes, respectively. Since these parameters add to 1, we have . For all methods, we repeat the bitcoin and Reddit experiments once for each . For each model and dataset, we then find the best stored performance of the model on the validation data across all values. We then treat the corresponding model as the trained model, and report its performance on the testing data in Tables 4 and 4. The results for the chess experiment are computed in the same way, but only for a single vector .

Appendix D Additional Results and Proofs

Throughout this section, will denote the Frobenius norm (i.e., the square root of the sum of the elements squared) of a matrix or tensor, and will denote the matrix spectral norm.

We first provide a few further results that clarify the algebraic properties of the M-product. Let denote the set of tensors. Similarly, let denote the set of tensors. Under the M-product framework, the set plays a role similar to that played by scalars in matrix algebra. With this in mind, the set can be seen as a length vector consisting of tubal elements of length . Propositions D.1 and D.2 make this more precise.

Proposition D.1 (Proposition 4.2 in Kernfeld et al. (2015)).

The set with product , which is denoted by , is a commutative ring with identity.

Proposition D.2 (Theorem 4.1 in Kernfeld et al. (2015)).

The set with product , which is denoted by , is a free module over the ring .

A free module is similar to a vector space. Like a vector space, it has a basis. Proposition D.3 shows that the lateral slices of in the tensor eigendecomposition form a basis for

, similarly to how the eigenvectors in a matrix eigendecomposition form a basis.

Proposition D.3.

The lateral slices of in Definition 3.8 form a basis for .

Proof.

Let . Note that

(17)

where . So the lateral slices of are a generating set for . Now suppose

(18)

for some . Then , and consequently

(19)

Since each frontal face of is an invertible matrix, this implies that each frontal face of is zero, and hence . So the lateral slices of are also linearly independent in . ∎

d.1 Proofs of Propositions in the Main Text

Proof of Proposition 4.2.

Since each adjacency matrix and each is symmetric, each frontal slice is also symmetric. Consequently,

(20)

so each frontal slice of is symmetric, and therefore has an eigendecomposition. ∎

Proof of Proposition 4.3.

Each has a spectrum contained in . Since is symmetric, it follows that . Consequently,

(21)

where we used the fact that . So since the frontal slices are symmetric, they each have a spectrum in . It follows that each frontal slice

(22)

has a spectrum contained in , which means that the entries of all lie in . ∎

Lemma D.4.

Let and let be invertible. Then

(23)
Proof.

We have

(24)

where the inequality is a well-known relation that holds for all matrices. ∎

Proof of Proposition 4.8.

By Weierstrass approximation theorem, there exists an integer and a set such that for all ,

(25)

Let . Note that if , then

(26)

since is f-diagonal. So

(27)

where the first inequality follows from Lemma D.4. Taking square roots completes the proof. ∎