1 Introduction
Graphs are popular data structures used to effectively represent interactions and structural relationships between entities in structured data domains. Inspired by the success of deep neural networks for learning representations in the image and language domains, recently, application of neural networks for graph representation learning has attracted much interest. A number of graph neural network (GNN) architectures have been explored in the contemporary literature for a variety of graph related tasks and applications (Hamilton et al., 2017; Seo et al., 2018; Chen et al., 2018; Zhou et al., 2018; Wu et al., 2019)
. Methods based on graph convolution filters which extend convolutional neural networks (CNNs) to irregular graph domains are popular
(Bruna et al., 2013; Defferrard et al., 2016; Kipf and Welling, 2016). Most of these GNN models operate on a given, static graph.In many realworld applications, the underlining graph changes over time, and learning representations of such dynamic graphs is essential. Examples include analyzing social networks (BergerWolf and Saia, 2006), predicting collaboration in citation networks (Leskovec et al., 2005), detecting fraud and crime in financial networks (Weber et al., 2018; Pareja et al., 2019), traffic control (Zhao et al., 2019), and understanding neuronal activities in the brain (De Vico Fallani et al., 2014). In such dynamic settings, the temporal interdependence in the graph connections and features also play a substantial role. However, efficient GNN methods that handle time varying graphs and that capture the temporal correlations are lacking.
By dynamic graph, we mean a sequence of graphs , , with a fixed set of nodes, adjacency matrices , and graph feature matrices where
is the feature vector consisting of
features associated with node at time . The graphs can be weighted, and directed or undirected. They can also have additional properties like (time varying) node and edge classes, which would be stored in a separate structure. Suppose we only observe the first graphs in the sequence. The goal of our method is to use these observations to predict some property of the remaining graphs. In this paper, we use it for edge classification. Other potential applications are node classification and edge/link prediction.In recent years, tensor constructs have been explored to effectively process highdimensional data, in order to better leverage the multidimensional structure of such data
(Kolda and Bader, 2009). Tensor based approaches have been shown to perform well in many image and video processing applications (Hao et al., 2013; Kilmer et al., 2013; Martin et al., 2013; Zhang et al., 2014; Zhang and Aeron, 2016; Lu et al., 2016; Newman et al., 2018). A number of tensor based neural networks have also been investigated to extract and learn multidimensional representations, e.g. methods based on tensor decomposition (Phan and Cichocki, 2010), tensortrains (Novikov et al., 2015; Stoudenmire and Schwab, 2016), and tensor factorized neural network (Chien and Bao, 2017). Recently, a new tensor framework called the tensor Mproduct framework (Braman, 2010; Kilmer and Martin, 2011; Kernfeld et al., 2015) was proposed that extends matrix based theory to highdimensional architectures.In this paper, we propose a novel tensor variant of the popular graph convolutional network (GCN) architecture (Kipf and Welling, 2016), which we call TensorGCN. It captures correlation over time by leveraging the tensor Mproduct framework. The flexibility and matrix mimeticability of the framework, help us adapt the GCN architecture to tensor space. Figure 1 illustrates our method at a high level: First, the time varying adjacency matrices and feature matrices
of the dynamic graph are aggregated into an adjacency tensor and a feature tensor, respectively. These tensors are then fed into our TensorGCN, which computes an embedding that can be used for a variety of tasks, such as link prediction, and edge and node classification. GCN architectures are motivated by graph convolution filtering, i.e., applying filters/functions to the graph Laplacian (in turn its eigenvalues)
(Bruna et al., 2013), and we establish a similar connection between TensorGCN and spectral filtering of tensors. Experimental results on real datasets illustrate the performance of our method for the edge classification task on dynamic graphs. Elements of our method can also be used as a preprocessing step for other dynamic graph methods.2 Related Work
The idea of using graph convolution based on the spectral graph theory for GNNs was first introduced by Bruna et al. (2013). Defferrard et al. (2016) then proposed Chebnet, where the spectral filter was approximated by Chebyshev polynomials in order to make it faster and localized. Kipf and Welling (2016) presented the simplified GCN, a degreeone polynomial approximation of Chebnet, in order to speed up computation further and improve the performance. There are many other works that deal with GNNs when the graph and features are fixed/static; see the review papers by Zhou et al. (2018) and Wu et al. (2019) and references therein. These methods cannot be directly applied to the dynamic setting we consider. Seo et al. (2018) devised the Graph Convolutional Recurrent Network for graphs with time varying features. However, this method assumes that the edges are fixed over time, and is not applicable in our setting. Wang et al. (2018) proposed a method called EdgeConv, which is a neural network (NN) approach that applies convolution operations on static graphs in a dynamic fashion. Their approach is not applicable when the graph itself is dynamic. Zhao et al. (2019) develop a temporal GCN method called TGCN, which they apply for traffic prediction. Their method assumes the graph remains fixed over time, and only the features vary.
The set of methods most relevant to our setting of learning embeddings of dynamic graphs use combinations of GNNs and recurrent architectures (RNN), to capture the graph structure and handle time dynamics, respectively. The approach in Manessi et al. (2020)
uses Long ShortTerm Memory (LSTM), a recurrent network, in order to handle time variations along with GNNs. They design architectures for semisupervised node classification and for supervised graph classification.
Pareja et al. (2019) presented a variant of GCN called EvolveGCN, where Gated Recurrent Units (GRU) and LSTMs are coupled with a GCN to handle dynamic graphs. This paper is currently the stateoftheart. However, their approach is based on a heuristic RNN/GRU mechanism, which is not theoretically viable, and does not harness a tensor algebraic framework to incorporate time varying information.
Newman et al. (2018) present a tensor NN which utilizes the Mproduct tensor framework. Their approach can be applied to image and other highdimensional data that lie on regular grids, and differs from ours since we consider data on dynamic graphs.3 Tensor MProduct Framework
Here, we cover the necessary preliminaries on tensors and the Mproduct framework. For a more general introduction to tensors, we refer the reader to the review paper by Kolda and Bader (2009). In this paper, a tensor is a threedimensional array of real numbers denoted by boldface Euler script letters, e.g. . Matrices are denoted by bold uppercase letters, e.g. ; vectors are denoted by bold lowercase letter, e.g. ; and scalars are denoted by lowercase letters, e.g. . An element at position in a tensor is denoted by subscripts, e.g. , with similar notation for elements of matrices and vectors. A colon will denote all elements along that dimension; denotes the th row of the matrix , and denotes the th frontal slice of . The vectors are called the tubes of .
The framework we consider relies on a new definition of the product of two tensors, called the Mproduct (Braman, 2010; Kilmer and Martin, 2011; Kilmer et al., 2013; Kernfeld et al., 2015). A distinguishing feature of this framework is that the Mproduct of two threedimensional tensors is also threedimensional, which is not the case for e.g. tensor contractions (Bishop and Goldberg, 2012). It allows one to elegantly generalize many classical numerical methods from linear algebra, and has been applied e.g. in neural networks (Newman et al., 2018), imaging (Kilmer et al., 2013; Martin et al., 2013; Semerci et al., 2014)
(Hao et al., 2013), and tensor completion and denoising (Zhang et al., 2014; Zhang and Aeron, 2016; Lu et al., 2016). Although the framework was originally developed for threedimensional tensors, which is sufficient for our purposes, it has been extended to handle tensors of dimension greater than three (Martin et al., 2013). The following definitions 3.1–3.3 describe the Mproduct.Definition 3.1 (Mtransform).
Let be a mixing matrix. The Mtransform of a tensor is denoted by and defined elementwise as
(1) 
We say that is in the transformed space. Note that if is invertible, then . Consequently, is the inverse Mtransform of . The definition in (1) may also be written in matrix form as , where the unfold operation takes the tubes of and stack them as columns into a matrix, and . Appendix A provides illustrations of how the Mtransform works.
Definition 3.2 (Facewise product).
Let and be two tensors. The facewise product, denote by , is defined facewise as .
Definition 3.3 (Mproduct).
Let and be two tensors, and let
be an invertible matrix. The
Mproduct, denoted by , is defined as(2) 
In the original formulation of the Mproduct,
was chosen to be the Discrete Fourier Transform (DFT) matrix, which allows efficient computation using the Fast Fourier Transform (FFT)
(Braman, 2010; Kilmer and Martin, 2011; Kilmer et al., 2013). The framework was later extended for arbitrary invertible (e.g. discrete cosine and wavelet transforms) (Kernfeld et al., 2015). A benefit of the tensor Mproduct framework is that many standard matrix concepts can be generalized in a straightforward manner. Definitions 3.4–3.7 extend the matrix concepts of diagonality, identity, transpose and orthogonality to tensors (Braman, 2010; Kilmer et al., 2013).Definition 3.4 (fdiagonal).
A tensor is said to be fdiagonal if each frontal slice is diagonal.
Definition 3.5 (Identity tensor).
Let be defined facewise as , where is the matrix identity. The Mproduct identity tensor is then defined as .
Definition 3.6 (Tensor transpose).
The transpose of a tensor is defined as , where for each .
Definition 3.7 (Orthogonal tensor).
A tensor is said to be orthogonal if .
Leveraging these concepts, a tensor eigendecomposition can now be defined (Braman, 2010; Kilmer et al., 2013):
Definition 3.8 (Tensor eigendecomposition).
Let be a tensor and assume that each frontal slice is symmetric. We can then eigendecompose these as , where is orthogonal and is diagonal (see e.g. Theorem 8.1.1 in Golub and Van Loan (2013)). The tensor eigendecomposition of is then defined as where is orthogonal, and if fdiagonal.
4 Tensor Dynamic Graph Embedding
Our approach is inspired by the first order GCN by Kipf and Welling (2016) for static graphs, owed to its simplicity and effectiveness. For a graph with adjacency matrix and feature matrix , a GCN layer takes the form , where
(3) 
is diagonal with , is the matrix identity, is a matrix to be learned when training the NN, and
is an activation function, e.g., ReLU. Our approach translates this to a tensor model by utilizing the Mproduct framework. We first introduce a tensor activation function
which operates in the transformed space.Definition 4.1.
Let be a tensor and an elementwise activation function. We define the activation function as .
We can now define our proposed dynamic graph embedding. Let be a tensor with frontal slices , where is the normalization of . Moreover, let be a tensor with frontal slices . Finally, let be a weight tensor. We define our dynamic graph embedding as . This computation can also be repeated in multiple layers. For example, a 2layer formulation would be of the form
(4) 
One important consideration is how to choose the matrix which defines the Mproduct. For timevarying graphs, we choose to be lower triangular and banded so that each frontal slice is a linear combination of the adjacency matrices , where we refer to as the “bandwidth” of . This choice ensures that each frontal slice only contains information from current and past graphs that are close temporally. Specifically, the entries of are set to
(5) 
which implies that for each . Another possibility is to treat as a parameter matrix to be learned from the data.
In order to avoid overparameterization and improve the performance, we choose the weight tensor (at each layer), such that each of the frontal slices of in the transformed domain remains the same, i.e., . In other words, the parameters in each layer are shared and learned over all the training instances. This reduces the number of parameters to be learned significantly.
An embedding can now be used for various prediction tasks, like link prediction, and edge and node classification. In Section 5, we apply our method for edge classification by using a model similar to that used by Pareja et al. (2019): Given an edge between nodes and at time , the predictive model is
(6) 
where and are row vectors, is a weight matrix, and the number of classes. Note that the embedding is first Mtransformed before the matrix is applied to the appropriate feature vectors. This, combined with the fact that the tensor activation functions are applied elementwise in the transformed domain, allow us to avoid ever needing to apply the inverse Mtransform. This approach reduces the computational cost, and has been found to improve performance in the edge classification task.
4.1 Theoretical Motivation for TensorGCN
Here, we present the results that establish the connection between the proposed TensorGCN and spectral convolution of tensors, in particular spectral filtering and approximation on dynamic graphs. This is analogous to the graph convolution based on spectral graph theory in the GNNs by Bruna et al. (2013), Defferrard et al. (2016), and Kipf and Welling (2016). All proofs are provided in Appendix D.
Let be a form of tensor Laplacian defined as . Throughout the remainder of this subsection, we will assume that the adjacency matrices are symmetric.
Proposition 4.2.
The tensor has an eigendecomposition .
Much like the spectrum of a normalized graph Laplacian is contained in (Shuman et al., 2013), the tensor spectrum of satisfies a similar property.
Proposition 4.3 (Spectral bound).
The entries of lie in .
Following the work by Kilmer et al. (2013), threedimensional tensors in can be viewed as operators on matrices, with those matrices “twisted” into tensors in . With this in mind, we define a tensor variant of the graph Fourier transform.
Definition 4.4 (Tensortube Mproduct).
Let and . Analogously to the definition of the matrixscalar product, we define via .
Definition 4.5 (Tensor graph Fourier transform).
Let be a tensor. We define a tensor graph Fourier transform as .
This is analogous to the definition of the matrix graph Fourier transform. This defines a convolution like operation for tensors similar to spectral graph convolution (Shuman et al., 2013; Bruna et al., 2013). Each lateral slice is expressible in terms of the set as follows:
(7) 
where each can be considered a tubal scalar. In fact, the lateral slices form a basis for the set with product ; see Appendix D for further details.
Definition 4.6 (Tensor spectral graph filtering).
Given a signal and a function , we define the tensor spectral graph filtering of with respect to as
(8) 
where
(9) 
In order to avoid the computation of an eigendecomposition, Defferrard et al. (2016) use a polynomial to approximate the filter function. We take a similar approach, and approximate with an Mproduct polynomial. For this approximation to make sense, we impose additional structure on .
Assumption 4.7.
Assume that is defined as
(10) 
where is defined elementwise as with each continuous.
Proposition 4.8.
Suppose satisfies Assumption 4.7. For any , there exists an integer and a set such that
(11) 
where is the tensor Frobenius norm, and where is the Mproduct of instances of , with the convention that .
As in the work of Defferrard et al. (2016), a tensor polynomial approximation allows us to approximate in (8) without computing the eigendecomposition of :
(12) 
All that is necessary is to compute tensor powers of . We can also define tensor polynomial analogs of the Chebyshev polynomials and do the approximation in (12) in terms of those instead of the tensor monomials . This is not necessary for the purposes of this paper. Instead, we note that if a degreeone approximation is used, the computation in (12) becomes
(13) 
Setting , which is analogous to the parameter choice made in the degreeone approximation by Kipf and Welling (2016), we get
(14) 
If we let contain signals, i.e., , and apply filters, (14) becomes
(15) 
where . This is precisely our embedding model, with replaced by a learnable parameter tensor .
5 Numerical Experiments
Here, we present results for edge classification on four datasets^{1}^{1}1We provide links to the datasets in Appendix B.: The Bitcoin Alpha and OTC transaction datasets (Kumar et al., 2016), the Reddit body hyperlink dataset (Kumar et al., 2018), and a chess results dataset (Kunegis, 2013). The bitcoin datasets consist of transaction histories for users on two different platforms. Each node is a user, and each directed edge indicates a transaction and is labeled with an integer between and which indicates the senders trust for the receiver. We convert these labels to two classes: positive (trustworthy) and negative (untrustworthy). The Reddit dataset is build from hyperlinks from one subreddit to another. Each node represents a subreddit, and each directed edge is an interaction which is labeled with for a hostile interaction or for a friendly interaction. We only consider those subreddits which have a total of 20 interactions or more. In the chess dataset, each node is a player, and each directed edge represents a match with the source node being the white player and the target node being the black player. Each edge is labeled for a black victory, for a draw, and for a white victory. Table 1 summarizes the statistics for the different datasets.
Dataset  Nodes  Edges  Graphs ()  Time window length  Classes 

Bitcoin OTC  6,005  35,569  135  14 days  2 
Bitcoin Alpha  7,604  24,173  135  14 days  2 
3,818  163,008  86  14 days  2  
Chess  7,301  64,958  100  31 days  3 
The data is temporally partitioned into graphs, with each graph containing data from a particular time window. Both and the time window length can vary between datasets. For each nodetime pair in these graphs, we compute the number of outgoing and incoming edges and use these two numbers as features. The adjacency tensor is then constructed as described in Section 4. The frontal slices of are divided into training slices, validation slices, and testing slices, which come sequentially after each other; see Figure 2 and Table 2.
Partitioning  

Dataset  Performance metric  
Bitcoin OTC  95  20  20  F1 score 
Bitcoin Alpha  95  20  20  F1 score 
66  10  10  F1 score  
Chess  80  10  10  Accuracy 
Since the adjacency matrices corresponding to graphs are very sparse for these datasets, we apply the same technique as Pareja et al. (2019) and add the entries of each frontal slice to the following frontal slices , where we refer to as the “edge life.” Note that this only affects , and that the added edges are not treated as real edges in the classification problem.
The bitcoin and Reddit datasets are heavily skewed, with about 90% of edges labeled positively, and the remaining labeled negatively. Since the negative instances are more interesting to identify (e.g. to prevent financial fraud or online hostility), we use the F1 score to evaluate the experiments on these datasets, treating the negative edges as the ones we want to identify. The classes are more wellbalanced in the chess dataset, so we use accuracy to evaluate those experiments.
We choose to use an embedding for training. When computing the embeddings for the validation and testing data, we still need frontal slices of , which we get by using a sliding window of slices. This is illustrated in Figure 2, where the green, blue and red blocks show the frontal slices used when computing the embeddings for the training, validation and testing data, respectively. The embeddings for the validation and testing data are and
, respectively. Preliminary experiments with 2layer architectures did not show convincing improvements in performance. We believe this is due to the fact that the datasets only have two features, and that a 1layer architecture therefore is sufficient for extracting relevant information in the data. For training, we use the cross entropy loss function:
(16) 
where is a onehot vector encoding the true class of the edge at time , and is a vector summing to 1 which contains the weight of each class. Since the bitcoin and Reddit datasets are so skewed, we weigh the minority class more heavily in the loss function for those datasets, and treat
as a hyperparameter; see Appendix
C for details.The experiments are implemented in PyTorch with some preprocessing done in Matlab. Our code will eventually be made available at
https://github.com/OsmanMalik. In the experiments, we use an edge life of , a bandwidth , and output features. Since the graphs in the considered datasets are directed, we also investigate the impact of symmetrizing the adjacency matrices, where the symmetrized version of an adjacency matrix is defined as .We compare our method with three other methods. The first one is a variant of the WDGCN by Manessi et al. (2020), which they specify in Equation (8a) of their paper. For the LSTM layer in their description, we use output features instead of . This is to avoid overfitting and make the method more comparable to ours which uses 6 output features. For the final layer, we use the same prediction model as that used by Pareja et al. (2019) for edge classification. The second method is a 1layer variant of EvolveGCNH by Pareja et al. (2019). The third method is a simple baseline which uses a 1layer version of the GCN by Kipf and Welling (2016). It uses the same weight matrix for all temporal graphs. Both EvolveGCNH and the baseline GCN use 6 output features as well.
Table 4 shows the results when the adjacency matrices have not been symmetrized. In this case, our method outperforms the other methods on the two bitcoin datasets and the chess dataset, with WDGCN performing best on the Reddit dataset. Table 4 shows the results for when the adjacency matrices have been symmetrized. Our method outperforms the other methods on the Bitcoin OTC dataset and the chess dataset, and performs similarly but slightly worse than the best performing methods on the Bitcoin Alpha and Reddit datasets. Overall, it seems like symmetrizing the adjacency matrices leads to lower performance.
Dataset  

Method  Bitcoin OTC  Bitcoin Alpha  Chess  
WDGCN  0.2062  0.1920  0.2337  0.4311 
EvolveGCN  0.3284  0.1609  0.2012  0.4351 
GCN  0.3317  0.2100  0.1805  0.4342 
TensorGCN (Proposal)  0.3529  0.2331  0.2028  0.4708 
Dataset  

Method  Bitcoin OTC  Bitcoin Alpha  Chess  
WDGCN  0.1009  0.1319  0.2173  0.4321 
EvolveGCN  0.0913  0.2273  0.1942  0.4091 
GCN  0.0769  0.1538  0.1966  0.4369 
TensorGCN (Proposal)  0.3103  0.2207  0.2071  0.4713 
6 Conclusion
We have presented a novel approach for dynamic graph embedding which leverages the tensor Mproduct framework. We used it for edge classification in experiments on four real datasets, where it performed competitively compared to stateoftheart methods. Future research directions include further developing the theoretical guarantees for the method, investigating optimal structure and learning of the transform matrix , using the method for other prediction tasks, and investigating how to utilize deeper architectures for dynamic graph learning.
References
 BergerWolf and Saia (2006) Tanya Y. BergerWolf and Jared Saia. A framework for analysis of dynamic social networks. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 523–528. ACM, 2006.
 Bishop and Goldberg (2012) Richard L. Bishop and Samuel I. Goldberg. Tensor Analysis on Manifolds. Courier Corporation, 2012.
 Braman (2010) Karen Braman. Thirdorder tensors as linear operators on a space of matrices. Linear Algebra and its Applications, 433(7):1241–1253, 2010.
 Bruna et al. (2013) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 Chen et al. (2018) Jie Chen, Tengfei Ma, and Cao Xiao. FastGCN: Fast learning with graph convolutional networks via importance sampling. In ICLR, 2018.
 Chien and Bao (2017) JenTzung Chien and YiTing Bao. Tensorfactorized neural networks. IEEE transactions on neural networks and learning systems, 29(5):1998–2011, 2017.
 De Vico Fallani et al. (2014) Fabrizio De Vico Fallani, Jonas Richiardi, Mario Chavez, and Sophie Achard. Graph analysis of functional brain networks: Practical issues in translational neuroscience. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1653):20130521, 2014.
 Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
 Golub and Van Loan (2013) Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns Hopkins University Press, Baltimore, 4th edition, 2013. ISBN 9781421407944.
 Hamilton et al. (2017) William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.
 Hao et al. (2013) Ning Hao, Misha E. Kilmer, Karen Braman, and Randy C. Hoover. Facial recognition using tensortensor decompositions. SIAM Journal on Imaging Sciences, 6(1):437–463, 2013.

Kernfeld et al. (2015)
Eric Kernfeld, Misha Kilmer, and Shuchin Aeron.
Tensor–tensor products with invertible linear transforms.
Linear Algebra and its Applications, 485:545–570, 2015.  Kilmer and Martin (2011) Misha E. Kilmer and Carla D. Martin. Factorization strategies for thirdorder tensors. Linear Algebra and its Applications, 435(3):641–658, 2011.
 Kilmer et al. (2013) Misha E. Kilmer, Karen Braman, Ning Hao, and Randy C. Hoover. Thirdorder tensors as operators on matrices: A theoretical and computational framework with applications in imaging. SIAM Journal on Matrix Analysis and Applications, 34(1):148–172, 2013.
 Kipf and Welling (2016) Thomas N. Kipf and Max Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 Kolda and Bader (2009) Tamara G. Kolda and Brett W. Bader. Tensor Decompositions and Applications. SIAM Review, 51(3):455–500, August 2009. ISSN 00361445. doi: 10.1137/07070111X.
 Kumar et al. (2016) Srijan Kumar, Francesca Spezzano, V. S. Subrahmanian, and Christos Faloutsos. Edge weight prediction in weighted signed networks. In 2016 IEEE 16th International Conference on Data Mining (ICDM), pages 221–230. IEEE, 2016.
 Kumar et al. (2018) Srijan Kumar, William L. Hamilton, Jure Leskovec, and Dan Jurafsky. Community interaction and conflict on the web. In Proceedings of the 2018 World Wide Web Conference, pages 933–943. International World Wide Web Conferences Steering Committee, 2018.
 Kunegis (2013) Jérôme Kunegis. Konect: The koblenz network collection. In Proceedings of the 22nd International Conference on World Wide Web, pages 1343–1350. ACM, 2013.
 Leskovec et al. (2005) Jure Leskovec, Jon Kleinberg, and Christos Faloutsos. Graphs over time: Densification laws, shrinking diameters and possible explanations. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pages 177–187. ACM, 2005.

Lu et al. (2016)
Canyi Lu, Jiashi Feng, Yudong Chen, Wei Liu, Zhouchen Lin, and Shuicheng Yan.
Tensor robust principal component analysis: Exact recovery of corrupted lowrank tensors via convex optimization.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 5249–5257, 2016.  Manessi et al. (2020) Franco Manessi, Alessandro Rozza, and Mario Manzo. Dynamic graph convolutional networks. Pattern Recognition, 97:107000, 2020.
 Martin et al. (2013) Carla D. Martin, Richard Shafer, and Betsy LaRue. An orderp tensor factorization with applications in imaging. SIAM Journal on Scientific Computing, 35(1):A474–A490, 2013.
 Newman et al. (2018) Elizabeth Newman, Lior Horesh, Haim Avron, and Misha Kilmer. Stable Tensor Neural Networks for Rapid Deep Learning. arXiv preprint arXiv:1811.06569, 2018.
 Novikov et al. (2015) Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, and Dmitry P Vetrov. Tensorizing neural networks. In Advances in neural information processing systems, pages 442–450, 2015.
 Pareja et al. (2019) Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao B. Schardl, and Charles E. Leisersen. EvolveGCN: Evolving graph convolutional networks for dynamic graphs. arXiv preprint arXiv:1902.10191, 2019.

Phan and Cichocki (2010)
Anh Huy Phan and Andrzej Cichocki.
Tensor decompositions for feature extraction and classification of high dimensional datasets.
Nonlinear theory and its applications, IEICE, 1(1):37–68, 2010.  Semerci et al. (2014) Oguz Semerci, Ning Hao, Misha E. Kilmer, and Eric L. Miller. Tensorbased formulation and nuclear norm regularization for multienergy computed tomography. IEEE Transactions on Image Processing, 23(4):1678–1693, 2014.
 Seo et al. (2018) Youngjoo Seo, Michaël Defferrard, Pierre Vandergheynst, and Xavier Bresson. Structured sequence modeling with graph convolutional recurrent networks. In International Conference on Neural Information Processing, pages 362–373. Springer, 2018.
 Shuman et al. (2013) David I. Shuman, Sunil K. Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst. The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98, May 2013. ISSN 10535888. doi: 10.1109/MSP.2012.2235192.
 Stoudenmire and Schwab (2016) Edwin Stoudenmire and David J Schwab. Supervised learning with tensor networks. In Advances in Neural Information Processing Systems, pages 4799–4807, 2016.
 Wang et al. (2018) Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E. Sarma, Michael M. Bronstein, and Justin M. Solomon. Dynamic graph CNN for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
 Weber et al. (2018) Mark Weber, Jie Chen, Toyotaro Suzumura, Aldo Pareja, Tengfei Ma, Hiroki Kanezashi, Tim Kaler, Charles E. Leiserson, and Tao B. Schardl. Scalable Graph Learning for AntiMoney Laundering: A First Look. arXiv preprint arXiv:1812.00076, 2018.
 Wu et al. (2019) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S. Yu. A comprehensive survey on graph neural networks. arXiv preprint arXiv:1901.00596, 2019.
 Zhang and Aeron (2016) Zemin Zhang and Shuchin Aeron. Exact tensor completion using tSVD. IEEE Transactions on Signal Processing, 65(6):1511–1526, 2016.
 Zhang et al. (2014) Zemin Zhang, Gregory Ely, Shuchin Aeron, Ning Hao, and Misha Kilmer. Novel methods for multilinear data completion and denoising based on tensorSVD. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3842–3849, 2014.
 Zhao et al. (2019) Ling Zhao, Yujiao Song, Chao Zhang, Yu Liu, Pu Wang, Tao Lin, Min Deng, and Haifeng Li. TGCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Transactions on Intelligent Transportation Systems, 2019.
 Zhou et al. (2018) Jie Zhou, Ganqu Cui, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434, 2018.
Appendix A Illustration of the Mtransform
We provide some illustrations that show how the Mtransform in Definition 3.1 works. Recall that . The matrix is first unfolded into a matrix, as illustrated in Figure 3. This unfolded tensor is then multiplied from the left by the matrix , as illustrated in Figure 4; the figure also illustrates the banded lower triangular structure of . Finally, the output matrix is folded back into a tensor. The fold operation is defined to be the inverse of the unfold operation.
Appendix B Links to Datasets

The Bitcoin Alpha dataset is available at
https://snap.stanford.edu/data/socsignbitcoinalpha.html. 
The Bitcoin OTC dataset is available at
https://snap.stanford.edu/data/socsignbitcoinotc.html. 
The Reddit dataset is available at
https://snap.stanford.edu/data/socRedditHyperlinks.html.
Note that we use the dataset with hyperlinks in the body of the posts. 
The chess dataset is available at
http://konect.unikoblenz.de/networks/chess.
Appendix C Further Details on the Experiment Setup
When partitioning the data into graphs, as described in Section 5, if there are multiple data points corresponding to an edge for a given time step , we only add that edge once to the corresponding graph and set the label equal to the sum of the labels of the different data points. For example, if bitcoin user makes three transactions to during time step with ratings , , , then we add a single edge to graph with label .
For training, we run gradient descent with a learning rate of 0.01 and momentum of 0.9 for 10,000 iterations. For each 100 iterations, we compute and store the performance of the model on the validation data. As mentioned in Section 5, the weight vector in the loss function (16) is treated as a hyperparameter in the bitcoin and Reddit experiments. Since these datasets all have two edge classes, let and be the weights of the minority (negative) and majority (positive) classes, respectively. Since these parameters add to 1, we have . For all methods, we repeat the bitcoin and Reddit experiments once for each . For each model and dataset, we then find the best stored performance of the model on the validation data across all values. We then treat the corresponding model as the trained model, and report its performance on the testing data in Tables 4 and 4. The results for the chess experiment are computed in the same way, but only for a single vector .
Appendix D Additional Results and Proofs
Throughout this section, will denote the Frobenius norm (i.e., the square root of the sum of the elements squared) of a matrix or tensor, and will denote the matrix spectral norm.
We first provide a few further results that clarify the algebraic properties of the Mproduct. Let denote the set of tensors. Similarly, let denote the set of tensors. Under the Mproduct framework, the set plays a role similar to that played by scalars in matrix algebra. With this in mind, the set can be seen as a length vector consisting of tubal elements of length . Propositions D.1 and D.2 make this more precise.
Proposition D.1 (Proposition 4.2 in Kernfeld et al. (2015)).
The set with product , which is denoted by , is a commutative ring with identity.
Proposition D.2 (Theorem 4.1 in Kernfeld et al. (2015)).
The set with product , which is denoted by , is a free module over the ring .
A free module is similar to a vector space. Like a vector space, it has a basis. Proposition D.3 shows that the lateral slices of in the tensor eigendecomposition form a basis for
, similarly to how the eigenvectors in a matrix eigendecomposition form a basis.
Proposition D.3.
The lateral slices of in Definition 3.8 form a basis for .
Proof.
Let . Note that
(17) 
where . So the lateral slices of are a generating set for . Now suppose
(18) 
for some . Then , and consequently
(19) 
Since each frontal face of is an invertible matrix, this implies that each frontal face of is zero, and hence . So the lateral slices of are also linearly independent in . ∎
d.1 Proofs of Propositions in the Main Text
Proof of Proposition 4.2.
Since each adjacency matrix and each is symmetric, each frontal slice is also symmetric. Consequently,
(20) 
so each frontal slice of is symmetric, and therefore has an eigendecomposition. ∎
Proof of Proposition 4.3.
Each has a spectrum contained in . Since is symmetric, it follows that . Consequently,
(21) 
where we used the fact that . So since the frontal slices are symmetric, they each have a spectrum in . It follows that each frontal slice
(22) 
has a spectrum contained in , which means that the entries of all lie in . ∎
Lemma D.4.
Let and let be invertible. Then
(23) 
Proof.
We have
(24)  
where the inequality is a wellknown relation that holds for all matrices. ∎