Many real-world data can be cast as structured sequences, with spatio-temporal sequences being a special case. A well-studied example of spatio-temporal data are videos, where succeeding frames share temporal and spatial structures. Many works, such as Donahue et al. (2015); Karpathy & Fei-Fei (2015); Vinyals et al. (2015)
, leveraged a combination of CNN and RNN to exploit such spatial and temporal regularities. Their models are able to process possibly time-varying visual inputs for variable-length prediction. These neural network architectures consist of combining a CNN for visual feature extraction followed by a RNN for sequence learning. Such architectures have been successfully used for video activity recognition, image captioning and video description.
More recently, interest has grown in properly fusing the CNN and RNN models for spatio-temporal sequence modeling. Inspired by language modeling, Ranzato et al. (2014)
proposed a model to represent complex deformations and motion patterns by discovering both spatial and temporal correlations. They showed that prediction of the next video frame and interpolation of intermediate frames can be achieved by building a RNN-based language model on the visual words obtained by quantizing the image patches. Their highest-performing model, recursive CNN (rCNN), uses convolutions for both inputs and states.Shi et al. (2015) then proposed the convolutional LSTM network (convLSTM), a recurrent model for spatio-temporal sequence modeling which uses 2D-grid convolution to leverage the spatial correlations in input data. They successfully applied their model to the prediction of the evolution of radar echo maps for precipitation nowcasting.
The spatial structure of many important problems may however not be as simple as regular grids. For instance, the data measured from meteorological stations lie on a irregular grid, i.e. a network of heterogeneous spatial distribution of stations. More challenging, the spatial structure of data may not even be spatial, as it is the case for social or biological networks. Eventually, the interpretation that sentences can be regarded as random walks on vocabulary graphs, a view popularized by Mikolov et al. (2013), allows us to cast language analysis problems as graph-structured sequence models.
This work leverages on the recent models of Defferrard et al. (2016); Ranzato et al. (2014); Shi et al. (2015) to design the GCRN model for modeling and predicting time-varying graph-based data. The core idea is to merge CNN for graph-structured data and RNN to identify simultaneously meaningful spatial structures and dynamic patterns. A generic illustration of the proposed GCRN architecture is given by Figure 2.
2.1 Structured Sequence Modeling
Sequence modeling is the problem of predicting the most likely future length- sequence given the previous observations:
where is an observation at time and denotes the domain of the observed features. The archetypal application being the -gram language model (with ), where
models the probability of wordto appear conditioned on the past words in the sentence (Graves, 2013).
In this paper, we are interested in special structured sequences, i.e. sequences where features of the observations are not independent but linked by pairwise relationships. Such relationships are universally modeled by weighted graphs.
Data can be viewed as a graph signal, i.e. a signal defined on an undirected and weighted graph , where is a finite set of vertices, is a set of edges and is a weighted adjacency matrix encoding the connection weight between two vertices. A signal defined on the nodes of the graph may be regarded as a matrix whose column is the -dimensional value of at the node. While the number of free variables in a structured sequence of length is in principle , we seek to exploit the structure of the space of possible predictions to reduce the dimensionality and hence make those problems more tractable.
2.2 Long Short-Term Memory
A special class of recurrent neural networks (RNN) that prevents the gradient from vanishing too quickly is the popular long short-term memory (LSTM) introduced byHochreiter & Schmidhuber (1997). This architecture has proven stable and powerful for modeling long-range dependencies in various general-purpose sequence modeling tasks (Graves, 2013; Srivastava et al., 2015; Sutskever et al., 2014). A fully-connected LSTM (FC-LSTM) may be seen as a multivariate version of LSTM where the input , cell output and states
are all vectors. In this paper, we follow the FC-LSTM formulation ofGraves (2013), that is:
where denotes the Hadamard product,
the sigmoid functionand are the input, forget and output gates. The weights , , and biases are the model parameters.111A practical trick is to initialize the biases , and to one such that the gates are initially open. Such a model is called fully-connected because the dense matrices and linearly combine all the components of and . The optional peephole connections , introduced by Gers & Schmidhuber (2000), have been found to improve performance on certain tasks.
2.3 Convolutional Neural Networks on Graphs
Generalizing convolutional neural networks (CNNs) to arbitrary graphs is a recent area of interest. Two approaches have been explored in the literature: (i) a generalization of the spatial definition of a convolution (Masci et al., 2015; Niepert et al., 2016) and (ii), a multiplication in the graph Fourier domain by the way of the convolution theorem (Bruna et al., 2014; Defferrard et al., 2016). Masci et al. (2015) introduced a spatial generalization of CNNs to 3D meshes. The authors used geodesic polar coordinates to define convolution operations on mesh patches, and formulated a deep learning architecture which allows comparison across different meshes. Hence, this method is tailored to manifolds and is not directly generalizable to arbitrary graphs. Niepert et al. (2016) proposed a spatial approach which may be decomposed in three steps: (i) select a node, (ii) construct its neighborhood and (iii) normalize the selected sub-graph, i.e. order the neighboring nodes. The extracted patches are then fed into a conventional 1D Euclidean CNN. As graphs generally do not possess a natural ordering (temporal, spatial or otherwise), a labeling procedure should be used to impose it. Bruna et al. (2014) were the first to introduce the spectral framework described below in the context of graph CNNs. The major drawback of this method is its complexity, which was overcome with the technique of Defferrard et al. (2016), which offers a linear complexity and provides strictly localized filters. Kipf & Welling (2016) took a first-order approximation of the spectral filters proposed by Defferrard et al. (2016) and successfully used it for semi-supervised classification of nodes. While we focus on the framework introduced by Defferrard et al. (2016), the proposed model is agnostic to the choice of the graph convolution operator .
As it is difficult to express a meaningful translation operator in the vertex domain (Bruna et al., 2014; Niepert et al., 2016), Defferrard et al. (2016) chose a spectral formulation for the convolution operator on graph . By this definition, a graph signal is filtered by a non-parametric kernel , where is a vector of Fourier coefficients, as
is the matrix of eigenvectors and
the diagonal matrix of eigenvalues of the normalized graph Laplacian, where
is the identity matrix andis the diagonal degree matrix with (Chung, 1997). Note that the signal is filtered by
with an element-wise multiplication of its graph Fourier transformwith (Shuman et al., 2013). Evaluating (3) is however expensive, as the multiplication with is . Furthermore, computing the eigendecomposition of might be prohibitively expensive for large graphs. To circumvent this problem, Defferrard et al. (2016) parametrizes as a truncated expansion, up to order , of Chebyshev polynomials such that
where the parameter is a vector of Chebyshev coefficients and is the Chebyshev polynomial of order evaluated at . The graph filtering operation can then be written as
where is the Chebyshev polynomial of order evaluated at the scaled Laplacian . Using the stable recurrence relation with and , one can evaluate (5) in operations, i.e. linearly with the number of edges. Note that as the filtering operation (5) is an order polynomial of the Laplacian, it is -localized and depends only on nodes that are at maximum hops away from the central node, the -neighborhood. The reader is referred to Defferrard et al. (2016) for details and an in-depth discussion.
3 Related Works
Shi et al. (2015) introduced a model for regular grid-structured sequences, which can be seen as a special case of the proposed model where the graph is an image grid where the nodes are well ordered. Their model is essentially the classical FC-LSTM (2) where the multiplications by dense matrices have been replaced by convolutions with kernels :
denotes the 2D convolution by a set of kernels. In their setting, the input tensoris the observation of measurements at time of a dynamical system over a spatial region represented by a grid of rows and columns. The model holds spatially distributed hidden and cell states of size given by the tensors . The size of the convolutional kernels and determines the number of parameters, which is independent of the grid size . Earlier, Ranzato et al. (2014) proposed a similar RNN variation which uses convolutional layers instead of fully connected layers. The hidden state at time is given by
where the convolutional kernels are restricted to filters of size 1x1 (effectively a fully connected layer shared across all spatial locations).
Observing that natural language exhibits syntactic properties that naturally combine words into phrases, Tai et al. (2015) proposed a model for tree-structured topologies, where each LSTM has access to the states of its children. They obtained state-of-the-art results on semantic relatedness and sentiment classification. Liang et al. (2016) followed up and proposed a variant on graphs. Their sophisticated network architecture obtained state-of-the-art results for semantic object parsing on four datasets. In those models, the states are gathered from the neighborhood by way of a weighted sum with trainable weight matrices. Those weights are however not shared across the graph, which would otherwise have required some ordering of the nodes, alike any other spatial definition of graph convolution. Moreover, their formulations are limited to the one-neighborhood of the current node, with equal weight given to each neighbor.
Motivated by spatio-temporal problems like modeling human motion and object interactions, Jain et al. (2016) developed a method to cast a spatio-temporal graph as a rich RNN mixture which essentially associates a RNN to each node and edge. Again, the communication is limited to directly connected nodes and edges.
The closest model to our work is probably the one proposed by Li et al. (2015), which showed stat-of-the-art performance on a problem from program verification. Whereas they use the iterative procedure of the Graph Neural Networks (GNNs) model introduced by Scarselli et al. (2009) to propagate node representations until convergence, we instead use the graph CNN introduced by Defferrard et al. (2016) to diffuse information across the nodes. While their motivations are quite different, those models are related by the fact that a spectral filter defined as a polynomial of order can be implemented as a -layer GNN.222The basic idea is to set the transition function as a diffusion and the output function such as to realize the polynomial recurrence, then stack of those. See Defferrard et al. (2016) for details.
4 Proposed GCRN Models
We propose two GCRN architectures that are quite natural, and investigate their performances in real-world applications in Section 5.
In that setting, the input matrix may represent the observation of measurements at time of a dynamical system over a network whose organization is given by a graph . is the output of the graph CNN gate. For a proof of concept, we simply choose here , where are the Chebyshev coefficients for the graph convolutional kernels of support . The model also holds spatially distributed hidden and cell states of size given by the matrices . Peepholes are controlled by . The weights and are the parameters of the fully connected layers. An architecture such as (8) may be enough to capture the data distribution by exploiting local stationarity and compositionality properties as well as the dynamic properties.
To generalize the convLSTM model (6) to graphs we replace the Euclidean 2D convolution by the graph convolution :
In that setting, the support of the graph convolutional kernels defined by the Chebyshev coefficients and determines the number of parameters, which is independent of the number of nodes . To keep the notation simple, we write to mean a graph convolution of with filters which are functions of the graph Laplacian parametrized by Chebyshev coefficients, as noted in (4) and (5). In a distributed computing setting, controls the communication overhead, i.e. the number of nodes any given node should exchange with in order to compute its local states.
The proposed blend of RNNs and graph CNNs is not limited to LSTMs and is straightforward to apply to any kind of recursive networks. For example, a vanilla RNN would be modified as
and a Gated Recurrent Unit (GRU)(Cho et al., 2014) as
As demonstrated by Shi et al. (2015), structure-aware LSTM cells can be stacked and used as sequence-to-sequence models using an architecture composed of an encoder, which processes the input sequence, and a decoder, which generates an output sequence. A standard practice for machine translation using RNNs (Cho et al., 2014; Sutskever et al., 2014).
5.1 Spatio-Temporal Sequence Modeling on Moving-MNIST
For this synthetic experiment, we use the moving-MNIST dataset generated by Shi et al. (2015). All sequences are 20 frames long (10 frames as input and 10 frames for prediction) and contain two handwritten digits bouncing inside a
patch. Following their experimental setup, all models are trained by minimizing the binary cross-entropy loss using back-propagation through time (BPTT) and RMSProp with a learning rate of
and a decay rate of 0.9. We choose the best model with early-stopping on validation set. All implementations are based on their Theano code and dataset.333http://www.wanghao.in/code/SPARNN-release.zip The adjacency matrix
is constructed as a k-nearest-neighbor (knn) graph with Euclidean distance and Gaussian kernel between pixel locations. For a fair comparison withShi et al. (2015) defined in (6), all GCRN experiments are conducted with Model 2 defined in (9), which is the same architecture with the 2D convolution replaced by a graph convolution . To further explore the impact of the isotropic property of our filters, we generated a variant of the moving MNIST dataset where digits are also rotating (see Figure 4).
|Architecture||Structure||Filter size||Parameters||Runtime||Test(w/o Rot)||Test(Rot)|
Table 1 shows the performance of various models: (i) the baseline FC-LSTM from Shi et al. (2015), (ii) the 1-layer LSTM+CNN from Shi et al. (2015) with different filter sizes, and (iii) the proposed LSTM+graph CNN(GCNN) defined in (9) with different supports . These results show the ability of the proposed method to capture spatio-temporal structures. Perhaps surprisingly, GCNNs can offer better performance than regular CNNs, even when the domain is a 2D grid and the data is images, the problem CNNs were initially developed for. The explanation is to be found in the differences between 2D filters and spectral graph filters. While a spectral filter of support corresponds to the reach of a patch of size (see Figure 2), the difference resides in the isotropic nature of the former and the number of parameters: for the former and for the later. Table 1 indeed shows that LSTM+CNN() rivals LSTM+GCNN with . However, when increasing the filter size to or , the GCNN variant clearly outperforms the CNN variant. This experiment demonstrates that graph spectral filters can obtain superior performance on regular domains with much less parameters thanks to their isotropic nature, a controversial property. Indeed, as the nodes are not ordered, there is no notion of an edge going up, down, on the right or on the left. All edges are treated equally, inducing some sort of rotation invariance. Additionally, Table 1 shows that the computational complexity of each model is linear with the filter size, and Figure 3 shows the learning dynamic of some of the models.
5.2 Natural Language Modeling on Penn Treebank
The Penn Treebank dataset has 1,036,580 words. It was pre-processed in Zaremba et al. (2014) and split444https://github.com/wojzaremba/lstm into a training set of 929k words, a validation set of 73k words, and a test set of 82k words. The size of the vocabulary of this corpus is 10,000. We use the gensim library555https://radimrehurek.com/gensim/models/word2vec.html to compute a word2vec model (Mikolov et al., 2013) for embedding the words of the dictionary in a 200-dimensional space. Then we build the adjacency matrix of the word embedding using a 4-nearest neighbor graph with cosine distance. Figure 6
presents the computed adjacency matrix, and its 3D visualization. We used the hyperparameters of the small configuration given by the code666https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/ptb_word_lm.py based on Zaremba et al. (2014): the size of the data mini-batch is 20, the number of temporal steps to unroll is 20, the dimension of the hidden state is 200. The global learning rate is 1.0 and the norm of the gradient is bounded by 5. The learning decay function is selected to be
. All experiments have 13 epochs, and dropout value is 0.75. ForZaremba et al. (2014), the input representation can be either the 200-dim embedding vector of the word, or the 10,000-dim one-hot representation of the word. For our models, the input representation is a one-hot representation of the word. This choice allows us to use the graph structure of the words.
|Architecture||Representation||Parameters||Train Perplexity||Test Perplexity|
|Zaremba et al. (2014) code666https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/ptb_word_lm.py||embedding||681,800||36.96||117.29|
|Zaremba et al. (2014) code666https://github.com/tensorflow/tensorflow/blob/master/tensorflow/models/rnn/ptb/ptb_word_lm.py||one-hot||34,011,600||53.89||118.82|
Table 2 reports the final train and test perplexity values for each investigated model and Figure 5 plots the perplexity value vs. the number of epochs for the train and test sets with and without dropout regularization. Numerical experiments show:
Given the same experimental conditions in terms of architecture and no dropout regularization, the standalone model of LSTM is more accurate than LSTM using the spatial graph information ( vs. ), extracted by graph CNN with the GCRN architecture of Model 1, Eq. (8).
However, using dropout regularization, the graph LSTM model overcomes the standalone LSTM with perplexity values vs. .
The use of spatial graph information found by graph CNN speeds up the learning process, and overfits the training dataset in the absence of dropout regularization. The graph structure likely acts a constraint on the learning system that is forced to move in the space of language topics.
We performed the same experiments with LSTM and Model 2 defined in (9). Model 1 significantly outperformed Model 2, and Model 2 did worse than standalone LSTM. This bad performance may be the result of the large increase of dimensionality in Model 2, as the dimension of the hidden and cell states changes from 200 to 10,000, the size of the vocabulary. A solution would be to downsize the data dimensionality, as done in Shi et al. (2015) in the case of image data.
6 Conclusion and Future Work
This work aims at learning spatio-temporal structures from graph-structured and time-varying data. In this context, the main challenge is to identify the best possible architecture that combines simultaneously recurrent neural networks like vanilla RNN, LSTM or GRU with convolutional neural networks for graph-structured data. We have investigated here two architectures, one using a stack of CNN and RNN (Model 1), and one using convLSTM that considers convolutions instead of fully connected operations in the RNN definition (Model 2). We have then considered two applications: video prediction and natural language modeling. Model 2 has shown good performances in the case of video prediction, by improving the results of Shi et al. (2015). Model 1 has also provided promising performances in the case of language modeling, particularly in terms of learning speed. It has been shown that (i) isotropic filters, maybe surprisingly, can outperform classical 2D filters on images while requiring much less parameters, and (ii) that graphs coupled with graph CNN and RNN are a versatile way of introducing and exploiting side-information, e.g. the semantic of words, by structuring a data matrix.
Future work will investigate applications to data naturally structured as dynamic graph signals, for instance fMRI and sensor networks. The graph CNN model we have used is rotationally-invariant and such spatial property seems quite attractive in real situations where motion is beyond translation. We will also investigate how to benefit of the fast learning property of our system to speed up language modeling models. Eventually, it will be interesting to analyze the underlying dynamical property of generic RNN architectures in the case of graphs. Graph structures may introduce stability to RNN systems, and prevent them to express unstable dynamic behaviors.
This research was supported in part by the European Union’s H2020 Framework Programme (H2020-MSCA-ITN-2014) under grant No. 642685 MacSeNet, and Nvidia equipment grant.
- Bruna et al. (2014) Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral Networks and Locally Connected Networks on Graphs. In International Conference on Learning Representations (ICML), 2014.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv:1406.1078, 2014.
- Chung (1997) F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997.
- Defferrard et al. (2016) Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems (NIPS), 2016.
- Donahue et al. (2015) Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In , 2015.
- Gers & Schmidhuber (2000) Felix A Gers and Jürgen Schmidhuber. Recurrent nets that time and count. In IEEE-INNS-ENNS International Joint Conference on Neural Networks, 2000.
- Graves (2013) Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
- Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 1997.
- Jain et al. (2016) Ashesh Jain, Amir R. Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-RNN: Deep Learning on Spatio-Temporal Graphs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
- Karpathy & Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Kipf & Welling (2016) Thomas N. Kipf and Max Welling. Semi-Supervised Classification with Graph Convolutional Networks. arXiv:1609.02907, 2016.
- Li et al. (2015) Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493, 2015.
- Liang et al. (2016) Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with graph lstm. arXiv:1603.07063, 2016.
- Masci et al. (2015) Jonathan Masci, Davide Boscaini, Michael M. Bronstein, and Pierre Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In IEEE International Conference on Computer Vision (ICCV) Workshops, 2015.
- Mikolov et al. (2013) T. Mikolov, K. Chen, G. Corrado, and J. Dean. Estimation of Word Representations in Vector Space. In International Conference on Learning Representations (ICLR), 2013.
Niepert et al. (2016)
Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov.
Learning Convolutional Neural Networks for Graphs.
International Conference on Machine Learning (ICML), 2016.
- Ranzato et al. (2014) MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos. arXiv:1412.6604, 2014.
- Scarselli et al. (2009) Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 2009.
- Shi et al. (2015) Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-kin Wong, and Wang-chun Woo. Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. In Advances in Neural Information Processing Systems (NIPS), 2015.
Shuman et al. (2013)
D. Shuman, S. Narang, P. Frossard, A. Ortega, and P. Vandergheynst.
The Emerging Field of Signal Processing on Graphs: Extending High-Dimensional Data Analysis to Networks and other Irregular Domains.IEEE Signal Processing Magazine, 2013.
- Srivastava et al. (2015) Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video representations using lstms. In International Conference on Machine Learning (ICML), 2015.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. In Advances in neural information processing systems (NIPS), 2014.
- Tai et al. (2015) Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In Association for Computational Linguistics (ACL), 2015.
- Vinyals et al. (2015) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
- Zaremba et al. (2014) Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv:1409.2329, 2014.