Structured Sequence Modeling with Graph Convolutional Recurrent Networks

12/22/2016 ∙ by Youngjoo Seo, et al. ∙ EPFL 0

This paper introduces Graph Convolutional Recurrent Network (GCRN), a deep learning model able to predict structured sequences of data. Precisely, GCRN is a generalization of classical recurrent neural networks (RNN) to data structured by an arbitrary graph. Such structured sequences can represent series of frames in videos, spatio-temporal measurements on a network of sensors, or random walks on a vocabulary graph for natural language modeling. The proposed model combines convolutional neural networks (CNN) on graphs to identify spatial structures and RNN to find dynamic patterns. We study two possible architectures of GCRN, and apply the models to two practical problems: predicting moving MNIST data, and modeling natural language with the Penn Treebank dataset. Experiments show that exploiting simultaneously graph spatial and dynamic information about data can improve both precision and learning speed.



There are no comments yet.


page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many real-world data can be cast as structured sequences, with spatio-temporal sequences being a special case. A well-studied example of spatio-temporal data are videos, where succeeding frames share temporal and spatial structures. Many works, such as Donahue et al. (2015); Karpathy & Fei-Fei (2015); Vinyals et al. (2015)

, leveraged a combination of CNN and RNN to exploit such spatial and temporal regularities. Their models are able to process possibly time-varying visual inputs for variable-length prediction. These neural network architectures consist of combining a CNN for visual feature extraction followed by a RNN for sequence learning. Such architectures have been successfully used for video activity recognition, image captioning and video description.

More recently, interest has grown in properly fusing the CNN and RNN models for spatio-temporal sequence modeling. Inspired by language modeling, Ranzato et al. (2014)

proposed a model to represent complex deformations and motion patterns by discovering both spatial and temporal correlations. They showed that prediction of the next video frame and interpolation of intermediate frames can be achieved by building a RNN-based language model on the visual words obtained by quantizing the image patches. Their highest-performing model, recursive CNN (rCNN), uses convolutions for both inputs and states.

Shi et al. (2015) then proposed the convolutional LSTM network (convLSTM), a recurrent model for spatio-temporal sequence modeling which uses 2D-grid convolution to leverage the spatial correlations in input data. They successfully applied their model to the prediction of the evolution of radar echo maps for precipitation nowcasting.

The spatial structure of many important problems may however not be as simple as regular grids. For instance, the data measured from meteorological stations lie on a irregular grid, i.e. a network of heterogeneous spatial distribution of stations. More challenging, the spatial structure of data may not even be spatial, as it is the case for social or biological networks. Eventually, the interpretation that sentences can be regarded as random walks on vocabulary graphs, a view popularized by Mikolov et al. (2013), allows us to cast language analysis problems as graph-structured sequence models.

This work leverages on the recent models of Defferrard et al. (2016); Ranzato et al. (2014); Shi et al. (2015) to design the GCRN model for modeling and predicting time-varying graph-based data. The core idea is to merge CNN for graph-structured data and RNN to identify simultaneously meaningful spatial structures and dynamic patterns. A generic illustration of the proposed GCRN architecture is given by Figure 2.

Figure 1: Illustration of the proposed GCRN model for spatio-temporal prediction of graph-structured data. The technique combines at the same time CNN on graphs and RNN. RNN can be easily exchanged with LSTM or GRU networks.
Figure 2: Illustration of the neighborhood on an 8-nearest-neighbor grid graph. Isotropic spectral filters of support have access to nodes at most at hops.

2 Preliminaries

2.1 Structured Sequence Modeling

Sequence modeling is the problem of predicting the most likely future length- sequence given the previous observations:


where is an observation at time and denotes the domain of the observed features. The archetypal application being the -gram language model (with ), where

models the probability of word

to appear conditioned on the past words in the sentence (Graves, 2013).

In this paper, we are interested in special structured sequences, i.e. sequences where features of the observations are not independent but linked by pairwise relationships. Such relationships are universally modeled by weighted graphs.

Data can be viewed as a graph signal, i.e. a signal defined on an undirected and weighted graph , where is a finite set of vertices, is a set of edges and is a weighted adjacency matrix encoding the connection weight between two vertices. A signal defined on the nodes of the graph may be regarded as a matrix whose column is the -dimensional value of at the node. While the number of free variables in a structured sequence of length is in principle , we seek to exploit the structure of the space of possible predictions to reduce the dimensionality and hence make those problems more tractable.

2.2 Long Short-Term Memory

A special class of recurrent neural networks (RNN) that prevents the gradient from vanishing too quickly is the popular long short-term memory (LSTM) introduced by

Hochreiter & Schmidhuber (1997). This architecture has proven stable and powerful for modeling long-range dependencies in various general-purpose sequence modeling tasks (Graves, 2013; Srivastava et al., 2015; Sutskever et al., 2014). A fully-connected LSTM (FC-LSTM) may be seen as a multivariate version of LSTM where the input , cell output and states

are all vectors. In this paper, we follow the FC-LSTM formulation of

Graves (2013), that is:


where denotes the Hadamard product,

the sigmoid function

and are the input, forget and output gates. The weights , , and biases are the model parameters.111A practical trick is to initialize the biases , and to one such that the gates are initially open. Such a model is called fully-connected because the dense matrices and linearly combine all the components of and . The optional peephole connections , introduced by Gers & Schmidhuber (2000), have been found to improve performance on certain tasks.

2.3 Convolutional Neural Networks on Graphs

Generalizing convolutional neural networks (CNNs) to arbitrary graphs is a recent area of interest. Two approaches have been explored in the literature: (i) a generalization of the spatial definition of a convolution (Masci et al., 2015; Niepert et al., 2016) and (ii), a multiplication in the graph Fourier domain by the way of the convolution theorem (Bruna et al., 2014; Defferrard et al., 2016). Masci et al. (2015) introduced a spatial generalization of CNNs to 3D meshes. The authors used geodesic polar coordinates to define convolution operations on mesh patches, and formulated a deep learning architecture which allows comparison across different meshes. Hence, this method is tailored to manifolds and is not directly generalizable to arbitrary graphs. Niepert et al. (2016) proposed a spatial approach which may be decomposed in three steps: (i) select a node, (ii) construct its neighborhood and (iii) normalize the selected sub-graph, i.e. order the neighboring nodes. The extracted patches are then fed into a conventional 1D Euclidean CNN. As graphs generally do not possess a natural ordering (temporal, spatial or otherwise), a labeling procedure should be used to impose it. Bruna et al. (2014) were the first to introduce the spectral framework described below in the context of graph CNNs. The major drawback of this method is its complexity, which was overcome with the technique of Defferrard et al. (2016), which offers a linear complexity and provides strictly localized filters. Kipf & Welling (2016) took a first-order approximation of the spectral filters proposed by Defferrard et al. (2016) and successfully used it for semi-supervised classification of nodes. While we focus on the framework introduced by Defferrard et al. (2016), the proposed model is agnostic to the choice of the graph convolution operator .

As it is difficult to express a meaningful translation operator in the vertex domain (Bruna et al., 2014; Niepert et al., 2016), Defferrard et al. (2016) chose a spectral formulation for the convolution operator on graph . By this definition, a graph signal is filtered by a non-parametric kernel , where is a vector of Fourier coefficients, as



is the matrix of eigenvectors and

the diagonal matrix of eigenvalues of the normalized graph Laplacian

, where

is the identity matrix and

is the diagonal degree matrix with (Chung, 1997). Note that the signal is filtered by

with an element-wise multiplication of its graph Fourier transform

with (Shuman et al., 2013). Evaluating (3) is however expensive, as the multiplication with is . Furthermore, computing the eigendecomposition of might be prohibitively expensive for large graphs. To circumvent this problem, Defferrard et al. (2016) parametrizes as a truncated expansion, up to order , of Chebyshev polynomials such that


where the parameter is a vector of Chebyshev coefficients and is the Chebyshev polynomial of order evaluated at . The graph filtering operation can then be written as


where is the Chebyshev polynomial of order evaluated at the scaled Laplacian . Using the stable recurrence relation with and , one can evaluate (5) in operations, i.e. linearly with the number of edges. Note that as the filtering operation (5) is an order polynomial of the Laplacian, it is -localized and depends only on nodes that are at maximum hops away from the central node, the -neighborhood. The reader is referred to Defferrard et al. (2016) for details and an in-depth discussion.

3 Related Works

Shi et al. (2015) introduced a model for regular grid-structured sequences, which can be seen as a special case of the proposed model where the graph is an image grid where the nodes are well ordered. Their model is essentially the classical FC-LSTM (2) where the multiplications by dense matrices have been replaced by convolutions with kernels :



denotes the 2D convolution by a set of kernels. In their setting, the input tensor

is the observation of measurements at time of a dynamical system over a spatial region represented by a grid of rows and columns. The model holds spatially distributed hidden and cell states of size given by the tensors . The size of the convolutional kernels and determines the number of parameters, which is independent of the grid size . Earlier, Ranzato et al. (2014) proposed a similar RNN variation which uses convolutional layers instead of fully connected layers. The hidden state at time is given by


where the convolutional kernels are restricted to filters of size 1x1 (effectively a fully connected layer shared across all spatial locations).

Observing that natural language exhibits syntactic properties that naturally combine words into phrases, Tai et al. (2015) proposed a model for tree-structured topologies, where each LSTM has access to the states of its children. They obtained state-of-the-art results on semantic relatedness and sentiment classification. Liang et al. (2016) followed up and proposed a variant on graphs. Their sophisticated network architecture obtained state-of-the-art results for semantic object parsing on four datasets. In those models, the states are gathered from the neighborhood by way of a weighted sum with trainable weight matrices. Those weights are however not shared across the graph, which would otherwise have required some ordering of the nodes, alike any other spatial definition of graph convolution. Moreover, their formulations are limited to the one-neighborhood of the current node, with equal weight given to each neighbor.

Motivated by spatio-temporal problems like modeling human motion and object interactions, Jain et al. (2016) developed a method to cast a spatio-temporal graph as a rich RNN mixture which essentially associates a RNN to each node and edge. Again, the communication is limited to directly connected nodes and edges.

The closest model to our work is probably the one proposed by Li et al. (2015), which showed stat-of-the-art performance on a problem from program verification. Whereas they use the iterative procedure of the Graph Neural Networks (GNNs) model introduced by Scarselli et al. (2009) to propagate node representations until convergence, we instead use the graph CNN introduced by Defferrard et al. (2016) to diffuse information across the nodes. While their motivations are quite different, those models are related by the fact that a spectral filter defined as a polynomial of order can be implemented as a -layer GNN.222The basic idea is to set the transition function as a diffusion and the output function such as to realize the polynomial recurrence, then stack of those. See Defferrard et al. (2016) for details.

4 Proposed GCRN Models

We propose two GCRN architectures that are quite natural, and investigate their performances in real-world applications in Section 5.

Model 1.

The most straightforward definition is to stack a graph CNN, defined as (5), for feature extraction and an LSTM, defined as (2), for sequence learning:


In that setting, the input matrix may represent the observation of measurements at time of a dynamical system over a network whose organization is given by a graph . is the output of the graph CNN gate. For a proof of concept, we simply choose here , where are the Chebyshev coefficients for the graph convolutional kernels of support . The model also holds spatially distributed hidden and cell states of size given by the matrices . Peepholes are controlled by . The weights and are the parameters of the fully connected layers. An architecture such as (8) may be enough to capture the data distribution by exploiting local stationarity and compositionality properties as well as the dynamic properties.

Model 2.

To generalize the convLSTM model (6) to graphs we replace the Euclidean 2D convolution by the graph convolution :


In that setting, the support of the graph convolutional kernels defined by the Chebyshev coefficients and determines the number of parameters, which is independent of the number of nodes . To keep the notation simple, we write to mean a graph convolution of with filters which are functions of the graph Laplacian parametrized by Chebyshev coefficients, as noted in (4) and (5). In a distributed computing setting, controls the communication overhead, i.e. the number of nodes any given node should exchange with in order to compute its local states.

The proposed blend of RNNs and graph CNNs is not limited to LSTMs and is straightforward to apply to any kind of recursive networks. For example, a vanilla RNN would be modified as


and a Gated Recurrent Unit (GRU)

(Cho et al., 2014) as


As demonstrated by Shi et al. (2015), structure-aware LSTM cells can be stacked and used as sequence-to-sequence models using an architecture composed of an encoder, which processes the input sequence, and a decoder, which generates an output sequence. A standard practice for machine translation using RNNs (Cho et al., 2014; Sutskever et al., 2014).

5 Experiments

5.1 Spatio-Temporal Sequence Modeling on Moving-MNIST

For this synthetic experiment, we use the moving-MNIST dataset generated by Shi et al. (2015). All sequences are 20 frames long (10 frames as input and 10 frames for prediction) and contain two handwritten digits bouncing inside a

patch. Following their experimental setup, all models are trained by minimizing the binary cross-entropy loss using back-propagation through time (BPTT) and RMSProp with a learning rate of

and a decay rate of 0.9. We choose the best model with early-stopping on validation set. All implementations are based on their Theano code and dataset.

333 The adjacency matrix

is constructed as a k-nearest-neighbor (knn) graph with Euclidean distance and Gaussian kernel between pixel locations. For a fair comparison with

Shi et al. (2015) defined in (6), all GCRN experiments are conducted with Model 2 defined in (9), which is the same architecture with the 2D convolution replaced by a graph convolution . To further explore the impact of the isotropic property of our filters, we generated a variant of the moving MNIST dataset where digits are also rotating (see Figure 4).

Architecture Structure Filter size Parameters Runtime Test(w/o Rot) Test(Rot)
FC-LSTM N/A N/A 142,667,776 N/A 4832 -
LSTM+CNN N/A 13,524,496 2.10 3851 4339
LSTM+CNN N/A 43,802,128 6.10 3903 4208
LSTM+GCNN 1,629,712 0.82 3866 4367
LSTM+GCNN 2,711,056 1.24 3495 3932
LSTM+GCNN 3,792,400 1.61 3400 3803
LSTM+GCNN 4,873,744 2.15 3395 3814
LSTM+GCNN 3,792,400 1.61 3446 3844
LSTM+GCNN 3,792,400 1.61 3578 3963
Table 1: Comparison between models. Runtime is the time spent per each mini-batch in seconds. Test cross-entropies correspond to moving MNIST, and rotating and moving MNIST. LSTM+GCNN is Model 2 defined in (9). Cross-entropy of FC-LSTM is taken from Shi et al. (2015).
Figure 3: Cross-entropy on validation set: Left: performance of graph CNN with various filter support . Right: performance w.r.t. graph construction.
Figure 4: Qualitative results for moving MNIST, and rotating and moving MNIST. First row is the input sequence, second the ground truth, and third and fourth are the predictions of the LSTM+CNN() and LSTM+GCNN().

Table 1 shows the performance of various models: (i) the baseline FC-LSTM from Shi et al. (2015), (ii) the 1-layer LSTM+CNN from Shi et al. (2015) with different filter sizes, and (iii) the proposed LSTM+graph CNN(GCNN) defined in (9) with different supports . These results show the ability of the proposed method to capture spatio-temporal structures. Perhaps surprisingly, GCNNs can offer better performance than regular CNNs, even when the domain is a 2D grid and the data is images, the problem CNNs were initially developed for. The explanation is to be found in the differences between 2D filters and spectral graph filters. While a spectral filter of support corresponds to the reach of a patch of size (see Figure 2), the difference resides in the isotropic nature of the former and the number of parameters: for the former and for the later. Table 1 indeed shows that LSTM+CNN() rivals LSTM+GCNN with . However, when increasing the filter size to or , the GCNN variant clearly outperforms the CNN variant. This experiment demonstrates that graph spectral filters can obtain superior performance on regular domains with much less parameters thanks to their isotropic nature, a controversial property. Indeed, as the nodes are not ordered, there is no notion of an edge going up, down, on the right or on the left. All edges are treated equally, inducing some sort of rotation invariance. Additionally, Table 1 shows that the computational complexity of each model is linear with the filter size, and Figure 3 shows the learning dynamic of some of the models.

5.2 Natural Language Modeling on Penn Treebank

The Penn Treebank dataset has 1,036,580 words. It was pre-processed in Zaremba et al. (2014) and split444 into a training set of 929k words, a validation set of 73k words, and a test set of 82k words. The size of the vocabulary of this corpus is 10,000. We use the gensim library555 to compute a word2vec model (Mikolov et al., 2013) for embedding the words of the dictionary in a 200-dimensional space. Then we build the adjacency matrix of the word embedding using a 4-nearest neighbor graph with cosine distance. Figure 6

presents the computed adjacency matrix, and its 3D visualization. We used the hyperparameters of the small configuration given by the code

666 based on Zaremba et al. (2014): the size of the data mini-batch is 20, the number of temporal steps to unroll is 20, the dimension of the hidden state is 200. The global learning rate is 1.0 and the norm of the gradient is bounded by 5. The learning decay function is selected to be

. All experiments have 13 epochs, and dropout value is 0.75. For

Zaremba et al. (2014), the input representation can be either the 200-dim embedding vector of the word, or the 10,000-dim one-hot representation of the word. For our models, the input representation is a one-hot representation of the word. This choice allows us to use the graph structure of the words.

Figure 5: Learning dynamic of LSTM with and without graph structure and dropout regularization.
Architecture Representation Parameters Train Perplexity Test Perplexity
Zaremba et al. (2014) code666 embedding 681,800 36.96 117.29
Zaremba et al. (2014) code666 one-hot 34,011,600 53.89 118.82
LSTM embedding 681,800 48.38 120.90
LSTM one-hot 34,011,600 54.41 120.16
LSTM, dropout one-hot 34,011,600 145.59 112.98
GCRN-M1 one-hot 42,011,602 18.49 177.14
GCRN-M1, dropout one-hot 42,011,602 114.29 98.67
Table 2: Comparison of models in terms of perplexity. Zaremba et al. (2014) code is ran as benchmark algorithm. The original Zaremba et al. (2014) code used as input representation for the 200-dim embedding representation of words, computed here by the gensim library. As our model runs on the 10,000-dim one-hot representation of words, we also ran Zaremba et al. (2014) code on this representation. We re-implemented Zaremba et al. (2014) code with the same architecture and hyperparameters. We remind that GCRN-M1 refers to GCRN Model 1 defined in (8).

Table 2 reports the final train and test perplexity values for each investigated model and Figure 5 plots the perplexity value vs. the number of epochs for the train and test sets with and without dropout regularization. Numerical experiments show:

  1. Given the same experimental conditions in terms of architecture and no dropout regularization, the standalone model of LSTM is more accurate than LSTM using the spatial graph information ( vs. ), extracted by graph CNN with the GCRN architecture of Model 1, Eq. (8).

  2. However, using dropout regularization, the graph LSTM model overcomes the standalone LSTM with perplexity values vs. .

  3. The use of spatial graph information found by graph CNN speeds up the learning process, and overfits the training dataset in the absence of dropout regularization. The graph structure likely acts a constraint on the learning system that is forced to move in the space of language topics.

  4. We performed the same experiments with LSTM and Model 2 defined in (9). Model 1 significantly outperformed Model 2, and Model 2 did worse than standalone LSTM. This bad performance may be the result of the large increase of dimensionality in Model 2, as the dimension of the hidden and cell states changes from 200 to 10,000, the size of the vocabulary. A solution would be to downsize the data dimensionality, as done in Shi et al. (2015) in the case of image data.

Figure 6: Left: adjacency matrix of word embeddings. Right: 3D visualization of words’ structure.

6 Conclusion and Future Work

This work aims at learning spatio-temporal structures from graph-structured and time-varying data. In this context, the main challenge is to identify the best possible architecture that combines simultaneously recurrent neural networks like vanilla RNN, LSTM or GRU with convolutional neural networks for graph-structured data. We have investigated here two architectures, one using a stack of CNN and RNN (Model 1), and one using convLSTM that considers convolutions instead of fully connected operations in the RNN definition (Model 2). We have then considered two applications: video prediction and natural language modeling. Model 2 has shown good performances in the case of video prediction, by improving the results of Shi et al. (2015). Model 1 has also provided promising performances in the case of language modeling, particularly in terms of learning speed. It has been shown that (i) isotropic filters, maybe surprisingly, can outperform classical 2D filters on images while requiring much less parameters, and (ii) that graphs coupled with graph CNN and RNN are a versatile way of introducing and exploiting side-information, e.g. the semantic of words, by structuring a data matrix.

Future work will investigate applications to data naturally structured as dynamic graph signals, for instance fMRI and sensor networks. The graph CNN model we have used is rotationally-invariant and such spatial property seems quite attractive in real situations where motion is beyond translation. We will also investigate how to benefit of the fast learning property of our system to speed up language modeling models. Eventually, it will be interesting to analyze the underlying dynamical property of generic RNN architectures in the case of graphs. Graph structures may introduce stability to RNN systems, and prevent them to express unstable dynamic behaviors.


This research was supported in part by the European Union’s H2020 Framework Programme (H2020-MSCA-ITN-2014) under grant No. 642685 MacSeNet, and Nvidia equipment grant.