1 Introduction
Supervised sequence modeling aims to build models to extract effective features from variety of sequence data such as video data or text data via supervised learning. It has extensive applications ranging from computer vision
[26, 35]to natural language processing
[8, 42]. The key challenge in supervised sequence modeling is to capture the longrange temporal dependencies, which are used to further learn the highlevel feature for the whole sequence.Most stateoftheart methods for supervised sequence modeling are built upon the recurrent neural networks (RNN) [32], which has been validated its effectiveness [33, 52]. One crucial limitation of the vanilaRNNs is the gradientvanishing problem along the temporal domain, which results in the inability to model longterm dependencies. This limitation is then substantially mitigated by gated recurrent networks such as GRU [4] and LSTM [11], which employ learnable gates to selectively retain information in the memory or hidden states. The memorybased methods for sequence modeling [34, 39, 45] are further proposed to address the issue of limited memory of recurrent networks. However, a potential drawback of these methods is that they only model explicitly the information interactions between adjacent time steps in the sequence, hence the highorder interactions between nonadjacent time steps are not fully exploited. This drawback gives rise to two negative consequences: 1) the highlevel features contained in the interactions between nonadjacent time steps cannot be distilled; 2) it greatly limits the modeling of longrange temporal dependencies since oneorder interaction information cannot be maintained in a long term due to the information dilution and gradient vanishing along with recurrent operations.
Inspired by nonlocal methods [3, 44]
which aim to explore potential interactions between all pairs of feature portions, we propose to perform nonlocal operations to model the highorder interactions between nonadjacent time steps in a sequence. The captured highorder interactions are able to not only help to distill latent highlevel features which are hardly learned by typical sequence modeling methods focusing on oneorder interactions, but also contribute to modeling longrange temporal dependencies since the nonlocal operations strength the latent feature propagation and thus substantially alleviate the vanishinggradient problem. Since exploring fullorder interactions between all time steps for a long sequence is computationally expensive and also not necessary due to information redundancy, we model fullorder interactions by nonlocal operations within a temporal block (a segment of sequence) and slide the block to recurrently update the extracted information. More specifically, we propose the Nonlocal Recurrent Neural Memory (
NRNM) to perform the blockwise nonlocal operations to learn fullorder interactions within each memory block and capture the local but highresolution temporal dependencies. Meanwhile, the global interactions between adjacent blocks are captured by updating the memory states in a gated recurrent manner when sliding the memory cell. Consequently, the longrange dependencies can be maintained. Figure 1 illustrates our method by an example of action recognition.Compared to typical supervised sequence modeling methods, especially recurrent networks with memory mechanism, our NRNM benefits from following advantages:

It is able to model 1) the local fullorder interactions between all time steps within a segment of sequence (memory block) and 2) the global interactions between memory blocks. Thus, It can capture much longer temporal dependencies.

The proposed NRNM is able to learn latent highlevel features contained in highorder interactions between nonadjacent time steps, which may be missed by conventional methods.

The NRNM cell can be seamlessly integrated into any existing sequence models with recurrent structure to enhance the power of sequence modeling. The integrated model can be trained in an endtoend manner.
2 Related work
Graphical sequence models.
The conventional graphical models for sequence modeling can be roughly divided into two categories: generative and discriminative models. A wellknown example of generative model is Hidden Markov Model (HMM)
[30], which models sequence data in a chain of latent nomial features. Discriminative graphical models model the distribution over all class labels conditioned on the input data. Conditional Random Fields (CRF) [19] is a discriminative model for sequential predictions by modeling the linear mapping between observations and labels. To tackle its limitation of linear mapping, many nonlinear CRFvariants are proposed [25, 27, 28, 41]. The disadvantages of graphical model compared to recurrent networks lie in the hard optimization and limited capability of temporal modeling. Our model is designed based on recurrent networks.Recurrent Networks. Recurrent Neural Network [32]
learns a hidden representation for each time step by taking into account both current and previous information. Benefited from its advantages such as easy training and temporal modeling, it has been successfully applied to, amongst others, handwriting recognition
[2] and speech recognition [33]. However, the key limitation of vanilaRNN is the gradient vanishing problem during training [10] and thus cannot model longrange temporal dependencies. This limitation is alleviated by gated recurrent networks such as Long ShotTerm Memory (LSTM) [11]and Gate Recurrent Unit (GRU)
[4], which selectively retain information by learnable gates. Nevertheless, a potential limitation of these models is that they only model explicitly oneorder interactions between adjacent time steps, hence the highorder interactions between nonadjacent time steps are not fully captured. Our model is proposed to circumvent this drawback by employing nonlocal operations to model fullorder interactions in a blockwise manner. Meanwhile, the global interactions between blocks are modeled by a gated recurrent mechanism. Thus, our model is able to model longrange temporal dependencies and distill highlevel features that are contained in highorder interactions.Memorybased recurrent networks. Memory networks are first proposed to rectify the drawback of limited memory of recurrent networks [39, 45], which are then extended for various tasks, especially in natural language processing. Most of these models build external memory units upon a basis model to augment its memory [9, 34, 39, 45]. In particular, attention mechanism [1] is employed to filter the information flow from memory [8, 18, 39, 47]. The primary difference between these memorybased recurrent networks and our model is that these models focus on augmenting the memory size to memorize more information for reference while our model aims to model highorder interactions between different time steps in a sequence, which is not concerned by existing memorybased networks.
3 Nonlocal Recurrent Neural Memory
Given as input a sequence, our Nonlocal Recurrent Neural Memory (NRNM) is designed as a memory module to capture the longrange temporal dependencies in a nonlocal manner. It can be seamlessly integrated into any existing sequential models with recurrent structure to enhance sequence modeling. As illustrated in Figure 2, we build our NRNM upon a LSTM backbone as an instantiation. We will first elaborate on the cell structure of our NRNM and then describe how NRNM and the LSTM backbone perform sequence modeling collaboratively.
3.1 NRNM Cell
Residing on a basis sequence model (a standard LSTM backbone), the proposed Nonlocal Recurrent Neural Memory (NRNM) aims to maintain a memory cell along the temporal dimension which can not only distill the underlying information contained in past time steps of the input sequence but also capture the temporal dependencies, both operating in the long temporal range. To this end, the NRNM cell performs nonlocal operations on a segment of the sequence (termed as nonlocal memory block) sliding along the temporal domain, as shown in Figure 2. This blocking design is analogous to DenseNet [12] which performs dense connection (a form of nonlocal operation) in blocks. The obtained memory embeddings are further leveraged to refine the hidden embeddings which are used for the final prediction. The memory state is updated recurrently when sliding the memory block, which is consistent with update of the basis LSTM backbone.
Consider an input sequence of length in which denotes the observation at the th time step. The hidden representation of the input sequence at time step learned by the LSTM backbone is denoted as . NRNM learns the memory embedding for a block (a segment of time steps) with blocking size covering the temporal interval by refining the underlying information contained in this time interval. Specifically, we consider two types of source information for NRNM cell: 1) the learned hidden representations in this time interval by the LSTM backbone; 2) the original input features . Hence the memory embedding at time step is formulated as:
(1) 
where is the nonlinear transformation function performed by NRNM cell. Here we incorporate the input feature which is already assimilated in the hidden representation of the basis LSTM backbone since we aim to explore the latent interactions between hidden representations and input features in the current block (i.e., the interval ).
Next we elaborate on the transformation function of NRNM cell presented in Figure 3. To distill information in current block that is worth to retain in memory, we apply SelfAttention mechanism implemented with multihead attention [42] to model latent fullorder interactions among source informations: original input features and the hidden representations by LSTM in the current block:
(2) 
Herein, are queries, keys and values of SelfAttention transformed by parameters from the source information respectively. is the derived attention weights calculated by dotproduct attention scheme scaled by the memory hidden size . The obtained attention embeddings is then fed into two skipconnection layers and one fullyconnected layer to achieve the memory embedding .
The physical interpretation of this design is that the source information is composed of information units: LSTM hidden states and input features. Each information unit of the obtained memory embedding is constructed by attending into each of these source information units while the size of memory embedding can be customized via the parametric transformation. As such, the fullorder latent interactions between the source information units are explored in a nonlocal way. Another benefit of such nonlocal operations is that it strengths latent feature propagation and thus alleviates the vanishinggradient problem, which is always suffered by recurrent networks.
Since LSTM hidden states already contain history information by recurrent structure, in practice we use a striding scheme to select hidden states as the source information for
NRNM cell to avoid potential information redundancy and improve the modeling efficiency. For instance, we pick hidden states every time steps in the temporal interval for the source information, given .Gated recurrent update of memory state. only contains information within current temporal block (). To model the temporal consistency between adjacent memory blocks, we also update the memory state in a gated recurrent manner, which is similar to the recurrent scheme of LSTM. Specifically, the final memory state for current memory block is obtained by:
(3) 
where is the sliding window size of NRNM cell which controls the updating frequency of memory state. and are input gate and forget gate respectively to balance the memory information flow from current time step and previous memory state . They are modeled by measuring the compatibility between current input feature and previous memory state:
(4) 
wherein, and are transformation matrices while and are bias terms.
Modeling longrange dependencies. We aim to capture underlying longrange dependencies in a sequence by a twopronged strategy:

We perform nonlocal operations within each temporal block by NRNM cell to capture the fullorder interactions locally between different time steps and distill the highquality memory state. Hence, the local but highresolution temporal information can be captured.

We update the memory state in a gated recurrent manner smoothly when sliding the window of memory block along the temporal domain. It is designed to capture the global temporal dependencies between memory blocks in low resolution considering the potential information redundancy and computational efficiency.
3.2 Sequence Modeling
Our NRNM can be seamlessly integrated into the LSTM backbone to enhance the power of sequence modeling. Specifically, we incorporate the obtained memory state into the recurrent update of LSTM cell states to help refine its quality as shown in Figure 4:
(5) 
where , and are previous LSTM cell state, current cell state and candidate cell state respectively.
is the vector flattened from the memory state
. and are the routine forget gate and input gate of LSTM cell to balance the information flow between the current time step and previous step. All , and are modeled in a similar nonlinear way as a function of input feature and previous hidden state . For instance, the input gate is modeled as:(6) 
In Equation 5, we enrich the modeling of the LSTM cell state by incorporating our NRNM cell state via a memory gate . The memory gate is constructed as a matrix to control the information flow from the memory state , which is derived by measuring the relevance (compatibility) between current input feature and the memory state:
(7) 
where and are transformation matrices and is the bias term.
The newly constructed cell state are further used to derive the hidden state of the whole sequence model prepared for the final prediction:
(8) 
where is the output gate which is modeled in a similar way to the input gate in Equation 6.
3.3 Endtoend Parameter Learning
The memory state of our NRNM for current block is learned based on the hidden states within this block of the LSTM backbone while the obtained memory state is in turn leveraged to refine the hidden states in future time steps. Hence, our NRNM and the LSTM backbone are integrated seamlessly and refine each other alternately.
The learned hidden representations in Equation 8 for a sequence with length
can be used for any sequence prediction task such as stepwise prediction (like language modeling) or sequence classification (like action classification). In subsequent experiments, we validate our model in two tasks of sequence classification with different modalities: action recognition and sentiment analysis. Below we present the loss function for training our model for sequence classification, but it is straightforward to substitute the loss function to adapt our model to tasks of stepwise prediction.
Given a training set containing sequences of length and their associated labels . We learn our NRNM and the LSTM backbone jointly in an endtoend manner by minimizing the conditional negative loglikelihood of the training data with respect to the parameters:
(9) 
where the probability of the predicted label
among classes is calculated by the hidden state in the last time step:(10) 
Herein, and
is the parameters for linear transformation and bias term.
4 Experiments on Action Recognition
To evaluate the performance of our proposed NRNM model, we first consider the task of action recognition in which the temporal dependencies between frames in a video are the most discriminative cues.
4.1 Dataset and Evaluation Protocol
We evaluate our method on the NTU dataset[35] which is currently the largest action recoginition dataset. It is a RGB+Dbased dataset containing 56,880 video sequences and 4 million frames collected from 40 distinct subjects. The dataset includes 60 action categories. 3D skeleton data (i.e. 3D coordinates of 25 body joints) is provided using Microsoft Kinect.
In our experiments, we opt for NTU dataset using only 3D skeleton joint information rather than Kinetics [15] based on RGB information for action recognition since singleframe RGB information already provides much implication for action recognition and weakens the importance of temporal dependencies [29]. Dropping RGBD information enforces our model to recognize actions relying on temporal information of joints.
Two standard evaluation metrics are provided in
[35]: CrossSubject (CS) and CrossView (CV). CS evaluation splits 40 subjects equally into training and test sets consisting of 40,320 and 16,560 samples respectively. In CV evaluation, samples of camera 1 are used for testing and samples from cameras 2 and 3 for training. We report both metrics for performance evaluation.4.2 Implementation
Our NRNM is built on a 3layer LSTM backbone. The number of hidden units of all recurrent networks mentioned in this work (vanilaRNN, GRU, LSTM) is tuned on a validation set by selecting the best configuration from the option set . We employ 4head attention scheme in practice. The size of memory state is set to be same as the combined size of input hidden states, i.e., the dimensions are . Following Tu et al. [40], Zoneout [17] is employed for network regularization. The dropout value is set to 0.5 to prevent potential overfitting. Adam [16] is used with the initial learning rate of 0.001 for gradient descent optimization.
4.3 Investigation on NRNM
We first perform experiments to investigate our proposed NRNM systematically.
Effect of the block size . We first conduct experiments on NTU dataset to investigate the performance of NRNM as a function of the block size. Concretely, we evaluate our method using an increasing number of block sizes: 4, 6, 8, 10, and 12 while fixing the other hyperparameters.
Figure 5 shows that the accuracy initially increases as the increase of the block size, which is reasonable since larger block size allows NRNM to incorporate information of more time steps in memory and thus enables NRNM to capture longer temporal dependencies. As the block size increases further after the saturated state at the block size of 8, the performance starts to decrease. We surmise that the nonlocal operations on a long block of sequence result in overfitting on the training data and information redundancy.
Effect of the integrated location of NRNM on the LSTM backbone. We next study the effect of integrating the NRNM into different layers of the 3layer LSTM backbone. Figure 5 presents the results, from which we can conclude: 1) Integrating NRNM at any layer of LSTM outperforms the standard LSTM; 2) Only integrating NRNM once at one layer performs better than applying NRNM at multiple layers which would lead to information redundancy and overfitting; 3) Integrating NRNM at the middle layer achieves the best performance, which is probably because the layer2 hidden states of LSTM are more suitable for NRNM to distill information than the lowlevel and highlevel features learned by layer1 and layer3 hidden states.
Effect of sliding window size . Then we investigate the effect of sliding window size, which is used to control the updating frequency of memory state. Theoretically, too small sliding window size implies much overlap between two adjacent memory blocks and thus tends to lead to information redundancy. On the other hand, too large sliding window size results in large nonaccessed temporal interval between two adjacent memory blocks and would potentially miss information in the interval.
In this set of experiments, we set block size to 8 time steps, and consider different sliding window size. Figure 5 reports that the model performs well when the sliding window is around 4 to 8 while the performance decreases at other values, which validates our analysis.
Comparison with LSTM baselines. To investigate the effectiveness of our NRNM, we compare our model to the basic recurrent models including vanilaRNN, GRU, LSTM and highorder RNN on NTU dataset in two evaluation metrics: CrossSubject (CS) and CrossView (CV). Figure 6 shows 1) all RNNs with memory or gated structure outperforms vanilaRNN and highorder RNN by a large margin, which indicates the advantages of memory and gated structure for controlling information flow; 2) highorder RNN performs better than vanilaRNN which implies the necessary of the nonlocal operations since highorder connections can be considered as a simple nonlocal operation in a local area. It is also consistent with the existing conclusions [37, 50]; 3) our NRNM outperforms LSTM significantly which demonstrates the superiority of our model over standard LSTM.
4.4 Comparison with Stateofthearts
In this set of experiments, we compare our model with the stateoftheart methods for action recognition on NTU dataset in both CrossSubject (CS) and CrossView (CV) metrics. It should be noted that we do not compare with methods which employ extra information or prior knowledge such as joint connections for each part of body or human body structure modeling [36, 48].
Table 1 reports the experimental results. Our model achieves the best performance in both CS and CV metrics, which demonstrates the superiority of our model over other recurrent networks, especially those with memory or gated structures. While our model outperforms the standard LSTM model substantially, the methods based on LSTM [38, 52] boost the performance over LSTM by introducing extra attention mechanisms.
CS  CV  
HBRNNL[7]  59.1  64.0 
Partaware LSTM[35]  62.9  70.3 
Trust Gate STLSTM[21]  69.2  77.7 
Twostream RNN[43]  71.3  79.5 
Ensemble TSLSTM[20]  74.6  81.3 
VALSTM[51]  79.4  87.6 
STALSTM[38]  73.4  81.2 
EleAttLSTM[52]  78.4  85.0 
EleAttGRU[52]  79.8  87.1 
LSTM (baseline)  70.3  84.0 
NRNM (ours)  80.8  89.2 
Analysis on model complexity. To compare the model complexity between our model and other recurrent baselines and investigate whether the performance gain of our model is boosted by the augmented model complexity, we evaluate the performance of the recurrent baselines with different model complexities (configurations) in Table 2. Our model substantially outperforms other baselines under optimized configurations, which demonstrates that the performance superiority of our model is not resulted from the increased capacity by the extra parameters.
CV(%)  #Parameters  
3LSTM (256)  83.9  1.5M 
3LSTM (512)  84.0  5.6M 
5LSTM (512)  83.1  9.8M 
3EleAttLSTM (256)  85.5  1.8M 
6EleAttLSTM (256)  82.7  3.8M 
4EleAttLSTM (512)  83.4  8.9M 
3EleAttGRU (100)  87.1  0.3M 
3EleAttGRU (256)  85.4  1.4M 
5EleAttGRU (256)  85.0  2.5M 
NRNM (ours)  89.2  3.6M 
4.5 Qualitative Analysis
To qualitatively illustrate the advantages of the proposed NRNM, figure 7
presents a concrete video example with the action label “walking towards each other” (groundtruth). In this example, it is quite challenging to recognize the action since it can only be inferred by the temporal variations of the relative distance between two persons in the scene. Hence, capturing the longrange dependencies is crucial to recognize it. The standard LSTM misclassifies it as “punching/slapping other person” while our model is able to correctly classify it due to the capability to model longrange temporal information by our designed
NRNM.Figure 7 visualizes two blocks of memory states, each of which is learned by NRNM cell via incorporating information of multiple frames including input features and the hidden states of LSTM backbone. To obtain more insights into the nonlocal operations of NRNM, we visualize the attention weights in Equation 2 to show that each unit of memory state is calculated by attending to all units of source information ( and ).
5 Experiments on Sentiment Analysis
Next we perform experiments on task of sentiment analysis to evaluate our model on the text modality. Specifically, we aim to identify online movie reviews as positive or negative, which is a sequence classification problem.
5.1 Dataset and Evaluation Protocol
We use the IMDB Review dataset [22] which is a stardard benchmark for sentiment analysis. It contains 50,000 labeled reviews among which 25,000 samples are used for training and the rest for testing. The average length of reviews is 241 words and the maximum length is 2526 words[5]
. Note that the IMDB dataset also provides additional 50,000 unlabeled reviews, which are used by several customized semisupervised learning methods
[5, 6, 14, 24, 31]. Since we only use labeled data for supervised training, we compare our methods with those methods based on supervised learning using the same set of training data for a fair comparison.The torchtext ^{1}^{1}1https://github.com/pytorch/text is used for data preprocessing. Following the training strategy in Dai et al. [5], we pretrain a language model for extracting word embeddings.
5.2 Comparison with LSTM Baselines
We first conduct a set of experiments to compare our model to the basic recurrent networks including vanilaRNN, GRU, LSTM and highorder RNN. Figure 9 shows that our model outperforms all other baselines significantly which reveals the remarkable advantages of our NRNM. Besides, while LSTM and GRU perform much better than vanilaRNN, highorder RNN also boosts the performance by a large margin compared to vanilaRNN. It again demonstrates the benefits of highorder connections which are a simple form of nonlocal operations in local area.
5.3 Comparison with the Stateofthearts
Next we compare our NRNM with the stateoftheart methods including LSTM[46], ohCNN [13] and oh2LSTMp[14] which learn the word embeddings by customized CNN or LSTM instead of using existing pretrained word embedding vocabulary, DSL [46] and MLDL[46] which perform a dual learning between language modeling and sentiment analysis, GLoMo[49]
which is a transfer learning framework, and BCN+Char+CoVe
[23] which trains a machine translation model to encode the word embeddings to improve the performance of sentiment analysis.5.4 Qualitative Analysis
Figure 8 illustrates an example of sentiment analysis from IMDB dataset. This example of movie review is fairly challenging since the last sentence of the review seems to be positive which is prone to misguide models, especially when we use the hidden state of last time step for prediction. Our model is able to correctly classify it as “negative” while LSTM fails. We also visualize the attention weights of nonlocal operations ( Equation 2) in two blocks of NRNM states to show the attendance of each information units of source information for calculating the NRNM states. The first memory block corresponds to the first sentence which is an important cue of negative sentiment while the second memory block corresponds to the last sentence.
6 Conclusion
In this work, we have presented the Nonlocal Recurrent Neural Memory (NRNM) for supervised sequence modeling. We perform nonlocal operations within each memory block to model fullorder interactions between nonadjacent time steps and model the global interactions between memory blocks in a gated recurrent manner. Thus, the longrange temporal dependencies are captured. Our method achieves the stateoftheart performance for tasks of action recognition and sentiment analysis.
References
 [1] (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §2.
 [2] (2009) A novel connectionist system for improved unconstrained handwriting recognition. IEEE TPAMI 31 (5). Cited by: §2.
 [3] (2005) A nonlocal algorithm for image denoising. In CVPR, Cited by: §1.
 [4] (2014) Learning phrase representations using rnn encoderdecoder for statistical machine translation. arXiv preprint arXiv:1406.1078. Cited by: §1, §2.
 [5] (2015) Semisupervised sequence learning. In NeurIPS, Cited by: §5.1, §5.1.
 [6] (2017) Topicrnn: a recurrent neural network with longrange semantic dependency. In ICLR, Cited by: §5.1.
 [7] (2015) Hierarchical recurrent neural network for skeleton based action recognition. In CVPR, Cited by: Table 1.
 [8] (2017) Improving neural language models with a continuous cache. ICLR. Cited by: §1, §2.
 [9] (2014) Neural turing machines. arXiv preprint arXiv:1410.5401. Cited by: §2.
 [10] (2001) Gradient flow in recurrent nets: the difficulty of learning longterm dependencies. A field guide to dynamical recurrent neural networks. IEEE Press. Cited by: §2.
 [11] (1997) Long shortterm memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §2.
 [12] (2017) Densely connected convolutional networks. In CVPR, Cited by: §3.1.

[13]
(2014)
Effective use of word order for text categorization with convolutional neural networks
. arXiv preprint arXiv:1412.1058. Cited by: §5.3, Table 3.  [14] (2016) Supervised and semisupervised text categorization using lstm for region embeddings. In ICML, Cited by: §5.1, §5.3, Table 3.
 [15] (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.1.
 [16] (2015) Adam: a method for stochastic optimization. ICLR. Cited by: §4.2.
 [17] (2017) Zoneout: regularizing rnns by randomly preserving hidden activations. ICLR. Cited by: §4.2.
 [18] (2016) Ask me anything: dynamic memory networks for natural language processing. In ICML, Cited by: §2.
 [19] (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. Cited by: §2.

[20]
(2017)
Ensemble deep learning for skeletonbased action recognition using temporal sliding lstm networks
. In ICCV, Cited by: Table 1.  [21] (2016) Spatiotemporal lstm with trust gates for 3d human action recognition. In ECCV, Cited by: Table 1.
 [22] (2011) Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologiesvolume 1, Cited by: §5.1.
 [23] (2017) Learned in translation: contextualized word vectors. In NeurIPS, Cited by: §5.3, §5.3, Table 3.
 [24] (2017) Adversarial training methods for semisupervised text classification. ICLR. Cited by: §5.1.
 [25] (2007) Latentdynamic discriminative models for continuous gesture recognition. In CVPR, Cited by: §2.
 [26] (2017) Temporal attentiongated model for robust sequence classification. In CVPR, Cited by: §1.
 [27] (2018) Multivariate timeseries classification using the hiddenunit logistic model. IEEE transactions on neural networks and learning systems 29 (4), pp. 920–931. Cited by: §2.
 [28] (2009) Conditional neural fields. In NIPS, Cited by: §2.
 [29] (2017) Learning spatiotemporal representation with pseudo3d residual networks. In ICCV, Cited by: §4.1.
 [30] (1989) A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77 (2), pp. 257–286. Cited by: §2.
 [31] (2017) Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444. Cited by: §5.1.
 [32] (1988) Learning representations by backpropagating errors. Cognitive modeling 5 (3), pp. 1. Cited by: §1, §2.
 [33] (2014) Long shortterm memory recurrent neural network architectures for large scale acoustic modeling. In Fifteenth annual conference of the international speech communication association, Cited by: §1, §2.
 [34] (2018) Relational recurrent neural networks. In NeurIPS, Cited by: §1, §2.
 [35] (2016) NTU rgb+ d: a large scale dataset for 3d human activity analysis. In CVPR, Cited by: §1, §4.1, §4.1, Table 1.
 [36] (2018) Skeletonbased action recognition with spatial reasoning and temporal stack learning. In ECCV, Cited by: §4.4.
 [37] (2016) Higher order recurrent neural networks. arXiv preprint arXiv:1605.00064. Cited by: §4.3.
 [38] (2018) Spatiotemporal attentionbased lstm networks for 3d action recognition and detection. IEEE TIP 27 (7), pp. 3459–3471. Cited by: §4.4, Table 1.
 [39] (2015) Endtoend memory networks. In NeurIPS, Cited by: §1, §2.

[40]
(2018)
Spatialtemporal data augmentation based on lstm autoencoder network for skeletonbased human action recognition
. In ICIP, Cited by: §4.2. 
[41]
(2011)
Hiddenunit conditional random fields.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics
, Cited by: §2.  [42] (2017) Attention is all you need. In NeurIPS, Cited by: §1, §3.1.
 [43] (2017) Modeling temporal dynamics and spatial configurations of actions using twostream recurrent neural networks. In CVPR, Cited by: Table 1.
 [44] (2018) Nonlocal neural networks. In CVPR, Cited by: §1.
 [45] (2015) Memory networks. In ICLR, Cited by: §1, §2.
 [46] (2018) Modellevel dual learning. In ICML, Cited by: §5.3, Table 3.
 [47] (2016) Dynamic memory networks for visual and textual question answering. In ICML, Cited by: §2.
 [48] (2018) Spatial temporal graph convolutional networks for skeletonbased action recognition. In AAAI, Cited by: §4.4.

[49]
(2018)
Glomo: unsupervisedly learned relational graphs as transferable representations
. In NeurIPS, Cited by: §5.3, §5.3, Table 3.  [50] (2018) High order recurrent neural networks for acoustic modelling. In ICASSP, Cited by: §4.3.
 [51] (2017) View adaptive recurrent neural networks for high performance human action recognition from skeleton data. In ICCV, Cited by: Table 1.

[52]
(2018)
Adding attentiveness to the neurons in recurrent neural networks
. In ECCV, Cited by: §1, §4.4, Table 1.