Focusing on What is Relevant: Time-Series Learning and Understanding using Attention

06/22/2018
by   Phongtharin Vinayavekhin, et al.
0

This paper is a contribution towards interpretability of the deep learning models in different applications of time-series. We propose a temporal attention layer that is capable of selecting the relevant information to perform various tasks, including data completion, key-frame detection and classification. The method uses the whole input sequence to calculate an attention value for each time step. This results in more focused attention values and more plausible visualisation than previous methods. We apply the proposed method to three different tasks. Experimental results show that the proposed network produces comparable results to a state of the art. In addition, the network provides better interpretability of the decision, that is, it generates more significant attention weight to related frames compared to similar techniques attempted in the past.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

12/16/2020

Series Saliency: Temporal Interpretation for Multivariate Time Series Forecasting

Time series forecasting is an important yet challenging task. Though dee...
04/07/2017

A Dual-Stage Attention-Based Recurrent Neural Network for Time Series Prediction

The Nonlinear autoregressive exogenous (NARX) model, which predicts the ...
05/03/2019

Temporal Graph Convolutional Networks for Automatic Seizure Detection

Seizure detection from EEGs is a challenging and time consuming clinical...
06/09/2021

Towards Explainable Abnormal Infant Movements Identification: A Body-part Based Prediction and Visualisation Framework

Providing early diagnosis of cerebral palsy (CP) is key to enhancing the...
07/14/2020

SRDCNN: Strongly Regularized Deep Convolution Neural Network Architecture for Time-series Sensor Signal Classification Tasks

Deep Neural Networks (DNN) have been successfully used to perform classi...
12/01/2016

Temporal Attention-Gated Model for Robust Sequence Classification

Typical techniques for sequence classification are designed for well-seg...
06/12/2018

Focused Hierarchical RNNs for Conditional Sequence Processing

Recurrent Neural Networks (RNNs) with attention mechanisms have obtained...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recent progress in deep neural network has led to an exponential increase in

Artificial Intelligence (AI) applications. While most of these techniques have surpassed human performance in many tasks, the effectiveness of these techniques and their applications to real-world problems is limited by the non-interpretability of their outcomes. Explainability is essential to understand and trust AI solutions.

Numerous techniques have been invented to gain insights into deep learning models. These techniques provides post-hoc interpretability to the learned model [1], which can be mainly categorised into i) methods that perform a calculation after training to find out what the model had learned without affect the performance of the original model [2, 3], and ii) model or layer that contains human understandable information by construction [4, 5, 6, 7]. This model or layer generally improve or at least maintain model accuracy while providing model insight. This paper follows the latter categories.

In general, data can be characterised into two groups: spatial and temporal. In this paper, we are interested in using a deep learning model to analyse temporal data. Learning structure of temporal data is crucial in many applications where the explanation of the decision is as important as the prediction accuracy. For instance, recognizing the most informative sequences of events and visualizing them is useful and desired in computer vision 

[8], marketing analysis [9], and medical applications [10]. For example, if a patient were diagnosed with an illness, it is natural for him/her to be curious on which information lead to this inference.

In this paper, we propose a novel neural network layer that learns temporal relations in time-series while providing interpretable information about the model by employing an attention mechanism. This layer calculates each attention weight based on information of the whole time-series which allows the network to directly select dependencies in the temporal data. Using the proposed layer results in a focused distribution of the attention values and is beneficial when interpreting a result of the network as it gives significant weight to only the related frames. This is in contrast to existing works for temporal attention [6, 11] where the network relies on Recurrent Neural Network (RNNs

) to capture the temporal dependency of the input, calculates each attention weight based on a single latent vector, and provides more diffused attention which give significant weight to non-significant frames.

We show how to use a proposed layer with a conventional neural network by providing two architectures: auto-encoder and classification model. These architectures are applied to three applications: motion capture data completion, key-frame detection in video sequences, and action classification. The experimental results show that the network achieves comparable accuracy to state of the art and provides a clear focus on key frames that lead to the outcome.

Ii Related Works

Deep learning models are often treated as black boxes. However, it is important to understand what they are learning for certain applications. Krizhevsky et al. [4] show interpretability of the network by visualizing weights of the first convolutional layer. Mahendran and Vedaldi [2]

try to understand what each layer of a deep network is doing by inverting the latent representation using a generic natural image prior. Another approach is to interpret the function computed by each individual neuron. This research can be separated into two categories: dataset-centric and network-centric. The data-centric approach requires both the network and the data, while the latter approach only the trained network. A dataset-centric approach displays a part of images that cause high or low activations for individual units. Zeiler and Fergus 

[3] propose a method that backtrack the network computations to identify which image patches are responsible for the activation of certain neurons. Network-centric approach analyses the network without the availability of any data. Nguyen et al. [12]

use evolutionary algorithms or gradient descent to produce images that can fool neural networks. Such techniques can be used to get a better understanding of the neural networks.

Instead of performing an additional calculation to visualise the model, this paper focuses on one type of layer that contains interpretable information by construction, an attention layer [5]. This type of layer outputs information, an attention matrix, that explains the network behavior. Attention mechanism has become a key component of sequence transduction for modeling the temporal dependencies. Such mechanism is commonly used for temporal sequences together with RNNs and Convolutional Neural Network (CNNs). Bahdanau et al. [6] provides an attention mechanism to improve performance and visualisation of applications like machine translation. In this case, the attention is the value of the weights of a linear combination of a latent vector encoded by RNNs. Vaswani et al. [7] proposed a transformer architecture for self-attention. The transformer model relies entirely on self-attention to compute representations of its input and output without using RNNs or CNNs. M. Daniluk et al. [13] proposed a key-value attention mechanism that uses specific output representations for querying a sliding-window memory of previous token representations. Sonderby et al. [14] and Raffel and Ellis [15] modified the calculation of Bahdanau’s original attention mechanism [6]. Each of their attention value is calculated as a function of the latent representation of a Bidirectional Recurrent Neural Network (BRNNs) encoder of one current time step by assuming that the encoder can capture the temporal information of the whole sequence. In the proposed attention layer, each attention value depends on all time steps of an input sequence. This allows the layer to compare and choose the time steps that are more relevant to the desired output which results in more focused attention value.

Iii Proposed Method

Here, we introduce the proposed neural network layer to learn temporal relations of sequential data while allowing visualisation for model interpretation. Next, we describe how to use the layer in two different network architectures along with the details how to train them.

Iii-a Temporal Contextual Layer

Fig. 1:

Temporal Contextual Layer: Neural network layer used to learn a temporal relation between two time-series. The layer takes an output of the time-distributed encoder as input, and calculated context vector as output. Types of architecture/model (auto-encoder or classification) depends on a time step of context vectors and the activation function of the decoder.

Our method assumes that some temporal relation exists in the time-series. The data is not required to be precisely periodic only that there is some semblance of temporal pattern. To learn the pattern, we propose a neural network layer which we refer to as a temporal contextual layer. In addition, the layer has the advantage of interpretability as attention. We begin by describing the proposed method as a layer that learn temporal relation between two time-series in this section.

To start, we define input time-series as vector of length where each vector element is a -dimensional vector representing the current state. The full input is a matrix: where the element at each time step is . Similarly, the output sequence of length is where . Our layer is formulated using an attention mechanism. Previously, an attention is used together with RNNs [6] encoder and decoder on the sequential data. It computes a context vector as the linear combination of a sequence of RNNs latent vector :

(1)

where is an input sequence length. is a normalised weight calculated by applying a softmax function on an attention weight . An attention weight is a learnable function of an input at a current time-step and a previous cell state of the decoder .

Although a latent representation of RNNs at one time-step is a function of all previous steps, it might not be able to capture long-term information due its limited memory even with a gated-type like Long Short-Term Memory (LSTM). Therefore, we propose to calculate an attention weight by providing an information of all time steps to the layer:

(2)

Fig. 1

shows temporal contextual layer. It takes a predefined length sequence as input and learns a mapping from this sequence to an output sequence, also of predefined length (which may be different), using an attention mechanism. Zero-padding can be used to handle time series with variable length.

In summary, two time-series and can be temporally related by a matrix as:

(3)

where is a temporal contextual matrix that can be visualised in the same way as an attention matrix. To achieve the desired behavior defined in Eq. 2, it can be defined as:

(4)
(5)

where is a row-wise softmax function and is the unnormalised temporal contextual matrix. , , and are the learned weight and bias matrices and are non-linear activation functions. Specifically, we use a tanh for and a relu for . The proposed layer has a total of of trainable parameters.

Iii-B Usages and Applications of Temporal Contextual Layer

This subsection details two variations on how to combine the proposed layer with conventional neural network layers to build a network architecture that can solve a specific task.

Iii-B1 Autoencoder Model

Temporal contextual layer is inserted between the encoder and the decoder, as depicted in Fig. 1, to create an auto-encoder model that learn temporal relations of time-series. Auto-encoder is an unsupervised model that learns a representation of the data by generating an output to be similar to the input it received.

Input time-series is encoded into latent representation either by a dense layer or BRNNs. In the former, the layer is applied in the timely-distributed manner, i.e. the same encoder is applied to each time step of the input separately. The encoder could either be sparse or compressive. Then, the time-series of the latent representation is passed to a temporal contextual layer which outputs the time-series of the same length in time, . Lastly, the contextual latent representation is passed to the dense layer to decode the time-series back to the same feature dimension as the original raw input.

An auto-encoder model can be used, for example, i) to perform data completion and ii) to detect a key-frame in time-series. For the former application, classically the denoising auto-encoder model is well-known to be used on data with random and partially occluded data [16, 17]

. The proposed network can also be used for filling the occluded gaps in time-series (data interpolation

[18]. This is due to its ability to find the temporal relation in the time-series. Section IV-A demonstrates this on motion capture data together with motion extrapolation task. For the latter application, the entire time-series is provided to the model as input, and the task is to reconstruct only the desired key-frame as output. In this case, the proposed layer learns to pick relevant information to reconstruct the desired key-frame. This allows us to indirectly detect the key-frame from the attention weight without explicitly training the network in a supervised fashion. Results are shown in Section IV-B by detecting a key-frame in the video.

Iii-B2 Classification Model

A temporal contextual layer can be used in a classification problem. We consider one specific type of classification task where the input is a sequential data and output is its corresponding class. Examples of the real-world application are action recognition from a mo-cap data, object recognition in video, speech recognition etc. Temporal contextual layer of an output of time-step is placed before the final soft-max layer to choose frames that are important to differentiate the time-series from others. Similarly, a raw input sequence can either be encoded by a spatial layer or BRNNs layer. The spatial layer can either be an encoder of one individual frame or the encoder that combined information from multiple frames such as a convolutional network. The main idea here is to maintain the temporal order of the input sequence. Fig. 1 shows a network architecture of this classification model.

The proposed model provides insight on a result of the classification. Large weights in an attention matrix specify the input frames that constitute to the classification decision. We show this analysis in an action classification experiment in Section IV-C.

Both auto-encoder and classification model with a temporal contextual layer can be treated as an optimisation problem. The loss function is minimized through a stochastic gradient descent. The choice of the loss function depends on whether the output time-series is discrete or continuous. The gradient can be back-propagated through the layer as all operations in the proposed layer are differentiable.

Iv Experimental Results

Temporal contextual network and the proposed architectures are applied to three different tasks. First, we use an auto-encoder model to perform data completion of mo-cap data and detect a key-frame in a video sequence. Finally, we use a classification model to classify various human actions.

Iv-a Motion Capture Data Completion

Fig. 2: Result of motion capture completion (motion interpolation) of a walking sequence in a Human 3.6M dataset. Reconstructed poses are shown together with the original poses for a qualitative comparison. An attention matrix of all time steps, together with detailed attention weights of time step are shown in  . To reconstruct the pose of , the network combines information from other poses with the similar appearance, . Attention weights of previous method [14] are also given in   for comparison.

For this task, we apply an auto-encoding model to fill the gap in occluded motion capture data (motion interpolation) and to predict future motion (motion extrapolation). A public dataset for 3D human motion, Human 3.6M [19], is used. The experiment is detailed as the following:

  • The data is down-sampled to fps. Human posture is represented by joint orientation using an exponential map in the parent coordinate frame [20, 21, 22].

  • For motion interpolation, each sequence is comprised of frames. A zero-valued occluded hole of frames (2400ms) is created in the middle of the sequence; hence frames for both prefix and suffix motion. During training, these sequences are given as input while the output is the original sequences.

  • For motion extrapolation, 50 frames of prefix motion are used as input and the next 60 frames are used as output during training.

  • Motion of subject id is used for training and validation, while subject id is used for testing.

  • For each activity, there are training, validation and testing sequences corresponding to the baseline for comparison purpose [22].

  • The auto-encoder model with a neuron dense encoder is trained with a batch size of for epochs with Mean Squared Error (MSE) loss.

Results are evaluated using MSE of joints in Euler angle, while the difference in location and body rotation is disregarded [22]. The results are compared with convolutional auto-encoders [17] and other motion prediction baselines [20, 21, 22]. Our implementation of [17] follows the kernel size reported, while the feature map of each layer is changed to respectively. Publicly available source code and models 111https://github.com/asheshjain399/RNNexp222https://github.com/una-dinosauria/human-motion-prediction are used to predict motion for the next 60 frames. We run all seq-to-seq methods [22] and residual sampling-based loss (rSA) method performs best in our experiment. Its average error of the last training iterations (out of a total of ) is reported in Table I.

Method 160 320 560 1200 1840 2080 2240
(milliseconds)
Activity : Walking
CNN [17] 1.31 1.35 1.33 1.44 1.45 1.51 1.44
Proposed 0.82 0.82 1.00 0.75 0.97 1.02 0.88
LSTM-3LR [20] 0.98 1.37 1.78 2.24 2.40 2.28 2.39
ERD [20] 1.11 1.38 1.79 2.27 2.41 2.39 2.49
S-RNN [21] 0.94 1.16 1.61 2.09 2.32 2.35 2.36
S2S (rSA) [22] 0.59 0.82 0.97 1.22 1.48 1.54 1.58
Proposed-pred 1.11 1.16 1.04 0.99 1.31 1.35 1.29
Actity : Smoking
CNN [17] 2.02 2.24 2.31 2.45 2.49 2.37 2.17
Proposed 1.00 1.19 1.33 1.43 1.19 1.04 1.43
LSTM-3LR [20] 1.61 1.98 2.27 2.52 2.60 2.76 2.93
ERD [20] 1.80 2.19 2.54 3.45 3.36 3.33 3.36
S-RNN [21] 0.94 1.16 1.61 2.09 2.32 2.35 2.36
S2S (rSA) [22] 0.76 1.20 1.46 2.13 2.31 2.41 2.47
Proposed-pred 1.09 1.27 1.52 1.91 1.85 1.96 2.37
Actity : Eating
CNN [17] 1.33 1.44 1.55 1.90 1.62 1.62 1.39
Proposed 0.65 0.85 1.06 1.39 1.04 1.01 0.82
LSTM-3LR [20] 1.25 1.82 2.28 2.69 2.65 2.74 2.70
ERD [20] 1.79 2.33 2.61 2.42 2.35 2.37 2.30
S-RNN [21] 1.41 1.85 2.19 2.84 2.99 3.05 3.11
S2S (rSA) [22] 0.50 0.78 1.10 1.63 1.66 1.79 1.81
Proposed-pred 0.77 1.01 1.32 1.49 1.58 1.60 1.51
TABLE I: Comparing results of motion completion techniques.

Errors of three activities, i.e. walking, smoking, and eating, are reported in Table I; above the dash line are results for motion interpolation while extrapolation results are below. The proposed method performs better than the other interpolation method (CNN). For extrapolation, our method performs comparably in the short term against most methods, but performs better in the long term. Interpolation results of the proposed method also performs better than the extrapolation one. This is because the interpolation method takes data before and after the gap to fill it.

Fig. 2 displays reconstructed poses in various time steps together with the original sequence. The attention matrix shows that the network combines poses with similar appearance from different phases, both before and after the time steps, to reconstruct a specific pose. This is more plausible than a previous attention method [14] ([6] without RNNs decoder) that combines poses from both similar and different appearances as depicted by a distributed attention graph in the figure. One disadvantage of the proposed method is that its output sequence is not smooth. This is because the attention weight is not learned based on previous output. One solution for this would be to use a RNNs decoder and incorporate the knowledge of previously reconstructed frames into the attention function [6].

Iv-B Key-frame detection in a video sequence

The proposed auto-encoder model is used for key-frame detection in this experiment. We perform experiments to reconstruct MNIST digits from a video sequence of randomly placed digits from to . For each video, there are 10 frames in total and each frame corresponds to a MNIST digit. We specifically reconstruct the digit from the video input. Convolutional encoder and decoder networks are pre-trained on MNIST images. During training, we pass each image frame in the video through the encoder to obtain feature space of size , where is a latent dimension. These are passed through the proposed layer which give an output of size , which is then passed through the decoder to reconstruct the desired image of digit . Encoder and decoder are also fine-tuned in this training process.

(a) Samples of a good reconstruction and correct detection result.
(b) Samples of a bad reconstruction and incorrect detection result.
Fig. 3: Attention value for each digit in the input video. For good reconstruction, the attention value on digit 2 is more compared to other digits.

Fig. 3 shows the attention value and reconstructed image. Qualitative inspection shows that the proposed attention layer chooses features from image frames containing digit if the reconstruction is of good quality. In the failure cases, it obtains attention from other digits and the reconstruction is poor. To quantitatively measure the performance of proposed attention layer, we computed the detection accuracy of the location of digit in the video sequence. For each video, we take the location of maximum attention value at the detection of that input. We compared the accuracy of this detection to the ground-truth location of digit in the videos, yielding an accuracy of about , which demonstrates that the attention layer gives significant attention to pick features from those locations. To encourage sparsity in the attention layer, we use a negative L2 activity regularisation for the attention unit, which acquires the minimum possible value close to when there is a single spike in the attention activity.

Iv-C Action Classification on Motion Capture Data

A proposed classification model is used to classify of human motion in the KIT Whole-Body Human Motion Database [23]. The motions are captured using a VICON motion capture and fit to a human model to obtain a sequence of joint angles. We select nine actions, listed in Table I(b), of two subjects (i.e. and ) based on a balance of a number of data in each action. A total of motion sequences are used. The experiment is detailed as the following:

  • All sequences are down-sampled from fps to fps by selecting every forth frame. This increases the number of sequences by four times.

  • All sequences are padded with zeros to have the same length as the longest sequence (227 time steps). A posture in each time step contains 44 joint angles; hence, each input sequence is a matrix of size .

  • A total of fps-reduced motion are divided into training, validation and testing sequences.

  • The classification model with a -neuron dense encoder is trained with a batch size of with a categorical cross-entropy loss. The training is stopped when validation loss decrease less than for at least epochs.

We compared the proposed method with a multilayer perceptron network (MLP) and a multi-layer

LSTM [24]. In the former, we take the proposed classification model with dense encoder and replace the temporal contextual layer with a dense layer of neurons. In the latter, two layers of LSTM of cell units are concatenated and the output of the last time steps is passed to a dense layer with softmax activation.

Results are evaluated using the accuracy and time required to train the model. We ran an experiment times for each method and the average are reported in Table I(a). The classification model with dense encoder performs well in term of accuracy. It takes less time to train than a multi-layer LSTM because back propagation through time is not required. While the method takes more time to train comparing to MLP, it provides an interpretation of the result which will be described later in the section. The proposed network also has a lower number of parameters than other networks.

Method Accuracy (%) Training Time (s) {Epochs} Number of Parameters
MLP 76.31.1 13.4 {12} 933,081
Dense + Proposed 76.70.6 56.5 {52} 4,732
2 Layer LSTM [24] 65.77.7 584.0 {43} 221,321
BiLSTM + Att. [14] 85.42.2 239.1 {18} 57,097
BiLSTM + Proposed 85.92.7 225.5 {17} 86,252
(a) Comparison of accuracy, training time, and number of parameters.

True Label

Bow 16 0 0 0 0 0 0 0 0
Jump 0 38 0 0 0 0 0 0 0
Kick 0 16 16 0 0 0 0 0 0
Golf 0 0 0 17 0 0 0 0 0
Tennis 0 5 0 0 22 0 0 4 0
Squat 0 0 0 0 0 8 0 0 0
Stomp 0 4 0 0 0 0 12 0 0
Throw 0 4 4 0 0 0 0 8 0
Wave 2 2 0 0 0 0 0 4 15
B J K G Te S St Th W
Predicted Label
(b) Confusion matrix of nine actions (Dense + Proposed).
TABLE II: Result of action classification on a subset of KIT dataset.
(a) Postures with top-2 highest attention value (key frame) of three actions with a perfect classification.
(b) Classification where key frames of confusing classes are similar.
Fig. 4: Visualisation of result based on attention values (Dense + Proposed).

Table I(b) shows a confusion matrix of the proposed classification model. Three actions, i.e. bow, play golf, squat, has a perfect classification result without any incorrect classification to and from other actions. Fig. 3(a) shows a sample of postures from those actions with the top-2 highest attention value (key frames). Based on their attention value, information of these frames is used in the decision of the classification. These key frames are very different between actions. They also correspond to human intuition to describe the actions, but this is not always true for all actions as there is no control over what the network will learn. On the other hand, a sample of postures with incorrect classification results are shown in Fig. 3(b). Key frames for tennis and throw are similar which lead to some incorrect classification.

Fig. 5: Comparison of attention value of various method in action classification task. Proposed method provides more focus attention value to classify bow action (Note that Y-axis of all graphs are different).

We compare the attention value with previous method by performing experiments using  Bidirectional Long Short-Term Memory (BiLSTM) encoder together with the proposed attention layer (BiLSTM + Proposed) and a feed-forward attention (BiLSTM + Att.) [14]. Methods with BiLSTM encoder perform better in term of accuracy, but require more training time than the dense encoder as shown in Table I(a). A comparison of attention values of one of the bow action is shown in Fig. 5. The action occurs between time step 50-75. The proposed attention layer with dense encoder uses frames between the vicinity to make a decision, whereas the BiLSTM encoder with a feed forward attention combines information both inside and outside the action. Another downside when using BiLSTM encoder is that one latent representation combines information from various time steps, which makes it difficult to interpret the results. When using the proposed layer with BiLSTM encoder, the attention spike from very few frames, mainly frame 58. In this case, the latent representation could have captured the temporal information of the nearby frames that made it differentiable from other actions.

V Conclusion

This paper proposes a neural network layer that learns the temporal structure of the data. The method is based on an attention mechanism which provides an interpretation to the deep learning model. We applied the method to various applications and showed that using the proposed temporal contextual layer has retained, and in some case improved, the performance of the model on the task. The network also allows results to the model be interpreted and visualised. As a future direction, we plan to investigate the idea to stack the temporal contextual layer to gain insight into the deeper and more complex model.

References

  • [1] Z. C. Lipton, “The mythos of model interpretability,”

    Int’l Conf. on Machine Learning (ICML), Human Interpretability in Machine Learning WHI

    , 2016.
  • [2] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in Procs. of CVPR, 2015.
  • [3] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Procs. of ECCV, 2014.
  • [4]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    NIPS, 2012.
  • [5] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Procs. of ICML, 2015.
  • [6]

    D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in

    ICLR, 2015.
  • [7] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017.
  • [8] K. Tang, D. Koller, and L. Fei-Fei, “Learning latent temporal structure for complex event detection,” in Procs. of CVPR, 2012.
  • [9] D. Markovitch and P. N. Golder, in Using Stock Prices to Predict Market Events: Evidence on Sales Takeoff and Long-Term Firm Survival, 2008, vol. 27, pp. 717–729.
  • [10] S. J. Baker and E. P. Reddy, “Understanding the temporal sequence of genetic events that lead to prostate cancer progression and metastasis,” Proc. of the National Academy of Sciences of the United States of America, 2013.
  • [11] W. Pei, T. Baltrusaitis, D. M. Tax, and L.-P. Morency, “Temporal attention-gated model for robust sequence classification,” in In Procs. of CVPR, 2017.
  • [12] A. M. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Procs. of CVPR, 2015.
  • [13] M. Daniluk, T. Rocktäschel, J. Welbl, and S. Riedel, “Frustratingly short attention spans in neural language modeling,” ICLR, 2017.
  • [14] S. K. Sonderby, C. K. Sonderby, H. Nielsen, and O. Winther, “Convolutional lstm networks for subcellular localization of proteins,” in Procs. of the Second Int’l Conf. on Algorithms for Computational Biology - Volume 9199, 2015, pp. 68–80.
  • [15] C. Raffel and D. P. W. Ellis, “Feed-forward networks with attention can solve some long-term memory problems,” Workshop Extended Abstracts of the 4th ICLR, 2016, 2015.
  • [16]

    P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in

    Procs. of ICML, 2008.
  • [17] D. Holden, J. Saito, T. Komura, and T. Joyce, “Learning motion manifolds with convolutional autoencoders,” in SIGGRAPH Asia 2015 Technical Briefs, ser. SA ’15, 2015, pp. 18:1–18:4.
  • [18] M. Berglund, T. Raiko, M. Honkala, L. Kärkkäinen, A. Vetek, and J. Karhunen, “Bidirectional recurrent neural networks as generative models,” in NIPS, 2015.
  • [19] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu, “Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 36, no. 7, pp. 1325–1339, jul 2014.
  • [20] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent network models for human dynamics,” in Procs. of ICCV, 2015.
  • [21] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: Deep learning on spatio-temporal graphs,” in Procs. of CVPR, 2016.
  • [22] J. Martinez, M. J. Black, and J. Romero, “On human motion prediction using recurrent neural networks,” in Procs. of CVPR, 2017.
  • [23] C. Mandery, O. Terlemez, M. Do, N. Vahrenkamp, and T. Asfour, “The kit whole-body human motion database,” in Int’l Conf. on Advanced Robotics (ICAR), July 2015, pp. 329–336.
  • [24] A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+d: A large scale dataset for 3d human activity analysis,” in Procs. of CVPR, 2016.