Spatiotemporal Learning of Dynamic Gestures from 3D Point Cloud Data

by   Joshua Owoyemi, et al.
Tohoku University

In this paper, we demonstrate an end-to-end spatiotemporal gesture learning approach for 3D point cloud data using a new gestures dataset of point clouds acquired from a 3D sensor. Nine classes of gestures were learned from gestures sample data. We mapped point cloud data into dense occupancy grids, then time steps of the occupancy grids are used as inputs into a 3D convolutional neural network which learns the spatiotemporal features in the data without explicit modeling of gesture dynamics. We also introduced a 3D region of interest jittering approach for point cloud data augmentation. This resulted in an increased classification accuracy of up to 10 to the original training data. The developed model is able to classify gestures from the dataset with 84.44 a more viable data type for scene understanding and motion recognition, as 3D sensors become ubiquitous in years to come.



There are no comments yet.


page 5


MeteorNet: Deep Learning on Dynamic 3D Point Cloud Sequences

Understanding dynamic 3D environment is crucial for robotic agents and m...

Classifying In-Place Gestures with End-to-End Point Cloud Learning

Walking in place for moving through virtual environments has attracted n...

Learning Rotation-Invariant Representations of Point Clouds Using Aligned Edge Convolutional Neural Networks

Point cloud analysis is an area of increasing interest due to the develo...

The Impact of Quantity of Training Data on Recognition of Eating Gestures

This paper considers the problem of recognizing eating gestures by track...

Egocentric Gesture Recognition for Head-Mounted AR devices

Natural interaction with virtual objects in AR/VR environments makes for...

Regularization Strategy for Point Cloud via Rigidly Mixed Sample

Data augmentation is an effective regularization strategy to alleviate t...

CaSPR: Learning Canonical Spatiotemporal Point Cloud Representations

We propose CaSPR, a method to learn object-centric canonical spatiotempo...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In recent years, understanding human motion has been gaining more popularity [1] in robotics, mainly for human-robot interaction (HRI)[2] applications. This knowledge is useful because of increase in situations where humans and robots continue to share working and inhabiting spaces. There is therefore need for robots to ‘understand’ the intentions of humans through gesture and action recognition, interpretation and prediction of human motions. This will help to effectively work with humans and ensure the safety of both parties. While there has been a lot of research using 2D sensors and images for this purpose [3] [4], 3D sensors have gained traction because of increased accessibility to low cost 3D sensors such as the Microsoft Kinect [5], allowing for more uses for 3D data.

In this paper we are interested in learning human action and gestures from 3D point cloud data. According to [6], actions are more generic whole body movements while gestures more fine-grained upper body movements performed by a user that have a meaning in a particular context. In our previous work [7], we have developed a model to predict the intentions of human arm motions in a workspace. We further evaluate this approach for dynamic gestures in this paper.

Researchers have used 2D data to achieved some remarkable progress in action and gesture recognition [8] [9] [10] [3]. However, using 2D images have proven difficult in some situations such as varying illumination conditions and cluttered backgrounds [11]. On the other hand, 3D data such as point cloud and depth maps offer advantages such as illumination invariance [12] and can capture appropriate information about the exact size and shape of an object in its physical space, allowing precise and accurate data usage for 3D manipulation, coordination and visualization [13] [14].

In most of the reviewed works, researchers used depth maps [15] or skeleton representations [16] [17] from 3D sensors to analyze or learn human motion. To our knowledge, there has been few works in gesture recognition based on only 3D point cloud data. This might be largely because point cloud data are unorganized, making it difficult to be used directly for model inputs. Therefore, some considerable preprocessing has to be done in order to convert the raw data into usable formats. In our case, we make use of an occupancy grid representation, with unit cells referred to as voxels. This representation can be used for arbitrary size of point cloud spaces, subject to the resolution of the voxels. The key contributions of our work are as follows: 1. We demonstrate spatiotemporal learning from point cloud data through a new dataset of common Japanese gestures. 2. We develop a 3D CNN model which learns gestures end-to-end from 3D representation of point clouds stream and outputs a corresponding gesture class performed by the human. 3. We evaluate the 3D CNN model on the new dataset of common Japanese gestures.

Ii Related Work

With 3D sensors becoming ubiquitous in computer vision and robotic applications, the task of motion and gesture recognition from 3D data is one of increasing practical relevance. A general approach to gesture recognition from 3D data involves feature extraction from data, followed by a possible dimensionality reduction, and application of a classifier on the resulting processed data. Some of earlier approaches include; Hidden Markov Model (HMM) based classifier


using the size of an ellipsoid containing an object and the length of vector from the center of mass to extreme points, analysis of Fourier transform and Radon transform of self-similarity matrix of features obtained from actions

[19], and computation of local spin image descriptors [14] or Local Surface Normal (LSN) based descriptors[20]. A different approach was the use of multi-viewed projection of point cloud into view images and describing hand gestures by extracting and fusing features in the view images [21], claiming that conversion of feature space increases the inner-class similarity and reduces inter-class similarity. Converse to these aforementioned approaches, our approach does not rely on hand-engineered features or descriptors from point cloud data. Rather, we use a 3D convolutional neural network (3D CNN) based model which is able to both automatically build relevant features and classify inputs from example data.

Ii-a Other 3D Data Types

Apart from point clouds, 3D data are also represented as depth maps, which are images that contains information relating to the distance of the surfaces of scene objects from a viewpoint. Depth maps have proven useful in gesture recognition [4] [22] [23] mostly because the data is in 2D, which makes it easy to apply popular feature extraction approaches. However, we argue that our approach is applicable to not only gesture learning but action learning in general. While depth data offers only a point of view of the 3D space, using raw point cloud data allows us to define arbitrary regions of interest (ROIs) in the case of using sensors that provide 3D field of view. An example will be LIDAR sensor mounted on a self-driving car. We can create multiple ROIs depending on the location of objects in the scene and individually analyze their actions.

Secondarily, researchers also use skeletal features, provided by the sensor’s software development kit (SDK) as in the case of Kinect111, or extracted manually from depth data[16]. These methods [17][17][24]

, however, involve manually modeling or engineering the features for the gestures or actions to be learned. Our approach, on the other hand, does not involve explicitly modeling the dynamics of the gestures or actions learned. Instead, the gestures are learned end-to-end, from input directly to gesture class using supervised learning.

Ii-B Gesture and action recognition with CNN

CNNs have proven effective in pattern recognition tasks

[25] and is known to outperform hand-engineered feature-based approaches in image classification, object recognition and similar tasks[26][4][3]. For gesture recognition, CNNs have been used to achieve state-of-the-art results [8][17][9][24] utilizing different kind of feature representations and data types. Asadi-Aghbolaghi et al. [6]

presents a good survey on deep learning for action recognition in image sequences.

In this paper, our model is inspired by the early fusion model in [27] where consecutive frames in a video are fed into a 2D CNN in order to classify the actions in the video stream. Similarly, we also feed consecutive frames of point cloud data, however, into a 3D CNN to learn the actions performed in the point cloud stream. To demonstrate the spatiotemporal learning capability of our approach, we collected a new dataset of common Japanese gestures (See Fig. 1). These were chosen arbitrarily from seeing videos and asking randomly chosen Japanese people to tell us the common gestures they use in a day-to-day life.

Fig. 1: Sample frames of the Dataset of Common Japanese Gestures. We collected one class of no gestures and 9 classes of gestures. The gesture classes are (1) No gesture, (2) Come here, (3) Me, (4) No thank you, (5) Money, (6) Peace, (7) Not allowed, (8) OK, (9) I’m sorry, (10) I got it!.

Iii Method

In this section we describe the methods we used in learning spatiotemporal features from point cloud data. First, we present the data collected, then the training data preparation. We also describe the data augmentation approach employed to improve the efficiency of the learning process.

Iii-a Data collection

We collected training data as point cloud frames acquired from a Kinect sensor. The dataset consist of a class labeled ‘No Gesture’ and nine other gesture classes making a total of ten classes. The classes of gestures are: 1. No Gesture, 2. Come, 3. Me, 4. No Thank You, 5. Money, 6. Peace, 7. Not Allowed, 8. OK, 9. I’m Sorry, and 10. I Got It!. The dataset were collected with a Kinect Sensor facing subjects as shown in Fig. 2. There were 5 subjects with each subject repeating the gestures at least 30 times per gesture. A total of 87,156 point cloud frames were collected for training and 29,758 data frames for testing. Sample frames of the dataset are shown in Fig. 1 for four consecutive frames of each gesture class. A ROI was determined to specify the volume in the scene where the subject is expected to be in. Therefore, the sensor only captures the upper part of the subject. The body was not removed from the point cloud because some gestures also involve bodily movements and the use of both hands. An example is the gesture ”I’m sorry”, which involves slightly bending or bowing the head with both hands clapped together in front of the head.

Fig. 2: The setup for point cloud data collection. The subject sits in front of a Kinect sensor at a considerable distance. The ROI is -0.50 m to 0.50 m, -0.30 m to 0.40 m, 0.50 m to 1.40 m in x, y and z axes respectively.
Fig. 3: Input occupancy grid, mapped from point clouds. The dimension of the occupancy grid depends on the size chosen for the voxels. Here we used a voxel size of 50 mm. There is also a data augmentation step which involves randomly jittering the ROI by a size across a combination of the axes. To augment a point cloud data, the original ROI is jittered randomly in the x-axis or the y-axis or the z-axis or a combination of the three axes.

Iii-B Training Data Preparation

To prepare the training data, point clouds from a 3D sensor are converted into 3D occupancy grids, with each point in the point cloud discretely mapped to a voxel coordinate by updating the occupancy grid similar to [28]. Each voxel has an initial value and is updated by;


where is a sequence of range measurement that either hit , or pass through a given voxel and is the individual points in the point cloud. This process is illustrated in Fig. 3 alongside the data augmentation approach which we will explain in the next subsection. We also define a “lookback-window” size corresponding to the number of prior time steps to consider when recognizing the gesture at time

. Hence, for each input time step, we have a data sample tensor

paired with the label , the corresponding gesture class. Therefore we aim to find a set of parameters of the non-linear function that maps the input to the corresponding label , given by (2).


Iii-C Training data augmentation

We doubled the amount of training data by carrying out the data augmentation approach we call ‘3D ROI jittering’ on the original dataset. This was achieved by applying a translation vector to the original 3D ROI of each gesture performance in the dataset. As illustrated in Fig. 3, basically we are shifting the ROI around in space while the position of the point cloud data is fixed. This is intended to simulate spatial variations in the performance of the gestures.

For an ROI, , and jittering size , the jittered ROI can be obtained as:


Where the vector is chosen randomly from the set,

the union of permutations with replacement for and . These permutations, therefore, represent the different jittering configurations we can have for a chosen jitter size.

Iii-D 3D CNN Model

CNN models are known to be suited for representing the non-linear relationships, as in (2), involving a multidimensional input spaces such as image classification and object detection problems[25]. Here, we are interested in using a CNN model to learn gestures from 3D data, hence the use of 3D CNN.

CNNs are characterized by convolution operations between an input tensor and a convolution kernel (see illustration in Fig. 4). For 3D inputs we have an output;


which is the activation of the node of the feature map in the next layer.

Deep CNN models achieve automatic feature construction by stacking multiple convolutional layers, where higher layers capture more complex or discriminating features. Formally, each layer’s output of the model is a set of activations

from the layer’s Rectified Linear Units (ReLUs)

[29] which are functions of kernel weights , and biases to be optimized. Equations (5) to (7) represent operations at the layers. are intermediate layers while is the output layer.


Using similar configurations from previous work [7], with modifications relating to the size of voxel space. Our CNN model configuration is as follows: As shown in Fig. 5

, we have a total of 7 layers, that is, 4 convolutional layers, 2 fully connected layers and the output layer. The 3D convolutional layers were designed to extract spatial features in each input time step and temporal features across time steps of the input. The first and second layers made use of 5x5x5 convolution kernels, followed by a 3x3x3 in the third layer and a 2x2x2 kernel in the fourth layer. A stride of 2x2x2 was used throughout the convolutional layers. We apply max pooling

[30] on the second and fourth layer to reduce the dimensionality of the parameters connected to the next layers.

Fig. 4: 3D convolution illustration. 3D kernel, represented by the multicoloured cubes, are applied to inputs from previous layer. Here we also have a temporal dimension and the kernel weights are shared across this dimension. For the input layer, contiguous occupancy grid time steps represent the temporal dimension of the input.

Iii-E Training Details

We trained our final model using a 5-fold cross-validation, employing an early stop approach of a patience of 3 in each cross-validation cycle. This prevents the model from overfitting and helps to choose a good trade-off between accuracy and training loss. For optimization, we used an Adam optimizer [31], and used dropouts [32]

of 0.3 on the fully connected layers to further prevent overfitting. During training, we reduce the learning rate by a factor of 0.3 after the validation loss has not decreased in 3 epochs.

All training was done on an Intel(R) Xenon(R) CPU E5-2637 v4 @ 3.50GHz x 12 with a 4 NVIDIA TITAN X GPUs. The Keras


library with Tensorflow

333 backend was used for model implementation.

Fig. 5: The architecture of the developed 3D CNN model. Input into the model are time steps of occupancy grids converted from point cloud data. Here, we show the first four filters of the convolutional layers and the activation map in the fully connected layers. The red cells show the activations in the filters of each layer. The filters that are black signify no activation for the given input.

Iv Evaluation

The developed model was evaluated on the test data kept apart during data collection. We used a window size of 4 timesteps for our training and evaluation. The following subsections outline details of the evaluations carried out on the model.

Iv-a Data Augmentation

Training with augmented data not only increased the number of training data that was used by 100% but also helped in compensating for the spatial variation in the gestures since the subject could perform the gesture in different positions within the ROI. We compared the result of training with and without augmented data and found that adding augmented data to our training samples increased test accuracy up to 10%. This is true for the different classifiers that we evaluated. This is a significant improvement from using only the collected data. A summary of the results from data augmentation is shown in Fig. 6, showing the comparison of different jitter sizes on different classifiers.

Fig. 6:

Evaluation and comparison of augmentation jitter sizes. Using jitter sizes of 5cm, 10 cm and 1.5 cm. The jitter size of 0 cm signifies no augmentation was performed. It is observed that the accuracies of both the 3D CNN and the off-the-shelf classifier increased after applying data augmentation. The accuracies increase, from 57.4% to 67.64% for the the random forest model, from 74.47% to 81.98% for the 3D CNN LTSM model and from 75.8% to 82.8% for the 3D CNN model. However a further increase in the jitter size did not yield significant increase in accuracy. It rather becomes worse, more evidently in the random forest model and later for other models.

Iv-B Models Comparison

We compared the 3D CNN model with a random forest model, an off-the-shelf classifier. This comparison is to evaluate if the 3D CNN model has considerable accuracy advantage over an off-the-shelf classifier. On the other hand, LSTM models [33] are known to perform better on time series or temporal data, so we compared the 3D CNN model with an LSTM variant. This was done by replacing the first fully connected layers in the model with an LSTM layer and passing each time steps of the input through individual mini 3D CNN networks to learn spatial features, and then into the LSTM layers to learn the temporal features. The LSTM variant of the 3D CNN architecture is shown in Fig. 7

. Even though these two models are not directly equivalent, we used similar hyperparameters for training in other to have similar conditions. In our evaluation, the 3D CNN LSTM model did not perform better than the 3D CNN. A summary of this evaluation results is shown in Table 


We see that the 3D CNN model is able to learn both spatial and temporal relationship in the data presented, furthermore, it outperforms the LSTM variant of the model for this particular problem. The confusion matrix for our final model is shown in Fig. 


Model Accuracy
Random Forest 67.64%
3D CNN 75.80%
3D CNN + Augmentation 84.44%
3D CNN + LSTM 74.47%
3D CNN + LSTM + Augmentation 81.82%
TABLE I: Models Performance Comparison
Fig. 7: The architecture of the LSTM variant of 3D CNN of our final model. The input time steps are separated into individual mini CNN networks in order that the spatial features are learned in the CNN layers and then the temporal features in the LSTM layers. This achieved a classification accuracy of 81.82%.
Fig. 8: Confusion Matrix of the prediction on the test set, showing the performance of the model in each class. Class 5 gesture, Peace, has the lowest accuracy, while class 6, Not Allowed, has the highest accuracy.

V Conclusions

We showed an end-to-end approach for spatiotemporal gesture learning from point cloud data. Our data augmentation approached achieved an increase in the model accuracy up to 10%. One limitation for this work is the ability to work with higher resolution representation of point clouds. A smaller voxel size would dramatically increase the dimension of the training data, and subsequently the training and computation time involved. At some point a smaller voxel size in infeasible because of limited computation resource and memory. A future work could address an approach for computation and memory efficient representation of point clouds, or other data augmentation scheme that compensates for anthropometry of subjects used for training thereby covering a wide range of users.


This work is partially supported by JSPS Grant-in-Aid 16H06536.


  • [1] M. Ye, Q. Zhang, L. Wang, J. Zhu, R. Yang, and J. Gall, “A survey on human motion analysis from depth data,” Lecture Notes in Computer Science (Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications), vol. 8200 LNCS, pp. 149–187, 2013.
  • [2] S. Nikolaidis, K. Gu, R. Ramakrishnan, and J. Shah, “Efficient Model Learning for Human-Robot Collaborative Tasks,” arXiv, pp. 1–9, 2014. [Online]. Available:
  • [3] K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Weinberger, Eds.   Curran Associates, Inc., 2014, pp. 568–576.
  • [4] P. Wang, W. Li, S. Liu, Y. Zhang, Z. Gao, and P. Ogunbona, “Large-scale continuous gesture recognition using convolutional neural networks,” 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 13–18, Dec 2016.
  • [5] G. Marin, F. Dominio, and P. Zanuttigh, “HAND GESTURE RECOGNITION WITH LEAP MOTION AND KINECT DEVICES Giulio Marin , Fabio Dominio , Pietro Zanuttigh Department of Information Engineering , University of Padova,” International Conference on Image Processing(ICIP), pp. 1565–1569, 2014.
  • [6] M. Asadi-Aghbolaghi, A. Clapés, M. Bellantonio, H. J. Escalante, V. Ponce-López, X. Baró, I. Guyon, S. Kasaei, and S. Escalera, Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey.   Cham: Springer International Publishing, 2017, pp. 539–578.
  • [7] J. Owoyemi and K. Hashimoto, “Learning Human Motion Intention with 3D Convolutional Neural Network,” in 2017 IEEE International Conference on Mechatronics and Automation (ICMA, 2017, pp. 1810–1815.
  • [8] S. Ji, W. Xu, M. Yang, and K. Yu, “3D Convolutional Neural Networks for Human Action Recognition,” Tpami, vol. 35, no. 1, pp. 221–231, 2013.
  • [9] C. Lea, A. Reiter, R. Vidal, and G. D. Hager, “Segmental spatiotemporal CNNs for fine-grained action segmentation,”

    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

    , vol. 9907 LNCS, pp. 36–52, 2016.
  • [10] C. Lea, R. Vidal, and G. D. Hager, “Learning convolutional action primitives for fine-grained action recognition,” Proceedings - IEEE International Conference on Robotics and Automation, vol. 2016-June, pp. 1642–1649, 2016.
  • [11] S. S. Rautaray and A. Agrawal, “Vision based hand gesture recognition for human computer interaction: a survey,” Artificial Intelligence Review, vol. 43, no. 1, pp. 1–54, 2012.
  • [12] B. Feng, F. He, X. Wang, Y. Wu, H. Wang, S. Yi, and W. Liu, “Depth-projection-map-based bag of contour fragments for robust hand gesture recognition,” IEEE Transactions on Human-Machine Systems, vol. 47, no. 4, pp. 511–523, Aug 2017.
  • [13] Y. Song, J. Tang, F. Liu, and S. Yan, “Body surface context: A new robust feature for action recognition from depth videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 6, pp. 952–964, June 2014.
  • [14] B. Apostol, C. R. Mihalache, and V. Manta, “Using spin images for hand gesture recognition in 3D point clouds,” System Theory, Control and Computing (ICSTCC), 2014 18th International Conference, no. November 2010, pp. 544–549, 2014.
  • [15] P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang, and P. Ogunbona, “Action Recognition from Depth Maps Using Deep Convolutional Neural Networks,” IEEE Transactions on Human Machine Systems, vol. 46, no. 4, pp. 1–12, 2015.
  • [16] S. Wu, F. Jiang, and D. Zhao, “Hand gesture recognition based on skeleton of point clouds,” 2012 IEEE 5th International Conference on Advanced Computational Intelligence, ICACI 2012, pp. 566–569, 2012.
  • [17] Z. Ding, P. Wang, P. O. Ogunbona, and W. Li, “Investigation of different skeleton features for cnn-based 3d action recognition,” arXiv preprint arXiv:1705.00835, 2017.
  • [18] M. Cholewa and P. Sporysz, “Classification of dynamic sequences of 3D point clouds,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 8467 LNAI, no. PART 1, pp. 672–683, 2014.
  • [19] M. Asadi-Aghbolaghi and S. Kasaei, “View invariant human action recognition using fourier-based and radon-based point cloud analysis,” 2014 7th International Symposium on Telecommunications, IST 2014, pp. 66–71, 2014.
  • [20] A. Abdolmaleki, M. Movahedi, N. Lau, and L. P. Reis, “RoboCup 2012: Robot Soccer World Cup XVI,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7500, pp. 237–248, 2013.
  • [21] C. Liang, Y. Song, and Y. Zhang, “Hand gesture recognition using view projection from point cloud,” in Image Processing (ICIP), 2016 IEEE International Conference on, no. 28.   IEEE, 2016, pp. 4413–4417.
  • [22] B. Feng, F. He, X. Wang, Y. Wu, H. Wang, S. Yi, and W. Liu, “Depth-Projection-Map-Based Bag of Contour Fragments for Robust Hand Gesture Recognition,” IEEE Transactions on Human-Machine Systems, vol. 47, no. 4, pp. 511–523, 2016.
  • [23] J. Imran and P. Kumar, “Human action recognition using RGB-D sensor and deep convolutional neural networks,” 2016 International Conference on Advances in Computing, Communications and Informatics, ICACCI 2016, pp. 144–148, 2016.
  • [24] C. Li, Y. Hou, P. Wang, and W. Li, “Joint Distance Maps Based Action Recognition with Convolutional Neural Networks,” IEEE Signal Processing Letters, vol. 24, no. 5, pp. 624–628, 2017.
  • [25] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient Based Learning Applied to Document Recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  • [26]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,”

    Advances In Neural Information Processing Systems, pp. 1–9, 2012.
  • [27] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and F. F. Li, “Large-scale video classification with convolutional neural networks,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1725–1732, 2014.
  • [28] D. Maturana and S. Scherer, “VoxNet: A 3D Convolutional Neural Network for Real-Time Object Recognition,” Iros, pp. 922–928, 2015.
  • [29] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier Nonlinearities Improve Neural Network Acoustic Models,”

    Proceedings of the 30 th International Conference on Machine Learning

    , vol. 28, p. 6, 2013.
  • [30] D. Scherer, A. Müller, and S. Behnke, “Evaluation of pooling operations in convolutional architectures for object recognition,” Artificial Neural Networks–ICANN 2010, pp. 92–101, 2010.
  • [31] D. P. Kingma and J. L. Ba, “Adam: a Method for Stochastic Optimization,” International Conference on Learning Representations 2015, pp. 1–15, 2015.
  • [32] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” ArXiv e-prints, pp. 1–18, 2012. [Online]. Available:
  • [33] K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber, “LSTM: A Search Space Odyssey,” IEEE Transactions on Neural Networks and Learning Systems, 2016.