A Video Recognition Method by using Adaptive Structural Learning of Long Short Term Memory based Deep Belief Network

09/30/2019 ∙ by Shin Kamada, et al. ∙ Prefectural University of Hiroshima 20

Deep learning builds deep architectures such as multi-layered artificial neural networks to effectively represent multiple features of input patterns. The adaptive structural learning method of Deep Belief Network (DBN) can realize a high classification capability while searching the optimal network structure during the training. The method can find the optimal number of hidden neurons of a Restricted Boltzmann Machine (RBM) by neuron generation-annihilation algorithm to train the given input data, and then it can make a new layer in DBN by the layer generation algorithm to actualize a deep data representation. Moreover, the learning algorithm of Adaptive RBM and Adaptive DBN was extended to the time-series analysis by using the idea of LSTM (Long Short Term Memory). In this paper, our proposed prediction method was applied to Moving MNIST, which is a benchmark data set for video recognition. We challenge to reveal the power of our proposed method in the video recognition research field, since video includes rich source of visual information. Compared with the LSTM model, our method showed higher prediction performance (more than 90



There are no comments yet.


page 1

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, Artificial Intelligence (AI) with sophisticated technologies has become an essential technique in our life. [1]. Especially, the recent advances in deep learning methods enable higher performance for several big data compared to traditional methods [2, 3]

. For example, CNNs (Convolutional Neural Network) such as AlexNet

[4], GoogLeNet [5], VGG16 [6], and ResNet [7], highly improved classification or detection accuracy in image recognition [8].

As improvement of image recognition, deep learning is also applied to video recognition [9]

. The video recognition is kind of fusion task which needs both image recognition and time-series prediction simultaneously. This is, recurrent function that classifies an given image or detects an object while predicting the future, is required. Understanding of time series video is expected in various kinds industrial fields, such as human detection, pose or facial estimation from video camera, autonomous driving system, and so on


LSTM (Long Short Term Memory) is a well-known method for time-series prediction and is applied to deep learning methods[11]

. The method enabled the traditional recurrent neural network recognizes not only short-term memory but also long-term memory for given sequential data

[12]. For video recognition of LSTM, the idea using convolutional filter instead of one-dimensional neuron can be used since one frame of sequential video can be seen as one image [13].

In our research, we proposed the adaptive structural learning method of DBN [14]. The adaptive structural learning can find a suitable size of network structure for given input space during its training. The neuron generation and annihilation algorithms [15, 16] were implemented on Restricted Boltzmann Machine (RBM) [17], and layer generation algorithm [18] was implemented on Deep Belief Network (DBN) [19]. The adaptive structural learning of DBN (Adaptive DBN) shows the highest classification capability in the research field of image recognition by using some benchmark data sets such as MNIST [20], CIFAR-10, and CIFAR-100 [21]. Moreover, the learning algorithm of Adaptive RBM and Adaptive DBN was extended to the time-series prediction by using the idea of LSTM [22]. LSTM was often implemented on a CNN structure, we implemented LSTM on our Adaptive RBM and DBN, and then the proposed method showed higher prediction accuracy than the other methods for several time-series benchmark data sets, such as Nottingham (MIDI) and CMU (Motion Capture).

For further improvement of the method, our proposed method was applied to Moving MNIST [23] in this paper, which is a benchmark data set for video recognition. We challenge to reveal the power of our proposed method in the video recognition research field, since video includes rich source of visual information. Compared with the LSTM model [24], our method make a higher performance of prediction.

The remainder of this paper is organized as follows. In section II, basic idea of the adaptive structural learning of DBN is briefly explained. Section III gives the description of the extension algorithm of Adaptive DBN for time-series prediction. In section IV, the effectiveness of our proposed method is verified on moving MNIST. In section V, we give some discussions to conclude this paper.

Ii Adaptive Learning Method of Deep Belief Network

This section explains the traditional RBM [17] and DBN [19] to describe the basic behavior of our proposed adaptive learning method of DBN.

Ii-a Neuron Generation and Annihilation Algorithm of RBM

While recent deep learning model has higher classification capability, some problems related to the network structure or the number of some parameters still remains to become a difficult task as the AI research. For the problem, we have developed the adaptive structural learning method in RBM model (Adaptive RBM) [14]. RBM as shown in Fig. 1

is an unsupervised graphical and energy based model on two kinds of layers; visible layer for input and hidden layer for feature vector, respectively. The neuron generation algorithm of the Adaptive RBM can generate an optimal number of hidden neurons and the trained RBM is suitable structure for given input space.

The neuron generation is based on the idea of Walking Distance (WD), which is inspired from the multi-layered neural network in the paper [25]

. WD is the difference between the prior variance and the current one of learning parameters.RBM has 3 kinds of parameters according to visible neurons, hidden neurons, and the weights among their connections. The Adaptive RBM can monitor their parameters excluding the visible one ( The paper

[14] describes the reason of the disregard). The situation means that only the existing hidden neurons cannot represent an ambiguous pattern, because there is the lack of the number of hidden neurons. In order to express the ambiguous patterns, a new neuron is inserted to inherit the attributes of the parent hidden neuron as shown in Fig. 2(a).

Fig. 1: A network structure of RBM

In addition to the neuron generation, the neuron annihilation algorithm was applied to the Adaptive RBM after neuron generation process as shown in Fig. 2(b). We may meet that some unnecessary or redundant neurons were generated due to the neuron generation process. Therefore, such neurons will be removed the corresponding hidden neuron according to the output activities.

Ii-B Layer Generation Algorithm of DBN

A DBN is a hierarchical model of stacking the several pre-trained RBMs. For building process, output (hidden neurons activation) of -th RBM can be seen as the next input of -th RBM. Generally, DBN with multiple RBMs has higher data representation power than one RBM. Such hierarchical model can represent the specified features from an abstract concept to an concrete object in the direction from input layer to output layer. However, the optimal number of RBMs depends on the target data space.

We developed Adaptive DBN which can automatically adjust an optimal network structure by the self-organization in the similar way of WD monitoring. If both WD and the energy function do not become small values, then a new RBM will be generated to keep the suitable network classification power for the data set, since the RBM has lacked the power of data representation to draw an image of input patterns. Therefore, the condition for layer generation is defined by using the total WD and the energy function. Fig. 3 shows the overview of layer generation in Adaptive DBN.

(a) Neuron generation
(b) Neuron annihilation
Fig. 2: Adaptive RBM
Fig. 3: Overview of Adaptive DBN

Iii Adaptive RNN-DBN for time-series prediction

In time-series prediction, some LSTM methods improve the prediction performance of the traditional recurrent neural network by using the several gates such as forget-gate, peephole connection gate, full and gradient gate [11]. These gates can represent multiple patterns of time-series sequence, that is not only short-term memory but also long-term memory.

Recurrent Neural Network Restricted Boltzmann Machine (RNN-RBM) [26] is a RBM based recurrent model for time-series prediction. The method is a extension model of the traditional Temporal RBM (TRBM) and Recurrent TRBM (RTRBM) [27] and it also used a similar idea of LSTM for better performance.

Fig. 4 shows the network structure of RNN-RBM. RNN-RBM had a recurrent structure on Markov process for given time-series sequence, as well as RNN. Let an input sequence with the length be . and are the input and hidden neurons at -th RBM, respectively. The time dependency parameters , , and are calculated by the past parameters at as following equations.



is a activation function. For example,

function was used in the paper [26]. represents time-series context for give input. is the initial state and the values are given randomly. is a set of learning parameters, and is not time-dependency. At each time , the RBM learning can employ the update algorithm of and at time , and weights between them. After the calculation of error continues till time , the gradient for is updated to trace from time back to time to the contrary by BPTT[28, 29]. BPTT is Back Propagation Through Time method which is often used in the traditional recurrent neural network.

We developed Adaptive RNN-RBM by applying the neuron generation and annihilation algorithm to RNN-RBM. An suitable number of hidden neurons is sought by the neuron generation and annihilation algorithm for better representation power as same as usual Adaptive RBM [22]. In addition, the hierarchical model with layer generation was developed as Adaptive RNN-DBN by using the same monitoring function of Adaptive DBN. In other words, the output signal of hidden neuron at -th Adaptive RNN-RBM can be seen as the next input signal at -th Adaptive RNN-RBM as shown in Fig. 3 The proposed Adaptive RNN-DBN was applied to several time-series benchmark data sets, such as Nottingham (MIDI) and CMU (Motion Capture). In [22], the prediction accuracy of the proposed method is higher than that of the traditional methods.

Fig. 4: Structure of RNN-RBM

Iv Experiment Results

In this paper, the effectiveness of our proposed method for moving MNIST benchmark data set [24] was verified. We challenge to reveal the power of our proposed method in the video recognition research field, since video includes rich source of visual information. When we want to detect and identify visual objects, faces, emotions, actions or events in the real time, state-of-the-art video recognition software for the comprehensive video content analysis can make the advanced solutions for AI-based visual content search. Therefore, the prediction performance is compared with the traditional LSTM in this paper.

Fig. 5: Two samples of Moving MNIST

Iv-a Data set: Moving MNIST

Moving MNIST [24] is a benchmark data set for video recognition. There are 10,000 samples including 8,000 for training and 2,000 for test. Each sample consists of 20 sequential gray scale images () of patch, where two digits move in the sequences. The digits were chosen randomly from the training set. The selected digits are placed initially at random locations inside the patch. Each digit was assigned a velocity which has direction and magnitude. The direction and the magnitude also chosen randomly. Fig. 5 shows two samples of 20 sequential images.

Moving MNIST is used in various researches such as the video decomposition [30] and the future predictor [24]. This paper aims to investigate the effectiveness of the our LSTM model in the video recognition and we compare the recognition capability as future predictor.

Since any teacher signal for each sample is not provided in moving MNIST, the LSTM in [24] investigated the cross entropy between the given sequence (ground truth) and the predicted sequence. Fig. 6 shows the composite model of LSTM in [24]. In the method, the first 10 frames are used as an input sequence of the model, the remaining 10 frames are used for evaluation of future prediction as shown in Fig. 5.

Fig. 7 shows an abstract procedure of prediction in our method. As same as in our method, same evaluation method was used. Since our method can predict next frame for given frame, the predicted frame is used for next input, and then squared error and prediction accuracy of the ground truth images and the predicted images were evaluated.

Fig. 6: The Composite Model of LSTM [24]
Fig. 7: Abstract of prediction

Iv-B Experimental Results

Table I and Table II show the prediction result on moving MNIST for training and test. The value in Table I is the cross entropy between the ground truth images and the predicted images at last sequence. The result of the LSTM is cited from the paper [24]. The left value of each column in Table II is the squared error and the value in brackets is the prediction accuracy.

In the LSTM, the result for training and the prediction accuracy for test are not reported. In the Adaptive RNN-DBN, three settings of learning ratio (lr) were used for evaluation. For training of the Adaptive RNN-DBN, the cross entropy and the squared error reached almost 0 and prediction accuracy reached almost 100%. For test, the three settings of Adaptive RNN-DBN showed smaller prediction error than the LSTM. The best performance was acquired when the learning ratio was 0.050 in the Adaptive RNN-DBN.

We also investigated the intermediate result in the predicted images of the Adaptive RNN-DBN. Table III expresses the future prediction accuracy for each predicted sequence. Basically, the prediction accuracy was slowly decreased from 11th to 20th frames.

Model Training Test
LSTM [24] - 341.2
Adaptive RNN-DBN (lr = 0.010) 18.5 165.3
Adaptive RNN-DBN (lr = 0.050) 16.1 134.0
Adaptive RNN-DBN (lr = 0.001) 17.0 140.6
TABLE I: Prediction result (Cross entropy)
Model Training Test
Adaptive RNN-DBN (lr = 0.010) 11.4 (99.0%) 119.9 (91.8%)
Adaptive RNN-DBN (lr = 0.050) 10.3 (99.4%) 100.5 (92.5%)
Adaptive RNN-DBN (lr = 0.001) 14.5 (98.9%) 140.2 (89.4%)
TABLE II: Prediction result (Squared loss and correct ratio)
Model 11 12 13 14 15 16 17 18 19 20
Adaptive RNN-DBN (lr = 0.010) 98.2% 97.9% 97.0% 95.2% 94.5% 93.9% 93.3% 92.8% 92.0% 91.8%
Adaptive RNN-DBN (lr = 0.050) 99.5% 98.0% 97.9% 96.5% 96.4% 94.5% 93.1% 92.9% 92.8% 92.5%
Adaptive RNN-DBN (lr = 0.001) 96.8% 96.8% 96.4% 94.8% 93.6% 92.9% 91.8% 91.0% 90.0% 89.4%
TABLE III: Prediction Accuracy for each frame

V Conclusion

Deep learning is widely used in various kinds of research fields, especially image recognition. In our research, Adaptive DBN which can find the optimal network structure for given data was developed. The method shows higher classification accuracy than existing deep learning methods for several benchmark data sets. In this paper, our proposed prediction method was applied to Moving MNIST, which is a benchmark data set for video recognition. Compared with the LSTM model, our method showed higher prediction performance (more than 90% predication accuracy for test data). Our proposed method will be further improved for better prediction capability by evaluating the method on the other large video databases such as video streaming and defect detection in time-series video data.


This work was supported by JSPS KAKENHI Grant Number 19K12142, 19K24365, and obtained from the commissioned research by National Institute of Information and Communications Technology (NICT, 21405), JAPAN.


  • [1] Markets and Markets, http://www.marketsandmarkets.com/Market-Reports/deep-learning-market-107369271.html (accessed 28 November 2018) (2016)
  • [2] Y.Bengio: Learning Deep Architectures for AI

    , Foundations and Trends in Machine Learning archive, vol.2, no.1, pp.1-127 (2009)

  • [3] V.Le.Quoc, R.Marc’s Aurelio, et.al.:

    Building high-level features using large scale unsupervised learning

    , Proc. of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.8595-8598 (2013)
  • [4] A.Krizhevsky, I.Sutskever, G.E.Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Proc. of Advances in Neural Information Processing Systems 25 (NIPS 2012) (2012)
  • [5] C.Szegedy, W. Liu, Y.Jia, P.Sermanet, S.Reed, D.Anguelov, D.Erhan, V.Vanhoucke, A.Rabinovich, Going Deeper with Convolutions, Proc. of CVPR2015 (2015)
  • [6] K.Simonyan, A.Zisserman, Very deep convolutional networks for large-scale image recognition, Proc. of International Conference on Learning Representations (ICLR 2015) (2015)
  • [7] K.He, X.Zhang, S.Ren, J.Sun, J, Deep residual learning for image recognition

    , Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770-778 (2016)

  • [8] O.Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S.Ma, Z.Huang, A.Karpathy, A.Khosla, M.Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision, vol.115, no.3. pp.211–252 (2015)
  • [9] M.Mohammadi, A.Al-Fuqaha, S.Sorour, and M.Guizani, Deep Learning for IoT Big Data and Streaming Analytics: A Survey, in IEEE Communications Surveys & Tutorials (2018)
  • [10] H.Zhang, Y.Zhang and B.Zhong, et.al., A Comprehensive Survey of Vision-Based Human Action Recognition Methods, Sensors, vol.19, no.5, pp.1–20 (2019)
  • [11] Y.Bengio, P.Simard, and P.Frasconi, Learning long-term dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol.5, no.2, pp.157–166 (1994)
  • [12] Z.C.Lipton, D.C.Kale, C.Elkan, and R.Wetzell, Learning to Diagnose with LSTM Recurrent Neural Networks, in International Conference on Learning Representations (ICLR 2016), pp.1–18 (2016)
  • [13] S.Xingjian, C.Zhourong, W.Hao, Y.Dit-Yan, W.Wai-kin, W.Wang-chun, Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting, Advances in Neural Information Processing Systems 28 (NIPS 2015) pp.802–810 (2015)
  • [14] S.Kamada, T.Ichimura, A.Hara, and K.J.Mackin, Adaptive Structure Learning Method of Deep Belief Network using Neuron Generation-Annihilation and Layer Generation, Neural Computing and Applications, pp.1–15 (2018)
  • [15] S.Kamada and T.Ichimura, An Adaptive Learning Method of Restricted Boltzmann Machine by Neuron Generation and Annihilation Algorithm. Proc. of 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC2016), pp.1273–1278 (2016)
  • [16] S.Kamada, T.Ichimura, A Structural Learning Method of Restricted Boltzmann Machine by Neuron Generation and Annihilation Algorithm, Neural Information Processing, Proc. of the 23rd International Conference on Neural Information Processing, Springer LNCS9950), pp.372–380 (2016)
  • [17] G.E.Hinton, A Practical Guide to Training Restricted Boltzmann Machines, Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science (LNCS, vol.7700), pp.599–619 (2012)
  • [18] S.Kamada and T.Ichimura, An Adaptive Learning Method of Deep Belief Network by Layer Generation Algorithm, Proc. of IEEE TENCON2016, pp.2971–2974 (2016)
  • [19] G.E.Hinton, S.Osindero and Y.Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol.18, no.7, pp.1527–1554 (2006)
  • [20] Y.LeCun, L.Bottou, Y.Bengio, and P.Haffner, Gradient-based learning applied to document recognition, Proc. of the IEEE, vol.86, no.11, pp.2278–2324 (1998)
  • [21] A.Krizhevsky: Learning Multiple Layers of Features from Tiny Images, Master of thesis, University of Toronto (2009)
  • [22] T.Ichimura, S.Kamada, Adaptive Learning Method of Recurrent Temporal Deep Belief Network to Analyze Time Series Data, Proc. of the 2017 International Joint Conference on Neural Network (IJCNN 2017), pp.2346–2353 (2017)
  • [23] http://www.cs.toronto.edu/~nitish/unsupervised_video/ (2019/7/23)
  • [24] N.Srivastava, E.Mansimov, R.Salakhutdinov, Unsupervised learning of video representations using LSTMs, Proc. of ICML’15 Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML 15), vol.37, pp.843–852 (2015)
  • [25] T.Ichimura, E.Tazaki and K.Yoshida, Extraction of fuzzy rules using neural networks with structure level adaptation -verification to the diagnosis of hepatobiliary disorders, International Journal of Biomedical Computing, Vol.40, No.2, pp.139–146 (1995)
  • [26] N.B.Lewandowski, Y.Bengio and P.Vincent, Modeling Temporal Dependencies in High-Dimensional Sequences:Application to Polyphonic Music Generation and Transcription, Proc. of the 29th International Conference on Machine Learning (ICML 2012), pp.1159–1166 (2012)
  • [27] I.Sutskever, G.E.Hinton, and G.W.Taylor, The Recurrent Temporal Restricted Boltzmann Machine, Proc. of Advances in Neural Information Processing Systems 21 (NIPS-2008) (2008)
  • [28] J.Elman, Finding structure in time, Cognitive Science, Vol.14, No.2 (1990)
  • [29] M.Jordan, Serial order: A parallel distributed processing approach, Tech. Rep. No. 8604. San Diego: University of California, Institute for Cognitive Science (1986)
  • [30] J.Hsieh, B.Liu, D.Huang, L.Fei-Fei, J.C.Niebles, Learning to Decompose and Disentangle Representations for Video Prediction, Procs. of Advances in Neural Information Processing Systems 31 (NIPS 2018) (2018)