I Introduction
Recently, Artificial Intelligence (AI) with sophisticated technologies has become an essential technique in our life. [1]. Especially, the recent advances in deep learning methods enable higher performance for several big data compared to traditional methods [2, 3]
. For example, CNNs (Convolutional Neural Network) such as AlexNet
[4], GoogLeNet [5], VGG16 [6], and ResNet [7], highly improved classification or detection accuracy in image recognition [8].As improvement of image recognition, deep learning is also applied to video recognition [9]
. The video recognition is kind of fusion task which needs both image recognition and timeseries prediction simultaneously. This is, recurrent function that classifies an given image or detects an object while predicting the future, is required. Understanding of time series video is expected in various kinds industrial fields, such as human detection, pose or facial estimation from video camera, autonomous driving system, and so on
[10].LSTM (Long Short Term Memory) is a wellknown method for timeseries prediction and is applied to deep learning methods[11]
. The method enabled the traditional recurrent neural network recognizes not only shortterm memory but also longterm memory for given sequential data
[12]. For video recognition of LSTM, the idea using convolutional filter instead of onedimensional neuron can be used since one frame of sequential video can be seen as one image [13].In our research, we proposed the adaptive structural learning method of DBN [14]. The adaptive structural learning can find a suitable size of network structure for given input space during its training. The neuron generation and annihilation algorithms [15, 16] were implemented on Restricted Boltzmann Machine (RBM) [17], and layer generation algorithm [18] was implemented on Deep Belief Network (DBN) [19]. The adaptive structural learning of DBN (Adaptive DBN) shows the highest classification capability in the research field of image recognition by using some benchmark data sets such as MNIST [20], CIFAR10, and CIFAR100 [21]. Moreover, the learning algorithm of Adaptive RBM and Adaptive DBN was extended to the timeseries prediction by using the idea of LSTM [22]. LSTM was often implemented on a CNN structure, we implemented LSTM on our Adaptive RBM and DBN, and then the proposed method showed higher prediction accuracy than the other methods for several timeseries benchmark data sets, such as Nottingham (MIDI) and CMU (Motion Capture).
For further improvement of the method, our proposed method was applied to Moving MNIST [23] in this paper, which is a benchmark data set for video recognition. We challenge to reveal the power of our proposed method in the video recognition research field, since video includes rich source of visual information. Compared with the LSTM model [24], our method make a higher performance of prediction.
The remainder of this paper is organized as follows. In section II, basic idea of the adaptive structural learning of DBN is briefly explained. Section III gives the description of the extension algorithm of Adaptive DBN for timeseries prediction. In section IV, the effectiveness of our proposed method is verified on moving MNIST. In section V, we give some discussions to conclude this paper.
Ii Adaptive Learning Method of Deep Belief Network
This section explains the traditional RBM [17] and DBN [19] to describe the basic behavior of our proposed adaptive learning method of DBN.
Iia Neuron Generation and Annihilation Algorithm of RBM
While recent deep learning model has higher classification capability, some problems related to the network structure or the number of some parameters still remains to become a difficult task as the AI research. For the problem, we have developed the adaptive structural learning method in RBM model (Adaptive RBM) [14]. RBM as shown in Fig. 1
is an unsupervised graphical and energy based model on two kinds of layers; visible layer for input and hidden layer for feature vector, respectively. The neuron generation algorithm of the Adaptive RBM can generate an optimal number of hidden neurons and the trained RBM is suitable structure for given input space.
The neuron generation is based on the idea of Walking Distance (WD), which is inspired from the multilayered neural network in the paper [25]
. WD is the difference between the prior variance and the current one of learning parameters.RBM has 3 kinds of parameters according to visible neurons, hidden neurons, and the weights among their connections. The Adaptive RBM can monitor their parameters excluding the visible one ( The paper
[14] describes the reason of the disregard). The situation means that only the existing hidden neurons cannot represent an ambiguous pattern, because there is the lack of the number of hidden neurons. In order to express the ambiguous patterns, a new neuron is inserted to inherit the attributes of the parent hidden neuron as shown in Fig. 2(a).In addition to the neuron generation, the neuron annihilation algorithm was applied to the Adaptive RBM after neuron generation process as shown in Fig. 2(b). We may meet that some unnecessary or redundant neurons were generated due to the neuron generation process. Therefore, such neurons will be removed the corresponding hidden neuron according to the output activities.
IiB Layer Generation Algorithm of DBN
A DBN is a hierarchical model of stacking the several pretrained RBMs. For building process, output (hidden neurons activation) of th RBM can be seen as the next input of th RBM. Generally, DBN with multiple RBMs has higher data representation power than one RBM. Such hierarchical model can represent the specified features from an abstract concept to an concrete object in the direction from input layer to output layer. However, the optimal number of RBMs depends on the target data space.
We developed Adaptive DBN which can automatically adjust an optimal network structure by the selforganization in the similar way of WD monitoring. If both WD and the energy function do not become small values, then a new RBM will be generated to keep the suitable network classification power for the data set, since the RBM has lacked the power of data representation to draw an image of input patterns. Therefore, the condition for layer generation is defined by using the total WD and the energy function. Fig. 3 shows the overview of layer generation in Adaptive DBN.
Iii Adaptive RNNDBN for timeseries prediction
In timeseries prediction, some LSTM methods improve the prediction performance of the traditional recurrent neural network by using the several gates such as forgetgate, peephole connection gate, full and gradient gate [11]. These gates can represent multiple patterns of timeseries sequence, that is not only shortterm memory but also longterm memory.
Recurrent Neural Network Restricted Boltzmann Machine (RNNRBM) [26] is a RBM based recurrent model for timeseries prediction. The method is a extension model of the traditional Temporal RBM (TRBM) and Recurrent TRBM (RTRBM) [27] and it also used a similar idea of LSTM for better performance.
Fig. 4 shows the network structure of RNNRBM. RNNRBM had a recurrent structure on Markov process for given timeseries sequence, as well as RNN. Let an input sequence with the length be . and are the input and hidden neurons at th RBM, respectively. The time dependency parameters , , and are calculated by the past parameters at as following equations.
(1) 
(2) 
(3) 
where
is a activation function. For example,
function was used in the paper [26]. represents timeseries context for give input. is the initial state and the values are given randomly. is a set of learning parameters, and is not timedependency. At each time , the RBM learning can employ the update algorithm of and at time , and weights between them. After the calculation of error continues till time , the gradient for is updated to trace from time back to time to the contrary by BPTT[28, 29]. BPTT is Back Propagation Through Time method which is often used in the traditional recurrent neural network.We developed Adaptive RNNRBM by applying the neuron generation and annihilation algorithm to RNNRBM. An suitable number of hidden neurons is sought by the neuron generation and annihilation algorithm for better representation power as same as usual Adaptive RBM [22]. In addition, the hierarchical model with layer generation was developed as Adaptive RNNDBN by using the same monitoring function of Adaptive DBN. In other words, the output signal of hidden neuron at th Adaptive RNNRBM can be seen as the next input signal at th Adaptive RNNRBM as shown in Fig. 3 The proposed Adaptive RNNDBN was applied to several timeseries benchmark data sets, such as Nottingham (MIDI) and CMU (Motion Capture). In [22], the prediction accuracy of the proposed method is higher than that of the traditional methods.
Iv Experiment Results
In this paper, the effectiveness of our proposed method for moving MNIST benchmark data set [24] was verified. We challenge to reveal the power of our proposed method in the video recognition research field, since video includes rich source of visual information. When we want to detect and identify visual objects, faces, emotions, actions or events in the real time, stateoftheart video recognition software for the comprehensive video content analysis can make the advanced solutions for AIbased visual content search. Therefore, the prediction performance is compared with the traditional LSTM in this paper.
Iva Data set: Moving MNIST
Moving MNIST [24] is a benchmark data set for video recognition. There are 10,000 samples including 8,000 for training and 2,000 for test. Each sample consists of 20 sequential gray scale images () of patch, where two digits move in the sequences. The digits were chosen randomly from the training set. The selected digits are placed initially at random locations inside the patch. Each digit was assigned a velocity which has direction and magnitude. The direction and the magnitude also chosen randomly. Fig. 5 shows two samples of 20 sequential images.
Moving MNIST is used in various researches such as the video decomposition [30] and the future predictor [24]. This paper aims to investigate the effectiveness of the our LSTM model in the video recognition and we compare the recognition capability as future predictor.
Since any teacher signal for each sample is not provided in moving MNIST, the LSTM in [24] investigated the cross entropy between the given sequence (ground truth) and the predicted sequence. Fig. 6 shows the composite model of LSTM in [24]. In the method, the first 10 frames are used as an input sequence of the model, the remaining 10 frames are used for evaluation of future prediction as shown in Fig. 5.
Fig. 7 shows an abstract procedure of prediction in our method. As same as in our method, same evaluation method was used. Since our method can predict next frame for given frame, the predicted frame is used for next input, and then squared error and prediction accuracy of the ground truth images and the predicted images were evaluated.
IvB Experimental Results
Table I and Table II show the prediction result on moving MNIST for training and test. The value in Table I is the cross entropy between the ground truth images and the predicted images at last sequence. The result of the LSTM is cited from the paper [24]. The left value of each column in Table II is the squared error and the value in brackets is the prediction accuracy.
In the LSTM, the result for training and the prediction accuracy for test are not reported. In the Adaptive RNNDBN, three settings of learning ratio (lr) were used for evaluation. For training of the Adaptive RNNDBN, the cross entropy and the squared error reached almost 0 and prediction accuracy reached almost 100%. For test, the three settings of Adaptive RNNDBN showed smaller prediction error than the LSTM. The best performance was acquired when the learning ratio was 0.050 in the Adaptive RNNDBN.
We also investigated the intermediate result in the predicted images of the Adaptive RNNDBN. Table III expresses the future prediction accuracy for each predicted sequence. Basically, the prediction accuracy was slowly decreased from 11th to 20th frames.
Model  Training  Test 

LSTM [24]    341.2 
Adaptive RNNDBN (lr = 0.010)  18.5  165.3 
Adaptive RNNDBN (lr = 0.050)  16.1  134.0 
Adaptive RNNDBN (lr = 0.001)  17.0  140.6 
Model  Training  Test  

Adaptive RNNDBN (lr = 0.010)  11.4  (99.0%)  119.9  (91.8%) 
Adaptive RNNDBN (lr = 0.050)  10.3  (99.4%)  100.5  (92.5%) 
Adaptive RNNDBN (lr = 0.001)  14.5  (98.9%)  140.2  (89.4%) 
Sequence  

Model  11  12  13  14  15  16  17  18  19  20 
Adaptive RNNDBN (lr = 0.010)  98.2%  97.9%  97.0%  95.2%  94.5%  93.9%  93.3%  92.8%  92.0%  91.8% 
Adaptive RNNDBN (lr = 0.050)  99.5%  98.0%  97.9%  96.5%  96.4%  94.5%  93.1%  92.9%  92.8%  92.5% 
Adaptive RNNDBN (lr = 0.001)  96.8%  96.8%  96.4%  94.8%  93.6%  92.9%  91.8%  91.0%  90.0%  89.4% 
V Conclusion
Deep learning is widely used in various kinds of research fields, especially image recognition. In our research, Adaptive DBN which can find the optimal network structure for given data was developed. The method shows higher classification accuracy than existing deep learning methods for several benchmark data sets. In this paper, our proposed prediction method was applied to Moving MNIST, which is a benchmark data set for video recognition. Compared with the LSTM model, our method showed higher prediction performance (more than 90% predication accuracy for test data). Our proposed method will be further improved for better prediction capability by evaluating the method on the other large video databases such as video streaming and defect detection in timeseries video data.
Acknowledgment
This work was supported by JSPS KAKENHI Grant Number 19K12142, 19K24365, and obtained from the commissioned research by National Institute of Information and Communications Technology (NICT, 21405), JAPAN.
References
 [1] Markets and Markets, http://www.marketsandmarkets.com/MarketReports/deeplearningmarket107369271.html (accessed 28 November 2018) (2016)

[2]
Y.Bengio: Learning Deep Architectures for AI
, Foundations and Trends in Machine Learning archive, vol.2, no.1, pp.1127 (2009)

[3]
V.Le.Quoc, R.Marc’s Aurelio, et.al.:
Building highlevel features using large scale unsupervised learning
, Proc. of 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp.85958598 (2013)  [4] A.Krizhevsky, I.Sutskever, G.E.Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Proc. of Advances in Neural Information Processing Systems 25 (NIPS 2012) (2012)
 [5] C.Szegedy, W. Liu, Y.Jia, P.Sermanet, S.Reed, D.Anguelov, D.Erhan, V.Vanhoucke, A.Rabinovich, Going Deeper with Convolutions, Proc. of CVPR2015 (2015)
 [6] K.Simonyan, A.Zisserman, Very deep convolutional networks for largescale image recognition, Proc. of International Conference on Learning Representations (ICLR 2015) (2015)

[7]
K.He, X.Zhang, S.Ren, J.Sun, J, Deep residual learning for image recognition
, Proc. of 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.770778 (2016)
 [8] O.Russakovsky, J.Deng, H.Su, J.Krause, S.Satheesh, S.Ma, Z.Huang, A.Karpathy, A.Khosla, M.Bernstein, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision, vol.115, no.3. pp.211–252 (2015)
 [9] M.Mohammadi, A.AlFuqaha, S.Sorour, and M.Guizani, Deep Learning for IoT Big Data and Streaming Analytics: A Survey, in IEEE Communications Surveys & Tutorials (2018)
 [10] H.Zhang, Y.Zhang and B.Zhong, et.al., A Comprehensive Survey of VisionBased Human Action Recognition Methods, Sensors, vol.19, no.5, pp.1–20 (2019)
 [11] Y.Bengio, P.Simard, and P.Frasconi, Learning longterm dependencies with gradient descent is difficult, IEEE Transactions on Neural Networks, vol.5, no.2, pp.157–166 (1994)
 [12] Z.C.Lipton, D.C.Kale, C.Elkan, and R.Wetzell, Learning to Diagnose with LSTM Recurrent Neural Networks, in International Conference on Learning Representations (ICLR 2016), pp.1–18 (2016)
 [13] S.Xingjian, C.Zhourong, W.Hao, Y.DitYan, W.Waikin, W.Wangchun, Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting, Advances in Neural Information Processing Systems 28 (NIPS 2015) pp.802–810 (2015)
 [14] S.Kamada, T.Ichimura, A.Hara, and K.J.Mackin, Adaptive Structure Learning Method of Deep Belief Network using Neuron GenerationAnnihilation and Layer Generation, Neural Computing and Applications, pp.1–15 (2018)
 [15] S.Kamada and T.Ichimura, An Adaptive Learning Method of Restricted Boltzmann Machine by Neuron Generation and Annihilation Algorithm. Proc. of 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC2016), pp.1273–1278 (2016)
 [16] S.Kamada, T.Ichimura, A Structural Learning Method of Restricted Boltzmann Machine by Neuron Generation and Annihilation Algorithm, Neural Information Processing, Proc. of the 23rd International Conference on Neural Information Processing, Springer LNCS9950), pp.372–380 (2016)
 [17] G.E.Hinton, A Practical Guide to Training Restricted Boltzmann Machines, Neural Networks, Tricks of the Trade, Lecture Notes in Computer Science (LNCS, vol.7700), pp.599–619 (2012)
 [18] S.Kamada and T.Ichimura, An Adaptive Learning Method of Deep Belief Network by Layer Generation Algorithm, Proc. of IEEE TENCON2016, pp.2971–2974 (2016)
 [19] G.E.Hinton, S.Osindero and Y.Teh, A fast learning algorithm for deep belief nets, Neural Computation, vol.18, no.7, pp.1527–1554 (2006)
 [20] Y.LeCun, L.Bottou, Y.Bengio, and P.Haffner, Gradientbased learning applied to document recognition, Proc. of the IEEE, vol.86, no.11, pp.2278–2324 (1998)
 [21] A.Krizhevsky: Learning Multiple Layers of Features from Tiny Images, Master of thesis, University of Toronto (2009)
 [22] T.Ichimura, S.Kamada, Adaptive Learning Method of Recurrent Temporal Deep Belief Network to Analyze Time Series Data, Proc. of the 2017 International Joint Conference on Neural Network (IJCNN 2017), pp.2346–2353 (2017)
 [23] http://www.cs.toronto.edu/~nitish/unsupervised_video/ (2019/7/23)
 [24] N.Srivastava, E.Mansimov, R.Salakhutdinov, Unsupervised learning of video representations using LSTMs, Proc. of ICML’15 Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML 15), vol.37, pp.843–852 (2015)
 [25] T.Ichimura, E.Tazaki and K.Yoshida, Extraction of fuzzy rules using neural networks with structure level adaptation verification to the diagnosis of hepatobiliary disorders, International Journal of Biomedical Computing, Vol.40, No.2, pp.139–146 (1995)
 [26] N.B.Lewandowski, Y.Bengio and P.Vincent, Modeling Temporal Dependencies in HighDimensional Sequences:Application to Polyphonic Music Generation and Transcription, Proc. of the 29th International Conference on Machine Learning (ICML 2012), pp.1159–1166 (2012)
 [27] I.Sutskever, G.E.Hinton, and G.W.Taylor, The Recurrent Temporal Restricted Boltzmann Machine, Proc. of Advances in Neural Information Processing Systems 21 (NIPS2008) (2008)
 [28] J.Elman, Finding structure in time, Cognitive Science, Vol.14, No.2 (1990)
 [29] M.Jordan, Serial order: A parallel distributed processing approach, Tech. Rep. No. 8604. San Diego: University of California, Institute for Cognitive Science (1986)
 [30] J.Hsieh, B.Liu, D.Huang, L.FeiFei, J.C.Niebles, Learning to Decompose and Disentangle Representations for Video Prediction, Procs. of Advances in Neural Information Processing Systems 31 (NIPS 2018) (2018)