1 Introduction
Currently, the Recurrent Neural Network (RNN) used to generating sequence based on the input sequence. The sequence generation model based on RNN generally called sequence to sequence RNN or RNN EncoderDecoder (RED) (sutskever2014sequence, ; cho2014learning, ). In this paper, our purpose is finding the model that can learn the same data more effectively. We assumed the RED will be used for anomaly detection such as previous researches (chauhan2015anomaly, ; malhotra2015long, ; nanduri2016anomaly, ). In their research, they were presented the model of predicting the future sequence from the current sequence. We wondered whether it is the best way to using future prediction model for anomaly detection. Because, we think restoring the current sequence is easier than predicting the future.
We compared three RED models. The first model is the future prediction model. The second model is modified model of the future prediction model to restoring current sequence. The last model is an intermediate model of the previous two models. The detail of each model is described in Section 2. In Section 3, we present our dataset and experimental result and conclude in the final section.
2 RNN EncoderDecoder
The RNN EncoderDecoder (RED) have similarities with AutoEncoder (AE) that generate the output from the inputs. The RED used to predict the next sequences in general. For example, Sucheta Chauhan et al. used RED for anomaly detection in Electrocardiography (ECG) (chauhan2015anomaly, ). Their RED was trained to predict the next (future) sequences from the input sequences. The ECG consists by , , , and waveforms with repeating. Each waveform has the onetoone correspondence in ECG. That means only the wave wave can appear after the . So, there is no problem for training or predicting with previous RED. However, if the next wave of wave is not only but also , and appears directly (onetomany correspondence), it makes confusing to learning. The RED derives the output by causality shown in Equation 1 to 5. When the model receives input sequence that starting with , RED becomes confused to select output among various options. Because the first vector of the output sequence is determined only by the first vector of the input sequence without the prior information. We experimented with the above attributes of RED.
(1) 
(2) 
(3) 
(4) 
(5) 
2.1 Models for comparison
In the RED, each RNN cell can be selectively used among vanilla RNN (mikolov2010recurrent, )
, Longshort Term Memory (LSTM)
(hochreiter1997long, )and Gated Recurrent Unit (GRU)
(chung2015gated, ). The vanilla RNN has vanishing gradient problem when the length of the sequence becomes long. However, LSTM and GRU solved vanishing gradient problem.
We construct REDs using LSTM shown as Figure 1. Because, there is no superiority between LSTM and GRU that Chung et al. have already confirmed (chung2014empirical, ). The relationship of input and output sequence can be adjusted for purpose. The relationship between the input and output was current sequence and future sequence in previous researches. In that case, RED learned to generate (predict) the future sequence from the current sequence.
(6) 
In the Equation 6, and are input and output sequence respectively. The is the uncertainty, is likelihood and is prior knowledge. The RED will be learned
by the maximum likelihood estimation in the training process. If
and are almost the same, it will be easier to learn likelihood, but vice versa. Because, it is easier to get the than to get from the information .We construct three models for experiment. ModelA is future prediction model that same as the model in the previous researches (chauhan2015anomaly, ; malhotra2015long, ; nanduri2016anomaly, ). ModelC is restoration model that generates output sequence same as input sequence. ModelB is an intermediate model of ModelA and ModelC. We assume that the ModelC will be better to train when the dataset has onetomany correspondence as mentioned earlier.
3 Experiment
In this section, we experimentally compare the three RED models. We constructed the simple sequential dataset for the experiment that described in Section 3.1.
3.1 Dataset
We constructed the dataset with 15 simple vectors. We called each vector (Alpha) to
(Oscar) in North Atlantic Treaty Organization (NATO) phonetic alphabet and converted it to the onehot encoded vector shown as Equation
7.(7) 
We constructed three datasets with 15 vectors shown in Equation 7. One of the datasets consists of 5 vectors between to (Echo) and other datasets are consisting of whole 15 vectors. The pattern of each dataset is shown in Table 1. The patterns of each dataset are shown in Table 1 repeated for a length of 3000. The notable point of three datasets in Table 1 is Alpha symbol of SetC. The symbol Alpha appears in front of symbol (Bravo), (Delta), (Foxtrot), (Hotel) and (Juliett). That makes confusing when training the pattern to RED because RED cannot generate the output correctly when the sequence started with Alpha.
Dataset  Pattern  Length 

SetA  ABCDE  5 
SetB  ABCDEFGHIJKLMNO  15 
SetC  ABCADEAFGAHIAJK  15 
Each dataset is subdivided into four types. One of subset consists of the same pattern as Table 1 and another one is added random noise to the original pattern with. Also, the dataset contains the new pattern that has not seen in Table 1 and added noise to them. We call each subset as ’Clear’, ’Noise’, ’Abnormal’ and ’Abnoise’ (abnormal with noise). Subset ’Clear’ and ’Noise’ is normal class and others are abnormal class. We use only one normal class subset of the all subsets for training. The test process uses all subsets.
3.2 Comparison
We compared the three models with three datasets. We trained each model with subset ’Clear’ and ’Noise’ respectively. We defined the synchronous case when the length of the pattern and the sequence length are equal or integer multiple. And other cases to asynchronous. We experimented with three synchronous cases and three asynchronous cases. Loss graphs of training process is shown in A. We evaluated three models with training set to confirm they learned well about the training set. Table 2 and 3 shows the result of evaluation. The lower loss means the model generates output more correctly. ModelC has lower loss than other models for all the training set.
Dataset (Subset)  ModelA  ModelB  ModelC 

SetA (Clear)  3.291  3.440  2.484 
SetB (Clear)  2.680  2.974  2.377 
SetC (Clear)  2.851  3.019  2.672 
Dataset (Subset)  ModelA  ModelB  ModelC 

SetA (Noise)  5.253  5.380  4.771 
SetB (Noise)  4.615  4.888  4.378 
SetC (Noise)  4.781  4.844  4.499 
We also evaluated three models with the test set shown as Table 4 and 5. Tables are presented the average loss for each class and model. The loss was measured by a divided normal set and abnormal set in each the test set. The lower loss for the normal set is better same as evaluation with the training set and for the abnormal set is vice versa.
Dataset  Class  ModelA  ModelB  ModelC 

SetA  Normal  4.046  4.184  3.339 
Abnormal  5.180  5.238  4.484  
SetB  Normal  3.482  3.698  3.170 
Abnormal  4.351  4.554  3.955  
SetC  Normal  3.534  3.704  3.392 
Abnormal  4.929  4.905  4.116 
Dataset  Class  ModelA  ModelB  ModelC 

SetA  Normal  4.012  4.198  3.452 
Abnormal  5.169  5.353  4.571  
SetB  Normal  3.544  3.825  3.285 
Abnormal  4.365  4.603  3.285  
SetC  Normal  3.691  3.752  3.392 
Abnormal  4.961  4.874  4.103 
Since confirming the superiority only with the loss value is difficult, we also checked the generated sequence. We present the part of the test case in Table 6 and 7. We provide the whole test result in our Github repository ^{1}^{1}1https://github.com/YeongHyeon/Compare_REDs.
Model  Subset  Input  Output  GroundTruth  Loss 

ModelA  Clear  ABC  DEA  DEA  0.244 
Noise  DEA  BCD  BCD  1.480  
Abnormal  BDE  EAB  ACC  2.106  
Abnoise  BBC  EAA  DEB  1.926  
ModelB  Clear  ABC  BCD  BCD  0.244 
Noise  DEA  EAB  EAB  1.104  
Abnormal  BDE  CDA  DEA  1.794  
Abnoise  BBC  CDE  BCD  2.283  
ModelC  Clear  ABC  ABC  ABC  0.244 
Noise  DEA  DEA  DEA  1.261  
Abnormal  BDE  BCD  BDE  1.707  
Abnoise  BBC  BCC  BBC  1.747 
Model  Class  Input  Output  GroundTruth  Loss 

ModelA  Clear  ABCADEAF  EAHIAJKA  GAHIAJKA  0.913 
Noise  AFGAHIAJ  DABCADEA  KABCADEA  2.169  
Abnormal  JKBBBADE  AFGAHIAJ  AFGAHIAJ  0.243  
Abnoise  AKKKBCAD  DAFGAHIA  EAFGAHIK  2.468  
ModelB  Clear  ABCADEAF  BEAFGAHI  DEAFGHAI  0.917 
Noise  AFGAHIAJ  JIAJKABC  HIAJKAVC  2.100  
Abnormal  JKBBBADE  CADEAAGA  BADEAFGA  1.944  
Abnoise  AKKKBCAD  JAAAEAJA  BCADEAFG  3.119  
ModelC  Clear  ABCADEAF  ABCADEAF  ABCADEAF  0.241 
Noise  AFGAHIAJ  AFGAHIAJ  AFGAHIAJ  1.673  
Abnormal  JKBBBADE  JKABBADE  JKBBBADE  1.268  
Abnoise  AKKKBCAD  AKAKBCAD  AKKKBCAD  2.818 
Each model generates the output sequences correctly when it received normal sequences, but it cannot generate the output correctly when it received abnormal sequences as shown in Table 6. We use this property for anomaly detection. However, we confirmed in Table 7, ModelA did not correctly predict the next sequence when using SetC whether the input was normal or abnormal. ModelA may be performing a high recall for anomaly detection, but it can be always decided the state to abnormal. ModelC correctly restore the input sequence as output when it receipted normal sequences.
Model  Class  Input  Output  GroundTruth  Loss 

ModelA  Clear  ABCADEAF  CAHIAJKA  GAHIAJKA  1.986 
Noise  AFGAHIAJ  CABCADEA  KABCADEA  0.858  
Abnormal  JKBBBADE  AFGAHIAJ  AFGAHIAJ  1.794  
Abnoise  AKKKBCAD  CFFGAHIA  EAFGAHIK  2.006  
ModelB  Clear  ABCADEAF  JEAFGAHI  DEAFGHAI  2.018 
Noise  AFGAHIAJ  FIAJKABC  HIAJKAVC  0.900  
Abnormal  JKBBBADE  CADEAFGA  BADEAFGA  2.412  
Abnoise  AKKKBCAD  FAAADAFG  BCADEAFG  2.263  
ModelC  Clear  ABCADEAF  ABCADEAF  ABCADEAF  1.996 
Noise  AFGAHIAJ  AFGAHIAJ  AFGAHIAJ  0.283  
Abnormal  JKBBBADE  JKABBIDE  JKBBBADE  2.699  
Abnoise  AKKKBCAD  AKKKACAD  AKKKBCAD  1.213 
Table 8 shows part of the test results when each model trained with subset ’Noise’. ModelA and ModelB cannot predict the next sequences correctly when using the subset ’Clear’ unlike Table 7, but ModelC is not. ModelA constructed for focused on predicting future sequences rather than current key features. However, since ModelC constructed to restore current sequences, it can concentrate on learning key features in a given pattern. So, ModelC can get the robustness whether noise is added to the sequence.
4 Conclusion
We experimentally compared the three models of RED. We assume the RED are used for anomaly detection. The RED generates or predicts the next sequences in general. But we wondered whether the future predicting is the best way for anomaly detection. There is no difference in predicting future time , determining anomalies, and restoring and immediately determining the current state. The only difference is the difficulty of the training process.
The model selection and setting of the sequence length are not important when the sequential vector has onetoone correspondence property. But that case is a very ideal, and it is almost rare. There is a probability that various pattern appears randomly after the specific pattern like SetC in reality. It is possible to get the correct result by an irregular pattern like SetC by accident, but it is usually difficult.
We presented in the Section 3.2, ModelC learned the same data more easily and effectively than other models. We certain restoring the current sequence like ModelC is better than predicting the future sequence like ModelA and ModelC for anomaly detection. The smoother the training process means it can performing better, and we have confirmed it with experiment.
When using the ModelA for anomaly detection, it will face the problem of misjudging the normal state as abnormal. The ModelC does not make correct judgments absolutely, but at least it will perform much better than other models. Also, we expect the ModelC can be using other types of time series data generation problems and performing well.
5 References
References
 (1) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, Advances in neural information processing systems pp. 31043112
 (2) K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, Y. Bengio, Learning phrase representations using rnn encoderdecoder for statistical machine translation, arXiv preprint arXiv:1406.1078.

(3)
S. Chauhan, L. Vig, Anomaly detection in ECG time signals via deep long shortterm memory networks, Data Science and Advanced Analytics (DSAA), 2015. 36678 2015. IEEE International Conference on, IEEE, pp. 1–7
 (4) P. Malhotra, L. Vig, G. Shroff, P. Agarwal, Long short term memory networks for anomaly detection in time series, in: Proceedings, Presses universitaires de Louvain, 2015, p. 89.
 (5) A. Nanduri, L. Sherry, Anomaly detection in aircraft data using Recurrent Neural Networks (RNN), Integrated Communications Navigation and Surveillance (ICNS), 2016, IEEE, pp. 5C2–1
 (6) T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, S. Khudanpur, Recurrent neural network based language model, in: Eleventh Annual Conference of the International Speech Communication Association
 (7) S. Hochreiter, J. urgen Schmidhuber, C. Elvezia, Long shortterm memory, Neural Computation 9 (8) (1997) 1735–1780.
 (8) J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Gated feedback recurrent neural networks, arXiv preprint arXiv:1502.02367.
 (9) J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555.
Appendix A
A contains loss graphs of training process. Each process was performed with three models and three datasets respectively. Left side of each figure presents the synchronous case and the right side is the asynchronous case. Firstly, we show the results of the training process with subset ’Clear’. Last three figures are results of the training process with subset ’Noise’.
Loss graph of training process with subset ’Clear’ of the SetA. Loss value of each model converged almost similar moment at each sequence length.
Comments
There are no comments yet.