I Introduction
Forecasting is an essential but challenging part of time series data analysis. The type of time series data along with the underlying context are the dominant factors effecting the performance and accuracy of time series data analysis and forecasting techniques employed. Some other domaindependent factors such as seasonality, economic shocks, unexpected events, and internal changes to the organization which are generating the data also affect the prediction.
The conventional time series data analysis techniques often utilize 1) linear regressions for model fitting, and then 2) moving average for the prediction purposes. The de facto standard of such techniques is “
AutoRegressive Integrated Moving Average”, also known as ARIMA. This linear regressionbased approach has been evolved over the years and accordingly many variations of this model have been developed such as SARIMA (or Seasonal ARIMA), and ARIMAX (or ARIMA with Explanatory Variable). These models perform reasonably well for shortterm forecasts (i.e., the next lag), but their performance deteriorates severely for longterm predictions.Machine learning and more notably deep learningbased approaches are emerging techniques in AIbased data analysis. These learning and AIbased approaches take the data analytical processes into another level, in which the models built are datadriven rather than modeldriven. With respect to the underlying application domain, the best learning model can be trained. For instance, a convolutionbased neural networks (CNNs) is suitable for problems such as image recognition; whereas, the recurrent neural networks (RNNs) better fit to modeling problems such as time series data and analysis.
There are several variations of RNNbased models. Most of these RNNbased models differ mainly because of their capabilities in remembering input data. In general, a vanilla form of RNN does not have the capability of remembering the past data. In terms of deep learning terminologies, these models are feed forwardingbased learning mechanisms. A special type of RNN models is the Long ShortTerm Memory (LSTM) networks, through which the relationships between the longer input and output data are modeled. These RNNbased models, called feedbackbased models, are capable of learning from past data, in which several gates into their network architecture are employed in order to remember the past data and thus build the prospective model with respect to the past and current data. Hence, the input data are traversed only once (i.e., from left (input) to right (output)).
It has been reported that the deep learningbased models outperform conventional ARIMAbased models in forecasting time series and in particular for the long term prediction problems [20]. Even though the performance of LSTM has been shown to be superior to ARIMA, an interesting question is whether its performance can be further improved by incorporating additional layers of training data into the LSTM.
To investigate whether incorporating additional layers of training into the architecture of an LSTM improves its prediction, this paper explores the performance of Bidirectional LSTM (BiLSTM). In an BiLSTM model the given input data is utilized twice for training (i.e., first from left to right, and then from right to left). In particular, we would like to perform a behavioral analysis comparing these two architectures when training their models. To do so, this paper reports the results of an experiment in which the performance and behavior of these two RNNbased architectures are compared. In particular, we are interested in addressing the following research questions:

Is the prediction improved when the time series data are learned from both directions (i.e., pasttofuture and futuretopast)?

How different these two architectures (LSTM and BiLSTM) treat input data?

How fast these two architectures reach the equilibrium?
To address these questions, this paper conducts a series of experiments and reports the results. In particular, this paper makes the following key contributions:

Investigate whether additional layers of training improve prediction in the financial time series context.

Provide a performance analysis comparing the prediction’s accuracy of the uniLSTM and its extension, BiLSTM. The analysis shows that BiLSTM models outperform LSTMs by reduction in error rates.

Conduct a behavioral analysis of learning processes involved in training the LSTM and BiLSTMbased models. According to the results, BiLSTMs train their models differently than LSTMs by fetching smaller batches of data for training. It was also observed that, the BiLSTM models reach the equilibrium slower than uniLSTMs.
This paper is structured as follows: Section II reviews the related works. The essential background and mathematical formulations are given in Section III. The procedure for experimental setup, data collection, and preparation is presented in Section IV. Section V presents the pseudocode of the developed algorithms. The results of the experiments are reported through Section VI. Section VII discusses the performance of the algorithms while factors are controlled. The conclusion of the paper and the possible future research directions are provided in Section VIII.
Ii Related Works
Traditional approaches to time series analysis and forecasting are primarily based on Autoregressive Integrated Moving Average (ARIMA) and its many variations such as Seasonal ARIMA (SARIMA) and ARIMA with Explanatory variables (ARIMAX) [6]. These techniques have been used for a long time in modeling time series problems [17, 2, 1]. While these moving averagedbased approaches perform reasonably well, they also suffer from some limitations [8]:

Since these models are regressionbased approaches to the problem, they are hardly able to model data with nonlinear relationships between parameters.

There are some assumptions about data when conducting statistical tests that need to be held in order to have a meaningful model (e.g., constant standard deviation).

They are less accurate for longterm predictions.
Machine and deep learningbased approaches have introduced new avenue to analyze time series data. Krauss et al. [19]
used various forms of forecasting models such as deep learning, gradientboosted trees, and random forests to model S&P 500 constitutes. Krauss et al. also reported that training neural networks and consequently deep learningbased algorithms was very difficult. Lee and Yoo
[21] introduced an RNNbased approach to predict stock returns. The idea was to build portfolios by adjusting the threshold levels of returns by internal layers of the RNN built. A similar work is performed by Fischera et al. [9] for financial data prediction.The most similar papers in which the performance of LSTM and its bidirectional variation is compared are [18, 7]. Kim and Moon report that Bidirectional Long ShortTerm Memory model based on multivariate timeseries data outperforms unidirectional LSTM. Cui et al. [7] proposed stacking bidirectional and unidirectional LSTM networks for predicting networkwide traffic speed. They report that the stacked architecture outperforms both BiLSTM and uniLSTMs.
This article is based on the authors previous research work where the performance of ARIMAbased models with the LSTMbased models was compared in the context of predicting economics and financial time series and parameter tuning [20], [26]. The paper takes an additional step in comparing the performance of three time series modeling standards: ARIMA, LSTM, and BiLSTM. While traditional prediction problems (such as building a scheduler [27] and predicting vulnerabilities in software systems [22]) can benefit largely from bidirectional training, it is unclear whether learning time series data, and in particular financial and economic data, from both sides is beneficial for the purpose of learning. This paper explores this research problem.
Iii Background
Iiia Recurrent Neural Networks
The Recurrent Neural Networks (RNNs) are an extension of the conventional FeedForward neural networks with the ability of managing variablelength sequence inputs. Unlike the conventional FeedForward neural networks, which are not generally able to handle sequential inputs and all their inputs (and outputs) must be independent of each others, the RNNs models provide some gates to store the previous inputs and leverage sequential information of the previous inputs. This special RNNs memory is called
recurrent hidden states and gives the RNNs the ability to predict what input is coming next in the sequence of input data. In theory, RNNs are able to leverage previous sequential information for arbitrary long sequences. In practice, however, due to RNNs’ memory limitations, the length of the sequential information is limited to only a few steps back. To give a formal definition of RNNs, lets assume represents a sequence of length , and represents RNN memory at time step , an RNN model updates its memory information using:(1) 
where
is a nonlinear function (e.g., logistic sigmoid, a hyperbolic tangent function, or rectified linear unit (ReLU)),
and are weight matrices that are used in deep learning model, and is a constant bias.In general, RNNs have multiple types: one input to many outputs, many inputs to many outputs, and many inputs to one output. In this work, we only consider RNNs that produce one output
which is the probability of the next element of a sequence while its previous inputs are given. The sequence probability can be decomposed as following:
(2) 
in which each conditional probability distribution is modeled:
(3) 
where is calculated using Equation 1.
One of the common problems of RNNs is called ”vanishing gradients” which happens when the information about the input or gradient passes thorough a lot of layers, it will vanish and wash out by the time when it reaches to the end or beginning layer. This problem makes it hard for RNNs to capture the longterm dependencies, and as such the training of RNNs will be extremely challenging. Another problem of RNNs, which rarely happens, is called “exploding gradients,” which refers to the cases in which information about the input or gradient passes thorough a lot of layers, it will accumulate and result in a very large gradient when it reaches to the end or beginning layer. This problem makes RNNs hard to train.
Gradient, which is mathematically defined as partial derivative of output of a function with respect to its inputs, essentially measures how much the output of a function changes with respect to the changes occurred to its inputs. In the ”vanishing gradients” problem, the RNN training algorithm assigns smaller values to the weight matrix (i.e., a matrix that is used in the process of RNN training) and thus the RNN model stops learning. On the other hand, in the exploding gradients problem , the training algorithm assigns higher values to the weight matrix without any reasons. This problem can be solved by truncating/squashing the gradients [12].
IiiB Long ShortTerm Memory (LSTM) Models
As mentioned earlier, RNNs have difficulties in learning longterm dependencies. The LSTMbased models are an extension for RNNs, which are able to address the vanishing gradient problem in a very clean way. The LSTM models essentially extend the RNNs’ memory to enable them keep and learn longterm dependencies of inputs. This memory extension has the ability of remembering information over a longer period of time and thus enables reading, writing, and deleting information from their memories. The LSTM memory is called a “
gated” cell, where the word gate is inspired by the ability to make the decision of preserving or ignoring the memory information. An LSTM model captures important features from inputs and preserves this information over a long period of time. The decision of deleting or preserving the information is made based on the weight values assigned to the information during the training process. Hence, an LSTM model learns what information worth to preserve or remove.In general, an LSTM model consists of three gates: forget, input, and output gates. The forget gate makes the decision of preserving/removing the existing information, the input gate specifies the extent to which the new information will be added into the memory, and the output gate controls whether the existing value in the cell contributes to the output.
I) Forget Gate.
A sigmoid function is usually used for this gate to make the decision of what information needs to be removed from the LSTM memory. This decision is essentially made based on the value of
and . The output of this gate is , a value between and , where indicates completely get rid of the learned value, and implies preserving the whole value. This output is computed as:(4) 
where is a constant and is called the bias value.
II) Input Gate. This gate makes the decision of whether or not the new information will be added into the LSTM memory. This gate consists of two layers: 1) a sigmoid layer, and 2) a “” layer. The sigmoid layer decides which values needs to be updated, and the
layer creates a vector of new candidate values that will be added into the LSTM memory. The outputs of these two layers are computed through:
(5)  
(6) 
in which represents whether the value needs to be updated or not, and indicates a vector of new candidate values that will be added into the LSTM memory. The combination of these two layers provides an update for the LSTM memory in which the current value is forgotten using the forget gate layer through multiplication of the old value (i.e., ) followed by adding the new candidate value . The following equation represents its mathematical equation:
(7) 
where is the results of the forget gate, which is a value between and where indicates completely get rid of the value; whereas, implies completely preserve the value.
III) Output Gate. This gate first uses a sigmoid layer to make the decision of what part of the LSTM memory contributes to the output. Then, it performs a nonlinear function to map the values between and . Finally, the result is multiplied by the output of a sigmoid layer. The following equation represents the formulas to compute the output:
(8)  
(9) 
where is the output value, and is its representation as a value between and .
IiiC Deep Bidirectional LSTMs (BiLSTM)
The deepbidirectional LSTMs [25] are an extension of the described LSTM models in which two LSTMs are applied to the input data. In the first round, an LSTM is applied on the input sequence (i.e., forward layer). In the second round, the reverse form of the input sequence is fed into the LSTM model (i.e., backward layer). Applying the LSTM twice leads to improve learning longterm dependencies and thus consequently will improve the accuracy of the model [3].
Iv LSTM vs. BiLSTM: An Experimental Study
This paper compares the performance of ARIMA, LSTM, and BiLSTM in the context of predicting financial time series.
Iva Data Set
The authors partially reused the previously collected data [20], in which daily, weekly, and monthly time series of some stock data for the period of Jan 1985 to Aug 2018 were extracted from the Yahoo finance Website^{1}^{1}1https://finance.yahoo.com. The data included 1) Nikkei 225 index (N225), 2) NASDAQ composite index (IXIC), 3) Hang Seng Index (HSI), 4) S&P 500 commodity price index (GSPC), 5) Dow Jones industrial average index (DJ), and 6) IBM Stock data. The daily IBM stock data were collected for the period of July 2009 to July 2019.
IvB Training and Test Data
The “Adjusted Close” variable was chosen as the only feature of financial time series to be fed into the ARIMA, LSTMs and its variation, BiLSTM models. The data set was divided into training and test where 70% of each data set were used for training and 30% of each data set was used for testing the accuracy of models. Table I provides the statistics of the number of time series’ observations.
Stock  Observations  Total  

Train 70%  Test 30%  
N225.monthly  283  120  403 
IXIC.daily  8,216  3,521  11,737 
IXIC.weekly  1,700  729  2,429 
IXIC.monthly  390  168  558 
HSI.monthly  258  110  368 
GSPC.daily  11,910  5,105  17,015 
GSPC.monthly  568  243  811 
DJI.daily  57,543  24,662  82,205 
DJI.weekly  1,189  509  1,698 
DJI.monthly  274  117  391 
IBM.daily  1,762  755  2,517 
Total  84,093  36,039  120,132 
IvC Assessment Metrics
The “loss” values are typically reported by deep learning algorithms. Loss technically is a kind of penalty for a poor prediction. More specifically, the loss value will be zero, if the model’s prediction is perfect. Hence, the goal is to minimize the loss values through obtaining a set of weights and biases that minimizes the loss. In addition to loss, which is utilized by the deep learning algorithms, researchers often utilize the RootMeanSquareError (RMSE) to assess the prediction performances. RMSE measures the differences between actual and predicated values. The formula for computing RMSE is:
(10) 
Where is the total number of observations, is the actual value; whereas, is the predicated value. The main benefit of using RMSE is that it penalizes large errors. It also scales the scores in the same units as the forecast values. Furthermore, we also used the percentage of reduction in RMSE, as a measure to assess the improvement that can be calculated as:
(11) 
V The Algorithms
The general “feedforward” Artificial Neural Networks (ANN) (Figure 1
(a)) allow training the model by traveling in one direction only without considering any feedback from the past input data. More specifically, ANN models travel directly from input (left) to output (right) without taking into account any feedback from the already trained data from the past. As a result, the output of any layer does not affect the training process performed on the same layer (i.e., no memory). These types of neural networks are useful for modeling the (linear or nonlinear) relationship between the input and output variables and thus functionally perform like a regressionbased modeling. In other words, through these networks a functional mapping is performed through which the input data are mapped to output data. This type of neural networks is heavily utilized in pattern recognition. The Convolutional Neural Networks (CNN), and the conventional and basic autoencoder networks are typical ANN models.
On the other hand, the Recurrentbased Neural Networks (RNNs) remember parts of the past data through a methodology, called feedback, in which the training takes place not only from input to output (as feedforward), but also it utilizes a loop in the network to preserve some information and thus functions like a memory (Figure 1(b)). Unlike feedforward ANN networks, the feedbackbased neural networks are dynamics and their states change continuously until they reach the equilibrium status and are thus optimized. The states remain at the equilibrium status until new inputs are arrived demanding changes in the equilibrium. The major problem of a vanilla form of RNNs is that these types of neural networks cannot preserve and thus does not remember long inputs.
As an extension to RNNs, Long ShortTerm Memory (LSTM) (Figure 1(c)) is introduced to remember long input data and thus the relationship between the long input data and output is described in accordance with an additional dimension (e.g., time or spatial location). An LSTM network remembers long sequence of data through the utilization of several gates such as: 1) input gate, 2) forget gate, and 3) output gate.
The deepbidirectional LSTMs (BiLSTM) networks [25] are a variation of normal LSTMs (Figure 1(d)), in which the desired model is trained not only from inputs to outputs, but also from outputs to inputs. More precisely, given the input sequence of data, a BiLSTM model first feed input data to an LSTM model (feedback layer), and then repeat the training via another LSTM model but on the reverse order of the sequence of the input data (i.e., WatsonCrick complement [10]). It has been reported that using BiLSTM models outperforms regular LSTMs [3]. The algorithm(s) developed for the experiments reported in this paper are listed in Listing 1. Please note that the two algorithms (LSTM and BiLSTM) are incorporated into one, where lines 9  12 switch between the two algorithms. The rollingbased algorithms retrain the models each time a new observation is fetched (line 26). Hence, once a prediction is performed and its value is compared with the actual value, the value is added to the training set (line 26), and the model is retrained (line 27).
Vi Results
Table II reports the Rooted Mean Squared Error (RMSE) achieved by each technique for forecasting the stock data. In most cases (except IXIC.weekly), a significant reduction in the magnitude of the RMSE values is observed.
In comparing the LSTM and BiLSTM models, the percentage of reductions varies from for DJI.daily to for IXIC.daily. On average, the RMSE values achieved for LSTM and BiLSTMbased models are and , respectively, and thus achieving reduction on average. With respect to the data, it is apparent that BiLSTM models outperform regular uniLSTM models significantly, with a large margin.
Table II also reports some other results computed for ARIMA and the percentages of the reductions captured. More specifically, the average reductions obtained using BiLSTM over ARIMA is ; whereas, the average percentage of reduction using LSTM over ARIMA is reported as . The results indicate that modeling using BiLSTM instead of LSTM and ARIMA indeed improves the prediction accuracy.
To illustrate the forecasts performed by both LSTM and BiLSTM models, Figures 2
(a)(c) show the forecasts for the IBM stock estimated by ARIMA, LSTM and BiLSTM, respectively. Please note that the parts colored in green and orange (i.e., predicted parts) are overlapping the original values of test data. As a result, the initial test data are less visible in the plots.
RMSE  % Reduction  
Bi  BiLSTM  BiLSTM  LSTM  
Stock  ARIMA  LSTM  LSTM  over  over  over 
[20]  LSTM  ARIMA  ARIMA  
N225monthly  766.45  102.49  23.13  77.43  96.98  86.66 
IXIC.daily  34.61  2.01  1.75  12.93  94.94  94.19 
IXIC.weekly  72.53  7.95  11.53  45.03  84.10  89.03 
IXIC.monthly  135.60  27.05  8.49  68.61  93.37  80.00 
HSI.monthly  1,306.95  172.58  121.71  29.47  90.68  86.79 
GSPC.daily  14.83  1.74  0.62  64.36  95.81  88.26 
GSPC.monthly  55.30  5.74  4.63  19.33  91.62  89.62 
DJI.daily  139.85  14.11  3.16  77.60  97.77  89.91 
DJI.weekly  287.60  26.61  23.05  13.37  91.98  90.74 
DJI.monthly  516.97  69.53  23.69  65.59  95.41  86.50 
IBM.daily  1.70  0.22  0.15  31.18  91.11  87.05 
Average  302.96  39.09  20.17  37.78  93.11  88.07 
Vii Discussion
As the results show, BiLSTM models outperforms the regular unidirectional LSTMs. It seems that BiLSTMs are able to capture the underlying context better by traversing inputs data twice (from left to right and then from right to left). The better performance of BiLSTM compared to the regular unidirectional LSTM is understandable for certain types of data such as text parsing and prediction of next words in the input sentence. However, it was not clear whether training numerical time series data twice and learning from the future as well as past would help in better forecasting of time series, since there might not exist some contexts, as observable in text parsing. Our results show that BiLSTMs perform better compared to regular LSTMs even in the context of forecasting financial time series data. In order to understand the differences between LSTM and BiLSTM in further details, there are several interesting questions that we can be posed and thus empirically address them, and then learn more about the behavior of these variations of recurrent neural networks and how they work.
Viia Loss vs. Batch Steps ()
In order to compare the loss values for both LSTM and BiLSTM models, we ran the developed scripts on our data and captured the loss values when the learning model fetches the next batch of data. Figure 3(a)(b) illustrate the plots for the IBM sample data when is set to , where the yaxis and xaxis represent the loss value and batch steps, respectively.
As illustrated in Figure 3(a), the loss value starts at and then decreases after fetching the third batch of data for the unidirectional LSTM where the loss values achieves . It implies that after three rounds of fetching the batches of time series data, the loss value remains stable until all batches of data are fetched, where it reaches the loss value of at its last iteration (i.e., the th iteration).
On the other hand, as shown in 3(b), the loss value starts at , and then, interestingly, its value increases to the highest value (i.e., ) on the third round of fetching batches of data. It then starts to decrease slowly after all the batches of data are captured and the parameters are trained. However, unlike the unidirectional LSTM for , the BiLSTM model after fetching and learning all the batches of data never reaches the loss value of the counterpart LSTM model (i.e., ). This observation may indicate that the BiLSTM model needs fetching more training data to reach the equilibrium in comparison to its unidirectional version (i.e., LSTM).
As reported in Table III, the standard deviation calculated for the loss values achieved for both unidirectional LSTM and BiLSTM models when is and , respectively. This indicates that the unidirectional LSTM model reaches the equilibrium faster compared to its counterpart, BiLSTM. The primary reason seems to be directly related to the training the underlying time series processes (first from lefttoright and then righttoleft). As a result, the BiLSTMbased learning model needs to fetch additional data batches to tune its parameters.
Model  Min  Max  SD  #Batches 

LSTM  0.014  0.061  0.007  42 
BiLSTM  0.026  0.087  0.012  71 
LSTM  0.013  0.048  0.005  41 
BiLSTM  0.025  0.184  0.02  75 
LSTM  0.01  0.23  0.004  42 
BiLSTM  0.022  0.135  0.013  73 
ViiB Loss vs. Batch Steps ()
The authors also compared the behavioral training of both BiLSTM and LSTM when . Figures 3(c)(f) illustrate the changes observed in the loss values after fetching each batches of data for the IBM data.
First the first rounds of Epoch for both BiLSTM and LSTM are compared. Figures 3(c)(d) illustrate the changes in the loss values for round 1 of training when . We observe similar trends that we obtained for (Figures 3(a)(b)) for both LSTM and BiLSTM models for round 1 of . More specifically, in round 1 of of the LSTM model, the loss value starts with and after fetching the 3rd batches is starts to be stabilized, where the loss value is . Whereas, for BiLSTM, the loss value starts with and then the loss values start to stabilize after fetching the 8th batches for BiLSTM, where the loss value is .
Table III list the descriptive statistics for both LSTM and BiLSTM for round 1 of . As the table reports, a trend similar to is observed for round 1 of , a large standard deviation in the calculated loss values is an indication that BiLSTM requires more data to optimally tune the parameters.
The most intriguing observation is about the changes of loss values in round 2 of . Figures 3(e)(f) demonstrate the trends of changes for loss for both LSTM and BiLSTM. For LSTM, the loss demonstrates a stabilized trend. The loss value starts with , remains stable, and after fetching all the batches of data stays at , which shows an insignificant increase. A closer look at Figure 3(c) with 3(e) indicates that the LSTM model has already reached the equilibrium during the first round and during the second round of Epoch nothing valuable is learned.
On the other hand, the trend for round 2 of BiLSTM for does not exhibit a trend similar as observed for the LSTM. As Figure 3(f) illustrates, the training model still keeps continue learning from the data and tuning the parameters. The loss value starts at and then quickly falls into and after a minor fluctuation, it becomes stabilized after fetching the 9th batch, where the loss value reaches . A comparison of Figures 3(d) and 3(f) indicates that the BiLSTM model keeps training its parameters after the second round; whereas, the LSTM model stops learning and tuning parameters after the first round. The numerical values and descriptive statistics are reported in Table III, where the standard deviations for LSTM and BiLSTM are reported as and , respectively. Indicating the training needed for optimizing the BiLSTM model in comparison to LSTM.
ViiC Batch Sizes
The last column of Table III reports an interesting phenomenon through which the number of batches, which are considered for the same data by each training model, is reported. According to the experimental data, the LSTM model divided the data into 41  42 batches (larger chunks); whereas, the BiLSTM model divided the same data into 71  75 batches (smaller chunks). A rational to explain this behavior is the limitation associated with LSTM models in general. Even though these models are capable of “remembering” sequences of data, LSTMbased models have limitations in remembering long sequences. Through a regular LSTM model, since the input data are traversed only once from left to right, a certain number of input items can be fed into the training model. On the other hand, in an BiLSTM model, the training network needs to train not only the input data from left to right, but also from right to left. As a result, the length of training data that can be handled through each batch is almost half of the amount of data learned through each batch by regular LSTM.
Viii Conclusion and Future Work
This paper reported the results of an experiment, through which the performance and accuracy as well as behavioral training of ARIMA, unidirectional LSTM (LSTM), and bidirectional LSTM (BiLSTM) models were analyzed and compared. The research question targeted by the experiment was primarily focusing on whether training of data from an opposite direction (i.e., right to left), in additional to regular form of training of data (i.e., left to right) had any positive and significant impact on improving the precision of time series forecasting. The results showed that the use of additional layer of training would help in improving the accuracy of forecast by percent on average and thus it is beneficial for modeling. We also observed an interesting phenomenon when conducting the behavioral analysis of unidirectional LSTM and BiLSTM models. We noticed that training based on BiLSTM is slower and it takes fetching additional batches of data to reach the equilibrium. This observation indicates that there are some additional features associated with data that might be captured by BiLSTM but unidirectional LSTM models are not capable of exposing them, since the training is only one way (i.e., from left to right). As a result, this paper recommends using BiLSTM instead of LSTM for forecasting problem in time series analysis. This research can be further expanded to forecasting problems for multivariate and seasonal time series.
Acknowledgment
This work is supported in part by National Science Foundation (NSF) under the grants 1821560 and 1723765.
References
 [1] A. A. Adebiyi, A. O. Adewumi, C. K. Ayo, “Stock Price Prediction Using the ARIMA Model,” in UKSimAMSS 16th International Conference on Computer Modeling and Simulation., 2014.
 [2] A. M. Alonso, C. GarciaMartos, “Time Series Analysis  Forecasting with ARIMA models,” Universidad Carlos III de Madrid, Universidad Politecnica de Madrid. 2012.
 [3] P. Baldi, S. Brunak, P. Frasconi, G. Soda, and G. Pollastri, “Exploiting the past and the future in protein secondary structure prediction,” Bioinformatics, 15(11), 1999.
 [4] J. Brownlee, “How to Create an ARIMA Model for Time Series Forecasting with Python,” 2017.

[5]
J. Brownlee, “Time Series Prediction with LSTM Recurrent Neural Networks in Python with Keras,” 2016.
 [6] G. Box, G. Jenkins, Time Series Analysis: Forecasting and Control, San Francisco: HoldenDay, 1970.
 [7] Z. Cui, R. Ke, Y. Wang, “Deep Stacked Bidirectional and Unidirectional LSTM Recurrent Neural Network for Networkwide Traffic Speed Prediction,” arXiv:1801.02143, 2018.
 [8] A. Earnest, M. I. Chen, D. Ng, L. Y. Sin, “Using Autoregressive Integrated Moving Average (ARIMA) Models to Predict and Monitor the Number of Beds Occupied During a SARS Outbreak in a Tertiary Hospital in Singapore,” in BMC Health Service Research, 5(36), 2005.
 [9] T. Fischera, C. Kraussb, “Deep Learning with Long Shortterm Memory Networks for Financial Market Predictions,” in FAU Discussion Papers in Economics 11, 2017.
 [10] , J. Gao, H. Liu, and E.T. Kool, “Expandedsize bases in naturally sized DNA: Evaluation of steric effects in Watson Crick pairing,” Journal of the American Chemical Society, 126(38), pp. 11826–11831, 2004.
 [11] F. A. Gers, J. Schmidhuber, F. Cummins, “Learning to Forget: Continual Prediction with LSTM,” in Neural Computation 12(10): 24512471, 2000.
 [12] S. Hochreiter, J. Schmidhuber, “Long ShortTerm Memory,” Neural Computation 9(8):17351780, 1997.
 [13] N. Huck, “Pairs Selection and Outranking: An Application to the S&P 100 Index,” in European Journal of Operational Research 196(2): 819825, 2009.
 [14] R. J. Hyndman, G. Athanasopoulos, Forecasting: Principles and Practice. OTexts, 2014.
 [15] R. J. Hyndman, “Variations on Rolling Forecasts,” 2014.
 [16] M. J. Kane, N. Price, M. Scotch, P. Rabinowitz, “Comparison of ARIMA and Random Forest Time Series Models for Prediction of Avian Influenza H5N1 Outbreaks,” BMC Bioinformatics, 15(1), 2014.
 [17] M. Khashei, M. Bijari, “A Novel Hybridization of Artificial Neural Networks and ARIMA Models for Time Series forecasting,” in Applied Soft Computing 11(2): 26642675, 2011.
 [18] J. Kim, N. Moon, ”BiLSTM model based on multivariate time series data in multiple field for forecasting trading area.” Journal of Ambient Intelligence and Humanized Computing, pp. 110.
 [19] C. Krauss, X. A. Do, N. Huck, “Deep neural networks, gradientboosted trees, random forests: Statistical arbitrage on the S&P 500,” FAU Discussion Papers in Economics 03/2016, FriedrichAlexander University ErlangenNuremberg, Institute for Economics, 2016.
 [20] S. S. Namini, N. Tavakoli, and A. S. Namin. ”A Comparison of ARIMA and LSTM in Forecasting Time Series.” 17th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 13941401. IEEE, 2018.
 [21] S. I. Lee, S. J. Seong Joon Yoo, “A Deep Efficient Frontier Method for Optimal Investments,” Department of Computer Engineering, Sejong University, Seoul, 05006, Republic of Korea, 2017.

[22]
Y. Pang, X. Xue, A.S. Namin, “Predicting Vulnerable Software Components through NGram Analysis and Statistical Feature Selection,”
International Conference on Machine Learning and Applications ICMLA, pp. 543548, 2015.  [23] J. Patterson, Deep Learning: A Practitioner’s Approach, O’Reilly Media, 2017.
 [24] J. Schmidhuber, “Deep learning in neural networks: An overview,” in Neural Networks, 61: 85117, 2015.

[25]
M. Schuster, K. K. Paliwal, “Bidirectional recurrent neural networks”,
IEEE Transactions on Signal Processing, 45 (11), pp. 2673–2681, 1997.  [26] N. Tavakoli, “Modeling Genome Data Using Bidirectional LSTM” IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), vol. 2, pp. 183188, 2019.
 [27] N. Tavakoli, D. Dong, and Y. Chen, ”Clientside straggleraware I/O scheduler for objectbased parallel file systems.” Parallel Computing, pp. 318,82, 2019.