Corr_Prediction_ARIMA_LSTM_Hybrid
We applied an ARIMALSTM hybrid model to predict future price correlation coefficients of two assets
view repo
Predicting the price correlation of two assets for future time periods is important in portfolio optimization. We apply LSTM recurrent neural networks (RNN) in predicting the stock price correlation coefficient of two individual stocks. RNNs are competent in understanding temporal dependencies. The use of LSTM cells further enhances its long term predictive properties. To encompass both linearity and nonlinearity in the model, we adopt the ARIMA model as well. The ARIMA model filters linear tendencies in the data and passes on the residual value to the LSTM model. The ARIMA LSTM hybrid model is tested against other traditional predictive financial models such as the full historical model, constant correlation model, single index model and the multi group model. In our empirical study, the predictive ability of the ARIMALSTM model turned out superior to all other financial models by a significant scale. Our work implies that it is worth considering the ARIMA LSTM model to forecast correlation coefficient for portfolio optimization.
READ FULL TEXT VIEW PDF
In the financial sector, a reliable forecast the future financial perfor...
read it
Financial markets have a vital role in the development of modern society...
read it
Prediction of stock prices has been an important area of research for a ...
read it
This study applies recurrent neural networks (RNNs), which are known for...
read it
The prediction of stock prices is an important task in economics, invest...
read it
The prediction of stock prices is an important task in economics, invest...
read it
In this work we present a modification in the conventional flow of
infor...
read it
We applied an ARIMALSTM hybrid model to predict future price correlation coefficients of two assets
When constructing and selecting a portfolio for investment, evaluation of its expected returns and risks is
considered the bottom line. Markowitz has introduced the Modern Portfolio Theory which proposes methods to
quantify returns and risks of a portfolio, in his paper ‘Portfolio Selection’ (1952) [13]. With the derived return and
risk, we draw the efficient frontier, which is a curve that connects all the combination of expected returns and risks
that yield the highest returnrisk ratio. Investors then select a portfolio on the efficient frontier, depending on their
risk tolerance.
However, there have been criticisms on Markowitz’s assumptions. One of them is that the correlation
coefficient used in measuring risk is constant and fixed. According to Francois Chesnay & Eric Jondeau’s empirical
study on correlation coefficients, stock markets’ prices tend to have positive correlations during times of financial
turbulence [6]. This implies that the correlation of any two assets may as well deviate from mean historical
correlation coefficients subject to financial conditions; thus, the correlation is not stable. Frank Fabozzi, Francis
Gupta and Harry Markowitz himself also briefly discussed the shortcomings of the Modern Portfolio Theory in
their paper, ‘The Legacy of Modern Portfolio Theory’ (2002) [7].
Acknowledging such pitfalls of the full historical correlation coefficient evaluation measures, numerous models for correlation coefficient prediction have been devised. One alternative is the Constant Correlation model, which sets all pairs of assets’ correlations equal to the mean of all the correlation coefficients of the assets in the portfolio [3]
. Some other forecast models include the MultiGroup model and the SingleIndex model. We will cover these models in our paper at part 2, ‘Various Financial Models for Correlation Prediction’. Although there have been many financial and statistical approaches to estimate future correlation, few have implemented the neural network to carry out the task. Neural networks are frequently used to predict future stock returns and have produced noteworthy results
^{1}^{1}1Y. Yoon, G. Swales (1991) [18]; A. N. Refenes et al. (1994) [1]; K. Kamijo, T. Tanigawa (1990) [11] ; M. Dixon et al. (2017) [12]. Given that stock correlation data can also be represented as time series data – deriving the correlation coefficient dataset with a rolling time window – application of neural networks in forecasting future correlation coefficients can be expected to have successful results as well. Rather than circumventing by predicting individual asset returns to compute the correlation coefficient, we cast predictions directly on the correlation coefficient value itself.The impreciseness of the full historical model for correlation prediction has largely been acknowledged [6, 7]. There have been numerous attempts to complement mispredictions. In this section, we discuss three other frequently used models, along with the full historical model; three of which cited in the literature by Elton et al. (1978) [3] – Full Historical model, Constant Correlation model, and the SingleIndex model – and the other, the MultiGroup model, in another paper of Elton et al. (1977) [5].
The Full Historical model is the simplest method to implement for the portfolio correlation estimation. This model adopts the past correlation value to forecast future correlation coefficient. That is, the correlation of two assets for a certain future time span is expected to be equal to the correlation value of a given past period [3].
(1) 
i, j : asset index in the correlation coefficient matrix
However, this model has encountered criticisms on its relative inferior prediction quality compared to other
equivalent models
The Constant Correlation model assumes that the full historical model encompasses only the information of
the mean correlation coefficient [3]. Any deviation from the mean correlation coefficient is considered a random
noise; it is sufficient to estimate the correlation of each pair of assets to be the average correlation of all pairs of
assets in a given portfolio. Therefore, applying the Constant Correlation model, all assets in a single portfolio have
the same correlation coefficient.
(2) 
i, j : asset index in the correlation coefficient matrix
n : number of assets in the portfolio
The SingleIndex model presumes that asset returns move in a systematic way with the ‘singleindex’, that is,
the market return [3].
To quantify the systematic movement with respect to the market return, we need to specify the market return
itself. We call this specification the ‘market model’, which was contrived by H. M. Markowitz [4], and furthered
by Sharpe (1963) [17]. The ‘market model’ relates the return of asset i with the market return at time t, which is
represented in the following equation:
: return of asset i at time t
: return of the market at time t
: risk adjusted excess return of asset i
: sensitivity of asset i to the market
: residual return; error term
such that, ;
Here, we use the beta() of asset i and j to estimate the correlation coefficient. With the equation that,
The estimated correlation coefficient would be,
(3) 
The MultiGroup model [5] takes the asset’s industry sector into account. Under the assumption that assets in
the same industry sector generally perform similarly, the model sets each correlation coefficient of asset pairs
identical to the mean correlation of the industry sector pair’s correlation value. In other words, the MultiGroup
model is a model that applies the Constant Correlation model to each pair of business sectors. For instance, if
company A and company B, each belongs to industry sector and , their correlation coefficient would be the
mean value of all the correlation coefficients of asset pairs with the same industry sector combination ().
The equation for the prediction is slightly different depending on whether the two industry sectors and
are identical or not. The equation is as follows.
(4) 
/ : industry sector notation
/ : the number of assets in each industry sector
Time series data is assumed to be composed of the linear portion and the nonlinear portion [19]. Thus, we can
express as follows.
represents the linearity of data at time , while signifies nonlinearity. The value is the error term.
The Autoregressive Integrated Moving Average (ARIMA) model is one of the traditional statistical models for
time series prediction. The model is known to perform decently on linear problems. On the other hand, the Long
ShortTerm Memory (LSTM) model can capture nonlinear trends in the dataset. So, the two models are
consecutively combined to encompass both linear and nonlinear tendencies in the model. The former sector is the
ARIMA model, and the latter is the LSTM model.
The ARIMA model is fundamentally a linear regression model accommodated to track linear tendencies in stationary time series data. The model is expressed as ARIMA(p,d,q). Parameters p, d, and q are integer values that decide the structure of the time series model; parameter p, q each is the order of the AR model and the MA model, and parameter d is the level of differencing applied to the data. The mathematical representation of the ARMA model of order (p,q) is as follows.
Term is a constant; and are coefficient values of AR model variable , and MA model variable . is an error notation at period (). It is assumed that
has zero mean with constant variance, and satisfies the i.i.d condition.
The notation is the value of the likelihood function, and
is the degree of freedom, that is, the number of parameters used. A model that has a small AIC value is generally considered a better model. There are different ways to compute the likelihood function,
. We use the maximum likelihood estimator for the computation. This method tends to be slow, but produces accurate results. Lastly, in the model checking step, residual analysis is carried out to finalize the ARIMA model. If residual analysis concludes that the residual value does not suffice standards, the three steps are iterated until an optimal ARIMA model is attained. Here, we use the residual that is calculated from the ARIMA model as the input for the subsequent LSTM model. As the ARIMA model has identified the linear trend, the residual is assumed to encompass the nonlinear features [19].
The value would be the final error term of our model.
Neural Networks are known to perform well on nonlinear tasks. Because of its versatility due to large dimension of parameters, and the use of nonlinear activation functions in each layer, the model can adapt to nonlinear trends in the data. But empirical studies on financial data show that the performance of neural networks are rather mixed. For example, in D.M.Q. Nelson
et al.’s literature [2], the accuracy of an LSTM neural network for stock price prediction generally tops other nonneuralnetwork models. However, there are overlapping portions in the accuracy range of each model, implying that the model not always performs superior to others. This provides a ground for our paper to use an ARIMALSTM hybrid model that encompasses both linearity and nonlinearity, so as to produce a more sophisticated result compared to pure LSTM neural network models.To understand the LSTM model, the mechanism of Recurrent Neural Networks (RNN) should first be discussed. The RNN is a type of sequential model that performs effectively on time series data. It takes a sequence of vectors of time series data as input X = [
] and outputs a vector value computed by the neural network structures in the model’s cell, symbolized as A in Figure 1. Vector X is a time series data spanning t time periods. The values in vector X is sequentially passed through cell A. At each time step, the cell outputs a value, which is concatenated with the next time step data, and the cell state C. The output value and C serve as input for the next time step. The process is repeated up to the last time step data. Then, the Backward Propagation Through Time (BPTT) process, where the weight matrices are updated, initiates. The BPTT process will not be further illustrated in this paper. For detailed illustration, refer to S. Hochreiter and J. Schmidhuber’s literature on Long ShortTerm Memory (1997) [16].
The function, also denoted with the same symbol in Figure 2, is the logistic function, often called the sigmoid. It serves as the activation function that enables nonlinear capabilities for the model.
In the next phase, the input gate and the input candidate gate operate together to render the new cell state , which will be passed on to the next time step as the renewed cell state. The input gate uses the sigmoid as the activation function and the input candidate utilizes the hyperbolic tangent, each outputting and . The selects which feature in should be reflect ed in to the new cell state .
The function, denoted ‘’ in Figure 2 as well, is the hyperbolic tangent. Unlike the sigmoid, which renders value between 0 and 1, the hyperbolic tangent outputs value between 1 and 1.
Finally, the output gate decides what values are to be selected, combining with the applied state as output . The new cell state is a combination of the forgetgateapplied former cell state and the new applied state .
The cell state and output will be passed to the next time step, and will go through a same process. Depending on the task, further activation functions such as the Softmax or Hyperbolic tangent can be applied to ’s. In our paper’s case, which is a regression task that has output with values bounded between 1 and 1, we apply the hyperbolic tangent function to the output of the last element of data vector X. Figure 2 provides a visual illustration to aid understanding of the LSTM cell inner structure.
In this paper, we resolve to utilize the ‘adjusted close’ price of the S&P500 firms ^{2}^{2}2https://en.wikipedia.org/wiki/List_of_S%26P_500_companies(accessed 23 May, 2018). The price data from 20080101 to 20171231 of the S&P500 firms are downloaded^{3}^{3}3We utilize the ‘Quandl’ API to download stock price data
(https://github.com/quandl/quandlpython)
. The data has a small ratio of missing values. The ratio of missing price data of each asset is around 0.1%, except for one asset with ticker ‘MMM’, which has a ratio around 1.1%. Although MMM’s ratio is not that high, missing data imputation seems improbable because the missing values are found in consecutive days, creating great chasms in the time series. This may cause distortion when computing the correlation coefficient. So we exclude MMM from our research. For other assets, we impute the missing data at time t with the value of time t1 for all assets. Then, we randomly select 150 stocks from the fully imputed price dataset. The randomly selected 150 firms’ tickers are enlisted in ‘Appendices A’.
, and each apply a rolling 100day window with a 100day stride until the end of the dataset. This process renders 55875 sets of time series data (
), each with 24 time steps. Finally, we generate the train, development, and test1&2 data set with the 55875 24 data. We split the data as follows by means to implement the walkforward optimization [15] in the model evaluation phase.Train set :  index 1 ~ 21  
Development set :  index 2 ~ 22  
Test1 set :  index 3 ~ 23  
Test2 set :  index 4 ~ 24 
Before fitting an ARIMA model, the order of the model must be specified. The ACF plot and the PACF plot aids the decision process. Most of the datasets showed an oscillatory trend that seemed close to a white noise as shown in Table 1. Other notable trends includes an increasing/decreasing trend, occasional big dips while steady correlation coefficient, and having mixed oscillatorysteady periods. Although the ACF/PACF plots indicate that a great portion of the datasets are close to a white noise, several orders (p, d, q) = (1, 1, 0), (0, 1, 1), (1, 1, 1), (2, 1, 1), (2, 1, 0) seems applicable. We fit the ARIMA model
^{4}^{4}4We utilize the ‘pyramid’ module to fit ARIMA modelsAlgorithm 1. ARIMA model fitting algorithm  

We use the residual values, derived from the ARIMA model, of the 150 randomly selected S&P500 stocks as input for the LSTM model. The datasets include the train X/Y, development X/Y, test1 X/Y, and test2 X/Y. Each X dataset has 55875 lines with 20 time steps, with a corresponding Y dataset for each time series (Figure 3). The data points are generally around 0, as the input is a residual dataset (Figure 4).
The architecture of the model for our task is an RNN neural network that employs 25 LSTM units^{5}^{5}5
We utilize the ‘keras’ module to train the LSTM model
The dropout method is one of the widely used methods to prevent overfitting. It prevents the neurons to develop interdependency, which causes overfitting. This is executed by simply turning off neurons in the network during training with probability
. Then, in the testing phase, dropout is disabled and each weight values are multiplied by , to scale down the output value into a desired boundary. Moreover, dropout has the effect of training multiple neural networks and averaging the outputs [14].Parameters and determine the intensity of regularization of the cost function. If the lambda values are too high, the model will be undertrained. On the other hand, if they are too low,
regularization affect will be minimal.
In our model, after trial and error, it turned out that not applying any regularization performs better. We tried more complex architectures with regularization, but for all architectures, models with no regularization had superior outputs.
Another problem to pay attention to when training a neural network model is the vanishing/exploding gradient.
This is particularly emphatic for RNN’s. Due to a deep propagation through time, the gradients far away from the
output layer tend to be very small or large, preventing the model from training properly. The remedy for this
problem is the LSTM cell itself. The LSTM is capable of connecting large time intervals without loss of
information [16].
Other miscellaneous details about the training process includes the use of minibatch of size 500, the ADAM
optimization function et cetera. For detail, refer to the LSTM section source codes in ‘Appendices B’.
The walkforward optimization method [15] is used as the evaluation method. The walkforward optimization
requires that a model be fitted for each rolling time intervals. Then, for each time interval, the newly trained model
is tested on the next time step. This ensures the robustness of the model fitting strategy. However, this process is
computationally expensive. In addition, our paper’s motive is to fit parameters of a model that generalizes well on
various assets as well as on different time periods. Thus, it is needless to train multiple models to approve of the
modelfitting strategy. Rather than training a new model for each rolling trainset window, we resolve to train a
single model with the first window and apply it to three time intervals – the development set and the test1/test2
set.
We selected our optimal model with the Mean Squared Error (MSE) metric. That is, the cost function of our
model was the MSE. For further evaluation, the Mean Absolute Error (MAE) and Root Mean Squared Error(RMSE)
was also investigated.
MSE
MAE
The selected optimal model is then tested on two recent time periods. We use two separate datasets to test the
model because the development set is deemed to be involved in the learning process as well.
If the model’s correlation coefficient prediction on two time periods turn out decent as well, we then test our
model against former financial predictive models. The MSE and MAE values are computed for the four financial
models as well. For the constant correlation model and the multigroup model, we regarded the 150 assets we
selected randomly to be our portfolio constituents.
Algorithm 2. LSTM model training algorithm  

After around 200 epochs, the train dataset’s MSE value and development dataset’s MSE value started to converge (Figure 6). The MAE learning curve showed a similar trend as well. Among the models, we selected the
epoch’s model. The epoch was decided based on both the overfitting metric and the performance metric. The overfitting metric was represented with the normalized value of the MSE difference between the train & development dataset. And the performance metric was represented with the normalized value of the MSE sum of the train & development datset. Then, the sum of the two normalized value was calculated to find the epoch that had the least value. The mathematical representation of the criterion is as follows.
With the selected ARIMALSTM hybrid model, the MSE, RMSE and MAE values of the prediction were
calculated. The MSE value on the development, test1, and test2 dataset were 0.1786, 0.1889, 0.2154 each. The
values have small variations, which means the model has been generalized adequately.
Then, the metric values were compared with that of other financial models. Among the financial models, the Constant Correlation model performed the best on our 150 S&P500 stocks’ dataset, just as what the empirical
study of E. J. Elton et al. has shown [3]. However, its performance was nowhere near the ARIMALSTM hybrid
model’s predictive capacity. The ARIMALSTM’s MSE value was nearly two thirds of that of other equivalent models. The MAE metric also showed clear outperformance. Table 2
demonstrates all metrics’ values for every dataset, for each model. The least value of each metric was boldfaced.
Here, we can easily notice the how all the metric values of the ARIMALSTM model are in boldface.
For further investigation, we tested our final model on different assets in the S&P500 firms. Excluding the 150
assets we have already selected to train our model, we randomly selected 10 assets and generated datasets with
identical structures as the ones used in the model training and testing. This generates 180 lines of data. We then
pass the data into our ARIMALSTM hybrid model and evaluate the predictions with the MSE, RMSE and MAE
metrics. We iterate this process 10 times to check for model stability. The output of 10 iterations are demonstrated
in Table 3.
The MSE values of 10 iterations range from 0.1447 to 0.2353. Although there is some variation in the results
compared to the Test1 & 2, this may be due to a relatively small sample size, and the outstanding performance of
the model makes it negligible. Therefore, we may carefully affirm that our ARIMALSTM model is robust.
The purpose of our empirical study was to propose a model that performs superior to extant financial
correlation coefficient predictive models. We adopted the ARIMALSTM hybrid model in an attempt to first filter
out linearity in the ARIMA modeling step, then predict nonlinear tendencies in the LSTM recurrent neural network.
The testing results showed that the ARIMALSTM hybrid model performs far superior to other equivalent
financial models. Model performance was validated on both different time periods and on different combinations
of assets with various metrics such as the MSE, RMSE, and the MAE. The values nearly halved that of the Constant
Correlation model, which, in our experiment, turned out to perform best among the four financial models. Judging
from such outperformance, we may presume that the ARIMALSTM hybrid model has sufficient predictive
potential. Thus, the ARIMALSTM model as a correlation coefficient predictor for portfolio optimization would
be considerable. With a better predictor, the portfolio is optimized more precisely, thereby enhancing returns in
investments.
However, our experiment did not cover time periods before the year 2008. So our model may be susceptible
to specific financial conditions that were not present in the years between 2008 and 2017. But financial anomalies
and noises are always prevalent. It is impossible to embrace all probable specific tendencies into the model. Hence,
further research into dealing with financial black swans is called for.
* This is the list of tickers of the 150 randomly selected S&P500 stocks.
CELG  PXD  WAT  LH  AMGN  AOS 
EFX  CRM  NEM  JNPR  LB  CTAS 
MAT  MDLZ  VLO  APH  ADM  MLM 
BK  NOV  BDX  RRC  IVZ  ED 
SBUX  GRMN  CI  ZION  COO  TIF 
RHT  FDX  LLL  GLW  GPN  IPGP 
GPC  HPQ  ADI  AMG  MTB  YUM 
SYK  KMX  AME  AAP  DAL  A 
MON  BRK  BMY  KMB  JPM  CCI 
AET  DLTR  MGM  FL  HD  CLX 
OKE  UPS  WMB  IFF  CMS  ARNC 
VIAB  MMC  REG  ES  ITW  NDAQ 
AIZ  VRTX  CTL  QCOM  MSI  NKTR 
AMAT  BWA  ESRX  TXT  EXR  VNO 
BBT  WDC  UAL  PVH  NOC  PCAR 
NSC  UAA  FFIV  PHM  LUV  HUM 
SPG  SJM  ABT  CMG  ALK  ULTA 
TMK  TAP  SCG  CAT  TMO  AES 
MRK  RMD  MKC  WU  CAN  HIG 
TEL  DE  ATVI  O  UNM  VMC 
ETFC  CMA  NRG  RHI  RE  FMC 
MU  CB  LNT  GE  CBS  ALGN 
SNA  LLY  LEN  MAA  OMC  F 
APA  CDNS  SLG  HP  XLNX  SHW 
AFL  STT  PAYX  AIG  FOX  MA 
* This source code is a simplified version; unnecessary portions were contracted or omitted. For original and other
relevant source codes, visit
‘https://github.com/imhgchoi/Corr_Prediction_ARIMA_LSTM_Hybrid’.
We thank developers of the ‘Quandl’ API, ‘Pyramidarima’ module, and ‘Keras’ module, who provided open source codes that alleviated the burden of our research.
We also thank an anonymous commenter with a pseudoname ’Moosefly’, who discovered a crucial error in the ARIMA modeling section source code.
Journal of Experimental And Theoretical Artificial Intelligence
, 15:3:315–330, 2003.Stock price pattern recognition  a recurrent neural network approach.
1990 IJCNN International Joint Conference on Neural Networks, vol.1, San Diego, CA, USA:215–221, 1990.Journal of Machine Learning Research
, 15:1929–1958, 2014.Stock trading with random forests, trend detection tests and force index volume indicators.
Artificial Intelligence and Soft Computing, Lecture Notes in Computer Science, 7895:pp.441–452, 2013.
Comments
There are no comments yet.