We applied an ARIMA-LSTM hybrid model to predict future price correlation coefficients of two assets
Predicting the price correlation of two assets for future time periods is important in portfolio optimization. We apply LSTM recurrent neural networks (RNN) in predicting the stock price correlation coefficient of two individual stocks. RNNs are competent in understanding temporal dependencies. The use of LSTM cells further enhances its long term predictive properties. To encompass both linearity and nonlinearity in the model, we adopt the ARIMA model as well. The ARIMA model filters linear tendencies in the data and passes on the residual value to the LSTM model. The ARIMA LSTM hybrid model is tested against other traditional predictive financial models such as the full historical model, constant correlation model, single index model and the multi group model. In our empirical study, the predictive ability of the ARIMA-LSTM model turned out superior to all other financial models by a significant scale. Our work implies that it is worth considering the ARIMA LSTM model to forecast correlation coefficient for portfolio optimization.READ FULL TEXT VIEW PDF
In the financial sector, a reliable forecast the future financial perfor...
Financial markets have a vital role in the development of modern society...
Prediction of stock prices has been an important area of research for a ...
This study applies recurrent neural networks (RNNs), which are known for...
The prediction of stock prices is an important task in economics, invest...
The prediction of stock prices is an important task in economics, invest...
In this work we present a modification in the conventional flow of
We applied an ARIMA-LSTM hybrid model to predict future price correlation coefficients of two assets
When constructing and selecting a portfolio for investment, evaluation of its expected returns and risks is
considered the bottom line. Markowitz has introduced the Modern Portfolio Theory which proposes methods to
quantify returns and risks of a portfolio, in his paper ‘Portfolio Selection’ (1952) . With the derived return and
risk, we draw the efficient frontier, which is a curve that connects all the combination of expected returns and risks
that yield the highest return-risk ratio. Investors then select a portfolio on the efficient frontier, depending on their
However, there have been criticisms on Markowitz’s assumptions. One of them is that the correlation coefficient used in measuring risk is constant and fixed. According to Francois Chesnay & Eric Jondeau’s empirical study on correlation coefficients, stock markets’ prices tend to have positive correlations during times of financial turbulence . This implies that the correlation of any two assets may as well deviate from mean historical correlation coefficients subject to financial conditions; thus, the correlation is not stable. Frank Fabozzi, Francis Gupta and Harry Markowitz himself also briefly discussed the shortcomings of the Modern Portfolio Theory in their paper, ‘The Legacy of Modern Portfolio Theory’ (2002) .
Acknowledging such pitfalls of the full historical correlation coefficient evaluation measures, numerous models for correlation coefficient prediction have been devised. One alternative is the Constant Correlation model, which sets all pairs of assets’ correlations equal to the mean of all the correlation coefficients of the assets in the portfolio 
. Some other forecast models include the Multi-Group model and the Single-Index model. We will cover these models in our paper at part 2, ‘Various Financial Models for Correlation Prediction’. Although there have been many financial and statistical approaches to estimate future correlation, few have implemented the neural network to carry out the task. Neural networks are frequently used to predict future stock returns and have produced noteworthy results111Y. Yoon, G. Swales (1991) ; A. N. Refenes et al. (1994) ; K. Kamijo, T. Tanigawa (1990)  ; M. Dixon et al. (2017) . Given that stock correlation data can also be represented as time series data – deriving the correlation coefficient dataset with a rolling time window – application of neural networks in forecasting future correlation coefficients can be expected to have successful results as well. Rather than circumventing by predicting individual asset returns to compute the correlation coefficient, we cast predictions directly on the correlation coefficient value itself.
The impreciseness of the full historical model for correlation prediction has largely been acknowledged [6, 7]. There have been numerous attempts to complement mispredictions. In this section, we discuss three other frequently used models, along with the full historical model; three of which cited in the literature by Elton et al. (1978)  – Full Historical model, Constant Correlation model, and the Single-Index model – and the other, the Multi-Group model, in another paper of Elton et al. (1977) .
The Full Historical model is the simplest method to implement for the portfolio correlation estimation. This model adopts the past correlation value to forecast future correlation coefficient. That is, the correlation of two assets for a certain future time span is expected to be equal to the correlation value of a given past period .
i, j : asset index in the correlation coefficient matrix
However, this model has encountered criticisms on its relative inferior prediction quality compared to other equivalent models
The Constant Correlation model assumes that the full historical model encompasses only the information of
the mean correlation coefficient . Any deviation from the mean correlation coefficient is considered a random
noise; it is sufficient to estimate the correlation of each pair of assets to be the average correlation of all pairs of
assets in a given portfolio. Therefore, applying the Constant Correlation model, all assets in a single portfolio have
the same correlation coefficient.
i, j : asset index in the correlation coefficient matrix
n : number of assets in the portfolio
The Single-Index model presumes that asset returns move in a systematic way with the ‘single-index’, that is,
the market return .
To quantify the systematic movement with respect to the market return, we need to specify the market return
itself. We call this specification the ‘market model’, which was contrived by H. M. Markowitz , and furthered
by Sharpe (1963) . The ‘market model’ relates the return of asset i with the market return at time t, which is
represented in the following equation:
: return of asset i at time t
: return of the market at time t
: risk adjusted excess return of asset i
: sensitivity of asset i to the market
: residual return; error term
such that, ;
Here, we use the beta() of asset i and j to estimate the correlation coefficient. With the equation that,
: standard deviation of asset i / j’s return
The estimated correlation coefficient would be,
The Multi-Group model  takes the asset’s industry sector into account. Under the assumption that assets in
the same industry sector generally perform similarly, the model sets each correlation coefficient of asset pairs
identical to the mean correlation of the industry sector pair’s correlation value. In other words, the Multi-Group
model is a model that applies the Constant Correlation model to each pair of business sectors. For instance, if
company A and company B, each belongs to industry sector and , their correlation coefficient would be the
mean value of all the correlation coefficients of asset pairs with the same industry sector combination ().
The equation for the prediction is slightly different depending on whether the two industry sectors and are identical or not. The equation is as follows.
/ : industry sector notation
/ : the number of assets in each industry sector
Time series data is assumed to be composed of the linear portion and the nonlinear portion . Thus, we can
express as follows.
represents the linearity of data at time , while signifies nonlinearity. The value is the error term.
The Autoregressive Integrated Moving Average (ARIMA) model is one of the traditional statistical models for time series prediction. The model is known to perform decently on linear problems. On the other hand, the Long Short-Term Memory (LSTM) model can capture nonlinear trends in the dataset. So, the two models are consecutively combined to encompass both linear and nonlinear tendencies in the model. The former sector is the ARIMA model, and the latter is the LSTM model.
The ARIMA model is fundamentally a linear regression model accommodated to track linear tendencies in stationary time series data. The model is expressed as ARIMA(p,d,q). Parameters p, d, and q are integer values that decide the structure of the time series model; parameter p, q each is the order of the AR model and the MA model, and parameter d is the level of differencing applied to the data. The mathematical representation of the ARMA model of order (p,q) is as follows.
Term is a constant; and are coefficient values of AR model variable , and MA model variable . is an error notation at period (). It is assumed that
has zero mean with constant variance, and satisfies the i.i.d condition.
The notation is the value of the likelihood function, and
is the degree of freedom, that is, the number of parameters used. A model that has a small AIC value is generally considered a better model. There are different ways to compute the likelihood function,. We use the maximum likelihood estimator for the computation. This method tends to be slow, but produces accurate results. Lastly, in the model checking step, residual analysis is carried out to finalize the ARIMA model. If residual analysis concludes that the residual value does not suffice standards, the three steps are iterated until an optimal ARIMA model is attained. Here, we use the residual that is calculated from the ARIMA model as the input for the subsequent LSTM model. As the ARIMA model has identified the linear trend, the residual is assumed to encompass the non-linear features .
The value would be the final error term of our model.
Neural Networks are known to perform well on nonlinear tasks. Because of its versatility due to large dimension of parameters, and the use of nonlinear activation functions in each layer, the model can adapt to nonlinear trends in the data. But empirical studies on financial data show that the performance of neural networks are rather mixed. For example, in D.M.Q. Nelsonet al.’s literature , the accuracy of an LSTM neural network for stock price prediction generally tops other non-neural-network models. However, there are overlapping portions in the accuracy range of each model, implying that the model not always performs superior to others. This provides a ground for our paper to use an ARIMA-LSTM hybrid model that encompasses both linearity and nonlinearity, so as to produce a more sophisticated result compared to pure LSTM neural network models.
To understand the LSTM model, the mechanism of Recurrent Neural Networks (RNN) should first be discussed. The RNN is a type of sequential model that performs effectively on time series data. It takes a sequence of vectors of time series data as input X =  and outputs a vector value computed by the neural network structures in the model’s cell, symbolized as A in Figure 1. Vector X is a time series data spanning t time periods. The values in vector X is sequentially passed through cell A. At each time step, the cell outputs a value, which is concatenated with the next time step data, and the cell state C. The output value and C serve as input for the next time step. The process is repeated up to the last time step data. Then, the Backward Propagation Through Time (BPTT) process, where the weight matrices are updated, initiates. The BPTT process will not be further illustrated in this paper. For detailed illustration, refer to S. Hochreiter and J. Schmidhuber’s literature on Long Short-Term Memory (1997) .
The function, also denoted with the same symbol in Figure 2, is the logistic function, often called the sigmoid. It serves as the activation function that enables nonlinear capabilities for the model.
In the next phase, the input gate and the input candidate gate operate together to render the new cell state , which will be passed on to the next time step as the renewed cell state. The input gate uses the sigmoid as the activation function and the input candidate utilizes the hyperbolic tangent, each outputting and . The selects which feature in should be reflect ed in to the new cell state .
The function, denoted ‘’ in Figure 2 as well, is the hyperbolic tangent. Unlike the sigmoid, which renders value between 0 and 1, the hyperbolic tangent outputs value between -1 and 1.
Finally, the output gate decides what values are to be selected, combining with the -applied state as output . The new cell state is a combination of the forget-gate-applied former cell state and the new -applied state .
The cell state and output will be passed to the next time step, and will go through a same process. Depending on the task, further activation functions such as the Softmax or Hyperbolic tangent can be applied to ’s. In our paper’s case, which is a regression task that has output with values bounded between -1 and 1, we apply the hyperbolic tangent function to the output of the last element of data vector X. Figure 2 provides a visual illustration to aid understanding of the LSTM cell inner structure.
In this paper, we resolve to utilize the ‘adjusted close’ price of the S&P500 firms 222https://en.wikipedia.org/wiki/List_of_S%26P_500_companies(accessed 23 May, 2018). The price data from 2008-01-01 to 2017-12-31 of the S&P500 firms are downloaded333We utilize the ‘Quandl’ API to download stock price data
. The data has a small ratio of missing values. The ratio of missing price data of each asset is around 0.1%, except for one asset with ticker ‘MMM’, which has a ratio around 1.1%. Although MMM’s ratio is not that high, missing data imputation seems improbable because the missing values are found in consecutive days, creating great chasms in the time series. This may cause distortion when computing the correlation coefficient. So we exclude MMM from our research. For other assets, we impute the missing data at time t with the value of time t-1 for all assets. Then, we randomly select 150 stocks from the fully imputed price dataset. The randomly selected 150 firms’ tickers are enlisted in ‘Appendices A’.
, and each apply a rolling 100-day window with a 100-day stride until the end of the dataset. This process renders 55875 sets of time series data (), each with 24 time steps. Finally, we generate the train, development, and test1&2 data set with the 55875 24 data. We split the data as follows by means to implement the walk-forward optimization  in the model evaluation phase.
|Train set :||index 1 ~ 21|
|Development set :||index 2 ~ 22|
|Test1 set :||index 3 ~ 23|
|Test2 set :||index 4 ~ 24|
Before fitting an ARIMA model, the order of the model must be specified. The ACF plot and the PACF plot aids the decision process. Most of the datasets showed an oscillatory trend that seemed close to a white noise as shown in Table 1. Other notable trends includes an increasing/decreasing trend, occasional big dips while steady correlation coefficient, and having mixed oscillatory-steady periods. Although the ACF/PACF plots indicate that a great portion of the datasets are close to a white noise, several orders (p, d, q) = (1, 1, 0), (0, 1, 1), (1, 1, 1), (2, 1, 1), (2, 1, 0) seems applicable. We fit the ARIMA model444We utilize the ‘pyramid’ module to fit ARIMA models
|Algorithm 1. ARIMA model fitting algorithm|
We use the residual values, derived from the ARIMA model, of the 150 randomly selected S&P500 stocks as input for the LSTM model. The datasets include the train X/Y, development X/Y, test1 X/Y, and test2 X/Y. Each X dataset has 55875 lines with 20 time steps, with a corresponding Y dataset for each time series (Figure 3). The data points are generally around 0, as the input is a residual dataset (Figure 4).
The architecture of the model for our task is an RNN neural network that employs 25 LSTM units555 We utilize the ‘keras’ module to train the LSTM model
(https://github.com/keras-team/keras). The last outputs of 25 LSTM units is merged into a single value with a fully connected layer. Then, the value will be passed through a doubled-hyperbolic tangent activation function to output a single final prediction. The doubled-hyperbolic tangent is simply the hyperbolic tangent function scaled by a factor of 2. Figure 5 shows a simplified architecture of the model.
When training the model, it is crucial to keep an eye on overfitting. Overfitting occurs when the model fits excessively on dataset while training. Hence, the predictive performance on the train dataset will be high, but will be poor on other newly introduced data. To monitor this problem, a separate set of development dataset is used. We train the LSTM model with the train dataset until the predictive performances on the train dataset and development dataset become similar to each other.
We utilize the ‘keras’ module to train the LSTM model
The dropout method is one of the widely used methods to prevent overfitting. It prevents the neurons to develop interdependency, which causes overfitting. This is executed by simply turning off neurons in the network during training with probability. Then, in the testing phase, dropout is disabled and each weight values are multiplied by , to scale down the output value into a desired boundary. Moreover, dropout has the effect of training multiple neural networks and averaging the outputs .
Parameters and determine the intensity of regularization of the cost function. If the lambda values are too high, the model will be under-trained. On the other hand, if they are too low,
regularization affect will be minimal.
In our model, after trial and error, it turned out that not applying any regularization performs better. We tried more complex architectures with regularization, but for all architectures, models with no regularization had superior outputs.
Another problem to pay attention to when training a neural network model is the vanishing/exploding gradient. This is particularly emphatic for RNN’s. Due to a deep propagation through time, the gradients far away from the output layer tend to be very small or large, preventing the model from training properly. The remedy for this problem is the LSTM cell itself. The LSTM is capable of connecting large time intervals without loss of information .
Other miscellaneous details about the training process includes the use of mini-batch of size 500, the ADAM optimization function et cetera. For detail, refer to the LSTM section source codes in ‘Appendices B’.
The walk-forward optimization method  is used as the evaluation method. The walk-forward optimization
requires that a model be fitted for each rolling time intervals. Then, for each time interval, the newly trained model
is tested on the next time step. This ensures the robustness of the model fitting strategy. However, this process is
computationally expensive. In addition, our paper’s motive is to fit parameters of a model that generalizes well on
various assets as well as on different time periods. Thus, it is needless to train multiple models to approve of the
model-fitting strategy. Rather than training a new model for each rolling train-set window, we resolve to train a
single model with the first window and apply it to three time intervals – the development set and the test1/test2
We selected our optimal model with the Mean Squared Error (MSE) metric. That is, the cost function of our model was the MSE. For further evaluation, the Mean Absolute Error (MAE) and Root Mean Squared Error(RMSE) was also investigated.
The selected optimal model is then tested on two recent time periods. We use two separate datasets to test the
model because the development set is deemed to be involved in the learning process as well.
If the model’s correlation coefficient prediction on two time periods turn out decent as well, we then test our model against former financial predictive models. The MSE and MAE values are computed for the four financial models as well. For the constant correlation model and the multi-group model, we regarded the 150 assets we selected randomly to be our portfolio constituents.
|Algorithm 2. LSTM model training algorithm|
After around 200 epochs, the train dataset’s MSE value and development dataset’s MSE value started to converge (Figure 6). The MAE learning curve showed a similar trend as well. Among the models, we selected theepoch’s model. The epoch was decided based on both the overfitting metric and the performance metric. The overfitting metric was represented with the normalized value of the MSE difference between the train & development dataset. And the performance metric was represented with the normalized value of the MSE sum of the train & development datset. Then, the sum of the two normalized value was calculated to find the epoch that had the least value. The mathematical representation of the criterion is as follows.
With the selected ARIMA-LSTM hybrid model, the MSE, RMSE and MAE values of the prediction were
calculated. The MSE value on the development, test1, and test2 dataset were 0.1786, 0.1889, 0.2154 each. The
values have small variations, which means the model has been generalized adequately.
Then, the metric values were compared with that of other financial models. Among the financial models, the Constant Correlation model performed the best on our 150 S&P500 stocks’ dataset, just as what the empirical study of E. J. Elton et al. has shown . However, its performance was nowhere near the ARIMA-LSTM hybrid model’s predictive capacity. The ARIMA-LSTM’s MSE value was nearly two thirds of that of other equivalent models. The MAE metric also showed clear outperformance. Table 2 demonstrates all metrics’ values for every dataset, for each model. The least value of each metric was boldfaced. Here, we can easily notice the how all the metric values of the ARIMA-LSTM model are in boldface.
For further investigation, we tested our final model on different assets in the S&P500 firms. Excluding the 150 assets we have already selected to train our model, we randomly selected 10 assets and generated datasets with identical structures as the ones used in the model training and testing. This generates 180 lines of data. We then pass the data into our ARIMA-LSTM hybrid model and evaluate the predictions with the MSE, RMSE and MAE metrics. We iterate this process 10 times to check for model stability. The output of 10 iterations are demonstrated in Table 3.
The MSE values of 10 iterations range from 0.1447 to 0.2353. Although there is some variation in the results compared to the Test1 & 2, this may be due to a relatively small sample size, and the outstanding performance of the model makes it negligible. Therefore, we may carefully affirm that our ARIMA-LSTM model is robust.
The purpose of our empirical study was to propose a model that performs superior to extant financial
correlation coefficient predictive models. We adopted the ARIMA-LSTM hybrid model in an attempt to first filter
out linearity in the ARIMA modeling step, then predict nonlinear tendencies in the LSTM recurrent neural network.
The testing results showed that the ARIMA-LSTM hybrid model performs far superior to other equivalent
financial models. Model performance was validated on both different time periods and on different combinations
of assets with various metrics such as the MSE, RMSE, and the MAE. The values nearly halved that of the Constant
Correlation model, which, in our experiment, turned out to perform best among the four financial models. Judging
from such outperformance, we may presume that the ARIMA-LSTM hybrid model has sufficient predictive
potential. Thus, the ARIMA-LSTM model as a correlation coefficient predictor for portfolio optimization would
be considerable. With a better predictor, the portfolio is optimized more precisely, thereby enhancing returns in
However, our experiment did not cover time periods before the year 2008. So our model may be susceptible to specific financial conditions that were not present in the years between 2008 and 2017. But financial anomalies and noises are always prevalent. It is impossible to embrace all probable specific tendencies into the model. Hence, further research into dealing with financial black swans is called for.
* This is the list of tickers of the 150 randomly selected S&P500 stocks.
* This source code is a simplified version; unnecessary portions were contracted or omitted. For original and other
relevant source codes, visit
We thank developers of the ‘Quandl’ API, ‘Pyramid-arima’ module, and ‘Keras’ module, who provided open source codes that alleviated the burden of our research.
We also thank an anonymous commenter with a pseudo-name ’Moosefly’, who discovered a crucial error in the ARIMA modeling section source code.
Journal of Experimental And Theoretical Artificial Intelligence, 15:3:315–330, 2003.
Stock price pattern recognition - a recurrent neural network approach.1990 IJCNN International Joint Conference on Neural Networks, vol.1, San Diego, CA, USA:215–221, 1990.
Journal of Machine Learning Research, 15:1929–1958, 2014.
Stock trading with random forests, trend detection tests and force index volume indicators.Artificial Intelligence and Soft Computing, Lecture Notes in Computer Science, 7895:pp.441–452, 2013.