Predicting stock returns is a task that has attracted a lot of attention among investors, financial managers and advisors but also among academics. This is important to finance practitioners to best allocate their assets and to academics to build better and more accurate asset pricing models. Moreover, predicting stock returns gives crucial implications about market efficiency. Interest in stock price movements dates back to kendall1953analysis where it was noticed that stock prices seemed to move randomly over time. This is what is called “the random walk hypothesis". This hypothesis is consistent with the efficient-market hypothesis which views prices as a function of information. The efficient-market hypothesis states that current stock prices reflect all the available information and if there are movements in the next periods, they comes as a result of a release of new information or as a result of random shocks. This means that it is impossible to predict future stock prices using past information. This work uses Artificial Neural Networks (hereafter ANNs) to question efficient market hypothesis by attempting to predict future individual stock prices using historical data.
1.1 Related Work
The efficient markets theory was first proposed by the French mathematician Louis Bachelier in 1900 (see bachelier1900theorie) but it started to draw a lot of attention only by the 1960’s. There were a huge number of studies which analyzed whether this hypothesis was true. The first variables used in predicting future movements were past prices, and later other predictive variables such as interest rates, default spreads, dividend yield, the book-to-market ratio, and the earnings-price ratio. (e.g fama1977asset, fama1988dividend, rozeff1984dividend, shiller1984stock, flood1986evaluation, campbell1988dividend). The last three ratios have received the most interest in the literature. The hypothesis that dividend yields forecast stock returns is even older (for example, dow1920scientific and ball1978anomalies). All three of these ratios have prices in the denominator, thus, high rations imply that the stock is undervalued, which in turn suggests high subsequent returns. This contradicts the view that returns (and thus prices) can not be predicted using past information. Indeed, studies such as fama1988dividend showed that these ratios are positively correlated with subsequent returns.
The interest in the topic continued also during the 1990s and early 2000s where in addition to the above mentioned predictive variables, many authors like lamont1998earnings, baker2000equity, lettau2001consumption
, used financing activity, consumption/wealth relation and valuations of high and low beta stocks. Setting aside the difference in predictive variables, model specifications or different ways to correct for errors and/or biases, what all these studies have in common is the usage of traditional approaches and methods for estimation and predictions.fama1988dividend use a linear least-squares model to predict the monthly NYSE returns from dividend yields and they find a -statistics between and . In line with the criticism that the method used here is biased (e.g stambaugh1985bias and mankiw1986we), nelson1993predictable replicate the study, correcting the bias by using bootstrap simulations. They find -values between and . On the other hand, stambaugh1999predictive performs the estimation assuming that dividend yields follow a first-order autoregressive (AR1) process. He regresses the NYSE returns during 1952-1996 on dividend yields and the -value reported for the one-sided test is .
Although the focus has been on predictive variables as the ratios mentioned above, stock prices are also very sensitive to social and political events, monetary policy, interest rates, and many more macro economic variables. Public news announcements or periods of no public information release also significantly affect fluctuations in stock prices (e.g chen2000extensions and chaudhuri2004stock
). The multivariate vector autoregression (VAR) model, a generalization to univariate autoregressive (AR) model, has also been widely used in the literature. This is a stochastic process model, treating all variables as potentially endogenous, and it is used to capture the linear interdependencies among multiple time series.
Nevertheless, given the chaotic, extremely volatile and nonparametric nature of stock prices, also these methods are questioned when it comes to obtaining reliable results. If stock prices do not follow random walks, what processes do they follow? When it comes to observing patterns, various studies have shown that stock returns exhibit reversal at weekly and 3-5 year intervals, and drift over 12-month periods (bondt1985does, lo1990contrarian, jegadeesh1993returns). Several models have been proposed to capture the predictability of stock returns using the approaches mentioned above. grossman1996equilibrium
evaluates some popular models using a Kalman Filter technique and finds that they have serious flaws. Popular models are usually too restrictive, they fail to perform well in empirical tests and have even worse performance in out of sample tests.
Despite the difficulties of the task, different estimation models, different ways of correcting for biases, there is a consensus among financial economists nowadays that stock returns contain a significant predictable component based on in-sample tests (Campbell 2000). Nevertheless, in the out of sample forecasting tasks, the predictive ability is pretty low (bossaerts1999implementing and goyal2003predicting, welch2008goyal). This is mainly related to the dynamic, and complex nature of the markets. Considering the difficulty and the flaws of the traditional estimation and forecast approaches, ANNs are now widely used in finance and economics and have received special focus especially on the task of stock price forecast. They have achieved high accuracy even in the out of sample, or test sets. ANNs have become an attractive tool for predicting stock market trends mostly because they can predict any non-linear dynamic system to any accuracy provided some mild conditions are met.
In one of the earliest studies, kimoto1990stock used several learning algorithms and prediction methods for the Tokyo stock exchange prices index (TOPIX) prediction system. Their system used modular neural networks to learn the relationships among various factors. kamijo1990stock
used recurrent neural networks for stock price pattern recognition andahmadi1990testability used a neural network to study the arbitrage in the stock market. yoon1991predicting also performed predictions using neural networks. Some were focused on forecasting the stock index futures market. trippi1992trading and choi1995trading predicted the daily direction of change in the SP 500 index futures using ANNs. duke1993neural tried to predict the daily predictions of the German government bond futures. Even though many of them did not bear outstanding prediction accuracy, many others have shown that ANN approach can outperform conventional methods (eg. yao1995forecasting; van1996application; fernandez2000profitability).
Most of the studies that use ANN to predict movements in stock prices use high frequency data, i.e hourly or daily data. martinez2009artificial analyze intra-day price movements and predict the best times to trade and make profits within a day. Other studies like senol2008stock
try to predict the direction of the movements rather than how much will the stock move, where the output is a categorical variable (goes up, goes down or stays the same).yao1995forecasting also use daily data to forecast the Kuala Lumpur Stock Exchange (KLSE) index. Many other studies focus on predicting indexes rather than individual stock prices such as wanjawa2014ann, sheta2015comparison.
This work has the following novel contributions in stock price forecasting: 1) it does not make assumptions related to the functional form of the model, 2) it aims at predicting individual stock prices rather than predict price indexes, 3) it tries, given historical data, to predict one quarter or one month ahead rather than dealing with intraday or next day predictions, 4) it only uses as input the historical stock price of each firm and 5) it also offers a test for the relevance of network inputs in the prediction task.
2 Data and Methodology
2.1 Data and statistics
Historical data of prices and dividend yields are taken from Quandl (QuandlDT) with data that ranges from 1980 to 2017. Quarterly data on prices, i.e the closing price at the end of each quarter is used and later on, monthly frequency data, i.e the closing price at the end of each month, is used. Quandl provides stock prices data for stocks in NYSE, NASDAQ and AMEX. The number of companies used in the study is 439. On the left, Fig. 1
shows the time series of three sample stock prices. Some prices have a very steep slope of the time trend, for some the slope of the trend is lower, and some of them appear more stationary and oscillate around some average price value during time, even though the variance seems to change. These patters highlight the highly non stationary nature of the prices.
Each of the companies is observed for a different number of quarters. The shortest time series is restricted to 16 quarters. The longest one consists of 175 quarterly observations for different companies in the sample. Monthly time series is also restricted to have a length of at least 16. Fig. 1 on the right shows the histogram of the number of years observed for 439 firms in the sample.
Different models require different ways to deal with the time series of different length and different processing of the data. We consider two models in order to forecast future quarterly stock prices: Recurrent Neural Network (hereafter RNN) and Multilayer Perceptron (hereafter MLP). We also consider an MLP with monthly frequency data and for the latter case, we also offer a test on whether past changes in stock prices have any effect on future prices (Section 4). The next subsections will explain in detail how the data is processed for each of the models considered. The MLP with monthly data is similar to the one with quarterly data.
In all the following sections, we have used TensorFlow library (tensorflow2015-whitepaper) in Python and NVIDIA® GeForce® 940MX GPU.
2.2 Recurrent Neural Network
In this subsection we discuss how the data is processed in order to be used in an RNN (see graves2013speech). Later on, the RNN architecture is discussed, and lastly, how learning is conducted.
Given the time series nature of the data, a natural way for prediction is using an RNN. After having a choice of modeling, the data must also be processed accordingly. The data in Quandl is obtained in the form of a data-frame and then it is reconstructed as a dictionary, with stock name as key and the associated matrix, , as value. For stock , where , the matrix is a matrix where is the time period for which data is available on stock and the three columns correspond to values of time, past prices and current prices respectively. The second column is a shift of the third column one row above. The first observation is lost, since there is no history of price to predict the price value of the first observation. We need to store the data as a matrix where . For this purpose, we need all the firms to have the same time dimension. Thus, we replace the non available data (the missing values) of each matrix with zeros. The original matrix and the augmented matrix are shown below:
The time column is augmented by the same (here it moves quarterly). The two other columns are added with zeros. For this model we try to use the variables in levels, logs and also the first difference of the logs. Using the first difference, improves stationary, increasing the performance after training. Fig. 2 shows the time series of a sample stock in levels and first difference of the logs.
Now the augmented values of can be interpreted as zero changes in stock prices from one stock to another. A typical RNN architecture for one hidden layer is shown in Fig. 3
Let be given by:
and the predicted value at time is given by:
, and denote the weights, and denote the biases and and are the non linearities used. The process has an output at every step. First, we need to forward till the end in order to obtain the cost which we denote by . In this case we use the mean squared error:
where is the actual value of the stock price of company at time and is the predicted one given from the Eq. 4. It is clear that with this formulation, the calculation of the partial derivative of the cost with respect to requires to back-propagate till the beginning in each step. One of the concerns in this case might be vanishing/exploding gradients as well as exponential memory decay over time (see pascanu2013difficulty). For this we allow for manipulation of the memory of the system, using LSTM and GRU. They are gated mechanisms or architectures of RNNs that control the information flow through some gates. LSTM is characterized by 3 gates: input, forget and output, while GRU by 2 gates: reset and update (see hochreiter1997long and cho2014learning). Using the last two methods increases the memory of the network and they are expected to give more accurate results.
The choice of optimization algorithm for this model is Adam Optimizer (see kingma2014adam and le2011optimization). The batch size used is and we use a piece wise constant learning rate with boundaries set at , and and learning rates of , , , respectively. A hyperbolic tangent non linearity is used for all the hidden layers, except the last one where no nonlinearity is used. The sample is divided randomly in a train set which consists of the total sample and a test set consisting of of the total sample.
2.3 Multilayer Perceptron
In this subsection, the MLP (see lecun2015deep) is discussed , including how the data is processed, the architecture of the MLP and how the learning is conducted.
The input in each period is a vector of past prices, rather than a scalar. Having such a model in mind, the data should be reprocessed accordingly. Now for every firm the input matrix will be of the form shown in Eq. 6 and the output will be of the form shown in Eq. 7:
The architecture of a one layer perceptron is shown in Fig. 3 where in order to predict each period’s price of firm at , the inputs used are , , …, , for a choice of . Here represents the element in the -th row and the -st column of the matrix . represents the element in the - th row and the - nd column of the matrix and so on.
For this model, we use the variables in their logarithmic form. The time series of price for one of the sample stocks is shown in Fig. 4
together with its logarithmic transformation. The choice of optimization algorithm for this model is also Adam Optimizer. The loss function chosen is mean squared error, shown in Eq.10. The batch size used is and we use a piecewise constant learning rate with boundaries set at , and and learning rates of , , , respectively for each interval.
3.1 Results from RNN model
This subsection discusses the results obtained from the RNN. Fig. 5 on the left shows the semilogy of the mean squared error loss during training for every batches of samples. The model used is an RNN with hidden layers and Basic cells. Fig.5 on the right shows the predicted values of quarterly prices for random actual-predicted pairs in the test set. The black lines represent the actual time series of prices and the blue line represents the predictions over time. The test loss reached after all the iterations is and the results do not seem very satisfactory. The model fails to predict very high or low deviations from zero.
We repeat the experiment for one and five hidden layers and use basic RNN cell, LSTM cell and GRU cell. Table 1 shows the mean squared error loss in each of the cases.
From these results, we can see that LSTM and GRU achieve the lowest mean squared error with 5 hidden layers. This is because they allow for an increase in the memory of the network compared to the basic RNN cell.
3.2 Results from the MLP model
The MLP model results are discussed in the following subsections. The first subsection reports the results from an MLP where the train and test set is split randomly. The next subsection shows the results when train set consists of observations before year 2016 and all the other observations are in the test set.
3.2.1 Random split of the train and test set
First, we report results from an MLP model with hidden layer, and lags of to predict the next price for each firm . The “lags” of a time series are considered to be all the past observations of , i.e , , …. The semilogy of the mean squared error loss during training for every batches of samples is shown in Fig.6 on the left.
The actual versus the predicted values from this model are shown in Fig.6 on the right. Here we randomly pick from the test set values of prices for some firm in the sample at some point in time. We plot these values with black and for each of them, we show with a blue color the predicted values from the model. The mean squared error loss is .
We also report the results of the model when we use different number of hidden layers and different number of lags of to predict the next values of for firm . These results are shown in Table 2. From these results, we can see that the minimum value of the mean squared error is achieved when using hidden layer, and lags of as input to predict .
3.2.2 Non random split of train and test set
For this section we use the variables again in their logarithmic form. Fig.7 on the left shows the semilogy of the mean squared error loss during training for every batches of samples. Fig.7 on the right shows the actual values (black line ) versus the predicted values (blue line) for some randomly picked observations in the sample.
Table 3 shows the comparison for different model specifications in terms of the mean squared error. The best specification seems to be the MLP with value of and hidden layers, where the error reached is . The error is lower compared to the model where the split is random because the test size here is smaller. In the random split, the test size consists of of the sample, while here it contains only observations in and after.
4 A test for the relevance of network inputs
In this section we formally test the null hypothesis that past prices do not affect present stock prices versus the alternative that they do. For this task, we use the results fromwhite2001statistical. The authors use a one hidden layer perceptron, with a continuous nonlinearity
for the hidden layer and a linear activation function for the output layer. Formally, the output is given by:
where are the weights, denotes the vector of inputs with and
is the number of neurons in the hidden layer. We use 1 hidden layer with 16 neurons and a hyperbolic tangent function as. Learning is done to minimize the following:
where is the number of observations, denotes the target and is the function defined in Eq. 11. During training, we use a piecewise constant learning rate with boundaries at and and values of the learning rate of , and respectively for each interval. Observations before 2012 are used for the train set and observations after 2012 are used for the test set.
With the appropriate restrictions discussed in white2001statistical, it follows that a weight vector exists and converges almost surely to which solves:
The hypothesis we are interested to estimate is:
where is the set of indexes specifying the inputs whose relevance we are interested in. The authors consider the following statistics:
where denotes . This statistics is if and only if is . They show that the statistics has an asymptotic mixture distribution. Since the computation of a critical value is complicated, the authors use the bootstrap method with a bootstrap statistics given by:
Here are the weights obtained from the bootstrap sample and . The bootstrap method consists of the following steps:
Use the original sample solve the minimization problem to get .
Draw a sample with replacement from the original sample and compute resampled weights by solving the minimization problem with the new sample.
Compute the bootstrap statistics, .
Repeat steps 2 and 3 times (e.g or ).
Compute the -value.
We use bootstrap samples for each model. The weights for each of the bootstrap sample are learned by setting the initial values of the weights at .
Following the methods described in white2001statistical, one should train separately for each firm, for each frequency (daily, weekly, monthly), and for different time periods. Due to the computational intensity of experimenting with firms, different frequencies and different time periods, we only perform the test on the top firms with monthly frequency and a time period from 1980 to 2017. Thus, our hypothesis testing is now restricted to "are past prices of those 5 firms relevant in predicting the current prices?". The firms used are the top 5 companies in terms of Market Capitalization from NASDAQ (as of 2017): Apple, Inc. (AAPL), Comcast Corporation (CMCSA), Intel Corporation (INTC), Microsoft Corporation (MSFT), PepsiCo, Inc. (PEP), monthly frequency and the 3 models as in Eq. 19, 18 and 17.
The -value is equal to the proportion of bootstrap samples that give a higher statistics than the original one. For example, if the number of samples exceeding the original statistics is and the total number of bootstrap statistics is b, the -value is calculated as . In this work we use a ratio of to provide more robust estimates. Also, we use percentage price changes, and for a given firm we estimate the three of the following models:
Then we have to test the null hypothesis with multiple models, for different lags used and different firms. We need an upper bound on the -value for the joint null hypothesis that in none of a particular group of models do the lagged changes matter. For this, we use the Bonferroni inequality for multiple hypothesis. With the Bonferroni method, we reject the null at the level if , where is the number of models and is the smallest -value from all the models. So the Bonferroni -value bound is . One could use tighter bounds by a modification of the Bonferroni inequality as in hochberg1988sharper.
shows the bootstrap statistics for all the firms, trained with the model with 5 lags. The statistics does not seem to be normally distributed nevertheless the distribution seems approximately symmetric, with the exception of AAPL and PEP. The p-values for allmodels we estimate are given in Fig. 9. With a value of equal to , and , we can reject the null hypothesis that in none of the models do lagged changes matter.
This work questions the efficient-market hypothesis using DNNs. For this purpose, we use RNN and MLP to forecast next quarter’s stock movements using historical stock prices. We train the MLP with a random and a non random train and test split. We also conduct a formal statistical test for the null hypothesis that past prices do not affect current prices versus the alternative that they do. The results reject the null hypothesis that in none of the models that we estimate, do the lagged prices matter. These results contradict the efficient-market hypothesis, in line with the work of basu1977investment, rosenberg1985persuasive and others. This encourages the use of better models, different specifications and different processing of the data to predict more accurately the future stock movements.