Training a practical and effective model for stock selection has been a greatly concerned problem in the field of artificial intelligence. Because of the uncertainty and sensitivity of the finance market, there are many factors which may influence the stock price such as significant events, society’s economic condition or political turmoils. Many scholars have been applied various methods of machine learning to find a fitting model for the stock price time-series data with nonlinearities, discontinuities, and high-frequency multi-polynomial components.[hadavandi2010integration] To handle these complicated components and make a precise prediction, a lot of scholars choose to use machine learning to create a model.
1.1 Related Work
Even though some of the models from previous works have been achieved good performance in the U.S. market by using low-frequency data and features, training a suitable model with high-frequency stock data is still a problem worth exploring. There is another challenge for stock trading novices without experience in this field. Although there are existing low-frequency features created by some experts, constructing useful high-frequency features with high-level information is difficult for us. Moreover, many existing features which are calculated by the U.S. stock market index different from the China stock market. Therefore, we prefer to apply methods without constructing features by ourselves. In this paper, we introduce two machine learning algorithms LSTM (long short-term memory) and CNN (convolutional neural network) to find the most beneficial strategy of stock trading in China stock market.
Through the high-frequency price data of the past period (one day or several days), we construct two models which can predict the expected return rate of the stock on the day, and select the stock with the highest expected yield at the opening to maximize the total return. The organization of the rest paper is as follows: Section 2 presents general concepts, advantages and detailed constructions for the two models and our preparation of processing data. Section 3 describes hyperparameter selection, regularization and how we train our models. Section 4 compares the accuracy, thermodynamic diagram and the result of backtesting to analyze their advantages and weakness. At last, we summarize our work and indicate several future attempts.
2.1 Long Short-Term Memory
One of a special kind of RNN(Recurrent Neural Network) is called Long Short-Term Memor(LSTM) which was first proposed by Hochreiter and Schmidhuber in 1997.[hochreiter1997long] Gers et al. advanced its structure by introducing forget gate in 2000.[gers1999learning] after a few years, H.Sak et al. and W.Zaremba et al. improved the framework of LSTM and added more details.[sak2014long, zaremba2014recurrent]
Nowadays, LSTM is widely used in the field of natural language processing and emotion analysis.
2.1.1 Basic Theorem
The traditional neural network always widely used in image processing and other problems without significant effects caused by time series. However, there are some situation depending on time series such as natural language processing and predicting stock price. In this process, we prefer to receive new information and also maintain preceding information because understanding earlier events can help us study current material. RNN address this issue by using networks with loops to allow information to preserve.
If we unfold RNN, we can get Figure 1 a chain-like sequence of neural networks which are intimately related to time series. Even though we can connect earlier information to the present task, we may need long-term memory to inform the understanding of current work. Considering two stocks with similar price tendency in the present week, according to data from this week, we are hesitated to buy which one. If we know in the last month, one of them hit limit up while another fell staying, then we can decide to buy the go up staying stock based on the understanding of information from previous days. Therefore, LSTM solves this problem and works tremendously well on time-series with long-term information problems.[colahblog] According to the basic model of RNN, if we unfold the structure of one layer RNN, we can easily assume it as a deep feedforward neural network with one dense layer which shares the same weight. By appending more neural network layers and gates, LSTM adequate to handling long-term dependencies.
2.1.2 Advantages of LSTM
Comparing to other artificial neural networks, RNN has an advantage in figuring out problems which are sensitive to time series such as language translation and emotion analysis. Even though their purpose is to learn long-term dependency, some shreds of evidence prove that recurrent neural network troubled in learning and keeping long-term memory. Fortunately, by imitating the human being’s memory mechanism, LSTM solved this problem appropriately. To learn how to classify meaningful information and remove useless memory, LSTM separate unit state into long-term memory state and short-term memory state and add the gate system. Based on the previous study, LSTM is more effective than traditional RNN because not only can it converge quickly, also it can solve the problem of vanishing gradient and exploding gradient. Moreover, considering the feature that LSTM can maintain long-term memory, it is sensitive to the long-term dependency of data.
2.1.3 Structure of LSTM
In Figure 2, it shows that the LSTM unit state has two vectors: and which represent short-term memory state and long-term memory state. Each cell has three gates (fully connected layer) at time: (forget gate), (input gate) and
(output gate). The active function of the gates is the Logistic function whose output is in the range ofto . The gates uses element-by-element multiple to control how many data should be maintained.
: short-term memory states
: long-term memory states
: a vector corresponding to input time series
: decide which part of from the previous time can be stored in current
: analyze and
: decide which part of the output returned by should be added to current
: decide which part of current should be regarded as current and the output of the cell at time
Their computational formulas are as follows:
represents the function as follows:
represents the function as follows:
The weight matrix and the bias will be generated by Back propagation algorithm.
2.1.4 LSTM Model Construction
Here, we use to represent the feature vector of stock at time . We treat a time series as an instance to be the input of our LSTM network which can predict the profit of the next day. To consider this problem reasonably, we make this problem to be a classification problem. Hence, we create a classification model depended on LSTM network in Figure 3.
The LSTM layer of this model has been introduced carefully in the previous section. In the dropout layer, each cell has a certain probability to stop working in each training process to avoid over-fitting of the model. Therefore, when predicting, all neurons participate in the work to maximize the role of the model. Earlier experiments have confirmed that the addition of this layer has a good performance on regularization.[hochreiter1997long]
The Dense layer has no activation function and outputs the score of the probability that the predicted instance belongs to each category. The score is then converted to the corresponding probability by the softmax layer. In this process, softmax layer use the equation as follows:
In this equation, and represent the probability and score respectedly that the predicted instance belongs to certain category. The category with the highest probability is regarded as the prediction made by the model on the instance category.
2.2 Convolutional Neural Network
2.2.1 Basic Theorem
The convolutional neural network is an advanced neural network with the ability of multilayer back-propagation supervised learning networks which first proposed by Y. LeCun in 1995.[lecun1995convolutional] CNN achieves some degree of distortion and deformation by using three architectural ideas: local receptive fields, shared weights, and spatial subsampling(convolution). Local receptive fields are inspired by the cortical structure of animal vision that only part of their neurons works in the process of sensing the external environment. The reason for using local receptive fields is that the correlation of the adjacent pixels in the image is relatively close and each cell only needs to perceive the local area without connecting to the whole picture. [lawrence1997face] By applying the connecting method of shared weights and local receptive fields, CNN reduces the number of parameters.
Similar to other neural networks, CNN also has an input layer, an output layer, and several hidden layers. These three types of layers are the core components for implementing the feature extraction of CNN. By using the gradient descent method to adjust the weight parameters in the network layer by layer, the model minimizes the loss function and improves the accuracy of the network through frequent iterative training. The low hidden layer of the convolutional neural network is composed of a convolutional layer and a maximum pool sampling layer. And there are hidden layer and logistic regression classifier of the full-connected layer corresponding to the traditional multi-layer perceptron. The input of the first fully connected layer is always a featured image with spatial and temporal dependencies obtained by feature extraction from the convolutional layer and the sub-sampling layer. The last layer of the output layer is a classifier that can be classified using logistic regression or SVM(support vector machine) or other methods.
2.2.2 CNN Model Construction
As introduced above, through convolution between a convolution kernel and various regions of a two-dimensional input (e.g. a picture), Convolutional Neural Networks (CNN) can convert primary features of each region into higher-level features, serving as a feature extraction tool. Such function of CNN can be utilized to deal with our task. That is because the data input of a stock in one day is a two-dimensional one, one dimension being the set of features and the other being the time periods in a day; the input data are described in Table 2 from later section. When it comes to the shape of convolution kernels, it is for certain that we do not need to use the common shapes in image identification like 33 or 55. When dealing with high frequency price-volume data, the shape of convolution kernels should be designed according to specific purposes.
E. Hoseinzade and S. Haratizadeh introduced a CNN structure to make predictions of stock price with a set of low frequency features. CNN framework proposed by this paper is inspired by their structure. That is because our task and data structure are very similar to those of theirs. The difference is that they use low frequency data with time period being day instead of minutes and features being low frequency ones like EMA10. The framework of theirs will be demonstrated in details when comparing with ours in following sections.
2.2.3 CNN Framework Proposed
In the framework by E. Hoseinzade and S. Haratizadeh [hoseinzade2019cnnpred], the first layer is a convolution layer in which convolution kernels are utilized to extract high-level features. And in CNN framework proposed by this paper, the same first layer is used. Specifically, since we use 11 price-volume features under the frequency of 15 minutes, convolution kernels with shape of 1
11 are utilized. Each of those filters covers all the daily variables and can combine them into a single higher-level feature (i.e. high-level features). Such kernels can construct different combinations of primary variables using this layer. It is also possible for the network to drop useless features by setting their corresponding weights in the kernels equal to zero. So, this layer works as a feature extraction or feature selection module.
After the search of hyperparameters, we eventually decide to use 40 kernels, which takes both the training cost of the model and the prediction effectiveness into account. Through this convolution layer, we obtain 40 high-level features, which are extracted from 11 primary features. And the two-dimensional data structure changes from 1611 (time-period original features) to 1640 (time-period high-level features).
Fully Connected Layers
In these layers, the 1640 data generated by the CNN layer are flattened into a final feature vector. This feature vector is then converted to a final prediction through 2 fully connected layers (2 hidden layers). And the output layer has 4 neurons, with SoftMax function utilized as the activation function, intended to give the probability of a big rise, a small rise, a big drop, a small drop of a stock respectively, and then calculate the expectation of its daily return today to make stock purchase advice. The whole framework is demonstrated in Figure 4
3.1 Data Processing
3.1.1 Sources of Data
We have chosen the closing price, opening price, highest price, lowest price, trading volume, transaction amount, number of transactions, commission ratio, volume ratio, commission purchase, commission sale of the China A-share market. These 11 volume-price features are used as elements to describe the state of the stock. In order to train with market-represented stocks and reduce data inconsistency (such as stock suspension) and noise, we selected the constituents of the CSI300 Index, denoted as , as the source of the sample data set. The model uses two types of data, every 15 minutes of data and every 120 minutes of data. The sample data is exhibited as in Table 2 and 3.
3.1.2 Time Period of Data
As for sample set, we choose the data in Figure 5 from July 1,2014 to December 31, 2018, denoted as , during which Chinese stock market witnessed periods of sharp rise, sharp fall, slight rise and slight fall, providing sufficient samples for each of our four labels. Since our ability of computation is insufficient, when we adjusted the model, we only used data from January,2019 to May,2019, denoted as .
3.1.3 Data Normalization
To speed up the convergence of the neural network and to eliminate the negative influence of the dimension of the feature data on the model , the feature vector of each stock at each time is normalized by the equation as follows:
then convert from the rate of return to the category.
3.1.4 Label Selection
We divide the daily rate of return into four categories: a big rise, a small rise, a big drop, a small drop. To avoid the problem of category imbalance, we use the sample to estimate the whole dataset. By taking the daily rate of return of all A-shares from June, 2014, to December, 2018 (closing price minus opening price) as a sample
, we calculate its quartile as the division point of the classification.
3.1.5 Division of Training Set and Test Set
number of transactions
Chinese stock market opening time:
9:3011:30 and 13:0015:00
We use the first 80% of the data set of dates as a training set and the rest 20% as a test set to avoid data snooping. The training set and the test set are compared during the training. The accuracy of the above can be used to determine whether the model is over-fitting or not. Our data set is shared by LSTM model and CNN model in the entire section. There are some differences in our training set: we still use as the data set. The LSTM model uses two types of data. The first is the time series of the price data for each 15 minutes of the previous day, 240 minutes or 16 steps in total, and the second is the price data for every 120 minutes in the first 10 days, which is a time series of 20 steps; CNN model uses one type of data, which is the volume data for every 15 minutes for the first five days.
3.2 Experiment of the LSTM model
3.2.1 Hyperparameter Selection for Loss Function and Optimizer
Because the model is to process classification problem, we chose the cross-entropy cost function as the loss function:
In this equation, is the total number of samples; is the total number of categories; represents the real category (only 0 or 1), and is the probability of the corresponding category output by the softmax layer.
For the optimizer, we selected Adam, Adadelta, and RMSProp three adaptive optimizers for testing (learning rate is 0.001). The performance of different optimizers is represented in Figure6. The test uses batch gradient descent method: there are 30 samples per batch and all samples do 50 iterations. Also, all the following tests are the same.
As can be seen from the graph, the Adadelta optimizer is inferior, and Adam optimizer and the RMSProp optimizer are equally effective. Therefore, according to this result, we chose Adam optimizer in our model.
Regarding the regularization method, we compared the performance of the same data using the dropout layer and L2 regularization (weight decay). L2 regularization adds a regular term to the loss function:
where is the objective function, is the original penalty function—the cross-entropy cost function, is the parameter of the model, is the total number of model parameters, is a hyperparameter. We have tried different parameters for lambda and plot the result in Figure 8.
Moreover, in Figure 9, we can see the confusion matrix thermogram for the three models at the epoch with using the early stopping to avoid overfitting is as follows, which is normalized by index.
Depend on the result showed in the graph, we can conclude that although L2 regularization can effectively avoid overfitting, using the dropout layer can make the model converge to a better solution fast. Hence, our model decides to use the dropout layer for regularization.
3.2.3 Keep Probability of the Dropout Layer
Drop probability of the dropout layer The keep probability determines the probability that each data entering the dropout layer will be retained. We use different keep parameter parameters to test on to see its effect on the model in Figure 10.
The confusion matrix thermogram for the three models at the epoch ,with using the early stopping to avoid overfitting, normolized by index, are represented in Figure 11:
It can be seen from the graph that when keep probability is 0.8, it is better than 0.7, 0.9. The accuracy of the model prediction is higher, so we choose keep probability0.8.
3.2.4 Training the LSTM Model
We consider that the previous day’s trading situation has a greater impact on the day, and the model should have a more sensitive and subtle understanding of it, which means that higher frequency data should be used for training. Moreover, the model should not ignore the long-term influence of fluctuations during the past several days. Then, it needs to use longer time intervals to learn the correlation between the data of the present day and several days before. Therefore, under the same network structure, we input two different forms of data, and train two prediction models, regarded as model and model , that have a different emphasis on the stock market approach. And then we combine the prediction results of the two models to get the final output. The first type chooses the input data as the time series of the price data per 15 minutes of the previous day, 240 minutes in total, same as 16 steps. The training model predicts the next day’s return , and the second uses the price per 120 minutes in the first 10 days. The data with 20 steps predicts the next day’s return .
The training results are as follows:
It can be seen from the Figure 12 that both models are valid compared to the random selection of 25% accuracy. And the model is better predicted than the model . The confusion matrix thermodynamics of the two models at the epoch are in Figure 13, which are normalized by columns.
The brighter areas of the image are concentrated near the diagonal, which again demonstrates the validity of the model. The areas in the lower right and the upper left corner are the brightest, which indicates that the model has a more accurate prediction of the situation of sharply rise and fall, specifically, the precision ratio is higher. The darker area in the middle indicates that the model does not distinguish between small rises and small falls. Overall, we use the early stop strategy to set the checkpoint to use the two models at the epoch as the final model to predict.
3.3 Experiment of the CNN Model
The CNN framework proposed here use the same date input as LSTM, so the data processing part is identical. Moreover, cross-entropy cost function is chosen as the loss function and Adam as the optimizer, the same as LSTM. See 3.3.1 and 3.3.2 for details. In following sections, the paper mainly focuses on the comparisons of different structures of framework, special regularization method for CNN and the expansion of input data. Experiment results are presented to help develop our theories.
3.3.1 The Difference and Improvements from Framework by E. Hoseinzade and S. Haratizadeh
In this section, the difference between two frameworks is explained. Experiment results are employed to conclude that the CNN framework proposed here performs better. And we try to develop theories why.
From the framework in Figure 14, it can be seen that Ehsan Hoseinzade and Saman Haratizadeh’s framework is more complicated, with 3 convolution layers, 2 pooling layers and one fully connected layer. As mentioned before, the first convolution layer has the same function as that our framework, which is to extract high-level features from primary ones. And the second convolution layer is to generate durational features by aggregating the information in consecutive time periods. So, convolution kernels of 31 is utilized, which generate new durational features containing information from 3 consecutive time periods. Such design is inspired by the observation that most of the famous candle-stick patterns like Three Line Strike and Three Black Crows try to find meaningful patterns in three consecutive days. The third layer is a pooling layer that performs a 2
1 max pooling, that is a very common setting for the pooling layers. Next, the third convolution layer is similar to the second one, intended to further extract durational features. It is followed by another same max pooling layer and a fully connected layer. In the very beginning, we chose to apply similar framework in our task, using 2 convolution layers to extract high-level features and durational features respectively, followed by a max pooling layer and finally a fully connected layer.
However, during experiments, we find that the average pooling layer performs better than a max pooling one. Better performance means higher accuracy rate in the test set in this section and those below. In Figure 15
, we find that the removal of pooling layer leads to even better performance. Therefore, we assume that pooling layer is not suitable for our task, omitting too much information. So, we remove the pooling layer. After that, we find that the removal of the second convolution layer also leads to better performance. Our assumption is as follows: The idea of such layer is inspired by candle-stick patterns like Three Line Strike. It is a good idea to combine information of past consecutive 3 days to predict daily return today. But it should be noticed here that the time span of each time period that is combined with others should match that of the time period being predicted. For instance, it may be effective to combine information of 3 consecutive 15-minute periods to predict the price direction in the next 15-minute period. But such effectiveness might not be valid when the whole daily return of today is being predicted. So, we remove the second convolution layer. So far, there remain one convolution layer and a fully connected layer in our semi-finished framework. After experiment, we discover that adding another fully connected layer lead to better performance. Such change increases the depth of the framework and thus better capture the information hidden in the data. However, we also find that increasing the number of fully connected layers to 3 or more does not lead to obvious improvement. So, we decide to employ 2 fully connected layers, the number of neurons being 250 and 100 respectively and ReLU function as activation function.
3.3.2 Comparison with DNN
We compare our convolution 2 fully connected layers framework with a framework with only 2 fully connected layers, and the result is in favor of ours. In Figure 16, this demonstrates that it is effective to use one convolution layer to extract high-level features.
In our CNN framework, besides commonly used dropout method in fully connected layers, our regularization methods also include SpatialDropout, a dropout method designed for CNN. Ordinary dropout method randomly chooses some components of input data without any fixed spatial patterns and turn them to zero, while SpatialDropout randomly chooses some rows or columns of input data and turn them to zero. According to Figure 17, such method is proved to be effective in image identification.
In our framework, when training our model, in each batch we employ such method to randomly choose 3 columns in input data and turn them to zero. That means we drop 3 out of 11 primary price-volume features and use 8 randomly left features to construct high-level features in each batch of the training set. Through this method, the model can learn to use different combinations of 8 primary features to construct high-level ones. It enhances CNN’s ability to extract different features. In addition, the dropout methods themselves can be comprehend as a low-cost ensemble strategy, whose essence is similar to random forest. Specific to our task, SpatialDropout forces the framework to utilize randomly left features to approach the best model. In this way, each primary feature is supposed to make contribution to the final model, which prevents our model from placing extra emphasis on some primary features in the training set, and thus, from overfitting.
3.3.4 Input Data of Five Days VS. One Day
After accomplishing improvements mentioned above, the accuracy on the test set is still not satisfying. We assume that the input data of one day is not sufficient to make predictions on the daily return today. Therefore, we consider elongating the time span on input data. Finally, we decide to 15-minute price-volume data in 5 past consecutive days to predict the daily return today. In Figure 18, the experiment result shows that it indeed improves the accuracy on the test set. We think this is because such elongation enables the model to better recognize the trends of features, which contributes to better predictions.
3.3.5 Selection of Final Framework
In the end, we determine to use CNN+2Dense as the final framework and feed it with 15-minute price-volume trading data in 5 past consecutive days.
3.4 Make Stock Purchase Advice
In the section of model construction, we mentioned that the softmax layer outputs the corresponding probability that the current sample belongs to four categories, and the one with the highest probability is its prediction result. In practical applications, we hope that the model not only predicts the classification of the sample but also want it can comprehensively consider the benefits and risks to directly give stock purchase advice. To achieve this purpose, we use the formula as follows:
represents the input sample, and represents the mean of all the yields belonging to category in sample , which we use to represent the expected rate of return for each category. represents the final prediction of the sample yield. The formula takes into account the possible rise and fall of the sample and the predictions of the two models. After sorting the final predicted rate of return, the higher the top-ranked stock, the more recommended it is.
4.1 Result Analysis
Finally, we connect the LSTM model and the CNN model to the backtesting framework, using the CSI300 Index as the baseline, and simulate trading from January 16, 2019 to May 31,2019. The specific trading operation is as follows: regarded as the stock pool, we trade according to the purchase proposal given by the model— considering the transaction fee, we only consider buying no more than 20 stocks with an predicted profit of 0.14% or more. The funds are allocated on average, and the portfolio are changed daily. Our results are represented in Figure 21, 22 and Table 4.
It can be seen that both models have obvious outperform baselines regardless of the handling fee, which explains their effectiveness from another perspective. The model slightly outperforms the baseline when considering the handling fee. And the curves of the two models are more intense than the baseline changes, which means that they tend to make more aggressive decisions when trading. Therefore, it can be concluded that our model is effective in dealing with stock return prediction with high frequency primary price-volume data.
4.2 Future Work
Through the above tests and analysis, we found that our two models have the advantages of fast convergence, strong generalization ability, and no need for construction factors. However, the accuracy of prediction still has a gap between our expectations. In the application of stock forecasting, LSTM and CNN are unable to achieve their superiority in the field of natural language processing and image processing. In future research, we want to utilize strengths of LSTM and CNN to construct a new model that combines CNN and LSTM. Specifically, we hope to use CNN’s ability to automatically extract high-level features and enhance important features to construct factors. After CNN generating time series about high-level factors, this time series is then used as an input to the LSTM. Moreover, LSTM’s sensitivity to time series is used to predict future stock price movements. This new model may overcome the difficulties of constructing factors and enhance our model’s ability of prediction.
Also, the data used in our a set are price-volume features per 15 minutes and 120 minutes. Because of insufficient computing power, we did not use data per 1 minute or 5 minute. These higher frequency data can capture subtle information and tendency in stock market more precisely. Therefore, in the future, we can use price-volume features per 1 minute as our dataset to obtain more primary features and improve prediction accuracy.
Besides, because in our case we trade stocks everyday, it will generate a lot of unnecessary transaction fees. Considering the expense of transaction fees, we want to apply reinforce learning to maximize future profits. Moreover, it is easy to construct the environment of the complicated and sensitive stock market without constructing features by ourselves. Hence, reinforce learning is a feasible attempt in the future.
This paper applies neural network of deep learning to construct two models of LSTM and CNN to forecast the expected return rate of the stock today, and to maximize the total return by adapting an appropriate strategy. We analyze the performances of LSTM and CNN and verify their effectiveness and rationalities for the application of forecasting stock prices. Although, these two models have overcame some difficulties, there is still a possibility of advancement such as avoiding unnecessary transaction fees. In our later works, we will focus on solving these problems.
We would like to express sincere appreciations to Maxwell Liu from ShingingMidas Private Fund, Xingyu Fu from Sun Yat-sen University for their generous guidance throughout the project. Also, we are grateful to Kangkang Jiang from Sun Yat-sen University for his assistance all the way. Without their supports, we cannot complish such a challenging task.