1 Introduction
Financial timeseries classification (FTC) is highly important for investors. It emerges attention from wide research fields, especially the Artificial Intelligence (AI) kim2003financial. The classical financial theory, Effective Market Hypothesis (EMH) b1, suggests that every piece of information in the financial market affect the movements of the corresponding security price. Thus, numerous researches have investigated the impact of historical financial data for the future security price. Due to a large amount of constantly produced financial data, analyzing these data consumes massive labor work from the human expert. Consequently, the technologies which can automatically process these data have been widely explored tsai2010combining; kara2011predicting; li2016empirical.
From the property of timeseries, existing researches on FTC can be divided into the MultiScale (MS) oriented methods and the Temporal Dependency (TD) oriented methods.
For MSoriented methods, existing researches focus on extracting the MS features from financial timeseries. As we know, the high scale of financial timeseries features reflects the trend information of the financial market in the long run, while the low scale financial timeseries features embody the shortterm trend information. The methods with only singlescale features neglect the information on other scales. Accordingly, these singlescale methods often fail to accurately describe the current state of timeseries movement. Unsurprisingly, these methods tend to misjudge the category of financial timeseries data. In order to describe financial timeseries precisely, its MSproperty should be considered. In the financial area, the MSproperty of financial timeseries has been extensively investigated dacorogna1996changing
. By the similarity measured on multiple scales, the future price of given security can be estimated by finding similar history price sequence across different financial markets
papadimitriou2006optimal. In the AI community, few studies have explored the MSproperty of financial timeseries. The most prior work, ScaleNet geva1998scalenet, decomposes the timeseries into different scales by Wavelet transform. Then, it extracts features from each scale by different Neural networks to make a prediction. More recently, Cui et al. cui2016multiuse Convolutional Neural Network (CNN) to improve the feature extraction ability. Although the above methods have achieved remarkable improvement compared to the methods only with singlescale features, these works overpass the TD within the financial timeseries.
For TDoriented methods, the nonlinear models are often used due to the nonstationary of financial timeseries. Most previous researches use classical models in modeling classification. For example, Kim kim2003financial
uses the Support Vector Machine (SVM) to predict the stock price index. Compared to the Neural Network (NN), it achieves comparable results under their experiment setting. It is notable that these models are not specifically designed for modeling the TD. More recently, the deep learning models
deng2017hierarchical are introduced to improve the feature extraction and representation from financial timeseries. For instance, Recurrent Neural Network (RNN), which can handle TD effectively, is often used in this scenario lin2017hybrid. However, the above methods only use singlescale features and ignore the MSproperty of financial timeseries. Consequently, they are not capable to describe the current state of the financial market precisely.The MS and TD property of financial timeseries and the subtle relation between these properties make the FTC very challenging. Very few works have investigated the effect of employing both properties of financial timeseries for FTC. Recently, State Frequency Memory (SFM) hu2017state
integrate LongShort Term Memory (LSTM) and Discrete Fourier Transform (DFT) to model the multiple frequency properties in stock price sequence. However, the DFT need predefined parameters which are very tricky and can not be learned automatically. In addition, the DFT is not a suitable choice for nonstationary financial timeseries. Therefore, a new method for FTC which can effectively utilize both properties of financial timeseries is needed.
To address the above problem, this paper proposes a MultiScale Temporal Dependent Recurrent Convolutional Neural Network (MSTDRCNN) for financial timeseries classification. The proposed model is an effective endtoend model which can learn its parameters automatically. The major contributions of this paper are summarized as follows:

We propose a novel method for FTC which combine and utilize both MS and TD properties of financial timeseries. The proposed method integrates CNN and RNN to handle two different properties in financial timeseries.

MS features are extracted with CNN units from the singlescale input of financial timeseries sequence. The parameters for each CNN units are learned automatically. There are no needs for tuning predefined parameters, which is critical for methods like DFT.

Different scales features are fused with an RNN. Benefited from its structure in handling TD, the RNN can explore and learn the dependency across different scales.

To evaluate MSTDRCNN, we build three minutelevel index price datasets, which are sourced from the China stock market. According to the financial timeseries shares identical structure and properties, it is feasible to expand our methods to the global financial markets.
The experimental results demonstrate that our model achieves superior performance compared to some classical and stateoftheart baseline models in both financial timeseries classification and simulated trading.
The rest of this paper is organized as follows: Section 2 introduces Financial Timeseries Prediction, the MultiScale (MS) property of timeseries and GRU, Section 3 illustrates the formulation of FTC and architecture of MSRCNN, Section 4 gives the experimental settings, Section 5 describes the experimental results and analysis, Section 6 shows the conclusions and future works.
2 Related works
2.1 Financial Timeseries Prediction
Financial timeseries prediction is essential for developing effective trading strategies in the financial market lee1991inferring. In past decades, it has attracted widespread attention from researchers of many areas, especially the Artificial Intelligence (AI) community kim2003financial. These researches mainly focus on a specifical market, e.g., the stock market leung2000forecasting; saad1998comparative, the foreign exchange market frankel1990chartists; cheung2018exchange; das2017hybridized, and the futures market zirilli1996financial; kim2017intelligent. Unsurprisingly, it is very challenging due to their irregular and noisy environment.
From the perspective of the learning target, existing researches can be divided into the regression approaches and classification approaches. For the regression approaches, they treat this task as a regression problem 12; 4, aiming to predict the future value of financial timeseries. While the classificationoriented approaches treat this as a classification problem 17; 8, focusing on financial timeseries classification (FTC).
In most cases, the classification approaches achieve higher profits than the regression ones leung2000forecasting. Accordingly, the effectiveness of various approaches in FTC has been widely explored 6; 7; 9; 10.
2.2 Multiscale of financial timeseries
The MultiScale (MS) property for timeseries classification has been widely studied peng1998multiple; papadimitriou2006optimal; stopar2018streamstory; yang2015deep; cui2016multi; wang2017time
. The concept of MS are often used for Computer Vision (CV) tasks
eigen2015predicting, i.e., image object detection cai2016unified. An image is a sample formed by sampling the objects in the real world at a certain pixel level. Images in largescale provide global features, images in smallscale provide local features. The MS of an image can provide more detailed information than singlescale features.Similar to images, timeseries also typically have MSproperty. Previous works mainly focus on predicting the future value or movement direction based on the assumption that the movement pattern of financial timeseries will repeat itself. Thus, timeseries similarity analysis approaches have been extensively investigated, i.e., discrete wavelet transform geva1998scalenet. Among these approaches, the use of MSproperty is one of the key factors to measure the similarity between timeseries sequences. Since the MSproperty is very effective to characterize a timeseries.
The way to analyze financial data draw more challenging due to their nonstationary characteristic and noisy environment in the financial market. Therefore, this paper focuses on predicting financial timeseries movement direction by utilizing the MSproperty.
2.3 Temporal dependency of financial timeseries
Previous researches have explored the effectiveness of a method who can classify financial timeseries based on their Temporal Dependency (TD). Traditionally, these researches can be divided into three categories: the featureoriented methods, the modeloriented methods, and the integrated methods.
For the featureoriented methods, the key factor is to extract effective features from the financial timeseries data. Statisticalbased approaches, such as Principal Component Analysis (PCA)
tsai2010combining and Information Gain (IG) lee2009using, are often used. These methods can help to improve the performance of a given model by removing the low relevant features. Some studies have introduced fuzzy logic to transform them into more expressive representations guan2018novel; chang2008tsk; atsalakis2009forecasting. Since these data are mainly numerical which are weakly expressive for category information. These researches transform the real value in a feature into a probability distribution over multiple categories, thereby improving the feature’s expressive for category information.
For the modeloriented methods, they focus on improving the fitting ability of the model. Traditionally, Support Vector Machine (SVM) and Neural Network (NN) are thought to be very effective for financial timeseries classification kara2011predicting. Due to the excessive parameter size, they are easily overfitting to the training set. As a result, Extreme Learning Machine (ELM) ma2016selected
and Random Forest (RF)
patel2015predicting is introduced for financial timeseries classification. ELM can speed up training and improve generalization performance through randomly generated hidden layer units. RF ensembles multiple trees to achieve better prediction and generalization performance than a single model. In more recent, some pioneer researches have explored the effectiveness of deep learning models in financial timeseries classification liu2017foreign; akita2016deep; deng2017hierarchical. Since deep learning models have many successful applications in Computer Vision (CV) krizhevsky2012imagenetand Natural Language Processing (NLP)
kim2014convolutional. For instance, TreNet lin2017hybrid integrates Convolutional Neural Network (CNN) and LongShort Term Memory (LSTM) for trend prediction.For the integrated methods, they often integrate multiple artificial intelligence or statisticalbased techniques into a pipeline method for financial timeseries classification. Some studies integrate the text classification DeFortuny2014; Shynkevich2015
12 in NLP with a classification model to determine the direction of the securities price movement. Kim and Han kim2000genetichave proposed feature selection methods based on Genetic Algorithm (GA) combined with a NN model to select useful features to predict the trend of stock price. Teixeira et al.
teixeira2010method have used the technical indicators, which often are used in technical analysis, as the representation of financial data and feed them into the classification model for FTC. DuránRosal et al. duran2017identifyinghave used piecewise linear regression based turning points to segment the target sequence, and then use a NN to predict these points. In this work, we explore the effects of deep learning models integrate staticalbased method (downsampling) in FTC.
3 Model
In this section, we provide the formal definition of the financial timeseries classification. Then, we present the proposed MSTDRCNN model.
3.1 Problem formulation
In this paper, we focus on classifying sequence of financial timeseries data into different categories by their movement direction. The price of a given security in the financial market is often a sequence of univariable data sequence. A financial timeseries dataset is denoted as , where is the number of samples in the dataset, is the th sample with length and is the corresponding label. Each sequence of timeseries is denoted as , where is the value at th timestep and is the length of time steps.
As a result, FTC is to build a nonlinear map function from an input timeseries to predict a class label formula:
(1) 
where is the nonlinear function we aim to learn.
Financial Timeseries Classification (FTC) emerge attentions from researchers of various fields. However, it is very challenging due to two major difficulties. Firstly, strategies/studies require MultiScale (MS) features to describe the state of the financial market. Secondly, the Temporal Dependency (TD) features of different scales are needed to be fused to make an accuracy classification.
To address these problems for FTC, we propose a MultiScale Temporal Dependent Recurrent Convolutional Neural Network (MSTDRCNN) model. The proposed model transform the input sequence into MS sequences, extract features from each scale, fuses these features and outputs the predicted category. Thus, the proposed model is an effective endtoend model for FTC.
3.2 Model architecture
The architecture of MSTDRCNN is depicted in Fig. 1. Our model mainly has three components: the transform layer, the feature layer, and the fusion layer. The major functions of the three layers are described as follows:

For the transform layer, the input sequence is transformed into MS sequences. Specifically, the downsampling transformations in the time domain are used.

For the feature layer, different convolutional units are used to extract features from each scales. In this end, convolution units of different scales are independent of each other. The feature maps of the convolution output will be padded to the same length and then concatenated together.

For the fusion layer, we feed the padded and concatenated feature maps to the GRU. The output of the GRU passes through the fully connected layers and the softmax layer to produce the final output.
Thus, our MSTDRCNN model is a complete endtoend system where all parameters are jointly trained through backpropagation.
3.2.1 Transform layer
In this layer, the singlescale input sequence is transformed into multiple new sequences with different scales. Here, the downsampling is used to generate sketches of financial data at different scales. This MS timeseries is potentially crucial to the prediction quality for this task. Furthermore, they can complement each other. High scale features reflect slow trends and low scale features exhibit subtle changes in fast trends.
Suppose there is a input sequence , and the downsampling rate is . Then every th data points is keep in the new sequence , where is the length of sequence . Through this method, multiple new sequences are generated with different down sampling rates, e.g., . For simplify, we use to denote the generated sequences .
3.2.2 Feature layer
This layer takes MS sequences as the input and outputs the concatenated features, which are extracted from each scale. It has two major components: the convolutional units and the concatenates operation.
Convolutional units. The CNN units, which are often used as a feature extractor in Computer Vision eigen2015predicting, are used to extract feature maps from sequences with different scales. Specifically, 1dimension CNN is used to process these newly generated sequences. These CNN units share the same filter size and number across all these sequences. Note that, with the same settings, higher scale sequence would get a larger receptive field than the original sequence. Through this means, each output of the convolution operation captures the features with a different receptive field from the original sequence. An advantage of this process is that by downsampling the input sequence instead of increasing the filter size, we can greatly reduce the computation in the convolutional units.
Let to denote the th scale timeseries. The corresponding kernel weights is used to extract features from the input sequence. Here, is the window size. For instance, the feature is calculated by
(2) 
Here, indicates the convolution operation, is a bias term and
is a nonlinear function such as the Rectified Linear Unit (ReLU). This filter is applied to the sequence
to produce a feature map as follow(3) 
Here, the pooling layers are not used. Since the transform layer does similar work as the pooling layers. The pooling layers increasing the receptive field howard2017. While the transform layer transforms the original sequence into different timescales before the feature extraction. We believe the transformation before the feature extraction can achieve similar effects with pooling layers.
Concatenation operation. This operation concatenates the feature maps of different scales. Due to the different lengths of feature maps, padding is needed before concatenation.
Since , the length of the feature map decreases with scale increasing. For the convenience of calculation, we unify the feature maps of different scales to the same length, that is, the feature map length when . We align the feature maps of other scales to length by zeropadding. For example, the alignment of the feature map for scale is as follow
(4) 
where is the padded feature map generated by jth kernel for scale and are zeros sequence with length .
Next, the padded feature maps are concatenated into a feature matrix. The concatenating process is described as following
(5) 
where is the feature matrix with length , is the number of convolutional kernels.
3.2.3 Fusion layer
The fusion layer fuses the features from multiscales and generates a prediction. The output of the feature layer is similar to the language model in Natural Language Processing, which has Temporal Dependency (TD) among each node. The major difference is that the sequences from different scales have different fields of view. To fuse these features, we need a model that captures this dependency and variety. The Recurrent Neural Networks (RNN) is often used as an encoder in Machine Learning Translation
cho2014learning. It can capture the complex dependency in different languages. Hence, we use the RNN model to process the feature maps in this case.Recurrent Neural Networks (RNN) have been successfully applied in machine translation sutskever2014sequence. The structure of RNNs are good at handling a variablelength sequence input by having a recurrent hidden state whose activation at each time is dependent on that of the previous time.
Similar to a traditional neural network, we can use a modified backpropagation algorithm Backpropagation Through Time (BPTT) to train an RNN mozer1989focused. Unfortunately, it is difficult to train RNN to capture longterm dependencies because the gradients tend to either vanish or explode bengio1994learning. Hochreiter and Schmidhuber hochreiter1997long proposed a long shortterm memory (LSTM) unit and Cho et al. cho2014learning
proposed a gated recurrent unit (GRU) to deal with the Problem effectively. To this end, we use GRU to process the feature matrix.
A Gated Recurrent Unit (GRU) makes each recurrent unit to adaptively capture dependencies of different time scales. The parameters can be updated by the following equations
(6) 
(7) 
(8) 
(9) 
where
denotes the logistic sigmoid function,
denotes the elementwise multiplication, denotes the reset gate, denotes the update gate and denotes the candidate hidden layer. In this paper, we apply the GRU as the feature summarize layer for stock trend prediction.Given the feature vector , the hidden states at the th timestep can be calculated by
(10) 
where is the hidden state of the encoder at time , is the size of hidden state, is the th column in the matrix , is a nonlinear function, and is the parameters of encoder function. There are many choice for encode the sequence of numerical data. In this case, we use GRU as the nonlinear function. The output of GRU is is deemed as the encoding of the multiple scales of input sequence.
The feature vector output by the GRU is passing through the multiple fully connected layers, and then a softmax activation layer to obtain a probability distribution of different classes. The softmax activation function is calculated as follow
(11) 
where indicates the result of the th output node, and is the number of categories.
The crossentropy loss function is used to measure the difference between our predicted classification distribution
and real distribution :(12) 
where is represent all the parameter of the model, is the total number of samples.
4 Experimental settings
In this section, we first give the details of datasets. Then, we introduce the baseline models in comparative evaluation. Last, the evaluation metrics are illustrated.
4.1 Datasets
We first describe the data source of the datasets. Then, we explain how to choose the threshold for the label and the window size for window sliding.
Three highfrequency stock index datasets are collected from the Chinese stock market.

SH000001: Shanghai Stock Exchange (SSE) Composite Index. Prepared and published by SSE index is the authoritative statistical indices widely followed and used at home and abroad to measure the performance of China’s securities market.

SZ399005: Shenzhen Stock Exchange Small & Medium Enterprises (SME Boards) Price Index. SME play an important role in the economic and social development of China. They foster economic growth, generate employment and contribute to the development of a dynamic private sector.

SZ399006: ChiNext Price Index is a NASDAQstyle board of the Shenzhen Stock Exchange. It aims to attract innovative and fastgrowing enterprises, especially hightech firms. Its listing standards are less stringent than those of the Main and SME Boards of the Shenzhen Stock Exchange.
The data in the dataset begins on January 1, 2016, and ends on December 30, 2016. There are a total of 58,000 data points. The window slicing is applied for the data augmentation cui2016multi. There are 48,000 of data points are used as training sets, 5000 are used as verification sets, and 5000 are used as testing sets.
There are three categorical values, they are defined as follows
(13) 
Here, means the price of the security in the next timestep is still, means the price is going upward and means the price is moving downward, is the threshold and is the change value compared to the previous timestep, it calculated by
(14) 
To select the threshold for each dataset, we analyzed the distribution of each dataset. As shown in Fig. 2 and Table 3, the distribution of price change on the development set on each dataset are mostly clustered around zero. We choose the threshold which can make each category on the development set distribute equally. As a result, the threshold is set to 0.3, 0.2 and 0.8 for SH000001, SZ399006, and SZ399005. The distribution of each dataset is shown in Table 1.
To select the window size for sliding windows. We use the Random Forest(RF) to train on the training set and evaluate the development set under different window size setting. As shown in Table 2, the window size makes the superior performance for RF. Therefore, the window size set to .
SH000001  SZ399005  SZ399006  

Category(trend)  Train(%)  Dev(%)  Test(%)  Train(%)  Dev(%)  Test(%)  Train(%)  Dev(%)  Test(%) 
Downward()  36.60  33.04  35.66  41.68  35.78  36.76  40.31  32.82  35.28 
Still()  26.99  32.16  31.90  18.67  32.42  31.60  18.50  33.16  29.24 
Upward()  36.41  34.80  32.44  39.65  31.80  31.64  41.19  34.02  35.48 
SH000001  SZ399005  SZ399006  

window size  Acc  F1  Acc  F1  Acc  F1 
10  0.5132  0.5123  0.4842  0.4718  0.5874  0.5826 
20  0.5212  0.5214  0.5018  0.4905  0.5974  0.5932 
30  0.5256  0.5252  0.5064  0.4938  0.5996  0.5956 
40  0.5210  0.5207  0.4856  0.4703  0.5950  0.5901 
50  0.5222  0.5212  0.4998  0.4842  0.5934  0.5877 
Bold numbers indicate the best results
SH000001  SZ399005  SZ399006  

Mean  0.0368  0.0016  0.0081 
Std  1.1915  0.9429  3.6840 
In order to avoid excessive correlation between these datasets, we calculate the Pearson Correlation Coefficient (PCC) for the data of these three data sets. indicates that and has a negative correlation. indicates that and has a positive correlation. indicates that and has no correlation. Table 4 lists the results of PCC, which indicate that there are no strong correlations () between each pair of datasets.
SH000001  SZ399005  SZ399006  

SH000001  1.00  0.58  0.42 
SZ399005    1.00  0.42 
SZ399006      1.00 
4.2 Baselines
There are six baseline models are used. Firstly, two classical models in FTC are given. Then, four advanced models for FTC are illustrated.

Support Vector Machine (SVM) kim2003financial
. It projects the input data into a higher dimensional space by the kernel function and separates different classes of data using a hyperplane. The tradeoff between margin and misclassification errors is controlled by the regularization parameter.

Random Forest (RF) kara2011predicting
. It belongs to the category of ensemble learning algorithms. It uses the decision tree as the base learner of the ensemble. The idea of ensemble learning is that a single classifier is not sufficient for determining the class of test data. After the creation of n trees, when testing data is used, the decision on which the majority of trees come up with is considered as the final output. This also avoids the problem of overfitting.

Fuzzy Deep Neural Network (FDNN) deng2017hierarchical. FDNN uses fuzzyneural layers and fully connected layers to learn the fuzzy representation and neural representation separately. Then, these two representations are fused by a twolayer fully connected layer. The fused representations are fed to a softmax activation to get the trend to predict results.

TreNet lin2017hybrid. TreNet hybrids LSTM and CNN for stock trend classification. Firstly, LSTM learning the dependencies in historical trend sequence, and CNN learning the local features from raw data of timeseries. Then, these extracted features fused by a fully connected layer to generate a prediction.

StateFrequency Memory Recurrent Neural Networks (SFM) hu2017state
. It allows separating dynamic patterns across different frequency components and their impacts on modeling the temporal contexts of input sequences. By jointly decomposing memorized dynamics into state frequency components, the SFM is able to offer a finegrained analysis of temporal sequences by capturing the dependency of uncovered patterns in both time and frequency domains.

MultiScale CNN (MSCNN) cui2016multi. MSCNN uses different convolutional units to extract features from each timescale of data. Then, these features are fused by a twolayer fully connected layers. In most cases, this model achieves better performance than the regular convolutional neural network in timeseries classification.
The parameters of our model are selected by the performance on the validation set. The maximum epoch is set to 100. The model is trained by Adam optimization algorithm with the learning rate 0.0005. The batch size is set to 32. There are
timescales. Convolution unit for each timescale has 16 filters, the number of hidden units in GRU is set to 48 ().4.3 Evaluation metrics
Accuracy, Fscore(F1) and Confusion Matrix(CM) are used as the classification metrics to evaluate the models. And the accumulated profit is used to evaluate the profitability of the models.
Accuracy and F1 are calculated based on Confusion Matrix which has four components: True Positive(TP), True Negative(TN), False Positive(FP) and False Negative(FN). CM shows for each pair of classes , how many samples from were incorrectly assigned to .
Accuracy is the rate of correct prediction and is calculated as the formula in Equation 15. Equation 15.
(15) 
Here, is the total number of samples in dataset. We next explain the weighted average of F1. The calculation of F1 is displayed in Equation 16.
(16) 
where R is the recall and P is the precise, which are calculated as follows:
(17) 
(18) 
The simulated trading algorithm is calculated based on the predicted result , the real trend and the change value of index . For each trading signal generated by the model, we will execute the buyin or sellout one unit of security. For the upward and downward categories, we will make a profit if the prediction is correct, and if it is wrong, we will suffer losses. For the still category, the change value is set to zero. The transaction cost is set to zero. The accumulated profit is calculated by
(19) 
Here, indicates the profit representing the change points, is an indicator function, which equals 1 when the , otherwise 0.
5 Results and Analysis
First, the model’s performance is compared with the baseline models on three datasets. Then, the effects of the feature layer in extracting MultiScale (MS) features are analyzed. Third, the effects of the fusion layer in capturing Temporal Dependent (TD) are analyzed. Next, the profitability of models is evaluated by simulated trading. Last, the reason for driving improvement in profitability is analyzed through the confusion matrix.
5.1 Comprehensive evaluation
Financial timeseries classification is a challenging task and a minor improvement usually leads to large potential profits 4. To demonstrate the effectiveness of our MSTDRCNN model, we compare it against the six baseline models on three datasets. The results are listed in Table 5
The ttest results between our model and other models are listed in Table
6. Examining the experimental results, we reach the following conclusions.
Our model achieves the best performance in both accuracy and F1. From the perspective of accuracy, our model achieves the best results in all three datasets. Especially, MSTDRCNN rises the accuracy of 3.07%, 3.00%, and 2.13% higher than the best baseline models on SH000001, SZ399005, and SZ399006. From the perspective of F1, our model also achieves the best performance on these three datasets. Especially, MSTDRCNN has 4.14%, 2.21%, and 2.86% improvement compared to the best baseline models on SH000001, SZ399005, and SZ399006. In addition, the ttest results suggest that the results of our model are significantly different from the ones of other models.

Our MSTDRCNN model can effectively extract MS features from financial timeseries. First of all, all models share the same singlescale input. Only SFM, MSCNN and MSTDRCNN are designed to utilize the MSproperty of timeseries. As a result, these models achieve a higher level of accuracy and F1 than other baseline models in most cases. It can be concluded that the FTC models, which can utilize MSproperty, are more effective than the singlescale ones. Secondly, our model achieves the best accuracy and F1 performance among these three models. That indicates our MSTDRCNN is more effective in extract MS features than the other two models.

Our MSTDRCNN model is very effective in capturing Temporal Dependency (TD) within financial timeseries. MSTDRCNN has a significantly higher level of classification performance than MSCNN. These two models have a similar structure. Especially, they both transform input singlescale sequence into MS sequences and then use CNN to extract MS features. For fuse features, MSTDRCNN uses the GRU and the MSCNN uses NN. We can conclude that the performance improvement is likely due to the efficiency of our MSTDRCNN in capturing TD.
Model  SH000001  SZ399005  SZ399006  

ACC  F1  ACC  F1  ACC  F1  
SVM  0.5150  0.5181  0.5038  0.4960  0.5620  0.5357 
RF  0.5230  0.5196  0.5116  0.5054  0.5732  0.5654 
TreNet  0.5238  0.5250  0.5120  0.5134  0.5964  0.5857 
FDNN  0.5232  0.5245  0.5122  0.5035  0.5930  0.5510 
SFM  0.5296  0.5227  0.5254  0.5232  0.5960  0.5740 
MSCNN  0.5334  0.5287  0.5198  0.5201  0.6006  0.5954 
MSTDRCNN  0.5498  0.5506  0.5454  0.5516  0.6134  0.6124 
Bold numbers indicate the best results
SVM  RF  TreNet  FDNN  SFM  MSCNN  
MSTDRCNN ()  0.8***  0.1***  0.2***  2.5***  0.5***  0.7*** 
pvalue : ***, pvalue : **, pvalue : *
5.2 Effects of multiscale features
To illustrate the effects of MSTDRCNN in employing MultiScale (MS) property, we evaluate our MSTDRCNN under different scale settings. Since MSTDRCNN uses the convolutional unit to extract features from distinct scales. The effects on feature extraction can be evaluated by the classification performance when using different scale settings. Hence, our model is evaluated with scale settings as follows: . indicates our model using original singlescale data. suggest our model using scale and . denotes our model using three corresponding scales.
Table 7 lists the classification results of MSTDRCNN with different scale settings. The accuracy and F1 of MSTDRCNN are rising with the increasing of scale number. The MSTDRCNN outperforms the MSTDRCNN, and the MSTDRCNN outperforms the MSTDRCNN. These are mainly due to the effect of MS features. These features from different scales can complement each other. Moreover, MSTDRCNN achieves higher classification performance than baselines even if only using singlescale data. The MSTDRCNN achieves a higher level of accuracy and F1 than baselines on three datasets. This is likely due to the effect of the convolutional units in feature extraction.
Model  SH000001  SZ399005  SZ399006  

ACC  F1  ACC  F1  ACC  F1  
MSTDRCNN  0.5356  0.5373  0.5330  0.5210  0.6070  0.6040 
MSTDRCNN  0.5454  0.5442  0.5414  0.5410  0.6122  0.6112 
MSTDRCNN  0.5498  0.5506  0.5454  0.5516  0.6134  0.6124 
Bold numbers indicate the best results
5.3 Effects of temporal dependency
To show the effects of Temporal Dependency (TD), we compare the classification performance of MSCNN and our model under different scale settings. MSCNN and our MSTDRCNN share similar structure in feature extraction. Hence, the differences in classification performance are majorly due to the different level of efficiency in capturing TD.
Fig. 3.a shows the accuracy results on three datasets. Firstly, the performance of MSTDRCNN is rising with the increasing of scale numbers on all three datasets. While the MSCNN has no significant trend in performance in most cases. That is likely due to the fusion layer in MSTDRCNN can effectively fuse the features with TD. MSCNN uses fully connected layers to fuse features. While MSTDRCNN uses a GRU to fuse features. Secondly, with sample input, the MSTDRCNN has a higher level of accuracy than MSCNN on all three datasets. The fusion layer can capture the temporary dependency which is very important for timeseries classification. We can conclude that MSTDRCNN is more effective than MSCNN to capture the TD.
5.4 Simulated trading
The ultimate goal of financial timeseries classification is to make a profit. To estimate the models’ profitability, we use a simulated trading algorithm (Equation 19) to evaluate these models based on their predictions on the testing sets. Table 8 lists the simulated trading results on three datasets. The profitability of these models is compared to the baseline strategy Buy & Hold (B&H) strategy. This B&H suggests that buy in the security at the beginning and sell out at the end.
MSTDRCNN achieves the highest profit on all three datasets. Especially, , and higher than the most profitable baseline model. Note that B&H strategy suffers losses due to the market is in a downward trend, while all models can make a profit. The results show that our model can not only be more accurate in classification but also more profitable than baseline models.
Next, the confusion matrix of our model is analyzed to find the cause of profitability improvement.
Strategies  SH000001  SZ399005  SZ399006 

B&H  233.13  221.48  557.53 
SVM  1172.68  1177.10  7225.94 
RF  1241.96  1247.25  7260.03 
TreNet  1330.30  1255.96  7373.50 
FDNN  1231.07  1273.94  7377.40 
SFM  1316.16  1265.90  7459.88 
MSCNN  1358.41  1262.12  7427.14 
MSRCNN  1419.85  1400.93  7609.42 
Bold numbers indicate the best results
5.5 Confusion Matrix
To find the reason for profitability improvement, we analyze its confusion matrix of our MSTDRCNN on three datasets. For comparison, we also demonstrate the confusion matrix of MSCNN. Since MSCNN shares a similar structure with our model and it achieves almost the highest classification and profit performance among baseline models.
There are three categories of financial timeseries: still(0), downward(1) and upward(2). Due to the impaction to profitability, the samples in downward and upward categories have a higher level of importance than the still ones. As a result, the error classifying these two categories will make the model suffer from losses in the simulated trading. In contrast, the error classifying the still category to the other two categories have no harm to the model’s profitability.
Fig. 4 shows the confusion matrix of MSTDRCNN and MSCNN on SH000001, SZ399005, and SZ399006. The major observations are listed as follows:

For the error classifying ”upward” to ”downward”, MSTDRCNN has fewer occurrences than MSCNN. For instance, MSTDRCNN only error classifies sequences on SH000001, which is sequences less than MSCNN.

For the error classifying ”downward” to ”upward”, MSTDRCNN also has fewer occurrences than MSCNN. For example, MSTDRCNN error classifies sequences on SZ399005, which is sequences less than MSCNN.

For the ”upward” and ”downward”, MSTDRCNN achieves a higher level of precision than MSCNN. Such as on SZ399006, MSTDRCNN achieves ”downward” precision and ”upward” precision , which are and higher than MSCNN.
MSTDRCNN has higher upward and downward classification accuracy and lower error classifying number between upward and downward than MSTDRCNN. Those are the accounting for that our model achieves higher profitability than MSTDRCNN. Moreover, it is likely causing our model to achieve the highest profitability in simulated trading.
6 Conclusion and future works
This paper proposes a MultiScale Recurrent Convolutional Neural Network, denoted MSTDRCNN, for financial timeseries classification. The proposed method can effectively combine and utilize MultiScale (MS) and Temporal Dependency (TD). The convolutional units are integrated to simultaneously extract MSfeatures, and a GRU is used to capture the TD across multiple scales. This enables the classification of timeseries with MSproperty by feedforwarding a single timescale input sequence through the network, which results in a very effective endtoend classifier. The profitability of our model is also evaluated by a simulated trading algorithm. Extensive experimental results suggest that our MSTDRCNN achieves stateoftheart performance in financial timeseries classification.
In the future, we prepare to explore three potential directions to improve our MSTDRCNN. First, different structure of feature extractors, such as the most recently Transformer devlin2018bert is likely an even more effective structure than CNN. Second, the attention mechanism liu2019numerical can be introduced to handle the longterm dependency which cannot be handled by RNN. Third, multisource of information can be used, especially textual information such as news.
Comments
There are no comments yet.