Log In Sign Up

DoubleEnsemble: A New Ensemble Method Based on Sample Reweighting and Feature Selection for Financial Data Analysis

Modern machine learning models (such as deep neural networks and boosting decision tree models) have become increasingly popular in financial market prediction, due to their superior capacity to extract complex non-linear patterns. However, since financial datasets have very low signal-to-noise ratio and are non-stationary, complex models are often very prone to overfitting and suffer from instability issues. Moreover, as various machine learning and data mining tools become more widely used in quantitative trading, many trading firms have been producing an increasing number of features (aka factors). Therefore, how to automatically select effective features becomes an imminent problem. To address these issues, we propose DoubleEnsemble, an ensemble framework leveraging learning trajectory based sample reweighting and shuffling based feature selection. Specifically, we identify the key samples based on the training dynamics on each sample and elicit key features based on the ablation impact of each feature via shuffling. Our model is applicable to a wide range of base models, capable of extracting complex patterns, while mitigating the overfitting and instability issues for financial market prediction. We conduct extensive experiments, including price prediction for cryptocurrencies and stock trading, using both DNN and gradient boosting decision tree as base models. Our experiment results demonstrate that DoubleEnsemble achieves a superior performance compared with several baseline methods.


Deep Portfolio Optimization via Distributional Prediction of Residual Factors

Recent developments in deep learning techniques have motivated intensive...

Detection of fraudulent users in P2P financial market

Financial fraud detection is one of the core technological assets of Fin...

U-CNNpred: A Universal CNN-based Predictor for Stock Markets

The performance of financial market prediction systems depends heavily o...

Combining Machine Learning Classifiers for Stock Trading with Effective Feature Extraction

The unpredictability and volatility of the stock market render it challe...

Price graphs: Utilizing the structural information of financial time series for stock prediction

Great research efforts have been devoted to exploiting deep neural netwo...

Automatic Historical Feature Generation through Tree-based Method in Ads Prediction

Historical features are important in ads click-through rate (CTR) predic...

Decision trees unearth return sign correlation in the S&P 500

Technical trading rules and linear regressive models are often used by p...

I Introduction

Financial market is notoriously difficult to predict due to its competing nature. There are some common reasons that partially explain why the prediction task is extremely difficult. First, the difficulty comes from the widely known efficient market theory, which is a hypothesis that states that share prices reflect all information and it is impossible to consistently outperform the overall market (see e.g., the original paper by Samuelson [40]). Second, due to existence of a large number of “noisy traders” [9], and other hidden factors that impact the movement of the market (e.g., government policy changes and breaking news), the financial data is highly noisy, dynamic and volatile.

Multifactor model [39] is a popular model for asset pricing and market prediction. The model prices the asset or predicts the market movement based on multiple features (or factors), such as the firm size [5], the earnings’ yield [6], the leverage [8] and the book-to-market ratio [12]

. Linear model has been a standard algorithm for the multifactor model but has a great limitation in exploiting complex patterns. Recently, non-linear machine learning models (such as gradient boosting decision trees or deep learning models) become popular due to their large model capacity

[36, 4, 15, 19, 25]. However, these complex non-linear models are prone to overfitting and susceptible to noisy samples.

To provide the model with more information, quantitative traders or researchers often create hundreds or even thousands of features (aka factors) [27, 14, 18, 52]

. However, training a prediction model with all the available features may lead to poor performance. Therefore, it is essential to select features that are not only informative but also uncorrelated with other features. For linear models (such as linear regression), we can select features with low correlations to alleviate the multicollinearity problem (see e.g.,

[17]). For highly complex non-linear models and highly noisy financial data, it is less clear how to effectively select features.

To address the aforementioned issues, we propose DoubleEnsemble, a new ensemble framework for financial market prediction. In particular, we construct sub-models in the ensemble one by one, where each sub-model is trained with both the weights of samples and carefully selected features. A wide range of base models can be used in learning the sub-models, such as the linear regression model, boosting decision trees, and deep neural networks. Each time, using our learning trajectory based sample reweighting scheme, we assign a weight to each sample in the original training set based on the training dynamics of the previous sub-model and the loss of the current ensemble. Moreover, we select features based on their contribution to the current ensemble via a shuffling technique.

There are three major contributions/features of our proposed DoubleEnsemble framework.

  1. Our method integrates sample reweighting and feature selection into a unified framework, and is named DoubleEnsemble. We ensemble diversified sub-models that are trained with not only different sample weights but also features. This property greatly alleviates the overfitting problem and makes DoubleEnsemble more stable and suitable for learning from highly noisy financial data.

  2. For the sample reweighting component, we propose a new learning trajectory based sample reweighting scheme, which fully incorporates the entire curve of the training loss into the construction of sample weights. This reweighting scheme can effectively reduce the weights of very easy and noisy samples and boosts that of the key samples that are more informative for training the model. 111

    Easy samples are those which the algorithm can classify correctly very easily. Fitting pure noisy samples may lead to overfitting. Hence, we would like the learning algorithm to focus less on these and more on the remaining samples. See Section 

    III-A for the details.

  3. For feature selection, traditional approaches (e.g., backward elimination and recursive feature elimination) usually attempt to remove redundant features according to their importance and retrain the whole model after removing each feature. In practice, retraining incurs a huge computational cost. Moreover, when training with neural networks, removing a feature could completely change the distribution of inputs, which leads to extremely unstable training process. To address the challenge, we propose a new shuffling based feature selection method. Instead of removing a feature, we shuffle a feature across training samples and measure the change of the loss. The small change indicates that the feature is less relevant for the predication task. Our feature selection approach is both computationally efficient and has shown to be effective on real financial datasets with a large number of factors.

In the experiments, we apply DoubleEnsemble to two financial markets, the cryptocurrency exchange OKEx and the securities exchange China’s A-share market. These two markets possess different trading rules and market participants, and therefore there are different types of noise and patterns in the historical data of these two markets. Moreover, we use DoubleEnsemble to construct prediction models to trade at different frequencies (from seconds to weeks). Our experiments show that DoubleEnsemble achieves superior performances in both markets. Specifically, DoubleEnsemble achieves a precision of 62.87% for predicting the direction of the cryptocurrency movement and an annualized return over 51.37% with the Sharpe ratio 4.941 in China’s A-share market.

The rest of our paper is organized as follows. We introduce the related work in Section II. Then, we introduce DoubleEnsemble in Section III and present the experiment results in Section IV. At last, we summarize our work in Section V.

Ii Related Work

Ensemble Model. Ensemble is an effective way to enhance the model robustness. The key for an ensemble model is to construct good and diverse sub-models. The methods to construct sub-models can be divided into two categories. In the first category, individual but different models can be built separately, such as bagging [11]. This category is popular for financial market prediction. For example, Liang and Ng [30] use different base models to construct different sub-models; Xiang and Fu [46] and Zhai et al. [50] construct sub-models by selecting financial data from different time periods or different market environments respectively. The other category builds the sub-models based on the performance of those built previously, such as boosting [20]. The model built through this category of methods has better predictive accuracy but tends to overfit to the noise in the training data [32] and therefore is not currently widely used for financial market prediction.

Sample reweighting.

Weighting the samples for the model training is shown to be effective in some computer vision applications:

Saxena et al. [42] treat the weights of the samples as parameters and learn the weights via the gradient. Hu et al. [23] and Fan et al. [16]

design a reward function for the weights and learn the weights via reinforcement learning.

Ren et al. [38] train an additional neural network to learn the weights.

There is a conflict between the objective of boosting and denoising when assigning weights to the samples for the model training. Boosting increases the weights of the hard samples. This is similar to curriculum learning [7] where the model is trained to first fit the easy samples and then the hard samples. In financial market prediction, this can also be interpreted as learning another new pattern when the previous patterns are exploited. Examples of this trend of reweighting are [42] and [16]

. On the other hand, for constructing an ensemble robust to the outliers and noisy samples, weights of these samples should be reduced. For instance,

Jiang et al. [26], Liu et al. [31] and Nguyen et al. [35] reduce the weights of the samples that the model does not fit well. However, it is hard for us to distinguish between the hard samples and the outliers or the noisy samples. It is a challenge to reduce the weights of noisy samples while performing a boosting style of learning.

Feature selection. Conventionally, features for financial market prediction are manually selected [29, 33]. However, automation for feature selection is desired when the number of features increases. Xu et al. [47] and Booth et al. [10] recursively select the features based on the degree of performance degeneration when the values for the feature are permuted. De Prado [14] introduces several feature importance metrics for financial machine learning. Sun et al. [44] maximize the mutual information between selected features and labels. However, they do not study how to select features in conjunction with sample reweighting for better performance.

Noise reduction for finance. Noise reduction is crucial to extract information from the financial data with a low signal-to-noise ratio. In this paper, we focus on denoising in the phase of model training. Apart from reweighting the samples to denoise, Zhang et al. [53] and Xu et al. [48]

design specific loss functions to denoise. Noise reduction can also be performed from the perspective of signal processing (e.g., filtering on the raw sequential data before extracting the features

[2, 1]) or the perspective of financial risk control (e.g., controlling the extent of the risk exposure [37]).

Iii Method

In this section, we propose DoubleEnsemble, an ensemble model with two key components: trajectory-based sample reweighting and shuffling-based feature selection. We show the training process in Algorithm 1. In the process, we sequentially construct sub-models, , each of which receives the features calculated from historical market information and gives a prediction. Given , the ensemble model is a simple average over the first sub-models. The training data contains the features and the labels. The features are denoted as where is the number of samples and is the number of features. The labels are denoted as . The -th sub-model is trained based on the training data along with the weights for the samples and a set of selected features . To train the next sub-model, we calculate the weights based on the learning trajectory and the loss on each sample, and select the features based on the feature importance to the current ensemble . During the training of the -th sub-model, the learning trajectory on the -th sample is denoted as , where is the number of iterations for training the sub-model and is the error of the -th sample on the -th iteration (for neural networks) or the error of this sample when the first trees are built (for boosting trees). The loss on the -th sample is denoted as which is the error between the prediction and the label . We denote the collection of the learning trajectory and the loss on all the samples as and respectively. We introduce the detailed rule for the sample reweighting process and the feature selection process in the subsequent subsections.

1:Input: The training data and the number of sub-models .
2:Set the weights
3:Select the features
4:for  to  do
6:     Retrieve the learning trajectory and the loss
7:         sample reweighting
8:         feature selection
9:Return: The ensemble model
Algorithm 1 DoubleEnsemble


To extract the temporal information prior to the time point for prediction, we filter the signals (e.g., using the moving average, the Kalman filter, etc.) before calculating the features. We empirically found that this is more effective than filtering the signals using variants of recurrent neural networks (e.g., SFM

[51]). Besides, in our model, the prediction of the ensemble model is a simple average of the predictions from all the sub-models. This is a simplest yet robust way to aggregate the sub-models. We note it is possible to set a weight for each sub-model or develop a stacked generalization ensemble (aka stacking). In general, a proper way to combine the sub-models can further improve the performance and we leave it as a future research direction.

Iii-a Learning trajectory based sample reweighting

The details of the learning trajectory based sample reweighting (SR) process can be found in Algorithm 2. We first calculate the -value for each sample and then divide all the samples into bins according to the -value. Then, we assign the same weights to the samples in the same bin.

1:Inputs: The learning trajectory , the loss and the index of the sub-model
2:Parameters: The coefficients and , the number of bins and the decay factor
3:Calculate the -value for each sample based on (1)
4:Divide the samples into bins according to the -values
5:Calculate the weight based on (2)
6:Return: The weights
Algorithm 2 SR: Learning trajectory based sample reweighting

Instead of using and directly, we use the normalized learning trajectory and the normalized loss , where outputs the column-wise normalized matrix via ranking, i.e., if is larger than of the elements in the -th column of . Then, the -values for all the samples are calculated as follows:


where represent the first and last row of the normalized learning trajectory respectively. For stability, we use the mean of the first and the last 10% rows of .

To avoid extreme values for the weights, we further divide the samples into bins according to the -values and assign the same weights to the samples in the same bin. Suppose the -th sample is divided into the -th bin. The weight of this sample is assigned as follows:


where is the mean of the -values that corresponds to the -th bin. Further, we use a decay factor to encourage the weight distribution to be more uniform in the latter sub-models of the ensemble. This technique is a simplified version from the concept of the self-paced factor in [31].

Fig. 1:

A small instance to illustrate the learning trajectory based sample reweighting scheme. a) The instance is a binary classification task with the dotted line being the underlying true decision boundary. The easy samples (blue) are those with large margins to the true decision boundary while the hard samples (orange) are very close to the true decision boundary. There are some noisy samples (green) whose labels are random regardless of the decision boundary. b) We train a neural network with one hidden layer as the classifier with stochastic gradient descent and plot the normalized learning trajectories of several samples from the three types, i.e., the corresponding rows in

. c) The weights of the three types of samples calculated using Equation (2) with , or as the -value, which are denoted as , and respectively.

Now, we explain the intuition behind our design with a small example in Figure 1. Consider three types of samples in a classification task: the easy samples that are easily classified correctly, the hard samples that are close to the true decision boundary and may easily get misclassified, and the noisy samples that may mislead the model. We would like our reweighting scheme to boost the weights of hard samples while reducing the weights of the easy and the noisy samples, since easy samples can be fitted anyway and fitting noisy samples may lead to overfitting. The term helps to reduce the weights of easy samples. Specifically, the loss of an easy sample is prone to be small which leads to a large value for and therefore a small weight. However, this term also boosts the noisy samples since it is hard to distinguish the noisy samples and the hard samples solely based on the loss value. With , we can distinguish them by their learning trajectories (cf. Figure 1b). Intuitively, we assign large weights to the samples with a descending normalized learning trajectory. Since the training process is driven by the majority of the samples, the loss of most of the samples is tend to decrease while the loss of noisy samples usually keeps the same or even increases. Therefore, the normalized learning trajectory of noisy samples will increase which leads to large values and therefore small weights. For the easy samples, their normalized learning trajectories are more likely to remain the same or fluctuate slightly after a quick decay, which results in moderate values and therefore moderate weights. For the hard samples, their normalized learning trajectories slowly decline during the training which indicates their contribution to the decision boundary. This results in small values and therefore large weights. We show the weights of the three types of samples calculated using , and as the -value respectively in Figure 1c. We observe that, using not only boosts the weights of hard samples but also those of noisy samples, while using suppresses the weights of noisy samples. With and combined (i.e., ), we can effectively boost the hard samples and reduce the weights for the easy samples and the noisy samples.

Iii-B Shuffling based feature selection

We use the shuffling based feature selection (FS) process in DoubleEnsemble to select features for training the next sub-model. We show this process in Algorithm 3.

1:Inputs: The ensemble model , the training data
2:Parameters: The number of bins and the sample ratio for each bin
4:for each feature index in  do
5:      with the -th column shuffled
8:Divide the features into bins according to the -values
10:for each bin  do
11:      features randomly selected from the bin, where for are different for each bin.
13:Return: The selected features
Algorithm 3 FS: shuffling based feature selection

We first calculate a -value for each feature to measure the contribution of the feature to the current ensemble (i.e., feature importance). A large -value indicates a large discrepancy between and , which shows that this feature is important since the elimination of the feature (via shuffling) significantly increases the losses on the samples. For robustness against extreme -values, we then divide all the features into bins according to the -values and randomly select features from from the -th bin, where is the number of features in this bin. At last, we concatenate and return all the randomly selected features.

The reason for the design is as follows: To estimate the contribution of a feature to the model, we would like to compare the performance when the feature is absent. One natural but costly way is to eliminate the feature, retrain and then re-evaluate the model. Instead of training a new model, we perturb the dataset to eliminate the contribution of the feature and compare the performance of the model using the perturbed dataset and that using the original dataset. Our scheme computationally is more efficient since there is no need to retrain a model.

Moreover, we argue that shuffling is more appropriate than replacing with zeros (or the mean of the feature). This is because many machine learning models are sensitive to the input data distribution. Shuffling keeps the marginal distribution of that feature, and replacing with zeros completely changes the distribution. For a simple example, consider a feature whose values are either or and the mean is . The trained model would focus on the regions where the feature value is or (regions with denser samples are better fitted). Hence, the region around feature value is not fitted well and the model may behave arbitrarily for samples with feature value replaced by , and cannot correctly reflect the performance when this feature is eliminated.

In addition, the shuffling based feature selection method has the following advantages: First, it considers the contribution of the feature to the model which is trained along with other features, instead of the quality of the feature itself such as the frequently used information coefficient and information ratio [21] in finance. Second, unlike other feature importance metrics that only apply to specific models (such as the information gain in boosting trees and the coefficients in Lasso [41]), the -value is applicable to different base models.

Iv Experiments

We apply DoubleEnsemble to predict for two different financial markets: OKEx (a cryptocurrency exchange) and China’s A-share market (a securities exchange).

In the first set of experiments on OKEx, we compare DoubleEnsemble with a set of baseline methods and several ablated variants of DoubleEnsemble to measure the effectiveness of the designs in DoubleEnsemble. Also, we design comparative experiments to quantify the robustness of our model to different level of noise.

In the second set of experiments on China’s A-share market, we train predictors and then construct trading strategies based on the predictors via variants of DoubleEnsemble and several baselines. The experiments demonstrate that the superior performance of our predictors can be translated into the profits from the induced strategy. We also conduct experiments under two different trading frequencies with different set of features.

In the following experiments, we use sub-models. In the SR process, we use and bins. In the FS process, we use bins and the sample ratios are .

Iv-a DoubleEnsemble to trade cryptocurrencies

30% Noise 50% Noise
ACC (%) AUC (%) F1 (%) PCT (‰) ACC (%) AUC (%) F1 (%) PCT (‰)
MLP DoubleEnsemble SR 60.78/0.65 52.54/0.54 75.83/0.51 2.20/1.01 60.05/0.43 53.49/0.17 75.04/0.34 1.89/0.67
SR (1st only) 60.93/0.17 52.86/0.14 75.72/0.13 2.49/0.26 59.95/0.44 52.89/0.51 74.96/0.34 1.82/0.67
SR (2nd only) 60.17/1.49 53.65/1.78 75.33/1.17 2.28/2.29 59.78/3.90 53.59/0.45 74.43/3.14 1.70/1.02
FS 61.00/0.11 52.69/0.60 75.77/0.09 2.53/0.18 59.40/0.58 53.59/0.76 74.53/0.46 1.44/0.90
SR+FS 62.10/0.87 53.56/0.76 76.62/0.66 3.18/1.35 60.94/0.94 54.27/0.55 75.72/0.73 2.49/1.44
Basic Methods SingleModel 58.03/0.46 52.57/0.39 73.44/0.28 0.50/0.60 58.10/0.52 52.68/0.51 74.29/0.26 0.73/0.52
SimpleEnsemble 59.77/0.46 53.47/0.97 74.82/0.36 1.69/0.70 59.63/0.24 53.25/0.94 74.71/0.18 1.59/0.36
RandomEnsemble 60.17/0.67 52.42/0.20 75.13/0.51 1.97/1.03 59.85/0.57 52.12/0.63 74.88/0.44 1.75/0.88
Baseline Methods LDMI[48] 58.61/0.51 52.09/0.47 73.91/0.41 0.90/0.78 57.52/1.72 51.73/0.69 73.01/1.41 0.15/2.63
LCCN[49] 57.96/0.41 52.80/0.53 73.38/0.32 0.45/0.62 58.34/0.19 52.27/0.47 73.69/0.15 0.71/0.30
CoTeach[22] 59.37/0.56 51.03/0.45 74.50/0.44 1.42/0.86 58.63/0.31 51.30/0.74 73.91/0.24 0.91/0.46
MentorNet[26] 58.37/0.40 52.75/0.41 73.71/0.32 0.73/0.62 57.92/0.41 52.60/0.37 73.35/0.33 0.43/0.64
LearnReweight[38] 58.72/0.56 52.50/0.53 73.98/0.44 0.97/0.86 56.06/0.15 51.46/0.14 71.84/0.12 -0.85/0.22
Curriculum[7] 60.39/0.36 52.38/0.50 75.62/0.26 2.16/0.55 60.15/0.62 53.12/0.95 75.12/0.48 1.96/0.95
No noise, SingleModel 61.20/0.82 52.85/0.74 75.93/0.62 2.68/1.25
GBM DoubleEnsemble SR 61.73/0.40 52.53/0.34 76.34/0.31 3.04/0.62 60.54/0.68 54.33/0.28 75.42/0.53 2.22/1.05
SR (1st only) 57.92/0.33 52.14/0.23 73.35/0.25 0.42/0.51 58.56/0.24 52.81/0.16 73.87/0.19 0.87/0.36
SR (2nd only) 62.47/0.77 53.08/0.62 76.90/0.59 3.54/1.81 60.92/0.87 52.93/0.75 75.71/0.67 2.48/1.33
FS 57.53/0.30 52.85/0.37 72.90/0.24 0.04/0.46 58.06/1.80 54.40/0.58 73.25/1.40 -0.89/0.37
SR+FS 62.87/1.07 54.15/0.80 77.67/0.83 3.83/1.64 61.49/0.58 53.71/0.21 76.16/0.45 2.87/0.90
Basic Methods SingleModel 56.17/0.36 52.71/0.46 71.93/0.29 -0.77/0.55 55.13/0.59 54.05/0.55 71.07/0.49 -1.44/1.01
SimpleEnsemble 56.04/0.30 53.35/0.42 71.82/0.24 -0.87/0.45 54.42/0.19 54.61/0.49 70.48/0.16 -1.49/0.91
RandomEnsemble 56.23/0.28 53.35/0.33 71.98/0.23 -0.73/0.43 53.62/0.28 54.14/0.25 69.81/0.23 -2.52/0.42
Baseline Methods Curriculum[7] 58.31/0.58 52.88/0.14 73.67/0.46 1.94/0.89 57.24/0.29 53.37/0.80 72.34/0.23 0.04/0.49
No noise, SingleModel 57.30/0.60 51.37/0.29 72.86/0.48 0.00/0.92
TABLE I: Experiment results on OKEx. See the detailed description of the experiment in Section IV-A

. The numbers in each entry are the mean and the standard deviation from 5 independent runs respectively. The transaction fee is 0.2‰.

This set of experiments are based on the data from OKEx. OKEx is a cryptocurrency exchange where traders around the world can trade between different cryptocurrencies in 24 hours a day. In this set of experiments, we use the data from four trading pairs: ETC/BTC, ETH/BTC, GAS/BTC and LTC/BTC. For each trading pair, one sample corresponds to one market snapshot, which is captured for approximately every second. The training samples used in the experiments are from consecutive trading days, with a total number of million. The testing samples come from the following trading days, with a total number of million. We use features, which are calculated based on the microstructure information of the market (snapshots of the limit order book), such as order flow imbalance (OFI) [13] and relative strength index (RSI) [45].

We compare the algorithms under two settings with different noise levels. In the setting denoted by 30% noise, we add 20 additional random features and 30% random samples (i.e., the values of these features/samples are randomly drawn from ). In the setting denoted by 50% noise, we add 30 random features and 50% random samples. Next, we introduce the algorithms that we compare and the performance metrics that we use.

DoubleEnsemble variants

We use SR to denote the ensemble model that only uses the SR process, i.e., using all the features. We use 1st only and 2nd only to denote the variants that only use the first term (i.e., ) or the second term (i.e., ) in Equation (1) for the SR process respectively. We use FS to denote the ensemble model that only uses the FS process, i.e., using equal weights.

Basic methods


: We use the training samples with all the available features and equal weights to train a single model. In the experiments, we use two types of base model: the neural network model (denoted as MLP) and the gradient boosting decision tree model (denoted as GBM). For the MLP model, we use a multi-layer perceptron with two hidden layers (each of which has

neurons) followed by a dropout layer [43] and a batch-norm layer [24]. We use Mish [34]

as the activation function and train the model for

epochs with early stopping and exponentially decaying learning rate. For the GBM model, we use LightGBM [28] with

trees, each of which has at most 32 leaves. In the later experiments, unless otherwise stated, the hyperparameters for training the sub-models are the same as used here. Notice that this single model is the same as the first sub-model in DoubleEnsemble.

SimpleEnsemble: This baseline model is an ensemble model that contains identical sub-models. The only difference between the sub-models is that they use different random seeds. We set this baseline to observe the performance difference brought by constructing an ensemble.

RandomEnsemble: This baseline model is different from the previous baseline SimpleEnsemble in that, the sub-models in this baseline not only use different random seeds but also are trained with the samples assigned with random weights. We notice that randomly reweighting the samples may improve the performance due to the fact that it increases the diversity of the sub-models. We set this baseline to isolate the performance different raised by the above reason. Constructing an ensemble by randomly reweighting samples is similar to bagging where the samples are randomly selected to construct different sub-models [11].

Baseline methods

The following baseline methods are designed for noise robustness and we compare our algorithm with them in terms of noise sensitivity. LDMI [48] uses an information-theory based loss function for training a neural network robust to noisy samples. Latent class-conditioned noise model (LCCN) [49] is another model designed for training a robust deep learning model against the noise by modeling the noise transition. CoTeaching [22] simultaneously trains two neural networks and utilize the communication between the two networks to select clean data. MentorNet [26] trains a mentor network to weight the samples based on their training dynamics for noise reduction. LearnReweight [38] sets the weights of the samples as parameters and learns the weights via gradient descent. The above baseline methods construct single models. Additionally, we design Curriculum to construct an ensemble model with sub-models, each of which uses the to of the easiest samples (the samples with lowest losses), which can be regarded as an ensemble version of curriculum learning [7].

Performance metrics

Precision: While standard classification problems care about the prediction accuracy on all the samples, the classification problems for financial market prediction care more about the accuracy for the retrieved samples. In financial market prediction, a retrieved sample corresponds to a trading signal and therefore relates to the profit of the trading strategy. Hence, we set the threshold such that approximately 1% of the samples are retrieved, and use precision as the performance metric. This corresponds to trading each pair for every seconds on average.

AUC: We also use the area under the ROC curve (ROC AUC) as the performance metric to summarize the performances of the predictor under different thresholds.

F1: In financial market prediction, we also care about the recall, which indicates the ability of the model to seize the trading opportunity. Therefore, we also use the F1 score as the performance measure which integrates the precision and the recall and it is defined as .

PCT: Finally, we directly measure the profitability by PCT, which is the average return for each trading day if we follow the following strategy. Each time the sample corresponding to the current trading time point is retrieved by the predictor (which we call a trading signal), we long the base currency in the next trading time point and then close the position after seconds.

(a) Results in the DAILY setting using MLP as the base model.
(b) Results in the DAILY setting using GBM as the base model.
(c) Results in the WEEKLY setting using MLP as the base model.
(d) Results in the WEEKLY setting using GBM as the base model.
Fig. 2: Hedged equity curves of different models under different settings. The transaction fee for each pair of trading is . The blue bars in the background indicate the ICs for the overall prediction.

Experiment results

We show the experiment results for the cryptocurrency prediction in Table I. The first number in each entry is the mean of 5 runs with different random seeds, and the second number in the entry is the standard deviation of the 5 runs.

First, we observe that the DoubleEnsemble variants achieve a good performance in the two settings with different noise levels and the DoubleEnsemble algorithm (i.e., SR+FS) achieves the best performance. Besides, although the AUC difference between the DoubleEnsemble variants and other baselines is not significant, the precision and the profitability difference is notable. This indicates that DoubleEnsemble has a higher accuracy on the key samples (i.e., the distinguishable samples with high future returns) and therefore is more suitable for financial applications.

Second, the experiment result also demonstrates the role of the SR process. We can compare the SR models (the models that use SR) with SingleModel, SimpleEnsemble and RandomEnsemble. When using MLP as the base model, the performance improvement brought by the SR process not only comes from constructing an ensemble or the diversity increase resulted from reweighting, but also comes from the reweighting scheme used in the SR process. When using GBM as the base model, the performance improvement is mainly resulted from the reweighting scheme of the SR process. This quantifies the important role that SR plays in identifying and weighting the key samples. We also found that, although some baselines (such as LCCN) are robust to different noise levels, the SR models outperform the previous baseline methods that reweight the samples to denoise. The reason may be that the SR process is designed not only to denoise but also to promote the performance by boosting the key samples.

At last, the experiment result shows the performance improvement brought by the FS process. We can observe the improvement brought by the FS process by comparing FS with RandomEnsemble or by comparing SR+FS with SR.

Iv-B DoubleEnsemble to trade stocks

Ann.Ret. Sharpe MDD IC/IR Ann.Ret. Sharpe MDD IC/IR
MLP DoubleEnsemble SR+FS 51.37% 4.941 5.98% 0.115/1.035 25.67% 4.448 2.41% 0.078/0.773
SR+Manual 50.68% 4.343 7.94% 0.106/0.994 19.16% 3.300 2.48% 0.078/0.784
SR+ALL 37.25% 2.933 14.34% 0.103/0.966 15.36% 3.051 2.32% 0.070/0.691
Baselines SimpleEnsemble+All 26.74% 2.435 12.61% 0.091/0.967 12.56% 2.049 4.59% 0.058/0.670
SimpleEnsemble+Manual 46.49% 3.813 11.75% 0.097/0.963 16.78% 2.817 2.45% 0.068/0.757
TimeWeighted+Manual 22.10% 1.936 18.49% 0.081/0.791 15.10% 2.342 3.56% 0.061/0.700
PCTWeighted+Manual 28.65% 2.269 10.32% 0.094/0.940 17.07% 3.704 2.84% 0.070/0.735
GBM DoubleEnsemble SR+FS 46.60% 4.151 8.60% 0.103/0.861 16.77% 3.160 3.23% 0.068/0.668
SR+Manual 41.24% 3.854 9.87% 0.096/0.807 19.84% 3.862 3.93% 0.071/0.676
SR+ALL 29.75% 3.594 7.13% 0.097/0.816 15.76% 3.379 4.04% 0.070/0.670
Baselines SimpleEnsemble+All 18.19% 1.661 18.45% 0.101/0.858 11.55% 2.337 3.61% 0.065/0.635
SimpleEnsemble+Manual 26.74% 2.435 12.61% 0.097/0.815 15.48% 2.902 3.52% 0.068/0.650
TimeWeighted+Manual 23.39% 2.176 21.72% 0.093/0.768 12.47% 2.498 3.13% 0.062/0.636
PCTWeighted+Manual 22.20% 1.669 13.68% 0.093/0.832 14.49% 2.355 4.22% 0.066/0.642
TABLE II: Performance of the stock trading strategies. The transaction fee is 0.3%.

In this set of experiments, we train predictors for the stock market and trade the stocks based on the prediction. We base our experiments on China’s A share market where over 3,000 stocks are traded. Each sample corresponds to one trading day of one stock.

Experiment settings

We conduct experiments in two different settings. In the first setting (denoted by DAILY), we long the top 20 stocks suggested by the predictor at the market closing of each trading day, and then sell these stocks upon the closing time of the next trading day. The predictions are based on 182 features that are calculated 3 minutes before the market closing of that trading day. In the second setting (denoted by WEEKLY), after the market closing on each trading day, we calculate 254 features based on the historical market information and make the prediction. In the next trading day, we long the top 10 stocks suggested by the prediction at the open price and hold these stocks for five trading days. Thereafter, we sell these stocks after the opening of the fifth trading day. In this setting, we are holding 50 stocks for most of the time. The features in both settings are composed of technical factors and fundamental factors, such as moving average convergence/divergence (MACD) [3] and price-to-book ratio (P/B) [12]. They are designed for the prediction at different frequencies and created by different trading firms. Therefore, they possess quite different underlying properties. Since there are more features in this experiment, we use three hidden layers with more neurons (256, 128 and 64 neurons respectively) in the MLP model and 250 trees in the GBM model.

We run the backtests for the models following a rolling scheme described as follows. We re-train the model every week and use the features of the latest 500 trading days (i.e., approximately the latest two years) each time we train the model. The trading period for two settings is from January 2017 to November 2019. For trading details, we exclude the stocks that reach daily surged limit or listed within 3 months. We long the top stocks with equal weights. The transaction fee plus slippage is 0.3%. We did not particularly consider the impact of holidays and suspension when making predictions and conducting backtest.


In this set of experiments, we compare the DoubleEnsemble variants with a set of baselines.

In terms of sample reweighting, we compare the SR process with SimpleEnsemble

and two other heuristic reweighting schemes designed for financial market prediction. Based on the observation that the patterns in the market varies with time,

TimeWeighted gives larger weights to more recent samples to encourage the model to exploit current patterns. Also, since we care about the accuracy on the samples that trigger trading signals, the model should pay more attention to the samples that are possibly retrieved. Accordingly, we design and compare to PCTWeighted where the historical samples with high returns are assigned with larger weights. We use PCT to refer to the percentage of price movement, i.e., return. .

In terms of feature selection, we compare the FS process with the baseline that uses fixed manually selected features (Manual) or uses all features without selection (All). The manually selected features are obtained based on a careful analysis on various aspects of the features, such as the historical performance, the information source and the risk. The two set of features (for DAILY and WEEKLY respectively) are used in the real trading and shown to be stable and effective in the real practice.

Performance metrics

Ann.Ret.: We use the hedged annualized return to measure how much return the investment portfolio constructed by the model earned exceeds the market. We divide our daily funds into two equal parts to buy stocks and hedge the market respectively. To hedge the market, we short the corresponding stock index futures. Moreover, we consider the compound return, i.e., where Total.Ret. the return during years.

Sharpe: The Sharpe ratio is one of the most commonly used metrics for stock investment, it reflects the risk adjusted profitability. Specificaly, , where Ann.Vol. is annualized volatility.

MDD: Maximum drawdown (MDD) is the maximum relative loss from a peak to a trough for a portfolio. MDD is an indicator of downside risk over a specified time period. MDD is related to investors’ maximum affordability and needs to be kept as low as possible.

IC/IR: The information coefficient (IC) and information ratio (IR) indicate the quality of the prediction. In our experiments, we use and , , where is the prediction and is the truth, is the IC for each time step.

Experiment results

We run backtests for the aforementioned models and hedge the systemic risk of the market by holding a short position of the corresponding exchange traded funds (ETF). We plot the hedged equity curves for these models under different settings in Figure 2. We also list the performance measures of the the backtest results in Table II.

In Figure 2, we show four sets of experiments. The four sets of experiments are conducted under different settings (DAILY or WEEKLY) and using different base models (MLP or GBM). The curves in the figure are the hedged equity curves for different models, and the blue bars in the background indicate the information coefficient (IC) of the SR+FS model on each trading day. The information coefficient on a trading day is the Spearman’s rank correlation coefficient between the continuous signals outputted by the model on that trading day and the actual future returns. While the equity curve reflects the prediction accuracy on the top retrieved samples, the information coefficient reflects the prediction accuracy on all the samples

We can see that the performance of SR+FS (the red lines) is better than that of SR+ALL (the orange lines) where all the features are used in each of the sub-models without selection. This indicates the effectiveness of the FS process. However, the automatic feature selection by the FS process is not as good as the manually selected features, which is quite a strong benchmark. We leave it as a future research direction to discover an automatic end-to-end feature selection method that is comparable or better than the manual selection.

Moreover, we observe that the models with the SR process achieve better performances than the models without the SR process (i.e., SimpleEnsemble). This can be observed by comparing the SR+Manual model (green solid line) with the SimpleEnsemble+Manual model (green dashed line) or by comparing the SR+ALL model (orange solid line) with the SimpleEnsemble+ALL (orange dashed line). This indicates that the SR process can improve the performance by paying more attention to the key samples.

At last, we observe that the performance of PCTWeighted and TimeWeighted is even not as good as that of SimpleEnsemble in most of the settings, except that PCTWeighted+Manual is better than SimpleEnsemble+Manual in the WEEKLY setting when using MLP as the base model. Also, the performance of these two reweighting schemes varies largely across different settings or different base models. The effectiveness of paying attention to the near samples or the samples with high future returns depends on the market environment. For example, if the market environment changes quickly, paying attention to the near samples may avoid the interference of the past samples which represent different market patterns. Paying attention to the samples with high future returns corresponds to the emphasis on the positive samples instead of all the samples. This may improve the precision when the market environment is stable. Compared with these two heuristic reweighting schemes, the SR process weights the samples in a self-paced style and therefore is more robust across different settings.

In Table II, we use the hedged annualized return (Ann.Ret.), the Sharpe ratio (Sharpe), the maximum drawdown (MDD), the mean of the ICs (IC) and the information ratio (IR) as the performance measure for the trading strategies. The information ratio is the mean of the ICs divided by the standard variation of the ICs.

We found that DoubleEnsemble (SR+FS) achieves an annualized return of more than 50% with low risk. The Sharpe ratio is near and the maximum drawdown is less than . This demonstrate that the strategy induced by DoubleEnsemble has a superior and stable performance.

Iv-C Discussion of Computational complexity

Denote the time cost of training a sub-model as and predicting using a sub-model as . Empirically . The computational complexity of our algorithm is around where is the number of estimators. Compared to training a sub-model, the time cost of the reweighting step is negligible and the extra time cost for the feature selection step is where is number of factors. Note that in the two tasks, the time cost in the training phase is not important since the models are trained offline. For example, we can update the model at weekends. For online prediction, the time cost will not increase neither because the time cost for model prediction can be ignored compared with factor calculation. Moreover, the sub-models can be loaded in parallel to avoid the increase of time cost.

V Conclusion

In this paper, we proposed a robust and effective ensemble model, DoubleEnsemble, via learning trajectory based sample reweighting and shuffling based feature selection for financial market prediction. The learning trajectory based sample reweighting assigns the samples of different difficulty with different weights, and hence is particularly suitable for highly noisy and irregular market data. The shuffling based feature selection can identify the contribution of the features to the model and select important and divers features for different sub-models. We conducted experiments on two different financial markets and compared DoubleEnsemble with several ablated variants and baseline methods. Our experiments demonstrate that the designs in DoubleEnsemble are effective and lead to a profitable and robust trading strategy.


Jian Li and Chuheng Zhang are supported in part by the National Natural Science Foundation of China Grant 61822203, 61772297, 61632016, 61761146003, and the Zhongguancun Haihua Institute for Frontier Information Technology, Turing AI Institute of Nanjing and Xi’an Institute for interdisciplinary information core Technology. Yuanqi Li is supported by National Key RD Program of China No.2017YFC082070 from E-hualu. Xi Chen is supported by NSF via Grant IIS-1845444.


  • [1] M. Al Wadia and M. T. Ismail (2011) Selecting wavelet transforms model in forecasting financial time series data based on arima model. Applied Mathematical Sciences 5 (7), pp. 315–326. Cited by: §II.
  • [2] R. M. Alrumaih and M. A. Al-Fawzan (2002) Time series forecasting using wavelet denoising an application to saudi stock index. Journal of King Saud University-Engineering Sciences 14 (2), pp. 221–233. Cited by: §II.
  • [3] G. Appel (2005) Technical analysis: power tools for active investors. FT Press. Cited by: §IV-B.
  • [4] A. Arévalo, J. Niño, G. Hernández, and J. Sandoval (2016) High-frequency trading strategy based on deep neural networks. In International conference on intelligent computing, pp. 424–436. Cited by: §I.
  • [5] R. W. Banz (1981) The relationship between return and market value of common stocks. Journal of financial economics 9 (1), pp. 3–18. Cited by: §I.
  • [6] S. Basu (1983) The relationship between earnings’ yield, market value and return for nyse common stocks: further evidence. Journal of financial economics 12 (1), pp. 129–156. Cited by: §I.
  • [7] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pp. 41–48. Cited by: §II, §IV-A, TABLE I.
  • [8] L. C. Bhandari (1988) Debt/equity ratio and expected common stock returns: empirical evidence. The journal of finance 43 (2), pp. 507–528. Cited by: §I.
  • [9] R. Bloomfield, M. O’hara, and G. Saar (2009) How noise trading affects markets: an experimental analysis. The Review of Financial Studies 22 (6), pp. 2275–2302. Cited by: §I.
  • [10] A. Booth, E. Gerding, and F. Mcgroarty (2014)

    Automated trading with performance weighted random forests and seasonality

    Expert Systems with Applications 41 (8), pp. 3651–3661. Cited by: §II.
  • [11] L. Breiman (1996) Bagging predictors. Machine learning 24 (2), pp. 123–140. Cited by: §II, §IV-A.
  • [12] L. K. Chan, Y. Hamao, and J. Lakonishok (1991) Fundamentals and stock returns in japan. the Journal of Finance 46 (5), pp. 1739–1764. Cited by: §I, §IV-B.
  • [13] R. Cont, A. Kukanov, and S. Stoikov (2014) The price impact of order book events. Journal of financial econometrics 12 (1), pp. 47–88. Cited by: §IV-A.
  • [14] M. L. De Prado (2018) Advances in financial machine learning. John Wiley & Sons. Cited by: §I, §II.
  • [15] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai (2016) Deep direct reinforcement learning for financial signal representation and trading. IEEE transactions on neural networks and learning systems 28 (3), pp. 653–664. Cited by: §I.
  • [16] Y. Fan, F. Tian, T. Qin, X. Li, and T. Liu (2018) Learning to teach. arXiv preprint arXiv:1805.03643. Cited by: §II, §II.
  • [17] D. E. Farrar and R. R. Glauber (1967)

    Multicollinearity in regression analysis: the problem revisited

    The Review of Economic and Statistics, pp. 92–107. Cited by: §I.
  • [18] G. Feng, N. G. Polson, and J. Xu (2018) Deep learning factor alpha. arXiv preprint arXiv:1805.01104. Cited by: §I.
  • [19] T. Fischer and C. Krauss (2018)

    Deep learning with long short-term memory networks for financial market predictions

    European Journal of Operational Research 270 (2), pp. 654–669. Cited by: §I.
  • [20] Y. Freund and R. E. Schapire (1995) A desicion-theoretic generalization of on-line learning and an application to boosting. In

    European conference on computational learning theory

    pp. 23–37. Cited by: §II.
  • [21] T. H. Goodwin (1998) The information ratio. Financial Analysts Journal 54 (4), pp. 34–43. Cited by: §III-B.
  • [22] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In Advances in neural information processing systems, pp. 8527–8537. Cited by: §IV-A, TABLE I.
  • [23] Z. Hu, B. Tan, R. R. Salakhutdinov, T. M. Mitchell, and E. P. Xing (2019) Learning data manipulation for augmentation and weighting. In Advances in Neural Information Processing Systems, pp. 15738–15749. Cited by: §II.
  • [24] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §IV-A.
  • [25] W. Jia, W. Chen, L. XIONG, and S. Hongyong (2019) Quantitative trading on stock market based on deep reinforcement learning. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §I.
  • [26] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) MentorNet: learning data-driven curriculum for very deep neural networks on corrupted labels. In International Conference on Machine Learning, pp. 2304–2313. Cited by: §II, §IV-A, TABLE I.
  • [27] Z. Kakushadze (2016) 101 formulaic alphas. Wilmott 2016 (84), pp. 72–81. Cited by: §I.
  • [28] G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017) Lightgbm: a highly efficient gradient boosting decision tree. In Advances in neural information processing systems, pp. 3146–3154. Cited by: §IV-A.
  • [29] Y. Kwon and B. Moon (2007) A hybrid neurogenetic approach for stock forecasting. IEEE transactions on neural networks 18 (3), pp. 851–864. Cited by: §II.
  • [30] X. Liang and W. W. Ng (2012) Stock investment decision support using an ensemble of l-gem based on rbfnn diverse trained from different years. In 2012 International Conference on Machine Learning and Cybernetics, Vol. 1, pp. 394–399. Cited by: §II.
  • [31] Z. Liu, W. Cao, Z. Gao, J. Bian, H. Chen, Y. Chang, and T. Liu (2019) Self-paced ensemble for highly imbalanced massive data classification. arXiv preprint arXiv:1909.03500. Cited by: §II, §III-A.
  • [32] P. M. Long and R. A. Servedio (2010) Random classification noise defeats all convex potential boosters. Machine learning 78 (3), pp. 287–304. Cited by: §II.
  • [33] L. Luo and X. Chen (2013)

    Integrating piecewise linear representation and weighted support vector machine for stock trading signal prediction

    Applied Soft Computing 13 (2), pp. 806–816. Cited by: §II.
  • [34] D. Misra (2019) Mish: a self regularized non-monotonic neural activation function. arXiv preprint arXiv:1908.08681. Cited by: §IV-A.
  • [35] D. T. Nguyen, C. K. Mummadi, T. P. N. Ngo, T. H. P. Nguyen, L. Beggel, and T. Brox (2019) SELF: learning to filter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842. Cited by: §II.
  • [36] C. N. Ochotorena, C. A. Yap, E. Dadios, and E. Sybingco (2012) Robust stock trading using fuzzy decision trees. In 2012 IEEE Conference on Computational Intelligence for Financial Engineering & Economics, pp. 1–8. Cited by: §I.
  • [37] E. Qian and R. Hua (2006) Active risk and information ratio. In The World Of Risk Management, pp. 151–167. Cited by: §II.
  • [38] M. Ren, W. Zeng, B. Yang, and R. Urtasun (2018) Learning to reweight examples for robust deep learning. In International Conference on Machine Learning, pp. 4334–4343. Cited by: §II, §IV-A, TABLE I.
  • [39] S. Ross (1976) The arbitrage pricing theory of capital asset pricing model. Journal of finance. Cited by: §I.
  • [40] P. A. Samuelson (2016) Proof that properly anticipated prices fluctuate randomly. In The world scientific handbook of futures markets, pp. 25–38. Cited by: §I.
  • [41] F. Santosa and W. W. Symes (1986) Linear inversion of band-limited reflection seismograms. SIAM Journal on Scientific and Statistical Computing 7 (4), pp. 1307–1330. Cited by: §III-B.
  • [42] S. Saxena, O. Tuzel, and D. DeCoste (2019) Data parameters: a new family of parameters for learning a differentiable curriculum. In Advances in Neural Information Processing Systems, pp. 11093–11103. Cited by: §II, §II.
  • [43] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §IV-A.
  • [44] J. Sun, K. Xiao, C. Liu, W. Zhou, and H. Xiong (2019-03) Exploiting intra-day patterns for market shock prediction: a machine learning approach. Expert Systems with Applications 127, pp. . External Links: Document Cited by: §II.
  • [45] J. W. Wilder (1978) New concepts in technical trading systems. Trend Research. Cited by: §IV-A.
  • [46] C. Xiang and W. Fu (2006) Predicting the stock market using multiple models. In 2006 9th International Conference on Control, Automation, Robotics and Vision, pp. 1–6. Cited by: §II.
  • [47] Y. Xu, Z. Li, and L. Luo (2013) A study on feature selection for trend prediction of stock trading price. In 2013 International Conference on Computational and Information Sciences, pp. 579–582. Cited by: §II.
  • [48] Y. Xu, P. Cao, Y. Kong, and Y. Wang (2019) L_DMI: a novel information-theoretic loss function for training deep nets robust to label noise. In Advances in Neural Information Processing Systems, pp. 6222–6233. Cited by: §II, §IV-A, TABLE I.
  • [49] J. Yao, H. Wu, Y. Zhang, I. W. Tsang, and J. Sun (2019) Safeguarded dynamic label regression for noisy supervision. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 9103–9110. Cited by: §IV-A, TABLE I.
  • [50] F. Zhai, Q. Wen, Z. Yang, and Y. Song (2010) Hybrid forecasting model research on stock data mining. In 4th International Conference on New Trends in Information Science and Service Science, pp. 630–633. Cited by: §II.
  • [51] L. Zhang, C. Aggarwal, and G. Qi (2017) Stock price prediction via discovering multi-frequency trading patterns. In Proceedings of the 23rd ACM SIGKDD, pp. 2141–2149. Cited by: §III.
  • [52] T. Zhang, Y. Li, Y. Jin, and J. Li (2020)

    AutoAlpha: an efficient hierarchical evolutionary algorithm for mining alpha factors in quantitativeinvestment

    Note: unpublished Cited by: §I.
  • [53] Z. Zhang, H. Zhang, S. O. Arik, H. Lee, and T. Pfister (2019) IEG: robust neural network training to tackle severe label noise. arXiv preprint arXiv:1910.00701. Cited by: §II.