The Success of AdaBoost and Its Application in Portfolio Management

03/23/2021
by   Yijian Chuan, et al.
Peking University
0

We develop a novel approach to explain why AdaBoost is a successful classifier. By introducing a measure of the influence of the noise points (ION) in the training data for the binary classification problem, we prove that there is a strong connection between the ION and the test error. We further identify that the ION of AdaBoost decreases as the iteration number or the complexity of the base learners increases. We confirm that it is impossible to obtain a consistent classifier without deep trees as the base learners of AdaBoost in some complicated situations. We apply AdaBoost in portfolio management via empirical studies in the Chinese market, which corroborates our theoretical propositions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 22

page 25

page 28

06/25/2021

Self-training Converts Weak Learners to Strong Learners in Mixture Models

We consider a binary classification problem when the data comes from a m...
07/01/2020

An ensemble learning framework based on group decision making

The classification problem is a significant topic in machine learning wh...
06/07/2018

Large scale classification in deep neural network with Label Mapping

In recent years, deep neural network is widely used in machine learning....
03/07/2021

Automatic Difficulty Classification of Arabic Sentences

In this paper, we present a Modern Standard Arabic (MSA) Sentence diffic...
04/03/2019

Multi-task Learning for Chinese Word Usage Errors Detection

Chinese word usage errors often occur in non-native Chinese learners' wr...
03/06/2021

An Effective Approach to Minimize Error in Midpoint Ellipse Drawing Algorithm

The present paper deals with the generalization of Midpoint Ellipse Draw...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Equal-weighted portfolios are one of the most important strategies in portfolio management. They are portfolios with weights equally distributed across the selected securities in the long and/or short positions. In academic research, numerous studies have suggested that equal-weighted portfolios have a better out-of-sample performance than other portfolios (e.g., jobson1981; james2003bias_variance; demiguel2007_1n). michaud1989markowitz and demiguel2007_1n

argued that, the equal-weighted strategies do not suffer from the estimation error of the covariance matrix, which is vulnerable to outliers

(tu2011). In industry, equal-weighted portfolios are popular across portfolio management in practice, particularly in the hedge funds. The MSCI has issued many equal-weighted indexes, which are “some of the oldest and best-known factor strategies that have aimed to identify specific characteristics of stocks generating excess return”111https://www.msci.com/msci-equal-weighted-indexes.

The core of constructing equal-weighted portfolios is to forecast the long and/or short positions, which is a classification problem. Machine learning usually outstands when dealing with classification problems, and the research of machine learning on classification is diverse.

DGL1996

lists numerous traditional pattern recognition research in computer science, where the pattern recognition is an alias for the classification problem.

book:esl

analyzes and summaries many machine learning methods, among which include lots of classification methods, such as the Linear/Quadratic Discriminant Analysis, Support Vector Machine, Boosting, Random Forest, etc. The application of machine learning in classification succeeds in various fields, such as email spam identification, handwritten digit recognition, etc. In portfolio management,

de2018ml explained how to apply machine learning to managing funds for some of the most demanding institutional investors. Specifically, creamer2005AdaBoost; creamer2010automated applied the Boosting method in finance, and revealed its practical value. creamer2012model used LogitBoost in high-frequency data for Euro futures, and generated positive returns. wang2012DB invented the N-LASR (non-linear adaptive style rotation) model by applying AdaBoost in stock’s factor strategy. They wisely incorporated benefits of different factors by the N-LASR model, and the empirical study on the component stocks in Russel 3000 showed a significant risk-adjusted portfolio return. fievet2018trees

proposed a decision tree forecasting model and applied it to S&P 500, which is capable of capturing arbitrary interaction patterns and generating positive returns.

rasekhschaffe2019_ML_stock_selection provided an example of the machine learning techniques to forecast the cross-section of stock returns. gu2018_ML_asset_pricing and dhondt2020 gave comprehensive analysis of machine learning methods for the canonical problem of empirical asset pricing, attributing their predicted gains to the non-linearity.

AdaBoost is a classification method in machine learning, inspiring a tremendous amount of innovations. AdaBoost has developed for more than two decades since freund1996adaboost. Since it is less prone to overfit, Breiman praised AdaBoost—“the best off-the-shelf classifier in the world”—at the 1996 NeurIPS conference (friedman2000additive). AdaBoost has made a significant impact on machine learning and statistics. To explain AdaBoost’s overfitting resistance, schapire1998boosting proposed the “margin” theory. Meanwhile, breiman1998arcing; breiman1999prediction and friedman2000additive discovered the fact that AdaBoost is equivalent to an optimization algorithm, so friedman2001greedy

put forward the Gradient Boosting. Inspired by AdaBoost,

breiman2001random

invented the Random Forest(RF) and believed that there are some similarities between RF and AdaBoost. Subsequently, people generalized the Boosting methods, in which two Boosting algorithms are widely-adopted—the XGBoost

(XGBoost) and the LightGBM (LGBM). Till now, the family of Boosting flourishes, and becomes a considerable part in machine learning.

Although there are many studies explaining why AdaBoost is a successful method, people are still curious about its excellent achievement till now. wyner2003OnBA; mease2008contrary believed that the available interpretation of AdaBoost is “incomplete”, particularly on the explanation of its overfitting resistance property. wyner2017AdaBoost_RandomForest introduced a novel perspective on AdaBoost and RF, and conjectured that their success could be explained by the “spiked-smooth”, where spiked is “in the sense that it is allowed to adapt in very small regions to noise points”, and smooth is “in the sense that it has been produced through averaging”. In other words, AdaBoost is a self-averaging interpolating method, localizing the effect of the noise points as the iteration number increases.

Our work is motivated by the questions from the industry: “May machine learning strategies outperform other traditional strategies in quantitative investment? Why and how do they work?” de2018ml gave a comprehensive and systematic approach to apply machine learning methods, and highly appraised Boosting: “We explored a number of standard black-box approaches. Among machine learning methods, we found gradient tree boosting to be more effective than others.” Besides, wang2012DB applied AdaBoost to select and combine factors with consistent and interpretative performance, and zhang2019boosting proposed a Boosting method to compose portfolios which performs well. These findings answered the first question. There is limited research concerning the mechanism or the interpretability of machine learning in portfolio management. However, interpretability is essential in investment (feng2017zoo). In detail, harvey2016and argued: “… a factor developed from first [economic] principles should have a lower threshold t-statistic than a factor that is discovered as a purely empirical exercise.” harvey2017presidential proposed an example. They constructed portfolios based on the first, second, and third letters of the ticker symbols, gaining significant excess returns. Nevertheless, most people would not like to adopt this symbol-based portfolio, as they implied. Thus, without interpretability, portfolio investment is vulnerable. We must pay more attention to the second/third questions.

To answer the “why and how” questions, we should investigate AdaBoost in the framework of statistics to find a theoretical explanation of its outperformance and apply them in portfolio management. wyner2017AdaBoost_RandomForest pointed out that: “The statistics community was especially confounded by two properties of AdaBoost: 1) interpolation (perfect prediction in sample) was achieved after relatively few iterations, 2) generalization error continues to drop even after interpolation is achieved and maintained.” They innovated the concept of “spiked-smooth” classifier created by a self-averaging and interpolating algorithm. They conjectured that the “spiked-smooth” property renders the success of AdaBoost, and provided many delicate examples by simulation to support their viewpoints. Thus, we would like to narrow the gap between the theory and the simulation by strengthening their work from a statistical perspective. To explain the “spiked-smooth” mathematically, we need to distinguish the signal and the noise within the training set in a statistical framework first. Then, one should connect the relationship between the “spiked-smooth” and the test error, explaining the property of overfitting resistance.

In addition, wyner2017AdaBoost_RandomForest pointed out that “boosting should be used like random forests: with large decision trees, without regularization or early stopping”. The point is that larger and deeper decision trees are preferred to be used as the “weak” classifiers (base learners) of AdaBoost, since they can both “interpolate” the training set and realize the goal of “spiked-smooth”. This point contradicts with the common sense about machine learning and statistics, and statisticians usually believe complexity leads to overfitting. Therefore, we wonder if the AdaBoost method can boost shallow trees when the true model is very complicated, just as we cannot “make bricks without straw”. We try to find out that under what populations will the AdaBoost method be unable to achieve good performance if the base learners are very weak. We want to demonstrate the viewpoints of wyner2017AdaBoost_RandomForest in a mathematical framework.

In this paper, we show that how AdaBoost can dig out more non-linear information in the training set without increasing the test error. Our work is composed of three parts. First, to concrete the abstract concept “spiked-smooth” into a measurable value, we define a measure of the influence of the noise points in the training set for a given method. The measure can also be regarded as a measure of the localization of the given method. We discover the connection between the measure and the out-of-sample performance. That is, under certain conditions, if the influence of the noise points is not essential, then the test error will be low. A toy example clarifies the theorem, intuitively illustrating the influence of the noise and explaining why it controls the test error. For AdaBoost, we show that, as the number of iterations increases or the depth of the base learners grows, it becomes more robust to the influence of the noise, and thus lead to a lower test error. Therefore, we give a theoretical explanation about why AdaBoost has a good performance without overfitting in noisy training sets.

Second, we confirm that it is a better choice to use deeper/larger decision trees as base learners of AdaBoost in the sense of digging out complex information. Specifically, we propose several counterexamples that AdaBoost based on shallow decision trees fails to handle, even after iterating infinite times. We generalize the results and indicate that AdaBoost based on shallow decision trees would fail in recognizing a certain kind of information, while the one based on deep decision trees could easily solve out. Therefore, these findings suggest that AdaBoost based on deep decision trees maybe better.

Third, the empirical studies in the Chinese market corroborate our theoretical propositions. The theoretical results about the interpolation and the localization of AdaBoost in the previous parts of this paper is verified by constructing an optimal portfolio strategy. Besides, the result also illustrates the good performance of the equal-weighted portfolio generated by the selected optimal classifier trained by AdaBoost.

The outline of this paper is as follows. Section 2 introduces a measure of “spiked-smooth”, illustrates the relationship between the measure and the test error, and explains the success of AdaBoost. Section 3 identifies that AdaBoost based on deep trees can dig out more information, while the one based on shallow trees fails. Section 4 provides empirical studies of AdaBoost in the Chinese stock market. Section 5 concludes.

2 The influence of the noise points and AdaBoost

In this section, we give a strict definition for the “spiked-smooth” suggested by wyner2017AdaBoost_RandomForest in the framework of the Bayes classifier. Under the framework, we explain the success of AdaBoost by developing a concrete measure.

First, we describe a background model of the binary classification problem and the Bayes classifier, and define the signal/noise points for a given training set. Based on these concepts, we build a bridge between the Bayes classifier and the interpolating classifier. We define a measure of the influence of the noise points, and specify its property. Second, we explore the connection between the measure and the test error. Last, we explain the success of AdaBoost as the minimization of the influence of the noise points in the sense of the “spiked-smooth”, and reveal its potential applications in portfolio management.

2.1 The model of the binary classification

A prediction model consists of an input vector

, an output , and a prediction classifier . For simplicity, let us assume that the distribution of is absolutely continuous with respect to the Lebesgue measure. We restate the definition of the Bayes classifier/error (DGL1996, p. 2) in Definition 1 below.

Definition 1 (Bayes Classifier/Error).

Given the population and , the Bayes classifier is

and the minimum is the Bayes error , i.e.,

According to the definition, is the classifier minimizing the test error, which can be represented by the conditional distribution in the population. One can show , when the population satisfies certain canonical conditions. There is no classifier having lower test error than . We give a general representation of the test error of a classifier by the Bayes classifier in Lemma 1 while the noise and are independent.

Lemma 1.

Given the population , the Bayes classifier , and the Bayes error , if and are independent, then, for any classifier ,

(1)
Proof.

We have

A natural corollary of (1) is . In other words, is a linear function of .

Next, we introduce the concept of the signal/noise points of the training set in the following Definition 2.

Definition 2 (Signal/Noise Points).

Given a training set generated from the population and the Bayes classifier , a point is a signal point, if ; and it is a noise point, if .

In short, the Bayes classifier

distinguishes the signal/noise points of a training set. Heuristically, the signal points are the points that equal to the output of the Bayes classifier, while the noise points are not.

We recall the definition of the interpolation classifier proposed by wyner2017AdaBoost_RandomForest for coherence.

Definition 3 (Interpolating Classifier).

A classifier is an interpolating classifier on the training set , if for all .

Immediately, we can obtain a property of the interpolating classifier, i.e., its training error is on the training set .

Though the Bayes classifier is the best classifier in the sense of minimizing the test error, it does not necessarily interpolate the given training set . The Bayes classifier violates interpolation at and only at the noise points, as implied in Definition 3. Thus, in view of the training set, the difference between an interpolating classifier and the Bayes classifier is only on the noise points. So, we propose a definition of a purified training set of by converting the noise points into the “signal” points.

Definition 4 (Purified Training Set).

Given a training set from the population and the Bayes classifier , the purified training set of is defined as .

There is no noise point in . In other words, the Bayes classifier must interpolate the purified training set . We can also rewrite the definition of the purified training set as

The two training sets and share the same input ’s but different outputs, and the difference between the outputs of the two sets is only on the noise points. The purpose is to separate out the influence of the noise points from the whole information contained in the training set .

Last, based on the previous preparations, we propose a measure of the influence of the noise points (ION) for a given training set and a given method . It helps us to compare the properties of different methods, such as one nearest neighbor (1NN) or AdaBoost, on a given training set.

Definition 5 (Ion).

Given the marginal probability measure of

(), we define the influence of noise (ION), a function of the training set and the method :

(2)

where is the classifier trained on the training set using method , and is on the purified training set using method .

We interpret Definition 5. represents a specific method. For instance, for 1NN, one can apply it on the training set and , which generates two classifiers and . Then, by comparing the two classifiers, we can get the value of ION(1NN, ). The ION is defined according to two sets: the training set and the proxy of the training set generated by .

Although Definition 5 does not require the interpolation of the classifier , ION usually characterizes the performance of the method which generates interpolating classifiers on a given training set. Meanwhile, . If ION is low, then the classifier is robust to the noise points on the training set of the given method, and vice versa.

Interpolation is not necessarily bad, if it subjects to some “mechanism” (wyner2017AdaBoost_RandomForest). Although some interpolating classifiers “can be shown to be inconsistent and have poor generalization error in environments with noise”, “the claim that all interpolating classifiers overfit is problematic”. The classifiers generated by 1NN or random forest are both interpolating classifiers, but their ION may not be the same. Furthermore, wyner2017AdaBoost_RandomForest suggested: “an interpolated classifier, if sufficiently local, minimizes the influence of noise points in other parts of the data.” The next question is, what is the relationship between the ION and the so-called “spiked-smooth” classifier.

2.2 The ION and the test error

In this section, we reveal the connection between the ION and the test error from theoretical and numerical perspectives.

First, we prove that, under certain conditions, the lower the ION, the lower the test error.

Proposition 1.

Given the population such that is independent of , let and denote the classifiers generated from two different methods and on the training set , and and denote the ones generated from and on the purified training set , respectively. If

(3)

and

then

(4)
Proof.

Because of (3), we have

Thus, by Lemma 1, (4) holds. ∎

Proposition 1 shows that the ION controls the test error. Specifically, it means that, if the two methods could reach the Bayes classifier in the purified training set, then the method with lower ION outperforms the others in the sense of the test error. For instance, might indicate 1NN, while indicate AdaBoost.

However, the condition (3) is slightly unnatural. It is so strong that it could only hold in several particular training sets. We therefore weaken the original condition (3) and establish a more natural condition in Theorem 1 below.

Theorem 1.

Given the population such that is independent of , let and denote the classifiers generated from two different methods and on the training set , and and denote the ones generated from and on the purified training set , respectively. The size of the training set or is . If

(5)

and

(6)

then

(7)
Proof.

To begin with, by Lemma 1, (5) is equivalent to

(8)

Then,

By (8), we have

and thus . By (8), we also have

so, by (2), one can show that and share the same limit, . Therefore,

Because of (6), we have

Further, by Lemma 1, (7) holds. ∎

We interpret Theorem 1. First, (5) is a very weak condition. It assumes the two methods are consistent on the purified training set . In fact, many classical methods have been proved to be consistent. Furthermore, because there is no noise point in , the consistency on is easier to achieve than on . Even the notoriously easy-to-overfit method, 1NN, is consistent in such a good training set but not necessarily consistent in , according to the Cover-Hart inequality (cover1967_1nn).

Second, (6) is about the property of some methods regarding a certain training set. Instead of subjectively describing the property of the methods, it measures the influence of the noise points in the particular training set objectively.

Third, according to Theorem 1, the decrease of ION implies that the method is minimizing the influence of the noise points and thus enhancing the generalization ability. It means that, for most methods, the ION is a good indicator of the test error.

Fourth, there is no natural conflict between interpolation and lower ION. For a classifier, the purpose of interpolating is to take all information contained in the signal points as much as possible, while the goal of lowering ION is to reduce the impact attributed to the noise points.

In order to have a concrete understanding of Proposition 1 and Theorem 1, we give a 2-dim toy example. First, the population is

where

is uniformly distributed in

. In other words, only the first dimension of is relevant to , while the second dimension contributes no information. One can easily solve the Bayes classifier and the Bayes error .

Second, we randomly generate a training set with a size , as in Fig. 1. The training set is composed of signal points and noise points. Roughly speaking, the yellow triangles to the left and the blue circles to the right are all noise points. Particularly, on the left and bottom side of the graph in Fig. 1, there is a solid triangle , which is the noise point that would be discussed later.

Figure 1: The training set .

Third, we apply two methods, (1NN) and (AdaBoost222To be more specific, for the AdaBoost we used, the base learners are decision trees with a maximum depth 4, and the number of iterations is 50.), to generate interpolating classifiers on the training set respectively. Fig. 2(subfig:drc1NN) is the classifier generated by method 1NN, while 2(subfig:drcAda) . The purple vertical dotted line () is the watershed of , while the black solid lines are the decision boundaries of . Both classifiers are interpolating, which means that , , even though they are from different methods.

(a)
(b)
Figure 2: The training set and the classifiers: AdaBoost has lower ION than 1NN.

Fourth, we argue that the ION of AdaBoost is lower than that of 1NN on the training set . the classifiers in Fig. 2 are different: The decision boundary of 1NN in Fig. 2(subfig:drc1NN) is smooth and natural, while that of AdaBoost in Fig. 2(subfig:drcAda) is sharp and uneven. However, we argue that the sharp and uneven is better than the smooth and natural in the sense of minimizing and localizing the influence of the noise points. For the isolated noise points, the regions surround them in 2(subfig:drcAda) are smaller and narrower than that in 2(subfig:drc1NN). In detail, we focus on the particular noise point lies at about , which is the solid triangle. The area around of in 2(subfig:drcAda) is very small, while that of in 2(subfig:drc1NN) is a big irregular polygon—the influence of the noise point seems to be lower for AdaBoost than 1NN.333It is noteworthy that the influence of the noise points are acting jointly rather than individually, but it does not matter in this heuristic case.

Last, we calculate the ION and the test error, summarized in Table 1, where and , and . We can observe that the results are in line with our theorem, i.e., the lower the ION, the lower the test error.

Method ION Training Error Test Error
- - 0.10
0.08 0 0.17
0.04 0 0.13
Table 1: The ION and the test error.

Overall, this section connects the ION and the test error. Both the theoretical derivation and the toy example of simulation demonstrate the importance of ION. Particularly, the toy example explains why 1NN is easy to overfit, while AdaBoost not. However, AdaBoost is only a general term for a class of methods, since both the base learners and the number of iterations need to be specified. By choosing different kinds of base learners and different numbers of iterations, we can generate a tremendous amount of specific methods. In Section 2.3

, we take a close look at the performance of AdaBoost with different hyperparameters from our new perspective: ION.

2.3 The ION and AdaBoost

AdaBoost mainly has two hyperparameters. One of them is the complexity of the base learners. The decision trees are one of the most popular base learners of AdaBoost, which is the classical base learner in the monograph book:esl. In this paper, we use the maximum depth of decision trees to indicate the complexity of the base learners. The deeper the decision trees, the more complex the base learners, and the more complex the AdaBoost. The other is the number of iterations, which is the number of the base learners added in total. The higher the number of iterations, the more complex the AdaBoost.

This section corroborates the conclusion of wyner2017AdaBoost_RandomForest with our newly defined concept ION. We show their conclusion that AdaBoost based on large decision trees without early stopping is better. We want to show that AdaBoost generates interpolating classifiers, and both the ION and the test error decrease as the depth of the base learners and the number of iterations increase. Instead of comparing AdaBoost with 1NN, we digest AdaBoost itself with different parameters in details via high-dimensional population of simulation.

The simulation population is

where is uniformly distributed in . We randomly generate a training set with , and compare the results of AdaBoost with different hyperparameters.

In order to explain the reason why AdaBoost without early stopping might be better, we compare the results of AdaBoost with different numbers of iterations but the same maximum depth of the base decision trees. The maximum depth is set as 5. We denote the corresponding classifiers by .

The results are in Fig. 3. The -axis in the figure represents the number of iterations . We clarify the three lines in detail. The green dashed line is the training error of on its training set , the red dashed-dotted line is the test error of , and the blue solid line is the ION of AdaBoost in the training set: . All the three lines decrease sharply when . When , the training error remains , but the test error and ION keep decreasing.

Figure 3: The performance of AdaBoost regarding to .

From Fig. 3, we have the following observations. First, AdaBoost is minimizing the influence of the noise points. When , the test error decreases but the training error remains . A natural question is, what is AdaBoost doing? There are many explanations. wyner2017AdaBoost_RandomForest believed that AdaBoost is self-averaging and generating a “spiked-smooth” classifier by minimizing the the influence of the noise points. We corroborate their work with the blue solid line ION. When , although the training error remains , the ION continues to decrease, which reflects the decrease of the influence of the noise points. Thus, as the number of iterations increases, AdaBoost keeps interpolating, and simultaneously minimizes the influence of the noise points. Second, the iteration of AdaBoost can be divided into two stages: The first stage is the sharp decrease of the training error (), and the second stage is the decrease of ION (). The first stage can be considered as the formation of the rough skeleton of the classifier, while the second stage can be treated as the process of the details with the minimization of the influence of the noise points.444However, the two stages can not be divided arbitrarily, because ION may also play a role in the first stage.

For another, we show that AdaBoost based on deep/large decision trees is better, and explain it by ION. Specifically, we apply AdaBoost based on different decision trees but the same number of iterations , where the base decision trees have different maximum depths from to . In other words, the number of the terminal leaves of the base decision trees varies from to . We denote the corresponding classifiers by .

The results are presented in Fig. 4. The -axis in the figure represents the maximum depth, i.e., for . The three lines are the same as those in Fig. 3, and so are the interpretations.

Figure 4: The performance of AdaBoost regarding to .

Overall, AdaBoost based on large decision trees without early stopping is better, which can be explained as the decrease of the ION. Given the condition that the training error is , the influence of the noise points decreases as the depth of the base decision trees and the number of iterations increase.

Now, return to the main line of the paper, we show that AdaBoost would not overfit even interpolating, when digging out complex structures of factors in constructing equal-weighted portfolios. As it was emphasized by de2018ml, the linear methods are awfully simplistic and useless, and would “fail to recognize the complexity of the data”. The academia and industry shift their focus to the non-linear ones. There are a tremendous amount of machine learning methods applied in various data and fields in finance. Many of the machine learning methods suffer from the overfitting and the low interpretation. However, AdaBoost is not heavily affected by them as illustrated above.

In Section 4, we give empirical studies about specific factors or strategies, and prove the advantage of AdaBoost in portfolio management. But we want to clarify what kind of non-linear information can AdaBoost dig out first.

3 Base learners of AdaBoost

AdaBoost is a boosting method. It boosts the performance of a series of base learners, or “weak classifiers”. People usually choose shallow trees (such as “stumps”, i.e., decision trees with only one layer) as base learners since they are “weak” enough and thus can avoid overfitting.

However, in many fields, especially in the area of portfolio management, using stumps as base learners may not capture the nature of the population, since the population is usually rather complicated. wyner2017AdaBoost_RandomForest proposed that the deep and large trees will allow the base learners to interpolate the data without overfitting, and it is a better choice to use deep trees as base learners. We have already shown the result mathematically in Section 2.3 from the perspective of the ION.

In this section, we discuss the shortcomings of AdaBoost based on stumps. We first show that stumps cannot deal with the “XOR” classification problem. Then, we generalize the result and demonstrate that AdaBoost based on stumps cannot deal with populations without “comonotonicity”. These kinds of populations are common in finance, since the investment activities in the financial world are usually rather complicated and interactive.

3.1 The “XOR” population

In this section, we use a toy example to show that the shallow trees (especially stumps) are not always capable to capture the patterns of the population. We introduce the Boolean operator “exclusive OR” (XOR) first.

Definition 6 (2-Xor).

The 2-XOR function is defined as such that

Definition 7 (-Xor).

For , the -XOR function, denoted as , is defined recursively as

(a) 2-XOR
(b) 3-XOR
Figure 5: Intuitions of 2-XOR and 3-XOR functions.

The Boolean operator -XOR is an important function in computer science (for instance, the parity check), and it is also a classical example in book:esl. It can also bring insights into portfolio management, since the -XOR can characterize the interaction among different factors. There are many studies focusing on the interaction among various factors (asness2018size). Fig. 5 are intuitive illustrations of the 2-XOR and the 3-XOR functions. The outputs are not the same in adjacent quadrants (or octants), which is a common pattern of interaction.

Now we show that the stumps cannot deal with the classification problems with the Bayes classifier , even in the case that the Bayes error is 0. For instance, if we use a stump classifier to classify the 2-XOR function, we can easily show that is always no matter how the stump is trained. That is because, a stump is equivalent to a partition of along the direction of one axis. After the partition, both half-spaces still contain values of 1 (accounts for ) and (50%), which leads to a test error .

The conclusion can also be generalized to high-dimensional spaces. Let denote a decision tree whose depth is no more than , and denote a decision tree whose depth is . We have the following result in Theorem 2.

Theorem 2.

Applying a decision tree on the - classification problem will always lead to , where .

Proof.

We prove the theorem by induction. The case of has already been proved. Now we assume that our conclusion holds for . We want to prove that it also holds for .

Without loss of generality, we assume that the splitting variable of ’s first layer is the 1st feature , then

where and represent the left subtree and the right subtree of ’s top node respectively, and is the splitting value. Let , then

Assuming without loss of generality, then we have

and

Our inductive assumption tells us that, for the classification problem, both and will have a error. Hence, the three probabilities , and are all equal to . Finally,

In the proof above, we suppose that each component of would be split just only one time. In other words, once the CART algorithm (book:esl, p. 305) split a decision tree at , it will not split at again in other layers. It is just for clarity and conciseness, because one can use the total probability formula to deal with the more complicated situations.

Although the -XOR is a special case that each factor interacts with other factors, it is enough to demonstrate that shallow decision trees (especially for one-layer stumps) may be unable to deal with factors that are not independent of each others.

3.2 The population without “comonotonicity”

In this section, we show the shortcomings of AdaBoost based on stumps by introducing the concept of “comonotonicity”. The conclusion in this section can be regarded as an extension of Section 3.1, since the XOR function do not have the property of comonotonicity, as we will discuss later.

Definition 8 (Comonotonic Population).

A population is comonotonic, if its Bayes classifier satisfies: for any constant and any , there exists an such that for each and , the elements in

are all non-positive or all non-negative, where .

To give an intuition of comonotonicity, in Fig. 6, we give three examples of populations which are not comonotonic. For Fig. 6(subfig:XOR_arrow), 6(subfig:ring_arrow) and 6(subfig:diag_arrow), the decision boundaries of their Bayes classifiers form shapes of an XOR, a ring, and a diagonal band, respectively. The yellow (light) region takes a value of , and the blue (dark) region . Note that, in each figure, there both exist arrows from values of to , and arrows from values of to . These arrows tell us that the populations are not comonotonic.

(a) XOR
(b) Ring
(c) Diagonal
Figure 6: The populations WITHOUT comonotonicity.

The following theorem shows that AdaBoost based on stumps cannot deal with populations without comonotonicity.

Theorem 3.

For a population in with a Bayes classifier , a necessary condition for the classifier trained by AdaBoost based on stumps can converge to as the number of iterations is: the population is comonotonic.

Proof.

The AdaBoost.M1 algorithm in book:esl shows that, if the number of iterations is , the final strong classifier must takes the form

where are base learners (stumps). In other words, must be a linear combination of base learners.

A stump “” with variables can be expressed as

where , and the splitting variable of the stump is the -th feature.

Without loss of generality, we require that can only be or . Then the linear combination of stumps trained by AdaBoost can be represented as

where is a constant, is the splitting variable of the -th stump, and is the splitting value of the -th stump. Since , we can adjust all inequality signs to the same direction:

For simplicity, let us consider the 2-dim case (). One can generalize the following conclusions to high-dimensional spaces similarly. According to the splitting variable of each stump, we can separate the stumps into two groups as

where is the number of stumps with as the splitting variable, is the number of stumps with as the splitting variable, and . Without loss of generality, we assume that , and .

Recall the definition of comonotonicity (Definition 8). For any constant , take any , and . We sort and together as

Then, from the expression of , we have

i.e., it does not depend on . Similarly, we also have that does not depend on . Let , and if the algorithm will converge, then

and

will also be constants which do not depend on and , respectively. Therefore, according to the definition of comonotonicity, cannot converge to if the population is not comonotonic. ∎

To show the intuition of in the proof above, let . Fig. 7 illustrates the property of the final strong classifier .

(a)
(b) The bird’s-eye view of (subfig:explanation_3d)
Figure 7: An example of the strong classifier .

Fig. 7 is a toy example of the strong classifier with and . Fig. 7(subfig:explanation_3d) is the graph of the function , and 7(subfig:explanation_2d) is the bird’s-eye view of 7(subfig:explanation_3d). The darker the color is, the smaller the value of takes. The value of is written explicitly on 7(subfig:explanation_2d), which shows that, the values in row 2 are greater than row 1, and the values in row 3 are greater than row 2, and so on. Similarly, the values in column 2 are greater than column 1, and the values in column 3 are greater than column 2. All numbers in the grids increase or decrease the same values, from left to right, and from bottom to top. It is the pattern of comonotonicity.

We have already shown in Fig. 6(subfig:XOR_arrow) that the XOR function is not comonotonic. Therefore, if the Bayes classifier of a population is the XOR function, it is impossible to give a good answer to the classification problem by training AdaBoost based on stumps. The conclusion in this section can be regarded as a generalization of Section 3.1.

In portfolio management, it is very common that factors may have interactions among each others. Hence, non-comonotonic populations are not rare. Although AdaBoost based on stumps can achieve good results in some areas, in financial studies, just based on stumps is far from reaching the desired goal. In Section 4, we use empirical studies to show that, using deeper trees as base learners of AdaBoost is usually a better choice in portfolio management.

4 Empirical studies

In this section, we use the data of the Chinese A-share market to give empirical studies about the factor investing strategy based on AdaBoost. How to construct a stock factor strategy is an open problem with long history in portfolio management. From wang2012DB invented the N-LASR to fievet2018trees proposed a decision tree forecasting model, and gu2018_ML_asset_pricing and dhondt2020 gave a comprehensive analysis of machine learning methods for the canonical problem of empirical asset pricing, all of them agree that it may improve the strategy performance if the prediction model can dig out nonlinear and complex information.

Our empirical studies have two goals. On the one hand, by selecting an optimal portfolio management strategy based on AdaBoost, we want to verify the general theoretical results about the interpolation and localization of AdaBoost in Section 2 and Section 3. On the other hand, we want to illustrate the good performance of the equal-weighted strategy based on AdaBoost.

In order to achieve the first goal, we give a sensitivity analysis about the depth of the base learners (decision trees) and the number of iterations of AdaBoost on the training set and the test set. We specifically explain the performance of AdaBoost that it can dig out useful information efficiently, as well as decrease the test error.

4.1 Data

The empirical data starts in June 2002 and ends in June 2017, 181 months in total. All stocks traded in the Chinese A-share market are included. 60 factors are used in our strategy. The data of the factor exposures and the monthly returns are downloaded from the Wind Financial Terminal555https://www.wind.com.cn/en/Default.html. The 60 factors include not only the fundamental factors, but also the technical factors, such as the momentum and the turnover. All 60 factors are listed in Table 2.

The original data has been preliminarily cleaned, but we still need to do some preprocessing before training. We remove all stocks which are not traded (or cannot be traded due to the limit-up or limit-down in the Chinese market) during the period we study. We remove the factors with over

missing data, and fill in the missing data of other factors with 0. For each month, we assign the response variables

of all stocks according to their ranks of the next-month returns cross-sectionally. The response variables of the top stocks are , and that of the bottom stocks are .

We divide all data into a training set and a test set manually. Total 181 months’ data is divided into two sets: the first 127 months’ data (June 2002–December 2012) is taken as the training set, and the last 54 months’ data (January 2013–June 2017) is taken as the test set. Then, the size of the training set is 193455 (sum of the stock numbers in all months), and the size of the test set is 133277. We use the training set to fit models, and then use the test set to evaluate the models and verify our conclusions in Section 2 and Section 3.

alr IR_bps_252 IR_netasset_126 IR_net_profit_63 IR_roe_252 net_assets
amount_21 IR_bps_63 IR_netasset_252 IR_oper_rev_126 IR_roe_63 oper_rev_ttm
avg_volume_63 IR_eps_126 IR_netasset_63 IR_oper_rev_252 IR_totasset_126 pb
bps IR_eps_252 IR_net_profit_126 IR_oper_rev_63 IR_totasset_252 q_eps
IR_bps_126 IR_eps_63 IR_net_profit_252 IR_roe_126 IR_totasset_63 q_grossprofitmargin
q_netprofitmargin q_ps rt_252 tot_assets ttm_pcf turnover_126
q_oper_rev q_roa rt_63 ttm_eps ttm_pe turnover_21
q_orps q_roe shr_float2tot ttm_grossprofitmargin ttm_ps turnover_252
q_pcf rt_126 s_dq_mv ttm_netprofitmargin ttm_roa turnover_63
q_pe rt_21 s_val_mv ttm_orps ttm_roe val_float2tot
Table 2: The 60 factors.

4.2 The performance of the AdaBoost classifiers

In this section, we analyze how the performance of the classifiers trained by AdaBoost varies with the two hyperparameters: the depth of the base learners (decision trees), and the number of iterations. Both hyperparameters are typically the source of overfitting in common sense. More specifically, we consider the following hyperparameters:

  • MaxDepth: The maximum depth of the base learners (decision trees), takes and 10, respectively;

  • NSteps: The number of iterations, takes and 1000, respectively.

In order to analyze the influence of these two hyperparameters on the fitting ability of AdaBoost, we fix other parameters, and set the learning rates of all models as 0.1. Both in the training set and in the test set, we use the AUC (area under the ROC curve) to measure the performance of the models, which is a supplement to the usual error calculation.

The performance results of all models we studied are summarized in Table 3. Based on the results, we observe that:

  • The training/test AUC and the training/test error are consistent, since if the AUC is high, the error will be low in almost every scenarios. For example, when Max_Depth and N_Steps (the 1st model), the training AUC is 0.5412 while the training error is 0.4701; And when Max_Depth and N_Steps (the 2nd model), the training AUC is 0.5436 while the training error is 0.4700—They vary in different directions.

  • The training AUC increases monotonically as the complexity of the model increases. Specifically, from the first model to the last model, the complexity increases. Meanwhile, the training AUC increases from 0.5412 to 0.6828; The training error decreases too.

  • The test AUC also almost increases monotonically as the complexity of the model increases. For instance, when Max_Depth and N_Steps (the 1st-5th models), the test AUC increases from 0.5433 to 0.5480; When N_Steps and Max_Depth (the 2nd, 7th and 12th models), the test AUC also increases from 0.5462 to 0.5490.

  • The changes of the test AUC are relatively small and stable with regard to that of the training AUC. For example, for the first 15 models, the test AUC changes from 0.5433 to 0.5513, while the training AUC changes from 0.5412 to 0.5946. It suggests that the test AUC around may be a stable threshold of the model, which reflects the ability of our methods to dig out the market information contained in our dataset. It is noteworthy that, is not a bad result in the Chinese stock market, according to the experience of the industry.

  • In the training set, the performance is more sensitive to Max_Depth than to N_Steps. In detail, given Max_Depth , the training AUC changes from 0.5412 to 0.5533 for N_Steps (the 1st-5th models); However, given N_Steps , the training AUC changes from 0.5412 to 0.5818 for Max_Depth .

We find that, as the depth of the trees and the number of iterations increases, the AUC for the test set increases stably without significant change. We can conclude that, in these cases, the more iteration steps, the better the classifier, and the more complex the base learner trees, the better the classifier.

Model No. Hyperparameters Training Set Test Set
Max_Depth N_Steps Training AUC Training Error Test AUC Test Error
1 2 10 0.5412 0.4701 0.5433 0.4713
2 2 20 0.5436 0.4700 0.5462 0.4741
3 2 30 0.5499 0.4665 0.5468 0.4728
4 2 40 0.5511 0.4656 0.5476 0.4716
5 2 50 0.5533 0.4636 0.5480 0.4714
6 4 10 0.5628 0.4545 0.5463 0.4671
7 4 20 0.5682 0.4515 0.5487 0.4697
8 4 30 0.5699 0.4500 0.5489 0.4681
9 4 40 0.5713 0.4505 0.5498 0.4669
10 4 50 0.5723 0.4498 0.5500 0.4669
11 6 10 0.5818 0.4418 0.5458 0.4715
12 6 20 0.5870 0.4392 0.5490 0.4683
13 6 30 0.5913 0.4353 0.5502 0.4675
14 6 40 0.5930 0.4346 0.5506 0.4676
15 6 50 0.5946 0.4338 0.5513 0.4670
16 8 100 0.6300 0.4108 0.5519 0.4663
17 8 500 0.6356 0.4071 0.5531 0.4659
18 8 1000 0.6358