# Forest Representation Learning Guided by Margin Distribution

In this paper, we reformulate the forest representation learning approach as an additive model which boosts the augmented feature instead of the prediction. We substantially improve the upper bound of generalization gap from O(√(%s/%s) mm) to O( m/m), while λ - the margin ratio between the margin standard deviation and the margin mean is small enough. This tighter upper bound inspires us to optimize the margin distribution ratio λ. Therefore, we design the margin distribution reweighting approach (mdDF) to achieve small ratio λ by boosting the augmented feature. Experiments and visualizations confirm the effectiveness of the approach in terms of performance and representation learning ability. This study offers a novel understanding of the cascaded deep forest from the margin-theory perspective and further uses the mdDF approach to guide the layer-by-layer forest representation learning.

## Authors

• 2 publications
• 18 publications
• 58 publications
• ### Margin-Based Generalization Lower Bounds for Boosted Classifiers

Boosting is one of the most successful ideas in machine learning. The mo...
09/27/2019 ∙ by Allan Grønland, et al. ∙ 0

• ### Function space analysis of deep learning representation layers

In this paper we propose a function space approach to Representation Lea...
10/09/2017 ∙ by Oren Elisha, et al. ∙ 0

• ### A Small Improvement to the Upper Bound on the Integrality Ratio for the s-t Path TSP

In this paper we investigate the integrality ratio of the standard LP re...
04/15/2020 ∙ by Xianghui Zhong, et al. ∙ 0

• ### Boosting through Optimization of Margin Distributions

Boosting has attracted much research attention in the past decade. The s...
04/14/2009 ∙ by Chunhua Shen, et al. ∙ 0

• ### Multi-Layered Gradient Boosting Decision Trees

Multi-layered representation is believed to be the key ingredient of dee...
05/31/2018 ∙ by Ji Feng, et al. ∙ 0

• ### For self-supervised learning, Rationality implies generalization, provably

We prove a new upper bound on the generalization gap of classifiers that...
10/16/2020 ∙ by Yamini Bansal, et al. ∙ 0

• ### Improving Password Guessing via Representation Learning

Learning useful representations from unstructured data is one of the cor...
10/09/2019 ∙ by Dario Pasquini, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years, deep neural networks have achieved excellent performance in many application scenarios such as face recognition and automatic speech recognition (ASR)

(LeCun et al., 2015)

. However, deep neural networks are difficult to be interpreted. This defect severely restricts the development of deep learning in a few application scenarios, where the model’s interpretability is needed. Moreover, the deep neural networks are very data-hungry due to the large complexity of the models, which means that the model’s performance can decrease significantly when the size of the training data decreases

(Elsayed et al., 2018; Lv et al., 2018).

In many real tasks, due to the high cost of data collection and labeling, the amount of labeled training data may be insufficient to train a deep neural network. In such a situation, traditional learning methods such as random forest (R.F.)

(Breiman, 2001)(Friedman, 2001; Chen and Guestrin, 2016)

, support-vector machines (SVMs)

(Cortes and Vapnik, 1995), etc., are still good choices. By realizing that the essence of deep learning lies in the layer-by-layer processing, in-model feature transformation, and sufficient model complexity (Zhou and Feng, 2018), recently Zhou and Feng (2017) proposed the deep forest model and the gcForest algorithm to achieve forest representation learning. It can achieve excellent performance on a broad range of tasks, and can even perform well on small or middle-scale of data. Later on, a more efficient improvement was presented (Pang et al., 2018), and it shows that forest is able to do auto-encoder which thought to be a specialty of neural networks (Feng and Zhou, 2018)

. The tree-based multi-layer model can even do distributed representation learning which was thought to be a special feature of neural networks

(Feng et al., 2018). Utkin and Ryabinin (2018) proposed a Siamese deep forest as an alternative to the Siamese neural networks to solve the metric learning tasks.

Though deep forest has achieved great success, its theoretical exploration is less developed. The layer-by-layer representation learning is important for the cascaded deep forest, however, the cascade structure in deep forest models does not have a sound interpretation. We attempt to explain the benefits of the cascaded deep forest in the view of boosted representations.

### 1.1 Our results

In Section 2

, we reformulate the cascade deep forest as an additive model (strong classifier) optimizing the margin distribution:

 F(x)=T∑t=1αtht([x,ft−1(x)]), (1)

where is a scalar determined by

- the margin distribution loss function reweighting the training samples. The input of forest block

are the raw feature and the augmented feature :

 ht(x)=gt([x,ft−1(x)])=gt([x,t−1∑l=1αlhl(x)]), (2)

which is defined by such a recursion form. Unlike all the weak classifiers of traditional boosting are chosen from the same hypotheses set , the layer- hypotheses set in the cascade deep forest contains that of the previous layer, i.e., , due to is recursive. We name such a cascaded representation learning algorithm margin distribution deep forest (mdDF).

In Section 3, we give a new upper bound on the generalization error of such an additive model:

 PrD[yF(x)<0]−PrS[yF(x)

where is the size of training set, is a margin parameter, is a ratio between the margin standard deviation and the expected margin, denotes the margin of the samples.

Margin distribution. We prove that the generalization error can be bounded by . When the margin distribution ratio is small enough, our bound will be dominated by the higher order term . This bound is tighter than previous bounds proved by Rademacher’s complexity (Cortes et al., 2014). This result inspires us to optimize the margin distribution by minimizing the ratio . Therefore, we utilize an appropriate margin distribution loss function to optimize the first- and second-order statistics of margin.

Mixture coefficients. As for the overfitting risk of such a deep model, our bound inherits the conclusion in Cortes et al. (2014). The cardinality of hypotheses set is controlled by the mixture coefficients s in equation 1. The hypotheses-set term in our bound implies that, while some hypothesis sets used for learning could have a large complexity, this may not be detrimental to generalization if the corresponding total mixture weight is relatively small. In other words, the coefficients s need to minimize the expected margin distribution loss , which implies the generalization ability of the layer cascaded deep forest.

Extensive experiments validate that mdDF can effectively improve the performance on classification tasks, especially for categorical and mixed modeling tasks. More intuitively, the visualizations of the learned features in Figure 7 and Figure 9 show great in-model feature transformation of the mdDF algorithm. The mdDF not only inherits all merits from the cascaded deep forest but also boosts the learned features over layers in cascade forest structure.

The gcForest (Zhou and Feng, 2018) is constructed by multi-grained scanning operation and cascade forest structure. The multi-grained scanning operation aims to dealing with the raw data which holds spatial or sequential relationships. The cascade forest structure aims to achieving in-model feature transformation, i.e., layer-by-layer representation learning. It can be viewed as an ensemble approach that utilizes almost all categories of strategies for diversity enhancement, e.g., input feature manipulation and output representation manipulation (Zhou, 2012).

Krogh and Vedelsby (1995) have given a theoretical equation derived from error-ambiguity decomposition:

 E=¯E−¯A, (4)

where denotes the error of an ensemble, denotes the average error of individual classifiers in the ensemble, and denotes the average ambiguity, later called diversity, among the individual classifiers. This offers general guidance for ensemble construction, however, it cannot be taken as an objective function for optimization, because the ambiguity is mathematically defined in the derivation and cannot be operated directly.

In this paper, we will use the margin distribution theory to analyze the cascade structure in deep forest and guide its layer-by-layer representation learning. The margin theory was first used to explain the generalization of the Adaboost algorithm (Schapire et al., 1998; Breiman, 1999). Then a sequence of research (Reyzin and Schapire, 2006; Wang et al., 2011; Gao and Zhou, 2013) tries to prove the relationship between the generalization gap and the empirical margin distribution for boosting algorithm. Cortes et al. (2017) propose a deep boosting algorithm which boosts the accuracy of variant depth decision trees, and Cortes et al. (2017); Huang et al. (2018) offer a Rademacher bcomplexity analysis to deep neural networks. However, these theoretical results depend on the Rademacher complexity rather than the margin distribution. Since the Rademacher complexity of the forest module cannot be explicitly formulized, it cannot be taken as an objective function for optimization.

As shown in Figure 1, the cascaded deep forest is composed of stacked entities referred to as forest block s. Each forest block consists of several forest modules, which are commonly RF (random forest) (Breiman, 2001) and CRF (Completely-random forest) (Zhou and Feng, 2017). Cascade structure transmits the samples’ representation layer-by-layer by concatenating the augmented feature onto the origin input feature . In fact, we can name this operation “preconc" (prediction concatenation), because the augmented feature is the prediction scores of forests in each layer. It is worth noting that “preconc" is completely different from the stacking operation (Wolpert, 1992; Breiman, 1996) in traditional ensemble learning. The second-level learners in stacking act on the prediction space composed of different base learners and the information of origin input feature space is ignored. Using the stacking operation with more than two layers would suffer seriously from overfitting in the experiment, and cannot enable a deep model by itself. The cascade structure is the key to success of forest representation learning, however, there has been no explicit explanation for this layer-by-layer process yet.

Firstly we reformulate the cascaded deep forest as an additive model mathematically in this section. We consider training and test samples generated i.i.d. from distribution over , where is the input space and is the label space. We denote by a training set of samples drawn according to . denote families ordered by increasing complexity, i.e., .

A cascaded deep forest algorithm can be formalized as follows. We use a quadruple form where

, where denotes the function computed by the -th forest block which is defined by equation 5;

, where denotes the layer cascaded forest defined by equation 6, and drawn from the hypothesis set ;

, where denotes the augmented feature in layer , which is defined by equation 7;

, where is the updated sample distribution in layer .

is the -level weak module returned by the random forest block algorithm 1. It is learned from the raw training sample and the augmented feature from the previous layer and the reweighting distribution :

 gt={Arfb([xi;yi]mi=1,D)t=1,Arfb([xi,ft−1(xi);yi]mi=1,Dt)t>1. (5)

With these weak modules, we can define the layer cascaded deep forest as:

 ht(x)={gt(x)t=1,gt([x,ft−1(x)])t>1, (6)

The augmented feature is defined as follows:

 ft(x)={αtht(x)t=1,αtht([x,ft−1(x)])+ft−1(x)t>1, (7)

where and is need to be optimized and updated.

Here, we find that the layer cascaded deep forest is defined as a recursion form:

 ht(x)=gt([x,ft−1(x)])=gt([x,t−1∑l=1αlhl(x)]). (8)

Unlike all the weak classifiers of traditional boosting are chosen from the same hypotheses set , the layer- hypotheses set in the cascade deep forest contains that of the previous layer, similar to the hypotheses sets of the deep neural networks (DNNs) in different depth, i.e., .

The entire cascaded model is defined as follows:

 ~F(x)=~σ(F(x))=argmaxj∈{1,2,…,s}[T∑t=1αthjt(x)], (9)

where is the final prediction vector of cascaded deep forest for classification and denotes a map from average prediction score vector to a label.

Note. Here we generalize the formula of cascaded deep forest as an additive model. In fact, the original version (Zhou and Feng, 2017) sets the data distribution invariable and the augmented feature non-additive , even the final prediction vector is the direct output . Through the generalization analysis in next section, we will explain why we need to optimize and update .

## 3 Generalization Analysis

In this section, we analyze the generalization error to understand the complexity of the cascaded deep forest model. For simplicity, we consider the binary classification task. We define the strong classifier as , i.e., the cascaded deep forest reformulated as an additive model. Now we define the margin for sample as , which implies the confidence of prediction. We assume that the hypotheses set of base classifiers can be decomposed as the union of families ordered by increasing complexity, where . Remarkably, the complexity term of these bounds admits an explicit dependency in terms of the mixture coefficients defining the ensembles. Thus, the ensemble family we consider is , that is the family of functions of the form , where is in the simplex .

For a fixed , any defines a distribution over . Sampling from according to and averaging leads to functions for some , with , and . For any with , we consider the family of functions

 GF,N={1nT∑k=1Nk∑j=1gk,j∣∣ ∣∣∀(k,j)∈[T]×[Nk],gk,j∈Hk}, (10)

and the union of all such families . For a fixed , the size of can be bounded as follows:

 ln∣∣GF,N∣∣≤ln(T∏t=1|Ht|Nt)=T∑t=1(Ntln|Ht|)=nT∑t=1(αtln|Ht|)≤nlnT∑t=1αt|Ht| (11)

Our margin distribution theory is based on a new Bernstein-type bound as follows:

###### Lemma 1.

For and , we have

 PrS,GF,n[yG(x)−yF(x)≥ϵ]≤exp(−nϵ22−2E2S[yF(x)]+4ϵ/3). (12)

Proof. For , according to the Markov’s inequality, we have

 PrS,GF,n[yG(x)−yF(x)≥ϵ] =PrS,GF,n[(yG(x)−yF(x))nλ/2≥nλϵ/2] (13) ≤exp(−λnϵ2)ES,Gj∈GF,n[exp(λ2n∑j=1(yGj(x)−yF(x)))] (14) =exp(−λnϵ2)n∏j=1ES,Gj∈GF,n[exp(λ2(yGj(x)−yF(x)))] (15)

where the last inequality holds from the independent of . Notice that (the margin is bounded: ), using Taylor’s expansion, we get

 ≤1+ES,Gj∈GF,n[(yGj(x)−yF(x))2]eλ−1−λ4 (16) ≤1+ES[1−(yF(x))2]eλ−1−λ4 (17) ≤exp(1−E2S[yF(x)])eλ−1−λ4 (18)

where the last inequality holds from Jensen’s inequality and . Therefore, we have

 PrS,GF,n[yG(x)−yF(x)≥ϵ]≤exp(n(eλ−1−λ)(1−ES[yF(x)])4−λnϵ2) (19)

If , then we could use Taylor’s expansion again to have

 eλ−λ−1=∞∑i=2λii!≤λ22∞∑m=0λm3m=λ22(1−λ/3). (20)

Now by picking , we have

 −λϵ2+λ2(1−E2S[yF(x)])8(1−λ/3)≤−ϵ22−2E2S[yF(x)]+4ϵ/3 (21)

By Combining the equation 19 and 21 together, we complete the proof. ∎

Since the gap between the margin of strong classifier and the margin of classifiers in the union family is bounded by the margin mean, we can further obtain a margin distribution theorem as follows:

###### Theorem 1.

Let be a distribution over and be a sample of examples chosen independently at random according to

. With probability at least

, for , the strong classifier (depth- mdDF) satisfies that

 PrD[yF(x)<0]≤infr∈(0,1][^R+1md+3√μm3/2+7μ3m+λ√3μm]

where

 ^R =PrS[yF(x)2, μ =lnmln(2T∑t=1αt|Ht|)/r2+ln(2δ), λ =√Var[yF(x)]/E2S[yF(x)].

Proof.

###### Lemma 2.

[Chernoff bound (Chernoff et al., 1952)] Let be

i.i.d. random variables with

. Then, for any , we have

 Pr[1mm∑i=1Xi≥E[X]+ϵ]≤exp(−mϵ22), (22) Pr[1mm∑i=1Xi≤E[X]−ϵ]≤exp(−mϵ22). (23)
###### Lemma 3.

[Gao and Zhou (2013)] For independent random variables with values in , and for , with probability at least we have

 1mm∑i=1E[Xi]−1mm∑i=1Xi≤√2^Vmln(2/δ)m+7ln(2/δ)3m, (24) 1mm∑i=1E[Xi]−1mm∑i=1Xi≥−√2^Vmln(2/δ)m−7ln(2/δ)3m. (25)

where

For and , we have . For , the Chernoff’s bound in Lemma 2 gives

 PrD[yF(x)<0] =PrD,GF,n[yF(x)<0,yG(x)≥β]+PrD,GF,n[yF(x)<0,yG(x)<β] (26) ≤exp(−nβ2/2)+PrD,GF,n[yG(x)<β]. (27)

Recall that for a fixed . Therefore, for any , combining the union bound with Lemma 3 guarantees that with probability at least over sample , for any and

 PrD[yG(x)<β] ≤PrS[yG(x)<β]+ ⎷2m^Vmln(2δT∏t=1|Ht|Nt)+73mln(2δT∏t=1|Ht|Nt) (28) =PrS[yG(x)<β]+ ⎷2m^VmT∑i=1Ntln(2|Ht|δ)+73mT∑i=1Ntln(2|Ht|δ) (29) ≤PrS[yG(x)<β]+ ⎷2nm^VmT∑i=1αtln(2|Ht|δ)+7n3mT∑i=1αtln(2|Ht|δ) (30) ≤PrS[yG(x)<β]+ ⎷2nm^Vmln(2∑Ti=1αt|Ht|δ)+7n3mln(2∑Ti=1αt|Ht|δ) (31)

where

 ^Vm=∑i≠j(I[yiG(xi)<β]−I[yjG(xj)<β])22m(m−1), (33)

The inequality 30 is a large probability bound when is large enough and inequality 31 is according to the Jensen’s Inequality. Since there are at most possible -tuples with , by the union bound, for any , with probability at least , for all and :

 PrD[yG(x)<β]≤PrS[yG(x)<β]+ ⎷2nm^Vmln(2∑Ti=1αt|Ht|δ/Tn)+7n3mln(2∑Ti=1αt|Ht|δ/Tn) (34)

Meantime, we can rewrite

 ^Vm =∑i≠j(I[yiG(xi)<β]−I[yjG(xj)<β])22m(m−1) (35) =2m2PrS[yG(x)<β]PrS[yG(x)≥β]2m(m−1) (36) =mm−1^V∗m (37)

For any , we utilize Chernoff’s bound in Lemma 3 to get:

 ^V∗m =PrS[yG(x)<β]PrS[yG(x)≥β] (38) ≤3exp(−nθ21/2)+PrS[yF(x)<β+θ1]PrS[yF(x)≥β−θ1] (39) ≤3exp(−nθ21/2)+ (40) PrS[yF(x)<β+θ1|ES[yF(x)]≥β+θ1+θ2]PrS[yF(x)≥β−θ1|ES[yF(x)]≥β+θ1+θ2] ≤3exp(−nθ21/2)+Var[yF(x)]θ22According to Chebyshev's Inequality (41) ≤3exp(−nθ21/2)+Var[yF(x)](ES[yF(x)]−β+θ1)2 (42) ≃3exp(−nθ21/2)+Var[yF(x)]E2S[yF(x)] (43)

where

is the variance of the margins.

From Lemma 1, we obtain that

 PrS[yG(x)<β]≤PrS[yF(x)<β+θ1]+exp(−nθ212−2E2S[yF(x)]+4θ1/3) (44)

Let , and , then we combine the equation 27,28,43 and 44, the proof is completed.

Remark 1. From Theorem 1, we know that the gap between the generalization error and empirical margin loss is generally bounded by the forests complexity , which is controlled by the ratio between the margin standard deviation and the margin mean . This ratio implies that the smaller margin mean and larger margin variance can reduce the complexity of models properly, which is crucial to alleviate the overfitting problem. When the margin distribution is good enough (margin mean is large and margin variance is small), will dominate the order of the sample complexity. This is tighter than the previous theoretical work about deep boosting (Cortes et al., 2014, 2017; Huang et al., 2018) .

Remark 2. Moreover, this novel bound inherits the property of previous bound (Cortes et al., 2014), the hypotheses term admits an explicit dependency on the mixture coefficients s. It implies that, while some hypothesis sets used for learning could have a large complexity, this may not be detrimental to generalization if the corresponding total mixture weight is relatively small. This property also offers a potential to obtain a good generalization result through optimizing the s.

It is worth noting that the analysis here only considers the cascade structure in deep forest. Due to the simplification of the model, we do not analyze the details about the "preconc" operation and the influence of adopting a different type of forests, though these two operations play an important role in practice. The advantages of these operations are evaluated empirically in Section 5.

## 4 Margin Distribution Optimization

The generalization theory shows the importance of optimizing the margin distribution ratio and the mixture coefficients s. Since we reformulate the cascaded deep forest as a additive model, we utilize the reweighting approach to minimize the expected margin distribution loss

 Ex∼S[ℓmd∘F(x)]=Ex∼S[ℓmd∘T∑t=1αtht(x)], (45)

where the margin distribution loss function is designed to utilize the first- and second-order statistics of margin distribution. The reweighting approach helps the model to boosts the augmented feature in deeper layer, i.e., focus on dealing with the samples which have the large loss in the previous layers. The choice of scalar is determined by minimizing the expected loss for -layer model.

### 4.1 Algorithm for mdDF approach.

We denote by a prediction score space, where is the number of classes. When any sample passes through the cascaded deep forest model, it will get an average prediction vector in each layer: . According to Crammer and Singer (2001), we can define the sample’s margin for multi-class classification as:

 γt(x):=hyt(x)−maxj≠yhjt(x), (46)

that is, the prediction’s confidence-rate. For example, in the 3-class problem, as shown in Figure 2, the average prediction score is , and the margin is calculated as .

The initial sample weights are , and we update the -th sample weight by:

 Dt+1(i)=ℓmd(∑tl=1αlγl(xi))∑mi=1ℓmd(∑tl=1αlγl(xi)), (47)

The margin distribution loss function is defined as follows:

 ℓmd(z)=⎧⎪ ⎪⎨⎪ ⎪⎩(z−γ)2γ2z≤γ,μ(z−γ)2(1−γ)2z>γ, (48)

where hyper-parameter is a parameter as the margin mean and is a parameter to trade off two different kinds of deviation (keeping the balance on both sides of the margin mean). In practice, we generally choose these two hyper-parameters from the finite sets and . The algorithm utilizing this margin distribution optimization is summarized in Algorithm 2.

### 4.2 The intuition of margin distribution loss function.

Since Reyzin and Schapire (2006) found that the margin distribution of AdaBoost is better than that of arc-gv (Breiman, 1999) which is a boosting algorithm designed to maximize the minimum margin, Reyzin and Schapire (2006) conjectured that margin distribution is more important to get a better generalization performance than the instance with the minimum margin. Gao and Zhou (2013) prove that utilizing both the margin mean and margin variance can portray the relationship between margin and generalization performance for AdaBoost algorithm more precisely. We list the several loss functions of the algorithms based on margin theory to compare and plot them in Figure. 3:

Compared with maximize the minimum margin (SVMs), the optimal margin distribution principle (Gao and Zhou, 2013; Zhang and Zhou, 2017) conjecture that maximizing the margin mean and minimizing the margin variance is the key to achieving a better generalization performance. Figure. 4 shows that optimizing the margin distribution with first- and second-order statistics can utilize more information on training data, e.g. the covariance among the different features. Inspired by this idea, Zhang and Zhou (2017) propose the optimal margin distribution machine (ODM) which can be formulated as equation 52:

 minw,ξi,ϵiΩ(w)+λmm∑i=1ξ2i+μϵ2i(1−θ)2s.t.γh(xi,yi)≥1−θ−ξiγh(xi,yi)≤1+θ+ϵi,∀i (52)

where and are the deviation of the margin to the margin mean, is a parameter to trade off two different kinds of deviation (larger or less than margin mean). is a parameter of the zero-loss band, which can control the number of support vectors, i.e., the sparsity of the solution, and in the denominator is to scale the second term to be a surrogate loss for 0-1 loss. Similar to support vector machines (SVMs), we can give ODM an intuitive illustration in Figure. 5. Similar to formulating support vector machines as a combination of the hinge loss and the regularization term, we can use margin distribution loss function defined in equation 51 and a regularization term to represent the ODM. Our simplified version margin distribution loss function (48) is similar to that of the ODM. Our forest representation learning approach requires as many samples as possible to train the model and generate the augmented features. Therefore, we remove the parameter which can control the number of support vectors. Our loss function is to optimize the margin distribution to minimize the margin distribution ratio , referring to Remark 2 in Section 3.