## 1 Introduction

In recent years, deep neural networks have achieved excellent performance in many application scenarios such as face recognition and automatic speech recognition (ASR)

(LeCun et al., 2015). However, deep neural networks are difficult to be interpreted. This defect severely restricts the development of deep learning in a few application scenarios, where the model’s interpretability is needed. Moreover, the deep neural networks are very data-hungry due to the large complexity of the models, which means that the model’s performance can decrease significantly when the size of the training data decreases

(Elsayed et al., 2018; Lv et al., 2018).In many real tasks, due to the high cost of data collection and labeling, the amount of labeled training data may be insufficient to train a deep neural network. In such a situation, traditional learning methods such as random forest (R.F.)

(Breiman, 2001), gradient boosting decision tree (GBDT)

(Friedman, 2001; Chen and Guestrin, 2016), support-vector machines (SVMs)

(Cortes and Vapnik, 1995), etc., are still good choices. By realizing that the essence of deep learning lies in the layer-by-layer processing, in-model feature transformation, and sufficient model complexity (Zhou and Feng, 2018), recently Zhou and Feng (2017) proposed the deep forest model and the gcForest algorithm to achieve*forest representation learning*. It can achieve excellent performance on a broad range of tasks, and can even perform well on small or middle-scale of data. Later on, a more efficient improvement was presented (Pang et al., 2018), and it shows that forest is able to do auto-encoder which thought to be a specialty of neural networks (Feng and Zhou, 2018)

. The tree-based multi-layer model can even do distributed representation learning which was thought to be a special feature of neural networks

(Feng et al., 2018). Utkin and Ryabinin (2018) proposed a Siamese deep forest as an alternative to the Siamese neural networks to solve the metric learning tasks.Though deep forest has achieved great success, its theoretical exploration is less developed. The layer-by-layer representation learning is important for the cascaded deep forest, however, the cascade structure in deep forest models does not have a sound interpretation. We attempt to explain the benefits of the cascaded deep forest in the view of boosted representations.

### 1.1 Our results

In Section 2

, we reformulate the cascade deep forest as an additive model (strong classifier) optimizing the margin distribution:

(1) |

where is a scalar determined by

- the margin distribution loss function reweighting the training samples. The input of forest block

are the raw feature and the augmented feature :(2) |

which is defined by such a recursion form. Unlike all the weak classifiers of traditional boosting are chosen from the same hypotheses set , the layer- hypotheses set in the cascade deep forest contains that of the previous layer, i.e., , due to is recursive. We name such a cascaded representation learning algorithm margin distribution deep forest (mdDF).

In Section 3, we give a new upper bound on the generalization error of such an additive model:

(3) |

where is the size of training set, is a margin parameter, is a ratio between the margin standard deviation and the expected margin, denotes the margin of the samples.

Margin distribution. We prove that the generalization error can be bounded by . When the margin distribution ratio is small enough, our bound will be dominated by the higher order term . This bound is tighter than previous bounds proved by Rademacher’s complexity (Cortes et al., 2014). This result inspires us to optimize the margin distribution by minimizing the ratio . Therefore, we utilize an appropriate margin distribution loss function to optimize the first- and second-order statistics of margin.

Mixture coefficients. As for the overfitting risk of such a deep model, our bound inherits the conclusion in Cortes et al. (2014). The cardinality of hypotheses set is controlled by the mixture coefficients s in equation 1. The hypotheses-set term in our bound implies that, while some hypothesis sets used for learning could have a large complexity, this may not be detrimental to generalization if the corresponding total mixture weight is relatively small. In other words, the coefficients s need to minimize the expected margin distribution loss , which implies the generalization ability of the layer cascaded deep forest.

Extensive experiments validate that mdDF can effectively improve the performance on classification tasks, especially for categorical and mixed modeling tasks. More intuitively, the visualizations of the learned features in Figure 7 and Figure 9 show great in-model feature transformation of the mdDF algorithm. The mdDF not only inherits all merits from the cascaded deep forest but also boosts the learned features over layers in cascade forest structure.

### 1.2 Additional related work

The gcForest (Zhou and Feng, 2018) is constructed by multi-grained scanning operation and cascade forest structure. The multi-grained scanning operation aims to dealing with the raw data which holds spatial or sequential relationships. The cascade forest structure aims to achieving in-model feature transformation, i.e., layer-by-layer representation learning. It can be viewed as an ensemble approach that utilizes almost all categories of strategies for diversity enhancement, e.g., input feature manipulation and output representation manipulation (Zhou, 2012).

Krogh and Vedelsby (1995) have given a theoretical equation derived from error-ambiguity decomposition:

(4) |

where denotes the error of an ensemble, denotes the average error of individual classifiers in the ensemble, and denotes the average ambiguity, later called diversity, among the individual classifiers. This offers general guidance for ensemble construction, however, it cannot be taken as an objective function for optimization, because the ambiguity is mathematically defined in the derivation and cannot be operated directly.

In this paper, we will use the margin distribution theory to analyze the cascade structure in deep forest and guide its layer-by-layer representation learning. The margin theory was first used to explain the generalization of the Adaboost algorithm (Schapire et al., 1998; Breiman, 1999). Then a sequence of research (Reyzin and Schapire, 2006; Wang et al., 2011; Gao and Zhou, 2013) tries to prove the relationship between the generalization gap and the empirical margin distribution for boosting algorithm. Cortes et al. (2017) propose a deep boosting algorithm which boosts the accuracy of variant depth decision trees, and Cortes et al. (2017); Huang et al. (2018) offer a Rademacher bcomplexity analysis to deep neural networks. However, these theoretical results depend on the Rademacher complexity rather than the margin distribution. Since the Rademacher complexity of the forest module cannot be explicitly formulized, it cannot be taken as an objective function for optimization.

## 2 Cascaded Deep Forest

As shown in Figure 1, the cascaded deep forest is composed of stacked entities referred to as forest block s. Each forest block consists of several forest modules, which are commonly RF (random forest) (Breiman, 2001) and CRF (Completely-random forest) (Zhou and Feng, 2017). Cascade structure transmits the samples’ representation layer-by-layer by concatenating the augmented feature onto the origin input feature . In fact, we can name this operation “preconc" (prediction concatenation), because the augmented feature is the prediction scores of forests in each layer. It is worth noting that “preconc" is completely different from the stacking operation (Wolpert, 1992; Breiman, 1996) in traditional ensemble learning. The second-level learners in stacking act on the prediction space composed of different base learners and the information of origin input feature space is ignored. Using the stacking operation with more than two layers would suffer seriously from overfitting in the experiment, and cannot enable a deep model by itself. The cascade structure is the key to success of forest representation learning, however, there has been no explicit explanation for this layer-by-layer process yet.

Firstly we reformulate the cascaded deep forest as an additive model mathematically in this section. We consider training and test samples generated i.i.d. from distribution over , where is the input space and is the label space. We denote by a training set of samples drawn according to . denote families ordered by increasing complexity, i.e., .

A cascaded deep forest algorithm can be formalized as follows. We use a quadruple form where

, where denotes the function computed by the -th forest block which is defined by equation 5;

, where denotes the layer cascaded forest defined by equation 6, and drawn from the hypothesis set ;

, where denotes the augmented feature in layer , which is defined by equation 7;

, where is the updated sample distribution in layer .

is the -level weak module returned by the random forest block algorithm 1. It is learned from the raw training sample and the augmented feature from the previous layer and the reweighting distribution :

(5) |

With these weak modules, we can define the layer cascaded deep forest as:

(6) |

The augmented feature is defined as follows:

(7) |

where and is need to be optimized and updated.

Here, we find that the layer cascaded deep forest is defined as a recursion form:

(8) |

Unlike all the weak classifiers of traditional boosting are chosen from the same hypotheses set , the layer- hypotheses set in the cascade deep forest contains that of the previous layer, similar to the hypotheses sets of the deep neural networks (DNNs) in different depth, i.e., .

The entire cascaded model is defined as follows:

(9) |

where is the final prediction vector of cascaded deep forest for classification and denotes a map from average prediction score vector to a label.

Note. Here we generalize the formula of cascaded deep forest as an additive model. In fact, the original version (Zhou and Feng, 2017) sets the data distribution invariable and the augmented feature non-additive , even the final prediction vector is the direct output . Through the generalization analysis in next section, we will explain why we need to optimize and update .

## 3 Generalization Analysis

In this section, we analyze the generalization error to understand the complexity of the cascaded deep forest model. For simplicity, we consider the binary classification task. We define the strong classifier as , i.e., the cascaded deep forest reformulated as an additive model. Now we define the margin for sample as , which implies the confidence of prediction. We assume that the hypotheses set of base classifiers can be decomposed as the union of families ordered by increasing complexity, where . Remarkably, the complexity term of these bounds admits an explicit dependency in terms of the mixture coefficients defining the ensembles. Thus, the ensemble family we consider is , that is the family of functions of the form , where is in the simplex .

For a fixed , any defines a distribution over . Sampling from according to and averaging leads to functions for some , with , and . For any with , we consider the family of functions

(10) |

and the union of all such families . For a fixed , the size of can be bounded as follows:

(11) |

Our margin distribution theory is based on a new Bernstein-type bound as follows:

###### Lemma 1.

For and , we have

(12) |

Proof. For , according to the Markov’s inequality, we have

(13) | ||||

(14) | ||||

(15) |

where the last inequality holds from the independent of . Notice that (the margin is bounded: ), using Taylor’s expansion, we get

(16) | ||||

(17) | ||||

(18) |

where the last inequality holds from Jensen’s inequality and . Therefore, we have

(19) |

If , then we could use Taylor’s expansion again to have

(20) |

Now by picking , we have

(21) |

By Combining the equation 19 and 21 together, we complete the proof. ∎

Since the gap between the margin of strong classifier and the margin of classifiers in the union family is bounded by the margin mean, we can further obtain a margin distribution theorem as follows:

###### Theorem 1.

Let be a distribution over and be a sample of examples chosen independently at random according to

. With probability at least

, for , the strong classifier (depth- mdDF) satisfies thatwhere

Proof.

###### Lemma 2.

[Chernoff bound (Chernoff et al., 1952)] Let be

i.i.d. random variables with

. Then, for any , we have(22) | |||

(23) |

###### Lemma 3.

[Gao and Zhou (2013)] For independent random variables with values in , and for , with probability at least we have

(24) | |||

(25) |

where

For and , we have . For , the Chernoff’s bound in Lemma 2 gives

(26) | ||||

(27) |

Recall that for a fixed . Therefore, for any , combining the union bound with Lemma 3 guarantees that with probability at least over sample , for any and

(28) | ||||

(29) | ||||

(30) | ||||

(31) |

where

(33) |

The inequality 30 is a large probability bound when is large enough and inequality 31 is according to the Jensen’s Inequality. Since there are at most possible -tuples with , by the union bound, for any , with probability at least , for all and :

(34) |

Meantime, we can rewrite

(35) | ||||

(36) | ||||

(37) |

For any , we utilize Chernoff’s bound in Lemma 3 to get:

(38) | ||||

(39) | ||||

(40) | ||||

(41) | ||||

(42) | ||||

(43) |

where

is the variance of the margins.

From Lemma 1, we obtain that

(44) |

Let , and , then we combine the equation 27,28,43 and 44, the proof is completed.

∎

Remark 1. From Theorem 1, we know that the gap between the generalization error and empirical margin loss is generally bounded by the forests complexity , which is controlled by the ratio between the margin standard deviation and the margin mean . This ratio implies that the smaller margin mean and larger margin variance can reduce the complexity of models properly, which is crucial to alleviate the overfitting problem. When the margin distribution is good enough (margin mean is large and margin variance is small), will dominate the order of the sample complexity. This is tighter than the previous theoretical work about deep boosting (Cortes et al., 2014, 2017; Huang et al., 2018) .

Remark 2. Moreover, this novel bound inherits the property of previous bound (Cortes et al., 2014), the hypotheses term admits an explicit dependency on the mixture coefficients s. It implies that, while some hypothesis sets used for learning could have a large complexity, this may not be detrimental to generalization if the corresponding total mixture weight is relatively small. This property also offers a potential to obtain a good generalization result through optimizing the s.

It is worth noting that the analysis here only considers the cascade structure in deep forest. Due to the simplification of the model, we do not analyze the details about the "preconc" operation and the influence of adopting a different type of forests, though these two operations play an important role in practice. The advantages of these operations are evaluated empirically in Section 5.

## 4 Margin Distribution Optimization

The generalization theory shows the importance of optimizing the margin distribution ratio and the mixture coefficients s. Since we reformulate the cascaded deep forest as a additive model, we utilize the reweighting approach to minimize the expected margin distribution loss

(45) |

where the margin distribution loss function is designed to utilize the first- and second-order statistics of margin distribution. The reweighting approach helps the model to boosts the augmented feature in deeper layer, i.e., focus on dealing with the samples which have the large loss in the previous layers. The choice of scalar is determined by minimizing the expected loss for -layer model.

### 4.1 Algorithm for mdDF approach.

We denote by a prediction score space, where is the number of classes. When any sample passes through the cascaded deep forest model, it will get an average prediction vector in each layer: . According to Crammer and Singer (2001), we can define the sample’s margin for multi-class classification as:

(46) |

that is, the prediction’s confidence-rate. For example, in the 3-class problem, as shown in Figure 2, the average prediction score is , and the margin is calculated as .

The initial sample weights are , and we update the -th sample weight by:

(47) |

The margin distribution loss function is defined as follows:

(48) |

where hyper-parameter is a parameter as the margin mean and is a parameter to trade off two different kinds of deviation (keeping the balance on both sides of the margin mean). In practice, we generally choose these two hyper-parameters from the finite sets and . The algorithm utilizing this margin distribution optimization is summarized in Algorithm 2.

### 4.2 The intuition of margin distribution loss function.

Since Reyzin and Schapire (2006) found that the margin distribution of AdaBoost is better than that of arc-gv (Breiman, 1999) which is a boosting algorithm designed to maximize the minimum margin, Reyzin and Schapire (2006) conjectured that margin distribution is more important to get a better generalization performance than the instance with the minimum margin. Gao and Zhou (2013) prove that utilizing both the margin mean and margin variance can portray the relationship between margin and generalization performance for AdaBoost algorithm more precisely. We list the several loss functions of the algorithms based on margin theory to compare and plot them in Figure. 3:

Compared with maximize the minimum margin (SVMs), the optimal margin distribution principle (Gao and Zhou, 2013; Zhang and Zhou, 2017) conjecture that maximizing the margin mean and minimizing the margin variance is the key to achieving a better generalization performance. Figure. 4 shows that optimizing the margin distribution with first- and second-order statistics can utilize more information on training data, e.g. the covariance among the different features. Inspired by this idea, Zhang and Zhou (2017) propose the optimal margin distribution machine (ODM) which can be formulated as equation 52:

(52) |

where and are the deviation of the margin to the margin mean, is a parameter to trade off two different kinds of deviation (larger or less than margin mean). is a parameter of the zero-loss band, which can control the number of support vectors, i.e., the sparsity of the solution, and in the denominator is to scale the second term to be a surrogate loss for 0-1 loss. Similar to support vector machines (SVMs), we can give ODM an intuitive illustration in Figure. 5. Similar to formulating support vector machines as a combination of the hinge loss and the regularization term, we can use margin distribution loss function defined in equation 51 and a regularization term to represent the ODM. Our simplified version margin distribution loss function (48) is similar to that of the ODM. Our forest representation learning approach requires as many samples as possible to train the model and generate the augmented features. Therefore, we remove the parameter which can control the number of support vectors. Our loss function is to optimize the margin distribution to minimize the margin distribution ratio , referring to Remark 2 in Section 3.

Comments

There are no comments yet.