Residual Networks Behave Like Boosting Algorithms

09/25/2019 ∙ by Chapman Siu, et al. ∙ 0

We show that Residual Networks (ResNet) is equivalent to boosting feature representation, without any modification to the underlying ResNet training algorithm. A regret bound based on Online Gradient Boosting theory is proved and suggests that ResNet could achieve Online Gradient Boosting regret bounds through neural network architectural changes with the addition of a shrinkage parameter in the identity skip-connections and using residual modules with max-norm bounds. Through this relation between ResNet and Online Boosting, novel feature representation boosting algorithms can be constructed based on altering residual modules. We demonstrate this through proposing decision tree residual modules to construct a new boosted decision tree algorithm and demonstrating generalization error bounds for both approaches; relaxing constraints within BoostResNet algorithm to allow it to be trained in an out-of-core manner. We evaluate convolution ResNet with and without shrinkage modifications to demonstrate its efficacy, and demonstrate that our online boosted decision tree algorithm is comparable to state-of-the-art offline boosted decision tree algorithms without the drawback of offline approaches.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Residual Networks (ResNet) He et al. (2016) have previously had a lot of attention due to performance, and ability to construct “deep” networks while largely avoiding the problem of vanishing (or exploding) gradients. Some attempts have been made in explaining ResNets through: unravelling their representationVeit et al. (2016); observing identity loops and finding no spurious local optimaHardt and Ma (2017)

; and reinterpreting residual modules as weak classifiers which allows sequential training under boosting theory

Huang et al. (2018).

Empirical evidence shows that these deep residual networks, and subsequent architectures with the same parameterizations, are easier to optimize. They also out-perform non-residual ones, and have consistently achieved state-of-the-art performance on various computer vision tasks such as CIFAR-10 and ImageNet

He et al. (2016).

1.1 Summary of Results

We demonstrate the equivalence of ResNet and Online Boosting in Section 3. We show that the layer by layer boosting method of ResNet has an equivalent representation of additive modelling approaches first demonstrated in Logitboost Friedman et al. (2000)

which boost feature representations rather than the label directly, i.e. ResNet can be framed as an Online Boosting algorithm with composite loss functions. Although traditional boosting results may not apply as ResNet are not a naive weighted ensemble, we can refer to them as Online Boosting analogues which presents regret bound guarantees. We demonstrate that under “nice” conditions for the composite loss function, the regret bound for Online Boosting holds, and by extension also applies for ResNet architectures.

Taking inspiration from Online Boosting, we also modify the architecture of ResNet with an additional learnable shrinkage parameter (vanilla ResNet can be interpreted as Online Boosting algorithm where the shrinkage factor is fixed/unlearnable and set to ). As this approach only modifies the neural network architecture, the same underlying ResNet algorithms can still be used.

Experimentally, we compare vanilla ResNet with our modified ResNet using convolutional neural network residual network (ResNet-CNN) on multiple image datasets. Our modified ResNet shows some improvement over vanilla ResNet architecture.

We also compare our boosted decision tree neural decision tree residual network on multiple benchmark datasets and their results against other decision tree ensemble methods, including Deep Neural Decision Forests Kontschieder et al. (2016), neural decision trees ensembled via AdaNet Cortes et al. (2017)

, and off-the-shelf algorithms (gradient boosting decision tree/random forest) using

LightGBM Ke et al. (2017). In our experiments, neural decision tree residual network showed superior performance to neural decision tree variants, and comparable performance to offline tradition gradient boosting decision tree models.

1.2 Related Works

In recent years researchers have sought to understand why ResNet perform the way that they do. The BoostResNet algorithm reinterprets ResNet as a multi-channel telescoping sum boosting problem for the purpose of introducing a new algorithm for sequential training Huang et al. (2018), providing theoretical justification for the representational power of ResNet under linear neural network constraintsHardt and Ma (2017). One interpretation of residual networks is as a collection of many paths of differing lengths which behave like a shallow ensemble; empirical studies demonstrate that residual networks introduce short paths which can carry gradients throughout the extent of very deep networks Veit et al. (2016).

Comparison with BoostResNet and AdaNet

Combining Neural Networks and boosting has previously been explored in architectures such as AdaNet Cortes et al. (2017) and BoostResNet Huang et al. (2018). We seek to understand why ResNet achieves their level of performance, without altering how ResNet are trained.

In the case of BoostResNet, the distribution must be explicitly maintain over all examples during training and parts of ResNet are trained sequentially which cannot be updated in a truly online, out-of-core manner. And in the case of AdaNet, which do not always work for ResNet structure, additional feature vectors are sequentially added, and chooses their own structure during learning. In our proposed approach, we do not require these modifications, and can train the model in the same way as an unmodified ResNet. A ResNet style architecture is a special case of AdaNet, so AdaNet generalization guarantee applies here and our generalization analysis is built upon their work. Furthermore we also demonstrate Neural Decision Trees belong to same family of feedforward neural networks as AdaNet, so AdaNet generalization guarentee also applies to Neural Decision Tree ResNet modules and our generalization analysis is built upon their work.

2 Preliminaries

In this section we cover the background of residual neural networks and boosting. We also explore the conditions which enable regret bounds in Online Gradient Boosting setting and the class of feedforward neural networks for AdaNet generalization bounds.

2.1 Residual Neural Networks

A residual neural network (ResNet) is composed of stacked entities referred to as residual blocks.

A Residual Block of ResNet contains a module and an identity loop. Let each module map its input to where denotes the level of the module, and where

is typically a sequence of convolutions, batch normalizations or non-linearities. These formulations may differ depending on context and the model architecture.

We denote the output of the -th residual block to be


where is the input of the ResNet.

Output of ResNet has a recursive relation specified in equation 1, then output of the -th residual block is equal to the summation of lower module outputs, i.e., , where and . For classification tasks, the output of a ResNet is rendered after a linear classifier on representation where is the number of classes, and is the number of channels:


where denotes a map from classifier output to labels. For example, could represent a softmax function.

2.2 Boosting

The goal of boosting is to combine weaker learners into a strong learner. There are many variations to boosting. For example, in AdaBoost and its derivatives, we require the boosting algorithm to choose training sets for the weak classifier to force it to make novel inferences Friedman et al. (2000). This was the approached used by BoostResNet Huang et al. (2018). In gradient boosting, this requirement is removed through training against pseudo-residual and can even be extended to the online learning setting Beygelzimer et al. (2015).

In either scenario, boosting can be viewed as an additive model or linear combinations of models , where is a function of the input and is the corresponding multiplier for the -th model Friedman et al. (2000)Beygelzimer et al. (2015).


is an algorithm first introduced in “Additive Logistic Regression: A Statistical View of Boosting”

Friedman et al. (2000), which introduces boosting on the input feature representation, including neural networks. In the general binary classification scenario, the formulation relies on boosting over the logit or softmax transformation


Where represents the softmax function. This form is similar to the linear classifier layer which is used by ResNet algorithm.

Online Boosting introduces a framework for training boosted models in an online manner. Within this formulation, there are two adjustments which are required to make offline boosting models online. First, the partial sums (where represents the predictions of the -th model) is multiplied by a shrinkage factor, which is tuned using gradient descent. Second, the partial sums outputs are to be bounded Beygelzimer et al. (2015).

The bounds presented for online gradient boosting are based on regret. The regret of a learner is defined as the difference between the total loss from the learner and the total learner of the best hypothesis in hindsight

Online gradient boosting regret bounds applies can be applied to any linear combination of a give base weak learner with a convex, linear loss function that is Lipschitz constant bounded by .

Corollary 2.1

(From Corollary 1 Beygelzimer et al. (2015)) Let the learning rate , number of weak learners , be given parameters. Algorithm 1 is an online learning algorithm for for set of convex, linear loss functions with Lipschitz constant bounded by with the following regret bound for any :

where , or the initial error, and is the regret or excess loss for the base learner algorithm.

The regret bound in this theorem depends on several conditions; the requirement that for any weak learner , that it has a finite upper bound, i.e. , for some , and the set of loss functions constraints an efficiently computable subgradient has a finite upper bound.

Compared with boosting approach used in BoostResNet which is based on AdaBoostHuang et al. (2018), the usage of the online gradient boosting algorithm does not require maintaining an explicit distribution of weights over the whole training data set and is a “true” online, out-of-core algorithm. Leveraging online gradient boosting allows us to overcome the constraints of BoostResNet approach.

AdaNet Generalization Bounds for feedforward neural networks defined to be a multi-layer architecture where units in each layer are only connected to those in the layer below has been provided by Cortes et al. (2017). It requires the weights of each layer to be bounded by -norm, with

, and all activation functions between each layer to be coordinate-wise and

-Lipschitz activation functions. This yields the following generalization error bounds provided by Lemma 2 from Cortes et al. (2017):

Corollary 2.2

(From Lemma 2 Cortes et al. (2017)) Let be distribution over and be a sample of examples chosen independently at a random according to

. With probability at least

, for , the strong decision tree classifier satisfies that

As this bound depends only on the logarithmically on the depth for the network this demonstrates the importance of strong performance in the earlier layers of the feedforward network.

Now that we have a formulation for ResNet and boosting, we explore further properties of ResNet, and how we may evolve and create more novel architectures.

3 ResNet are Equivalent to a Boosted Model

As we recall from equations 2 and 3, ResNet indeed have a similar form to LogitBoost. In this scenario, both formulations aim to boost the underlying feature representation. One consequence of the ResNet formulation is that the linear classifier , would be a shared linear classifier across all all ResNet modules.

Assumption 3.1

The -th residual module with a trainable linear classifier layer defined by


Is a weak learner for all . We will call this weak learner the hypothesis module.

This assumption is required to ensure that is a weak learner to adhere to learning bounds proposed in Corollary 2.1. We show that different ResNet modules variants used in our experiments assumption in Sections 6.3.

Overall this demonstrates that the proposed framework is equivalent to traditional boosting frameworks which boost on the feature representation. However, to further analyse the algorithmic results, we need to first consider additional restrictions which are placed within the “Online Boosting Algorithm” framework.

3.1 Online Boosting Considerations

Our representation is a special case of online gradient boosting as shown in Algorithm 1, our regret bound analysis is built upon work in Beygelzimer et al. (2015). The regret bounds for an online boosting algorithm that competes with linear combination of the base weak learner applies when used for a class of convex, linear loss function with Lipschitz constant bounded by .

1:  Maintain copies of the algorithm , denoted and choose step size parameter
2:  For each , initialize .
3:  for  to  do
4:     Receive example
5:     Define
6:     for  to  do
7:        , where is our shrinkage factor for algorithm
8:     end for
9:     Predict
10:     Obtain loss function and the model suffers loss , which is equivalent to equivalently
11:     for  to  do
12:        Pass loss based on partial sums of to , i.e. in descent direction
13:        Update using online gradient descent
14:     end for
15:  end for
Algorithm 1 Online Boosting for Composite Loss Functions for

This algorithm yields the regret bound for algorithm 1, which is directly from Corollary 1 from Beygelzimer et al. (2015) in Corollary 2.1. We will provide further analysis of this algorithm; in particular the validity of composite loss functions in Section 4.

4 Analysis of Online Boosting for Composite Loss Functions

In this section we provide analysis on corollary 2.1. Corollary 2.1 holds for the learning algorithm with losses in , where is defined to be set of convex, linear loss functions with Lipschitz constant bounded by . Next we describe the conditions in which composite loss functions belongs in .

A composite loss function , where is the link function, belongs to if is the canonical link function. This has been shown to be a sufficient but not necessary condition for canonical link to lead to convex composite loss functions Reid and Williamson (2010).

Lemma 4.1

Composite loss functions retain smoothness

Proof: If satisfies Lipschitz continuous function (e.g. logistic function/softmax, as its derivative is bounded everywhere), then the composite loss is also Lipschitz constant, as composition of functions which are Lipschitz constant is also Lipschitz constant. As if has Lipschitz constant and has Lipschitz constant then

Hence, if has Lipschitz constant bounded by , then the composition of the particular loss function with Lipschitz constant of also has Lipschitz constant bounded by and belongs to the base loss function class. An example of such a link function is the logit function, which has a Lipschitz constant of 1 and is the canonical link function for log loss (cross entropy loss), which suggests that the composite loss function is indeed convex and belongs in loss function class

This demonstrates ResNet which boost on the feature representation and have a logit link satisfies regret bound as shown in Collorary 2.1.

5 Recovering Loss For Intermediary Residual Modules


Figure 1: The architecture of a modified residual network (three residual modules) with shrinkage parameter and shared linear classifier

When analysing the ResNet and Online Boosting algorithm, the Online Boosting algorithm requires the gradient of the underlying boosting function to be recovered as part of the update process. This is shown in line 12 within Algorithm 1. One approach to tackle this challenge was suggested in BoostResNet where a common auxiliary linear classifier is used across all residual modules, however this approach was not explored in the work as BoostResNet was focused on sequential training of residual modules, and such a constraint was deemed inappropriate. Instead BoostResNet would construct different linear classifier layers which were dropped at every stage when the residual modules have been trained.

Our approach to remediate this is to formulate the ResNet architecture as a single-input, multi-output training problem, whereby each residual module will have an explicit ‘shortcut’ to the output labels whilst sharing the same linear classifier layer. This architecture is shown in Figure 1.

Remark: It has been demonstrated that through carefully constructing a ResNet, the last layer need not be trained, and instead can be a fixed random projectionHardt and Ma (2017). This has been demonstrated through theoretical justifications in linear neural networks.

1:  Maintain ResNet Modules , shrinkage layers , linear classifier layer and choose step size parameter , constructed as per Figure 2
2:  For each , initialize shrinkage layer .
3:  Define
4:  for  to  do
5:     {Feed Forward}
6:     Receive example
7:     for  to  do
9:        Predict and output
10:     end for
11:     {Back Propagation}
12:     for  to  do
13:        Update all layers in the subnetwork up to ResNet module via back propagation using the final prediction output
14:     end for
15:  end for
Algorithm 2 Online Boosting Algorithm as ResNet with Shrinkage

In our ResNet algorithm, if the linear classifier layer is shared, then the model would be framed as a single input, multi-output residual network, where the outputs are all predicting the same output . The predicted output of the network, which corresponds to each of the weak learners would correspond to on lines 8 and 9 of Algorithm 2. Through this setup, it allows each residual module, to be updated by back propagation with respect to the label in the same manner as line 12 in Algorithm 1. In a similar manner the shrinkage layers in Algorithm 2 would be updated as shown in Algorithm 1 as per line .


Figure 2: The architecture of a modified residual network two modules and with shrinkage layers.

Through unravelling a ResNet, the paths of a ResNet are distributed in a binomial mannerVeit et al. (2016), that is, there is one path that passes through modules and paths that go through one module, with an average path length of Veit et al. (2016). This means that even without a shared layer, there will be paths within the ResNet framework where the gradient is recovered to the residual modules. This approach is shown by figure 2, and has an identical setup as algorithm 2 except in the back propagation step, we update all layers based on the whole network using output only.

Remark: if a residual network is reframed as the Online Boosting Algorithm 1, it would be equivalent to choosing , with being fixed or untrainable. For the regret bounds to hold, we require shrinkage parameter to be trainable, and the outputs of each residual module to be bounded by a predetermined max-norm.

In Section 6 we will provide empirical evidence validating both approaches.

5.1 Neural Decision Tree

Another popular application of boosting algorithms is through the construction of decision trees. In order to demonstrate how ResNet could be used to boost a variety of models with different residual module representations, we describe our construction for our Neural Decision Tree ResNet and the associated generalization error analysis.

5.1.1 Construction of Neural Decision Tree and Generalization Error Analysis

To demonstrate decision tree formulation based on Deep Neural Decision Forests belongs to this family of neural network models, consider the residual module is shown by Figure 5

, where the split functions are realized by randomized multi-layer perceptrons

Kontschieder et al. (2016). This construction is a neural network has sets of layers that belongs to family of artificial neural networks defined by Cortes et al. (2017); which require the weights of each layer to be bounded by -norm, with , and all activation functions between each layer to be coordinate-wise and -Lipschitz activation functions. The size of these layers are based on a predetermined number of nodes with a corresponding number of leaves . Let the input space be and for any , let denote the corresponding feature vector.


Figure 3: Left: Iris Decision Tree by Scikit-Learn, Right: Corresponding Parameters for our Neural Network. Changing the softmax function to a deterministic routing will yield precisely the same result as the Scikit-Learn decision tree.


Figure 4: Decision Tree as a three layer Neural Network. The Neural Network has two trainable layers: the decision tree nodes, and the leaf nodes.

The first layer is decision node layer. This is defined by trainable parameters , with and . Define and , which represent the positive and negative routes of each node. Then the output of the first layer is . This is interpreted as the linear decision boundary which dictates how each node is to be routed.

The next is the probability routing layer, which are all untrainable, and are a predetermined binary matrix . This matrix is constructed to define an explicit form for routing within a decision tree. We observe that routes in a decision tree are fixed and pre-determined. We introduce a routing matrix which is a binary matrix which describes the relationship between the nodes and the leaves. If there are nodes and leaves, then , where the rows of represents the presence of each binary decision of the nodes for the corresponding leaf . We define the activation function to be . Then the output of the second layer is . As is 1-Lipschitz bounded function in the domain and the range of , then by extension, is a 1-Lipschitz bounded function for . As is a binary matrix, then the output of must also be in the range .

The final output layer is the leaf layer, this is a fully connected layer to the previous layer, which is defined by parameter , which represents the number of leaves. The activation function is defined to be . The the output of the last layer is defined to be . Since has range , then is a 1-Lipschitz bounded function as is 1-Lipschitz bounded in the domain . As each activation function is 1-Lipschitz functions, then our decision tree neural network belongs to the same family of artificial neural networks defined by Cortes et al. (2017), and thus our decision trees have the corresponding generalisation error bounds related to AdaNet.

The formulation of these equations and their parameters is shown in figure 3 which demonstrates how a decision tree trained in Python Scikit-Learn can have its parameters be converted to a neural decision tree, and figure 4 demonstrates the formulation of the three layer network which constructs this decision tree.

5.1.2 Extending Neural Decision Trees to ResNet

For our Neural Decison Tree ResNet, in order to ensure that the feature representation is invariant to the number of leaves in the tree, we add a linear projection to ensure that the shortcut connection match the dimensions, as suggested in the original ResNet implementation He et al. (2016).


Figure 5: Decision Tree Residual Module based on Decision Tree Algorithm in “Deep Neural Decision Forests” Kontschieder et al. (2016)

In this way, we have demonstrated construction of our variations of residual modules retain generalization bounds proved by Cortes et al. (2017) and retain true out-of-core online boosted learning, compared with other existing algorithms such as BoostResNetHuang et al. (2018).

6 Experiments

Below, we perform experiments on two different ResNet architectures.

First, we examine the ResNet convolution network variantHe et al. (2016), with and without the addition of trainable shrinkage parameter. Both models are assessed over street view house numbers SVHN Netzer et al. (2011), and CIFAR-10 Krizhevsky et al. (2012) benchmark datasets.

Second, we examine the efficacy of creating boosted decision tree models in ResNet framework. Our approach was compared against other neural decision tree ensemble models and offline models including Deep Neural Decision Forests Kontschieder et al. (2016), neural decision trees ensembled via AdaNet Cortes et al. (2017), and off-the-shelf algorithms (gradient boosting decision tree/random forest) using LightGBM Ke et al. (2017). All models were assess using UCI datasets which are detailed in Section B of the appendix.

In both scenarios, the datasets were divided using a split into a training and test dataset respectively.

6.1 Convolution Network ResNet

In both the CIFAR-10 and SVHN datasets we fit the same 20-layer ResNet. This ResNet consists of one convolution, followed by stacks of layers with convolutions of the feature maps sizes of respectively, with layers for each feature map size. The number of filters are

. The subsampling is performed by convolutions with a stride of 2 and the network ends with global average pooling, a

-way fully connected layer and softmax. The implementation is taken directly from the Keras CIFAR-10 ResNet sample code. The model was run without image augmentation, and with a batch size of for epochs. To compare the original ResNet, we augment the ResNet model by adding a trainable shrinkage parameter as described in Section 5 (ResNet-Shrinkage), and our augmented ResNet model with both shrinkage parameter and shared linear layer (ResNet-Shared).

Model Train Test
ResNet 0.93886 0.94630
ResNet (Shrinkage) 0.93917 0.94852
ResNet (Shared) 0.93689 0.94626
Table 1: Accuracies of SVHN Task. All trained with same number of iterations (200 epoch, with learning schedule as defined in original ResNet model.)

Model Train Test
ResNet 0.98412 0.91530
ResNet (Shrinkage) 0.98504 0.91870
ResNet (Shared) 0.94496 0.88570
Table 2: Accuracies of CIFAR-10 Task. All trained with same number of iterations (200 epoch, with learning schedule as defined in original ResNet model.)

We find that the model with shrinkage only has marginally higher accuracy than the vanilla ResNet-20 implementation in both datasets. For the ResNet-Shared model, it is comparable to the SVHN task, however falls short in the CIFAR-10 task. In general, adding shrinkage does not impact performance of ResNet models and in certain cases, it improves the performance.

6.2 Neural Decision Tree ResNet

The next experiment conducted was to address whether ResNet could be used to boost a variety of models with different residual module representations. We compared our decision tree in ResNet (ResNet-DT), and ResNet with shared linear classifier layer (ResNet-DT Shared) with Deep Neural Decision Forests Kontschieder et al. (2016) (DNDF), neural decision trees ensembled via AdaNet Cortes et al. (2017) (AdaNet-DT), and off-the-shelf algorithms (gradient boosting decision tree/random forest) using LightGBM Ke et al. (2017) which we denote as LightGBDT, LightRF respectively.

For ResNet-DT, ResNet-DT Shared, DNDF, LightGBM and LightRF, all models used an ensemble of 15 trees with a maximum depth of 5 (i.e. 32 nodes). For each of these models, they were run for 200 epoch.

For AdaNet-DT, the candidate sub-networks used are decision trees identical to implementation in DNDF. This means that at every iteration, a candidate neural decision tree was either added or discarded with no change to the ensemble. The complexity measure function was defined to be where is the number of hidden layers (i.e. number of nodes) in the decision tree Golowich et al. (2018). For AdaNet-DT, the algorithm started with tree, and was run times with epoch per iteration, allowing AdaNet to build up to trees. Once the final neural network structure was chosen, it was run for another 200 epoch and used for comparison with the other models.

To assess the efficacy, we used a variety of datasets from the UCI repository. Full results for the training and test data sets are provided in section B of the appendix.


Figure 6: Boxplot of Relative Performance with Deep Neural Decision Forest Model as Baseline on train dataset. High values indicate better performance.


Figure 7: Boxplot of Relative Performance with Deep Neural Decision Forest Model as Baseline on test dataset. High values indicate better performance.
Mean Mean
Improv. Reciprocal Rank
LightGBM 26.140% 0.545
LightRF 17.414% 0.2683
AdaNet-DT 16.519% 0.345
ResNet-DT 25.360% 0.5783
ResNet-DT (Shared) 20.306% 0.545
Table 3: Mean Improvement compared with Deep Neural Decision Forest Model and Mean Reciprocal Rank on train datasets.
Mean Mean
Improv. Reciprocal Rank
LightGBM 20.904% 0.67
LightRF 13.665% 0.3367
AdaNet-DT 14.771% 0.315
ResNet-DT 17.892% 0.5117
ResNet-DT (Shared) 12.949% 0.4067
Table 4: Mean Improvement compared with Deep Neural Decision Forest Model and Mean Reciprocal Rank on test datasets.

In order to construct a baseline for all models to be comparable, the results presented are on the average and median error improvement compared with DNDF models, as they were the worse performing model based on these benchmarks. From the results in Table 4, LightGBM performed the best with the best average improvement on error relative to the baseline DNDF model. What is interesting is that both our ResNet-DT model performed second best, beating LightRF and AdaNet-DT models.It is important to note that our setup for AdaNet-DT only allowed a “bushy” candidate model, this did not allow AdaNet-DT to build deeper layers compared with ResNet-DT approach; only allowing it to build a wider and shallow architect through appending additional decision trees. Despite this, the AdaNet-DT implementation did outperform the DNDF implementation.

When examining relative improvement, it is important to understand how the values are then distributed. Figures 6 and 7 contain the boxplots of relative performance based on the train and test datasets respectively. From our empirical experiments, it suggests that the difference between the ResNet-DT and ResNet-DT Shared

are around the variance in the results. One interpretation is through joint training, the variability in performance is lowered and may possible provide more stable models. As to whether joint training should be used or not, we believe it should be considered to be a optional parameter that is learned in training time instead.

In general, it would appear our ResNet-DT performance is comparable to LightGBM models whilst providing the ability to update the tree ensemble in an online manner and producing non-greedy decision splits. As this approach can be performed in an online or mini-batch manner, it can be used to incrementally update and train over large datasets compared with LightGBM models which can only operate in an offline manner.


Figure 8: A ResNet-CNN module as defined in He et al. (2016)

6.3 Weak Learning Condition Check

We present a summarised proof demonstrating ResNet-CNN and ResNet-DT satisfy the weak learning condition as stated in Assumption 3.1. The full proof is provided in Section A of the appendix.

For both cases, it is sufficient to demonstrate that there exists a parameterization such that the residual module . Applying this parameterization over the recursive relation , suggests there exists a parameterization of the residual module such that . As is a learnable weight and a linear model, which is a known weak learner Mannor and Meir (2002), demonstrating that hypothesis modules created through residual modules are weak learners.

ResNet-CNN: We will briefly demonstrate that with dense layers in a ResNet setup He et al. (2016) can recover the identity. We defer demonstrating convolutional layers scenario to section A of the appendix.

We will ignore the batch normalization function in ResNet, noting that batch normalization layer with centering value of and scale of is a valid parameterization. As such the residual module can be expressed as

Where , are the appropriate weights matrices with , being the respective biases and

is ReLu activation. Suppose


is chosen to be the identity matrix and

is chosen to be a matrix containing a single value representing , and . Hence there exists a parameterization of ResNet-CNN where as required.

Remark: ResNet built under constraints of a linear residual module with only convolution layers and ReLu activations have been shown to have perfect finite sample expressivity; which is a much stronger condition than recovering only the identity Hardt and Ma (2017).

ResNet-DT: The weak learning condition can be trivially demonstrated through routing the input in a deterministic manner to a single leaf with probability . Under this condition the final linear projection layer, project only the target leaf, would result in an identity mapping. This demonstrates a decision tree which routes only to one leaf will have a parameterization . This can also interpreted as a “decision stump” which is commonly used in boosting applications.

7 Conclusions and Future Work

We have demonstrated the equivalence between ResNet and Online Boosting algorithm, and provided a regret bound for ResNet based on the interpretation of residual modules with the linear classifier as weak learners. We have proposed the addition of shrinkage parameters to ResNet, which based on initial results demonstrating it as a promising approach in refining ResNet models. We have also demonstrated a method to remove “offline” restriction of BoostResNet of requiring maintaining distribution of all training data weights through extending it to an online gradient boosting algorithm. Together these provide insight into the interpretation of ResNet as well as extensions of residual modules to new and novel feature representations, such as neural decision trees. These representations allow us to create new boosting variations of decision trees. We have additionally demonstrated that this approach is superior to other neural network decision tree ensemble variants and comparable with state-of-the-art offline variations without the drawbacks of offline approaches. In addition we have also provided generalization bounds for our residual module implementations. The insights into the relation between boosting and ResNet could spur other changes to the default ResNet architecture, such as challenging the default size of the step parameter in the identity skip-connect. These insights may also change how residual modules are optimized and built, and encourage developments into new residual modules architectures.


  • A. Beygelzimer, E. Hazan, S. Kale, and H. Luo (2015) Online gradient boosting. In Advances in Neural Information Processing Systems, pp. 2458–2466. Cited by: §2.2, §2.2, §2.2, Corollary 2.1, §3.1, §3.1.
  • J. A. Blackard and D. J. Dean. (2000) "Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.". In Computers and Electronics in Agriculture, 24(3), pp. 131–151. Cited by: 2nd item.
  • C. Cortes, X. Gonzalvo, V. Kuznetsov, M. Mohri, and S. Yang (2017) AdaNet: adaptive structural learning of artificial neural networks. In

    Proceedings of the 34th International Conference on Machine Learning

    , D. Precup and Y. W. Teh (Eds.),
    Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 874–883. Cited by: §1.1, §1.2, §2.2, Corollary 2.2, §5.1.1, §5.1.1, §5.1.2, §6.2, §6.
  • P. W. Frey and D. J. Slate (1991) Letter recognition using holland-style adaptive classifiers. Machine Learning 6, pp. 161. Cited by: 8th item.
  • J. Friedman, T. Hastie, R. Tibshirani, et al. (2000) Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). The annals of statistics 28 (2), pp. 337–407. Cited by: §1.1, §2.2, §2.2, §2.2.
  • N. Golowich, A. Rakhlin, and O. Shamir (2018) Size-independent sample complexity of neural networks. In Proceedings of the 31st Conference On Learning Theory, S. Bubeck, V. Perchet, and P. Rigollet (Eds.), Proceedings of Machine Learning Research, Vol. 75, , pp. 297–299. Cited by: §6.2.
  • I. Guyon, S. Gunn, A. Ben-Hur, and G. Dror (2005)

    Result analysis of the nips 2003 feature selection challenge

    In Advances in Neural Information Processing Systems 17, L. K. Saul, Y. Weiss, and L. Bottou (Eds.), pp. 545–552. Cited by: 5th item.
  • M. Hardt and T. Ma (2017)

    Identity matters in deep learning

    In International Conference on Learning Representations, Cited by: §1.2, §1, §5, §6.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition.

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 770–778.
    Cited by: §1, §1, §5.1.2, Figure 8, §6.3, §6.
  • F. Huang, J. T. Ash, J. Langford, and R. E. Schapire (2018) Learning deep resnet blocks sequentially using boosting theory. International Conference of Machine Learning 2018 abs/1706.04964. External Links: 1706.04964 Cited by: §1.2, §1.2, §1, §2.2, §2.2, §5.1.2.
  • G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T. Liu (2017) LightGBM: a highly efficient gradient boosting decision tree. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3146–3154. Cited by: §1.1, §6.2, §6.
  • R. Kohavi (1996)

    Scaling up the accuracy of naive-bayes classifiers: a decision-tree hybrid

    In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207. Cited by: 1st item.
  • P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò (2016) Deep neural decision forests. In

    Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016

    pp. 4190–4194. Cited by: §1.1, Figure 5, §5.1.1, §6.2, §6.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §6.
  • S. Mannor and R. Meir (2002) On the existence of linear weak learners and applications to boosting. Machine Learning 48 (1-3), pp. 219–251. Cited by: 1.§, §6.3.
  • K. Nakai and M. Kanehisa (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14, pp. 897–911. Note: MEDLINE Abstract Cited by: 7th item.
  • Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, Cited by: §6.
  • M. D. Reid and R. C. Williamson (2010) Composite binary losses. Journal of Machine Learning Research 11 (Sep), pp. 2387–2422. Cited by: §4.
  • M. Tan and L. J. Eshelman (1988) Using weighted networks to represent classification knowledge in noisy domains. In ML, pp. 121. Cited by: 6th item.
  • A. Veit, M. Wilber, and S. Belongie (2016) Residual networks behave like ensembles of relatively shallow networks. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, USA, pp. 550–558. External Links: ISBN 978-1-5108-3881-9 Cited by: §1.2, §1, §5.


.1 Weak Learners

To demonstrate that both ResNet-CNN and ResNet-DT we first prove that the existance of the parameterization is a sufficient condition to construct a weak learner.

Lemma .1

If there exists a parameterization , then for any , .

(By Induction Hypothesis) It is given that and , then using the definition and , we have .

Assume, for some , holds true, then

Since both the base case and the inductive step have been performed, then by mathematical induction for .

Lemma .2

If there exists a parameterization , then the hypothesis module is a weak learner.

Using Lemma A.1, we can easily see that the hypothesis module

As is a learnable parameter, then the hypothesis module is a weak learner as linear models are weak learners Mannor and Meir (2002).


The ResNet-CNN modules generally consist of repeated blocks consisting of convolution-batch normalization-ReLu activation repeated several times (see

implementation in Keras examples, under

conv_block and identity_block).

Lemma .3

There exists convolution layer and batch normalization weights such that

where is the weights of the convolution layer, BN is batch normalization function and is the ReLu function, and is some constant scalar.

As before, we will ignore the batch normalization function in ResNet, noting that batch normalization layer with centering value of and scale of is a valid parameterization. Then we construct convolution layer through choosing only the identity kernel, with bias constructed to be the absolute value of the minimal element in , which we will call . Then

As all elements in would be greater than , which negates the effect of the ReLu activation function.

Lemma .4

We define the residual module is the composition of arbritary many , as defined below

Using this definition for some constant scalar

This can trivially be shown via induction. This holds for by Lemma A.3. Assume it holds for , i.e. . Then for

For some constant scalar . Since both the base case and the inductive step have been performed, then by mathematical induction for some constant scalar , which suggests that there exists a parameterization for convolution ResNet modules of any depth which recovers the identity.

Therefore using Lemma A.2 and Lemma A.4 the hypothesis module for convolution ResNet model is a weak learner.

! Dataset AdaNet-DT ResNet-DT ResNet-DT (Shared) DNDF LightGBM LightRF Train Test Train Test Train Test Train Test Train Test Train Test adult 0.8628 0.8533 0.8837 0.8386 0.8057 0.7733 0.8579 0.8538 0.8638 0.8613 0.8624 0.8596 covtype 0.7153 0.7106 0.9689 0.8302 0.9457 0.8246 0.6877 0.6825 0.8396 0.7831 0.8030 0.7598 dna 0.9919 0.9288 0.9888 0.8848 0.9830 0.9204 0.9794 0.9361 0.9745 0.9466 0.9632 0.9393 glass 0.7133 0.5781 0.7933 0.6719 0.7933 0.5781 0.5667 0.4688 0.8533 0.7031 0.7067 0.5938 letter 0.8524 0.8383 0.9934 0.9700 0.9868 0.9590 0.8602 0.8495 0.9481 0.9075 0.9056 0.8693 sat 0.8976 0.8695 0.8904 0.8710 0.9628 0.9125 0.8519 0.8275 0.9556 0.8885 0.9204 0.8835 shuttle 0.9965 0.9964 0.7860 0.7859 0.7860 0.7859 0.7860 0.7859 0.9997 0.9997 0.9992 0.9985 mandelon 0.8576 0.8478 0.9848 0.9078 0.9881 0.8856 0.8776 0.8722 0.9219 0.8456 0.8700 0.8322 soybean 0.9937 0.9069 0.9916 0.8873 0.9979 0.8922 0.6388 0.6127 0.9415 0.8971 0.8058 0.7255 yeast 0.6064 0.5618 0.7382 0.5910 0.5486 0.4876 0.4071 0.3865 0.7594 0.6090 0.6843 0.5708 Number of wins 1 1 3 3 3 1 0 0 3 5 0 0 Mean Reciprocal Rank 0.345 0.3150 0.5783 0.5117 0.5450 0.4067 0.1983 0.2283 0.5450 0.6700 0.2683 0.3367

Table 5: The full results are shown below. All datasets are measured based on accuracy

.2 Description of Data Sets

The full results are shown in table 5. All datasets are measured based on accuracy.

The datasets used come from the UCI repository and are listed as follows: