Neural Forest Learning

11/18/2019
by   Yun-Hao Cao, et al.
33

We propose Neural Forest Learning (NFL), a novel deep learning based random-forest-like method. In contrast to previous forest methods, NFL enjoys the benefits of end-to-end, data-driven representation learning, as well as pervasive support from deep learning software and hardware platforms, hence achieving faster inference speed and higher accuracy than previous forest methods. Furthermore, NFL learns non-linear feature representations in CNNs more efficiently than previous higher-order pooling methods, producing good results with negligible increase in parameters, floating point operations (FLOPs) and real running time. We achieve superior performance on 7 machine learning datasets when compared to random forests and GBDTs. On the fine-grained benchmarks CUB-200-2011, FGVC-aircraft and Stanford Cars, we achieve over 5.7 can converge in much fewer epochs, further accelerating network training. On the large-scale ImageNet ILSVRC-12 validation set, integration of NFL into ResNet-18 achieves top-1/top-5 errors of 28.32 ResNet-18 by 1.92 consistent under various architectures.

READ FULL TEXT VIEW PDF

page 1

page 2

page 5

page 6

page 7

page 8

page 9

page 11

11/25/2019

Neural Random Forest Imitation

We present Neural Random Forest Imitation - a novel approach for transfo...
02/12/2018

Random Hinge Forest for Differentiable Learning

We propose random hinge forests, a simple, efficient, and novel variant ...
10/15/2018

Vehicle classification using ResNets, localisation and spatially-weighted pooling

We investigate whether ResNet architectures can outperform more traditio...
07/25/2022

Deep Forest with Hashing Screening and Window Screening

As a novel deep learning model, gcForest has been widely used in various...
03/11/2021

Interpretable Data-driven Methods for Subgrid-scale Closure in LES for Transcritical LOX/GCH4 Combustion

Many practical combustion systems such as those in rockets, gas turbines...
11/12/2017

Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train

For the past 5 years, the ILSVRC competition and the ImageNet dataset ha...
12/25/2021

DBC-Forest: Deep forest with binning confidence screening

As a deep learning model, deep confidence screening forest (gcForestcs) ...

Code Repositories

NRS_pytorch

Neural Random Subspace (NRS) official pytorch implementation


view repo

1 Introduction

Deep convolutional neural networks (CNNs) have achieved remarkable advancement in a variety of computer vision tasks, such as image classification (

[alexnet:krizhevsky:NIPS12, vgg:simonyan:ICLR15, resnet:he:CVPR16]), object detection ([rcnn:girshick:CVPR14, mask-rcnn:he:ICCV17]) and semantic segmentation ([fcn:long:CVPR15]). Despite the rapid development of CNNs, forest methods such as random forests [randomforest:breiman:ML01]

or gradient boosting trees (GBDTs) 

[gbdt:friedman:2001]

are still the dominant way for vectorized inputs and and are widely used in many real-world applications 

[kaggleresults:17]

. Training and predictions for these models are computationally expensive for large problems. More importantly, such forest methods are mostly combinatorial rather than differentiable and they lack the capability of representation learning. In contrast, CNNs integrate representation learning and classifier learning in an end-to-end fashion, enjoying pervasive software (e.g., deep learning frameworks) and hardware (e.g., GPUs) support. Hence, one question arises: Can we design a forest method with both the capability of end-to-end representation learning and the support of existing software and hardware for deep learning?

An opposite aspect is to examine CNNs. By stacking layers of convolution and nonlinearity, CNNs effectively learn from low-level to high-level features and discriminative representations. As one standard module in deep CNN architectures, global average pooling (GAP) summarizes linear statistics of the last convolution layer. Recently, many higher-order pooling (HOP) methods (e.g., [bcnn:lin:ICCV15]) are proposed to be integrated into deep CNNs to learn higher-order, non-linear feature representations to replace GAP, which have achieved impressive recognition accuracies. However, these HOP methods suffer from expensive computing costs because of the need to calculate covariance information of very high dimensional matrices. Therefore, another question is: Can we add non-linearity to the linear GAP to achieve both good accuracy and high efficiency?

In this paper, we take a step towards addressing these two problems jointly. We propose a Neural Forest Learning (NFL) model which is a deep learning based random-forest-like method. On one hand, it learns non-linear decision tree representations using both randomness and existing CNN layers, which enjoys the benefits of end-to-end, data-driven representation learning, as well as pervasive support from deep learning software and hardware platforms. Moreover, NFL handles vectorized inputs well and achieves both higher accuracy and faster inference speed than random forests and GBDTs, which are attractive for many real-world pattern recognition tasks.

On the other hand, NFL can be seamlessly installed after the convolution layers and a GAP layer at the end of a CNN for image recognition, which non-linearly transforms the output of GAP. NFL achieves higher accuracy than standard GAP and is more efficient than HOP methods, with negligible increase in parameters, FLOPs and real running time. Furthermore, NFL can be installed across all layers in a CNN when integrated into Squeeze-and-Excitation (SE) networks 

[senet:hujie:arxiv] and it achieves comparable or better accuracy than SENet with fewer parameters and FLOPs.

Experimental results confirm the effectiveness of NFL. We achieve superior performance on 7 machine learning datasets when compared to random forests and GBDTs. On the fine-grained benchmarks CUB-200-2011 [cub200], FGVC-aircraft [aircrafts] and Stanford Cars [cars], by combining NFL we achieve over , and gains for VGG-16, respectively, with negligible increase in parameters, FLOPs and real running time. On ImageNet ILSVRC-12 [ILSVRC2012:russakovsky:IJCV15], integration of NFL into ResNet-18 achieves top-1/top-5 errors of , which outperforms ResNet-18 by with negligible extra cost.

2 Related Work

2.1 Forest Learning

Forest learning is a powerful learning paradigm which often uses decision trees as its base learners. Bagging and boosting, for instance, are the driving forces of random forests [randomforest:breiman:ML01] and GBDTs [gbdt:friedman:2001]

, respectively. They have become the choice for many industrial applications and data science projects, ranging from predicting clicks on Ads 

[clickonads:he:ADKDD14] to numerous data science competitions in Kaggle111https://www.kaggle.com and beyond. Note that the input to such models are vectors rather than images, and not suitable for methods such as CNN to process. Very recently, ThunderGBM [thundergbm:wen:19] proposes a GPU-based software to improve the efficiency of random forests and GBDTs, especially for large and high dimensional problems. Nevertheless, they are designed for specific algorithms and hardwares. With the rapid development of deep learning, there have also been deep forest methods. Recently, [gcforest:zhou:IJCAI17] proposes gcForest, which is a deep forest ensemble with a cascade structure. mGBDTs [mgbdt:feng:NIPS18]

learns hierarchical distributed representations by stacking several layers of regression GBDTs as its building block. In contrast, our method integrates forest with end-to-end, data-driven representation learning capabilities under the support of existing deep learning software and hardware platforms. NDF 

[neural-decision-forests:kontschieder:ICCV15] combines a single deep CNN with a random forest for image classification, where outputs of the top CNN layer are considered as nodes of the decision tree and prediction loss is computed at each split node of the tree. Despite sharing the similarity that both of us are bagging methods based on deep learning, our work differs as follows: (i) We implement the bagging process in a novel and easy way, which will be introduced in the next section. (ii) Our method is light-weight and more easily integrated into existing deep learning frameworks.

2.2 Non-linear representations in CNNs

Statistics higher than first-order ones have been successfully used in both classic and deep learning based classification scenarios. The Vectors of Locally Aggregated Descriptors (VLAD) [vlad:jegou:CVPR10] and Fisher Vectors (FV) [fishervector:perronnin:ECCV10] use non-linear representations based on hand-crafted features (e.g., SIFT [sift:lowe:ICCV99]

). By replacing hand-crafted features by features extracted from CNNs pretrained on ImageNet 

[ILSVRC2012:russakovsky:IJCV15], these models achieve state-of-the-art results on many recognition tasks [filterbank:compoi:CVPR15]. In these designs, image representation and classifier training are not jointly optimized and end-to-end training has not been fully studied. [bcnn:lin:ICCV15] proposes a bilinear CNN (B-CNN) that aggregates the outer products of convolutional features from two networks and allows end-to-end training for fine-grained visual classification. [isqrt:li:CVPR2018]

proposes an iterative matrix square root normalization (iSQRT) method for fast training of global covariance pooling networks. These works have shown that higher-order, non-linear feature representations based on convolution outcomes achieve impressive improvements over the classic linear representations. However, they suffer from the expensive computational overhead because these methods depend heavily on eigendecomposition or singular value decomposition of very high dimensional covariance matrices. Contrary to previous higher-order methods, our NFL learns non-linear feature representations with negligible increase in parameters, FLOPs and real running time while achieving higher accuracy.

3 Neural Forest Learning

In this section, we propose the NFL model, which mainly consists of random permutations and group convolutions. Furthermore, we show that it in effect is an ensemble of one-level trees, hence is named the Neural Forest Learning (NFL).

3.1 Network architecture

The following notations are used in the rest of this paper. Here we use to represent a -dimensional feature vector and to represent the -th feature () of . We denote the expansion rate in depth as , expansion height as , expansion width as and the number of channels per group in a group convolution as .

Figure 1: NFL architecture with one group convolution layer.

Our goal is to build a neural forest which combines the advantages of both forest learning and deep learning. We propose a novel NFL architecture to achieve this goal, as shown in Figure 1. For a -dimensional feature vector , we first generate random permutations , where . Then we can obtain a set of randomly permuted vectors from correspondingly:

(1)

Then, we concatenate these -dimensional feature vectors into one vector of dimensions and reshape

into an order-3 tensor

, where . We denote the element at location of -th row, -th column and -th channel in as . is actually generated by -th random permutation and we have:

(2)

Hence, corresponds to -th element in and we have:

(3)

where , and is calculated by Equation (2). Then, we send the tensor into a group convolution layer of kernel size , out channel numbers and group numbers

without padding, obtaining a new order-3 tensor

of size

followed by ReLU non-linearity. We can directly use only one group convolution layer with

and , thus achieving of size , as is done in Figure 1. Also, we can use multiple group convolution layers with and to make our forest deeper. Then, we add ReLU non-linearity upon and obtain tensor of size . Finally, we send

into fully connected layers plus a softmax layer for classification.

3.2 Neural forest learning via CNN implementation

For a -dimensional input feature vector, it can be either a hand-crafted feature vector in traditional machine learning or pattern recognition tasks or a feature representation generated by CNNs (e.g., the output of a GAP layer). It is then transformed into the tensor by random permutations.

Figure 2: Group convolution makes a forest. Each square in different color corresponds to different feature in input feature vector of 5 dimensions in Figure 1, e.g., the red square corresponds to , etc.

includes a set of 2-D feature maps . , of size , is the -th feature map of the corresponding channel (the -th channel). For each feature map , it consists of features, which are randomly sampled from original features. Then, each group convolution filter which randomly chooses features and the subsequent ReLU layer which acts upon an attribute (aka linear combination of these selected features) can be considered as a one-level decision tree. We take a group convolution layer with (aka depthwise convolution) as an example to illustrate this relationship, which is shown in Figure 2. We use to denote the -th depthwise convolution filter, to denote the -th channel of and to denote the -th channel of (). Then from Equation (3) we have:

(4)

Let denote the ReLU function, can be computed as:

(5)

Then, from Equation (5) and Figure 2 we can see that each convolution filter plus the subsequent ReLU corresponds to a tree which outputs a linear combination of features and decision based on it. Hence, all convolution filters form a forest with different one-level trees. Actually, if we use depthwise group convolution where , each base learner reduces to a decision stump which learns with a single feature. In conclusion, random permutation operation acts as resampling and group convolution is used for aggregation. These operations in effect form a feature bagging process where each base learner learns from a random subset of input features.

Although NFL is not strictly a forest because it does not match traditional trees or forest models precisely, we call it neural forest learning because NFL and forest models are similar in their motivations.

Also, when we increase , the number of channels gets larger and we get more group convolution filters. Hence, from Equation (4), more trees are integrated and our forest get larger correspondingly. Furthermore, we can increase and to explicitly increase the number of features utilized in each tree, thus increasing the capacity of each base learner. Finally, by stacking more group convolution layers, we can make our forest deeper. In all our experiments, we set for simplicity, so we denote and as in the rest of this paper. We conduct ablation studies about and in Sec 4.3.

4 Experimental Results

We will empirically study the benefits of our NFL method. On one hand, for vectorized inputs, we compare our method with other competitive forest methods on 6 machine learning classification datasets: satimage [satimage:hsu:2002], gisette [gisette:guyon:NIPS05], mnist [mnist:lecun:1998], letter [satimage:hsu:2002], usps [usps:hull:1994] and yeast [yeast:elisseeff:NIPS02] as well as 1 multivariate regression dataset sarcos [sarcos:vijayakumar:ICML00]. On the other hand, NFL can be integrated into CNNs for improving non-linear capability either at the end of the network or across all layers in the network. We conduct experiments on CIFAR-10 [cifar], CIFAR-100 [cifar], fine-grained visual categorization benchmarks and large-scale ImageNet ILSVRC-12 [ILSVRC2012:russakovsky:IJCV15] task with five widely used deep models: MobileNetV2 [mobilenetv2:sabdker:CVPR18], VGG [vgg:simonyan:ICLR15], ResNet [resnet:he:CVPR16], Inception-v3 [inceptionv3:szegedy:cvpr16] and SENet [senet:hujie:arxiv]

. All our experiments were conducted using PyTorch on Tesla M40 GPUs.

4.1 Datasets

For machine learning classification and regression datasets, a brief description of them including the train-test split, the number of categories and feature dimensions is given in the appendix. The CIFAR-10 [cifar] consists of 50000 training images and 10000 test images in 10 classes and the CIFAR-100 [cifar] is just like the CIFAR-10, except it has 100 classes containing 600 images each. For fine-grained categorization, we use three popular fine-grained benchmarks, i.e., CUB-200-2011 (Birds) [cub200], FGVC-aircraft (Aircrafts) [aircrafts] and Stanford Cars (Cars) [cars]. The Birds dataset contains 11788 images from 200 species, with large intra-class variation but small inter-class variation. The Aircrafts dataset includes 100 aircraft classes and a total of 10000 images with small background noise but higher inter-class similarity. The Cars dataset consists of 16185 images from 196 classes. For large-scale image classification, we adopt the ImageNet ILSVRC-12 dataset [ILSVRC2012:russakovsky:IJCV15] with 1000 object categories. The dataset contains 1.28M images for training, 50K images for validation and 100K images for testing (without published labels). As in [resnet:he:CVPR16], we report the results on the validation set.

4.2 Machine learning datasets

We compare NFL with forest methods, e.g., random forests and GBDTs in terms of accuracy, training/testing time and model size. Furthermore, because we use NFL with 2 fully connected (fc) layers on these machine learning classification and regression datasets, we also compare it with multi-layer perceptrons (MLP).

Implementation details:

We build NFL by 1 group convolution layer and 2 fc layers with batch normalization in all datasets. We construct MLP with 2 fc layers with batch normalization in the same way. We split

of the training data for validation to determine the total epochs separately for each dataset. We train all networks for 20~50 epochs, using Adam as optimizer and initializing learning rate to 0.0001. In Table 1, We set to 1 and to 3 for all these datasets for simplicity. Considering feature dimensions among different datasets, we set

to 20, 10, 16, 100, 30, 20 and 40 for satimage, gisette, mnist, letter, usps, yeast and sarcos, respectively. For MLP, random forests and GBDTs, we carefully tune the parameters through 5-fold cross-validation on the train set and choose the best parameters for them in each dataset. We report the mean accuracy and standard deviation of 5 trails for all datasets except yeast, which is evaluated by 10-fold cross-validation. We use different experimental settings for algorithms in Table 

2 for fair model size, speed and accuracy comparison and the experimental details are included in the appendix.

Datasets NFL (ours) MLP NDF [neural-decision-forests:kontschieder:ICCV15] Random Forests GBDTs
satimage 91.520.31 90.010.31 89.710.31 91.010.35 89.260.04
gisette 98.260.05 98.080.12 97.240.29 96.980.13 97.180.04
mnist 98.570.03 98.170.07 97.290.12 96.960.08 96.560.07
letter 97.850.10 97.230.17 97.080.17 96.140.10 94.660.01
usps 95.710.17 95.130.26 94.990.24 93.800.19 92.830.03
yeast 62.812.61 60.573.45 60.313.37 62.813.47 60.712.35
NFL (ours) MLP ANT [ant:tanno:ICML19] Random Forests GBDTs
sarcos 01.230.05 02.360.16 1.38* 02.370.01 01.440.01
Table 1: Accuracy(%) on 6 machine learning classification datasets and MSE on the regression dataset sarcos. We report the average accuracy and standard deviation of 5 trails. NFL and MLP are the results of last epoch. * denotes that [ant:tanno:ICML19] didn’t report the standard deviation.
Method Model Time Acc
Size Inference Training
gisette NFL (ours) 35 0.17 062.51 97.82
Random Forests 3.6 0.12 000.67 96.70
ThunderGBM RFs 0.6 2.77 024.96 93.60
GBDTs 0.2 0.01 181.14 96.70
ThunderGBM GBDTs 0.6 2.04 018.78 91.79
mnist NFL (ours) 9.6 0.17 194.45 98.42
Random Forests 137 0.31 2.09 96.85
ThunderGBM RFs 6.1 0.76 19.43 93.16
GBDTs 1.7 0.42 2877.78 94.87
ThunderGBM GBDTs 6.1 0.99 23.66 93.78
letter NFL (ours) 3.4 0.19 43.49 97.78
Random Forests 106 0.39 0.38 96.12
ThunderGBM RFs 15.9 0.35 16.03 93.29
GBDTs 4.5 0.27 50.88 92.04
ThunderGBM GBDTs 15.9 0.33 15.20 92.99
Table 2: Model size (MB), total inference / training time (s) and accuracy (%) comparison. Number of trees is set to 100 for random forests, GBDTs, ThunderGBM random forests (RFs) and GBDTs for faster speed and smaller size.

Comparison among different algorithms: Table 1 shows that NFL achieves the highest accuracy in all classification datasets and the lowest mean square error (MSE) in the regression dataset compared with MLP, random forests, GBDTs, NDF [neural-decision-forests:kontschieder:ICCV15] and ANT [ant:tanno:ICML19]. Moreover, it is worth noting that although our method introduces randomness due to random permutations, it achieves a low standard deviation and is very robust, even more stable than MLP.

Using the 2 largest datasets and the highest dimensional dataset among these 6 classification datasets, we compare the speed and size of NFL with random forests and GBDTs. Table 2 shows that although GPU-based ThunderGBM can greatly reduce the training time, especially for GBDTs, the inference process seems to have no benefit. Compared to these forest methods, NFL achieves the highest accuracy and the fastest inference speed on mnist and letter, and also the smallest model size on letter. NFL achieves the highest accuracy on gisette but the model size is larger than other forest methods, indicating that NFL may be unfriendly to those datasets with non-sparse high dimensions in terms of model size. Note that although we use smaller values in Table 2 than the experiments reported in Table 1, NFL’s accuracy in Table 2 are still similar to those in Table 1. Effect of NFL’s hyper-parameters such as will be studied using ablative experiments.

4.3 Ablation studies

We choose the 4 largest datasets among those 6, i.e., gisette, mnist, letter and usps, for ablative experiments. Ablation studies include three parts: expansion rate, number of channels per group and expansion height/width.

Expansion rate. As is known in random forests or GBDTs, we can increase the number of decision trees to boost performance. Similarly, we can increase in NFL to make our forest larger and we conduct ablation studies about . Here we set to 1 for all experiments and other settings remain the same as in the previous subsection. The results in Figure (a)a show that when grows, the average accuracy increases and the standard deviation becomes smaller. It indicates that as increases, more trees are integrated into our model and the performance becomes better and our model gets more robust.

(a) Classification results using different expansion rates .
(b) Classification results using different number of channels per group .
(c) Classification results using different expansion size .
Figure 3: Ablation studies of , and on gisette, mnist and letter (from left to right in each sub figure). We plot the average accuracy and standard deviation of 5 trails at each point.

Number of channels per group. We can also increase to increase the number of features utilized in each tree. Here we set to 1 for all experiments and other settings remain the same. The results in Figure (b)b show that when grows, the test accuracy will increase at first and then it will become stable or slightly decrease. It means that as increases, the capacity of each tree and hence the whole model will also increase, thus the accuracy will also increase at first. However, the model is more likely to overfit with large and model capacity and we can see that the performance will not continue to improve.

Expansion height/width. We set to 1, 5 and 50 for gisette, mnist and letter and we set to 1 for all these datasets. The results in Figure (c)c show that when is very small, i.e., equals 1, the result is bad, especially for gisette. When grows, the result becomes better and will not continue to improve when it grows beyond 3. Therefore, 2 or 3 is a good choice for in terms of accuracy and efficiency.

4.4 Fine-grained Visual Categorization

We then evaluate NFL in CNN architectures for image recognition. NFL is used after GAP to non-linearly transform the GAP output vector at the end of the network. First, this section evaluates NFL with ResNet-50 [resnet:he:CVPR16] and VGG-16 [vgg:simonyan:ICLR15] on the Birds, Aircrafts and Cars datasets. We compare our method with baseline models and one representative higher-order pooling method.

Implementation details: For fair comparisons, we follow [bcnn:lin:ICCV15] for experimental setting and evaluation protocal. We crop patches as input images for all datasets. For baseline models, we replace the 1000-way softmax layer of ResNet-50 pretrained on ImageNet ILSVRC-12 [ILSVRC2012:russakovsky:IJCV15] with a -way softmax layer for finetuing, where is the number of classes in the fine-grained dataset. We replace all fc layers of pretrained VGG-16 with a GAP layer plus a -way softmax layer to fit input. We finetune all the networks using SGD with batch size of 32, a momentum of 0.9 and a weight decay of 0.0001. We train the networks for 65 epochs, initializing the learning rate to 0.002 which is devided by 10 every 20 epochs. For NFL models, we replace the 1000-way softmax layer of pretrained ResNet-50 with our NFL module, specifically, random permutations, 1 group convolution layer and a -way softmax layer ( is number of classes), which is called ResNet-50+NFL. Here we set , and to 2, 64 and 3, respectively. Moreover, we also use the pretrained VGG-16 as our backbone network to construct VGG-16+NFL in a similar way, except that we set to 6 considering different feature dimensions. We finetune our models under the same setting as the baseline models. These models integrating NFL are trained end-to-end as the baseline models.

Comparison among different algorithms: Table 3 shows that our NFL method achieves significant improvement compared to baseline models, with negligible increase in parameters, FLOPs and real running time. It is worth mentioning that VGG-16+NFL achieves and relative improvement over baseline models on Birds, Aircrafts and Cars, respectively. Besides, our NFL performs consistently better than B-CNN [bcnn:lin:ICCV15] on all the 3 datasets under the VGG-16 architecture despite using much fewer parameters, FLOPs and real running time. Furthermore, the learning curves in Figure 4 shows that NFL can greatly accelerate the convergence and achieves better results both in accuracy and convergence speed than baseline methods (the red curves v.s. the green curves). It indicates that NFL can effectively learn non-linear feature representations and achieves good results on fine-grained recognition.

Method # Dim # Params # FLOPs Inference Time Accuracy
CPU GPU Birds Aircrafts Cars
ResNet-50 Baseline 002K 23.92M 16.53G 540.48 28.16 84.0 88.6 89.2
ResNet-50+NFL (ours) 004K 26.70M 16.53G 541.57 28.38 86.7 92.8 93.4
VGG-16 Baseline 00.5K 15.34M 61.44G 644.18 28.24 78.7 82.7 83.7
B-CNN ([bcnn:lin:ICCV15]) 262K 67.14M 61.75G 856.46 31.90 84.0 84.1 90.6
VGG-16+NFL (ours) 003K 17.11M 61.44G 645.19 28.58 84.4 89.6 91.5
Table 3: Comparison of representation dimensions, parameters, FLOPs, inference time per image (ms) and accuracy (%) on fine-grained benchmarks. The inference time is recorded with batch size of 1 on both CPU and GPU.
(a) Learning curves of Aircrafts
(b) Learning curves of Cars
Figure 4: Loss and accuracy learning curves. Both ResNet-50 and ResNet-50+NFL are trained under the same setting.

4.5 ImageNet ILSVRC-12

We then evaluate NFL on the large-scale ImageNet ILSVRC-12 task and also NFL is used after GAP at the end of the network.

Implementation details: We train a ResNet-50+NFL model from scratch on ImageNet, which is described in the previous subsection except that the last layer is a 1000-way softmax layer. The images are resized with shorter side=256, then a crop is randomly sampled from the resized imgae with horizontal flip and mean-std normalization. Then, the preprocessed images are fed into ResNet-50+NFL model. We train ResNet-50+NFL using SGD with batch size of 256, a momentum of 0.9 and a weight decay of 1e-4 for 100 epochs. The initial learning rate starts from 0.1, and is devided by 10 every 30 epochs. A ResNet-18+NFL model is constructed and trained in a similar way, except that we set and to 4 and 32, respectively. For MobileNetV2 [mobilenetv2:sabdker:CVPR18], we set and to 1 and 32, respectively. We train the network using SGD with batch size of 256, a momentum of 0.9 and a weight decay of 4e-5 for 150 epochs. We initialize the learning rate to 0.05 and use cosine learning rate decay.

Comparison with baseline methods: Table 4 shows that NFL produces 0.70%, 1.92% and 0.73% top-1 error (1-crop) less than the original MobileNetV2, ResNet-18 and ResNet-50 model, respectively, with negligible increase in parameters and FLOPs. It is worth noting that if we use larger , and , we can get better results at the cost of more parameters and FLOPs, which are included in the appendix. It indicates that our NFL method is also effective for large scale recognition, achieving better performance consistently under various architectures.

Method # Params # FLOPs Top-1 / 5 err.
Original ResNet-501 25.56M 4.14G 23.85 / 07.13
ResNet-50+NFL (ours) 29.98M 4.14G 23.12 / 06.62
Original ResNet-181 11.69M 1.82G 30.24 / 10.92
ResNet-18+NFL (ours) 13.82M 1.83G 28.32 / 09.77
Original MobileNetV21 03.50M 0.33G 28.12 / 09.71
MobileNetV2+NFL (ours) 03.88M 0.33G 27.42 / 09.39
Table 4: Error rate (%, 1-crop prediction) comparison on ImageNet ILSVRC-12 under different architectures.
Method CIFAR-10 CIFAR-100
# Params #FLOPs Accuracy # Params #FLOPs Accuracy
ResNet-20 original ResNet-20 00.27M 0041.62M 92.75 00.28M 0041.63M 69.33
SE-ResNet-20 00.27M 0041.71M 93.28 00.28M 0041.72M 70.35
SE-ResNet-20+NFL (ours) 00.28M 0041.71M 93.73 00.28M 0041.72M 70.38
ResNet-50 original ResNet-50 23.52M 1311.59M 95.78 23.71M 1311.96M 80.41
SE-ResNet-50 26.04M 1318.42M 95.59 26.22M 1318.79M 81.57
SE-ResNet-50+NFL (ours) 23.67M 1313.56M 96.05 23.86M 1313.93M 81.48
Inception-v3 original Inception-v3 22.13M 3411.04M 94.83 22.32M 3411.41M 79.62
SE-Inception-v3 23.79M 3416.04M 95.60 23.97M 3416.41M 80.44
SE-Inception-v3+NFL (ours) 22.23M 3412.85M 95.67 22.42M 3413.22M 80.54
Table 5: Comparison of params, FLOPs and accuracy (%) on CIFAR-10 and CIFAR-100 under various architectures.
Method # Params # FLOPs Inference Time Top-1 / 5 err.
CPU GPU reported in [senet:hujie:arxiv] our re-implementation
Original ResNet-501 25.56M 4.14G 465.39 21.07 24.80 / 7.48 23.85 / 7.13
SE-ResNet-50 28.07M 4.15G 581.32 35.82 23.29 / 6.62 22.68 / 6.30
SE-ResNet50+NFL (ours) 25.71M 4.14G 523.97 32.80 - / - 22.89 / 6.57
Table 6: Comparison of parameters, FLOPs, inference time per image (ms) and error rate (%, 1-crop prediction) comparison on ImageNet ILSVRC-12 under SENet architectures. The inference time is recorded with batch size of 1 on both CPU and GPU.

4.6 NFL across all layers

Motivated by the Squeeze-and-Excitation (SE) method [senet:hujie:arxiv], we use NFL to replace all the SE modules. We conduct experiments on CIFAR-10 [cifar], CIFAR-100 [cifar] and the ImageNet ILSVRC-12 task. We compare our method with baseline methods and SENet under various architectures.

Implementation details: We replace the 2 fc layers in each SE block with NFL, specifically, random permutations and 1 group convolution layer followed by sigmoid activation, as shown in Figure 5, which is called SENet+NFL. In all our experiments in this section, we set and to 1 and to 3 for SENet+NFL and the reduction ratio is set to 16 for SENet as is done in [senet:hujie:arxiv]. For CIFAR-10 and CIFAR-100, we use ResNet-20 [resnet:he:CVPR16], ResNet-50 [resnet:he:CVPR16] and Inception-v3 [inceptionv3:szegedy:cvpr16] as the backbone network. Mean subtraction, horizontal random flip and random crops after padding 4 pixels on each side were performed as data preprocessing and augmentation. We train all networks from scratch using SGD with 0.9 momentum, a weight decay of 5e-4 and batch size of 128 for 350 epochs. The initial learning rate starts from 0.1 using cosine learning rate decay. For ImageNet, we follow the same setting as in [senet:hujie:arxiv]. The images are resized with shorter side=256, then a crop is randomly sampled from the resized imgae with horizontal flip and mean-std normalization. We use SGD with a momentum of 0.9, a weight decay of 1e-4, and batch size of 256 and the initial learning rate is set to 0.15 and decreased by a factor of 10 every 30 epochs. Models are trained for 100 epochs from scratch.

Comparison with baseline models: Table 5 shows that under ResNet-20, SENet+NFL achieves the highest accuracy on CIFAR-10 and CIFAR-100 with negligible increase in parameters and FLOPs. For ResNet-50 and Inception-v3 backbone, SENet+NFL achieves comparable or better accuracy than original SENet despite using fewer parameters and FLOPs, further demonstrating the effectiveness of NFL.

Table 6 shows that under ResNet-50, SENet+NFL achieves fewer parameters, FLOPs and real running time than original SENet while maintaining comparable accuracy. SENet+NFL also achieves higher accuracy than the baseline method with negligible increase in parameters and FLOPs. It indicates that NFL can be integrated not only at the end of a CNN as shown in the previous sections but also across all layers in a CNN to learn non-linear mapping effectively.

Figure 5: The schema of the original Residual module (left), the SE-ResNet module (middle) and the SE-ResNet+NFL module (right).

5 Conclusions

We proposed a deep learning based random-forest-like method NFL. We introduced the feature bagging process into deep learning with random permutations acting as resampling and group convolutions acting as aggregation, where each base learner learns from a subset of features. NFL can handle vectorized inputs well and can be installed into CNNs seamlessly both at the end of the network and across all layers in the network. On one hand, it enriches forest method with the capability of end-to-end representation learning as well as pervasive deep learning software and hardware support. On the other hand, it effectively learns non-linear feature representations in CNNs with negligible increase in parameters, FLOPs and real running time. We have successfully confirmed the effectiveness of NFL on standard machine learning datasets, popular CIFAR datasets, challenging fine-grained benchmarks as well as the large-scale ImageNet dataset. In the future, we will continue exploration on combining deep learning and traditional forest learning algorithms to better understand the relation between different approaches. Furthermore, we will extend NFL to handle datasets with high dimensional sparse features, as well as small datasets.

References

Appendix A Machine learning datasets

A brief description of the machine learning datasets including the train-test split, the number of categories and feature dimensions is given in Table 7. Note that sarcos is a regression dataset so it doesn’t have the attribute of the category.

Datasets # Category # Training # Testing # Dim
satimage 06 04435 02000 0036
gisette 02 06000 01000 5000
mnist 10 60000 10000 0780
letter 26 15000 05000 0016
usps 10 07291 02007 0256
yeast 14 01500 00917 0008
sarcos - 44484 04449 0021
Table 7: Attributes of the machine learning datasets. The above 6 datasets are classfication datasets and sarcos is a regression dataset.

Appendix B Speed comparison experimental details

For better trade-off between model size, speed and accuracy, we reduce the number of trees for random Forests and GBDTs and the parameter for NFL correspondingly. In Table 2, we use the same setting as in Table 1 except that we set to 1, 5 and 50 for NFL for gisette, mnist and letter, respectively, for better model size, speed and accuracy trade-off. Note that although we use smaller values in Table 2 than the experiments reported in Tabel 1, NFL’s accuracies in Tabel 2 are still similar to those in Table 1 (e.g., 97.85 in Table 1 v.s. 97.78 in Table 2 on letter). Correspondingly, we reduce the number of trees to 100 for random forests and GBDTs for faster speed and smaller size. We also use 5-fold cross validation on the train set and choose the best value for other parameters. Note that although we reduce the number of trees for random forests from 500 to 100 on mnist, the accuracy drops slightly (from 96.96% to 96.85%) while the model size is reduced by 5 times (from 680M to 137M). We record total training time on the train set and inference time on the test set in seconds.

Appendix C More results

If we use larger values, we can use more group convolution layers to make our forest deeper. We set to 3 and use one group convolution layer of kernel size (3,3) in the paper. Here we set to 5 and use two group convolution layers of kernel size (3,3) and we compare the results under these two settings.

c.1 Fine-grained Visual Categorization

First, we conduct experiments on fine-grained benchmarks for NFL with 2 group convolution layers.

Implementation details: When , our NFL module consists of random permutations, 2 group convolution layers and a -way softmax layer ( is number of classes). And we remain all other settings the same as described in Sec 4.4.

Method # Params # FLOPs Top-1 / 5 err.
Original ResNet-501 25.56M 4.14G 23.85 / 07.13
ResNet-50+NFL (2 layers) 37.07M 4.19G 23.08 / 06.22
Original ResNet-181 11.69M 1.82G 30.24 / 10.92
ResNet-18+NFL (2 layers) 20.02M 1.86G 29.03 / 10.03
Original MobileNetV21 03.50M 0.33G 28.12 / 09.71
MobileNetV2+NFL (2 layers) 07.75M 0.35G 27.02 / 08.94
Table 8: Error rate (%, 1-crop prediction) comparison on ImageNet ILSVRC-12 under different architectures.
Method # Params # FLOPs Top-1 / 5 err.
Original ResNet-501 25.56M 4.14G 22.57 / 6.24
ResNet-50+NFL (2 layers) 37.07M 4.19G 21.44 / 5.72
Original ResNet-101 1 44.55M 7.87G 21.06 / 5.55
Table 9: Error rate(%, 10-crop prediction) comparison of ResNet-50+NFL with the original PyTorch ResNets on ImageNet-ILSVRC12.

Comparison among different settings: From Table 10 we can see that when we use more group convolution layers, we can get better results on all the 3 fine-grained benchmarks under VGG-16 at the cost of more parameters, FLOPs and real running time.

Method # Dim # Params # FLOPs Inference Time Accuracy
CPU GPU Birds Aircrafts Cars
ResNet-50 Baseline 002K 23.92M 16.53G 540.48 28.16 84.0 88.6 89.2
NFL (1 group conv layer) 004K 26.70M 16.53G 541.57 28.38 86.7 92.8 93.4
NFL (2 group conv layers) 004K 29.07M 16.55G 543.04 30.77 86.6 92.8 93.1
VGG-16 Baseline 00.5K 15.34M 61.44G 644.18 28.24 78.7 82.7 83.7
B-CNN 262K 67.14M 61.75G 856.46 31.90 84.0 84.1 90.6
NFL (1 group conv layer) 003K 17.11M 61.44G 645.19 28.58 84.4 89.6 91.5
NFL (2 group conv layers) 003K 18.89M 61.46G 645.72 30.13 84.8 89.7 91.6
Table 10: Comparison of representation dimensions, FLOPs, inference time per image (ms) and accuracy (%) on fine-grained benchmarks. The inference time is recorded with batch size of 1 on both CPU and GPU.

c.2 ImageNet ILSVRC-12

Then, we evaluate NFL with 2 group convolution layers on the large-scale ImageNet ILSVRC-12 task.

Implementation details: Due to resource constraints, we finetune a ResNet-50+NFL model on ImageNet, which is described in the previous subsection except that the last layer is a 1000-way softmax layer and we set to 128. We finetune ResNet-50+NFL using SGD with batch size of 256, a momentum of 0.9 and a weight decay of 0.0005 for 30 epochs. The initial learning rate starts from 0.001, and is devided by 10 every 10 epochs. A ResNet-18+NFL model is constructed and finetuned in a similar way, except that we set and to 8 and 64, respectively. For MobileNetV2, we set and to 2 and 64, respectively. We finetune the network for 45 epochs with using cosine learning rate decay.

Comparison with baseline methods: Table 8 shows that NFL produces 1.10%, 1.21% and 0.77% top-1 error (1-crop) less than the original MobileNetV2, ResNet-18 and ResNet-50 model, respectively. Table 9 shows that, compared to the original PyTorch ResNets, with 10-crop prediction our method performs 1.13% better than ResNet-50, while being comparable to ResNet-101. It indicates that our NFL method is also effective for large scale recognition, achieving performance matching deeper CNNs with much shallower one. These results indicate that we can get better results when using more layers as well as setting and larger at the cost of more parameters and FLOPs.