Progressive Neural Networks for Image Classification

04/25/2018 ∙ by Zhi Zhang, et al. ∙ 0

The inference structures and computational complexity of existing deep neural networks, once trained, are fixed and remain the same for all test images. However, in practice, it is highly desirable to establish a progressive structure for deep neural networks which is able to adapt its inference process and complexity for images with different visual recognition complexity. In this work, we develop a multi-stage progressive structure with integrated confidence analysis and decision policy learning for deep neural networks. This new framework consists of a set of network units to be activated in a sequential manner with progressively increased complexity and visual recognition power. Our extensive experimental results on the CIFAR-10 and ImageNet datasets demonstrate that the proposed progressive deep neural network is able to obtain more than 10 fold complexity scalability while achieving the state-of-the-art performance using a single network model satisfying different complexity-accuracy requirements.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recently, large and deep neural networks have demonstrated extraordinary performance on various computer vision and machine learning tasks

[1, 2]

. We notice that, the inference structures, execution procedures, and computational complexity of existing deep neural networks, once trained, are fixed and remain the same for all test images, no matter how much they have been optimized speed-wise. In this work, our goal is to develop a new progressive framework for deep neural network such that a single model can be evaluated at different performance levels with different computational complexity requirements. This single-model-variable-complexity property is very important in practice. We recognize that different images have different complexity levels of visual representation and different difficulty levels for visual recognition. For simple images with low visual recognition complexity, we can easily classify the image or recognize the object with simple networks at very low complexity. For example, it will be very easy to detect a person standing in front of a clean background or classify if it is male or female. For harder images, we will need to extract sophisticated visual features using more layers of network representation to gain sufficient visual discriminative power so that the object can be successfully distinguished from those in other classes.

Figure 1: Left: Validation results of different models, images are sorted by average top-1 accuracies across models. Right: Individual image difficulties across different models.

To validate this observation, we conducted the following interesting experiment. We collected 16 different pre-trained deep neural networks, ranging from very low-complexity MobileNet [3], to very high-complexity DenseNet201 [2]. We use these networks to classify the images in the ILSVRC2012 [1] 50k validation set. Fig. 1 (left) shows the classification results. Each row corresponds to a specific network. The horizontal axis represents the index of the test image. A blue line indicates that the image is successfully classified by the network or ranked first in all images. A magenta line indicates that the correct result is within the top 2 to 5 categories. A yellow line indicates that the correct result is beyond the top 5 results. As summarized in Fig. 1, we can see that 38% of images are always correctly classified by all networks, no matter how simple the network is. These are the so-called simple images with low visual recognition complexity. This suggests that, if we can successfully identify those set of simple images, we can save a lot of computational resources by choosing simple networks to analyze them. More excitingly, if we are able to model or predict the visual recognition complexity of the input image and if we are able to establish a progressive network, we can then adapt the network complexity during run-time according to the visual recognition complexity of the input image. This will allow us to save the computational resources significantly.

Figure 2: Each prediction result is produced by pre-trained model of different size. Top row: hard example which requires full network inference. Bottom row: easy example that is predicted confidently by even tiny models. Best viewed in color.

Let us look at one more example. Fig. 2 shows two images, a simple Ice Cream image and a hard Siamese Cat image with occlusion. We choose 5 networks with different computational complexity. The most complex network is the DenseNet201 [2] labeled with 100%. The complexity levels of the other four networks are shown in approximate relative percentage in Fig. 2. For example, the inference cost of the first network is about 5% of DenseNet201. From this experiment, we can see that, for the simple Ice Cream image, the confidence score for Ice Cream is much higher than other object categories, even with very simple networks. However, for the hard Siamese Cat image, the score distribution is more uniform for low-complexity networks. As the network becomes more complex and more powerful, the classifier is more and more confident about the result since the score for the correct label is getting much higher than other categories.

These two experiments strongly suggest that it is highly desirable to establish a progressive deep neural network structure which is able to adapt its network inference structure and computational complexity to images with different visual recognition complexity. This progressive structure should be able to scale up its discriminative power and achieve higher recognition capability by activating and executing more analysis units of the deep neural networks and accumulating more visual evidences for more challenging vision analysis tasks or image sets.

The major contributions of this work can be summarized as follows. (1) We have successfully developed a multi-stage progressive structure, called ProgNet, for deep neural networks, with each stage being separately evaluated for output decision and all stages being activated in a sequential manner with progressively increased complexity and visual recognition power. (2) We present different structures for progress network design and develop a Confidence Analysis and Decision Policy network to learn the classification confidence function for the progressive network and make run-time complexity-accuracy decision for each input image. (3) Our extensive experimental results demonstrate that the proposed progressive framework for deep neural networks is able to outperform existing state-of-the-art networks or models, from MobileNet to DenseNet, using one single model and its complexity can be adaptively controlled. This progressive framework will provide an important and useful tool for practical deep neural network design and resource allocation in real-time applications.

The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 provides a conceptual overview of the proposed progressive framework. Detailed design and methods for progressive networks are presented in Section 4. Section 5 presents our confidence analysis and decision policy network. Experimental results are presented in Section 6. Section 7 concludes the paper.

2 Related Work

This is work is closely related to complexity control / optimization and confidence analysis of deep neural networks.

(A) Complexity Optimization of Deep Neural Networks. Deep neural networks often involves high computational complexity. A number of methods have been developed to accelerate inference speed of deep neural networks, or reduce its computational resource requirement so that they can operate on lower-end devices, such as CPUs, embedded processors and mobile devices. Network parameter pruning and quantization are two effective approaches. Gong et al. [4] and Wu [5]

applied k-means scalar quantization to pre-trained parameters. Significant speed up can be achieved by 8-bit integer and 16-bit floating point quantization as shown in

[6] and [7]. Parameter pruning approaches [8, 9, 10] can be used to reduce network complexity. Low rank factorization and decomposition [11]

, transferred learning and compact convolutional filter learning

[12, 13], and knowledge distillation [14, 15, 16] methods have been developed to reduce the network complexity. NoScope [17]

tackled the problem of very expensive surveillance video object detection by using a shallow and fast CNN as a early estimator and dispatcher. Only complex scenes with significant inter-frame changes will require a full run of deep object detection network, therefore higher analysis speed can be achieved. All of these networks aim to optimize the computational complexity and inference speed of deep neural networks, often at the cost of degraded visual recognition performance. Their inference structures, execution procedures, and computational complexity, once trained, are fixed and remain the same for all test images. They are not able to adapt their network inference structure and complexity to different resource supplies and input images.

(B) Confidence Analysis for Deep Neural Networks. Researchers have recognized that the decision scores of existing deep neural networks are poorly calibrated [18]. For example, higher scores often do not correspond to better or closer samples. [19]

argues that probability scores generated by

softmax should not be considered as confidence score or distance measure directly since they are based on the norm of pre-softmax

inputs. To address this issue, various methods using scalar, vector, matrix and binning methods to calibrate the confidence scores produced by the

softmax function. Gaussian density modeling is proposed in [19] as a post prediction calibration using prior information of the training data. Gal et al. [20] has implemented a randomized dropout network [21] to estimate the uncertainty level of network prediction. The open set deep network approach in [22] attempts to measure uncertainty contributed from unknown categories. It should be noted that these methods are based on statistical modeling, being optimized on the entire validation set, and therefore not suitable for confidence analysis for each individual input image.

3 Method Overview

Fig. 3 provides an overview of the proposed framework for deep neural networks. In this work, we divide network into stages with deep neural network units, , . Each unit

consists of a set of network layers, including convolution, pooling, ReLU, etc. The output of unit

is a feature vector . At decision stage , we use the feature to generate the decision output, i.e., the classification result for the input image, using an evaluation network

. The evaluation network consists of a set of network layers, including convolution, pooling, fully connected layers, and softmax layers. At stage

, feature generated from unit and feature generated from unit are fused together using a fusion network to produce the fused feature , which will be used by the evaluation network to produce the decision output. The fusion network consists of feature concatenating followed by convolution layers for normalization. The fused feature produced at stage 2 will be forwarded to stage . At stage , the network unit will produce feature , which will be fused with from previous layers to produce the fused feature . We can see that the proposed progressive deep neural network is able to accumulate and fuse more and more visual features, scale up its visual recognition power by activating more network units and certainly consuming more computational resources. Certainly, the network inference can be terminated at any stage.

Figure 3: The proposed framework for progressive deep neural networks.

Let be the output result produced by the evaluation network . Certainly, at later stages, the progressive network is able to accumulate more visual features or evidences for classification, its classification accuracy or visual recognition power will be higher, and the uncertainty for decision will be lower. So, we need a carefully designed module, called Confidence Analysis and Decision Policy (CADP) network to analyze the output results

from each stage and its previous stage. It will decide if the current decision is reliable enough with early termination of the inference process or we need to proceed to the next stage to gather more visual evidence. In this work, the CADP network is realized by a recurrent neural network (RNN) learned from the training data.

The task the CADP network is two-fold: (1) it needs to generate the decision of classification or other visual recognition tasks at the current stage using the evaluation results from the current and previous stages. (2) It needs to learn an optimal decision policy for early termination such that the overall computational complexity is minimized while maintaining the state-of-the-art classification accuracy achieved by existing non-progressive networks. Let , be the input image. We define


We denote the decision policy in the CADP network by . The CADP network decides that image be terminated at stage . Let be the computational complexity of stage . Then the computational complexity for evaluating the input image will be


The overall accuracy for all test images will be given by


Therefore, the optimal policy to be learned by the CADP network aims to minimize the overall complexity while maintaining the target accuracy:


In the following sections, we will present our progressive deep neural network design and explain our method to learn the CADP network.

4 Progressive Deep Neural Network Design

The concept of progressive inference is different from traditional network inference. It must be able to produce a sequence of complete prediction results. Early stage of the network should have small computational complexity. Besides, the features and results from previous stages should be reused and accumulated. As discussed in [23], the overall computation required by standard convolutional layers in a CNN is given by:


where , , and are kernel size, input feature map size, input and output channels of layer , respectively. To change the values of and , we can choose different building blocks, such as residual [24] and dense [2]. In complexity-progressive network design, we focus on the rest two sets of parameters: channels(,) and layers(L), which dominate the overall complexity since their values are typically very large.

It can be seen that these two sets of parameters represent two different dimensions, corresponding to two different dimensions for network partition: horizontal and vertical partitions. This leads to two different structures, parallel and serial structures, for progress deep neural network design, which will be explained in the following.

Figure 4: ProgNet prototype structures of parallel(left) and serial(right) for image classification. Spatial reduction cells are omitted in figures for simplicity.

(A) Parallel structures. In the parallel structure for deep neural networks, we partition the network into multiple stages by reducing the input and output channel size, and . let be the down-sampling ratio of and . The complexity of convolution layers can be then reduced by according to Eq. (5). As shown in Fig. 4 (left), at stage , we use a thin network with small input and output channel sizes. The depth of the network could be as large as the original non-progressive network. We use existing building blocks developed in the literature. Available choices are residual [24], residual bottleneck [25], dense [2], inception [26], inception-residual [27] and NasNet [28] blocks. Similar to [29], a Reduction block

contains stride 2 convolution or down-sampling layers used to reduce spatial size by a factor of 2. A

normal block keeps spatial dimension intact. In this parallel structure, the input image is analyzed by different network units with different channel sizes. The features generated by different units are fused and accumulated together using a concatenate operator.

(B) Serial structures. One limitation of the parallel structure is that the width of each unit or branch cannot be reduced to arbitrary numbers. In our experiments, 4 and 8 are the minimum effective width of each unit on the CIFAR [30] and ImageNet [1] datasets, respectively, in order to maintain sufficient representation capacity. The serial structure partitions the network along the dimension of layers . As shown in Fig. 4

(right), we extract features from different layers of the network, apply global pooling to them, and use a fully connected layer to generate the output decision. Also, this feature is concatenated with features extracted from next layers to be used for decision at the next stage. We can see that in this serial structure, the complexity of different stages increases linearly with the layer depth


Designing and successfully training the progressive network structure is a challenging task. Specifically, we need to make sure: (1) the overall accuracy at stage achieved by the evaluation network is increasing with . Otherwise, additional computational computational resources have been wasted. (2) When we apply the full complexity, i.e., evaluate each input using the whole network, we need to make sure that it is more complexity-accuracy effective than existing state-of-the-art networks.

Following the work in [31, 32]

for multi-class classification, we use the Cross Entropy loss as our joint loss function:


where and are weight and output from stage , respectively. is the ground-truth label. If not otherwise required by targeting application, we treat each stage with uniform weights (), i.e., outputs from all stages are equally important.

5 Confidence Analysis and Decision Policy Learning

In the previous section, we have introduced the ProgNet that can perform network prediction at a sequence of stages. At each stage, ProgNet needs to determine if the current evaluation output is reliable or confident enough and if it is necessary to proceed to the next stage for accumulate more visual evidences. During our experiments, we found out that the decisions at different stages are inter-dependent with each other. Specifically, the current stage needs to examine the evaluation results in all previous stages for effective decisions. To address issue, we propose to design a recurrent neural network (RNN) to learn the confidence analysis and decision policy (CADP). As illustrated in Fig. 5, the RNN CADP network uses the pre-softmax outputs in the evaluation networks of all previous stages to learn the confidence estimation and decision policy at the current stage.

Figure 5: Early termination control during inference by RNN controller.

More specifically, for each input (usually a mini-batch of images), a RNN controller takes as input the -class pre-softmax output from the current CNN classifier, and produces outputs, with new categorization results and post-sigmoid confidence estimation. Post-sigmoid confidence score is compared with user defined threshold in range to determine whether output is emitted directly from RNN classification results, otherwise another stage of CNN-RNN-decision is required.

Optimizing the RNN controller is the most challenging problem in this work. For each image, with a user-defined threshold , the objective of CADP network is to solve the following optimization problem: minimizing the overall error rate subjected to a computational complexity constraint:

subject to (9)

where is the number of stages and is the normalized computational cost of -th DNN unit. is a constant number once ProgNet is composed. is Heaviside(unit) step function which is for positive inputs and for negative inputs. is the correctness score ( if correctly classified, otherwise) of the -th stage which is the last rejected stage by . Finding is equivalent to finding the first index where and as shown in Eq. (8). Without loss of generality, we approximate the confidence integral using summation over discrete samples of within the range of . The above optimization problem becomes :


Combining standard cross entropy loss and , we have the following optimization objective function:


where and are weights and bias of the RNN controller, respectively. is the ground-truth of -th image, is the hyper-parameter controlling weights of classification and confidence losses. While it is possible to directly optimize Eq. (11) with constraints Eq. (9) using the method in [33], we found that it is more efficient and robust to convert the problem into:


where is the optimal confidence score. The problem in Eq. (12)can be solved with the Constrained Optimization by Linear Approximation algorithm [34]. Eq. (13) can be solved using back-propagation with a standard stochastic gradient descend (SGD).

Note that the desired in Eqs. (10), (11), and (12) is the output from the RNN classifier, while it is possible to update

after each batch or epoch, it is a very expensive process. In this work we are using the outputs from the evaluation networks for the first

controller training epochs. We then update using the latest from the controller output, and continue training controller for the rest epochs. We implement the RNN controller using a three-layer LSTM, stacking 3 LSTM cells with 2 Dropout [21] layers in between. Before each forwarding step, internal states of controller are initialized with zero inputs. More training details will be provided in in Section 6.

6 Experimental Results

In this section, we evaluate our proposed ProgNet using the CIFAR-10 [30] and ImageNet(ILSVRC2012) [1] datasets. On both datasets, our goal is to train a single ProgNet model which provides progressive complexity-accuracy scability while outperforming existing state-of-the-art networks in terms of complexity-accuracy performance. All ProgNet models are trained on AWS P3 8x large instances 111 with 4 Tesla V100 GPUs. Testings and run-time benchmarks are executed on local workstation with 1 Intel Xeon(R) CPU E5-1620 v3 @ 3.50GHz and 4 Pascal Titan GPUs. We implement the ProgNet using Gluon 222 imperative python APIs with MXNet backend [35]. All reference networks for performance comparison are also benchmarked using MXNet if not otherwise specified.

(A) Network Configurations. Both parallel and serial structures of the ProgNet are flexible and highly expandable. In this work, we conducted extensive experiments using three different network settings for CIFAR-10 and two different settings for ImageNet, which are summarized in Table 1.

width=1 CIFAR-10 Multiplier [1, 1, 2, 3] [1, 1, 1, 2, 3, 4] [1] Output ProgNet-p4-residual ProgNet-p6-residual ProgNet-s9-dense, k=12 conv, stride 2 conv, stride 2 conv, stride 2 [res 2] - [

s2 max pool]

[res 2] - [ s2 max pool] [d 2, fc] - [ s2 max pool] [res 3] - [ s2 max pool] [res 3] - [ s2 max pool] [d 2, fc] - [ s2 max pool] [res 3] - [global avg pool] [res 3] - [global avg pool] [d 3, fc] - [global avg pool] fc fc fc width=1 ImageNet Multiplier [1, 1, 2, 3] [1] Output ProgNet-p4-residual ProgNet-s6-dense, k=18 conv, stride 2 conv, stride 2 max pool, stride 2 max pool, stride 2 [res ]- [res, stride 2] [d ]-[d, stride 2, fc] [res ] - [res, stride 2] [d ]-[d, stride 2, fc] [res ]- [global avg pool] [d , fc, d , fc, d , fc, d ]- [global avg pool] fc fc

Table 1: Network Configurations for CIFAR-10 and ImageNet Datasets.

(B) Network Training and Inference. The base classifier and LSTM controller in ProgNet are trained separately using SGD optimization. For the base network, we use a batch size of 256. The number of training epochs for CIFAR-10 and ImageNet are 350 and 180, respectively. Following [2, 29, 28], we use an initial learning rate 0.1, weight decay 0.0001, and momentum 0.9. The learning rate is lowered by a factor of 10 at 25%, 50% and 80% of total epochs. Parameters with the best mean accuracy of all DNN units are saved as our best model. We then start training the LSTM controller using this best model. We sample the early termination threshold . and are excluded because are considered as no network inference and stands for full network inference. The controllers are optimized using SGD with a learning rate 0.5, weight decay 0 and momentum 0 for all experiments.

To evaluate the impact of RNN controllers during inference, we conducted experiments using the following two modes: dynamic and fixed. In the dynamic mode, users can specify the confidence as early termination threshold. This is the desired behavior in this work. For comparison, we allow the ProgNet to follow preset stage setting and run in fixed mode. Once set, The ProgNet acts as a non-progressive network.

In our experiments, we evaluate two different progressive structure: parallel and serial, with different stages, such as 4, 6, and 9 stages. We also evaluate two different modes, dynamic and fixed, for the CADP controller. For example, in our figures, the notation p6-dynamic indicates that the network has a parallel structure with 6 stages and a dynamic CADP mode. s9-fixed indicates that the network has a serial structure with 9 stages and a fixed CADP mode.

6.1 Experimental Results on the CIFAR-10 Dataset

In our CIFAR-10 experiments, we follow previous practice [24] to augment the data. We pre-process the training images by converting them to float32, followed by up-sampling and random cropping to . Random horizontal flipping is applied during training as a common strategy. The validation data is processed by converting to float32 without augmentation. We set the batch size to be 50. For a given early termination threshold , the confidence analysis and decision policy network in ProgNet will decide when to stop the network inference for the input image. Let be the corresponding average network inference time of ProgNet. We also record the overall error rate by . Fig. 6 shows the curves the following network settings: p6-dynamic, p4-dynamic, and s9-dynamic. We shows the results on CPU (left) and GPU (right). We include results for ProgNet with fixed controller modes (shown in solid boxes). For comparison, we also include the complexity-accuracy results for state-of-the-art networks on the CIFAR-10 dataset: resnet20, resnet10, densenet100/12, densetnet100/24. It can be seen that our proposed ProgNet outperforms existing networks in the complexity-accuracy performance using one single model. As we increase the number of stages, we can achieve a large complexity-accuracy scalability range. The parallel structure is slightly better than the serial structure.

Figure 6: Error rate versus actual average inference speed on CIFAR-10 validation set. Black squares indicate results from previously published networks running on our machine.
Figure 7: Top-1 accuracy versus actual average inference speed on the ImageNet validation set. Black squares represent results from previously published networks.

6.2 Experimental Results on the ImageNet

Following the same procedures outlined in existing papers [25, 2, 36, 29, 28], we conduct the experiments of ProgNet training and testing on the ILSVRC 2012 [1]. We use an image size . Training images are augmented with random cropping with min/max aspect ratio of and

. Random horizontal flipping is used in training. For training and validation, pixel means of [123, 117, 104] are subtracted from images and then normalized by standard deviations of [58.4, 57.12, 57.38]. As in

[24, 25, 29], we use 1.28 million images for training and 50000 images for testing.

Similar to the above CIFAR-10 experiments, we record the complexity-accuracy curve for different ProgNet structures (parallel and serial) with different stages (4, and 6 stages) using different controller modes (dynamic and fixed). Fig. 7 shows these curves for CPU (left) and GPU (right). We also include complexity-accuracy results achieved by existing networks, including resnet18, resnet50, densenet121, mobilenet-v1, mobilenet-v2. Table 2 summarizes the complexity-accuracy performance of ProgNet in comparison with existing networks. We include results on the number of model parameters, number of MACC (Multiply-Accumulation) operations, and running times on CPU and GPU. It can be seen that our proposed ProgNet outperforms existing networks in the complexity-accuracy performance by a large margin using one single model. For the same complexity, both ProgNet variants outperform the MobileNet [3, 23], which has been significantly optimized, by 2% to 3% in classification accuracy. For the same accuracy, the ProgNet-p4-dynamic model is able to achieve 20% less inference time than MobileNet-v2.

width=0.7 Model Top-1 Params MACC CPU GPU ResNet-18 [24] 68.0 11.69M 1.83G 52.37ms 0.54ms ResNet-50 [24] 75.4 25.26M 3.87G 103.00ms 1.86ms DenseNet-121 [2] 74.9 7.98M 3.08G 142.73ms 1.95ms MobileNet-v1 [3] 70.6 4.2M 575M 41.98ms 0.59ms MobileNet-v2 [23] 71.7 3.4M 300M 64.78ms 0.89ms ShuffleNet(1.5) [36] 69.0 2.9M 292M - - NasNet-A [28] 74.0 5.3M 564M - - ProgNet-p4-dynamic 73.9 13.12M 1.87G 89.03ms 1.17ms ProgNet-s6-dynamic 74.6 14.3M 1.31G 85.31ms 1.23ms ProgNet-p4-fixed @t=0.5 71.9 13.12M - 47.3ms 0.67ms ProgNet-p4-fixed @t=0.6 72.5 13.12M - 59.8ms 0.81ms ProgNet-s6-fixed @t=0.3 67.1 14.3M - 23.2ms 0.31ms

Table 2: Complexity-accuracy comparison between ProgNet and existing networks on the ImageNet dataset.

In our ProgNet design, the confidence analysis and decision policy (CADP) network plays a critical role since it controls whether the next network stage should be activated for the input image or its inference should be early terminated. This has a direct impact on the complexity-accuracy performance of our ProgNet. To further understand the behavior and capability of the CADP controller, we implement a random controller in which the network inference of an input image is terminated at a random stage. We then run the experiment using this random controller on the ImageNet dataset for 1000 times to simulate partial brute-force search for the best control policy. In Fig. 8, one dot represents the complexity-accuracy result for one experimental run. The solid curve represents our CADP controller. We can see that the CADP is very effective, outperforming static control.

Figure 8: Complexity-accuracy performance comparison between our CADP controller and random termination.

7 Conclusion

In this work, we have successfully established a progressive deep neural network structure which is able to adapt its network inference structure and computational complexity to images with different visual recognition complexity. This progressive structure is able to scale up its discriminative power and achieve higher recognition capability by activating and executing more analysis units of the deep neural networks for more difficult vision analysis tasks or image sets. We have developed a multi-stage progressive structure, called ProgNet, for deep neural networks, with each stage being separately evaluated for output decision and all stages being activated in a sequential manner with progressively increased complexity and visual recognition power. We developed a recurrent neural network method to learn the confidence analysis and decision policy for early termination. Our extensive experimental results on the CIFAR-10 and ImageNet datasets demonstrated that the proposed ProgNet is able to obtain more than 10 fold complexity scalability while achieving the state-of-the-art performances with a single network model.