Image classification [akata2015evaluation, blot2016max, elsayed2018large, li2017improving, rastegari2016xnor, tang2015improving, wang2018ensemble, yan2012beyond]
, aiming at classifying an image into one of several predefined categories, is an essential problem in computer vision. In the last few decades, researchers focus on representing images with hand-crafted low-level descriptors[chen2009wld, lowe-2004-sift], and then discriminating them with a classifier (e.g., SVM [chang-2011-libsvm] or its variants [lu2007gait, maji2008classification]). However, due to the lack of high-level features, the performance is saturating. Thanks to the availability of huge labeled datasets [lu-2014-twoclass, russakovsky2015imagenet] and powerful computational infrastructures, convolutional neural networks (CNNs) could automatically extract discriminative high-level features from the training images, significantly improving the state-of-the-art performance.
Although high-level features are more discriminative, adopting them alone to classify images is still challenging, since with a growing number of categories, the possibilities of confusion increase. In addition, features in the early layers are proved to be able to separate groups of classes in a higher hierarchy [bilal-2017-convLearnHierarchy]. Therefore, researchers attempt to combine high and low-level features together to exploit their complementary strengths [yu2017exploiting]. However, a simple combination of them will make the features in relatively high dimensions, hindering practical use.
Other researchers employ low-level features to make coarse decisions and then utilize high-level features to make finers, based on the idea of divide-and-conquer. This could be achieved by designing deep decision trees that implement traditional decision trees[quinlan-1986-DecisionTree] with CNNs. With the hierarchical structure of categories, a straightforward way is to make the networks of the root node to identify the coarsest category, and then dynamic route to the networks of a child node to determine the finer one recursively [kontschieder-2015-decForest]. However, hierarchical information of categories is not always available, therefore researchers are required to design suitable division solutions, making the training process extremely complex (e.g., multiple-staged). Besides, current deep decision tree based methods also face two other fatal weaknesses: (1) the network should save all the tree branches, making the number of parameters explosively larger than a single classification network; (2) once the decision routes to a false path, it could be hardly recovered.
To resolve the above issues, we propose a novel Decision Propagation Module (DPM). The key idea is that if we adopt the early layer to generate a category-coherent decision, and then propagate it along the network, the latter layers could be guided to encode more discriminative features. By stacking a collection of DPMs into the backbone network architectures for image classification, the generated Decision Propagation Networks (DP Nets) are explicitly formulated as to progressively encode more descriptive features guided by the decisions made by early layers and then refine the decisions based on the new generated features iteratively. In the view of residual learning [He-2016-ResNet], it is much easier to optimize the refining process than to optimize the making of a unreferenced new decision from scratch. Besides the advantage of easier optimization, the property of DPM enables DP Nets to overcome the weaknesses of common deep decision trees naturally. Firstly, in contrast to dynamically routing between several different branches after making a decision, DPM applies the decision as a conditional code to the latter layers similar to [mirza-2014-conditionalGAN]
, such that it could be propagated without bringing additional network branches. Thanks to our novel decision propagation scheme again, DP Nets could be recovered from some false decisions made before as without routing. Furthermore, instead of designating what each intermediate decision indicates explicitly, with weak supervision provided by three novel loss functions, DPM could automatically learn a more suitable and coherent division to separate the categories instead of following the man-made category hierarchy and could be trained in a totally end-to-end fashion with the backbone networks. In total, our contribution is three fold:
We design a novel DPM, which could propagate the decision made upon an early layer to guide the latter layers.
We propose three novel loss functions to enforce DPM to make category-coherent decisions.
We demonstrate a general way to integrate DPMs into various backbone networks to form DP Nets.
Extensive comparison results on four publicly available datasets validate DPM could consistently improve the classification performance, and is superior to the state-of-the-art methods. Code will be made public upon paper acceptance.
2 Related Work
Category Hierarchy, which indicates categories form a semantic hierarchy consisting of many levels of abstraction, has been well exploited [grauman2011learning, saha2018class2str, tousch-2012-semanticHierarchiesSurvey]. Deng et al. [deng-2014-LabelRelationGraphs] introduced hierarchy and exclusion graphs that capture semantic relations between any two labels to improve classification. Yan et al. [yan-2015-HierarchicalCNN] proposed a two-level hierarchical CNN with the first layer separating easy classes using a coarse category classifier and the second layer handing difficult classes utilizing fine category classifiers. To mimic the high level reasoning ability of human, Goo et al. [goo-2016-taxonomy] introduced a regulation layer that could abstract and differentiate object categories based on a given taxonomy, significantly improving the performance. However, the man-made category hierarchy may be not a good division in the view of CNNs.
Deep Decision Trees/Forests.
The cascade of sample splitting in decision trees has been well explored by traditional machine learning approaches[quinlan-1986-DecisionTree]. With the rise of deep networks, researchers attempt to design deep decision trees or forests [zhou-2017-deepForest] to solve the classification problem. Since prevailing approaches for decision tree training typically operate in a greedy and local manner, hindering representation training with CNNs, Kontschieder et al. [kontschieder-2015-decForest]
therefore introduced a novel stochastic routing for decision trees, enabling split node parameter learning via backpropagation. Without requiring the user to set the number of trees, Murthy et al.[murthy-2016-decsion-mining-low-confident-recursive] proposed a “data-driven” deep decision network which stage-wisely introduces decision stumps to classify confident samples and partition the remaining data, which is difficult to classify, into smaller data clusters for learning successive expert networks in the next stage. Ahmed et al. [ahmed-2016-NetworkOfExperts] further proposed to combine the learning of a generalist to discriminate coarse groupings of categories and the training of experts aimed at accurate recognition of classes within each specialty together and obtained substantial improvement. Instead of clustering data based on image labels, Chen et al. [chen-2018-Semi-supervisedHierarchicalCNN]
proposed a large-scale unsupervised maximum margin clustering technique to split images into a number of hierarchical clusters iteratively to learn cluster-level CNNs at parent nodes and category-level CNNs at leaf nodes.
Different from the above approaches that implement each decision branch with another separate routing network, Xiong et al. [xiong-2015-conditionalNetwork]
proposed a conditional CNN framework for face recognition, which dynamic routes via activating a part of kernels, making the deep decision tree more compact. Based on it, Baek et al.[baek-2017-deepDecisionJungle] proposed a fully connected “soft” decision jungle structure to enable the decision could be recoverable, and thus lead to more discriminative intermediate representations and higher accuracies.
Our DPM could be considered as a deep decision tree based approach, and the most similar work to ours is [baek-2017-deepDecisionJungle]
. However, the difference is at least three-fold. Firstly, instead of dynamic activating a part of kernels to reduce parameters which would make each kernel work for only a part of decisions, our DPM, which adopts conditional codes to propagate decisions, could enforce each kernel to work for all the decisions, and thus make full use of the neurons. Secondly, their approach requires layers with channel number larger than the category number, which could be hardly satisfied in real cases with 1k or more categories, while our solution does not have such restriction. Last but not least, we designed three novel loss functions to enforce DPM could make category-coherent decisions.
Belief Propagation in CNNs. Belief propagation has been well studied for a long time especially by traditional methods [conitzer-2019-belief, Felzenszwalb-2006-EfficientBP]. Actually, the concept of belief propagation has also been exploited by various deep networks. Highway networks [Srivastava-2015-Highway] allow unimpeded information propagates across several layers on information highways. ResNets [He-2016-identity] propagate the identities via the well-defined res-block structures. Compared with those skip-connection based methods that propagate the identity feature maps directly, the intermediate decisions propagated by our approach are in relatively lower dimensions but with more explicit (category-coherent) guidance. Therefore, our designed DPM could be considered as another feasible solution for belief propagation. Besides, we will also see that DPM could be easily integrated into skip-connection based networks to further improve their performance.
3 Decision Propagation Networks
In this section, we will first define what is the category-coherent decision, and then introduce the structure of Decision Propagation Module (DPM) together with three corresponding loss functions for training it; in addition, we will discuss the large category issue that hinders DPM training and our solution; finally we will demonstrate several exemplars of Decision Propagation Networks by integrating DPMs into some popular backbone network architectures.
3.1 The Category-coherent Decision
Given inputs with the same object category, if their corresponding decisions are similar, then these decisions could be called the category-coherent decisions. Note that, we also allow inputs with multiple categories to have the same decision results. In this paper, we set the category-coherent decision with () auxiliary categories, namely , and .
3.2 Structure of Decision Propagation Module
The Decision Propagation Module (DPM) is a computational unit that is devised to make a category-coherent early decision [zamir-2017-Feedback] based on the features encoded in an early layer and then propagate it to the following network layers to guide them. A diagram of DPM is shown in Fig. 1.
3.2.1 Make a Soft Decision
Give an intermediate feature map ( ) , our aim is to make a category-coherent decision to guide subsequent network layers without bringing too much additional computational cost. Therefore, instead of continuing convolving on it, we propose to adopt global average pooling (GAP) to extremely reduce the feature dimensions. As verified in [lin-2013-NetInNet], the pooled feature map with channel-wise statistics, is usually discriminative enough for classification. We thus adopt a fully connected network with one or two layers to make a decision based on it. To facilitate this decision branch could be optimized with the whole network structure in an end-to-end manner, we apply the softmax function to the output, and thus obtain a “soft” decision .
3.2.2 Decision Propagation
To make use of the information aggregated in the intermediate decision , a straightforward idea is to dynamic route accordingly [murthy-2016-decsion-mining-low-confident-recursive], making the network to form a deep decision tree. As a deep decision tree will bring explosive parameter increment and could not be recovered from previous false decisions, we thus follow [mirza-2014-conditionalGAN]
and consider the intermediate decision as a conditional code, such that the category prediction process could be directed by conditioning on it. Specifically, we expand the decision vectorto be with the same resolution as the feature map of , by copying the decision scores directly, and then concatenate the expanded decision with as additional channels (see Fig. 1). Note that, could be itself or any other feature map outputted by a subsequent network layer, and we also allow propagating one decision to multiple layers.
3.3 Loss Functions for DPM
To enable DPM to make category-coherent decisions, we propose three novel loss functions to guide it. In the following, we will describe them one by one.
Notations. We denote as all the s in a mini-batch with size of , where is the decision score (confidence) of the -th auxiliary category for the -th instance in the batch.
Decision Explicit Loss. If the intermediate decision made by DPM is ambiguous, the following layers could hardly get any useful information from it. Therefore, we introduce a decision explicit loss to encourage the decision score of one or several auxiliary categories to have relatively larger values, while avoiding all the auxiliary categories to have similar scores. The loss function is defined as follows:
which is in the form of entropy to encourage the decision scores of different auxiliary categories to vary a lot.
Decision Consistent Loss. Simply enforcing the decision to be explicit is not enough. Besides, we wish the decisions for many different instances with the same original category should be consistent. Specifically, their decision scores of the same auxiliary category should be similar.
Therefore, we propose a decision consistent loss which is defined as follows:
is the sample variance of thosefor the instances whose original category is in the batch.
Denote as the indicator matrix for a batch of data: if the original category of the -th instance in the batch is , then ; otherwise . Thus the mean decision score of the -th auxiliary category for all the instances in the batch with original category could be calculated with the following equation:
where is a small value to avoid divide-zero error. After that, we could calculate with
However, calculating those one by one is very time-consuming, we therefore leverage matrix operations to accelerate them. The derived equation is as follows:
which is a matrix with the value in -th row and -th column to be , namely . Operator indicates cross-product, while all other operations are conducted element-wisely. is a two dimension matrix with , for arbitrary . Although simple, it is critical for training the module in an efficient mode.
Decision Balance Loss. Besides the above two losses, we also propose a decision balance loss to avoid the degraded situation that no matter what original category the input instance is, DPM explicitly assign it to a single auxiliary category. Therefore, the decision balance loss is to encourage all auxiliary categories could be balanced assigned, in the form of the reverse of entropy:
3.4 Large Category Issue
For all the original categories appear in a batch, we expect their decision scores could be consistently, explicitly and balancedly distributed into all the auxiliary categories. However, due to the limitation of computational resources and the training issues of large batch SGD [goyal-2017-largeBatchSGD], the batch size is normally set with a small number between 1 to 256. For the tasks with 100, 1000 or even more original categories, randomly load a batch of data will result in the number of instances with each original category is only two or even smaller, resulting in could not work.
Therefore, instead of simply increasing the batch size to maintain more data samples, we propose a novel load-shuffle-split strategy which could resolve the large category issue without enlarging the batch size significantly. Specifically, given Fig. 2 as an example, where the original category number is 4 and the mini-batch size is 4, too. The load-shuffle-split strategy has three key steps: (1) instead of loading 4 data samples only, we load more samples in each iteration (e.g., 8); (2) we first generate a number list with containing all the category IDs; and then shuffle it to obtain another list ; finally the shuffled list is splitted into two lists: and ; (3) we split the 8 data samples into two training batches according to the category IDs in the two splitted lists generated in the last step (see Fig. 2 again), and then train the two batches separately. In this case, the number of instances with each original category in a training batch is doubled.
3.5 Exemplar Decision Propagation Networks
Our DPM is very flexible and could be integrated into various classification network architectures to form Decision Propagation Networks (DP Nets). As it is straightforward to apply it to VGG network [Simonyan-2014-VGG] or AlexNet [Krizhevsky-2012-AlexNet], in this section we only illustrate how to integrate DPMs into modern sophisticated architectures.
For residual networks, we take ResNet [He-2016-ResNet] as an example. As ResNet is organized by stacking multiple residual blocks, we thus integrate DPM into each residual unit, see Fig. 3
for a demonstration. The intermediate decision is propagated along the residual branch, and thus these neurons could be guided to learn better residuals. For other popular architectures, such as Inception network, we also demonstrate how to integrate DPM into the Inception module in Fig.4. The integration of DPM(s) with many other ResNet and Inception variants, such as ResNeXt [Xie-2016-ResNext], Inception-ResNet [Szegedy-2017-InceptionResNet] could be constructed in similar schemes.
In this section, we evaluate the image classification performance of our approach on four publicly available datasets. Our main focus is on demonstrating DPM could improve the performance of backbone CNN networks on image classification
, but not on pushing the state-of-the-art results. Therefore, we spend more space to compare our approach with popular baseline networks on three relatively small-scale datasets due to limited computational resources, and finally report our results on the ImageNet 2012 dataset[Deng-2009-Imagenet] to validate the scalability of our approach.
We implement DPM and reproduce all the evaluated networks with PyTorch[paszke2017-pytorch]. The decision network
of DPM is constructed with two fully connected (FC) layers around ReLU[Nair-2010-Relu]
and followed with a Softmax layer to normalize the decision scores. To limit the model complexity, we reduce the dimension in the first FC layer with the reduction ratio of 16. For all the DPMs integrated into a network, we assume all their auxiliary category numbers are exactly the same to ease network construction, and we set it with 2 if not specifically stated. For VGG, we add BatchNorm (BN)[ioffe-2015-batch] while with no Dropout [Srivastava-2014-dropout], and use one FC layer. For Inception, we choose v1 with BN. While other models are identical to the original papers.
4.2 Dataset and Training Details
CIFAR-10 and CIFAR-100 [Krizhevsky-2009-Cifar] consist of 60k 32
32 images that belong to 10 and 100 categories respectively. We train the models on the whole training set with 50k images in a mini-batch of 128, and evaluate them on the test set. We set the initial learning rate with 0.1, and drop it by 0.2 at 60, 120 and 160 epochs for total 200 epochs. For data augmentation, we pad 4 pixels on each side of the image, and randomly sample a 3232 crop from the padded image or its horizontal flip, and then apply the simple mean/std normalization.
CINIC-10 [storkey-2018-cinic-10] contains 270k 3232 images belonging to 10 categories, equally split into three subsets: train, validation, and test. We train the models on the train set with a mini-batch of 128 and evaluate them on the test set. The training starts with an initial learning rate of 0.1, and cosine annealed to zero for total 300 epochs, based on the same data augmentation scheme as in CIFARs.
ImageNet [Deng-2009-Imagenet] consists of 1.2 million training images and 50k validation images from 1k classes. We train the models with minimal data augmentation including random resized crop, flip and the simple mean/std normalization on the whole training set and report results on the validation set. The initial learning rate is set to 0.1 and decreased by a factor of 10 every 30 epochs to a total of 100 epochs.
During training, the three loss functions are calculated for all the DPMs in the network, and the average of each loss is accumulated with the traditional cross-entropy loss for classification. We set the weighting of the three loss terms to be 0.01 on ImageNet while 0.1 on others. All the models are trained from scratch with SGD using default parameters as the optimizer, and the weights are initialized following [he-2015-Init]. We evaluate the single-crop performance at each epoch and report the best one.
4.3 CIFAR and CINIC-10 Experiments
To evaluate the effectiveness of DPM, we first perform extensive ablation experiments on three relatively small datasets to verify that DP Nets with integrating DPMs outperform the corresponding baseline networks without bells and whistles, and then compare them with the state-of-the-art methods to demonstrate the superiority.
4.3.1 Category Number in Each Batch
As mentioned before, to handle the large category issue that affects the estimation of decision consistent loss, we proposed a load-shuffle-split strategy. In this part, we will evaluate the effects brought by the strategy. Since CIFAR-100 consists of images with 100 categories while the mini-batch size is 128, we thus choose it to investigate. Specifically, we make 4 different configurations, with each configuration has around 128 images in a training batch for fair comparisons.
The results in Tab. 1 show that DP-ResNet-20 and DP-ResNet-56 both obtain the best results when enforcing each training batch with images in 25 categories, instead of the default 100 categories. Therefore we could conclude that our load-shuffle-split strategy is useful for training DP Nets on large category datasets. However, when the number of categories in a training batch keeps decreasing, the performance drops heavily. The reason is that CNNs require the data to be i.i.d. distributed, but the reduction of category number in each batch will hurt the distribution, thus degrading the performance. To validate this, we also conduct experiments on the baseline ResNet-20 and ResNet-56 with 10 categories in each batch and find the performance also drops. Even though, our approach could leverage it to handle the large category issue.
4.3.2 Auxiliary Category Number
To investigate the effects of auxiliary category number in DPM, we follow the best configuration in the above experiments that set the category number in each training batch as 25, and report the experimental results in Tab. 2
. It could be seen that the performance with 2 auxiliary categories is very good and stable, while those with larger auxiliary categories vary a lot for the two different models. The reason is probably that current supervision is not enough to enforce DPM to make use of more auxiliary categories. Besides, we will show that it is the decision scores that encode some meaningful information about the original category, rather than the auxiliary category itself (see Sec.4.3.6). Therefore, we simply choose the auxiliary category number to be 2.
4.3.3 Three Loss Functions
These experiments are to evaluate the influence of each loss function for training DPM by ablating one of them. The results depicted in Tab. 3 show that the performance drops if we ablate any loss function. Particularly, has the largest influence on the classification results and the accuracies drop nearly 1% for both models, which probably indicates DPM is easily degenerated to consistently and explicitly assigning all pooled feature maps into a single auxiliary category. In addition, we would like to point out that DP-ResNets with ablating one loss function could still outperform the baseline networks whose results are reported in Tab. 1, validating the effectiveness of DPM.
|DDN [murthy-2016-decsion-mining-low-confident-recursive] *||-||-||90.32||68.35||-|
|DCDJ [baek-2017-deepDecisionJungle] *||-||-||-||68.80||-|
|DP-ResNet-56||894.96k||0.14G||94.35 (0.73)||73.76 (1.53)||85.50 (0.76)|
|DP-ResNet-110||1.81M||0.28G||94.56(0.58)||74.85 (0.91)||86.34 (1.16)|
|DP-VGG13||9.49M||0.23G||94.61 (0.43)||74.94 (0.52)||85.59(0.55)|
4.3.4 Comparisons with the State-of-the-art Methods
We conduct extensive experiments on the three challenging datasets: CIFAR-10, CIFAR-100 and CINIC-10, with various popular architectures, including ResNets [He-2016-ResNet], Network in Network (NIN) [lin-2013-NetInNet], GoogLeNet [Szegedy-2015-Inception] and VGG [Simonyan-2014-VGG] as backbones. The results demonstrated in Tab. 4 show that by integrating DPMs, all the networks could consistently obtain significant better performance (e.g., more than 1.5% improvement for DP-ResNet-56 on CIFAR-100), validating the effectiveness and versatility of DPM. Particularly, we would like to point out that DP-ResNet-56 outperforms the original ResNet-110 on both the CIFAR10 and CINIC-10 datasets, but with nearly half the numbers of parameters and multiply-and-accumulates (MACs). In addition, we also compare our approach with two latest decision tree based methods: Deep Convolutional Decision Jungle (DCDJ) [baek-2017-deepDecisionJungle] and Deep Decision Network (DDN) [murthy-2016-decsion-mining-low-confident-recursive]. Since they have not released their codes, we thus compare ours with the results reported in the original papers. It could be seen that our DP-NIN outperforms DCDJ and DDN with all using NIN as the backbone network. Finally, we compare our DPM with the most advanced SE block [Hu-2017-SENet], whose motivation is to improve the performance of various backbone architectures in the manner of attention. From Tab. 4, we could see that DPM is comparable with the SE block on classification and sometimes is superior to it.
4.3.5 Complexity Analysis
To enable practical use, DPM is expected to provide an effective trade-off between complexity and performance. Therefore, we report the statistics of complexity in Tab. 4. It could be seen that, by integrating DPMs, the increased numbers of parameters and multiply-and-accumulates (MACs) are less than 5% of their original ones. While in several previous subsections, we have validated that the brought improvements are significant. Therefore, we could conclude that the overhead brought by DPM is deserved.
4.3.6 Visualization and Discussion
To investigate what DPM learns, we visualize the decisions made by the 9 DPMs of DP-ResNet-20 on 512 images from the CIFAR-10 dataset in Fig. 5, with DPM of the earliest layer to the latest layer located from left to right in sequence. For the decisions made by each DPM, all 512 images are visualized with their positions related to the decision scores assigned to them: the larger decision score assigned to the first auxiliary category (totally two), the more bottom position the image is located, while all the images are randomly spanned across the horizontal direction. Since with limited supervision, the decision scores are concentrated in a small range instead of the whole , but we will show it is enough to distinguish the categories.
We could see the images in the first column are distributed along the vertical direction with “blue” images located in the above, and “green” images in the below, indicating that the first DPM probably makes decisions based on the color information. Although simple, frogs and airplanes are separated quite well, validating that low-level information is useful for classification. The second DPM seems to not work, assigning equal decision scores to the two auxiliary categories. This behavior is similar to ResNet that allows some gated shortcuts to be closed. Interestingly, the 3-5th DPMs make almost reversed decisions (e.g., the score is about one minus the score made by the other DPM), indicating that the auxiliary categories made by different DPMs could be different, and neural networks have the ability to decode these decisions. We rotate the decisions in the 5-th column, and find it is somewhat consistent with the decisions in the first column, but has better semantic clustering along the vertical direction. For example, all the airplanes (see the red circle in Fig. 5) are located in the above part, while a “green” airplane is located in the below part by mistake within the first decisions (see the black rectangle in the first zoomed view in Fig. 5), validating that our approach could be recovered from some false decisions made before. We also visualize the decisions made by the last DPM, and find that the instances with the same categories (e.g., airplane, automobile) are located within a quite small range along the vertical axis and are well separated with some other categories, therefore we could conclude it is the decision score that encodes some meaningful information about the object category, rather than the auxiliary category itself. From the three zoomed views, we could see that the decisions are progressively refined, validating our intuition to propagate decisions. Particularly, trucks and automobiles are located in similar vertical ranges, which could be considered as belonging to a coarse category “man-made objects” mentioned by the category hierarchy. However, other “man-made objects”, such as airplanes and ships, are mixed with the objects belonging to the coarse category “animals”. Therefore, we conclude that the decision made by DPM is not based on the man-made category hierarchy, but another division that is better in the view of CNNs.
4.4 ImageNet Experiments
We also evaluate the performance of various DP Nets on ImageNet [Deng-2009-Imagenet]. The results depicted in Tab. 5 show that DP Nets outperform all the baseline networks with large margins (e.g., 0.6% for ResNet-18 and GoogLeNet, 1.0% for ResNet-50 and ResNet-101) that randomly sample all 1000-category instances into a training batch, in which case could hardly contribute to the training. Therefore, we also conduct experiments with setting batch sizes to be 1024, 768 and 512, while enforcing the category number in a batch to be 250, 250 and 200 using the load-shuffle-split strategy. We could see that the improvements on top-1 accuracy for all DP Nets are finally enlarged to 1.0% - 1.6%, validating the scalability of our approach.
We have presented the Decision Propagation Module (DPM), a novel drop-in computational unit that could propagate the category-coherent decision made upon an early layer of CNNs to guide the latter layers for image classification. Decision Propagation Networks generated by integrating DPMs into existing classification networks could be trained in an end-to-end fashion, and bring consistent improvements with minimal additional computational cost. Extensive comparisons validate the effectiveness and superiority of our approach. We hope DPM become an important component of various networks for image classification. In the future, we plan to extend our approach to handle more vision tasks, e.g., detection and segmentation.