Dynamic Hierarchical Mimicking Towards Consistent Optimization Objectives

03/24/2020 ∙ by Duo Li, et al. ∙ 0

While the depth of modern Convolutional Neural Networks (CNNs) surpasses that of the pioneering networks with a significant margin, the traditional way of appending supervision only over the final classifier and progressively propagating gradient flow upstream remains the training mainstay. Seminal Deeply-Supervised Networks (DSN) were proposed to alleviate the difficulty of optimization arising from gradient flow through a long chain. However, it is still vulnerable to issues including interference to the hierarchical representation generation process and inconsistent optimization objectives, as illustrated theoretically and empirically in this paper. Complementary to previous training strategies, we propose Dynamic Hierarchical Mimicking, a generic feature learning mechanism, to advance CNN training with enhanced generalization ability. Partially inspired by DSN, we fork delicately designed side branches from the intermediate layers of a given neural network. Each branch can emerge from certain locations of the main branch dynamically, which not only retains representation rooted in the backbone network but also generates more diverse representations along its own pathway. We go one step further to promote multi-level interactions among different branches through an optimization formula with probabilistic prediction matching losses, thus guaranteeing a more robust optimization process and better representation ability. Experiments on both category and instance recognition tasks demonstrate the substantial improvements of our proposed method over its corresponding counterparts using diverse state-of-the-art CNN architectures. Code and models are publicly available at https://github.com/d-li14/DHM



There are no comments yet.


page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Illustration of the Dynamic Hierarchical Mimicking mechanism. The proposed framework attaches three side branches to the main branch. In these branches, the green layers represent standard convolutional layers while the red ones represent downsampling layers. The purple dots at the end of existing classifiers represent the probabilistic prediction outputs. Bidirectional dashed lines densely connected to these dots represent the knowledge transfer process through pair-wise soft label mimicking. Best viewed in color.

Convolutional neural networks (CNNs) have become the mainstream models for tackling a wide array of computer vision problems such as image classification 

[3, 33, 8, 13], object detection [30, 29, 23] and semantic image segmentation [24, 43, 1]. The advent of AlexNet [18]

that achieved groundbreaking results in the ImageNet Large Scale Visual Recognition Challenge 2012 

[3] ignites the resurgence of deep CNN architectures. Recently along with the growing abundance of computational resources and development frameworks, a remarkable trend of modern CNN architectures is more and more convolutional layers are stacked upon, interweaved with indispensable non-linear activation layers and down-sampling layers. CNNs are now capable of mining intrinsic characteristics of the images as superhuman image descriptors with tens of thousands of parameters and engineered innovative connection topology [8, 13, 41]. Embracing these sophisticated CNNs as modeling tools, the past years witnessed an unprecedented achievement on a variety of visual recognition competitions [3, 5, 22]. However within the very deep network architectures, inappropriately designed blocks would impede the gradient flow across layers, consequently causing critical gradient vanishing or parameter redundancy problems [15, 13].

The aforementioned nuisances motivate us to ease the CNN model training and enhance generalization ability. One promising line of exploration lays emphasis on the intermediate feature representation and hidden layer supervision. In the Inception series [36, 37], auxiliary classifiers are connected to two intermediate layers. Although joint optimization of the weighted auxiliary probabilistic prediction together with the original one can combat the gradient vanishing problem as expected, gain in model performance is relatively minor (around 0.5% [36] or 0.4% [37]) as announced by the authors. Another contemporary work is DSN [20], which stacks a supervised MLP with very simple auxiliary classifiers, i.e., SVM or Softmax on top of each hidden layer of the deep architecture. As [32] suggests, imposing a very discriminative hint for classification on intermediate layers might be too aggressive to achieve promising performance at the top-most classifier. Analogously, more recent MSDNet [14] empirically manifests that introducing early-exit classifiers to the intermediate layers will lead to accuracy degradation of the final classifier and assumes the phenomena is attributed to collapse of the progressive bottom-up feature generation process.

Our reasoning arises from two critical standpoints primarily. First, hierarchical root locations would endow different auxiliary classifiers with the ability to capture prediction representation with much more diversity, without interfering with the information flow of the main branch as long as the corresponding classifier is delicately architectural-engineered according to the location of the intermediate layer to which it attaches. Nevertheless, it brings limited benefit to improving model generalization and accuracy in a previously prevailing joint optimization scheme used in [20, 36]. Second, we blame the barrier to improving model generalization ability and accuracy on the insufficient collaboration of recognition knowledge extracted from diverse stages. It may become difficult that optimization directions of different auxiliary branches conform with each other, thus the gradients flowing upstream to their common stem can counteract and little positive optimization effects are imposed on the parameters of the most shallow layers shared by them all. In other words, within the training dynamics, potentially inconsistent optimization sub-objectives of different auxiliary classifiers can give rise to a suboptimal solution of the whole neural network.

Motivated by the issues above, we propose Dynamic Hierarchical Mimicking (DHM), a generic training framework amenable to any state-of-the-art CNN models, which noticeably improves the performance on supervised visual recognition tasks compared with the standard top-most supervised training as well as the deeply supervised training scheme. As illustrated in Figure 1, our mechanism is comprised of two components advancing the training procedure collaboratively. On the one hand, we attach carefully designed auxiliary branches to some intermediate nodes of the backbone network. For side branches, locations of corresponding root nodes are sampled from certain distributions (e.g. uniform discrete distribution). In principle, diverse auxiliary towers both inherit multi-level knowledge from the backbone network and advance the inherited knowledge through staked modules to generate more diverse representation. On the other hand, with differentiable hierarchical predictions (i.e

. probabilistic distribution output over training examples) extracted within a single CNN model at hand, it is naturally expected to enable comprehensive knowledge interactions. To this end, we propose a novel joint optimization formula containing a pairwise probabilistic distribution matching loss utilized between any two branches. This additional loss function enhances the opportunity of knowledge sharing and forces the optimization consistency across the whole network. Notably, we focus on improving the training procedure and discard all the side branches during inference, without introducing any computational overheads compared to the standard inference.

We evaluate our method on two challenging image classification datasets, including the large-scale ImageNet benchmark, as well as two widely-accepted person re-identification datasets using state-of-the-art CNN architectures. The presented results indicate that the deep CNN models trained with our proposed mechanism have significant accuracy and generalization ability improvements against their counterpart baseline models.

2 Related Work

We review some related approaches in prior literature, from which our method draws inspiration. We also analyze their differences from our mechanism.

Auxiliary Supervision. To accelerate convergence and combat gradient vanishing problem, supervision signals are delivered to hidden layers more straightforwardly through auxiliary towers built on top of these intermediate layers. Two concurrent works utilizing this advanced supervision methodology are GoogLeNet [36] and DSN [20] which are benchmarked with primitive deep neural networks on the fundamental image classification tasks. Once published, this idea has been extended to various application fields to address edge detection [40]

, pose estimation 

[26], scene parsing [52], semantic segmentation [51] and other visual recognition tasks [21, 47, 25]. We conjecture that there still exists room for performance improvement through encouraging explicit knowledge interactions between each sampled pair of auxiliary classifiers and demonstrate the aforementioned conjecture with our proposed DHM training strategy.

Network Regularization. With the drive to suppress the over-fitting issue, Dropout [35]

multiplies each hidden activation of a layer by a Bernoulli random variable during training which effectively impels hidden nodes to learn independent representations. Follow-up works validate the advantage of this idea, such as MaxOut 

[7], DropConnect [38], DropIn [34] and DropBlock [6]

. Batch Normalization 


unifies the layer-wise activation distribution to zero mean and unit variance which mitigates the need of Dropout in modern CNNs like ResNet and DenseNet. Stochastic Depth 

[15] shifts the focus from micro-architecture to macro- by stochastically discarding entire blocks to improve network resilience. FractalNet [19] develops Drop Path in a parallel fashion to discourage co-adaption of subnetworks in a group. Furthermore, over-fitting problems can be partially attributed to the inadequate available source of large-scale data. Data transformation techniques are widely applied to synthetically augment the original datasets, e.g

., reflectional padding, horizontal flipping, random cropping, color jittering, and linear interpolation 

[18, 36, 11, 4, 48]. Our method serves as an effective supplement to these existing methods, which behaves like a strong regularizer during the training process.

Knowledge Transfer. Our method also has a connection with the research field of Knowledge Transfer (KT). Top-performing deep CNN models suffering from intensive computational demands are hindered from being embedded into resource-aware applications. To narrow the gap between theoretical performance and real-world feasibility, Dark Knowledge Distillation [9] takes the probabilistic distribution prediction from a powerful but resource-hungry teacher model or an ensemble of teacher models as the soft target, to jointly regularize the optimization objective when training a smaller student model with given image samples and the corresponding one-hot labels. Intermediate feature maps are demonstrated to be effective hints to further advance the knowledge distillation process [32, 42, 45]. Extending the concept of knowledge distillation and its variants, Deep Mutual Learning [50] shows that the teacher model would benefit from the knowledge of the student model in turn, in contrast to the prevailing one-way teaching-learning mode. The newly established idea was soon used in person re-identification tasks [50, 49]. Different from the method above in focus and formulation, our motivation is to solve the inherent deficiency hidden in the deeply-supervised training procedure, utilizing representation mimicking as a tool. Our proposed Dynamic Hierarchical Mimicking can be viewed as an internal knowledge transfer process limited in one single neural network among hierarchical auxiliary classifiers, which has never attracted enough attention from the research community. We also include a more comprehensive analysis of the differences between our method and KT in the supplementary materials.

3 Approach

In this section, we shed light on the intrinsic deficiency within the traditional hidden layer supervision and joint optimization scheme [20, 36]. Furthermore, we elaborate on the improved optimization objective of our proposed mechanism, highlighting its intuition and theoretical insights.

3.1 Analysis of Deep Supervision

Given a fully annotated dataset including examples collected from predefined classes where is the th training example and is the corresponding ground truth label. Let ( in subscript denotes main branch) be the learnable weight matrices of an -layer deep neural network , be the part of weights from the bottom layer up to the th layer and be the

-dimensional probability distribution prediction outputted from the network concerning the training sample

. Then, the optimization objective of the standard training scheme with only top-most supervision can be defined as


where is the total loss over all training examples, is the regularization term (usually or Frobenius norm) with as the positive weighting factor. Specifically, is typically defined as the cross-entropy loss function


Note that the regularization term is an inherent attribute specific to the model structure which has no relation to supervision signals, for brevity and a better clarification of our main method, this term will be omitted in the following analysis. Now, Equation 1 can be reduced into


Besides one existing top-most classifier attached to the final fully-connected layer, Deeply-Supervised Nets [20] append auxiliary classifiers over all hidden layers to create a more transparent learning process in which classification error information no longer needs to travel long distance over stacked modules to update the weights of a shallow layer. Instead, gradients could flow back through its nearest side branch in a much more straightforward manner. Let ( in subscript denotes side branch) be a set of weight matrices collected from auxiliary classifiers attached on top of intermediate layers where denotes the weight matrix of the auxiliary classifier rooted in the th hidden layer. Then, without loss of generality, the optimization objective of the training scheme with deep supervision can be expressed as


where is the weighted sum of losses from all auxiliary classifiers over all training examples with being the weighting factor of the auxiliary classifier. Namely, is defined as


where denotes the probabilistic prediction from the th auxiliary classifier. Thus the optimization objective in the contemporary work, GoogLeNet [36], can be considered as a special case of Equation 4 which appends auxiliary towers selectively over two intermediate layers of its main branch. It is clear that through the newly introduced loss term in Equation 4, it allows the intermediate layers to gather gradients not only from the top-most supervision signal but also from the deep supervision signals, which is empirically demonstrated to be an effective method to combat the gradient vanishing problem and ensure faster convergence.

However, directly attaching simple classifiers to the hidden layers without thinking twice may lead to some performance decline when training very deep CNN models. Huang et al[14] reported a series of similar experimental results and attributed the unsatisfactory performance to the intermediate auxiliary classifiers which interfere with the bottom-up feature generation process. As is known to all, deep neural networks have the capacity to represent hierarchical information as per depth, features learned in the shallow layers have fine spatial resolution but lack semantic meanings, thus their knowledge about category discrimination is much weaker in comparison to those of the deeper layers. Their hypothesis is further supported by the fact that accuracy degradation becomes more pronounced when auxiliary classifiers are attached to earlier layers. In this view, using a very discriminative supervision signal at intermediate layers might be too aggressive, it deviates the original shallow features optimized for short-term objectives. That is, the directly attached auxiliary classifier may demand low-level features to be also discriminative about different categories resembling the high-level features. Without inevitably sacrificing the original coarse-level representation, the goal of precise visual recognition will be hard to achieve. Following promising methods to alleviate this issue [36, 51], we resort to auxiliary classifiers with rather complex structures. Specifically, each side branch consists of building blocks (e.g

. residual blocks in ResNet) with the same type as the main branch. Furthermore, both auxiliary and the original branches keep the same down-sampling rate through its own pathway to the ending softmax classifier. We expect to maintain the progressive feature generation process in the main branch using these heuristic designing principles (we leave the architectural details regarding diverse networks in Section 

4). Comparative experiments demonstrate that these well-designed auxiliary classifiers do facilitate the performance, though to a minor extent.

3.2 Dynamic Hierarchical Mimicking

As stated in the previous subsection, considering the locations and architecture design of auxiliary classifiers is essential to the whole network training. GoogLeNet [36] and MSDNet [14] also provide alternative solutions respectively. The former only attaches auxiliary towers to the endpoints of two relatively deep intermediate Inception blocks to ease the training, while the latter introduces horizontal and vertical connections, maintaining coarse-level information in the earlier layers and improving feature discrimination simultaneously. However, as clearly stated by the two, with their proposed methods, they have not achieved obvious improvement, i.e., either around 0.5% or almost no change with a single auxiliary classifier, in their main classifier accuracy, compared with the standard training scheme.

We revisit the formulation of direct loss summation in Equation 4 and speculate that more intrinsic deficiency still lies in the optimization inconsistency among these added entries. Specifically, each item represents a sub-objective function of the corresponding auxiliary classifier. Discrepancy in their optimization directions could hamper the overall training procedure, leaving negative effects on the final model accuracy. Therefore, the more in-depth concern, which has been rarely explored in the related works [20, 36, 37, 14, 40, 26, 52] is the lack of essentially comprehensive interactions among the predication outputs from the auxiliary classifiers and the top-most classifier of the network. Denote the probabilistic representation information gathered by each branch as knowledge, our substantial research efforts have been invested in how to design a robust strategy that can facilitate aggregation of hierarchical knowledge extracted from classifiers located at different depths of the backbone network and relieve the optimization inconsistency among them.

Our core contribution is introducing a novel knowledge matching loss to regularize the training process towards optimization consistency efficiently and robustly. Based on the aforementioned analysis, we first attach delicately designed auxiliary classifiers to some particular intermediate layers of a given network. Locations of such intermediate layers are dynamically drawn from a given discrete probability distribution at each training epoch. Besides collecting the classification losses from auxiliary classifiers for straightforward optimization, we tend to focus more on their diverse representation in the pathway and meaningful probability distribution prediction outputs at the end. Once these knowledge from the auxiliary classifiers as well as the top-most classifier is generated, we use a pairwise probabilistic information mimicking strategy immediately, enabling on-the-fly comprehensive knowledge interactions and hierarchical information sharing. The objective function is yet another cross-entropy loss between the probabilistic prediction output of any two branches expected to be consistently optimized, partially resembling the KT procedure 

[9, 42, 50] in the form but also compactly combined with the DS methodology constrained in one single CNN model. Below, we describe the mathematical formulation of our proposed Dynamic Hierarchical Mimicking strategy.

Let denote a probability space spreading over all the indices of intermediate layers and be a dynamically sampled set containing indices of certain intermediate layers to which auxiliary classifiers are to be attached. Let and respectively denote the set of indices and weight matrices collected from all selected auxiliary classifiers and the top-most classifier. Let be a binary indicator function as


where means there exists an activated auxiliary network classifier connected to the th layer in the backbone. Then, following Equation 4, the optimization objective of our method is defined as


where the second term samples some auxiliary classification losses from certain locations via the binary mask , which is defined as


while the last term ( in subscript denotes knowledge matching loss) represents pairwise probabilistic prediction mimicking loss summed from all selected auxiliary classifiers in couples, which is defined as


where is a positive coefficient indicating the confidence of the knowledge transfer process from the auxiliary classifier to . Empirically, we find that using the same weight (i.e., setting

) for all entropy losses works well in practice, so we did not make much cumbersome exploration in tuning these weighting parameters. To stabilize the training process and avoid further regularization effect, we fork auxiliary classifiers from each node of the backbone network with probabilities toggling between zero and one in our main experiments, following a binary sampling strategy along the axis of network forward propagation. In addition, we also explore Bernoulli distributions

for comparison in the supplementary materials. As stated by Equation 3.2

, the knowledge interaction process between any pair of activated network classifiers is expressed as a dual cross-entropy minimization process. This loss term can be optimized with an easily-implemented alternative of Kullback-Leibler divergence which differs from the original cross-entropy by nothing but a constant term. In principle, taking temporary probabilistic distribution prediction outputted from the network classifier

as a fixed soft label target, forces the predication of network classifier to become as similar as possible. In this way, knowledge currently learned by the network classifier can be transferred to network classifier as it accepts corresponding soft labels from classifier as smoother hint for guidance. Therefore, by enabling dynamic knowledge mimicking among different combinations of network classifier cohorts in an on-the-fly fashion, our method can well enhance the capability of feature reuse across the whole network.

Theoretical Analysis. Through penalizing the distance between probabilistic distribution generated by different classifiers, the proposed mechanism behaves as a strong regularizer and improves model generalization ability. Without loss of generality, we select two matched branches for illustration. Let denote the network modules in their shared path, and denote the approximate functions in the separate branches. The total loss of one branch can be represented as


where is the intermediate representation for input with as the corresponding label and is the perturbation introduced by the randomness of data augmentation, with zero expectation and variance . We lay analytic emphasis on the term of mimicking loss and derive that (refer to supplementary materials for the detailed derivation process)

In the approximation step, higher order infinitesimal of variable is omitted. The first term matches probabilistic predictions of the paired branches while the second term penalizes inconsistent gradients with respect to their shared parameters in the stem path, regularizing the overall training process robustly.

Theoretically and intuitively, aggregating the knowledge of multiple existing network classifiers in this novel way addressed the concern raised in the beginning of this subsection well.

4 Experiments

We conduct extensive experiments on several benchmark datasets to evaluate the effectiveness of our DHM method, i.e., CIFAR-100 [17] and ILSVRC 2012 [3] datasets for image classification, Market-1501 [53] and DukeMTMC-reID [31] datasets for person re-identification. We follow the prevalent training scheme used in [18, 8, 44, 39, 13, 12, 2] for the single classifier based method and [20, 36, 37, 14] for the auxiliary classifier based ones separately. The experiments utilizing different training strategies are conducted with exactly the same setups for fair comparison, including data preprocessing, mini-batch sizes, training epochs and other relevant hyper-parameters for optimization.

4.1 Category Recognition

4.1.1 Cifar-100

The CIFAR-100 dataset [17] is comprised of 50k training images and 10k test images, where each sample is a colorful image within 100 categories. For data augmentation, we use the same data preprocessing method as [8, 20]. During training, images are zero-padded by 4 pixels on each side and then

regions are randomly cropped from padded images or their horizontal flips. The transformed image samples are finally normalized by subtracting their mean pixel value and dividing the standard deviation. During evaluation, error rates on the original test set are reported based on five successive runs with random seeds.

Architecture Method Top-1 Error(%)
ResNet-32 baseline
ResNet-110 baseline
ResNet-1202 baseline
DenseNet (d=40, k=12) baseline
DenseNet (d=100, k=12) baseline
WRN-16-8 baseline
WRN-28-10 baseline
Table 1: Top-1 error comparisons on CIFAR-100. Our results were obtained by computing mean and standard deviation over 5 runs (given in mean

std. format in the table). The best result regarding each network architecture is highlighted in bold, baseline denotes model trained without adding auxiliary classifiers, DSL denotes model facilitated by Deeply Supervised Learning, DHM denotes model further boosted with Dynamic Hierarchical Mimicking, the same below in term of these notations.

We apply different training strategies to three state-of-the-art CNN architectures for comprehensive comparisons, including ResNet [8], DenseNet [13] and WRN [44] with varied depths. For training, we use SGD with momentum as the default optimizer with initial learning rate as 0.1 and momentum as 0.9. We set the batch size as 128, the weight decay as 0.0001 for all experiments. The learning rate annealing schedule follows the default settings of the original works proposing the corresponding network architectures respectively. We forks two carefully designed auxiliary classifiers before or after each down-sampling layer of these CNN architectures, i.e. after every residual stage for ResNet and WRN, after every transition layer for DenseNet. All auxiliary branches have the same heuristically designed macro-structure, i.e. stacked building blocks as the main branch, a global average pooling layer with a subsequent fully connected layer (refer to supplementary materials for architectural details).

Experimental results are summarized in Table 1 for clear comparison. Deep supervision consistently improves the model performance with carefully designed auxiliary classifiers, though restricted to relatively minor gain. Comparatively, our DHM method further brings considerable gain against the advanced deeply supervision scheme and more impressive gain against baseline across all network architectures. These experiments validate the effectiveness of our method, especially the vital importance of knowledge interaction process inside a single CNN model.

Architecture Method Top-1 / Top-5 Error(%)
ResNet-18 baseline 30.046 / 10.752
DSL 29.728 / 10.450
DHM 28.714 / 9.940
ResNet-50 baseline 23.990 / 7.166
DSL 23.874 / 7.074
DHM 23.430 / 6.764
ResNet-101 baseline 22.636 / 6.362
DSL 22.260 / 6.128
DHM 21.348 / 5.684
ResNet-152 baseline 21.894 / 5.886
DSL 21.602 / 5.824
DHM 20.810 / 5.396
Table 2: Top-1/Top-5 error comparisons on the ILSVRC 2012 validation set, with the single center crop testing method.

4.1.2 ImageNet

We also perform experiments on the large-scale ImageNet dataset [3], which is a much more challenging benchmark. It consists of around 1.2 million training images and 50k validation images, labeled with 1,000 object classes. For training data processing, we use scale and aspect ratio augmentation and horizontal flipping as [36, 13]. Following common practice, top-1/top-5 error rates are reported on the validation set using single-crop testing.

We select the widely-used ResNet with varied depths as the backbone network for evaluation. We use the default input image resolution (), batch size (256) , training epochs (90) and optimizer (SGD with momentum as 0.9 and weight decay as 0.0001) for training. The learning rate is initiated from 0.1 and decayed by the factor of 0.1 every 30 epochs. We choose the best model according to the validation accuracy among all training epochs since the baseline models are prone to over-fit along the overall training process. Noticing that our method behaves as a strong regularizer, scale augmentation is canceled for relatively shallow ResNet models to avoid excessive regularization effect when training with our hierarchical mimicking methodology. We attach two auxiliary classifiers after both the last and the penultimate group of residual blocks (i.e. denoted as conv3_x and conv4_x group in [8]). All auxiliary classifiers are constructed with sequential residual blocks, a global average pooling and a fully connected layer, though different in the depth and width of residual blocks (refer to supplementary materials for architectural details).

Experimental results are summarized in Table 2. Since ImageNet dataset is much more difficult than CIFAR-100, performance gain over the baseline model is absolutely small. Nevertheless, Deeply Supervised Learning method still boosts model accuracy and our Dynamic Hierarchical Mimicking mechanism further outperforms the deep supervision strategy with relatively considerable margins in top-1/top-5 accuracy. Even with the very deep architectures ResNet-101 and ResNet-152, substantial improvement of surpassing the baseline by over 1% in top-1 accuracy is achieved. Please refer to supplementary materials for their complete training curves.

Architecture Method Market-1501 DukeMTMC
mAP R-1 mAP R-1
(w/ pretrain)
baseline 70.3 88.5 59.4 78.2
DSL 72.0 88.2 60.5 78.8
DHM 76.7 90.3 65.4 81.1
(w/o pretrain)
baseline 55.6 78.2 45.7 69.0
DSL 55.6 77.4 46.9 68.7
DHM 59.1 79.0 50.6 70.5
Table 3: Rank-1 accuracy and mAP on the Market-1501 and DukeMTMC-reID datasets. R-1 denotes Rank-1 accuracy. w/ pretrain and w/o pretrain means with and without ImageNet pre-trained weights loaded respectively.

4.2 Instance Recognition

We further conduct experiments on two popular person re-identification datasets to demonstrate the effectiveness of our method on this more challenging instance recognition problem. The Market-1501 [53] dataset has 32,668 bounding boxes drawn from 1,501 identities captured by 6 different cameras near the supermarket inside Tsinghua University, including 12,936 training images, 15,913 gallery images and 3,368 query images respectively detected by DPM [27]. The DukeMTMC-reID [31] dataset collected by 2 more cameras serves as one of the most challenging re-ID datasets to date, which contains 1,404 identities, 16,522 training examples, 17,661 gallery images and 2,228 queries.

We adopt the prevalent ResNet-50 [8] and the scalable MobileNet [10] as backbone networks and the simple cross-entropy as the loss function since our benchmark goal is to evaluate the newly proposed training mechanism. For the ResNet-50 backbone, training samples are resized slightly larger than the target size, then cropped to regions and augmented with horizontal flipping and normalization. We set the batch size as 32 and train for 60 epochs with the AMSGRAD [28] optimizer (, weight decay=0.0005). The learning rate starts at 0.0003 and is divided by 10 every 20 epochs. Architectural design of auxiliary classifiers for ResNet-50 is totally identical to the one for experiments on ImageNet. Before standard training, ImageNet pre-trained weights of corresponding layers in the ResNet architecture are loaded, while all newly introduced layers for the re-ID model together with auxiliary classifiers without available pre-trained weights are trained for 10 epochs in advance with the pre-trained layers fixed. We also apply label smoothing during training since images in the re-ID datasets are not diverse enough. For the MobileNet backbone, we do not load the pre-trained weights, so we keep all the other settings the same but increase the initial learning rate to 0.001 and train longer for 90 epochs in total, decayed every 30 epochs with the same factor. We leave detailed structure of auxiliary classifiers for MobileNet in the supplementary materials. We report both mAP and Rank-1 accuracy under the single-query mode.

Side Branches Top-1 / Top-5 Error(%)
30.046 / 10.752
29.276 / 10.272
29.248 / 10.258
 &  28.814 / 9.940
 &  29.220 / 10.044
Table 4: Performance comparison of different configurations with respect to auxiliary branches.

From the result comparison in Table 3, we observe that deep supervision strategy leads to comparative performance in contrast to the baseline method, if not even worse. On the other hand, it is noteworthy that our hierarchical mimicking methodology outperforms the baseline with very compelling performance both on different datasets and with different backbones. Especially under the more comprehensive evaluation protocol of mAP, models trained with our proposed method achieve a margin of over 6% mAP on both datasets using the pre-trained ResNet-50 backbone.

4.3 Ablation Analysis

4.3.1 Knowledge Transfer Direction

From the view of Knowledge Transfer (KT), peer classifiers selected for mimicking can be deployed in a unidirectional or bidirectional mode. The unidirectional mode can include two specific configurations. One is the top-down configuration, in which only the probabilistic prediction from auxiliary classifiers connected to shallower layers is impelled to mimic that from deeper layers. The above situation reverses in the bottom-up configuration. However, inspired by [50], we heuristically prefer the bidirectional configuration which combines top-down and bottom-up mimicking directions. Actually, the bidirectional mode is the default choice in our main experiments. For rigorous verification, we also report the results of two configurations in the family of unidirectional mode using ResNet-18 model on the ImageNet dataset. Employing the bottom-up and top-down configuration respectively, there shows a slight decrease in top-1 error to 29.670 and 29.385 respectively, compared with the baseline of 30.046 and the DSL method of 29.728. This comparison preliminarily validates the regularization effect of our method as mathematically revealed in Equation 3.2. Recall that the resulting error of bidirectional mimicking in Table 2

is 28.814, we deduce that our adopted bidirectional mode further help to boost the knowledge interaction which is insufficient in the one-way skewed scenario above.

Branch Method Top-1 / Top-5 Error(%)
independent 30.046 / 10.752
mimicking 28.814 / 9.940
independent 27.988 / 9.560
mimicking 27.626 / 9.176
independent 31.458 / 11.522
mimicking 29.544 / 10.370
Table 5: Influence of hierarchical mimicking on each auxiliary branch. Branches are evaluated in the same way as all the baseline experiments on ImageNet. ‘independent’ means individual classifier isolated from other branches, ‘mimicking’ means classifier trained together with peers using our DHM mechanism.

4.3.2 Design of Auxiliary Branches

Appropriate design of auxiliary classifiers is of vital importance to the final performance of deeply supervised learning and our proposed method. We perform experiments on the ImageNet dataset with ResNet-18 to analyze the influence of various configurations related to auxiliary classifiers. We denote the main branch as whose independent performance is identical to the baseline model. Auxiliary branches attached after the final and the penultimate group of residual blocks are denoted as and respectively. We perform experiments by discarding one of the auxiliary branches of and or appending another auxiliary branch called to shallower intermediate layers (refer to supplementary materials for details about its location and architecture). From Table 4, we notice that models trained with our proposed mechanism outperform the baseline model () consistently, regardless of the number of auxiliary branches. One extra auxiliary classifier is sufficient to boost the performance by a non-negligible margin. Furthermore, we infer that the substantial gain of our proposed method does not arise from blindly increasing the model capacity via adding more auxiliary branches, since the triple-branch model starts to show a declining performance. Hence we adopt the double-branch model in our main experiments which achieves more satisfactory performance regarding both efficacy and efficiency. We also shed light on the influence of hierarchical mimicking on each auxiliary classifier. Towards this target, we first isolate each auxiliary branch from the main branch and train these classifiers separately. From the results shown in Table 5, it is obvious that all the auxiliary branches benefit from the regularization process within our proposed optimization mechanism compared to optimized independently.

5 Conclusion

In this paper, we propose a general-purpose optimization mechanism named DHM, which effectively and robustly facilitates the CNN training process without introducing computational cost during inference. Through delving into the training dynamics of deep supervision, a novel representation mimicking loss is considered to advance gradient consistency among optimization objectives of different delicately designed auxiliary branches. We theoretically and empirically demonstrate that this approach is beneficial to improving the accuracy and generalization ability of powerful neural networks on various visual recognition tasks.

Appendix A Architectural Design of Auxiliary Classifiers

Following descriptions above, we always attach two auxiliary branches on top of certain intermediate layers of the backbone networks. For brevity of clarification, we denote the main branch as and the auxiliary branch close to (away from) the top-most classifier as (). In the architecture engineering process, we heuristically follow three principles below: (i) building blocks in the auxiliary branches are the same as those in the original main branch for architectural identity; (ii) from the common input to the end of every branch, number of layers for down-sampling are kept the same, guaranteeing the uninterrupted coarse-to-fine information flow; (iii) with broader pathway and with shorter pathway are preferable in our design.

a.1 Various Networks on the CIFAR-100 dataset

We append two auxiliary branches to different popular networks with varied depths. Refer to Table 67 and 8 for detailed architectural design of these auxiliary branches in ResNet [8], DenseNet [13] and WRN [44] respectively.

layer name output size ResNet-32 ResNet-110 ResNet-1202
conv1 3232 33, 16 33, 16 33, 16
conv2_x 3232 5 18 200
conv3_x 1616 5 5 18 9 200 100
conv4_x 88 5 3 5 18 9 18 200 100 200
classifier 11 average pool, 100-d fc, softmax
Table 6:

Architectures of the ResNet family with auxiliary branches for CIFAR-100. Residual blocks are shown in brackets with the numbers of blocks stacked. Downsampling is performed by conv3_1 and conv4_1 with a stride of 2.

layer name output size DenseNet (k=40, d=12) DenseNet (k=100, d=12)
conv1 3232 33, 2k 33, 2k
conv2_x 3232 [33, k] 12 [33, k] 32
conv3_x 1616 [33, k] 12 [33, k] 12 [33, k] 32 [33, k] 16
conv4_x 88 [33, k] 12 [33, k] 6 [33, 3k] 12 [33, k] 32 [33, k] 16 [33, 3k] 32
classifier 11 average pool, 100-d fc, softmax
Table 7: Architectures of the DenseNet family with auxiliary branches for CIFAR-100. Dense blocks are shown in brackets with the numbers of blocks stacked. Downsampling is performed by transition layers inserted between conv2_x, conv3_x and conv4_x with a stride of 2.
layer name output size WRN-16-8 WRN-28-10
conv1 3232 33, 16 33, 16
conv2_x 3232 2 4
conv3_x 1616 2 2 4 4
conv4_x 88 2 1 2 4 2 4
classifier 11 average pool, 100-d fc, softmax
Table 8: Architectures of the Wide Residual Network family with auxiliary branches for CIFAR-100. Residual blocks are shown in brackets with the numbers of blocks stacked. Downsampling is performed by conv3_1 and conv4_1 with a stride of 2.

a.2 ResNet on the ImageNet dataset

We also append two auxiliary branches to certain locations of the ResNet [8] backbone for main experiments on the ImageNet dataset. For ablation study we further take into consideration a third branch connected to a shallower intermediate layer in ResNet-18 which is called in accordance with the order of the subscript. Refer to Table 9 for full configurations including specific number of residual blocks and number of channels in each building block.

layer name output size 18-layer 50-layer 101-layer 152-layer
conv1 112112 77, 64, stride 2
conv2_x 5656 3

3 max pool, stride 2

2 3 3 3
conv3_x 2828 2 1 4 4 8
conv4_x 1414 2 1 1 6 3 23 12 36 18
conv5_x 77 2 2 2 2 3 2 3 3 3 2 3 2 3
classifier 11 average pool, 1000-d fc, softmax
Table 9: Architectures of the ResNet family with auxiliary branches for ImageNet. Residual blocks are shown in brackets with the numbers of blocks stacked. Downsampling is performed by conv3_1, conv4_1, and conv5_1 with a stride of 2.

a.3 MobileNet on Re-ID datasets

For MobileNet used on the Re-ID tasks, we fork two auxiliary branches from the network stem, consisting of depthwise separable convolutions resembling the basic modules in the backbone. Refer to Table 10 for architectural details of both main and auxiliary branches.

Conv(3, 32) / s2
Conv(3, 32) dw / s1
Conv(1, 64) / s1
Conv(3, 64) dw / s2
Conv(1, 128) / s1
Conv(3, 128) dw / s1
Conv(1, 128) / s1
Conv(3, 128) dw / s2
Conv(1, 256) / s1
Conv(3, 256) dw / s1
Conv(1, 256) / s1
Conv(3, 256) dw / s2 Conv(3, 256) dw / s2
Conv(1, 256) / s1 Conv(1, 256) / s1
Conv(3, 512) dw / s1 Conv(3, 512) dw / s1
     Conv(1, 512) / s1      Conv(1, 512) / s1
Conv(3, 512) dw / s2 Conv(3, 512) dw / s2 Conv(3, 512) dw / s2
Conv(1, 1024) / s1 Conv(1, 1024) / s1 Conv(1, 2048) / s1
Conv(3, 1024) dw / s2 Conv(3, 1024) dw / s2 Conv(3, 2048) dw / s2
Conv(1, 1024) / s1 Conv(1, 1024) / s1 Conv(1, 2048) / s1
Avg Pool / s1 Avg Pool / s1 Avg Pool / s1
FC / s1 FC / s1 FC / s1
Softmax Classifier / s1 Softmax Classifier / s1 Softmax Classifier / s1
Table 10: Architecture of the MobileNet body with auxiliary branches used in person re-identification tasks. Conv(k, c) denotes convolutional filters with kernel size k and output channel c, ‘dw’ denotes depthwise convolution, s1 and s2 specify the stride in the corresponding layer.

Appendix B Training Curves on the ImageNet dataset

We attach the training curves of representative ResNet-101 and ResNet-152 on ImageNet, as illustrated in Figure 2. Very deep ResNets with tens of millions of parameters are prone to over-fitting. We note that through our proposed Dynamic Hierarchical Mimicking, the training accuracy curve tends to be lower than both the plain one and Deeply Supervised Learning, but our methodology leads to substantial gain in the validation accuracy compared to the other two. We infer that our training scheme implicitly achieves strong regularization effect to enhance the generalization ability of deep convolutional neural networks.

Figure 2: Curves of top-1 training (solid lines) and validation (dash lines) accuracy of ResNet-101 (left) and ResNet-152 (right) on the ImageNet dataset trained with different mechanism. The zoomed-in region shows that the model trained with our DHM method achieves the lowest training accuracy but the highest validation accuracy. Best viewed in color.

Appendix C Implicit Penalty on Inconsistent Gradients

The derivation process of Equation 3.2 is presented here in detail. Similar analysis could be conducted on the paired branch .

Appendix D Effect of Bernoulli Sampling

In the main experiments, we keep using auxiliary classifiers forked from certain locations of the backbone network with a binary sampling strategy. Here as a justification for more complicated stochastic sampling methods, we use the CIFAR-100 dataset and the shallow ResNet-32 model as the test case. We maintain the original settings relevant to structures of auxiliary classifiers and collect cross-entropy losses from all of these classifiers. Then we stochastically discard some of these auxiliary branches depending on i.i.d. samples drawn from a multivariate Bernoulli distribution (each variate is associated with one auxiliary branch) with the probability of 0.5 when calculating mimicking losses at each training epoch. With the stochastically activated branches for interaction, much stronger regularization effect is achieved even using this small network. The ResNet-32 model trained with this Bernoulli sampling policy outperforms all of its counterparts in Table 1 with the (mean std.) top-1 error.

Appendix E Experiments on Corrupt Data

We further explore the flexibility of our method when applied to corrupt data [46], i.e. part of ground truth labels in the dataset are replaced with random labels. The best-performing WRN-28-10 architecture among our spectrum of experiments on CIFAR-100 is utilized as the testbed. We toggle the ratio of corruption from 0.2 to 0.5 and observe the corresponding performance change. When 20% training labels are corrupt, top-1 accuracy of the baseline model drops nearly 10 percent to , while with our proposed training mechanism the trained model still struggles to preserve an accuracy of , which is a more remarkable margin noticing that the performance improvement on clean data is just around 2%. Along with the corrupt ratio increasing to 50%, the performance of baseline model drops another 10 percent to while ours is , maintaining a margin of around 3%. From Figure 3, we observe that training accuracy approximates to 100% even on corrupt data while the validation accuracy suffers a sharp decline which implies severe over-fitting problems. Intriguingly, our proposed hierarchical mimicking training mechanism achieves larger margin in this corrupt setting, demonstrating its powerful regularization effect of suppressing the random label disturbance.

Figure 3: Curves of top-1 training and validation accuracy of WRN-28-10 on corrupt CIFAR-100 dataset with different training mechanism. ‘baseline’ denotes plain optimization scheme without auxiliary branches, ‘mimicking’ denotes our proposed methodology. The sub-figure in the left is obtained with the corresponding networks evaluated on the CIFAR-100 training set with a corrupt ratio of 0.2 while the one in the right with a corrupt ratio of 0.5. Results are bounded by the range of 5 successive runs. Best viewed in color.

Appendix F Experiments Using WRN with Dropout

Reminiscent of the regularization efficiency of dropout layers in Wide Residual Networks [44], we extent our experiments on CIFAR-100 to WRN-28-10 equipped with dropout. There exists an evident decrease in top-1 error to compared with vanilla WRN-28-10. We apply our hierarchical mimicking method to the training procedure of WRN-28-10 (dropout=0.3), resulting in a further improvement by decreasing the top-1 error to . We can conclude that our proposed method has no counteractive effect on previous popular regularization techniques, e.g. dropout and is complementary to them towards achieving higher accuracy with powerful CNNs.

Appendix G Comparison to Knowledge Transfer Research

Our knowledge matching loss is partially inspired by the line of Knowledge Transfer (KT) research but we shift its primary focus away from model compression in the conventional KT methods. The representative Dark Knowledge Distillation [9] requires a large teacher model to aid the optimization process of a small student model via offering informative hint in the form of probabilistic prediction output as the soft label. In this framework, aiming at easing the optimization difficulty of small networks, an available strong model is required beforehand. In contrast, we concentrate on developing deeply supervised training scheme and further boosting the optimization process of state-of-the-art CNNs instead of compact models. Moreover, unlike the teacher and student in the distillation procedure which are optimized sequentially without straightforward association during their separate training process, our training strategy drives all auxiliary branch classifiers together with the original classifier to be optimized simultaneously with a knowledge matching loss among them computed in an on-the-fly manner. Knowledge transfer process occurs in a more compact way within our proposed mechanism, which enables knowledge sharing across hierarchical layers in one single network, without the demand of an extra teacher model. Thus our knowledge integration learning scheme is ready to be deployed in the optimization process of any convolutional neural networks, both lightweight networks and heavy ones.

Appendix H Visualization of Improved Representation Consistency

To visualize the improved intermediate features for demonstration, We select the side branch and the main branch of the ResNet-152 model, take the maximum from each kernel of the middle layer in the residual blocks and normalize them across channels and filters. Then the correlation matrices are calculated between the corresponding convolutional layers from these two branches. Some representative comparisons are illustrated in Figure 4, in which our proposed method leads to clearly higher correlation values.

Figure 4: Correlation heatmaps of conv4_1, conv4_10, conv4_17 and conv5_2 in the ResNet-152 model. In each sub-figure, the left panel shows the result corresponding to the model trained through Deeply Supervised Learning, while the right panel shows the result corresponding to the model trained with our proposed Dynamic Hierarchical Mimicking strategy. The x-axis and y-axis represents input and output channel indices of a convolutional layer respectively.


  • [1] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, Cited by: §1.
  • [2] Y. Chen, J. Li, H. Xiao, X. Jin, S. Yan, and J. Feng (2017) Dual path networks. In NIPS, Cited by: §4.
  • [3] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §1, §4.1.2, §4.
  • [4] T. Devries and G. W. Taylor (2017) Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552. External Links: Link, 1708.04552 Cited by: §2.
  • [5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010-06-01) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88 (2), pp. 303–338. External Links: ISSN 1573-1405, Document, Link Cited by: §1.
  • [6] G. Ghiasi, T. Lin, and Q. V. Le (2018) DropBlock: a regularization method for convolutional networks. In NeurIPS, Cited by: §2.
  • [7] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio (2013) Maxout networks. In ICML, Cited by: §2.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §A.1, §A.2, §1, §4.1.1, §4.1.1, §4.1.2, §4.2, §4.
  • [9] G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. In

    NIPS Deep Learning and Representation Learning Workshop

    External Links: Link Cited by: Appendix G, §2, §3.2.
  • [10] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam (2017) MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Link, 1704.04861 Cited by: §4.2.
  • [11] A. G. Howard (2013) Some improvements on deep convolutional neural network based image classification. CoRR abs/1312.5402. External Links: Link, 1312.5402 Cited by: §2.
  • [12] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §4.
  • [13] G. Huang, Z. Liu, L. v. Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In CVPR, Cited by: §A.1, §1, §4.1.1, §4.1.2, §4.
  • [14] G. Huang, D. Chen, T. Li, F. Wu, L. van der Maaten, and K. Weinberger (2018) Multi-scale dense networks for resource efficient image classification. In ICLR, Cited by: §1, §3.1, §3.2, §3.2, §4.
  • [15] G. Huang, Y. Sun, Z. Liu, D. Sedra, and K. Q. Weinberger (2016) Deep networks with stochastic depth. In ECCV, Cited by: §1, §2.
  • [16] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: §2.
  • [17] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto. Cited by: §4.1.1, §4.
  • [18] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §1, §2, §4.
  • [19] G. Larsson, M. Maire, and G. Shakhnarovich (2017) FractalNet: ultra-deep neural networks without residuals. In ICLR, Cited by: §2.
  • [20] C. Lee, S. Xie, P. Gallagher, Z. Zhang, and Z. Tu (2015) Deeply-supervised nets. In AISTATS, Cited by: §1, §1, §2, §3.1, §3.2, §3, §4.1.1, §4.
  • [21] C. Li, M. Z. Zia, Q. Tran, X. Yu, G. D. Hager, and M. Chandraker (2019) Deep supervision with intermediate concepts. IEEE TPAMI. Cited by: §2.
  • [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §1.
  • [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In ECCV, Cited by: §1.
  • [24] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In CVPR, Cited by: §1.
  • [25] A. Mosinska, P. Márquez-Neila, M. Koziński, and P. Fua (2018) Beyond the pixel-wise loss for topology-aware delineation. In CVPR, Cited by: §2.
  • [26] A. Newell, K. Yang, and J. Deng (2016)

    Stacked hourglass networks for human pose estimation

    In ECCV, Cited by: §2, §3.2.
  • [27] D. Ramanan, P. Felzenszwalb, and D. McAllester (2008) A discriminatively trained, multiscale, deformable part model. In CVPR, Cited by: §4.2.
  • [28] S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. In ICLR, Cited by: §4.2.
  • [29] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §1.
  • [30] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1.
  • [31] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision workshop on Benchmarking Multi-Target Tracking, Cited by: §4.2, §4.
  • [32] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio (2015) FitNets: hints for thin deep nets. In ICLR, Cited by: §1, §2.
  • [33] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §1.
  • [34] L. N. Smith, E. M. Hand, and T. Doster (2016) Gradual dropin of layers to train very deep neural networks. In CVPR, Cited by: §2.
  • [35] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    15, pp. 1929–1958.
    External Links: Link Cited by: §2.
  • [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §1, §1, §2, §2, §3.1, §3.1, §3.2, §3.2, §3, §4.1.2, §4.
  • [37] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: §1, §3.2, §4.
  • [38] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In ICML, Cited by: §2.
  • [39] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §4.
  • [40] S. Xie and Z. Tu (2015) Holistically-nested edge detection. In ICCV, Cited by: §2, §3.2.
  • [41] Y. Yang, Z. Zhong, T. Shen, and Z. Lin (2018) Convolutional neural networks with alternately updated clique. In CVPR, Cited by: §1.
  • [42] J. Yim, D. Joo, J. Bae, and J. Kim (2017)

    A gift from knowledge distillation: fast optimization, network minimization and transfer learning

    In CVPR, Cited by: §2, §3.2.
  • [43] F. Yu, V. Koltun, and T. Funkhouser (2017) Dilated residual networks. In CVPR, Cited by: §1.
  • [44] S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In BMVC, Cited by: §A.1, Appendix F, §4.1.1, §4.
  • [45] S. Zagoruyko and N. Komodakis (2017) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR, Cited by: §2.
  • [46] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. In ICLR, Cited by: Appendix E.
  • [47] H. Zhang, H. Wu, W. Sun, and B. Zheng (2018) Deeptravel: a neural network based travel time estimation model with auxiliary supervision. In IJCAI, Cited by: §2.
  • [48] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2018) Mixup: beyond empirical risk minimization. In ICLR, Cited by: §2.
  • [49] X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, and J. Sun (2017) AlignedReID: surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184. Cited by: §2.
  • [50] Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu (2018) Deep mutual learning. In CVPR, Cited by: §2, §3.2, §4.3.1.
  • [51] Z. Zhang, X. Zhang, C. Peng, X. Xue, and J. Sun (2018) ExFuse: enhancing feature fusion for semantic segmentation. In ECCV, Cited by: §2, §3.1.
  • [52] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, Cited by: §2, §3.2.
  • [53] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In ICCV, Cited by: §4.2, §4.