Backbone Can Not be Trained at Once: Rolling Back to Pre-trained Network for Person Re-Identification (AAAI2019)
In person re-identification (ReID) task, because of its shortage of trainable dataset, it is common to utilize fine-tuning method using a classification network pre-trained on a large dataset. However, it is relatively difficult to sufficiently fine-tune the low-level layers of the network due to the gradient vanishing problem. In this work, we propose a novel fine-tuning strategy that allows low-level layers to be sufficiently trained by rolling back the weights of high-level layers to their initial pre-trained weights. Our strategy alleviates the problem of gradient vanishing in low-level layers and robustly trains the low-level layers to fit the ReID dataset, thereby increasing the performance of ReID tasks. The improved performance of the proposed strategy is validated via several experiments. Furthermore, without any add-ons such as pose estimation or segmentation, our strategy exhibits state-of-the-art performance using only vanilla deep convolutional neural network architecture.READ FULL TEXT VIEW PDF
Transfer learning, which allows a source task to affect the inductive bi...
It is common within the deep learning community to first pre-train a dee...
Convolutional neural networks (CNN) have been shown to achieve
Fine-tuning the deep convolution neural network(CNN) using a pre-trained...
Model fine-tuning is a widely used transfer learning approach in person
We study a novel architecture and training procedure for locomotion task...
Backbone Can Not be Trained at Once: Rolling Back to Pre-trained Network for Person Re-Identification (AAAI2019)
Person re-identification (ReID) refers to the tasks connecting the same person, for instance, a pedestrian, among multiple people detected in non-overlapping camera views. Different camera views capture pedestrians in various poses with different backgrounds, which interferes with the ability to correctly estimate the similarity among pedestrian candidates. These obstacles makes it difficult to recognize the identities of numerous pedestrians robustly by comparing them with a limited number of person images with known identities. Furthermore, it is infeasible to obtain large training datasets sufficient to cover the appearance variation of pedestrians, making the ReID problem difficult to be solved. When sufficient training data is not available, it is a common approach to fine-tune the network pre-trained by another large dataset (e.g.
, ImageNet) which contains abundant information. The fine-tuning approach results in better performance than the approaches in which networks are trained from randomly initialized parameters. This is a practical approach used in many research areas[Ren et al.2015, Long, Shelhamer, and Darrell2015] to avoid the problem of overfitting. Likewise, the previous ReID algorithms [Chang, Hospedales, and Xiang2018, Si et al.2018, Sun et al.2017] have utilized the fine-tuning approach. Most of recent works in ReID research have attempted to utilize semantic information such as pose estimation [Zhao et al.2017, Xu et al.2018, Sarfraz et al.2018], segmentation mask [Song et al.2018], and semantic parsing [Kalayeh et al.2018] to improve the accuracy of ReID by considering the additional pedestrian contexts.
In contrast to the previous studies, we are interested in incrementally improving the performance of ReID by enhancing the basic fine-tuning strategy applied to the pre-trained network. A few attempts have been made to improve learning methods by the ways designing a new loss function or augmenting data in a novel way[Zhang et al.2017, Chen et al.2017, Zhong et al.2017b, Sun et al.2017]. However, there has been no research on improving the learning method to consider the characteristics of each layer filter.
Before suggesting our novel fine-tuning strategy for ReID, we first empirically analyze the importance of fine-tuning low-level layers for ReID problems. According to related research [Zeiler and Fergus2014, Mahendran and Vedaldi2015], the low-level layers concentrate on details of appearance to discriminate between samples while the high-level layers contain semantic information. Thus, we need to sufficiently fine-tune the low-level layers to improve the discriminant power for the specific class ‘person’ in ReID because the low-level layers of the pre-trained network include detailed information on numerous classes. However, since the gradients delivered from high-level layers to low-level layers are reduced through back-propagation, the low-level layers suffer from a gradient-vanishing problem, which causes early convergence of the entire network before the low-level layers are trained sufficiently.
To solve this problem, we propose a novel fine-tuning strategy in which a part of the network is intentionally perturbed when learning slows down. The proposed fine-tuning strategy can recover the vanished gradients by rolling back the weights in the high-level layers to their pre-trained weights, which provides an opportunity for further tuning of weights in the low-level layers. As shown in Figure 1
, the proposed fine-tuning strategy allows the network to converge to a minimum in a basin with better generalization performance than the conventional fine-tuning method. We validate the proposed method that uses no add-on schemes via a number of experiments, and the method outperforms state-of-the-art ReID methods appending additional context to the basic network architecture. Furthermore, we apply the proposed learning strategy to the fine-grained classification problem, which validates its generality for various computer vision tasks.
Traditionally, the ReID problem has been solved by using a metric learning method [Koestinger et al.2012] to narrow the distance among the images of the same person. Clothing provides an important hint in the ReID task, and some approaches [Pedagadi et al.2013, Kuo, Khamis, and Shet2013]
have used color-based histograms. With the development of deep learning, many ReID methods to learn discriminative features by deep architectures appear, which dramatically increases the ReID performance[Sun et al.2017, Li, Zhu, and Gong2018, Hermans, Beyer, and Leibe2017]. Recently, the state-of-the-art approaches [Si et al.2018, Song et al.2018, Zhong et al.2018] have also used the advanced deep architecture, especially pre-trained on ImageNet [Deng et al.2009], as a backbone network.
To increase the performance, many recent works based on the deep architectures have tried to consider additional semantic information such as poses of pedestrians and attention masks. One of the most popular approaches is to use the off-the-shelf pose estimation algorithms [Cao et al.2017, Insafutdinov et al.2016] to tackle the misaligned poses of the candidate pedestrians. In [Su et al.2017], using the pose information, Su et al aligned each part of a person, producing pose-normalized input to deal with the problem of the deformable variation of the ReID object. Sarfraz et al. [Sarfraz et al.2018] proposed a view predictor network that distinguishes the front, back, and sides of a person using pose information. In addition to using the pose estimation algorithms, there was a method [Song et al.2018] which embeds a 4-channel input by concatenating 3-channels of RGB input image and one channel of segmentation mask. Likewise, an algorithm [Kalayeh et al.2018] uses semantic parsing masks rather than whole body mask. In [Qian et al.2018], they generate a realistic pose-normalized image. The synthesized image can be used as training data because the label is preserved. [Xu et al.2018] proposed attention-aware composition network. They pointed out the conventional methods using pose information based on rigid body regions such as rectangular RoI. They obtained non-rigid parts through connectivity information between the human joints and matched them individually. In contrast to the previous ReID methods, we target on improving the training method itself without any additional semantic information or extra architecture.
There are other studies to improve learning methods on pre-trained networks. Li and Hoiem [Li and Hoiem2017]
suggested a method which can learn a new task without forgetting the existing tasks in transfer learning. In[Kornblith, Shlens, and Le2018], Kornblith et al. analyzed a conventional fine-tuning method, which concluded that the state-of-the-art ImageNet architecture yields state-of-the-art results over many tasks. In the ReID task, several methods have improved learning strategy on pre-trained networks. The quadruplet loss was proposed in [Chen et al.2017]. In this research, Chen et al. have developed an improved version of triplet losses, which does not only make the inter-class close but also add a negative sample, making the distance in the intra-class much longer. In [Zhang et al.2017], Zhang et al were inspired by the distillation method [Hinton, Vinyals, and Dean2015] between teacher and student networks and proposed a learning method based on co-student networks which can be trained without teacher network. However, there has been no research considering the fine-tuning characteristics for the ReID problem. In this paper, we propose a novel fine-tuning strategy adapted to the ReID task, which takes into account the layer-by-layer characteristic of the network.
In this section, we first analyze the conventional fine-tuning strategy to determine which layer is insufficiently trained for ReID problems. Based on the analysis, we propose a new fine-tuning strategy that alleviates the vanishing gradient in the poorly trained layers, consequently improving the generalization performance of the fine-tuned network.
Before describing the empirical analysis and the proposed fine-tuning strategy, we first introduce an overall framework including a network architecture with its training and testing processes. The notations defined in this section are used in the following sections.
In this paper, we use a classification-based network [Zheng, Yang, and Hauptmann2016] that determines the entire identity label as a class. We assume that the deep convolutional neural network consists of two components: a feature extractor and a classifier. The feature extractor is composed of multiple convolutional layers and the classifier consists of several fully-connected (FC) layers. As the feature extractor, we utilize convolutional layers of pre-trained ResNet [He et al.2016], which are widely used in many ReID algorithms [Sun et al.2017, Zhong et al.2018, Qian et al.2018]. The three structures ResNet-34, ResNet-50, and ResNet-101 are used for the feature extractor to show the generality of the proposed fine-tuning strategy. According to the resolution of the convolutional layers, the feature extractor can be partitioned into five blocks where each block contains several convolutional layers of the same resolution. The five blocks of ResNet-34, ResNet-50, and ResNet-101 contain , , and
convolutional layers, respectively. Following feature extraction, a feature vector is obtained by a global average pooling layer that averages the channel-wise values of the feature map resulting from the last convolutional layer. The resulting feature vector is a 2048-D vector for ResNet-50 and ResNet-101 and a 512-D vector for ResNet-34. The network infers the identity of the input sample by feeding the feature vector obtained from the feature extractor into the classifier. The classifier is newly defined in the order of 512-D FC layer, batch normalization, leaky-rectified linear unit, and FC layer with-dimension, where is the number of identities in the training set and varies between datasets. Following the last FC layer, a soft-max layer is located.
We train the network to classify the identities of training samples based on cross-entropy loss. The weight parameters to be trained are denoted by , where and are weight parameters of -th block and FC layers, respectively. Given training samples with identities and the corresponding one-hot vectors where
, the probability thatcorresponds to each label is calculated as:
where , denotes feature extractor for with , and denotes a classifier with . The cross-entropy loss between the estimated and is calculated as follows:
The identities given to the testing set are completely different than the identities in the training set. Thus, the classifier trained in the training process cannot be used for the testing process. To find correspondence between pedestrian candidates without using the classifier, we estimate the similarity of two pedestrians based on the distance between the feature vectors of each pedestrian extracted from the trained feature extractor. To evaluate the performance, the testing set is divided into a query set and a gallery set with and samples, respectively. The samples of the query and gallery sets are denoted by and , respectively. Each sample in the query set is a person of interest, which should be matched to the candidate samples in the gallery set.
The distance between and is calculated by L-2 norm as follows:
The identity of the gallery sample with the lowest distance is determined as the identity of the -th query sample.
This section determines which layer converges insufficiently by conventional fine-tuning. Figure 3 shows the convergence, supporting the key ideas of the proposed fine-tuning strategy. ‘baseline’ denotes the conventional fine-tuning, while ‘Block ’ indicates the refine-tuning wherein every block except ‘Block ’ is rolled back after the ‘baseline’ fine-tuning. Table 1 shows the generalization performance of each scheme. A meaningful discovery is that a rolling-back scheme with remaining low-level blocks (Block1, Block2, Block3) shows slower convergence than applying the rolling-back scheme to the remaining high-level blocks (Block3, Block4). However, as shown in Table 1, the scheme that maintains the low-level blocks gives better generalization performance than the scheme preserving the high-level blocks. This indicates that the ’baseline’ fine-tuning causes the low-level layers to be converged at a premature. This gives us an insight that rolling back of the high-level layers except the low-level layers might give the low-level layers an opportunity to learn further. As additional consideration, all the weights cannot be given in pre-trained states. This is because the output layer of a deep network for a new task is usually different from the backbone network. Hence, the FC layers must be initialized in a random manner. Rolling back the FC layers to random states does not provide any benefit. Thus, in our rolling-back scheme, FC layers are excluded from rolling back, although it is a high-level layer, to keep a consistent learning of the low-level layers.
The aforementioned analysis shows that a premature convergence degrades performance and rolling back high-level layers can be a beneficial strategy to mitigate the premature convergence problem in the low-level layers. For further tuning of the low-level layers, we designed a rolling-back refine-tuning scheme that trains the low-level layers incrementally from the front layer along with rolling back the remaining high-level layers. The detailed rolling back scheme is described in the following.
In the first fine-tuning period (), the weights, , are initialized with the pre-trained weights, .
The weights () in FC layer are initialized with the random scratch [He et al.2015]. Then the first period of fine-tuning is performed on the target dataset by Eq. (1), Eq. (2). The updated weight of the -th block is denoted by , which is obtained by minimizing the loss from Eq. (2).
From the refine-tuning period with rolling back (), we roll-back the high-level layers as in the following procedure. First, Block1 () is maintained in the state of previous period and all the remaining blocks () are rolled back to their pre-trained states . In other words, Block1 continues the learning, and the other blocks restart the learning from the beginning with the pre-trained initial weights. In the incremental manner, the next low-level block is added one-by-one to the set of blocks continuing the learning, while the remaining ones are rolled back. The rolling-back refine-tuning is repeated until all layers are included in the set of blocks continuing the learning. In summary, in the -th refine-tuning period, the weights of the network are rolled back as
where are the updated weights in the -th refine-tuning period. During the refine-tuning process, the () is not rolled back as mentioned above.
The detailed procedure of the refine-tuning scheme with rolling-back is summarized in Algorithm 1.
Market-1501 [Zheng et al.2015] is widely used dataset in person ReID. Market-1501 contains 32,668 images of 1,501 identities. All the bounding box images are results of detection by the DPM detector [Felzenszwalb et al.2010]. The dataset is divided into a training set of 751 identities and a test set of 750 identities.
Based on the multi-target and multi-camera tracking dataset, DukeMTMC [Zheng, Zheng, and Yang2017] has been specially designed for person ReID. DukeMTMC contains 36,411 images of 1,402 identities which are divided into a training set and a testing set of 702 and 702 identities, respectively.
CUHK03-np [Zhong et al.2017a] is a modified version of the original CUHK03 dataset. The hand-labeled (CUHK03-L) and DPM-detected [Felzenszwalb et al.2010] bounding boxes (CUHK03-D) are offered. CUHK03-np contains 14,096 images of 1,467 identities. The new version is split into two balanced sets containing 767 and 700 identities for training and testing, respectively.
Our method was implemented using PyTorch[Paszke et al.2017] library. All inputs are resized to 288
144 and the batch size was set to 32. No other augmentation is used except horizontal flip in our training process. The initial learning rate was set to 0.01 and 0.1 for the feature extractor and the classifier, respectively. The learning rates were multiplied by 0.1 at every 20 epoch and we trained for 40 epochs as one refine-tuning period. In our experiment, the proposed refine-tuning strategy has been rolled back three times and four epochs have been trained for four refine-tuning periods, and so a total of 160 epochs are have been repeated for all fine-tuning. The learning rates of rolling back blocks are restored to 0.01 at the beginning of every period. In contrast, the blocks that do not roll back begin with the low learning rate of 0.001 since a high learning rate of the sufficiently trained blocks might yield sudden exploding. The optimizer used in this study was stochastic gradient descent (SGD) with nesterov momentum[Nesterov1983]. For the optimizer, the momentum rate and the weight decay were set to and , respectively. In every rolling back, the momentum of gradient was reset to . In the test process, the additional feature vector was used to add the feature vector of the horizontal flipped input pairwise. We report rank-1 accuracy of Cumulative Matching Characteristics (CMC) curve and the mean Average Precision (mAP) for performance evaluation.
The network trained with the proposed strategy was verified via ablation tests on Market-1501, DukeMTMC, CUHK03-L and CUHK03-D. The proposed refine-tuning strategy is applied to a network over four periods. As the refine-tuning periods progress, the continuously tuned blocks are cumulative (e.g., B1+B2+FC in the third period). The other blocks are rolled back to their original pre-trained states. As shown in Table 2, the performance increases as the refine-tuning periods progress with the exception of DukeMTMC in the fourth period. However, even in this case, the gap was negligible. The improvement is most prominent in the second refine-tuning period during which the first rolling back is performed. To verify the generality of our refine-tuning scheme, we conducted additional experiments with other networks including ResNet-34 and ResNet-101 [He et al.2016] under the same settings. Table 3 shows the performance of each network in Market-1501 and DukeMTMC. The proposed refine-tuning scheme also showed a consistent improvement in ResNet-34 and ResNet-101. The ablation test results demonstrate that the proposed refine tuning scheme has a significant advantage as a general method to enhance the generalization performance in the ReID problem in which only a limited amount of data is available.
To evaluate the effect of our rolling-back scheme, it is compared with ’basecy’ method that does roll back none of the block but merely adjusts the learning rate with the same timing as ours for a perturbation driving to other local basins. The ’basecy’ is similar to other studies [Loshchilov and Hutter2016, Smith2017] that perturb only the learning rate. Figure 4 shows the change in training loss and mAP of the whole processes of the proposed refine-tuning and the basecy fine-tuning. After the first rolling-back at 40 epochs, the training loss from the rolling-back scheme converges to a value that is better than the value of the basecy in the 70-80 epochs. After the second and third rolling-backs, the training loss of the basecy converges to a lower value than that of the proposed method, but the basecy shows a worse generalization performance (mAP) than the proposed method.
In this section, we discuss the difference between our method and FC warm-up training [He et al.2016]. As mentioned previously, the new FC layers start randomly from scratch. FC warm-up is a way to freeze the pre-trained weights in all hidden layers except for the FC layers and train the FC layers before starting the main fine-tuning. In the comparison experiment, the baseline was warmed up for 20 epochs. In our proposed method, period 1 (see Table 2) is similar to FC warm-up where FC layers start from random scratch. However, the proposed method does not freeze the pre-trained weights in period 1. The training loss and mAP for FC warm-up and our methods are depicted in Figure 5. FC warm-up and our methods start fine/refine-tuning after training the FC layers. The FC warm-up converges to a lower training loss than the proposed method, but the proposed method shows better performance in terms of generalization.
To learn discriminative features for the ReID task, it is important to distinguish the foreground from the background. Figure 6 shows that our method can generate a more distinguishable feature map in the last convolutional layer than the baseline of the conventional fine-tuning method.
We also compared the proposed method with state-of-the-art methods. Table 5 shows the comparison results when using ResNet-50. The proposed rolling-back refine-tuning scheme shows the best performance even though our method does not use any add-on scheme. Furthermore, compared to other methods without add-on scheme (SVDNet, HAP2S_P), our method outperforms them by more than 7% mAP improvement for Market-1501. Table 4 summarizes the results compared with the state-of-the-art methods on Market-1501, DukeMCMT, and CUHK03-L/D. According to the results, the rolling-back refine-tuning scheme makes a meaningful contribution to the enhancement of any backbone networks so that it outperforms state-of-the-art algorithms utilizing add-on schemes.
In this paper, we proposed a refine tuning method with a rolling-back scheme which further enhances the backbone network. The key idea of the rolling-back scheme is to restore the weights in a part of the backbone network to the pre-trained weights when the fine-tuning converges at a premature state. To escape from the premature state, we adopt an incremental refine tuning strategy by applying the fine tuning repeatedly, along with the rolling-back. According to the experimental results, the rolling-back scheme makes a meaningful contribution to enhancement of the backbone network where it derives the convergence to a local basin of a good generalization performance. As a result, our method without any add-on scheme could outperform the state-of-the-arts with help of add-on scheme.
This work was supported by Next-Generation ICD Program through NRF funded by Ministry of S&ICT [2017M3C4A7077582], ICT R&D program MSIP/IITP [2017-0-00306, Outdoor Surveillance Robots].
Food-101–mining discriminative components with random forests.In ECCV, pages=446–461, year=2014, organization=Springer.
Mask-guided contrastive attention model for person re-identification.In CVPR.
Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826.