[ICPR 2020] Adaptive L2 Regularization in Person Re-Identification
We introduce an adaptive L2 regularization mechanism termed AdaptiveReID, in the setting of person re-identification. In the literature, it is common practice to utilize hand-picked regularization factors which remain constant throughout the training procedure. Unlike existing approaches, the regularization factors in our proposed method are updated adaptively through backpropagation. This is achieved by incorporating trainable scalar variables as the regularization factors, which are further fed into a scaled hard sigmoid function. Extensive experiments on the Market-1501, DukeMTMC-reID and MSMT17 datasets validate the effectiveness of our framework. Most notably, we obtain state-of-the-art performance on MSMT17, which is the largest dataset for person re-identification. Source code will be published at https://github.com/nixingyang/AdaptiveReID.READ FULL TEXT VIEW PDF
Since neural networks are data-hungry, incorporating data augmentation i...
This paper explores a simple and efficient baseline for person
Due to domain bias, directly deploying a deep person re-identification
In this study, we analyze model inversion attacks with only two assumpti...
Person re-identification has seen significant advancement in recent year...
Existing person re-identification methods rely on the visual sensor to
Person re-identification is a challenging task due to various complex
[ICPR 2020] Adaptive L2 Regularization in Person Re-Identification
Person re-identification involves retrieving corresponding samples from a gallery set based on the appearance of a query sample across multiple cameras. It is a challenging task since images may differ significantly due to variations in factors such as illumination, camera angle and human pose. On account of the availability of large-scale datasets [33, 20, 29], remarkable progress has been witnessed in recent studies on person re-identification, e.g., utilizing local feature representations [27, 23], leveraging extra attribute labels [22, 16], improving policies for data augmentation [36, 4], adding a separate re-ranking step [35, 37] and switching to video-based datasets [17, 3].
regularization imposes constraints on the parameters of neural networks and adds penalties to the objective function during optimization. It is a commonly adopted technique which can improve the model’s generalization ability. Although some works[26, 9, 19, 15] provide insights on the underlying mechanism of regularization, it is an understudied topic and has not received sufficient attention. In most literature, regularization is taken for granted, and the text dedicated to it is typically shrunk into one sentence as in 
. On the other hand, existing approaches assign constant values to regularization factors in the training procedure, and such hyperparameters are hand-picked via hyperparameter optimization which is a tedious and time-consuming process. The primary purpose of this work is to address the bottleneck of conventionalregularization and introduce a mechanism which learns the regularization factors and update those values adaptively.
In this paper, our major contributions are twofold:
We introduce an adaptive regularization mechanism, which optimizes each regularization factor adaptively as the training procedure progresses.
With the proposed framework, we obtain state-of-the-art performance on MSMT17, which is the largest dataset for person re-identification.
The rest of this paper is organized as follows. Section II reviews important works in person re-identification and regularization. In Section III, we present the essential components of our baseline, alongside the proposed adaptive regularization mechanism. Section IV
describes the details of our experiments, including datasets, evaluation metrics and comprehensive analysis of our proposed method. Finally, SectionV concludes the paper.
In this section, we give a brief overview of two research topics, namely, person re-identification and regularization.
Utilizing local feature representations which are specific to certain regions, has been shown successful. Varior et al. 
propose a Long Short-Term Memory architecture which models the spatial dependency and thus extracts more discriminative local features. Sunet al.  apply a uniform partition strategy which divides the feature maps evenly into individual parts, and the part-informed features are concatenated to form the final descriptor.
Besides, methods based on auxiliary feature are advocated, aiming to utilize extra attributes in addition to the identity labels. Su et al.  shows that learning mid-level human attributes can be used to address the challenge of visual appearance variations. Specifically, an attribute prediction model is trained on an independent dataset which contains the attribute labels. Lin et al.  manually annotate attribute labels which contain detailed local descriptions. A multi-task network is proposed to learn an embedding for re-identification and also predict the attribute labels. In addition to the performance improvement in re-identification, such system can speed up the retrieval process by ten times.
By applying random manipulations on training samples, data augmentation has played an essential role in suppressing the overfitting issue and improving the generalization of models. Zhong et al.  introduce an approach which erases the pixel values in a random rectangle region during training. By contrast, Dai et al.  suggest dropping the same region for all samples in the same batch. Such feature dropping branch strengthens the learned features of local regions.
Adding a separate re-ranking step to refine the initial ranking list can lead to significant improvements. Zhong et al.  develop a k-reciprocal encoding method based on the hypothesis that a gallery image is more likely to be a true match if it is similar to the probe in the k-reciprocal nearest neighbours. Zhou et al.  rank the predictions with a specified local metric by exploiting negative samples for each online query, rather than implementing a general global metric for all query probes.
Lastly, some works shift the emphasis from image-based to video-based person re-identification. Liu et al.  introduce a spatio-temporal body-action model which exploits the periodicity exhibited by a walking person in a video sequence. Alternatively, Dai et al.  present a learning approach which unifies two modules: one module extracts the features of consecutive frames, and the other module tackles the poor spatial alignment of moving pedestrians.
Laarhoven  prove that regularization would not regularize properly in the presence of normalization operations, i.e11] and weight normalization . Instead, regularization will affect the scale of weights, and therefore it has an influence on the effective learning rate.
Similarly, Hoffer et al.  investigate how does applying weight decay before batch normalization affect learning dynamics. Combining weight decay and batch normalization would constrain the norm to a small range of values and lead to a more stable step size for the weight direction. It enables better control over the effective step size through the learning rate.
Later on, Loshchilov et al.  clarify a long-established misunderstanding that regularization is equivalent to weight decay. The aforementioned statement does not hold when applying adaptive gradient algorithms, e.g., Adam . Furthermore, they suggest decoupling the weight decay from the optimization steps, and it leads to the original formulation of weight decay.
Most recently, Lewkowycz et al.  present an empirical study on the relations among the
coefficient, the learning rate, and the number of training epochs and the performance of the model. In a similar manner as learning rate schedules, a manually designed schedule for the L2 parameter is proposed to increase training speed and boost model’s performance.
In this section, we first present a minimal setup for person re-identification. Later on, we explain five components that contribute to significant improvements in performance and use the resulting method as the baseline in our study. Most importantly, we discuss the proposed adaptive regularization mechanism at the end.
Backbone: ResNet50 
, initialized with ImageNet pre-trained weights, is selected as the backbone model. For convenience, it is separated into five individual blocks, i.e., block 1-5, as illustrated in Figure 2
. Additionally, the stride arguments of the first convolution layer in block 5 are set to 1, rather than default value 2. This enlarges the feature maps by a scale factor of 2 along with both height and width dimensions while reusing the pre-trained weights and keeping the total amount of parameters identical.
Objective module: Figure 1 demonstrates the structure of an objective module that converts the feature maps to learning objectives. A global average pooling layer squeezes the spatial dimensions in the feature maps, and the following batch normalization 
layer generates the normalized feature vectors. The concluding fully-connected layer does not contain a bias vector, and it produces the predicted probabilities of each unique identity so that the model can be optimized using the categorical cross-entropy loss. In the inference procedure, the feature embeddings before the batch normalization layer are extracted as the representations, and cosine distance is adopted to measure the distance between two samples.
Overall topology: The topology of the overall model is shown in Figure 2. It is to be observed that the minimal setup only contains the global branch. Given a batch of images, the individual blocks from the backbone model utilize successively, and an objective module is appended at the end.
The image is resized to target resolution using a bilinear interpolation method. Besides, the image is flipped horizontally at random with probability set to 0.5. Zero paddings are added to all sides of the image,i.e., the top, bottom, left, and right sides. A random part with target resolution is subsequently cropped.
Learning rate: The learning rate increases linearly from a low value to the pre-defined base learning rate in the early stage of the training procedure, and it is divided by ten once the performance on the validation set plateaus. On the one hand, the warmup strategy suppresses the distorted gradient issue at the beginning . On the other hand, periodically reducing the learning rate boosts the performance even further.
Label smoothing: The label smoothing regularization 
is applied alongside with the categorical cross-entropy loss function. Given a sample with ground truth label
, the one-hot encoded labelequals to only if the index is as the same as label , and otherwise. The smoothed label introduces a hyperparameter and is calculated as:
Triplet loss: As highlighted in Figure 1, the triplet loss  is applied on the feature embeddings before the batch normalization layer. It mines the moderate hard triplets instead of all possible combinations of triplets, given that using all possible triplets may destabilize the training procedure. Considering that multiple loss functions are present, the weighting coefficients of each loss function are set to 1 on account of simplicity.
Regional branches: In addition to the global branch, two regional branches are integrated into the model. Figure 2 illustrates the diagram of those regional branches. Firstly, the block 5 from the backbone model is replicated, and it is not shared with the global branch. Secondly, we adopt the uniform partition scheme as in . The slicing layer explicitly divides the feature maps into two horizontal stripes. Lastly, dimensionality reduction is performed using a convolutional layer on each stripe. Separate objective modules are appended afterwards. In the inference procedure, feature embeddings from multiple objective modules are concatenated.
Random erasing: In addition to random horizontal flipping, random erasing  is utilized in data augmentation. During training, it erases an area of original images to improve the robustness of the model, especially for the occlusion cases.
Clipping: The clipping layer is inserted between the global average pooling layer and the batch normalization layer in Figure 1
. It performs element-wise value clipping so that values in its output are contained in a closed interval. The clipping layer works in a similar manner as the ReLU-n units, and it relieves optimization difficulties in the succeeding triplet loss .
regularization: Conventional regularization is utilized to all trainable parameters, i.e., the regularization factors remain constant throughout the training procedure. Additionally, those regularization factors need to be hand-picked via hyperparameter optimization.
A neural network consists of a set of distinct parameters,
with containing all trainable parameters. Each
is an array which could be a vector, a matrix or a 3rd-order tensor. For example, the kernel and bias terms in a fully-connected layer are a matrix and a vector, respectively.
Conventional regularization imposes an additional penalty term to the objective function, which can be formulated as follows:
where and denote the original and updated objective functions, respectively. In our case (see Figures 1 and 2), is a weighted sum of triplet loss  and categorical cross-entropy loss functions. In addition, refers to the square of the norm111We define to denote the sum of squares of all elements also when is a matrix or a 3rd-order tensor. of , and the constant coefficient defines the regularization strength.
One may wish to add penalties in a different way, e.g., applying lighter regularization in the early layers but stronger in the last ones. Thus, it is possible to generalize even further, i.e., defining a unique coefficient for each :
where each parameter is associated with an individual regularization factor .
Obviously, it is infeasible to manually fine-tune those regularization factors for one by one, since is in the order of 100 for models trained with ResNet50. Therefore, we treat them as any other learnable parameters and find suitable values from the data itself.
To make the aforementioned regularization factors adaptive, a straightforward extension is obtained by replacing the pre-defined constant with scalar variables which are trainable through backpropagation. After the modification, Equation 4 remains unchanged while . However, such an approach without any constraints on will fail. Namely, setting negative values for allows naively increasing so that decreases sharply. In other words, the regularization penalties would become dominant in the optimization process. Thus the model collapses and would not learn useful feature embeddings.
To address the collapse problem, we apply the hard sigmoid function which assures that the regularization factor would always have non-negative values. The hard sigmoid function is defined as
In our experiments, we use , but any other positive values can be used as well.
The regularization factor is obtained by applying the hard sigmoid on the raw parameters as
where () are the trainable scalar variables. Furthermore, we introduce a hyperparameter which represents the amplitude. Hence, we get
In this section, we explain datasets, evaluation metrics and comprehensive analysis of our proposed method.
|Test Query Samples||3,368||2,228||11,659|
|Test Gallery Samples||15,913||17,661||82,161|
We conduct experiments on three person re-identification datasets, namely, Market-1501 , DukeMTMC-reID  and MSMT17 . Table I makes a comparison of those datasets. The MSMT17 dataset outshines the other two due to its large scale.
The Market-1501 dataset is collected with six different cameras in total. It contains 32,217 images from 1,501 pedestrians, and at least two cameras capture each pedestrian. The training set includes 751 pedestrians with 12,936 images, while the test set consists of the remaining images from 750 pedestrians and one distractor class.
The DukeMTMC-reID dataset includes 1,404 pedestrians that appear in at least two cameras and 408 pedestrians that appear only in one camera. The training and test sets contain 16,522 and 19,889 images, respectively. The query and gallery samples in the test set are randomly split.
The MSMT17 dataset is the largest person re-identification dataset which is publicly available, as of July 2020. It contains 126,441 images from 4,101 pedestrians, while 3 indoor cameras and 12 outdoor cameras are employed. In particular, the test set has approximately three times as much samples as the training set. Such setting motivates the research community to leverage a limited number of training samples that are available since data annotation is costly.
Following the practices in , two evaluation metrics are applied to measure the performance, i.e., mean Average Precision (mAP), and Cumulative Matching Characteristic (CMC) rank-k accuracy. The metrics take the distance matrix between query and gallery samples, in conjunction with the ground truth identities and camera IDs as input arguments. Gallery samples are discarded if they have been taken from the same camera as the query sample. As a result, greater emphasis is laid on the performance in the cross-camera setting.
Since the query samples may have multiple ground truth matches in the gallery set, mAP is preferable than rank-k accuracy for the reason that mAP considers both precision and recall.
|Annotators ||arXiv 2017||-||-||93.5||-||-||-||-|
|PCB ||ECCV 2018||ResNet50||81.6||93.8||69.2||83.3||-||-|
|IANet ||CVPR 2019||ResNet50||83.1||94.4||73.4||87.1||46.8||75.5|
|AANet ||CVPR 2019||ResNet50||82.5||93.9||72.6||86.4||-||-|
|CAMA ||CVPR 2019||ResNet50||84.5||94.7||72.9||85.8||-||-|
|DGNet ||CVPR 2019||ResNet50||86.0||94.8||74.8||86.6||52.3||77.2|
|OSNet ||ICCV 2019||OSNet||84.9||94.8||73.5||88.6||52.9||78.7|
|MHN ||ICCV 2019||ResNet50||85.0||95.1||77.2||89.1||-||-|
|BDB ||ICCV 2019||ResNet50||86.7||95.3||76.0||89.0||-||-|
|BAT-net ||ICCV 2019||GoogLeNet||87.4||95.1||77.3||87.7||56.8||79.5|
|SNR ||CVPR 2020||ResNet50||84.7||94.4||72.9||84.4||-||-|
|HOReID ||CVPR 2020||ResNet50||84.9||94.2||75.6||86.9||-||-|
|RGA-SC ||CVPR 2020||ResNet50||88.4||96.1||-||-||57.5||80.3|
|SCSN ||CVPR 2020||ResNet50||88.5||95.7||79.0||91.0||58.5||83.8|
|+ Triplet Loss||79.9||92.0||68.8||82.2||44.0||70.9|
|+ Regional Branches||81.3||93.3||71.2||84.2||47.9||74.2|
|+ Random Erasing||85.8||94.4||76.6||87.0||54.1||77.0|
The baseline differs from the minimal setup in five aspects, as discussed in Section III-B. Table III presents an ablation study to demonstrate how each component contributes to the performance on person re-identification. On the one hand, the triplet loss  brings the most significant improvements on all three datasets. The boost is due to the fact that the triplet loss is applied to the feature embeddings which are retrieved in the inference procedure (see Figure 1). Since the triplet loss directly optimizes the model in a manner comparable to similarity search, it closes the gap between the training and inference procedures. On the other hand, the other four components bring moderate improvements. It is conceivable that the model reaches better generalization by using hand-picked regularization factors which remain constant throughout the training procedure.
Table II shows performance comparisons among baseline, AdaptiveReID and existing approaches.
Firstly, all methods listed in Table II have surpassed the best-performing human annotators  on the Market-1501 dataset. In light of the scale of the Market-1501 and DukeMTMC-reID datasets (see Table I), these two small-scale datasets might have been saturated and more emphasis should be put on the MSMT17 dataset. Since mAP is preferable than rank-k accuracy, the mAP score on MSMT17 is the most reliable indicator of performance.
Secondly, our AdaptiveReID models are trained with the proposed adaptive regularization mechanism. The amplitude in Equation 8 is set to . On the one hand, the AdaptiveReID method achieves decent improvements over the baseline method, especially on MSMT17 in which the mAP score increases by . On the other hand, the AdaptiveReID method obtains the state-of-the-art performance on DukeMTMC-reID and MSMT17, very close to state-of-the-art performance on Market-1501, among methods which utilize the ResNet50 backbone.
Last but not least, deeper backbones (i.e., ResNet101 and ResNet152) further improve the performance, at the cost of extra computations. Attributed to the re-ranking  method which exploits the test data in the inference procedure, new milestones have been accomplished, i.e., the mAP scores on Market-1501, DukeMTMC-reID and MSMT17 stand at , and , respectively.
Depending on the associated distinct parameter (see Equation 2
), the regularization factors can be classified into five categories:conv_kernel, conv_bias, bn_gamma, bn_beta and dense_kernel, where conv, bn and dense denote the convolutional, batch normalization and fully-connected layers. In the following, we examine the regularization factors for a model trained on MSMT17 using the ResNet50 backbone.
Figure 3 visualizes the median value of regularization factors in each category, with respect to the number of iterations. Note that the learning rate gets reduced at iterations and . While conv_kernel, bn_gamma and bn_beta behave similarly, conv_bias remains constant throughout the training procedure and dense_kernel drops to in the early stage.
Figure 4 demonstrates a histogram of regularization factors in the last epoch, i.e., the training procedure completes. The interval is divided evenly into five buckets. For regularization factors from the same category, the values could differ significantly, e.g., and regularization factors from conv_bias fall within the interval and , respectively. To be specific, the regularization factors from conv_bias in the two Reduction blocks are (see Figure 2). If omitting the effects of the Clipping layer in Figure 1, those convolutional layers are followed by batch normalization layers which intrinsically cancels out the bias terms in aforementioned convolutional layers. Consequently, such regularization factors would converge to . In summary, this phenomenon reflects the superiority of our proposed method, in which each regularization factor is optimized separately.
Figure 5 illustrates selected query samples with corresponding top 5 matches from the gallery set. Although query samples and erroneous matches may have similar appearances, minor differences could be observed under careful inspection, e.g., the dissimilarity between backpacks. Furthermore, our models could retrieve correct matches even in the presence of large illumination changes, e.g., the two examples from the MSMT17 dataset.
In this work, we revisit the regularization in neural networks and propose an adaptive mechanism named AdaptiveReID. Differentiated from existing approaches which employ hand-picked regularization factors that are constant, our proposed method can optimize those regularization factors adaptively through backpropagation. More specifically, we apply a scaled hard sigmoid function to trainable scalar variables, and use those as the regularization factors. Extensive experiments validate the effectiveness of our framework, and we obtain state-of-the-art performance on MSMT17 which is the largest person re-identification dataset.
Proceedings of the IEEE International Conference on Computer Vision, pp. 371–381. Cited by: TABLE II.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3300–3310. Cited by: TABLE II.
Convolutional deep belief networks on cifar-10. Unpublished manuscript 40 (7), pp. 1–9. Cited by: §III-B.
On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265. Cited by: §III-A.