Attention Network Robustification for Person ReID

10/15/2019 ∙ by Hussam Lawen, et al. ∙ 0

The task of person re-identification (ReID) has attracted growing attention in recent years with improving performance but lack of focus on real-world applications. Most state of the art methods use large pre-trained models, e.g., ResNet50 ( 25M parameters), as their backbone, which makes it tedious to explore different architecture modifications. In this study, we focus on small-sized randomly initialized models which enable us to easily introduce network and training modifications suitable for person ReID public datasets and real-world setups. We show the robustness of our network and training improvements by outperforming state of the art results in terms of rank-1 accuracy and mAP on Market1501 (96.2, 89.7) and DukeMTMC (89.8, 80.3) with only 6.4M parameters and without using re-ranking. Finally, we show the applicability of the proposed ReID network for multi-object tracking.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The objective in person re-identification (ReID) is to assign a stable ID to a person in multiple camera views. In this study we are interested in the development of small sized models for ReID with high accuracy for two main reasons. First, it is beneficial for practical deployment and productization of ReID solutions. Second, the research for models that provide high accuracy requires exploration of many architecture variations and training schemes. When the backbone is heavy, re-training consumes both a lot of time and computing resources, hence, we wish to avoid this. Our approach differs from many SotA methods, that rely on large pre-trained backbone models, such as ResNet50, e.g. [31, 26, 28, 13].

Figure 1: Performance comparison of our approach and SotA ReID methods on Market1501 dataset. Top: Rank-1 accuracy vs. number of parameters. Bottom: mAP vs. number of parameters.

We argue that a cost-effective ReID model should be computationally efficient, capable of running on low-res video input, and robust to multiple camera setting. Hence, we propose an efficient ReID model and training schemes that demonstrate state of the art performance under these requirements. To reduce the computational burden, we aim to decrease the number of parameters and use a relatively small ReID model. Figure 1 shows the current state of the art results [1, 11, 13, 19, 24, 26, 28, 32, 36, 34, 35, 29, 2, 41] and the number of parameters compared to our proposed method on the popular Market1501 [37]

dataset in terms of rank-1 accuracy and mAP. For some methods, the number of parameters was not known so we used an estimated lower bound. Using our proposed training framework we achieve state of the art results with an order of magnitude smaller model compared to the best existing ReID CNN.

The importance of training “tricks” for deep person ReID was been discussed before in [13]. In this paper, we suggest training techniques and architecture modifications that rubstify an harmonious attention network [11] to achieve similar or better results than much larger and complicated models. The contribution of this paper is thus three-fold:

  • We propose a robust deep person ReID model. Our model achieves state of the art results on two popular person ReID datasets (Market1501 and DukeMTMC ReID [21]) despite having a small number of parameters, number of FLOPS, and low resolution input image in comparison to current leading methods.

  • We explore a variety of training schemes and network choices that we studied. While we have not explored their affect for other models we believe they could be of interest to others to examine.

  • We demonstrate the applicability of the proposed person ReID model by improving multi-target multi-camera tracking.

In the following section we describe the baseline ReID network we started with. The training techniques and architecture modifications that were explored in this study are presented in section 3. Next, the experimental results including an ablation study, additional analysis, and comparison to state of the art are presented (section 4). Finally, multi camera multi target tracking results are presented in section 5.

2 Baseline ReID network - HA-CNN

We wanted to obtain a robust model with a small number of parameters and capability to deal with low-res input images to reduce computational complexity.

We chose the Harmonious Attention CNN [11] as our primary baseline due to it being a light-weight yet deep model, that can be trained from scratch thus obviating the need to pre-train on additional data, while providing nice results taking into consideration the small number of parameters (2.7M). In addition, the input image size for this network is relatively small compared to other person ReID networks.

The HA-CNN is an attention network with several attention modules including a soft spatial and channel-wise attention and a hard attention to extract local regions. The network architecture holds two branches: a global one and a local one that uses the regions extracted based on the hard attention. Finally, the output vectors of both branches are concatenated for the final person image descriptor. Holding two branches and multiple attention modules improves the network perception and despite these features the HA-CNN keeps a small number of parameters making it accurate and efficient. However, parts of the architecture can still be optimized as well as the training scheme. Optimizing it can further improve the HA-CNN and obtain a more accurate and robust model.

3 Methods

Deep learning model performance is highly dependent on the training schemes being used. Recent works have shown that adding different training procedure refinements can improve the model results significantly [13, 6]. Architecture modifications and refinements can also have an important impact on the final result. In this section we elaborate on the training schemes and architecture modifications we used in this study starting with HA-CNN as our baseline. The training techniques will be presented in section 3.1 and the architecture modifications will be presented in section 3.2. In section 3.3 we mention several modifications that didn’t improve the model performance.

3.1 Training techniques

The following training techniques were used in this study:

Random erasing augmentation (REA) [39]

- randomly erasing a rectangle in an image has shown to improve the model generalization ability. We used REA with the following parameters: probability for random erasing an image of

, area ratio of erasing a region in the range of , and with aspect ratio in the range of .

Warmup [4] - used to bootstrap the network for better performance. Starting with a smaller learning rate has shown to improve the training process stability, especially when using a randomly initialized model. Using warmup we start the training with a small learning rate and then gradually increase it. We used the following learning rate scheme:

(1)

Label smoothing [27] - widely used for classification problems by encouraging the model to be less confident during training and prevent over-fitting. We used label smoothing in a similar way as proposed in [13].

Weighted triplet loss

- Triplet loss is widely used in Person Re-Identification methods and other computer vision tasks like Face Recognition and Few-Shot Learning. The original triplet loss was proposed by Schroff

[23]. We denote an anchor sample by , positive samples as and negative samples as , then the triplet loss can be written as:

(2)

where is the given inter-class separation margin, denotes distance of appearance, and .

Hermans [7] and Mischuk [54] have proposed the batch-hard triplet loss that selects only the most difficult positive and negative samples:

(3)

In contrast to the original triplet loss, the batch-hard triplet loss emphasizes hard examples. However, it is sensitive to outlier samples and may discard useful information due to its hard selective approach. To deal with these problems, Ristani proposed the batch-soft triplet loss:

(4)

One hyper-parameter that exists in all of the triplet loss variations shown above is the margin denoted as . In this paper we used a modified version of the batch-soft triplet loss presented in Equation (4) that eliminates the need to manually determine the margin value as will be explained next.

Soft margin [7] - We used the softplus function instead of

in the triplet loss function:

(5)

This modification is called soft margin since it replaces the hard cut off of the max function (as in the ReLUactivation function) with a soft exponential decay. By using the soft margin there’s no need to choose a margin parameter. Using a hard margin value, once the negative samples’ distance is larger than the positive samples’ distance by more than the hard margin value there is no incentive to push the positive samples closer or push the negative samples further away. The soft margin encourages a continuous reduction of the positive distance to the anchor while increasing the negative distance. Figure 3 illustrates the difference between the soft and the hard margin. The examples in (a) and (b) will obtain a similar loss value of zero since it answers the requirement based on the hard margin. While in (c), using the soft margin, the computed loss will continue to push the positive sample closer to the anchor while pushing the negative sample away.

Figure 2: The Softplus function compared to .
Figure 3: Example of hard margin and soft margin. Although (b) is more desirable than (a), using hard margin in the triplet loss we will get a loss value of zero for both while in (c), using the soft margin the loss will continue to pull the positive sample closer to the anchor while pushing the negative sample away and will encourage going from (a) to (b).

L2 normalization - The normalization of the feature vectors can be important when using two different loss functions such as cross-entropy and triplet loss which are optimized using different distance measures. Lou [13]

have tackled the normalization problem by adding a batch normalization layer after the feature vectors (right before the fully connected layer). In our empirical studies we found that simply using

normalization for each feature vector (global and local) during training achieved a better performance. Figure 4 shows the additional normalization used during training and inference.

Figure 4: Robust-ReID architecture diagram showing the proposed modifications over the original HA-CNN: normalization during training, GeM instead of average pooling, and soft triplet loss.

SWAG [15]- A common technique to further boost the performance of a model is using ensemble approaches. The standard ensemble method uses several models in test time for the final prediction utilizing much more computing resources. Stochastic weight averaging (SWA) [9] suggests to form an ensemble during training and output a single model for inference. SWA essentially conducts a uniform average over several model weights traversed by SGD during training to achieve a wider region of the loss minima. In order to use SWA a modified learning rate scheduler is required. In this study we used cosine annealing learning rate scheduler with

of 35 epochs and cycle decay factor of 0.7 after each cycle. At the end of each cycle we average the weights of the current model with the previous models taken from the end of each cycle. This is different than the original learning rate scheduler used in

[9] as it was shown empirically to be better in our experiments. Another variation of the SWA is the SWA-Gaussian (SWAG) [15]

. SWAG fits a Gaussian distribution using the SWA solution and diagonal covariance forming an approximate posterior distribution over neural network weights. Next, SWAG performs a Bayesian model averaging based on the Gaussian distribution. We used SWAG with the modified learning rate scheduler in this study.

3.2 Architecture modifications

In addition to the training techniques listed above, we further explored architecture modifications:

Shuffle blocks [14]- Our goal was to improve our network accuracy while maintaining a small number of parameters. To do this, we examined replacing the inception blocks with the shuffle blocks presented in Figure 5.

Shuffle-A is more efficient than the original inception block since it splits the input features into two equal branches, the first branch remains as is while three convolution operators are applied to the second branch. In addition, one of the convolution operators is depth-wise convolution. The Shuffle-A blocks can be used in a repeated sequence and still maintain the same number of parameters as the original inception block. Hence, we were able to build a deeper network with similar number of parameters. The Shuffle-B block is similar to Shuffle-A but it can be used for spatial down-sampling or channel expansion. These characteristics require convolution operators to be applied to the first branch as well. Table 1 summarizes the repeated sequences of Shuffle blocks used in our proposed architecture.

Figure 5: The shuffle blocks used in this study to replace the original HA-CNN inception blocks.
Local Branch Global Branch Layer Input Stride 1 2
Repeat Output Ch. Repeat Output Ch.
Conv1 Conv 3x3 16064 2 1 32 1 36
Stage1
Shuffle-B
Shuffle-A
Shuffle-B
8032
1
1
2
1
7
1
128
1
8
1
240
Soft-Attn1 HA-Block 4016 1 1 1
Hard-Attn1 Shuffle-B 4(2428) 1 1 1
Stage2
Shuffle-B
Shuffle-A
Shuffle-B
4016
1
1
2
1
10
1
256
1
11
1
320
Soft-Attn2 HA-Block 208 1 1 1
Hard-Attn2 Shuffle-B 4(1214) 1 1 1
Stage3
Shuffle-B
Shuffle-A
Shuffle-B
208
1
1
2
1
7
1
384
1
8
1
480
Soft-Attn3 HA-Block 104 1 1 1
Hard-Attn3 Shuffle-B 4(67) 1 1 1
Pooling GeM 104 1 1 1
Pooling GeM 4(34) 1 1 1
FC Global Linear 11 1 1 512 1 960
FC Local Linear 11 1 1 1
FLOPs 0.72B 1.68B
# of Params. 2.9M 6.4M
Table 1: Overall architecture of Robust-ReID, for 2 different levels of complexities. Since our architecture uses a low-resolution input of 160x64, we down-scale the feature maps by applying strided convolution only in the last layer of each stage and not in the beginning. This way the network can leverage a higher spatial resolution in most of the network.

Generalized Mean (GeM) [20]

- In the original HA-CNN version the pooling method being used (just before the fully connected layer) is global average pooling. However, we found that using global Max-pool instead can achieve different results and it wasn’t clear which one performs better and why. Hence, we used the trainable GeM pooling which generalizes both max and average pooling methods. The GeM operator for a single feature map

can be described as:

(6)

We initialized the GeM parameter with in our experiments. Figure 4 shows where is it used during training and inference.

Deeper and wider - We further propose a deeper and wider version of our architecture by modifying the number of shuffle blocks as well as the number of output channels in each stage. Table 1 presents these modifications in bold.

HA-CNN [11] a b c d e f g h Robust-ReID
BagOfTricks [13]
Soft triplet
L2 normalization
Shuffle blocks
Soft margin
GeM
Deeper & wider
SWAG
Market1501 Rank1 91.2 93.2 94.5 94.9 95.1 95.3 95.4 95.8 95.7 96.2
mAP 75.7 82.0 85.7 86.6 87.1 87.9 87.3 88.7 88.1 89.7
Table 2: Ablation study on Market1501. The first column indicates the different training techniques and architecture modifications we tried including some of the tricks mentioned in BagOfTricks [13]: warmup, random erase, label smoothing, and no bias in the classification layers. The baseline we started with, i.e. the original HA-CNN implementation, is presented in the second column for comparison. The last column shows the results of our proposed Robust-ReID network including all of the training techniques and architecture modifications proposed in this study. Columns a-h demonstrates the impact of each modification by turning it off.

3.3 Additional tricks we tried

Some tricks that were introduced in prior works failed to improve the performance when used with our baseline. The following tricks didn’t work:

  • As mentioned, max and average pooling provide different results so one way to benefit from both pooling methods is by concatenation of their output. Basically we tried to replace the global average pooling used in the original HA-CNN architecture with these two pooling methods and concatenations. It resulted in a similar accuracy with more parameters in the final model.

  • The batch norm neck used in [13] provided inferior results when compared to the simple normalization.

  • Hard triplet loss instead of the soft version was too sensitive to outliers.

  • Using Shuffle blocks without normalization or soft margin in the triplet loss.

  • Training for more epochs didn’t improve the performance. The only way it did lead to an improvement was using the Cyclic LR scheme.

  • Cyclic LR scheme didn’t improve the results when used from scratch from the beginning of the training. It only worked when used in additional training epochs after the the model converged.

4 Experimental results

In the following we evaluate the performance boost using each one of the methods discussed in section 3. Our models are evaluated on Market1501 and DukeMTMC ReID datasets based on rank-1 accuracy and mAP.

Implementation details - All person images are resized to . We used SGD for optimization with a learning rate schedule as in Equation (1) for 350 epochs. When using SWAG we train for 15 cycles of 35 epochs which sums up to 525 additional epochs. We randomly sample 8 identities and 4 images per person in each training batch.

4.1 Ablation study

To evaluate the different training techniques explored in this study we set several experiments in an ablation study. Table 2 shows the different modifications starting from the original HA-CNN architecture. The first row indicates using some of the tricks from [13] that showed an improvement when tested on Market1501 using the HA-CNN architecture. These include warm-up, random erasing, and no-bias in the fully connected layers. These tricks alone (experiment a) provided an improvement of 2% in rank-1 accuracy and 6.3% in mean average precision compared to the original HA-CNN paper result (i.e. our baseline).

Next, to test the influence of some of our modifications we removed one modification at a time. The most significant decrease in results compared to column h cancelling the normalization caused a decrease of 1.2% in Rank-1 accuracy and 2.4% in mAP (column b). Reduction of other modifications such as shuffle blocks, soft margin, GeM, and deeper and wider network caused a decrease in the performance as well indicating the benefit of using it.

Finally, we used the SWAG in two experiments: experiment g and the final Robust-ReID. Continuing the training with SWAG provided an improvement in both rank-1 and mAP in both experiments. The SWAG is used in this study as a post process for models that already achieve high accuracy to show its contribution on top of that.

Setup Update Market1501
r = 1 mAP
1 - 93.8 83.6
+LR Scheme 94.3 84.8
+SWAG 94.5 85.3
2 - 95.4 87.3
+LR Scheme 95.8 88.2
+SWAG 95.8 88.7
3 - 95.7 88.1
+LR Scheme 95.7 88.9
+SWAG 96.2 89.7

Table 3: Performance evaluation on Market1501 for SWAG with cosine annealing with decay factor learning scheme.
Figure 6: Performance evaluation of SWAG compared to SGD using the cosine annealing learning scheme on Market1501 dataset showing the average of 5 runs. Top: Rank-1 accuracy vs. epoch. Bottom: mAP vs. epoch.

4.2 Exploring SWAG

Our empirical experiments showed that the SWAG method consistently improved our model performance. However, it requires additional training time and uses a custom made cosine annealing learning scheme with a decay factor. Therefore, we wanted to further explore the SWAG contribution by analyzing some of our experimental results. Table 3 shows the results when testing the learning rate scheme with and without SWAG for three different setups. In the first setup we used our proposed architecture minus three main modifications: GeM, Shuffle blocks, and deeper and wider. The second and third setups are experiments f and h in Table 2 respectively. Evidently, adding the LR scheme provided a nice improvement, and adding the SWAG performed even better. The most significant improvements were in terms of mAP.

Figure 6 presents the average over five experiments comparing SWAG and standard SGD in terms of Rank1 accuracy and mAP on Market1501 dataset. Using SWAG the accuracy trend seems more consistent and robust compared to standard SGD. In addition, it is consistently and significantly better in terms of mAP.

4.3 Comparison to state of the arts

We compare our models performance to different state of the art methods (Table 4. Our best model achieves state of the art results in terms of rank-1 accuracy and mAP on Market1501 (96.2, 89.7) and DukeMTMC (89.8, 80.3) with only 6.4M parameters. To our best knowledge, our model achieves the best performance on these public datasets. It should be noted that the smaller version of our model (2.9M parameters) also achieves state of the art results on both datasets. In terms of FLOPS our final network has 1.7B FLOPS while the ResNet 50 used in Luo [13] implementation has 4.1B FLOPS. We did not apply re-ranking for clear comparison and since it is currently not relevant for real world practice.

Market1501 DukeMTMC
Type Method r = 1 mAP r = 1 mAP
Mask-guided SPReID [10] 92.5 81.3 84.4 71.0
MaskReID [17] 90.0 75.3 78.8 61.9
Stripe-based AlignedReID [33] 90.6 77.7 81.2 67.4
SCPNet [5] 91.2 75.2 80.3 62.6
LocalCNN [32] 91.5 77.7 82.2 66.0
Pyramid[36] 92.8 82.1 - -
PCB [26] 93.8 81.6 83.3 69.2
BFE[3] 94.5 85.0 88.7 75.8
MGN [29] 95.7 86.9 88.7 78.4
Pyramid[36] 95.7 88.2 89.0 79.0
LocalCNN (MG) [32] 95.9 87.4 - -

Dense-semantics
DSA [34] 95.7 87.6 86.2 74.3
GAN-based Camstyle [40] 88.1 68.7 75.3 53.5
PN-GAN [18] 89.4 72.6 73.6 53.2
Global feature IDE [38] 79.5 59.9 - -
SVDNet [25] 82.3 62.1 76.7 56.8
TriNet[7] 84.9 69.1 - -
AWTL[22] 89.5 75.7 79.8 63.4
OS-Net [41] 94.8 84.9 88.6 73.5
BagOfTricks [13] 94.5 85.9 86.4 76.4
NAS Auto-ReID [19] 94.5 85.1 88.5 75.1
Attention-based HA-CNN [11] 91.2 75.7 80.5 63.8
DuATM [24] 91.4 76.6 81.2 62.3
Mancs [28] 93.1 82.3 84.9 71.8
ABD [2] 95.6 88.3 89.0 78.6
RGA-SC [35] 95.8 88.1 86.1 74.9
Ours (2.9M) 95.8 88.7 88.8 78.9
Ours (6.4M) 96.2 89.7 89.8 80.3

Table 4: Comparison of state-or-the-arts methods.

5 Application to multi object tracking

Although the public datasets used in this study for person ReID are valuable for comparison between different architectures and models, we wanted to evaluate the model’s applicability by using it to improve multi target multi camera tracking in two different scenes.

5.1 Indoor multi target multi camera tracking

We first explore whether the proposed model can be used for tracking purposes in a room with people coming in and out. Testing the model in a real world setting such as tracking is much more challenging. A wrong ReID assignment can affect the assignment of other persons since we only compare each query image to tracks that are not active (not present in the room at the time of the query). In addition, for each query we need to decide if we open a new track or assign it to an existing track (ReID), meaning that in some cases the gallery does not include images of the person found in the query.

We used the LAB sequence which is a part of the Task-Decomposition database [8] of multi-view sequences for people tracking. The LAB sequence is about 12.5 minutes long111Information in the database website mentions 3.5 minutes but the downloaded videos are actually 12.5 minutes long., the tracking domain is about 5*6 meters in dimension, and the images were captured at 15 Hz with a resolution of 640*480 pixels, where four cameras are installed at the corners of the room. Through the sequence, people enter, walk around, sit down and exit the room randomly, causing frequent occlusions. The maximum number of people in the scene at the same time is 7. We first used an internal software for global people tracking which uses the calibration provided for each camera and report the results we got with and without using the model for ReID in terms of MOTA and IDF1. We used ReID each time a person enters the room by comparing it to several images per person that is currently not tracked inside the room.

Table 5 shows the results obtained using different models including: the original HA-CNN and our proposed Robust-ReID model. The Robust-ReID performed better than the original HA-CNN in terms of IDF1 using Market1501 or DukeMTMC for training. Due to the original resolution of the videos, the size of the bounding box of each query and gallery image can get very small in size. Our model showed robustness to the low-res images since it was trained on small sized input.

Model Trained Dataset MOTA IDF1
Robust-ReID DukeMTMC 96.1 89.1
Robust-ReID Market1501 96.1 79.6
HA-CNN DukeMTMC 96.1 78.9
HA-CNN Market1501 96.1 65.7
No ReID - 96.1 57.1
Table 5: Multi camera multi target tracking results on LAB dataset using our proposed Robust-ReID model compared to the original HA-CNN.

5.2 Outdoor multi target tracking

To evaluate the performance of the proposed ReID model for outdoor object tracking we conducted an additional experiment on the MOT16 dataset [16]. We followed Long [12], using the same exact tracker. We only replaced the GoogLeNet based ReID model with our ReID model. As in [12] we did not train on MOT16 and we used the same validation set for comparison including 5 video sequences. Each sequence varies in the camera angle and the distance from the subjects. One sequence can have large person bounding boxes captured from front view angle and other can have small bounding boxes with a much different camera angle. Therefore, having a single ReID model that can excel in all of these domains is challenging. For the purpose of multi target training, our model was trained on multiple ReID datasets (DukeMTMC, Market1501 and MSMT17 [30]) in order to have a more generic representation that can cope with the variability that exists in the MOT16 sequences. Note that ReID domain adaptation is still an active area of research and its out of the scope of this work. Table 6 shows the results in terms of IDF1 and MOTA. The proposed ReID model showed robustness to these difficulties with a slight improvement in both metrics.

Model MOTA IDF1
Robust-ReID 36.7 46.2
Long [12] (code) 36.5 46.1
Long [12] (paper) 35.7 45.3
Table 6: Multi target tracking results on MOT16 validation set using our proposed Robust-ReID model compared to Long [12].

6 Conclusions

This paper explores several training techniques and architecture modifications and suggests how to integrate them into an harmonious attention based network for person ReID. Each training technique is tested as well as some of the tricks presented in other prior works. Using the proposed training scheme and network modifications we were able to outperform SotA works achieving 96.2% rank1 accuracy and 89.7% mAP on Market1501 and 89.8% rank1 accuracy and 80.3% mAP on DukeMTMC with only 6.4M parameters. In addition, we show that even for a smaller version (2.9M parameters) we achieve state of the art results. Finally, we show the applicability of our proposed model by utilizing it to improve existing methods for multi object tracking on a public dataset. Future work entails more experiments using other deep ReID networks as our baseline, as well as tackling the cross-domain challenges in person ReID.

Acknowledgments

We thank Sagi Rorlich and Genadiy Vasserman for their help in some of the experiments.

References