Fast and Accurate Person Re-Identification with RMNet

12/06/2018 ∙ by Evgeny Izutov, et al. ∙ Intel 0

In this paper we introduce a new neural network architecture designed to use in embedded vision applications. It merges the best working practices of network architectures like MobileNets and ResNets to our named RMNet architecture. We also focus on key moments of building mobile architectures to carry out in the limited computation budget. Additionally, to demonstrate the effectiveness of our architecture we evaluate the RMNet backbone on Person Re-identification task. The proposed approach is in top 3 of state of the art solutions on Market-1501 challenge, however our method significantly outperforms them by the inference speed.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

[nindent=0em,lines=3] The CNN-based solutions have demonstrated the ability to solve a wide range of computer vision tasks achieving the human level performance or even outperforming them. Moreover, not only demonstrating the ability to solve a set of canonical tasks like ImageNet

[1] classification or Cityscapes [2] segmentation challenges can be attributed to CNNs but solving practical use case problems. It’s about issues like a person re-identification which is a key component of the tracking pipelines.

Unfortunately, many researchers offering each time a dramatically new approach allowing to lift a problem on the new level of understanding have a purpose of their work to only beat current state of the art without any attention to the performance problem. But speaking about the industry-useful solutions we should take into account the requirement of real-time inference on the customer affordable hardware.

In case of CNN-based solutions the necessity to affect the inference behavior the choice of backbone is the only thing that needs to be changed. We have many examples of backbone architectures like MobileNet ([3], [4]) and ShuffleNet ([5], [6]) designed for the fast inference in embedded applications. The most significant moment is that for many users these backbones are the only changes required to adopt their approach for the fast inference. Instead of thinking in terms of practices satisfying their target requirements, users mix the components from different and often incompatible areas and, as result, underperform what it could be.

In this paper, we address this issue by carefully designing the direct architecture to solve specific and small task like a person re-identification. Our aim is to show that this problem can be solved on near state of the art level and significantly outperformed by speed. Our contributions are as follows:

  • New lightweight backbone architecture named RMNet for the fast and accurate inference for mobile applications.

  • Re-thinking of the manifold learning techniques according to the person re-identification challenge.

  • Novel lightweight network head to combine the advantages of the low and high level losses without grow in number of parameters.

More broadly, this work demonstrates some ways to design the lightweight CNN-based solution to tackle with specific (not general) tasks without needs to accept being fast as well as being is inaccurate. The proposed model (set of models with different trade off between speed and accuracy) you can find as a part of the Intel OpenVINO™toolkit111For more information you can follow the link:

Index terms. Person re-identification, manifold learning, local and global structure losses, mobile network architecture, lightweight backbone, RMNet.

Ii Related Work

i Mobile architectures

[nindent=0em,lines=3]In recent time Deep Learning (DL) as an independent tool of Machine Learning has made significant leap in a CNN architecture development starting from the vanilla networks like VGG

[7] and continue with ResNet [8] and Inception [9] families. Recent architectures bring some key understanding at how to deal with permanent DL problems: vanishing/exploding gradients, over-parametrization and next over-fitting. As a byproduct for the fast inference purposes we get some reduction in the computation budget while using ResNet-18 and similar models which we can name "relatively small". For some simple tasks the cheap speed up of inference by reduction of the depth of some default architecture is enough and no future investigation is performed on this aspect. But when we speak about mobile applications the future inference time reduction is needed. On the way to do that the techniques like model weights pruning [10] and quantization [11] are used.

The first one is based on the assumption that the trained CNN-based model has a parameter redundancy [12]

by some imperfection of the Stochastic Gradient Descent (SGD) based training procedure which sins to produce duplicate filters

[13]. The main idea of pruning methods is to remove useless parameters without significant drop in accuracy. As it can be seen, the recent papers demonstrate model compression and inference speed up pretty well [10]. But the parameter redundancy problem has another point of view – instead of putting up with the necessity to use pruning we can try to train a model directly without any parameter redundancy. In the proposed paper we have investigated one of possible ways to get it.

Regarding a quantization or more restricted binarization

[14] techniques we do not consider this issue because it’s mostly related to edge-specific implementations than general ideas which are applicable for the wide range of tasks.

Completely different approach is to design the network architecture directly assuming some possible degradation in the accuracy but with gain in a computation time. The first significant step by introducing the depth-wise separable convolutions [3] has been made. This idea was simple but powerful. In present time all mobile network architectures reuse it including the proposed paper too. To future speed up the computations the MobileNet-v2 [4]

architecture focuses on an idea of fixing some internal problems with ReLU

[15]activation function by inverting well-known bottlenecks. We agree with authors that some problems arise because of incompatibility SGD properties with ReLU function. But we have found out that refusal in favor of ELU [16] activation function with some other changes in backbone is more flexible way.

A special place occupies the ShuffleNet [5] architecture which brings an idea to utilize multipath inference in single network by channel shuffling. Next generation [6] of this architecture optimizes the the memory consumption and reveals some analogies with DenseNet [17] architecture.

ii Person re-identification

[nindent=0em,lines=3]The person re-identification task is formulated as a task of learning of some parametric mapping function which maps semantically similar points from the image space onto close points on the embedding space . During the inference a pair of input images is compared by

or cosine distance between the embeddings vectors.

For now the best working practices utilize the Siamese network [18] with appropriate target function like the triplet loss [19] as well as train a model as a classification task with Softmax and cross-entropy loss [20]. More recently they reuse the AM-Softmax loss [21]

from the twin face recognition challenge.

Next improvement in person re-identification has been connected with joint training both metric learning approaches (triplet and AM-Softmax losses), incorporating some form of attention by slicing images on horizontal stripes [22], aggregation of embeddings from different levels [23] and mix of the previous attempts in single network without regard for the computation budget [24].

Another attempt to resolve the person re-identification challenge is based on some kind of hard sample mining techniques for both the triplet loss and for joint training [25].

Regarding the presented paper we are focused on manual mixing different metric learning approaches to escape the difficulties of triplet sampling and incorporate different-level manifold learning ([26], [27]).

Iii Backbone design

Figure 1: Thoughts-flow diagram to build the target lightweight architecture by the definition of key requirements and solving the following issues.

i Top-Down architecture design

[nindent=0em,lines=3]As it was previously sad, the evolution of network architectures has made several steps on the way from the regular structure where the representation power is focused in simple stacking of convolution layers to architectures which exploit the fusion of different-level representations into a single stage. The last trend is to concentrate on the network design in variation of its building blocks like bottlenecks in ResNet architecture. This strategy is followed by the recent mobile architectures: MobileNet and ShuffleNet.

Regarding the design of a network for mobile applications, we can follow one of possible approaches. The first one is a "bottom-up" approach which is based on discovering the inference bottlenecks and following fixes for them. The most powerful example of such approach is ShuffleNet-v2 [6] architecture. It includes strong baseline to exclude as much memory consumed operations as possible. Generally speaking it’s a good attempt to build a fast network foremost but without any attention to the target task. Final accuracy in this case is mostly a result of lucky choice of architecture, otherwise the incrementation of the model size is proposed only.

Another approach is presented by a "top-down" one. It includes the definition of key requirements which cannot be omitted and the following growing of the network building blocks. Moreover, such requirements don’t need to be of one and the same logical level. Often this list is composed of high-level architecture solutions (shallow or deep network) and low-level operations. All the next steps are targeted to merge requirements into a single multi-level solution. It is worth saying that next steps are not limited in an architecture design only but may include initialization tricks and more sophisticated training procedure.

Of course it may happen that key requirements supplemented by the limited computation budget are contradictory. Fortunately we should remember that our purpose is not focused on developing the solution for some general task (e.g. ImageNet [1] classification or COCO [28] detection problems). It gives the hope that the relief in generality of model brings us the realizable trade-off. In this paper we follow the "top-down" approach and the next sections show our vision on direct building of a lightweight model according to the specific person re-identification task.

ii Deep vs Shallow networks

[nindent=0em,lines=3]In the course of the conversation about a network design in the limited computation budget we face well known dilemma of deep or shallow network architecture. Most often the choice is to cut a general architecture to satisfy the restriction on maximal number of FLOPs per the single network input. In the architecture level it means the aggressive usage of pooling operations on the early stages (e.g. [6]). On the one hand, the pooling operator should bring some kind of transformation which is the equivariant to translations. Unluckily, for the rest of the tasks aggressive pooling prevents from extracting of accurate high-level features.

On the other hand, we cannot give up pooling operators because it is a lightweight way to control the number of FLOPs on each scale level by changing the spatial resolution of the feature map. In addition to that, we can vary the number of blocks on each scale and the width of each block. Unfortunately, for most of users the restriction of number of blocks without any change in each of them is the easiest way.

Figure 2:

Diagram of RMNet block. Left: regular bottleneck. Right: bottleneck for spatial reduction with stride 2 for max-pooling and internal convolution layers.

In the presented paper we defend the position that the key component for robust feature extractor is the depth of a network (in terms of number of convolutions in the longest path from the input to the network output). Regarding the design of a backbone the choice is imposed on a fight with a gradient flow during training and is based on the ResNet-family. According to the results [29] the residual structure of bottlenecks can be interpreted as an iterative feature enhancing on a single representation level (obviously the level border is defined by down-scaling operations). It is also important to note that ResNet-18 or even ResNet-50 architectures are too "shallow" and don’t satisfy our intuition. We should talk about a hundred of layers at least.

But choosing the deep architectures we face the necessity to make each residual block as light as possible. The simplest way is to follow the mainstream practice to use bottlenecks with two consecutive convolutions instead of original [30]. Contrariwise, we can think about a network depth in terms of the representation power [31]. For us it means that the choice to use either two or three convolutions in the bottleneck is decided in favor of three convolutions with non-linearity after each.

Finally, we can formulate the list of key requirements which forms the basis of the presented backbone architecture (on Figure 1 you can see our flow of thoughts on the way to build lightweight network):

  • Very deep network with a hundred of layers.

  • ResNet-like architecture.

  • Residual blocks with three convolutions () and non-linearity after each.

iii RMNet backbone

[nindent=0em,lines=3]For now we have the general vision on a backbone design and support points to fit a model to the target computation budget. As it was mentioned earlier the ResNet-like bottlenecks consist of 3 convolutions: the first convolution maps the input onto some internal representation with simultaneous reduction of number of channels, the next internal convolution carry out spatial mixing and the last convolution maps internal representation back onto the input manifold.

Name Times Stride
Input 3
conv 1 2 32
RM-block 4 1 32
RM-block 1 2 64
RM-block 8 1 64
RM-block 1 2 128
RM-block 10 1 128
RM-block 1 2 256
RM-block 11 1 256
Table 1: RMNet backbone architecture

The first step to reduce the number of operations is to replace the internal convolution with its depth-wise variant [3]. But instead of the depth-wise separable convolution practice [32] we preserve the nonlinearity after the internal convolution to leave unchanged the representation power of the network. Unfortunately, this reduction is not enough and the last support point should be used too. This is about the channel reduction factor used in the internal convolution. In this paper we need to use strong factor. Moreover the maximal number of channels is also limited 256 too.

Another unobvious question is about the choice of an activation function. The common practice is to use ReLU [15] non-linearity. It is found out that some negative effect of using ReLU in deep networks ([4], [30]) which is connected with well known sparsity of activations. Easy to see that this sparsity in forward pass will affect the backward pass too by producing sparsity in gradients and the following convergence retardation. Researchers propose different solutions but we follow more simple way to replace ReLU onto ELU [16] activation function. As it will be described further it dramatically changes the behavior of the network.

The next important question is related to the utilization of model parameters. Looking at the effectiveness of pruning methods [33] we should take into account the fact that not all learnt model parameters are useful according to the target task. In case of general architectures with millions of parameters it is expected behavior but regarding our network design with strong channel reduction it’s impossible to leave some rudimentary parameters.

To tackle with the above reported issue we follow the common practices like orthogonal weight initialization [34] (not for all filters), pre-training with huge general-purpose datasets [35] and dropout regularization in each bottleneck [36].

The final RMNet (Residual Mobile Network) block is presented on Figure 2 and whole backbone design is reported in Table 1.

Iv ReID network

Figure 3: Re-identification head to map the internal representation after backbone onto the final embedding vector.

i Manifold learning

[nindent=0em,lines=3]As it was described earlier the goal of the person re-identification based on metric learning is to learn the parametric function embedding vectors of which can be compared with simple norm. For us it means that learning process can be interpreted as a process of forming the target manifold with desired properties.

Generally speaking each loss function impacts different aspects of the final manifold. In light of this we can divide them in two big families: global and local structure losses. Let’s describe a set of appearances of different instances. Our goal is to find the transformation after which the appearances of the same instance will be closer to each other rather than to different instances. On the one hand, for this purpose we can select the single appearance (center) of each instance and try to learn mapping by forcing other appearances to be close to its center instance. In other words we define the global rule for the mapping function. And this is the nature of the first family losses. Regarding the examples of implementation there are different modifications of Softmax with Cross-Entropy losses (see eq.

1). In the presented paper we are focused on a variant with large margins between classes – AM-Softmax loss (see eq. 2). [21].


On the other hand, we can follow the Hebbian Learning Rule [37] which declares that local rules of interactions between elements define the global order of the system. This learning strategy is implicitly presented by the triplet loss [19] family. Unluckily, the main drawback of triplets is a sampling procedure which significantly impacts on the final model accuracy [25].

Recent papers proposed to merge both loss families into a single training procedure and achieved the state of the art results [24]. In our opinion the better performance can be achieved by an elimination of the triplets by dividing them in two constituent forces: push and pull losses [38]. In the presented paper we follow the same strategy to divide triples into components thereby overcoming the sampling issues but we supplement the default margins by the "smart" variant like in [27]. Finally we have three local structure losses: Center (eq. 3), PushPlus (eq. 4) and GlobPushPlus (eq. 5) losses.


Total loss to train the model is a weighted sum of global and local losses (weights are estimated to equalize the impact of each loss in the total sum):


ii Re-identification head

[nindent=0em,lines=3]The last component of our network is a re-identification head which maps the point from the internal representation (backbone output) onto the final embedding which can be compared with others by the cosine (or ) distance. Recently, the unique choice is to use a fully connected (FC) layer on the top of backbone output. Unfortunately, FC layer are too wasteful to the computation resources and cannot be used for mobile networks.

Another variant is presented by using global pooling operators like max- or average-pooling. As it is reported in the paper [39] such approach includes some form of the spatial attention due to pooling over all spatial locations of a feature map. We follow the same solution and use global max-pooling (GMP) operator to collapse the spatial dimensions. You can find the proposed re-identification head on Figure 3.

Our re-identification head has two key components. The first one is inverted bottleneck after the GMP operator – by

convolution we increase the number of channels from 256 to 512 and then compress it back to the 256 (attempt to leap in high dimensional space where the class separation can be solved by linear transformation). The second one is based on dividing the support point of global and local structure losses. It means that we extract some internal representation which is trained with local structure losses only and then we calibrate it by learning with global structure losses. For both representations we use

normalization to follow the AM-Softmax proposed restrictions on the embeddings (to be compatible with a cosine similarity measure). Finally, the network output is the last calibrated embedding.

V Implementation details

i Network architecture

[nindent=0em,lines=3]The proposed network consists of two consecutive components: lightweight feature extractor (RMNet-based backbone) and single re-identification head. To reduce the total inference time we follow the fully convolutional network (FCN) practice and don’t use any FC layers. Moreover we avoid the usage of multibranch [24] solutions and concatenation of embeddings from different layers [23].

The network extracts the normalized embedding vector with 256 elements which can be compared with another one in pairwise manner using the cosine similarity measure.

ii Optimization

[nindent=0em,lines=3]All experiments have been completed in Caffe framework

[40]. We use the SGD with momentum optimization method and decay on the learning rate each 50k iteration starting with .

To initialize the network parameters we use the mixed strategy: input convolutions of each bottleneck are initialized orthogonally [34] and the rest weights initialized using MSRA method [41]. Before running the main experiment we pre-trained the backbone on the OpenImages dataset [42] by fitting a classification task on the extracted object crops ( input size).

One more important step to train the lightweight network which is able to utilize significant part of parameters and prevent from the need to use pruning is using dropout regularization [43] in each block (dropout ratio is set to ). But the dropout regularization reduces the total network capacity and it’s unsuitable for our initially small implementation. To overcome this issue we disable the dropout regularization on the late iterations (when the learning rate is small enough) and continue without it. This strategy allows us to form the manifold structure on early iterations without the threat of over-fitting but to use up the whole network capacity later.

To solve the unbalanced data problem (significant difference in a number of appearances of each identity) we follow the common practice to reuse the hard sample mining procedure [44]. Our implementation of it consists of next steps:

  1. [label=]

  2. To sample augmented images for each identity from the training dataset.

  3. To estimate the value of the loss for each sample.

  4. To select top of hardest (with highest loss value) samples.

  5. To train the network in mini-batches as usual on hardest samples.

  6. To increase the difficulty of the augmentation and go to beginning.

The last component to train the network successfully is a strong data augmentation with the progressively increased difficulty. The best choice is to use random erasing augmentation [45] in addition to standard horizontal flip and random crop methods.

Figure 4: Comparison of ratio of learnt model weights for the different activation functions. Filters are ordered according the the ReLU ratios.

Vi Experimented Result

i Data

[nindent=0em,lines=3]To evaluate the proposed solution we use the Market-1501 dataset [46]. It is a benchmark for person re-identification purposes with images from 6 cameras of different resolutions. It was annotated with the 1501 identities: 751 among which are used for training and 750 are used for testing. The training set contains 12936 images with 3368 query images. The gallery set is composed of images from the 750 test identities and of distractor images, 19732 images in total. The most common and useful evaluation scenario is a single query image.

ii Metrics

[nindent=0em,lines=3]We follow standard procedure and report the mean average precision over all queries (mAP) and the cumulative matching curve (CMC) at rank-1 using the evaluation codes provided by the benchmark.

It’s worth saying that there are some techniques to improve the final result in both metrics. The first common method is to estimate the embedding for the original and flipped images and then concatenate them into a single one (including additional normalization step to use with cosine similarity measure). In our opinion it is not an honest way to improve the accuracy because it doubles the computation time. Unfortunately some authors don’t report the result with a mark that flipping is used. However to be able to go with that approach we report results including horizontal flipping metric.

The second method is based on using re-ranking (RK) techniques [47]. In other words it is direct optimization over comparable metrics. We report result with RK too.

iii Ablation study

[nindent=0em,lines=3]As it was announced earlier we first compare the backbone implementations with different activation functions. Our main message in this paper is that widely used ReLU activation is not a proper one that leads to uprising of some problems. To prove it we measure the ratio between absolute values of filter weights for each convolution layer in network. On Figure 4 you can find this ratios for both ReLU and ELU activation functions. High value of ratio means that there are invalid filters on the current level. As it can be seen the network trained with ReLU have more than half of noisy filters which usually is pruned for model compression purposes. Another picture gives us the result of using the ELU activation function – significant part of filters is still useful and no capacity reduction is observed. Due to low final quality of model with ReLU activation all the next experiments are performed with ELU.

Table 2 shows ablation of study experiments. The initial point of our experiments is training on our dataset with AM-Softmax loss only. Generally speaking this approach should beat SOTA results with general-purpose backbone. But in our case we are very limited in the model capacity and default training is failed. In other words the task to train the lightweight but accurate person re-identification network is really challenging.

Method Market-1501
rank@1 mAP
AM-Softmax 78.00 60.74
+ HSM 79.07 57.88
+ Center loss 81.53 60.42
+ Disabled dropout 85.24 65.94
+ Push loss 87.11 70.95
+ GlobPush loss 88.69 73.40
+ Smart margins 90.20 78.80
+ Weighted HSM 91.66 81.63
+ Increased resolution 92.37 82.53
Table 2: Ablation study on Marlet-1501 dataset. HSM – hard sample mining procedure.
Method Market-1501 GFLOPs MParams FPS
rank@1 mAP
GP-ReID [39] 92.2 81.2 8 24.66 64
Deep-Person [48] 92.3 79.5 8 24.66 64
PCB [22] 92.4 77.3 8 24.66 64
PCB+RPP [22] 93.8 81.6 8 24.66 64
HPM (flip) [23] 94.2 82.7 24.66 32
MGN (flip) [24] 95.7 86.9 68.75 16
Our (light) 91.7 81.6 0.12 0.81 923
Our (strong) 92.4 82.5 0.58 0.81 268
Our (strong, flip) 92.5 83.1 0.81 134
MGN (RK) [24] 96.6 94.2 68.75 16
Our (strong, RK) 93.1 91.1 0.58 0.81 268
Table 3: Comparison with state of the art solutions on Market-1501 dataset. Performance results (Frames Per Second) are measured with OpenVINO on Intel Core i7-6700K CPU@2.90GHz

The first step to improve the baseline is to tackle with the data imbalance problem. As it was mentioned earlier in this paper we use the hard sample mining (HSM) procedure (see the description of the used method above). In the first experiment the AM-Softmax loss value is used to order the samples only (instead of step ). The impact on metrics is not significant but it allows us not to think more about possible over-fitting due to training on plain samples.

During next steps we dive into the manifold learning approach by introducing different local structure losses: Center, Push and GlobPush. Each step gives us the following improvement in both metrics. The most significant impact is achieved after using the smart margins for the Push and GlobPush losses. Moreover as it can be expected smart margins mostly affect the mAP metric which reflects the orderliness of the learnt manifold.

It is worth noting that our concerns about the limited model capacity due to using dropout regularization are confirmed and the strategy to disable this type of regularization on the late iterations brings us significant leap in accuracy for both metrics.

The very last attempt to increase the metrics is to make the sample mining procedure more flexible (more smart to consider the sample complexity for different-level losses) by mixing multiple losses into the ranking criterion. It allows us to improve the mAP metric mostly.

To be able to align with other state of the art solutions we should also test different input resolutions. As it was shown in paper [39] the plain increasing of input size can bring significant leap in accuracy. For our model the main input resolution is . We also tested the higher resolution and as it can be seen the result is slightly better but with expected slowdown in the inference time (both model variants can be found in Intel OpenVINO™toolkit).

iv Comparison with the state of the art

[nindent=0em,lines=3]Table 3 compares the proposed solution to the state of the art approaches. Our approach without using multibranching [24] or merging embeddings from the different levels [23] achieves well enough accuracy but significantly outperforms in the inference time more than one order of magnitude. It happens due to our lightweight backbone RMNet instead of the widely used ResNet-50 architecture.

The proposed combination of loss functions and the training strategy allows us to achieve the comparable results even when our model has significantly less number of parameters (0.81 vs 25 MParams). Moreover our solution is in top-3 by rank@1 metric and in top-2 by the mAP metric.

To measure the model performance we use publicly available OpenVINO toolkit and run experiments on Intel Core i7-6700K CPU. We significantly outperforms other solutions by the Frame per Second (FPS) metric. It is worth saying that person re-identification method can be referred to real-time solutions if it’s able to perform several pairwise comparisons on each frame from the input stream in the real-time mode. For example using our faster solution (light, 923 fps) we can process about 30 persons on each frame in real-time. No one other state of the art is able to do that on the same quality.

Vii Conclusion

[nindent=0em,lines=3]In this paper we have proposed the novel lightweight backbone (RMNet) and set of training practices to tackle with the person re-identification problem. We have demonstrated that our solution is close to state of the art approaches but significantly outperforms them by the inference time. We presume that our work gives new breath to the lightweight solution development for the wide range of applications by direct designing of task-specific networks.