Person re-identification (re-id), as an important task in computer vision, has developed rapidly in recent years and is facing the demand for real-world applications such as intelligent video surveillance and smart retailing. Many researchers realize a task based on open source code, less extensible and reusable modification make it difficult to reproduce the results. Besides, there often exists a gap between academic research and model deployment, which makes it difficult for academic research models to be quickly transferred to productions.
To accelerate progress in the community of person re-identification including researchers and practitioners in academia and industry, we now release a unified person re-identification library named FastReID. We have introduced a stronger modular, extensible design that allows researchers and practitioners easily to plug their oven designed module without repeatedly rewriting codebase, into a re-id system for further rapidly moving research ideas into production models. Manageable system configuration makes it more flexible and extensible, which is easily extended to a range of tasks, such as general image retrieve and face recognition, etc. Based on FastReID, we provide many state-of-the-art pre-trained models on multiple tasks about person re-id, cross-domain person re-id, partial person re-id and vehicle re-id. Except for providing the codebase and benchmarking results, we hope that the library can provide a fair comparison between different approaches.
Recently, FastReID has become one of the widely used open-source library in JD AI Research. We will continually refine it and add new features to it. We warmly welcome individuals, labs to use our open-source library and look forward to cooperating with you to jointly accelerate AI Research and achieve technological breakthroughs. .
2 Highlight of FastReID
FastReID provides a complete toolkit for training, evaluation, finetuning and model deployment. Besides, FastReID provides strong baselines that are capable of achieving state-of-the-art performance on multiple tasks.
Modular and extensible design. In FastReID, we introduce a modular design that allows users to plug custom-designed modules into almost any part of the re-identification system. Therefore, many new researchers and practitioners can quickly implement their ideas without re-writing hundreds of thousands of lines of code.
Manageable system configuration.
FastReID implemented in PyTorch is able to provide fast training on multi-GPU servers. Model definitions, training and testing are written as YAML files. FastReID supports many optional components, such as backbone, head aggregation layer and loss function, and training strategy.
Richer evaluation system. At present, many researchers only provide a single CMC evaluation index. To meet the requirement of model deployment in practical scenarios, FastReID provides more abundant evaluation indexes, e.g., ROC and mINP, which can better reflect the performance of models.
Engineering deployment. Too deep model is hard to deploy in edge computing hardware and AI chips due to time-consuming inference and unrealizable layers. FastReID implements the knowledge distillation module to obtain a more precise and efficient lightweight model. Also, FastReID provides a conversion tool, e.g., PyTorchCaffe and PyTorchTensorRT to achieve fast model deployment.
State-of-the-art pre-trained models. FastReID provides state-of-the-art inference models including person re-id, partial re-id, cross-domain re-id and vehicle re-id. We plan to release these pre-trained models. FastReID is very easy to extend to general object retrieval and face recognition. We hope that a common software advanced new ideas to applications.
3 Architecture of FastReID
In this section, we elaborate on the pipeline of FastReID as shown in Fig. 1. The whole pipeline consists of four modules: image pre-processing, backbone, aggregation and head, we will introduce them in detail one by one.
3.1 Image Pre-processing
The collected images are of different sizes, we first resize the images to fixed-size images. And images can be packaged into batches and then input into the network. To obtain a more robust model, flipping as a data augmentation method by mirroring the source images to make data more diverse. Random erasing, Random patch, Random patch [Zhou_2019_ICCV] and Cutout [devries2017improved] are also augmentation methods that randomly selects a rectangle region in an image and erases its pixels with random values, another image patch and zero values, making the model effectively reduce the risk of over-fitting and robust to occlusion. Auto-augment is based on automl technique to achieve effective data augmentation for improving the robustness of feature representation. It uses an auto search algorithm to find the fusion policy about multiple image processing functions such as translation, rotation and shearing.
Backbone is the network that infers an image to feature maps, such as a ResNet without the last average pooling layer. FastReID achieves three different backbones including ResNet [he2016deep], ResNeXt [xie2017aggregated] and ResNeSt [zhang2020resnest]. We also add attention-like non-local [wang2018non]
module and instance batch normalization (IBN)[pan2018two] module into backbones to learn more robust feature.
The aggregation layer aims to aggregate feature maps generated by the backbone into a global feature. We will introduce four aggregation methods: max pooling, average pooling, GeM pooling and attention pooling. The pooling layer takes
as input and produces a vectoras an output of the pooling process, where respectively represent the width, the height and the channel of the feature maps. The global vector in the case of the max pooling, average pooling, GeM pooling and attention pooling of are respectively given by
where is control coefficient and are the softmax attention weights.
Head is the part of addressing the global vector generated by aggregation module, including batch normalization (BN) head, Linear head and Reduction head. Three types of the head are shown in Fig. 3
, the linear head only contains a decision layer, the BN head contains a bn layer and a decision layer and the reduction head contains conv+bn+relu+dropout operation, a reduction layer and a decision layer.
Batch Normalization [ioffe2015batch] is used to solve internal covariate shift because it is very difficult to train models with saturating non-linearities. Given a batch of feature vector ( is the sample number in a batch), then the bn feature vector can be computed as
where and are trainable scale and shift parameters, and
is a constant added to the mini-batch variance for numerical stability.
Reduction layer is aiming to make the high-dimensional feature become the low-dimensional feature, i.e., 2048-dim512-dim.
outputs the probability of different categories to distinguish different categories for the following model training.
4.1 Loss Function
Four different loss functions are implemented in FastReID.
Cross-entropy loss is usually used for one-of-many classification, which can be defined as
. Cross-entropy loss makes the predicted logit values to approximate to the ground truth. It encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient reduces the ability of the model to adapt, resulting in a model too confident about its predictions. This, in turn, can lead to over-fitting. To build a robust model that can generalize well,Label Smoothing is proposed by Google Brain to address the problem. It encourages the activations of the penultimate layer to be close to the template of the correct class and equally distant to the templates of the incorrect classes. So the ground truth label in cross-entropy loss can be defined as and .
Arcface loss [deng2019arcface] maps cartesian coordinates to spherical coordinates. It transforms the logit as , where is the angle between the weight and the feature . It fixes the individual weight by normalisation and also fixes the embedding feature by l2 normalisation and re-scale it to , so . To simultaneously enhace the intra-class compactness and inter-class discrepancy, Arcface adds an additive angular margin penalty in the intra-class measure. So can rewritten as .
Circle loss. The derivation process of circle loss is not described here in detail, it can refer to [sun2020circle].
Triplet loss ensures that an image of a specific person is closer to all other images of the same person than to any images of other persons, which wants to make an image (anchor) of a specific person closer to all other images (positive) of the same person than to any image (negative) of any other person in the image embedding space. Thus, we want , where is measure distance between a pair of person images. Then the Triplet Loss with samples is defined as , where is a margin that is enforced between a pair of positive and negative.
4.2 Training Strategy
Fig. 4 shows the train strategy that contains many tricks including learning rate for different iteration, network warm-up and freeze.
Learning rate warm-up helps to slow down the premature over-fitting of the mini-batch in the initial stage of the model training. Also, it helps to maintain the stability of the deep layer of the model. Therefore, we will give a very small learning rate, e.g., in the initial training and then gradually increase it during the 2k iterations. After that, the learning rate remains at between 2k iterations and 9k iterations. Then, the learning rate starts from and decays to at cosine rule after 9k iterations, the training is finished at 18k iterations.
To re-train a classification network to meet the requirement of our tasks, we use the collected data from the tasks to fine-tune on the ImageNet pre-trained model. Generally, we add a classifier that collected the network such as ResNet, and the classifier parameters are randomly initialized. To initialize the parameters of the classifier better, we only train the classifier parameters while freezing the network parameters without updating at the beginning of the training (2k iterations). After 2k iterations, we will free the network parameter for end-to-end training.
5.1 Distance Metric.
Eucildean and cosine measure are implemented in FastReID. And we also implement a local matching method: deep spatial reconstruction (DSR).
Deep spatial reconstruction. Suppose there is a pair of person images and . Denote the spatial features map from backbone as for with dimension dimension , and for with dimension . The total spatial features from locations are aggregated into a matrix , where . Likewise, we construct the gallery feature matrix , . Then, can find the most similar spatial feature in to match, and its matching score . Therefore, we try to obtain the similar scores for all spatial features of with respect to , and the final matching score can be defined as .
Two re-rank methods: K-reciprocal coding [zhong2017re] and Query Expansion (QE) [bhagwan2004total] and are implemented in FastReID.
Query expansion. Given a query image, and use it to find similar gallery images. The query feature is defined as and similar gallery features are defined as . Then the new query feature is construction by averaging the verified gallery features and the query feature. So the new query feature can be defined as
After that the new query feature is used for following image retrieve. QE can be easily used for practical scenarios.
For performance evaluation, we employ the standard metrics as in most person re-identification literature, namely the cumulative matching cure (CMC) and the mean Average Precision (mAP). Besides, we also add two metrics: receiver operating characteristic (ROC) curve and mean inverse negative penalty (mINP) [ye2020deep].
We provide a rank list tool of retrieval result that contributes to checking the problems of our algorithm that we haven’t solved.
In general, the deeper the model, the better the performance. However, too deep a model is not easy to deploy in edge computing hardware and AI chips since 1) it needs time-consuming inference; 2) many layers are difficult to implement on AI chips. Considering these reasons, we implement the knowledge distillation module in FastReID to achieve a high-precision, high-efficiency lightweight model.
As shown in Fig. 5, given a pre-trained student model and a pre-trained teacher model on reid datasets, the teacher model is a deeper model with non-local module, ibn module and some useful tricks. The student model is simple and shallow. We adopt two-stream way to train the student model with teacher backbone freezing. The student and teacher models respectively output classifier logits , and features , . We want to student model learn classification ability as much as possible about the teacher model, the logit learning can be defined as
|SPReID [Kalayeh_2018_CVPR] (CVPR’18)||92.5||81.3||84.4||70.1||-||-|
|PCB [sun2018beyond] (ECCV’18)||92.3||77.4||81.8||66.1||-||-|
|AANet [Tay_2019_CVPR] (CVPR’19)||93.9||83.4||87.7||74.3||-||-|
|IANet [Hou_2019_CVPR] (CVPR’19)||94.4||83.1||87.1||73.4||75.5||45.8|
|CAMA [Yang_2019_CVPR] (CVPR’19)||94.7||84.5||85.8||72.9||-||-|
|DGNet [Zheng_2019_CVPR] (CVPR’19)||94.8||86.0||86.6||74.8||-||-|
|DSAP [Zhang_2019_CVPR] (CVPR’19)||95.7||87.6||86.2||74.3||-||-|
|Pyramid [Zheng_2019_CVPR] (CVPR’19)||95.7||88.2||89.0||79.0||-||-|
|Auto-ReID [Quan_2019_ICCV] (ICCV’19)||94.5||85.1||-||-||78.2||52.5|
|OSNet [Zhou_2019_ICCV] (ICCV’19)||94.8||84.9||88.6||73.5||78.7||52.9|
|MHN [Chen_2019_ICCV] (ICCV’19)||95.1||85.0||89.1||77.2||-||-|
|-Net [Guo_2019_ICCV] (ICCV’19)||95.2||85.6||86.5||75.1||-||-|
|BDB [Dai_2019_ICCV] (ICCV’19)||95.3||86.7||89.0||76.0||-||-|
|FPR [He_2019_ICCV] (ICCV’19)||95.4||86.6||88.6||78.4||-||-|
|ABDNet [Chen_2019_ICCV] (ICCV’19)||95.6||88.3||89.0||78.6||82.3||60.8|
|SONA [Xia_2019_ICCV] (ICCV’19)||95.7||88.7||89.3||78.1||-||-|
|SCAL [Chen_2019_ICCV] (ICCV’19)||95.8||89.3||89.0||79.6||-||-|
|CAR [Zhou_2019_ICCV] (ICCV’19)||96.1||84.7||86.3||73.1||-||-|
|Circle Loss [sun2020circle] (CVPR’20)||96.1||87.4||89.0||79.6||76.9||52.1|
In order to ensure the consistency of student model and teacher model in the feature space distribution, probabilistic knowledge transfer model based on Kullback-Leibler divergence is used for optimizing the student model:
is cosine similarity measure.
At the same time, the student model needs ReID loss to optimize the entire network. Therefore, the total loss is:
After finish training, the is used for inference.
We also provide model conversion tool (PyTorch Caffe and PyTorch TensorRT) in the FastReID library.
7.1 Person Re-identification
Datasets. Three person re-id benchmarking datasets: Market1501 [bai2017scalable], DukeMTMC [zheng2017unlabeled], MSMT17 [qian2019leader] are used for evaluating the FastReID. We won’t go into the details of the database here.
FastReID Setting. We use flipping, random erasing and auto-augment to process the training image. The IBN-ResNet101 with a Non-local module is used as the backbone. The gem pooling and bnneck are used as the head and aggregation layer, respectively. For the batch hard triplet loss function, one batch consists of 4 subjects, and each subject has 16 different images, and we use circle loss and triplet loss to train the whole network.
Result. The state-of-the-art algorithms published in CVPR, ICCV, ECCV during 2018-2020 are listed in Table 1, FastReID achieves the best performance on Market1501 96.3%(90.3%), DukeMTMC 92.4%(83.2%) and MSMT17 85.1%(65.4%) at rank-1/mAP accuracy, respectively. Fig. 6 shows the ROC curves on the three benchmarking datasets.
7.2 Cross-domain Person Re-identification
Problem definition. Cross-domain person re-identification aims at adapting the model trained on a labeled source domain dataset to another target domain dataset without any annotation.
Setting. We propose a cross-domain method FastReID-MLT that adopts mixture label transport to learn pseudo label by multi-granularity strategy. We first train a model with a source-domain dataset and then finetune on the pre-trained model with pseudo labels of the target-domain dataset. FastReID-MLT is implemented by ResNet50 backbone, gem pooling and bnneck head. For the batch hard triplet loss function, one batch consists of 4 subjects, and each subject has 16 different images, and we use circle loss and triplet loss to train the whole network. Detailed configuration can be found on the GitHub website. The framework of FastReID-MLT is shown in Fig. 7.
|TJ-AIDL [wang2018transferable] (CVPR’18)||26.5||58.2||23.0||44.3|
|SPGAN [deng2018image] (CVPR’18)||22.8||51.5||22.3||41.1|
|HHL [zhong2018generalizing] (ECCV’18)||31.4||62.2||27.2||46.9|
ARN [li2018adaptation] (CVPR’18-WS)
|ECN [zhong2019invariance] (CVPR’19)||43.0||75.1||40.4||63.3|
|UCDA [qi2019novel] (ICCV’19)||30.9||60.4||31.0||47.7|
|PDA-Net [li2019cross] (ICCV’19)||47.6||75.2||45.1||63.2|
|PCB-PAST [zhang2019self] (ICCV’19)||54.6||78.4||54.3||72.4|
|SSG [yang2019selfsimilarity] (ICCV’19)||58.3||80.0||53.4||73.0|
|MPLP+MMCL [WANG2020cvpr1] (CVPR’20)||60.4||84.4||51.4||72.4|
|AD-Cluster [zhai2020adcluster] (CVPR’20)||68.3||86.7||54.1||72.6|
|MMT [ge2020mutual] (ICLR’20)||71.2||87.7||65.1||78.0|
Supervised learning (BOT [Luo2019CVPRWorkshops])
|PTGAN [wei2018person] (CVPR’18)||2.9||10.2||3.3||11.8|
|ENC [zhong2019invariance] (CVPR’19)||8.5||25.3||10.2||30.2|
|SSG [yang2019selfsimilarity] (ICCV’19)||13.2||31.6||13.3||32.2|
|DAAM [Huang2020aaai] (AAAI’20)||20.8||44.5||21.6||46.7|
MMT [ge2020mutual] (ICLR’20)
|Supervised learning (BOT [Luo2019CVPRWorkshops])||48.3||72.3||48.3||72.3|
Result. Table 3 shows the results on several datasets, FastReID-MLT can achieve 92.7%(77.5%), 82.7%(69.2%) under DM, MD settings. The result is close to supervised learning results.
7.3 Partial Person Re-identification
Problem definition. Partial person re-identification (re-id) is a challenging problem, where only several partial observations (images) of people are available for matching.
Setting. The setting as shown in Fig. 8.
|PCB [sun2018beyond] (ECCV’18)||56.3||54.7||41.3||38.9||46.8||40.2|
|SCPNet [fan2018scpnet] (ACCV’18)||68.3||-||-||-||-||-|
|DSR [he2018deep] (CVPR’18)||73.7||68.1||72.8||62.8||64.3||58.1|
|VPM [sun2019perceive] (CVPR’19)||67.7||-||-||-||65.5||-|
|FPR [he2019foreground] (ICCV’19)||81.0||76.6||78.3||68.0||68.1||61.8|
|HOReID [wang2020high] (CVPR’20)||85.3||-||80.3||70.2||72.6||-|
Result. Table 5 shows the results on PartialREID, OccludedREID and PartialiLIDS datasets. FastReID-DSR can achieve 82.7% (76.8%), 81.6% (70.9%) and 73.1% (79.8) at rank-1/mAP metrics.
7.4 Vehicle Re-identification
Datasets. Three vehicle re-id benchmarking datasets: VeRi, VehicleID and VERI-Wild are used for evaluating the proposed FastReIDin the FastReID. We won’t go into the details of the database here.
Settings. The setting as shown in Fig. 9.
This paper introduces a open source library namely FastReID for re-identification. Experimental results demonstrated the versatility and effectiveness of FastReID on multiple tasks, such as person re-identification and vehicle re-identification. We’re sharing FastReID because open source research platforms are critical to the rapid advances in AI made by the entire community, including researchers and practitioners in academia and industry. We hope that releasing FastReID will continue to accelerate progress in the area of person/vehicle re-identification. We also look forward to collaborating with learning from each other for advancing the development of computer vision.
|Methods||mAP (%)||R-1 (%)||R-5 (%)|
|Siamese-CNN [iccv/ShenXLYW17] (ICCV’17)||54.2||79.3||88.9|
|FDA-Net [cvpr/LouB0WD19] (CVPR’19)||55.5||84.3||92.4|
|Siamese-CNN+ST [iccv/ShenXLYW17] (ICCV’17)||58.3||83.5||90.0|
|PROVID [tmm/LiuLMM18] (TMM’18)||53.4||81.6||95.1|
|PRN [cvpr/HeLZT19] (CVPR’19)||70.2||92.2||97.9|
|PRN [cvpr/HeLZT19] (CVPR’19)||74.3||94.3||98.9|