1 Introduction
Person reidentification aims at spotting the same person from nonoverlapping camera views, which can be applied to crime suspect recognition, target customer identification and other scenarios.
As a key technology in intelligent video surveillance, person reidentification has received great attention of scholars and lots of excellent results have emerged. Especially since the introduction of the deep learning method, the feature extraction and metric learning are integrated together. More discriminative feature representation and distance metric can be learned, far exceeding the traditional twostage method of handcrafted features and separate distance metric learning.
Although person reidentification has made great progress, there are still some problems. Mainly in the following two aspects: First, the existing deep learning method is dominated by a fully supervised network, relying on a large number of annotated data. In practical applications, the real situation of the scene is complex and changes with time. Obtaining a large number of annotation data suitable for the application scene is very expensive and impractical, which limits the application of the existing supervised algorithm relying on a large number of annotation data. Second, the existing research algorithms have achieved good results on a limited set of general data sets, but when applied to actual scenes, they are subject to a series of complexities such as occlusion, illumination changes, pedestrian attitude changes, viewing angle changes, and even camera model changes. The impact of factors will greatly reduce the accuracy of recognition. Therefore, the existing research is still far from the actual application level.
It’s very expensive and impractical to rely on a large amount of manual labeling data in practical applications. But a large number of unlabeled samples can be obtained from person detection and tracking in surveillance video. Fewexample learning method only needs to label the sample with a small label or a small amount. By correctly estimating the label for the unlabeled sample, it is used to train the network, which can solve the problem that the manual labeling is expensive and the time consuming is too long in the practical application. Therefore, fewexample person reidentification method has great research value and practical value.
As illustrated in Figure 1, we iteratively estimates labels for unlabeled samples, incorporates them into training sets, and trains a more robust network.
Our contributions can be summarized as follows:

A multibranch network PAM that jointly learns local and global features is proposed. PAM has high accuracy, few parameters and converges fast, which is suitable for fewexample person reidentification.

We propose the static relative distance sampling(SRD) strategy based on the relative distance between classes which surpass GPS on smallscale datasets. For the problem that SRD can not use all unlabeled samples, we propose adaptive relative distance sampling (ARD) strategy.

For oneexample experiment, PAM+GPS reaches 86.9% rank1 accuracy and 47.26% mAP on DukeMTMCVID and MARS respectively. PAM+ARD reaches 89.78%, 56.13%, 89.17% rank1 accuracy and 45.36% mAP on PRID2011, iLIDSVID, DukeMTMCVID and MARS respectively, which exceeds the stateoftheart by a large margin.
2 Related work
In recent years, approaches based on deep neural networks have dominated the research of person reidentification. These approaches combined the feature representation learning and distance metric learning together with an endtoend architecture.
Supervised Person ReID. Dong Yi et al. [17]
introduce siamese network to person reidentification and used cosine similarity to measure similarity loss. Although siamese network has many excellent features, it’s easy to cause the network training failure because the cross entropy loss is very sensitive to small changes in the feature vector. In order to solve this problem, S. Ding et al.
[2] used a triplet composed of positive and negative sample pairs as the input of the siamese network. By learning pairs of positive and negative samples, feature embeedings become more discriminative. At the same time, siamese network and common classification network are combined in their work, while using the Softmax loss and the improved triple loss. Zhao et al. [18] introduce saliency information into person reidentification, learning the saliency of pedestrians and matching the significant similarities of different pedestrians. W. Li et al. [8] propose a multibranch network. The upstream single branch learns the global feature of pedestrians, and the downstream multibranches learn multiple local features. Meanwhile, the interaction between soft and hard space, channel attention and spatial attention is used. The network proposed by S. Li et al. [6] can learn multiple spatial partial attention models, and adopts a diversity regularity term to ensure that multiple partial attention models focus on different areas of the body. The timeattention mechanism employed enables the network to learn the features of the face, torso, and other parts of the body from the bestconditioned frames in the sequence.Fewexample person ReID.
In general, there are two main types of fewexample methods. One is to establish a good model only with a small number of labeled data. These methods may use siamese network, matching network, Meta Learning, and Transfer Learning. The other is to train a good network by estimating the label for the unlabeled data and then augmenting the training set with them.
In the first type, Koch et al. [4] propose a framework for solving the problem of oneshot classification. They first build a fully convolutional siamese network based on verification loss, and then use this network to calculate the similarity between the image to be identified and other labeled samples. The image is then recognized as a sample of the category which the most similar labeled sample belongs to. Vinyals et al. [13] propose matching network. During the training process, some samples are selected to form a support set and the remaining samples are used as training images. They construct different encoders for the support set and training pictures. The classfier’s output is a weighted sum of the predicted values between the support set and the training images. During the test process, oneshot sample are used as support set to predict the category of new images. Rahimpour et al. [11] use metalearning methods to learn multiple similar tasks, and build two encoders for the gallery and probe respectively. Based on these encoders, they get gallery images’ embedding according to the characteristics of the remaining gallery images. They get probe images’ embedding according to the characteristics of the gallery images. In this way they obtain a more discriminative feature representation.
In the second type, Ye et al. [16] establish a graph for each camera. They view the labeled sample as the node of the graph, and view the distance between the video sequence features as the path. Unlabeled sample are mapped into different graphs (namely estimating the labels) to minimize the objective function. The graphs are updated dynamically . They continually estimate labels, and train models until the algorithm converges. Liu et al. [10] first initialize the model with labeled samples. Then they calculate k nearest neighbors of the probe with the gallery. They remove the suspect samples and then add the remaining samples to the training set. The procedure is iterated until the algorithm converges. Wu et al. [15] initialize a CNN with labeled data firstly, and then linearly incorporate pseudolabel samples to the training set according to the distance to labeled samples. Then the CNN is retrained with the new training set. Finally all unlabeled samples have estimated label and are added into training set, then they use a validation set to select the best model.
3 Method
3.1 Framework Overview
The framework of our approach is shown in Figure 1. We first initialize the CNN with the labeled samples. The CNN here is the PAM network (section 3.2). Then we use the trained CNN to estimate the labels for the unlabeled samples based on the distance between the labeled samples and unlabled samples. The label of the labeled samples closest to the unlabeled sample in the feature space is used as the estimated label of the unlabeled sample. Unlabeled samples and their estimated labels form pseudolabel samples. Then we use sampling strategy ARD(section 3.3) to select correctly estimated ones and then incorporate them into training set. After that, the enlarged training set is used to retrain the CNN again. We iterate this process until all unlabeled samples have been estimated and added to the training set. Since the training set is enlarged continuously during training iterations, we can progressively learn a more stable model. When the algorithm converges, we get a most robust model trained with all samples.
3.2 Part Attention Model
In this paper, the small sample method of iteratively estimating labels is adopted. If the model is too complex and the number of parameters is too large, the iteration time is too long. If the model performance is not good, the label estimation accuracy is low, so that all the training sets are error samples, which will lead to poor algorithm performance. Therefore, the network must be simple in structure and high in accuracy to meet the needs. Most of the existing fully supervised network models with good performance are complex, with large number of parameters and slow convergence, which cannot be applied in this paper. However, the network performance with few parameters can not meet the demand. To solve this problem, this paper proposes a multibranch network for joint learning of local and global features.
The network structure is shown in Figure 2, using a multibranch network that learns both local and global features. A tracklet is randomly extracted from the frame and input into the network. After Backbone, a feature map is obtained. Then, the feature map is sent to the global branch as a whole, and the feature matrix of the dimension is obtained through global pooling and dimension reduction. Then, the global feature of the dimension is obtained by merging the time domain average pooling in the time domain; the feature graph is sliced vertically. After obtaining p local feature maps, the local feature maps are globally pooled and dimensionally reduced to obtain pdimensional feature matrices, and then the timedomain averaged pools are combined in the time domain to obtain pdimensional local features.
Backbone uses the Resnet50 pretrained on ImageNet
[5]. The Global branch is the original IDE network [18], and the Local branch uses the 1:2:2:1 slice for the feature map output of the last pooled layer of ResNet50. The global feature and each local feature each train a classifier. Enter a tracklet to extract the global features and 4 local features of each frame, and concatenate the global feature and four local features as the feature representation of the entire tracklet.
3.3 Adaptively Sampling Distance Strategy
The task of the sampling strategy is to select all the correctly estimated pseudolabel samples to join the training set as much as possible. If the pseudolabel samples sampled after each iteration are correct, but the number is small, then adding it to the training set and training CNN again can bring very limited performance improvement. The next time the tag is estimated, it will only be the last time. The result is slightly better, and the increase in the number of correct labels that can be selected will be very slow, resulting in a network that has been growing in performance, but the training takes too long. On the other hand, if the number of pseudolabel samples selected by each iteration is large, but it is very correct, adding it to the training set to train the CNN will result in a decrease in network performance and loss of the meaning of increasing the number of training samples. Therefore, the number of pseudolabel samples selected by the sampling strategy is required to be large and the accuracy is high.
Wu et al. [15] adopted a linearly increasing sampling strategy. After each iteration, the nearest neighborbased tag estimation strategy is used to estimate the tag for the unlabeled sample, and then all the unlabeled samples are represented by the L2 distance between the feature space and the nearest tagged sample to represent the credibility of the tag estimate. It sorts from small to large, and then samples where t is the number of iterations and p is the tunable parameter), and has achieved good performance on the MARS and DukeMTMCVID data sets. However, there are some problems with this sampling method. For a limited number of unlabeled samples, label estimation is difficult. Each iteration adds a fixed number of samples than the last time, causing the initial iterations to include only simple samples that are easy to estimate, wasting the network’s ability to estimate. The last few iterations added too many difficult samples to estimate the error, overdrafting the network capabilities. In order to estimate and sample more robustly, the author set the parameter p to 0.05 in the experimental setup, requiring a total of 20 iterations, and the training time is too long. Moreover, the selection of the p value is a very difficult problem. A smaller p value can bring better performance, but the iteration time is too long. Larger p account for severe performance degradations due to the addition of too many false samples.
The setting of the small sample determines that there are only a small number of samples when the network is initialized, and the network tends to learn the simple and direct distinction between the samples, such as the color of the clothes, and ignores other higherlevel distinguishing information. If the sampling only depends on the absolute distance between the feature space samples, the shallow information will be similar and the distance between the samples of the actual different classes (the first line in Figure 41) is less than the shallow information is not so similar but actually the same kind. The distance between the samples (the second line in Figure 41) is preferentially added to the network. The result is catastrophic, because any unlabeled sample with similar surface information (wearing a yellow Tshirt) will be estimated to be added to the training set due to the smaller absolute distance, thus forming a malignant Positive feedback damages the discriminating power of the network. This is also the root cause of poor performance of PAM plus GPS algorithm on one example of smallscale data set PRID2011, iLIDSVID.
3.3.1 Static Relative Distance Sampling
We propose a Static Relative Distance Sampling (SRD) based on the distance relationship between classes: The distance between the unlabeled sample in the feature space and the nearest labeled sample is the label sample and the rest of the label. However, the minimum distance between samples whose labels are not (ie, not belonging to the same class) is that when the a equation is satisfied, the unlabeled sample and its estimated label are added to the training set. Iteratively trains the network, estimates the label, and samples the extended training set. When the difference between the number of samples and the number of previous samples is less than b (for the manually set hyperparameter, it can be set to 0.01, 0.03, etc.) the algorithm converges and stops iterating.
The reason for this is that CNN can learn a good feature extractor and distance measurement, and it can distinguish between similar samples and different samples after embedding samples into the feature space. If the estimated tag of the unlabeled sample is correct, the distance between the unlabeled sample and the nearest labeled sample belongs to the intraclass distance. However, the distance between samples with tags and other samples with tags but different (that is, they do not belong to the same class) is interclass distance, among which the smallest distance is the smallest interclass distance. When the intraclass distance is less than the minimum interclass distance, we have reason to believe that the estimated tag is correct, so we add it to the training set. In order to improve the accuracy, I multiplied the minimum interclass distance by a constant k less than 1 to expand the training set more carefully. Obviously, the smaller the k, the higher the probability that the sample you add has the correct label.
SRD static sampling can achieve good performance, but there are two problems: first, although the convergence can be achieved after iterating a few steps, the convergence cannot use up all the unlabeled samples. Second, as the number of iterations increases, the number of samples in the last few iterations increases very little, but it takes up a lot of training time. We hope that the algorithm can make use of all the unlabeled samples and add more unlabeled samples for most purposes in each iteration, so as to save too many iterations and reduce meaningless learning.
Based on the static sampling strategy SRD, we propose an Adaptive Relative Distance Sampling (ARD) based on the distance between classes. The core idea is to find the appropriate initial k value through the probe mechanism, and then perform SRD. When the relative value of the current iterative sampling number increases is less than a certain threshold, k is automatically increased, and the SRD is performed under the new k value, so that more samples are added in the next iteration. Use all unlabeled samples until k¿1. The adaptive sampling strategy (ARD) specifically includes a kprobe mechanism and an adaptive kvalue increasing mechanism, which are explained in detail below.
kprobe mechanism.. The goal of the kprobe mechanism is to find a suitable kvalue to start sampling. Considering that the initial k value will increase the k value after the convergence of the algorithm, it will continue to iterate. Therefore, the k value is suitable from a small value. At the same time, considering that the k value is too small, there may be no qualified pseudolabel samples, or join. The number of pseudolabel samples is too small, making meaningless iterations, wasting the predictive power of the network. Therefore, the following kprobe mechanism is proposed: CNN estimates the label for the unlabeled sample after initializing the training with the labeled sample. Then try to sample k 0.6, 0.7, 0.8, 0.9, 1.0 in sequence, and stop when the number of samples is greater than the initial k value.
Adaptively increase k value mechanism. For each k, record the difference of the number of samples of the previous two iterations of the k value. If the difference between the current iteration sample number and the last sample number is less than (k) , then k=k + 0.1, increase The k value continues to train the network. When k exceeds 1, the training is terminated and the algorithm converges. The threshold (k) takes a dynamic setting method, setting different thresholds for different k, and small k sets a larger threshold. The function of setting the threshold for each k value is to determine whether the SRD sample under the current k value has converged. When the increased number of samples is less than the threshold, the value of k is increased to start the next SRD, and the purpose of adaptive sampling is achieved. is determined by the difference between the number of samples of the previous two iterations of the current k value, and is different for different k. This threshold setting method is a great innovation in this paper and is the core of the Adaptive Sampling Strategy (ARD).
4 Experiments
Methods  PRID2011  iLIDSVID  DukeMTMC  MARS  

Rank1  mAP  Rank1  mAP  Rank1  mAP  Rank1  mAP  
EUG (CVPR18)  59.6  65.3  –  –  72.79  63.23  62.67  42.45 
DGM (ICCV17)  82.4  –  37.1  –  44.36  33.62  36.81  16.87 
Stepwise(ICCV17)  84.27  87.64  –  –  56.26  46.76  41  19.65 
BUC[7](AAAI19)  –  –  –  –  69.2  61.9  61.1  38.0 
TAUDL[12](ECCV18)  49.4  –  26.7  –  –  –  43.8  29.1 
PAM+SRD(Alg2.)  85.51  87.60  39.80  45.59  83.33  77.13  56.82  38.71 
PAM+ARD(Alg3.)  89.78  94  56.13  61.14  89.17  85.16  61.57  45.36 
4.1 Datasets and Settings
The iLIDSVID dataset. The iLIDSVID dataset [14] contains 300 different pedestrians shot by two nonoverlapping cameras with a total of 600 tracklets (each tracker has 2 tracklets). Each tracklet has a length of 23192 frames and an average length of 73 frames. Because this data set was taken in the multicamera network in the arrival hall of the airport, the clothing similarity, illumination, and viewing angle vary greatly, so it is more challenging.
The PRID2011 dataset. The PRID2011 dataset [3] contains 934 pedestrians shot by two still cameras with different perspectives, including a total of 1134 tracklets. Camera 1 took 385 tracklets from 385 pedestrians, Camera 2 shot 749 tracklets from 749 pedestrians, and only the first 200 pedestrians appeared in both cameras. Each tracklet has a length of 5675 frames and an average length of 100 frames.
The DukeMTMC dataset. The DukeMTMCVID dataset [15] is a largescale video person reID dataset that is processed by the DukeMTMC picture pedestrian reidentification data set. It is photographed by a plurality of cameras whose fields of view do not overlap, and is manually labeled by constructing a tracklet by equally extracting 12 frames per second in continuous video. A total of 1404 pedestrian tracklets (two tracklets for each pedestrian with at least two different cameras) and 408 pedestrian jamming tracklets (one tracklet for each pedestrian with only one camera), a total of 4832 tracklets. 2196 tracklets of 702 pedestrians in the data set were used for training, and 2,636 tracklets of the remaining 702 pedestrians and 408 disturbing pedestrians were used for testing.
The MARS dataset. The MARS dataset [1] is the largest data set for video pedestrian recognition, and is expanded by the Market1501 data set. Shot on a college campus by six nearsynchronized cameras, with 1,261 segments of 1,631 segments of tracklets and 3,248 segments of interference tracklets (error detection or tracking video sequences). It is divided into a training set of 625 pedestrians and a test set of 636 pedestrians. Each pedestrian has an average of 13 tracklets, 816 frames, and each pedestrian has at least two tracklets taken by different cameras. Another significant feature of the MARS dataset and the above dataset is that it uses algorithmic annotation rather than manual annotation. The detection and tracking of pedestrian bounding boxes uses the Variable Part Model (DPM) [19] and many more. Target tracking algorithm GMMCP tracker [20].
Experiment Setting.For oneexample experiments, we use the same protocol as [21]. In both datasets, we randomly choose one tracklet in camera 1 for each identity as initialization. If there is no tracklet recorded by camera 1 for one identity, we randomly select one tracklet in the next camera to make sure each identity has one video tracklet for initialization. For fewexample experiments, we conduct it on the MARS dataset. 20%, 40%, and 60% of the samples were randomly selected from the training set as the initial labeled data. The remaining samples of the training set are stripped of its labels and used as unlabeled samples.
Implementation Details.
Train 70 epochs on the iLIDSVID and PRID2011 data sets, using the random gradient descent (SGD) plus momentum (Momentum) optimization method, the momentum is set to 0.5, the initial initial learning law is set to 0.1, 55 epochs Set to 0.01 afterwards. 50 epochs were trained in the DukeMTMCVID and MARS data sets, and the random gradient plus momentum optimization method was also used. The momentum was set to 0.5, the initial learning law was set to 0.1, and the learning law was set to 0.01 after 40 epochs. . The weight attenuation is set to 5e4; the data enhancement method using random cropping, random flipping, and random erasure: random cropping to size, random horizontal flip with a probability of 0.5, and random erase area area ratio range [0.02, 0.2], an area with an aspect ratio of [0.3, 3.3], filled with pixels [0.0, 0.0, 0.0]. Fixed the parameters of conv1, layer1 and layer2 of ResNet50. The learning rate of the rest of ResNet50 is set to 0.1 times of the global learning rate; the value of the loss function is set to 0.1, and the value of K is the number of categories classified in the training stage (that is, the number of ids of the training set) ).
4.2 Comparison with the StateoftheArt Methods
From the experimental results of the four datasets, the algorithm 2 static sampling SRD and the algorithm 3 adaptive sampling ARD have good performance. Especially on the smallscale datasets PRID2011 and iLIDSVID, the performance of oneexample exceeds the state of the art and is better than the PAM+GPS of Algorithm 1. Because it is based on the distance between the classes and the relative distance between the classes, more metric information is used, which can overcome the problems of GPS equalization incremental sampling in section 4.1.1, so it is obtained on PRID2011 and iLIDSVID. A significant performance boost. The SRD algorithm of static sampling is not as good as the GPS algorithm on the DukeMTMC and MARS datasets, mainly because GPS can estimate more more in multiple iterations by using a smaller n0 and a smaller sampling increase factor k. The correct label. However, when the SRD algorithm is set too small, it will fall into local optimum on the big data set, and only a small number of unlabeled samples are added. The excessive k setting will cause too many error samples to be added in the initial iteration, which will gradually deteriorate as the number of iterations increases, which limits the performance improvement. After adaptive incremental sampling ARD, the performance of DukeMTMC and MARS data sets has been significantly improved, which proves the effectiveness of adaptive incremental sampling strategy compared to pure static sampling strategy, especially on largescale data sets. Adaptive Sampling ARD (Algorithm 3) exceeds the previous algorithm on all four data sets. Rank1 was 89.78% and 56.13% on PRID and iLIDSVID, respectively, and 85.16% and 45.36% on DukeMTMC and MARS, respectively, although the MARS data set was slightly inferior to the PAM+GPS algorithm proposed in Chapter 3. However, the performance is better on smallscale data sets, and the number of iterations is also lower, so the overall performance is the best.
4.3 Fewexample experiment
The results of the fewexample settings on the MARS dataset using PAM and the Adaptive Sampling Algorithm (ARD) are shown in Table 1.
No.  Method  Type  MARS  

R1  mAP  
1  AMOC+EpicFlow[8]  Super.  68.3  52.9 
2  QAN[9]  Super.  73.7  51.7 
3  PAM+ARD  Super.  61.57  45.36 
4  PAM+ARD  Semi.(20%)  68.38  52.61 
5  PAM+ARD  Semi.(40%)  74.29  60.31 
6  PAM+ARD  Semi.(60%)  77.98  65.74 
As can be seen from the above 1, the PAM+SRD algorithm achieves 52.61% when using 20% of the labeled data, which is better than the fullsupervised algorithm QAN, which is slightly inferior to the fullsupervised algorithm AMOC+EpicFlow. Explain that our method can achieve the performance of the fully supervised algorithm when only 20% of the labeled data is used, which further proves the excellent performance of the algorithm. Although fewexample setting requires more manual labeling than single labeling, performance can be greatly improved.
4.4 Ablation Studies
We performed a series of ablation experiments on the PAM+ARD algorithm on the DukeMTMCVID dataset according to one example to verify the performance of each part of the algorithm.
Methods  DukeMTMCVID  

R1  R5  R10  R20  mAP  
IDE+ARD  x  x  x  x  x 
PAM+EUG(k=0.05)  x  x  x  x  x 
PAM+ARD  89.2  96.7  97.9  98.3  85.2 
Part attation model. Table 3 is a control variable experiment for the Part Attention Model (PAM). The algorithm uses the Adaptive Increased Sampling Strategy, and the CNN networks are the IDE and PAM. The results show that PAM can achieve a 2.7% performance improvement over Rank on the DukeMTMCVID dataset, which demonstrates the effectiveness of Part Attention Model.
Sampling strategy. Table 3 is a control variable experiment for the Adaptive Increased Sampling Strategy (ARD). The algorithm uses the IDE network, and the sampling strategy uses the linear increase sampling (p=0.05) and adaptive incremental sampling (ARD) of the EUG. The results show that ARD can achieve a 2.7% performance improvement over Rank on the DukeMTMCVID dataset, which demonstrates the effectiveness of adaptively increasing the sampling strategy.
Dynamic coefficient for threshold. In order to verify the effectiveness of the method of setting thresholds by multiplying kmargin0 of different k values with different coefficients (k), I designed a group of comparative experiments: one group multiplied kmargin0 of different k values with dynamic coefficients (k), and one group multiplied kmargin0 of different k values with fixed coefficient 0.3. The experimental results are shown in table 45. It can be seen that comparing the kmargin0 of different k values multiplied by different coefficients (k) with the fixed coefficient 0.3 can significantly reduce the number of iterations (15 times reduced to 12 times), and obtain the performance improvement of Rank1 and mAP by 2.13% and 2.85% respectively. As can be seen from table 45, compared with SRD, ARD can reduce the number of iterations and significantly improve the performance, which proves the superiority of ARD algorithm.
No.  Method  DukeMTMCVID  

Total Steps  R1  mAP  
1  IDE+EUG  20  72.79  63.23 
2  PAM+SRD  16  83.33  77.13 
3  PAM+ARD(0.3)  15  87.04  82.31 
4  PAM+ARD(k)  12  89.17  85.16 
4.5 Analysis and Visualization.
We visualize the relationship between the accuracy and the number of samples of the adaptive sampling ARD (Algorithm 3) on the DukeMTMCVID data set with the number of iterations of the algorithm, as shown in Figure 46:
It can be clearly seen that the increasing trend of the number of samples is only gentle when k=1.0, and stable rising trend when k= 0.7, 0.8 and 0.9. Therefore, learning in each iteration is meaningful. MAP accuracy curve shows an obvious rise in the form of four steps, with each value of k corresponding to a step, and the accuracy of iteration has been increasing. When the final algorithm converges, the accuracy rate is basically the highest, indicating that the performance of the last iteration model is the best, and there is no need to select the optimal model through additional verification sets.
Experiment 4, 5 sample along with the iteration number of absolute value change as shown in figure 4, can see more clearly for different kmargin0 multiplied by the coefficient of different k value (1.2k) play a role in iteration algorithm: make the sampling number increasing trend is more stable and fast, thus reducing the total number of iterations, improve the learning efficiency of the algorithm.
5 Conclusion
Since unlabeled person tracklets are cheap and easy to get, data driven deep models can get promising results with label estimation for fewexample person reidentification. The challenge is that how to estimate labels correctly and select the reliable ones to enlarge traing set. In the paper, we propose a light and converges fast network PAM, which is suitable for fewexample person reID. We also propose an adaptively sampling strategy to select most reliable pseudo label samples and gradually learn a more robust model. Our approach surpasses the stateoftheart method by 5.5 19.0 16.4 points (absolute) in rank1 accuracy on PRID2011 iLIDSVID DukeMTMCVID, and 2.9 points in mAP (absolute) on MAS. The proposed approach is very efficient and accurate for fewexample person reidentification.
References
 [1] MARS: A Video Benchmark for LargeScale Person ReIdentification  SpringerLink.
 [2] Shengyong Ding, Liang Lin, Guangrun Wang, and Hongyang Chao. Deep feature learning with relative distance comparison for person reidentification. Pattern Recognition, 48(10):2993–3003, Oct. 2015.
 [3] Martin Hirzer, Csaba Beleznai, Peter M. Roth, and Horst Bischof. Person Reidentification by Descriptive and Discriminative Classification. In Anders Heyden and Fredrik Kahl, editors, Image Analysis, volume 6688, pages 91–102. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
 [4] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese Neural Networks for Oneshot Image Recognition. page 8.

[5]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
ImageNet Classification with Deep Convolutional Neural Networks.
In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.  [6] Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious Attention Network for Person ReIdentification. arXiv:1802.08122 [cs], Feb. 2018. arXiv: 1802.08122.
 [7] Yutian Lin, Xuanyi Dong, Liang Zheng, Yan Yan, and Yi Yang. A bottomup clustering approach to unsupervised person reidentification. In AAAI, 2019.
 [8] Hao Liu, Zequn Jie, Karlekar Jayashree, Meibin Qi, Jianguo Jiang, Shuicheng Yan, and Jiashi Feng. Videobased Person Reidentification with Accumulative Motion Context. arXiv:1701.00193 [cs], Dec. 2016. arXiv: 1701.00193.
 [9] Yu Liu, Junjie Yan, and Wanli Ouyang. Quality Aware Network for Set to Set Recognition. arXiv:1704.03373 [cs], Apr. 2017. arXiv: 1704.03373.

[10]
Zimo Liu, Dong Wang, and Huchuan Lu.
Stepwise Metric Promotion for Unsupervised Video Person
Reidentification.
In
2017 IEEE International Conference on Computer Vision (ICCV)
, pages 2448–2457, Venice, Oct. 2017. IEEE.  [11] Alireza Rahimpour and Hairong Qi. Attentionbased FewShot Person Reidentification Using Meta Learning. arXiv:1806.09613 [cs], June 2018. arXiv: 1806.09613.
 [12] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond Part Models: Person Retrieval with Refined Part Pooling (and A Strong Convolutional Baseline). In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss, editors, Computer Vision – ECCV 2018, volume 11208, pages 501–518. Springer International Publishing, Cham, 2018.
 [13] Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, and Daan Wierstra. Matching Networks for One Shot Learning. arXiv:1606.04080 [cs, stat], June 2016. arXiv: 1606.04080.
 [14] Taiqing Wang, Shaogang Gong, Xiatian Zhu, and Shengjin Wang. Person Reidentification by Video Ranking. In David Fleet, Tomas Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision – ECCV 2014, volume 8692, pages 688–703. Springer International Publishing, Cham, 2014.
 [15] Yu Wu, Yutian Lin, Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Exploit the Unknown Gradually: OneShot VideoBased Person Reidentification by Stepwise Learning. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5177–5186, Salt Lake City, UT, June 2018. IEEE.
 [16] Mang Ye, Andy J. Ma, Liang Zheng, Jiawei Li, and Pong C. Yuen. Dynamic Label Graph Matching for Unsupervised Video Reidentification. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 5152–5160, Venice, Oct. 2017. IEEE.
 [17] Dong Yi, Zhen Lei, and Stan Z. Li. Deep Metric Learning for Practical Person ReIdentification. arXiv:1407.4979 [cs], July 2014. arXiv: 1407.4979.
 [18] R. Zhao, W. Oyang, and X. Wang. Person ReIdentification by Saliency Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(2):356–370, Feb. 2017.
Comments
There are no comments yet.