Person re-identification (re-id) is a task of reasoning the subtle identity class information in detected person bounding box images captured under non-overlapping camera views [gray2008viewpoint, prosser2010person, zheng2013reidentification, market1501, XQDA, gong2014person, zhu2017exploiting]
. This is still a rather challenging task due to the nonrigid structure of the human body, the highly variable illumination conditions, and low resolution person bounding box images. Most existing deep learning re-id methods in the literature train convolutional neural network (ConvNet) models in a supervised learning fashion[cuhk03, xiao2016learning, chen2017person, zhang2018deep, li2018harmonious, chang2018multi, qian2018pose, sun2018beyond]. One of the major limitations with supervised modelling is rooted in assuming the availability of a large set of inter-camera labelled training identity classes collected through an exhaustive and expensive annotation process. This dramatically degrades the usability and scalability of such methods in real-world application and deployments at scales.
This problem has received a significant amount of attentions recently. One intuitive approach is unsupervised person re-id. Existing methods of this kind can be generally divided into three categories: (1) Domain generic feature design [gray2008viewpoint, farenzena2010person, market1501, XQDA, gog]; (2) Unsupervised domain adaptation [peng2016unsupervised, deng2018image, wang2018Transfer, lin2018multi, yu2018unsupervised, yu2019unsupervised, zhong2019invariance, zhu2019icip, panda2017unsupervised]
; (3) Unsupervised learning[wang2016towards, chen2018deep, li2018taudl, lin2019BUC, li2019utal_pami]. By hand-crafting universal person appearance features, the first category aims to improve the re-id model performance generically. However, the methods in this category often yield dramatically inferior model generalisation capability due to limited information involved in such representations. The second category attempts to transfer the identity knowledge of a labelled source domain to an unlabelled target domain via image or feature adaptation. Unfortunately, such methods implicitly assume that the source and target domains have reasonably similar camera viewing conditions for ensuring sufficient transferable knowledge. As a more scalable approach, the third category instead leverages only unlabelled target domain data during model training. To benefit from existing supervised learning algorithms, previous unsupervised re-id methods often turn to the idea of self-discovering the underlying identity association information. Compared with conventional manual labelling, this automated annotating remains less accurate and less complete, leading to inferior model optimisation and generalisation capability.
In this work, we instead investigate the person re-id scalability from the data annotation perspective. We consider a more scalable re-id problem with cheaper training data labelling where person identity (ID) labels are annotated in each camera-view without inter-camera association. This is based on our observation that inter-camera search in the manual annotation is the most time-consuming and expensive sub-process (Fig 1red a). It is because, the generic (unframed) people usually takes a-prior unknown routines in open public space with complex space-time topology. On the other hand, labelling person identity classes in each camera view independently is much simpler and faster, possibly further benefiting from off-the-shelf tracking algorithms in a single camera view (Fig 1red b). We name this new setting as Intra-Camera Supervised (ICS) person re-id. Compared with conventional strong re-id supervision with labelled inter-camera identity association, this re-id problem focuses a learning algorithm on self-discovering the correspondence relationships between camera-specific identity spaces. It presents a new modelling challenge.
To address the ICS re-id problem, we proposed a Multi-Task Multi-Label (MTML) deep learning model in this work. Since there is no inter-camera association in the proposed re-id data labeling, MTML is specially designed for self-discovering the inter-camera identity correspondence by the inter-camera multi-label learning component under a joint multi-task inference framework. Some previous inter-camera identity association re-id methods [chen2018deep, li2018taudl, li2019utal_pami] learn discriminative representations by associating similar samples in the feature space. In contrast, we introduce an idea of multiple labels for each person identity in inter-camera association across different label spaces for better exploiting the per-camera identity labelling information. In addition, MTML can also efficiently learn the discriminative re-id features using the provided per-camera identity labels based on multi-camera multi-task learning.
The contributions of this work are: (1) We reformulate supervised learning person re-id by removing explicitly the assumption for exhaustive inter-camera pairwise labelling in model training. This eliminates the most time-consuming and tedious inter-camera pairwise ID labelling task required in the conventional re-id model training. In return, a more challenging re-id model learning problem is presented with only per-camera independently annotated ID labels in the training dataset. Compared to completely unsupervised re-id, our introduced re-id problem enables the re-id model to benefit from per-camera view labelled ID information which can be easily annotated or generated using tracking algorithms. (2) We formulate a Multi-Task Multi-Label (MTML) learning method for ICS person re-id. MTML takes a multi-task learning framework to jointly account for independent camera-specific identity discriminative labelling information and self-discovering inter-camera identity association in a multi-labelling fashion. Three large-scale re-id datasets, i.e., Market-1501 [market1501], DukeMTMC-reID [DukeMTMC-reID, ristani2016MTMC], and MSMT17 [wei2018person], have been used in experiments with the proposed ICS setting. The results demonstrate the superiority of the proposed MTML method compared with the state-of-the-art person re-id models.
2 Related Work
As we are concerned with the re-id scalability issue from the person re-id dataset annotation perspective, this section will discuss and review supervised and unsupervised person re-id works in the literature.
Supervised learning based person re-id methods dominate the literature [xiao2016learning, wei2018person, chen2017person, wang2018reg, zhao2017deeply, zhu2017fast, chen2017tpami, li2017person, chang2018multi, zhang2018deep, Shen_2018_ECCV, li2018harmonious, qian2018pose, sun2018beyond]. This type of models are trained in a strongly
supervised manner by inter-camera pairwise ID labelled training images. They suffer from significant model performance degradation when the test domain is dissimilar to the training domain. Moreover, supervised learning based re-id models are effective only when strongly labelled training data are available at large scale for every target domain. This limits their usefulness. Semi-supervised learning methods[liu2014semi, wang2016towards] decrease the amount of labelled training data but still require some cross-view pairwise labelling. Removing the expensive inter-camera pairwise labelling requirement for re-id model learning is desirable in practice.
Unsupervised learning based person re-id models have received increasing attention with three flavours, i.e., domain generic feature design [gray2008viewpoint, farenzena2010person, market1501, XQDA, gog], unsupervised domain adaptation [wang2018Transfer, lin2018multi, zhong2018generalizing, yu2018unsupervised, zhong2019invariance] and unsupervised model learning [lisanti2015person, wang2016towards, kodirov2016person, ma2017person, chen2018deep, li2018taudl, lin2019BUC, li2019utal_pami]. All these methods do not need labelled training data from the target domain therefore more deployable. However, their re-id performances are much weaker than those of supervised learning based models (when training and test domains are similar).
Intra-camera supervised learning based person re-id is considered in this work, where the strong inter-camera identity association labels are removed from the training data. Without the need for manually annotating identity correspondences between every pair of camera views, we minimise the amount of labours required for person identity class annotation and enable a re-id model to be more feasible in deployment at scale. To solve this re-id problem, we develop a re-id learning algorithm for making full use of per-camera independently annotated ID labels and self-discovering most likely person correspondences between different camera views, yielding stronger re-id models than the unsupervised learning counterpart.
In this section we formulate a person re-id learning method without inter-camera identity association in the training data. Suppose there are camera views in a camera network. For the -th camera view, we independently annotate a set of samples where is the -th person image in . Each person image is associated with an identity label and the corresponding camera identity . is the total number of unique person identities in . Due to per-camera independent labelling nature, the same identity labels (e.g. identity ) of any two camera views are very likely referring to two different persons. By combining camera-specific labelled data, we obtain the entire training set as . The presence of such multiple identity class label spaces prevents the training of a conventional supervised re-id model and thus a new effective re-id method is needed.
We formulate a Multi-Task Multi-Label (MTML) deep learning method for addressing this more challenging and more scalable ICS re-id problem. Given the per-camera identity independent labelling nature, the key of model learning lies in two aspects: (1) How to effectively exploit the per-camera identity labels, and (2) How to associate the identity label classes across camera views (or label spaces). MTML achieves these two aspects by integrating two corresponding components: (i) Multi-camera multi-task learning that assigns a separate learning task to each individual camera view for modelling the respective identity space, (ii) inter-camera Multi-label learning that automatically self-discovers the identity associations across camera views and assigns multiple labels to these associated identities for modeling the inter-camera person appearance variation. More details about these two components are presented in the following parts and Fig. 2 gives an overview of the proposed MTML model.
3.1 Multi-Camera Multi-Task Learning
As shown in Fig. 2red b and 2red c, we consider the multi-task learning strategy [argyriou2007multi] given camera independently labelled person identity information. This aims at better mining the common knowledge shared across camera views whilst enhancing model learning by augmented training data for each camera view. Each camera view is treated separately due to their independent labelling property. Importantly, this also allows to derive a person re-id representation with implicit inter-camera identity discriminative capability for facilitating inter-camera identity association [li2019utal_pami].
Formally, we create a camera-shared feature representation upon which multi-task branches are rooted. Each branch is responsible for the classification task in a specific camera view. During model training, each task branch can be used to propagate the respective per-camera identity label information via the softmax cross-entropy loss function. For one sample, its corresponding softmax cross-entropy loss function can be formulated as:
where specifies the camera-shared
feature vector of the corresponding imagefrom the -th camera and it is extracted after the fully connected layer FC- as shown in Fig. 2, in which is the dimension of a feature vector.
denotes the classifier function of camera
. The one-hot encoding functionreturns an one-hot vector with value 1 for the element at the given index. For stochastic mini-batch deep learning, the multi-camera multi-task learning (MT) objective is designed as:
where denotes the accumulated cross-entropy loss (Eq. (1)) of all in-batch images from the -th camera, and is the mini-batch size. With this multi-camera multi-task learning, the discriminative re-id features can be efficiently learned using the existing identity labels within each camera-view.
3.2 Inter-Camera Multi-Label Learning
In person re-id, inter-camera person appearance variation is one of the most significant elements during model training. Whilst this is implicitly learned in the multi-camera multi-task learning component as discussed above, it is insufficient to fully capture the underlying inter-camera identity correspondence relationships. To address this problem, an inter-camera multi-label learning component is designed that aims to self-discover the identity correspondence between camera-specific identity label spaces and imposes the re-id model to effectively model the inter-camera person appearance variation.
Specifically, given an identity class from camera , we want to find if a true match exists in camera . To this end, all the person images of are mapped into the branch of camera and an average prediction of in camera is obtained as:
in which is the averaged prediction of in camera . As in Eq. (1), denotes the classifier function of camera and is the averaging function. We then nominate the identity class in camera with the maximum likelihood probability as the candidate matching identity:
where is the index of identity in -th camera. To boost the accuracy and robustness of matching pairs, the identity is further mapped back to camera in a similar way as Eq. (3) and the corresponding candidate matching identity as in Eq. (4) is retrieved. This cyclic mapping and matching operation between every two camera views determines the inter-camera identity association result as:
To benefit model training from self-discovered identity matching pairs, a proper supervision function is designed. Considering the idea of inter-camera prediction based identity association, the inter-camera learning is performed by multi-labeling the associated person identities.
In particular, given an identity matching pair (), with from camera and from camera as defined in Eq. (5), the person images corresponding to the associated matching identities and are assigned with multiple labels. For the person images of , they are assigned with the new label and similarly, the person images of are assigned with the new label . After this inter-camera multi-labeling, the images of and are attached with the identical multiple identity labels, i.e., and , and thus and are inter-camera associated.
With these newly assigned labels, a simple but efficient loss function is designed based on the softmax cross entropy loss as the MT loss in Eqs. (1) and (2). For one person image with new label as in Eq. (4), its ML loss can be formulated as:
in which is the ML loss in camera , i.e., . is the number of images with newly assigned labels in camera .
3.3 Model Objective Loss Function
By combining the multi-camera and multi-label learning function in a multi-task manner, we obtain the final MTML model objective loss function as:
where controls the relative weight of the two terms. In our experiments, we set considering that inter-camera identity association is necessarily noisy therefore adversely affects the quality of .
3.4 Model Training and Inference
The stochastic gradient descent algorithm can be applied for optimizing the proposed deep re-id model. In the considered re-id dataset annotation as shown in Fig.1(b), per-camera identity labels are accurately annotated whilst the identity matching between camera views is likely inaccurate. Based on this observation, the proposed re-id model is first pre-trained using only the multi-camera multi-task learning loss (Eq (2)). Then based on this pre-trained re-id model, the inter-camera multi-label learning is iteratively performed using the model objective loss function as in Eq. (8
). In every iteration, the re-id model will be first trained for a number of epochs, and then the cyclic prediction consistency and inter-camera multi-labeling will be applied for associating inter-camera identities. The newly assigned multi-labels of associated identities will be further included into the training dataset for model learning in the following iteration.
In model inference, the trained MTML model is deployed to extract the camera-shared features of test person images as their re-id representations. For efficient re-id matching and ranking, the Euclidean distance metric is utilised to compute the probe-gallery pairwise similarity.
4.1 Experimental Setup
Datasets. Three large-scale re-id datasets, i.e., Market-1501 [market1501], DukeMTMC-reID [DukeMTMC-reID, ristani2016MTMC], and MSMT17 [wei2018person], are selected for evaluating our proposed ICS problem and our MTML method. As no existing re-id datasets annotated in the ICS fashion, we adopted these three fully labelled re-id datasets by independently annotating their identity labels in each camera-view as shown in Fig. 1. We still utilise the identical test data of each dataset for model performance evaluation. We will publicly release these ICS person re-id benchmarks.
Performance metrics. We used the common Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP) metrics for model performance measurement.
In practice, the ImageNet pre-trained ResNet-50[resnet] is selected as the backbone CNN 111Layers after avg-pooling are removed. of our MTML model. For multi-task learning, each branch is formed by a FC classification layer. Person bounding box images are resized to 256128 in pixel before feeding into the network. The standard stochastic gradient descent (SGD) optimizer is adopted for training the MTMC model with the initial learning rate of 0.05. In pre-training the model using only the MT loss, the learning rate is decayed 10 times every 40 epochs and the epoch number is set to 100. In inter-camera multi-label learning, the learning rate is decayed 10 times after 8 epochs. The epoch number is set to 15 in each iteration and the number of iteration is set to 8. In order to balance the model training speed across camera views, we randomly selected from each camera view the same number of images (i.e., 4 images) from one person identity and the same number of persons (i.e., 2 persons). By default, we set in Eq (8) for balancing the the losses of and .
4.2 Evaluation on Person Re-Identification
Evaluated methods. Apart from the proposed MTML model, we further evaluated two methods particularly adapted to the newly introduced ICS person re-id setting:
(1) Ensemble of Per-Camera Supervised Learning (E-PCSL):
Without inter-camera ID labels, we trained a separate re-id model
for each camera on the corresponding labelled training data.
We used the ResNet-50 as the backbone CNN,
and the softmax cross-entropy loss function as
the supervised objective.
During deployment, given a test image we extracted
the feature vectors of all the per-camera models,
concatenated them into a single representation vector,
and utilised the Euclidean distance for re-id matching.
(2) Unsupervised Tracklet Association Learning
Unsupervised Tracklet Association Learning(UTAL) [li2019utal_pami]: This method is designed for associating person tracklets in an unsupervised manner for video based re-id, taking the auto-detected tracklets as imagery data form in particular. For enabling multi-shot image based re-id as considered here, following UTAL we stacked the images with the same ID from the same camera into a single tracklet, forming the intra-camera supervision specifically for this model. In terms of experiment setting, UTAL assumes the same training data annotation as in our work. However, it is noteworthy to mention that this is due to lacking spatial-temporal information in the existing image based person re-id benchmarks; Conceptually, the two works investigate rather different person re-id scenarios, starting from distinctive motivations and annotation assumptions.
In addition, the proposed MTML is also compared with the state-of-the-art unsupervised domain adaptive re-id methods which consider the re-id problem with fully labelled data in the source domain but no labels in the target domain. These methods include CAMEL [yu2017cross], PUL [fan18unsupervisedreid], TJ-AIDL [wang2018Transfer], CycleGAN [zhu2017unpaired], SPGAN [deng2018image], PTGAN [wei2018person], HHL [zhong2018generalizing], MAR [yu2019unsupervised], ECN [zhong2019invariance]. This provides a overall quantitative evaluation and comparision between different person re-id settings, but no apple-to-apple comparison due to different types of supervision involved.
Results. Tables 1-3 give the re-id performance comparison results between our MTML model and other considered methods. Several observations can be derived that: (1) By independently exploiting camera-specific identity class annotation, the baseline E-PCSL yields the weakest re-id model generalisation. This is due to the incapability of leveraging the shared knowledge between camera views and mining the inter-camera identity matching information. (2) The model performance is continuously increased by more recent unsupervised re-id models. In comparison, the proposed MTML model improves the performance observably. One reason is that our model benefits from more scalable per-camera ID labelling, in addition to the superior formulation of our model. (3) The proposed MTML model significantly outperforms both E-PCSL and UTAL, suggesting the performance superiority of our method in tackling the person re-id problem under the proposed cheaper annotation case. Whilst MTML shares partly the model structure with UTAL in terms of multi-task learning design, we observed a large performance difference between them. The plausible reason may be due to the unique advantage of exploiting the cyclic prediction consistency based inter-camera identity association.
4.3 Further Analysis and Discussions
Model component analysis. We examined the effectiveness of the two model components in MTML, i.e., multi-camera multi-task (MT) and inter-camera multi-label (ML) learning. Table 4 shows that: (1) With the MT component alone, the model can already achieve very competitive performance, suggesting the significance of sharing labelling knowledge across all the camera views via joint multi-task inference. (2) After adding the ML component, the model generalisation capability can be further boosted. This indicates the positive influences of leveraging the inter-camera identity association information through self-supervision despite at the risk of deriving false inter-camera identity associations and propagating their error information into re-id model during training.
Inter-camera identity association dynamics. To further examine the benefits of inter-camera identity association, we tracked the evolving dynamics of self-discovered matching pairs during the model training process. Figure 3 shows that our model is able to reveal an increasingly number of inter-camera identity matching pairs whilst maintaining high association accuracy. This explicitly explains the model performance advantage of the proposed inter-camera multi-label learning idea.
In this work, we presented a more scalable intra-camera supervised (ICS) person re-identification problem, characterised by re-id model learning without cross-view pairwise labelling but with only per-camera independent person identity labels. The key idea is to eliminate the tedious process of manually annotating exhaustively identity classes across every pair of camera views in a surveillance network, both costly and sparsely available. This reformulates the conventional supervised re-id model learning into a weakly supervised learning problem with multiple independent ID label spaces across camera views. Consequently, it focuses the learning task on self-discovering inter-camera identity label associations. To that end, we introduced a Multi-Task Multi-Label (MTML) learning algorithm capable of fully exploiting the available weak re-id supervision constraint whilst simultaneously self-mining inter-camera identity association by a cyclic classification consistency idea. Extensive evaluations were conducted on three re-id benchmarks to validate the advantages of the proposed MTML model over the state-of-the-art alternative methods in the proposed ICS learning setting. The detailed component analysis is also provided for giving insights on our model design.
This work was partially supported by Vision Semantics Limited, the Alan Turing Institute Fellowship Project on Deep Learning for Large-Scale Video Semantic Search, and the Innovate UK Industrial Challenge Project on Developing and Commercialising Intelligent Video Analytics Solutions for Public Safety (98111-571149).