[CVPR20] Weakly supervised discriminative learning with state information for person identification
Unsupervised learning of identity-discriminative visual feature is appealing in real-world tasks where manual labelling is costly. However, the images of an identity can be visually discrepant when images are taken under different states, e.g. different camera views and poses. This visual discrepancy leads to great difficulty in unsupervised discriminative learning. Fortunately, in real-world tasks we could often know the states without human annotation, e.g. we can easily have the camera view labels in person re-identification and facial pose labels in face recognition. In this work we propose utilizing the state information as weak supervision to address the visual discrepancy caused by different states. We formulate a simple pseudo label model and utilize the state information in an attempt to refine the assigned pseudo labels by the weakly supervised decision boundary rectification and weakly supervised feature drift regularization. We evaluate our model on unsupervised person re-identification and pose-invariant face recognition. Despite the simplicity of our method, it could outperform the state-of-the-art results on Duke-reID, MultiPIE and CFP datasets with a standard ResNet-50 backbone. We also find our model could perform comparably with the standard supervised fine-tuning results on the three datasets. Code is available at https://github.com/KovenYu/state-informationREAD FULL TEXT VIEW PDF
[CVPR20] Weakly supervised discriminative learning with state information for person identification
While deep discriminative feature learning has shown great success in many vision tasks, it depends highly on the manually labelled large-scale visual data. This limits its scalability to real-world tasks where the labelling is costly and tedious, e.g. person re-identification [78, 55] and unconstrained pose-invariant face recognition . Thus, learning identity-discriminative features without manual labels has drawn increasing attention due to its promise to address the scalability problem [67, 18, 69, 68, 63].
However, the images of an identity can be drastically different when they are taken under different states such as different poses and camera views. For example, we observe great visual discrepancy in the images of the same pedestrian under different camera views in a surveillance scenario (See Figure 1). Such visual discrepancy caused by the different states induces great difficulty in unsupervised discriminative learning. Fortunately, in real-world discriminative tasks, we can often have some state information without human annotation effort. For instance, in person re-identification, it is straightforward to know from which camera view an unlabeled image is taken [67, 69, 68, 12]
, and in face recognition the pose and facial expression can be estimated by off-the-shelf estimators[70, 49] (See Figure 1). We aim to exploit the state information as weak supervision to address the visual discrepancy in unsupervised discriminative learning. We refer to our task as the weakly supervised discriminative feature learning.
In this work, we propose a novel pseudo label model for weakly supervised discriminative feature learning. We assign every unlabeled image example to a surrogate class (i.e. artificially created pseudo class) which is expected to represent an unknown identity in the unlabelled training set, and we construct the surrogate classification as a simple basic model. However the unsupervised assignment is often incorrect, because the image features of the same identity are distorted due to the aforementioned visual discrepancy. When the visual discrepancy is moderate, in the feature space, an unlabeled example “slips away” from the correct decision region and crosses the decision boundary to the decision region of a nearby surrogate class (See the middle part in Figure 2). We refer to this effect as the feature distortion. We develop the weakly supervised decision boundary rectification to address this problem. The idea is to rectify the decision boundary to encourage the unlabeled example back to the correct decision region.
When the feature distortion is significant, however, the unlabeled example can be pushed far away from the correct decision region. Fortunately, the feature distortion caused by a state often follows a specific distortion pattern (e.g., extremely dark illumination in Figure 1 may suppress most visual features). Collectively, this causes a specific global feature drift (See the right part in Figure 2). Therefore, we alleviate the significant feature distortion to a moderate level (so that it can be addressed by the decision boundary rectification) by countering the global-scale feature drifting. Specifically, we achieve this by introducing the weakly supervised feature drift regularization.
We evaluate our model on two tasks, i.e. unsupervised person re-identification and pose-invariant face recognition. We find that our model could perform comparably with the standard supervised learning on DukeMTMC-reID, Multi-PIE  and CFP  datasets. We also find our model could outperform the state-of-the-art unsupervised models on DukeMTMC-reID and supervised models on Multi-PIE and CFP. To our best knowledge, this is the first work to develop a weakly supervised discriminative learning model that can successfully apply to different tasks, leveraging different kinds of state information.
Learning with state information.
State information has been explored separately in identification tasks.
In person re-identification (RE-ID),
several works leveraged the camera view label to help learn view-invariant features and distance metrics
[34, 12, 7, 32, 83].
In face recognition,
the pose label was also used to learn pose-invariant models
[76, 74, 29, 86, 64, 51, 65].
Specifically,  and  visualized the feature embedding
to illustrate the feature distortion problem nicely for person re-identification and face recognition,
However, most existing methods are based on supervised learning, and thus the prohibitive labelling cost
could largely limit their scalability.
Therefore, unsupervised RE-ID
[68, 67, 31, 19, 58, 63, 61]
and cross-domain transfer learning RE-ID
cross-domain transfer learning RE-ID[69, 13, 80, 81, 54, 82, 11, 73, 20] have been attracting increasing attention. These methods typically incorporate the camera view labels to learn the camera view-specific feature transforms [67, 68], to learn the soft multilabels , to provide associations between the video RE-ID tracklets , or to generate augmentation images [80, 81, 82]. Our work is different from the cross-domain transfer learning RE-ID methods in that we do not need any labeled data in the training stage. As for the unsupervised RE-ID methods, the most related works are [67, 68] where Yu et.al. proposed the asymmetric clustering in which the camera view labels were leveraged to learn a set of view-specific projections. However, they need to learn as many projections as the camera views via solving the costly eigen problem, which limits their scalability. In contrast we learn a generalizable feature for all kinds of states (camera views).
Weakly supervised learning. Our method is to iteratively refine pseudo labels with the state information which is regarded as weak supervision. The state information serves to guide the pseudo label assignments as well as to improve the feature invariance against distractive states.
In literatures, weak supervision is a broadly used term. Typical weak supervision  includes image-level coarse labels for finer tasks like detection [5, 6] and segmentation [57, 41]. Another line of research that is more related to our work is utilizing large-scale inaccurate labels (typically collected online  or from a database like Instagram  or Flickr ) to learn general features. Different from existing works, our objective is to learn identity-discriminative features that are directly applicable to identification tasks without supervised fine-tuning.
Unsupervised deep learning
. Beyond certain vision applications, general unsupervised deep learning is a long-standing problem in vision community. The typical lines of research include clustering based methods[9, 62, 2, 17] which discovered cluster structures in the unlabelled data and utilized the cluster labels, and the generation based methods which learned low-dimensional features that were effective for generative discrimination [45, 16, 23] or reconstruction [52, 30, 3].
Recently, self-supervised learning, a promising paradigm of unsupervised learning, has been quite popular. Self-supervised methods typically construct some pretext tasks where the supervision comes from the data. Typical pretext tasks include predicting relative patch positions , predicting future patches , solving jigsaw puzzles [38, 39]42]71, 72] and predicting image rotation . By solving the pretext tasks, they aimed to learn features that were useful for downstream real-world tasks.
Our goal is different from these works. Since they aim to learn useful features for various downstream tasks, they were designed to be downstream task-agnostic, and required supervised fine-tuning for them. In contrast, we actually focus on the “fine-tuning” step, with a goal to reduce the need of manual labeling.
Let denote the unlabelled training set, where is an unlabelled image example. We also know the state , e.g., the illumination of is dark, normal or bright. Our goal is to learn a deep network to extract identity-discriminative feature which is denoted by . A straightforward idea is to assume that in the feature space every
belongs to a surrogate class which is modelled by a surrogate classifier. A surrogate class is expected to model a potential unknown identity in the unlabeled training set. The discriminative learning can be done by a surrogate classification:
where denotes the surrogate class label of , and denotes the number of surrogate classes. An intuitive method for surrogate class assignment is:
However, the visual discrepancy caused by the state leads to incorrect assignments. When the feature distortion is moderate, wrong assignments happen locally, i.e., wrongly crosses the decision boundary into a nearby surrogate class’ decision region. We develop the Weakly supervised Decision Boundary Rectification (WDBR) to address it. As for the significant feature distortion, however, it is extremely challenging as is pushed far away from the correct decision region. To deal with it, we introduce the Weakly supervised Feature Drift Regularization to alleviate the significant feature distortion down to a moderate level that WDBR can address. We show an overview illustration in Figure 2.
We first consider the moderate visual feature distortion. It “nudges” an image feature to wrongly cross the decision boundary into a nearby surrogate class. For example, two persons wearing dark clothes are even harder to distinguish when they both appear in a dark camera view. Thus, these person images are assigned to the same surrogate class (see Figure 2 for illustration). In this case, a direct observation is that most members of the surrogate class is taken from the same dark camera view (i.e. the same state). Therefore, we quantify the extent to which a surrogate class is dominated by a state. We push the decision boundary toward a highly dominated surrogate class or even nullify it, in an attempt to correct these local boundary-crossing wrong assignments.
We quantify the extent by the Maximum Predominance Index (MPI). The MPI is defined as the proportion of the most common state in a surrogate class. Formally, the MPI of the -th surrogate class is defined by:
where the denominator is the number of members in a surrogate class, formulated by the cardinality of the member set of the -th surrogate class :
and the numerator is the number of presences of the most common state in . We formulate it by the intersection of and the state subset corresponding to the -th state :
Note that the member set is dynamically updated, as the surrogate class assignment (Eq. (15)) is on-the-fly along with the learning, and is improved upon better learned features.
As analyzed above, a higher indicates that it is more likely that some examples have wrongly crossed the decision boundary into the surrogate class due to the feature distortion. Hence, we shrink that surrogate class’ decision boundary to purge the potential boundary-crossing examples from its decision region. Specifically, we develop the weakly supervised rectified assignment:
where is the rectifier function that is monotonically decreasing with :
where is the rectification strength and is the rectification threshold. We typically set . In particular, we consider , and thus we have:
This means that when the MPI exceeds the threshold we nullify it by shrinking its decision boundary to a single point. We show a plot of in Figure 3(a).
For any two neighboring surrogate classes and , the decision boundary is (where we leave the derivation to the supplementary material):
Discussion. To have a better understanding of the WDBR, let us first consider the hard rectifier function. When a surrogate class’ MPI exceeds the threshold (typically we set ), the decision region vanishes, and no example would be assigned to the surrogate class (i.e., it is completely nullified). Therefore, WDBR prevents the unsupervised learning from being misled by those severely affected surrogate classes. For example, if over person images assigned to a surrogate class are from the same dark camera view, it is highly likely this is simply because it is too dark to distinguish them, rather than because they are the same person. Thus, WDBR nullifies this poorly formed surrogate class.
When we use the soft rectifier function, the WDBR does not directly nullify the surrogate class that exceeds the threshold, but favors the surrogate class which has lower MPI (because they are less likely to have the boundary-cross problem) by moving the decision boundary. This can be seen from Figure 3(b) where we plot a set of decision boundaries in the two-class case. In some sense, the soft WDBR favors the state-balanced surrogate classes. This property may further improve the unsupervised learning, especially if the unlabelled training set is indeed state-balanced for most identities. However, if we do not have such a prior knowledge of state balance, using hard rectifier can be more desirable, because hard rectifier does not favor state-balanced surrogate classes. We will discuss more about this property upon real cases in Sec. 4.2.
In the supplementary material, we theoretically justify our model by showing that the rectified assignment is the maximum a posteriori optimal estimation of . However, the WDBR is a local mechanism, i.e. WDBR deals with the moderate feature distortion that nudges examples to slip in nearby surrogate classes. Its effectiveness might be limited when the feature distortion is significant.
A visually dominant state may cause a significant feature distortion that pushes an example far away from the correct surrogate class. This problem is extremely difficult to address by only considering a few surrogate classes in a local neighborhood. Nevertheless, such a significant feature distortion is likely to follow a specific pattern. For example, the extremely low illumination may suppress all kinds of visual features: dim colors, indistinguishable textures, etc. Collectively, we can capture the significant feature distortion pattern in a global scale. In other words, such a state-specific feature distortion would cause many exmaples in the state subset to drift toward a specific direction (see Figure 2 for illustration). We capture this by the state sub-distribution and introduce the Weakly supervised Feature Drift Regularization (WFDR) to address it and complement the WDBR.
In particular, we define the state sub-distribution as , which is the distribution over the state subset defined in Eq. (5). For example, all the unlabeled person images captured from a dark camera view. We further denote the distribution over the whole unlabelled training set as , where . Apparently, the state-specific feature distortion would lead to a specific sub-distributional drift, i.e., drifts away from . For example, all person images from a dark camera view may be extremely low-valued in many feature dimensions, and this forms a specific distributional characteristic. Our idea is straightforward: we counter this “collective drifting force” by aligning the state sub-distribution with the overall total distribution to suppress the significant feature distortion. We formulate this idea as the Weakly supervised Feature Drift Regularization (WFDR):
where /. Similarly, / is the mean/standard deviation feature vector over the whole unlabelled training set .
Ideally, WFDR alleviates the significant feature distortion down to a mild level (i.e., is regularized into the correct decision region) or a moderate level (i.e., is regularized into the neighborhood of the correct surrogate class) that the WDBR can address. Thus, it is mutually complementary to the WDBR. We note that the WFDR is mathematically akin to the soft multilabel learning loss in , but they serve for different purposes. The soft multilabel learning loss is to align the cross-view associations between unlabeled target images and labeled source images, while we aim to align the feature distributions of unlabeled images and we do not need a source dataset.
Finally, the loss function of our model is:
is a hyperparameter to balance the two terms.
In our implementation we used the standard ResNet-50  as our backbone network. We trained our model for approximately 1,600 iterations with batchsize 384, momentum 0.9 and weight decay 0.005. We followed  to use SGD, set the learning rate to 0.001, and divided the learning rate by 10 after 1,000/1,400 iterations. We used a single SGD optimizer for both and . Training costed less than two hours by using 4 Titan X GPUs. We initialized the surrogate classifiers
by performing standard K-means clustering on the initial feature space and using the cluster centroids. For further details please refer to the supplementary. We also summarize our method in an algorithm in the supplementary material.
We evaluated our model on two real-world discriminative tasks with state information, i.e. person re-identification (RE-ID)  and pose-invariant face recognition (PIFR) [27, 14]. In RE-ID which aims to match person images across non-overlapping camera views, the state information is the camera view label, as illustrated in Figure 4(a) and 4(b). Note that each camera view has its specific conditions including illumination, viewpoint and occlusion (e.g. Figure 4(a) and 4(b)). In PIFR, which aims to identify faces across different poses, the state information is the pose, as illustrated in Figure 4(c). We note that on both tasks the training identities are completely different from the testing identities. Hence, these tasks are suitable to evaluate the discriminability and generalisability of learned feature.
Person re-identification (RE-ID). We evaluated on Market-1501  and DukeMTMC-reID [79, 46]. Market-1501 contains 32,668 person images of 1,501 identities. Each person is taken images from at least 2 out of 6 disjoint camera views. We followed the standard evaluation protocol where the training set had 750 identities and testing set had the other 751 identities . The performance was measured by the cumulative accuracy and the mean average precision (MAP) . DukeMTMC-reID contains 36,411 person images of 1,404 identities. Images of each person were taken from at least 2 out of 8 disjoint camera views. We followed the standard protocol which was similar to the Market-1501 . We followed  to pretrain the network with standard softmax loss on the MSMT17 dataset  in which the scenario and identity pool were completely different from Market-1501 and DukeMTMC-reID. It should be pointed out that in fine-grained discriminative tasks like RE-ID and PIFR, the pretraining is important for unsupervised models because the class-discriminative visual clues are not general but highly task-dependent [21, 18, 54, 68], and therefore some extent of field-specific knowledge is necessary for successful unsupervised learning. We resized the images to . In the unsupervised setting, the precise number of training classes (persons) (i.e. 750/700 for Market-1501/DukeMTMC-reID) should be unknown. Since our method was able to automatically discard excessive surrogate classes, an “upper bound” estimation could be reasonable. We set for both datasets.
Pose-invariant face recognition (PIFR). We mainly evaluated on the large dataset Multi-PIE . Multi-PIE contains 754,200 images of 337 subjects taken with up to 20 illuminations, 6 expressions and 15 poses . For Multi-PIE, most experiments followed the widely-used setting 
which used all 337 subjects with neutral expression and 9 poses interpolated betweenand . The training set contained the first 200 persons, and the testing set contained the remaining 137 persons. When testing, one image per identity with the frontal view was put into the gallery set and all the other images into the query set. The performance was measured by the top-1 recognition rate. We detected and cropped the face images by MTCNN , resized the cropped images to , and we adopted the pretrained model weights provided by . Similarly to the unsupervised RE-ID setting, we simply set . We also evaluated on an unconstrained dataset CFP . The in-the-wild CFP dataset contains 500 subjects with 10 frontal and 4 profile images for each subject. We adopted the more challenging frontal-profile verification setting . We followed the official protocol . to report the mean accuracy, equal error rate (EER) and area under curve (AUC).
In the unsupervised RE-ID task, the camera view labels were naturally available [69, 68]. In PIFR we used groundtruth pose labels for better analysis. In the supplementary material we showed the simulation results when we used the estimated pose labels. The performance did not drop until the correctly estimated pose labels were less than . In practice the facial pose is continuous and we need to discretize it to produce the pose labels. In our preliminary experiments on Multi-PIE we found that merging the pose labels into coarse-grained groups did not affect the performance significantly. Therefore, for fair comparison to other methods, we followed the conventional setting to use the default pose labels. We set and for all datasets except Multi-PIE which has more continual poses and thus we decreased to , . We evaluated both soft version and hard version . We provide evaluations and analysis for ,, and in the supplementary material.
|K-means as labels||37.3||52.1||25.2||47.3||63.1||25.6|
|Basic + WDBR (hard)||69.4||80.5||50.2||60.3||73.4||34.5|
|Basic + WDBR (soft)||63.6||77.2||45.4||60.0||75.6||34.3|
|Basic + WFDR||67.7||79.4||47.5||67.4||82.3||39.4|
|Full model (hard)||72.1||83.5||53.8||74.0||87.4||47.9|
|Full model (soft)||70.3||81.7||50.0||70.7||85.2||43.4|
|K-means as labels||81.0||95.7||94.6||89.1||76.7||56.0|
|Basic + WDBR (hard)||91.7||98.9||98.7||97.5||91.2||75.9|
|Basic + WDBR (soft)||97.0||99.1||98.9||98.3||96.8||93.0|
|Basic + WFDR||95.7||98.4||98.1||97.0||95.5||91.0|
|Full model (hard)||95.7||98.3||98.1||97.0||95.3||91.1|
|Full model (soft)||97.1||99.1||98.9||98.3||96.8||93.1|
|Full model (soft)||95.49(0.70)||4.74(0.72)||98.83(0.29)|
We decomposed our model for analysis. To ground the performance, we provided the standard supervised fine-tuning results (i.e. replacing our proposed loss with softmax loss with groundtruth class labels, and keeping other settings the same) which could be seen as an upper bound. As an unsupervised baseline, we used K-means cluster labels (i.e. we performed K-means once on the pretrained feature space to obtain the cluster labels, and used the cluster labels instead of the groundtruth labels for fine-tuning) and we denote this as “K-means as labels”. We also ablated both WDBR and WFDR from our full model to obtain a “Basic model”. The key difference between “K-means as labels” and “Basic model” is that the former uses fixed cluster labels while the latter dynamically infers pseudo labels every batch along with model training. We show the results in Table 1, 2 and 3. On CFP the observation was similar to Multi-PIE and we show the most significant results only.
Comparable performance to standard supervised fine-tuning. Compared to the standard supervised fine-tuning, we found that our model could perform comparably with the supervised results
in both the person re-identification task on DukeMTMC-reID and the face recognition task on Multi-PIE and CFP. The overall effectiveness was clear when we ground the performance by both the supervised results and the pretrained baseline results. For example, on DukeMTMC-reID, the supervised learning improved the pretrained network by 31.9% in rank-1 accuracy, while our model improved it by 29.0%, leaving only a gap of 2.9%. On Multi-PIE our model achieved 97.1% average recognition rate which was very closed to the supervised result 98.2%. On CFP our model even achieved approximately the same performance as supervised fine-tuning, probably because the small training set (6300 images) favored a regularization. We also notice that significant performances held both when the initial pretrained backbone network wasweak (e.g. in RE-ID the initial rank-1 accuracy performance was below 50%) and when the initial backbone network was strong (i.e. in PIFR the initial recognition accuracy performance was over 80%). These comparisons verified the effectiveness of our model.
Soft vs. hard decision boundary rectification. We found that the soft rectification performed better on PIFR benchmarks while hard rectification excelled at RE-ID. We assumed that a key reason was that on the RE-ID datasets, different persons’ images were unbalanced, i.e., some IDs appeared only in two camera views while some may appear in up to six camera views. For example, for a person who appeared in 2 camera views, the MPI was at least 1/2, while this lower bound was 1/6 for another person who appeared in 6 camera views. Thus the soft rectifier may unfairly favor the surrogate class corresponding to the person appearing in more camera views. While the hard rectifier does not favor state-balance: it only nullified highly likely incorrect surrogate classes with very high MPI. Therefore, the hard rectification could be more robust to the state imbalance. On the other hand, for Multi-PIE and CFP where the classes were balanced, soft rectification would fine-tune the decision boundary to a better position, and thus achieved better results. Hence, in this paper we used the hard WDBR for RE-ID and the soft WDBR for PIFR.
Complementary nature of WDBR and WFDR. Comparing the basic model (our model without WDBR or WFDR) to basic model with either WDBR or WFDR, the performance was consistently improved. With both WDBR and WFDR, the performance was further improved. This showed that the fine local-scale WDBR and the global-scale WFDR were complementarily effective.
We noticed that on Multi-PIE this complementary nature was less significant, as using WDBR alone could achieve similar results to the full model. This may be due to the continual nature of the pose variation on Multi-PIE: the variation from 0°to 60°is in a locally connected manifold , with 15°/30°/45°in between. Therefore, it was easier for our local mechanism to gradually “connect” some 0°/15°surrogate classes with some 15°/30°surrogate classes to finally have a global aligning effect. In contrast, in RE-ID this manifold nature is less apparent since it lacks evidence of inherent relations between each pair of camera views.
|Wu et.al. ||ICCV’19||59.3||37.8||65.4||35.5|
|Deep Features ||84.91(1.82)||14.97(1.98)||93.00(1.55)|
|Triplet Embedding ||89.17(2.35)||8.85(0.99)||97.00(0.53)|
|Chen et.al. ||91.97(1.70)||8.00(1.68)||97.70(0.82)|
We further demonstrated the effectiveness of our model by comparing to the state-of-the-art methods on both tasks. It should be pointed out that for RE-ID and PIFR where the goal was to solve the real-world problem, there were no standards on architecture: different works used different networks and different pretraining data. Thus we simply kept using the standard ResNet-50 without task-specific improvements and using the public pretraining data. For a fairer comparison, we also compared our method with a recent unsupervised deep learning method DeepCluster , which also uses a discriminative classification loss. We used the same architecture and pretraining as for our method. We show the results in Table 4, 5 and 6.
Superior performance across tasks and benchmarks. Compared to the reported results, our method could achieve the state-of-the-art performances. On unsupervised RE-ID task, our method achieved a 5.0%/5.8% absolute improvement in rank-1 accuracy/MAP on DukeMTMC-reID, compared to the recent state-of-the-art RE-ID model MAR , which used exactly the same architecture and pretraining data as ours. Although a few recent domain adaptation methods  achieve comparable performances to our method, it is worth noting that they rely on labeled source data for discriminative learning, while we do not use labeled data and our method can generalize to different tasks instead of specifically modeling a single RE-ID task. We note that many of the compared recent state-of-the-art RE-ID methods also exploited the camera view labels [80, 81, 69, 67, 68, 60, 44]. For instance, the domain adaptation RE-ID models HHL /ECN  leveraged the camera view labels to synthesize cross-view person images for training data augmentation [80, 81], and MAR  used view labels to learn the view-invariant soft multilabels. On the pose-invariant face recognition task, our model outperformed the state-of-the-art supervised results on both Multi-PIE and the CFP benchmarks. We also note that most compared PIFR models exploited both the identity labels and the pose labels.
Our model also outperformed the DeepCluster  significantly on all the four benchmarks. A major reason should be that some discriminative visual clues (e.g. fine-grained clothes pattern) of persons (/faces) were “overpowered” by the camera view (/pose) induced feature distortion. Without appropriate mechanisms to address this problem, the clustering might be misled by the feature distortion to produce inferior cluster separations. In contrast, our model addressed this problem via the weakly supervised decision boundary rectification and the feature drift regularization.
To provide visual insights of the problems our model tried to address, we show the t-SNE embedding  of the learned features in Figure 5. Note that the shown identities were unseen during training, and thus the characteristic reflected in the qualitative results were generalisable.
Addressing intra-identity visual feature discrepancy. Let us compare Figure 5(a) and Figure 5(b) for an illustration. Figure 5(a) illustrated that the same person (see the highlighted brown points) appeared differently in two camera views, which had different viewpoints and backgrounds. This visual discrepancy caused significant feature distortion of the identity. Apparently, it was extremely difficult to address this problem if other effective mechanisms were not provided besides feature similarity. From Figure 5(b) we observed that when equiped with the WDBR and WFDR, the feature distortion was significantly alleviated. This observation indicated that the our model leveraged the state information to effectively alleviate the intra-identity visual discrepancy for better discriminative feature learning.
Addressing inter-identity visual feature entanglement. In a more complex case shown in Figure 5(c), we observed that some visually similar frontal face images (males with eye glasses) were entangled with each other in the feature space learned by the basic model. In particular, some magenta, red and dark green points highly overlapped with each other. This demonstrated that if we simply used feature similarity, it was also extremely difficult to address the inter-identity visual feature entanglement. Nevertheless, as shown in Figure 5(d) our full model could address this problem with the WDBR and WFDR. The learned feature space was much desirable, and the inter-identity overlapping points were now distant from each other. In other words, our model could leverage the state information to help the unsupervised learning via alleviating the inter-identity visual feature entanglement.
|w/ only pose labels||94.6||96.5||96.2||95.9||94.9||90.6|
|w/ only illumination labels||30.3||73.3||65.8||24.9||7.0||2.2|
|w/ only expression labels||43.5||80.6||75.2||51.5||26.2||2.6|
|w/ all three kinds of labels||95.9||98.3||98.2||97.2||95.7||91.1|
Our method is easy to extend to incorporate multiple kinds of state information. We experimented on Multi-PIE with the state information being expression, illumination and pose. We used all 6 expressions, 20 illuminations and 9 poses. We decomposed the rectifier by , where the subscripts // stand for pose/illumination/expression, respectively. We also accordingly use three equally-weighted feature drift regularization terms in the loss function. We used hard WDBR to have a regular shape of rectifier function. We show the results in Table 7. Exploiting pose labels produced much better results than illumination and expression, indicating that pose was the most distractive on Multi-PIE. Exploiting all three kinds of state information further improved the performance to 95.9%, which was closed to the supervised result 96.6%. This comparison showed that our model could be further improved when more valuable state information was available.
In this work we proposed a novel psuedo label method with state information. We found that some proper state information could help address the visual discrepancy caused by those distractive states. Specifically, we investigate the state information in person re-identification and face recognition and found the camera view labels and pose labels to be effective. Our results indicate that it is reasonable to make use of the free state information in unsupervised person re-identification and face recognition. Since the weakly supervised feature drift regularization (WFDR) is a simple loss term which is model-free, it can be plugged into other different methods than our proposed pseudo label method.
However, we should point out that our method works with the state information that corresponds to the visually distractive states. As for more general state information, it still remains an open problem to effectively utilize it.
Discriminative unsupervised feature learning with convolutional neural networks.In NIPS, 2014.
Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.JMLR, 2010.
Multi-view perceptron: a deep model for learning face identity and view representations.In NIPS, 2014.
We used standard data augmentation (random crop and horizontal flip) during training. We used the spherical feature embedding [53, 35], i.e., we enforced and . To address the gradient saturation in the spherical embedding , we followed the method introduced by  to scale every inner product in Eq. (1) up to 30. We updated every iterations, as we found it not sensitive in a broad range. We maintained a buffer for and as a reference, whereas and were estimated within each batch to obtain the gradient. We updated the buffer with a momentum for each batch where denoted the batch size and denotes the training set size.
In this section we first show that the weakly supervised decision boundary rectification (WDBR) is the maximum a posteriori (MAP) optimal estimation of the surrogate label
under suitable assumptions. Specifically, if we assume that a surrogate class is modeled by a normal distribution[69, 26, 4] parameterized by a mean vector and covariance matrix , the likelihood that is generated by the -th surrogate class is:
From Eq. (14) we can see that the basic surrogate classification in Eq. (1) in the main manuscript is the Maximum Likelihood Estimation (MLE) of model parameters and . And the assignment
is the MLE optimal assignment. If we further consider the prior information of each surrogate class, i.e., which surrogate classes are more preferable to assign, we can improve the assignment to the Maximum a Posteriori (MAP) optimal assignment:
is the posterior probability. Eq. (16
) is identical to the rectified assignment in Eq. (6) in the main manuscript. Hence, we can interpret the weakly supervised rectifier function as a prior probability that specifies our preference on the surrogate class. In particular, when we use the hard rectification, we actually specify that we dislike severely unbalanced surrogate classes. When we use the soft rectification, we specify that we favor the more balanced surrogate classes.
Derivation of the decision boundary. Here we consider the simplest two-surrogate class case. It is straightforward to extend it to multi-surrogate class cases. From Eq. (16) we can see that the decision boundary between two surrogate class and is:
In order to provide insight and guidance on choosing the hyperparameter values, in this section we show evaluation results of the hyperparameters to reveal some behaviors and characteristics of our model. For person re-identification (RE-ID) we evaluated on the widely-used Market-1501 and DukeMTMC-reID datasets, and for pose-invariant face recognition (PIFR) we evaluated on the large-scale Multi-PIE dataset. For easier interpretation and more in-depth analysis, we used the hard rectification function on all datasets. This was because the hard rectification function could be interpreted as nullification of high maximum predominance index (and thus likely to be dominated by the extrinsic state) surrogate classes.
: Number of surrogate classes. In each task (i.e. RE-ID or PIFR), we varied the number of surrogate classes by setting it to a multiple of the precise number of classes in the training set (e.g. for Market-1501 and for Multi-PIE). We show the results in Figure 6. From Figure S1(a) and S1(b) we can see that the optimal performances could be achieved when or . This might be because the dynamic nullification (i.e. hard rectification) reduced the effective in training, so that a larger could also be optimal. In a practical perspective, we might estimate an “upper bound” of and set to it according to some prior knowledge.
: Weight of feature drift regularization. We show the evaluation results in Figure 7. Here we removed the surrogate decision boundary rectification for PIFR to better understand the characteristic of . From Figure S2(a) and S2(b), we found that while the performances on RE-ID were optimal around , for PIFR it was near optimal within the range of .
Hard surrogate rectification/nullification threshold. We show the evaluation results on RE-ID datasets in Figure 8. The performances were optimal when was not too low, e.g. was optimal for both RE-ID datasets. A major reason was that it was difficult to form sufficient surrogate classes when the threshold was too low. To see this, in Figure 8 we also show the number of active (i.e. not nullified) surrogate classes in the final convergence epoch. Clearly, a lack of surrogate classes was harmful to the discriminative learning.
Soft surrogate rectification. We show the evaluation results on Multi-PIE in Figure 9. As analyzed in the main manuscript, soft rectification consistently improved over the hard rectification due to the balanced classes on Multi-PIE. The optimal value of the soft rectification threshold was around because on Multi-PIE the five poses were evenly distributed and thus the optimal MPI shall be around slightly above . In a practical perspective, when we have prior knowledge of the unlabelled data, we might be able to estimate the soft rectification threshold. Nevertheless, even when we do not have reliable prior knowledge, the robust conservative hard rectification could also be effective.
In this supplementary material we present the evaluation on our method’s robustness for pose label perturbation. This is a simulation for the more challenging real-world PIFR setting, where the pose labels are obtained by pose estimation models, and thus there might be incorrect pose labels. We note that this is not the case for person re-identification (RE-ID), because in RE-ID every image comes from a certain camera view of the surveillance camera network, so that no estimation is involved.
To simulate the pose label noise, we add perturbation to the groundtruth pose labels. We randomly reset some pose labels to incorrect values (e.g. we reset a randomly chosen pose label to 15°which is actually 60°). The randomly reset pose labels were equally distributed in every degree. For example, when we reset 20% pose labels, there were 20% of 60°pose labels were reset incorrect, 20% of 45°pose labels were incorrect, and so forth for other degrees. We vary the incorrect percents and show the results in Figure 10.
From Figure 10 we observed that the performance on PIFR did not drop significantly until less than 60% pose labels were correct. This observation indicated that our model could tolerate a moderate extent of pose label noise. A major reason was that when a few pose labels were incorrect, a highly affected surrogate class (whose members were mostly of the same pose) would still have a high Maximum Predominance Index for it to be nullfied. In addition, when most pose labels were correct the estimation of the manifestation sub-distributions should approximate the correct manifestation sub-distributions. Therefore, our model should be robust for the unsupervised PIFR task when a few pose labels were perturbated.