In real-world applications, e.g. law enforcement and video retrieval, one often needs to search a certain person in long videos with just one portrait. This is much more challenging than the conventional settings for person re-identification, as the search may need to be carried out in the environments different from where the portrait was taken. In this paper, we aim to tackle this challenge and propose a novel framework, which takes into account the identity invariance along a tracklet, thus allowing person identities to be propagated via both the visual and the temporal links. We also develop a novel scheme called Progressive Propagation via Competitive Consensus, which significantly improves the reliability of the propagation process. To promote the study of person search, we construct a large-scale benchmark, which contains 127K manually annotated tracklets from 192 movies. Experiments show that our approach remarkably outperforms mainstream person re-id methods, raising the mAP from 42.16READ FULL TEXT VIEW PDF
Humans are arguably one of the most important subjects in video streams,...
In many scenarios of Person Re-identification (Re-ID), the gallery set
The task of searching certain people in videos has seen increasing poten...
Existing methods for person re-identification (Re-ID) are mostly based o...
Each day, approximately 500 missing persons cases occur that go
Designing real-world person re-identification (re-id) systems requires
Standard video and movie description tasks abstract away from person
Searching persons in videos is frequently needed in real-world scenarios. To catch a wanted criminal, the police may have to go through thousands of hours of videos collected from multiple surveillance cameras, probably with just a single portrait. To find the movie shots featured by a popular star, the retrieval system has to examine many hour-long films, with just a few facial photos as the references. In applications like these, the reference photos are often taken in an environment that is very different from the target environments where the search is conducted. As illustrated in Figure1, such settings are very challenging. Even state-of-the-art recognition techniques would find it difficult to reliably identify all occurrences of a person, facing the dramatic variations in pose, makeups, clothing, and illumination.
It is noteworthy that two related problems, namely person re-identification (re-id) and person recognition in albums, have drawn increasing attention from the research community. However, they are substantially different from the problem of person search with one portrait, which we aim to tackle in this work. Specifically, in typical settings of person re-id [44, 22, 38, 45, 13, 8, 16], the queries and the references in the gallery set are usually captured under similar conditions, from different cameras along a street, and within a short duration. Even though some queries can be subject to issues like occlusion and pose changes, they can still be identifies via other visual cues, clothing. For person recognition in albums , one is typically given a diverse collection of gallery samples, which may cover a wide range of conditions and therefore can be directly matched to various queries. Hence, for both problems, the references in the gallery are often good representatives of the targets, and therefore the methods based on visual cues can perform reasonably well [22, 1, 4, 3, 39, 44, 43, 15, 14]. On the contrary, our task is to bridge a single portrait with a highly diverse set of samples, which is much more challenging and requires new techniques that go beyond visual matching.
To tackle this problem, we propose a new framework that propagates labels through both visual and temporal links. The basic idea is to take advantage of the identity invariance along a person trajectory, all person instances along a continuous trajectory in a video should belong to the same identity. The connections induced by tracklets, which we refer to as the temporal links, are complementary to the visual links
based on feature similarity. For example, a trajectory can sometimes cover a wide range of facial images that can not be easily associated based on visual similarity. With bothvisual and temporal links incorporated, our framework can form a large connected graph, thus allowing the identity information to be propagated over a very diverse collection of instances.
While the combination of visual and temporal links provide a broad foundation for identity propagation, it remains a very challenging problem to carry out the propagation reliably over a large real-world dataset. As we begin with only a single portrait, a few wrong labels during propagation can result in catastrophic errors downstream. Actually, our empirical study shows that conventional schemes like linear diffusion [47, 46] even leads to substantially worse results. To address this issue, we develop a novel scheme called Progressive Propagation via Competitive Consensus, which performs the propagation prudently, spreading a piece of identity information only when there is high certainty.
To facilitate the research on this problem setting, we construct a dataset named Cast Search in Movies (CSM), which contains tracklets of cast identities from movies. The identities of all the tracklets are manually annotated. Each cast identity also comes with a reference portrait. The benchmark is very challenging, where the person instances for each identity varies significantly in makeup, pose, clothing, illumination, and even age. On this benchmark, our approach get and mAP under two settings, Comparing to the and mAP of the conventional visual-matching method, it shows that only matching by visual cues can not solve this problem well, and our proposed framework – Progressive Propagation via Competitive Consensus can significantly raise the performance.
In summary, the main contributions of this work lie in four aspects: (1) We systematically study the problem of person search in videos, which often arises in real-world practice, but remains widely open in research. (2) We propose a framework, which incorporates both the visual similarity and the identity invariance along a tracklet, thus allowing the search to be carried out much further. (3) We develop the Progressive Propagation via Competitive Consensus scheme, which significantly improves the reliability of propagation. (4) We construct a dataset Cast Search in Movies (CSM) with manually annotated tracklets to promote the study on this problem.
Person Re-id. Person re-id [41, 6, 7], which aims to match pedestrian images (or tracklets) from different cameras within a short period, has drawn much attention in the research community. Many datasets [44, 22, 38, 45, 13, 8, 16] have been proposed to promote the research of re-id. However, the videos are captured by just several cameras in nearby locations within a short period. For example, the Airport  dataset is captured in an airport from 8 a.m. to 8 p.m. in one day. So the instances of the same identities are usually similar enough to identify by visual appearance although with occlusion and pose changes. Based on such characteristic of the data, most of the re-id methods focus on how to match a query and a gallery instance by visual cues. In earily works, the matching process is splited into feature designing [11, 9, 26, 27] and metric learning [28, 17, 23]
. Recently, many deep learning based methods have been proposed to jointly handle the matching problem.Li et al.  and Ahmed et al.  designed siamese-based networks which employ a binary verification loss to train the parameters. Ding et al.  and Cheng et al.  exploit triple loss for training more discriminating feature. Xiao et al.  and Zheng et al. 
proposed to learn features by classifying identities. Although the feature learning methods of re-id can be adopted for the Person Search with One Portrait problem, they are substantially different as the query and the gallery would have huge visual appearances gap in person search, which would make one-to-one matching fail.
Person Recognition in Photo Album. Person recognition [24, 43, 15, 19, 14] is another related problem, which usually focuses on the persons in photo album. It aims to recognize the identities of the queries given a set of labeled persons in gallery. Zhang et al.  proposed a Pose Invariant Person Recognition method (PIPER), which combines three types of visual recognizers based on ConvNets, respectively on face, full body, and poselet-level cues. The PIPA dataset published in  has been widely adopted as a standard benchmark to evaluate person recognition methods. Oh et al.  evaluated the effectiveness of different body regions, and used a weighted combination of the scores obtained from different regions for recognition. Li et al.  proposed a multi-level contextual model, which integrates person-level, photo-level and group-level contexts. But the person recognition is also quite different from the person search problem we aim to tackle in this paper, since the samples of the same identities in query and gallery are still similar in visual appearances and the methods mostly focus on recognizing by visual cues and context.
Person Search. There are some works that focus on person search problem. Xiao et al.  proposed a person search task which aims to search the corresponding instances in the images of the gallery without bounding box annotation. The associated data is similar to that in re-id. The key difference is that the bounding box is unavailable in this task. Actually it can be seen as a task to combine pedestrian detection and person re-id. There are some other works try to search person with different modality of data, such as language-based  and attribute-based [35, 5], which focus on the application scenarios that are different from the portrait-based problem we aim to tackle in this paper.
, is widely used as a semi-supervised learning method. It relies on the idea of building a graph in which nodes are data points (labeled and unlabeled) and the edges represent similarities between points so that labels can propagate from labeled points to unlabeled points. Different kinds of LP-based approaches have been proposed for face recognition[18, 48], semantic segmentation , object detection , saliency detection 
in the computer vision community. In this paper, We develop a novel LP-based approach called Progressive Propagation via Competitive Consensus, which differs from the conventional LP in two folds: (1) propagating by competitive consensus rather than linear diffusion, and (2) iterating in a progressive manner.
Whereas there have been a number of public datasets for person re-id [44, 22, 38, 45, 13, 8, 16] and album-based person recognition . But dataset for our task, namely person search with a single portrait, remains lacking. In this work, we constructed a large-scale dataset Cast Search in Movies (CSM) for this task. CSM comprises a query set that contains the portraits for cast (the actors and actresses) and a gallery set that contains tracklets (with person instances) extracted from movies.
We compare CSM with other datasets for person re-id and person recognition in Tabel 1. We can see that CSM is significantly larger, times for tracklets and times more instances than MARS , which is the largest dataset for person re-id to our knowledge. Moreover, CSM has a much wider range of tracklet durations (from to frames) and instance sizes (from to pixels in height). Figure 2 shows several example tracklets as well as their corresponding portraits, which are very diverse in pose, illumination, and wearings. It can be seen that the task is very challenging.
For each movie in CSM, we acquired the cast list from IMDB. For those movies with more than cast, we only keep the top according to the IMDB order, which can cover the main characters for most of the movies. In total, we obtained cast, which we refer to as the credited cast. For each credited cast, we download a portrait from either its IMDB or TMDB homepage, which will serve as the query portraits in CSM.
We obtained the tracklets in the gallery set through five steps:
Annotating bounding boxes on keyframes. We then manually annotated the person bounding boxes on keyframes and obtained around bounding boxes.
Training a person detector. We trained a person detector with the annotated bounding boxes. Specifically, all the keyframes are partitioned into a training set and a testing set by a ratio . We then finetuned a Faster-RCNN  pre-trained on MSCOCO  on the training set. On the testing set, the detector gets around mAP, which is good enough for tracklet generation.
Generating tracklets. With the person detector as described above, we performed per-frame person detection over all the frames. By concatenating the bounding boxes across frames with within each shot, we obtained trackets from the movies.
Annotating identities. Finally, we manually annotated the identities of all the tracklets. Particularly, each tracklet is annotated as one of the credited cast or as “others”. Note that the identities of the tracklets in each movie are annotated independently to ensure high annotation quality with a reasonable budget. Hence, being labeled as “others” means that the tracklet does not belong to any credited cast of the corresponding movie.
In this work, we aim to develop a method to find all the occurrences of a person in a long video, a movie, with just a single portrait. The challenge of this task lies in the vast gap of visual appearance between the portrait (query) and the candidates in the gallery.
Our basic idea to tackle this problem by leveraging the inherent identity invariance along a person tracklet and propagate the identities among instances via both visual and temporal links. The visual and temporal links are complementary. The use of both types of links allows identities to be propagated much further than using either type alone. However, how to propagate over a large, diverse, and noisy dataset reliably remains a very challenging problem, considering that we only begin with just a small number of labeled samples (the portraits). The key to overcoming this difficulty is to be prudent, only propagating the information which we are certain about. To this end, we propose a new propagation framework called Progressive Propagation via Competitive Consensus, which can effectively identify confident labels in a competitive way.
The propagation is carried out over a graph among person instances. Specifically, the propagation graph is constructed as follows. Suppose there are cast in query set, tracklets in gallery set, and the length of -th tracklet (denoted by ) is , it contains instances. The cast portraits and all the instances along the tracklets are treated as graph nodes. Hence, the graph contains nodes. In particular, the identities of the cast portraits are known, and the corresponding nodes are referred to as labeled nodes, while the other nodes are called unlabled nodes.
The propagation framework aims to propagate the identities from the labeled nodes to the unlabeled nodes through both visual and temporal links between them. The visual links are based on feature similarity. For each instance (say the
-th), we can extract a feature vector, denoted as. Each visual link is associated with an affinity value – the affinity between two instances and
is defined to be their cosine similarity as. Generally, higher affinity value indicates that and are more likely to be from the same identity. The temporal links capture the identity invariance along a tracklet, all instances along a tracklet should share the same identity. In this framework, we treat the identity invariance as hard constraints, which is enforced via a competitive consensus mechanism.
For two tracklets with lengths and , there can be links between their nodes. Among all these links, the strongest link, the one between the most similar pair, is the best to reflect the visual similarity. Hence, we only keep one strongest link for each pair of tracklets as shown in Figure 4, which makes the propagation more reliable and efficient. Also, thanks to the temporal links, such reduction would not compromise the connectivity of the whole graph.
As illustrated in Figure 4, the visual and temporal links are complementary. The former allows the identity information to be propagated among those instances that are similar in appearance, while the latter allows the propagation along a continuous trajectory, in which the instances can look significantly different. With only visual links, we can obtain clusters in the feature space. With only temporal links, we only have isolated tracklets. However, with both types of links incorporated, we can construct a more connected graph, which allows the identities to be propagated much further.
Each node of the graph is associated with a probability vector , which will be iteratively updated as the propagation proceeds. To begin with, we set the probability vector for each labeled node to be a one-hot vector indicating its label, and initialize all others to be zero vectors. Due to the identity invariance along tracklets, we enforce all nodes along a tracklet to share the same probability vector, denoted by . At each iteration, we traverse all tracklets and update their associated probability vectors one by one.
Linear diffusion is the most widely used propagation scheme, where a node would update its probability vector by taking a linear combination of those from the neighbors. In our setting with identity invariance, the linear diffusion scheme can be expressed as follows:
Here, is the set of all visual neighbors of those instances in . Also, is the affinity of a neighbor node to the tracklet . Due to the constraint that there is only one visual link between two tracklets (see Sec. 4.1), each neighbor will be connected to just one of the nodes in , and is set to the affinity between the neighbor to that node.
However, we found that the linear diffusion scheme yields poor performance in our experiments, even far worse than the naive visual matching method. An important reason for the poor performance is that errors will be mixed into the updated probability vector and then propagated to other nodes. This can cause catastrophic errors downstream, especially in a real-world dataset that is filled with noise and challenging cases.
To tackle this problem, it is crucial to improve the reliability and propagate the most confident information only. Particularly, we should only trust those neighbors that provide strong evidence instead of simply taking the weighted average of all neighbors. Following this intuition, we develop a novel scheme called competitive consensus.
When updating , the probability vector for the tracklet , we first collect the strongest evidence to support each identity , from all the neighbors in , as
where the normalized coefficient is defined in Eq.(1). Intuitively, an identity is strongly supported for if one of its neighbors assigns a high probability to it. Next, we turn the evidences for individual identities into a probability vector via a tempered softmax function as
Here, is a temperature the controls how much the probabilities concentrate on the strongest identity. In this scheme, all identities compete for getting high probability values in by collecting the strongest supports from the neighbors. This allows the strongest identity to stand out.
Competitive consensus can be considered as a coordinate ascent method to solve Eq. 4
, where we introduce a binary variableto indicate whether the -th neighbor is a trustable source for the class for the -th tracklet. Here, is the entropy. The constraint means that one trustable source is selected for each class and tracklet .
Figure 5 illustrates how linear diffusion and our competitive Consensus work. Experiments on CSM also show that competitive consensus significantly improves the performance of the person search problem.
In conventional label propagation, labels of all the nodes would be updated until convergence. This way can be prohibitively expensive when the graph contains a large number of nodes. However, for the person search problem, this is unnecessary – when we are very confident about the identity of a certain instance, we don’t have to keep updating it.
Motivated by the analysis above, we propose a progressive propagation scheme to accelerate the propagation process. At each iteration, we will fix the labels for a certain fraction of nodes that have the highest confidence, where the confidence is defined to be the maximum probability value in . We found empirically that a simple freezing schedule, adding of the instances to the label-frozen set, can already bring notable benefits to the propagation process.
Note that the progressive scheme not only reduces computational cost but also improves propagation accuracy. The reason is that without freezing, the noise and the uncertain nodes will keep affecting all the other nodes, which can sometimes cause additional errors. Experiments in 5.3 will show more details.
The movies in CSM are partitioned into training (train), validation (val) and testing (test) sets. Statistics of these sets are shown in Table 3. Note that we make sure that there is no overlap between the cast of different sets. the cast in the testing set would not appear in training and validation. This ensures the reliability of the testing results.
Under the Person Search with One Portrait setting, one should rank all the tracklets in the gallery given a query. For this task, we use mean Average Precision (mAP)
as the evaluation metric. We also report the recall of tracklet identification results in our experiments in terms of R@k. Here, we rank the identities for each tracklet according to their probabilities. R@k means the fraction of tracklets for which the correct identity is listed within the topresults.
We consider two test settings in the CSM benchmark named “search cast in a movie” (IN) and “search cast across all movies” (ACROSS). The setting “IN” means the gallery consists of just the tracklets from one movie, including the tracklets of the credited cast and those of “others”. While in the “ACROSS” setting, the gallery comprises all the tracklets of credited cast in testing set. Here we exclude the tracklets of “others” in the “ACROSS” setting because “others” just means that it does not belong to any one of the credited cast of a particular movie rather than all the movies in the dataset as we have mentioned in Sec. 3. Table 3 shows the query/gallery sizes of each setting.
We use two kinds of visual features in our experiments. The first one is the IDE feature  widely used in person re-id. The IDE descriptor is a CNN feature of the whole person instance, extracted by a Resnet-50 
, which is pre-trained on ImageNet and finetuned on the training set of CSM. The second one is the face feature, extracted by a Resnet-101, which is trained on MS-Celeb-1M . For each instance, we extract its IDE feature and the face feature of the face region, which is detected by a face detector . All the visual similarities in experiments are calculated by cosines similarity between the visual features.
We set up four baselines for comparison: (1) FACE: To match the portrait with the tracklet in the gallery by face feature similarity. Here we use the mean feature of all the instances in the tracklet to represent it. (2) IDE: Similar to FACE, except that the IDE features are used rather than the face features. (3) IDE+FACE: To combine face similarity and IDE similarity for matching, respectively with weights and . (4) LP: Conventional label propagation with linear diffusion with both visual and temporal links. Specifically, we use face similarity as the visual links between portraits and candidates and the IDE similarity as the visual links between different candidates. We also consider two settings of the proposed Progressive Propagation via Competitive Consensus method. (5) PPCC-v: using only visual links. (6) PPCC-vt: the full config with both visual and temporal links.
From the results in Table 4, we can see that: (1) Even with a very powerful CNN trained on a large-scale dataset, matching portrait and candidates by visual cues cannot solve the person search problem well due to the big gap of visual appearances between the portraits and the candidates. Although face features are generally more stable than IDE features, they would fail when the faces are invisible, which is very common in real-world videos like movies. (2) Label propagation with linear diffusion gets very poor results, even worse than the matching-based methods. (3) Our approach raises the performance by a considerable margin. Particularly, the performance gain is especially remarkable on the more challenging “ACROSS” setting ( with ours vs. with the visual matching method).
. To show the effectiveness of Competitive Consensus, we study different settings of the Competitive Consensus scheme in two aspects: (1) The in Eq. (3) can be relaxed to top- average. Here indicates the number of neighbors to receive information from. When , it reduces to only taking the maximum, which is what we use in PPCC. Performances obtained with different are shown in Fig. 6. (2) We also study on the “softmax” in Eq.(3) and compare results between different temperatures of it. The results are also shown in Fig. 6. Clearly, using smaller temperature of softmax significantly boosts the performance. This study supports what we have claimed when designing Competitive Consensus: we should only propagate the most confident information in this task.
. Here we show the comparison between our progressive updating scheme and the conventional scheme that updates all the nodes at each iteration. For progressive propagation, we try two kinds of freezing mechanisms: (1) Step scheme means that we set the freezing ratio of each iteration and the ratio are raised step by step. More specifically, the freezing ratio is set to in our experiment. (2) Threshold scheme means that we set a threshold, and each time we freeze the nodes whose max probability to a particular identity is greater than the threshold. In our experiments, the threshold is set to . The results are shown in Table 5, from which we can see the effectiveness of the progressives scheme.
. We show some samples that are correctly searched in different iterations in Fig. 7. We can see that the easy cases, which are usually with clear frontal faces, can be identified at the beginning. And after iterative propagation, the information can be propagated to the harder samples. At the end of the propagation, even some very hard samples, which are non-frontal, blurred, occluded and under extreme illumination, can be propagated a right identity.
In this paper, we studied a new problem named Person Search in Videos with One Protrait, which is challenging but practical in the real world. To promote the research on this problem, we construct a large-scale dataset CSM, which contains tracklets of cast from movies. To tackle this problem, we proposed a new framework that incorporates both visual and temporal links for identity propagation, with a novel Progressive Propagation vis Competitive Consensus scheme. Both quantitative and qualitative studies show the challenges of the problem and the effectiveness of our approach.
This work is partially supported by the Big Data Collaboration Research grant from SenseTime Group (CUHK Agreement No. TS1610626), the General Research Fund (GRF) of Hong Kong (No. 14236516).
Ahmed, E., Jones, M., Marks, T.K.: An improved deep learning architecture for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3908–3916 (2015)
Cheng, D., Gong, Y., Zhou, S., Wang, J., Zheng, N.: Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1335–1344 (2016)
Ding, S., Lin, L., Wang, G., Chao, H.: Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition48(10), 2993–3003 (2015)
Li, W., Zhao, R., Xiao, T., Wang, X.: Deepreid: Deep filter pairing neural network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 152–159 (2014)
Rohrbach, M., Ebert, S., Schiele, B.: Transfer learning in a transductive setting. In: Advances in neural information processing systems. pp. 46–54 (2013)
Wang, J., Jebara, T., Chang, S.F.: Graph transduction via alternating minimization. In: Proceedings of the 25th international conference on Machine learning. pp. 1144–1151. ACM (2008)