1 Introduction
The problem of identifying face images in video and clustering them together by identity is a natural precursor to high impact applications such as video understanding and analysis. This general problem area was popularized in the paper “Hello! My name is…Buffy” [9], which used text captions and face analysis to name people in each frame of a fulllength video. In this work, we use only raw video (with no captions), and group faces by identity rather than naming the characters. In addition, unlike face clustering methods that start with detected faces, we include detection as part of the problem. This means we must deal with false positives and false negatives, both algorithmically, and in our evaluation method. We make three contributions:

A new approach to combining highquality face detection [15] and generic tracking [31]
to improve both precision and recall of our video face detection.

A new method, ErdősRényi clustering, for largescale clustering of images and video tracklets. We argue that effective largescale face clustering requires face verification with fewer false positives, and we introduce rank1 counts verification, showing that it indeed achieves better true positive rates in low false positive regimes. Rank1 counts verification, used with simple linkbased clustering, achieves high quality clustering results on three separate video data sets.

A principled evaluation for the endtoend problem of face detection and clustering in videos; until now there has been no clear way to evaluate the quality of such an endtoend system, but only to evaluate its individual parts (detection and clustering).
We structure the paper as follows. In Section 2 we discuss related work. In Section 3, we describe the first phase of our system, in which we use a face detector and generic tracker to extract face tracklets. In Section 4, we introduce ErdősRényi clustering and rank1 counts verification. Sections 5 and 6 present experiments and discussions.
2 Related Work
In this section, we first discuss face tracking and then the problem of naming TV (or movie) characters. We can divide the characternaming work into two categories: fully unsupervised and with some supervision. We then discuss prior work using reference images. Related work on clustering is covered in Section 5.2.
Recent work on robust face tracking [36, 29, 24] has gradually expanded the length of face tracklets, starting from face detection results. Ozerov [24]
merge results from different detectors by clustering based on spatiotemporal similarity. Clusters are then merged, interpolated, and smoothed for face tracklet creation. Similarly, Roth et al.
[29] generate lowlevel tracklets by merging detection results, form highlevel tracklets by linking lowlevel tracklets, and apply the Hungarian algorithm to form even longer tracklets. Tapaswi et al. [36] improve on this [29] by removing false positive tracklets.With the development of multiface tracking techniques, the problem of naming TV characters^{1}^{1}1Another related problem is person reidentification [44, 18, 6] in which the goal is to tell whether a person of interest seen in one camera has been observed by another camera. Reidentification typically uses the whole body on short time scales while naming TV characters focuses on faces, but over a longer period of time. has been also widely studied [35, 13, 9, 2, 39, 40, 37]. Given precomputed face tracklets, the goal is to assign a name or an ID to a group of face tracklets with the same identity. Wu et al. [39, 40] iteratively cluster face tracklets and link clusters into longer tracks in a bootstrapping manner. Tapaswi et al. [37]
train classifiers to find thresholds for joining tracklets in two stages: within a scene and across scenes. Similarly, we aim to generate face clusters in a fully unsupervised manner.
Though solving this problem may yield a better result for face tracking, some forms of supervision specific to the video or characters in the test data can improve performance. Tapaswi et al. [35] perform face recognition, clothing clustering and speaker identification, where face models and speaker models are first trained on other videos containing the same main characters as in the test set. In [9, 2], subtitles and transcripts are used to obtain weak labels for face tracks. More recently, Haurilet et al. [13] solve the problem without transcripts by resolving name references only in subtitles. Our approach is more broadly applicable because it does not use subtitles, transcripts, or any other supervision related to the identities in the test data, unlike these other works [35, 13, 9, 2].
As in the proposed verification system, some existing work [4, 12] uses reference images. For example, index code methods [12]
map each single image to a code based upon a set of reference images, and then compare these codes. On the other hand, our method compares the relative distance of two images with the distance of one of the images to the reference set, which is different. In addition, we use the newly defined rank1 counts, rather than traditional Euclidean or Mahalanobis distance measures to compare images
[4, 12] for similarity measures.3 Detection and tracking
Our goal is to take raw videos, with no captions or annotations, and to detect all faces and cluster them by identity. We start by describing our method for generating face tracklets
, or continuous sequences of the same face across video frames. We wish to generate clean face tracklets that contain face detections from just a single identity. Ideally, exactly one tracklet should be generated for an identity from the moment his/her face appears in a shot until the moment it disappears or is completely occluded.
To achieve this, we first detect faces in each video frame using the Faster RCNN object detector [28], but retrained on the WIDER face data set [41], as described by Jiang et al. [15]. Even with this advanced detector, face detection sometimes fails under challenging illumination or pose. In videos, those faces can be detected before or after the challenging circumstances by using a tracker that tracks both forward and backward in time. We use the distribution field tracker [31], a general object tracker that is not trained specifically for faces. Unlike face detectors, the tracker’s goal is to find in the next frame the object most similar to the target in the current frame. The extra faces found by the tracker compensate for missed detections (Fig. 2, bottom of block 2). Tracking helps not only to catch false negatives, but also to link faces of equivalent identity in different frames.
One simple approach to combining a detector and tracker is to run a tracker forward and backward in time from every single face detection for some fixed number of frames, producing a large number of “minitracks”. A Viterbistyle algorithm [10, 5] can then be used to combine these minitracks into longer sequences. This approach is computationally expensive since the tracker is run many times on overlapping subsequences, producing heavily redundant minitracks. To improve performance, we developed the following novel method for combining a detector and tracker. Happily, it also improves precision and recall, since it takes advantage of the tracker’s ability to form long face tracks of a single identity.
The method starts by running the face detector in each frame. When a face is first detected, a tracker is initialized with that face. In subsequent frames, faces are again detected. In addition, we examine each current tracklet to see where it might be extended by the tracking algorithm in the current frame. We then check the agreement between detection and tracking results. We use the intersection over union (IoU) between detections and tracking results with threshold 0.3, and apply the Hungarian algorithm[16] to establish correspondences among multiple matches. If a detection matches a tracking result, the detection is stored in the current face sequence such that the tracker can search in the next frame given the detection result. For the detections that have no matched tracking result, a new tracklet is initiated. If there are tracking results that have no associated detections, it means that either a) the tracker could not find an appropriate area on the current frame, or b) the tracking result is correct while the detector failed to find the face. The algorithm postpones its decision about the tracked region for the next consecutive frames (). If the face sequence has any matches with detections within frames, the algorithm will keep the tracking results. Otherwise, it will remove the trackingonly results. The second block of Fig. 2 summarizes our proposed face tracklet generation algorithm and shows examples corrected by our joint detectiontracking strategy. Next, we describe our approach to clustering based on low false positive verification.
4 ErdősRényi Clustering and Rank1 Counts Verification
In this section, we describe our approach to clustering face images, or, in the case of videos, face tracklets. We adopt the basic paradigm of linkage clustering, in which each pair of points (either images or tracklets) is evaluated for linking, and then clusters are formed among all points connected by linked face pairs. We name our general approach to clustering ErdősRényi clustering since it is inspired by classic results in graph theory due to Erdős and Rényi [7], as described next.
Consider a graph with
vertices and probability
of each possible edge being present. This is the ErdősRényi random graph model [7]. The expected number of edges is One of the central results of this work is that, for and sufficiently large, if(1) 
then the graph will almost surely be connected (there exists a path from each vertex to every other vertex). Fig. 3 shows this effect on different graph sizes, obtained through simulation.
Consider a clustering system in which links are made between tracklets by a verifier (a face verification system), whose job is to say whether a pair of tracklets is the “same” person or two “different” people. While graphs obtained in clustering problems are not uniformly random graphs, the results of Erdős and Rényi suggest that this verifier can have a fairly low recall (percentage of same links that are connected) and still do a good job connecting large clusters. In addition, false matches may connect large clusters of different identities, dramatically hurting clustering performance. This motivates us to build a verifier that focuses on low false positives rather than high recall. In the next section, we present our approach to building a verifier that is designed to have good recall at low false positive rates, and hence is appropriate for clustering problems with large clusters, like grouping cast members in movies.
4.1 Rank1 counts for fewer false positives
Our method compares images by comparing their multidimensional feature vectors. More specifically, we count the number of feature dimensions in which the two images are closer in value than the first image is to any of a set of reference images. We call this number the
rank1 count similarity. Intuitively, two images whose feature values are “very close” for many different dimensions are more likely to be the same person. Here, an image is considered “very close” to a second image in one dimension if it is closer to the second image in that dimension than to any of the reference images.More formally, to compare two images and , our first step is to obtain feature vectors and for these images. We extract 4096D feature vectors from the fc7 layer of a standard pretrained face recognition CNN [26]. In addition to these two images, we use a fixed reference set with images (we typically set ), and compute CNN feature vectors for each of these reference images.^{2}^{2}2The reference images may overlap in identity with the clustering set, but we choose reference images so that there is no more than one occurrence of each person in the reference set. Let the CNN feature vectors for the reference images be . We sample reference images from the TV Human Interactions Dataset [27], since these are likely to have a similar distribution to the images we want to cluster.
For each feature dimension (of the 4096), we ask whether
That is, is the value in dimension closer between and than between and all the reference images? If so, then we say that the th feature dimension is rank1 between and . The cumulative rank1 counts feature is simply the number of rank1 counts across all 4096 features:
where is an indicator function which is 1 if the expression is true and 0 otherwise.
Taking inspiration from Barlow’s notion that the brain takes special note of “suspicious coincidences” [1], each rank1 feature dimension can be considered a suspicious coincidence. It provides some weak evidence that and may be two images of the same person. On the other hand, in comparing all 4096 feature dimensions, we expect to obtain quite a large number of rank1 feature dimensions even if and are not the same person.
When two images and the reference set are selected randomly from a large distribution of faces (in this case they are usually different people), the probability that is closer to in a particular feature dimension than to any of the reference images is just
Repeating this process 4096 times means that the expected number of rank1 counts is simply
since expectations are linear (even in the presence of statistical dependencies among the feature dimensions). Note that this calculation is a fairly tight upper bound on the expected number of rank1 features conditioned on the images being of different identities, since most pairs of images in large clustering problems are different, and conditioning on ”different” will tend reduce the expected rank1 count. Now if two images and have a large rank1 count, it is likely they represent the same person. The key question is how to set the threshold on these counts to obtain the best verification performance.
Recall that our goal, as guided by the ErdősRényi random graph model, is to find a threshold on the rank1 counts so that we obtain very few false positives (declaring two different faces to be “same”) while still achieving good recall (a large number of same faces declared to be “same”). Fig. 4 shows distributions of rank1 counts for various subsets of image pairs from Labeled Faces in the Wild (LFW) [14]. The red curve shows the distribution of rank1 counts for mismatched pairs from all possible mismatched pairs in the entire data set (not just the test sets). Notice that the mean is exactly where we would expect with a gallery size of 50, at . The green curve shows the distribution of rank1 counts for the matched pairs, which is clearly much higher. The challenge for clustering, of course, is that we don’t have access to these distributions since we don’t know which pairs are matched and which are not. The yellow curve shows the rank1 counts for all pairs of images in LFW, which is nearly identical to the distribution of mismatched rank1 counts, since the vast majority of possibe pairs in all of LFW are mismatched. This is the distribution to which the clustering algorithm has access.
If the 4,096 CNN features were statistically independent (but not identically distributed), then the distribution of rank1 counts would be a binomial distribution (
blue curve). In this case, it would be easy to set a threshold on the rank1 counts to guarantee a small number of false positives, by simply setting the threshold to be near the right end of the mismatched (red) distribution. However, the dependencies among the CNN features prevent the mismatched rank1 counts distribution from being binomial, and so this approach is not possible.4.2 Automatic determination of rank1 count threshold
Ideally, if we could obtain the rank1 count distribution of mismatched pairs of a test set, we could set the threshold such that the number of false positives becomes very low. However, it is not clear how to get the actual distribution of rank1 counts for mismatched pairs at test time.
Instead, we can estimate the shape of the mismatched pair rank1 count distribution using one distribution (LFW), and use it to estimate the distribution of mismatched rank1 counts for the test distribution. We do this by fitting the
left half of the LFW distribution to the left half of the clustering distribution using scale and location parameters. The reason we use the left half to fit the distribution is that this part of the rank1 counts distribution is almost exclusively influence by mismatched pairs. The right side of this matched distribution then gives us an approximate way to threshold the test distribution to obtain a certain false positive rate. It is this method that we use to report the results in the leftmost column of Table 3.


FPR  5cmRank1count 
5cmL2 

Adaptation [4] 
5cm 

Distance [45] 
5cm 



1E9  0.0252  0.0068  0.0016  0.0086 
1E8  0.0342  0.0094  0.0017  0.0086 
1E7  0.0614  0.0330  0.0034  0.0086 
1E6  0.1872  0.1279  0.0175  0.0086 
1E5  0.3800  0.3154  0.0767  0.0427 
1E4  0.6096  0.5600  0.2388  0.2589 
1E3  0.8222  0.7952  0.5215  0.8719 
1E2  0.9490  0.9396  0.8204  0.9656 
1E1  0.9939  0.9915  0.9776  0.9861 

A key property of our rank1 counts verifier is that it has good recall across a wide range of the low false positive regime. Thus, our method is relatively robust to the setting of the rank1 counts threshold. In order to show that our rank1 counts feature has good performance for the types of verification problems used in clustering, we construct a verification problem using all possible pairs of the LFW database [14]. In this case, the number of mismatched pairs (quadratic in ) is much greater than the number of matched pairs. As shown in Table 1, we observe that our verifier has higher recall than three competing methods (all of which use the same base CNN representation) at low false positive rates.
Using rank1 counts verification for tracklet clustering. In our face clustering application, we consider every pair of tracklets, calculate a value akin to the rank1 count , and join the tracklets if the threshold is exceeded. In order to calculate an value for tracklets, we sample a random subset of 10 face images from each tracklet, compute a rank1 count for each pair of images, and take the maximum of the resulting values.
4.3 Averaging over gallery sets
While our basic algorithm uses a fixed (but randomly selected) reference gallery, the method is susceptible to the case in which one of the gallery images happens to be similar in appearance to a person with a large cluster, resulting in a large number of false negatives. To mitigate this effect, we implicitly average the rank1 counts over an exponential number of random galleries, as follows.
The idea is to sample random galleries of size from a larger supergallery with images; we used . We are interested rank1 counts, in which image ’s feature is closer to than to any of the gallery of size . Suppose we know that among the 1000 supergallery images, there are (e.g., ) that are closer to than is. The probability that a random selection (with replacement) of images from the supergallery would contain none of the closer images (and hence represent a rank1 count) is
That is, is the probability of having a rank1 count with a random gallery, and using as the count is equivalent to averaging over all possible random galleries. In our final algorithm, we sum these probabilities rather than the deterministic rank1 counts.
4.4 Efficient implementation
For simplicity, we discuss the computational complexity of our fixed gallery algorithm; the complexity of the average gallery algorithm is similar. With , , and indicating the feature dimensionality, number of gallery images, and number of face tracklets to be clustered, the time complexity of the naive rank1 count algorithm is .
However, for each feature dimension, we can sort test image feature values and gallery image feature values in time . Then, for each value in test image A, we find the closest gallery value, and increment the rank1 count for the test images that are closer to A. Let be the average number of stjpeg to find the closest gallery value. This is typically much smaller than . The time complexity is then .
4.5 Clustering with donotlink constraints
It is common in clustering applications to incorporate constraints such as donotlink or mustlink, which specify that certain pairs should be in separate clusters or the same cluster, respectively [38, 32, 19, 17, 21]. They are also often seen in the face clustering literature [3, 39, 40, 25, 37, 43]. These constraints can be either rigid, implying they must be enforced [38, 32, 21, 25]
, or soft, meaning that violations cause an increase in the loss function, but those violations may be tolerated if other considerations are more important in reducing the loss
[19, 17, 39, 40, 43].In this work, we assume that if two faces appear in the same frame, they must be from different people, and hence their face images obey a donotlink constraint. Furthermore, we extend this hard constraint to the tracklets that contain faces. If two tracklets have any overlap in time, then the entire tracklets represent a donotlink constraint.
We enforce these constraints on our clustering procedure. Note that connecting all pairs below a certain dissimilarity threshold followed by transitive closure is equivalent to singlelinkage agglomerative clustering with a joining threshold. In agglomerative clustering, a pair of closest clusters is found and joined at each iteration until there is a single cluster left or a threshold met. A naïve implementation will simply search and update the dissimilarity matrix at each iteration, making the whole process in time. There are faster algorithms giving the optimal time complexity for singlelinkage clustering [34, 22]. Many of these algorithms incur a dissimilarity update at each iteration, i.e. update after combining cluster and (and using as the cluster id of the resulting cluster). If the pairs with donotlink constraints are initialized with dissimilarity, the aforementioned update rule can be modified to incorporate the constraints without affecting the time and space complexity:
5 Experiments
We evaluate our proposed approach on three video data sets: the Big Bang Theory (BBT) Season 1 (s01), Episodes 16 (e01e06) [2], Buffy the Vampire Slayer (Buffy) Season 5 (s05), Episodes 16 (e01e06) [2], and Hannah and Her Sisters (Hannah) [24]. Each episode of the BBT and Buffy data set contains 58 and 1117 characters respectively, while Hannah has annotations for 235 characters.^{3}^{3}3We removed garbage classes such as ‘unknown’ or ‘false_positive’. Buffy and Hannah have many occlusions which make the face clustering problem more challenging. In addition to the video data sets, we also evaluate our clustering algorithm on LFW [14] which contains 5730 subjects.^{4}^{4}4All known ground truth errors are removed.


Verification system + Linkbased clustering algorithm  Other clustering algorithms  
Test set  
(automatic threshold)  5cmRank1 Count 
5cmL2 

Adaptation [4] 
5cm 

Distance [45] 
5cm 

based Clustering [45] 
5cm 

Propagation [11] 
5cmDBSCAN [8] 

Clustering [33] 
5cmBirch [42] 
5cmSpectral 

KMeans [30] 
5cmMiniBatch 



Video  2cmBBT s01 [2]  .7728  .7828  .7365  .7612  .6692  .6634  .1916  .2936  .6319  .2326  .1945 
2cmBuffy s05 [2]  .5661  .6299  .3931  .5845  .2990  .5439  .1601  .1409  .5351  .1214  .1143  
Hannah [24]  .6436  .6813  .2581  .3620  .4123  .3955  .1886  .1230  .3344  .1240  .1052  
Image  LFW [14]  .8532  .8943  .8498  .3735  .5989  .5812  .3197  .0117  .2538  .4520  .3133 

by setting a threshold automatically. For the rest of the columns, we show fscores using optimal (oraclesupplied) thresholds. For BBT and Buffy, we show average scores over six episodes. The full table with individual episode results is given in Appendix
A. Best viewed in color (1st place, 2nd place, 3rd place).An endtoend evaluation metric.
There are many evaluation metrics used to independently evaluate detection, tracking, and clustering. Previously, it has been difficult to evaluate the relative performance of two endtoend systems because of the complex tradeoffs between detection, tracking, and clustering performance. Some researchers have attempted to overcome this problem by providing a reference set of detections with suggested metrics [20], but this approach precludes optimizing complete system performance. To support evaluation of the full videotoidentity pipeline, in which false positives, false negatives, and clustering errors are handled in a common framework, we introduce unified pairwise precision (UPP) and unified pairwise recall (UPR) as follows.
Given a set of annotations, and detections, , we consider the union of three sets of tuples: false positives resulting from unannotated face detections ; valid face detections ; and false negatives resulting from unmatched annotations . Fig. 5 visualizes every possible pair of tuples ordered by false positives, valid detections, and false negatives for the first few minutes of the Hannah data set. Further, groups of tuples have been ordered by identity to show blocks of identity to aid our understanding of the visualization, although the order is inconsequential for the numerical analysis.
In Fig 5, the large blue region (and the regions it contains) represents all pairs of annotated detections, where we have valid detections corresponding to their best annotation. In this region, white pairs are correctly clustered, magenta pairs are the same individual but not clustered, cyan pairs are clustered but not the same individual, and blue pairs are not clustered pairs from different individuals. The upper left portion of the matrix represents false positives with no corresponding annotation. The green pairs in this region correspond to any false positive matching with any valid detection. The lower right portion of the matrix corresponds to the false negatives. The red pairs in this region correspond to any missed clustered pairs resulting from these missed detections. The ideal result would contain blue and white pairs, with no green, red, cyan, or magenta.
The unified pairwise precision (UPP) is the fraction of pairs, within all clusters with matching identities, , the number of white pairs divided by the number of white, cyan, and green pairs. UPP decreases if: two matched detections in a cluster do not correspond to the same individual; if a matched detection is clustered with a false positive; for each false positive regardless of its clustering; and for false positives clustered with valid detections. Similarly, the unified pairwise recall (UPR) is the fraction of pairs within all identities that have been properly clustered, , the number of white pairs divided by number of white, magenta, and red pairs. UPR decreases if: two matched detections of the same identity are not clustered; a matched detection should be matched but there is no corresponding detection; for each false negative; and for false negative pairs that should be detected and clustered. The only way to achieve perfect UPP and UPR is to detect every face with no false positives and cluster all faces correctly. At a glance, our visualization in Fig. 5 shows that our detection produces few false negatives, many more false positives, and is less aggressive in clustering. Using this unified metric, others can tune their own detection, tracking, and clustering algorithms to optimize the unified performance metrics. Note that for image matching without any detection failures, the UPP and UPR reduce to standard pairwise precision and pairwise recall.
The UPP and UPR can be summarized with a single Fmeasure (the weighted harmonic mean) providing a single, unified performance measure for the entire process. It can be
weighted to alter the relative value of precision and recall performance:(2) 
where . denotes a balanced Fmeasure.
(a)  (b)  (c)  (d)  (e) 
(f)  (g)  (h)  (i)  (j) 
(a)  (b)  (c)  (d)  (e) 
(f)  (g)  (h)  (i)  (j) 
5.1 Threshold for rank1 counts
The leftmost column in Table 3 shows our clustering results when the threshold is set automatically by the validation set. We used LFW as a validation set for BBT, Buffy and Hannah while Hannah was used for LFW. Note that the proposed method is very competitive even when the threshold is automatically set.
5.2 Comparisons
We can divide other clustering algorithms into two broad categories–linkbased clustering algorithms (like ours) that use a different similarity function and clustering algorithms that are not linkbased (such as spectral clustering
[33]). Table 3 shows the comparisons to various distance functions [4, 23, 45] with our linkbased clustering algorithm. L2 shows competitive performance in LFW while the performance drops dramatically when a test set has large pose variations. We also compare against a recent socalled “template adaptation” method [4] which also requires a reference set. It takes 2nd and 3rd place on Buffy and BBT. In addition, we compare to the rankorder method [45] in two different ways: linkbased clustering algorithms using their rankorder distance, and rankorder distance based clustering.In addition, we compare against several generic clustering algorithms (Affinity Propagation [11], DBSCAN [8], Spectral Clustering [33], Birch [42], KMeans [30]), where L2 distance is used as pairwise metric. For algorithms that can take as input the similarity matrix (Affinity Propagation, DBSCAN, Spectral Clustering), donotlink constraints are applied by setting the distance between the corresponding pairs to . Note that this is just an approximation, and in general does not guarantee the constraints in the final clustering result (for singlelinkage agglomerative clustering, a modified update rule is also needed in Section 4.5).
Note that all other settings (feature encoding, tracklet generation) are common for all methods. In Table 3, except for the leftmost column, we report the best scores using optimal (oraclesupplied) thresholds for (number of clusters, distance). The linkbased clustering algorithm with rank1 counts outperforms the stateoftheart on all four data sets in score. Figures 6 and 7 show some clustering results on Buffy and BBT.
6 Discussion
We have presented a system for doing endtoend clustering in full length videos and movies. In addition to a careful combination of detection and tracking, and a new endtoend evaluation metric, we have introduced a novel approach to linkbased clustering that we call ErdősRényi clustering. We demonstrated a method for automatically estimating a good decision threshold for a verification method based on rank1 counts by estimating the underlying portion of the rank1 counts distribution due to mismatched pairs.
This decision threshold was shown to result in good recall at a low falsepositive operating point. Such operating points are critical for large clustering problems, since the vast majority of pairs are from different clusters, and false positive links that incorrectly join clusters can have a large negative effect on clustering performance.
There are several things that could disrupt our algorithm: a) if a high percentage of different pairs are highly similar (e.g. family members), b) if only a small percentage of pairs are “different” (e.g., one cluster contains 90% of the images), and if “same” pairs lack lots of matching features (e.g., every cluster is a pair of images of the same person under extremely different conditions). Nevertheless, we showed excellent results on 3 popular video data sets. Not only do we dominate other methods when thresholds are optimized for clustering, but we outperform other methods even when our thresholds are picked automatically.
References
 [1] H. Barlow. Cerebral cortex as model builder. In Matters of Intelligence, pages 395–406. Springer, 1987.
 [2] M. Bauml, M. Tapaswi, and R. Stiefelhagen. Semisupervised learning with constraints for person identification in multimedia data. In Proc. CVPR, 2013.
 [3] R. G. Cinbis, J. Verbeek, and C. Schmid. Unsupervised metric learning for face identification in TV video. In Proc. ICCV, 2011.
 [4] N. Crosswhite, J. Byrne, C. Stauffer, O. M. Parkhi, Q. Cao, and A. Zisserman. Template adaptation for face verification and identification. In Face and Gesture, 2017.
 [5] S. J. Davey, M. G. Rutten, and B. Cheung. A comparison of detection performance for several trackbeforedetect algorithms. EURASIP Journal on Advances in Signal Processing, 2008:41, 2008.
 [6] B. DeCann and A. Ross. Modeling errors in a biometric reidentification system. IET Biometrics, 4(4):209–219, 2015.
 [7] P. Erdős and A. Rényi. On the evolution of random graphs. Publications of the Mathematical Institute of the Hungarian Academy of Sciences, 5:17–61, 1960.
 [8] M. Ester, H.P. Kriegel, J. Sander, and X. Xu. A densitybased algorithm for discovering clusters in large spatial databases with noise. KDD, 96(34):226–231, 1996.
 [9] M. Everingham, J. Sivic, and A. Zisserman. ”Hello! My name is… Buffy” Automatic naming of characters in TV video. In Proc. BMVC, 2006.
 [10] G. D. Forney. The Viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
 [11] B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315(5814):972–976, 2007.
 [12] A. Gyaourova and A. Ross. Index codes for multibiometric pattern retrieval. IEEE Transactions on Information Forensics and Security (TIFS), 7(2):518–529, April 2012.
 [13] M.L. Haurilet, M. Tapaswi, Z. AlHalah, and R. Stiefelhagen. Naming TV characters by watching and analyzing dialogs. In Proc. CVPR, 2016.
 [14] G. B. Huang, M. Mattar, T. Berg, and E. LearnedMiller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In The Workshop on Faces in RealLife Images at ECCV, 2008.
 [15] H. Jiang and E. LearnedMiller. Face detection with the Faster RCNN. In Face and Gesture, 2017.
 [16] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(12):83–97, 1955.
 [17] Z. Li, J. Liu, and X. Tang. Pairwise constraint propagation by semidefinite programming for semisupervised classification. In Proc. ICML, 2008.
 [18] G. Lisanti, I. Masi, A. D. Bagdanov, and A. D. Bimbo. Person reidentification by iterative reweighted sparse ranking. TPAMI, 37(8):1629–1642, August 2015.
 [19] Z. Lu and T. K. Leen. Penalized probabilistic clustering. Neural Computation, 19(6):1528–1567, 2007.
 [20] A. Milan, L. LealTaixé, I. Reid, S. Roth, and K. Schindler. MOT16: A benchmark for multiobject tracking. arXiv:1603.00831 [cs], Mar. 2016. arXiv: 1603.00831.

[21]
S. Miyamoto and A. Terami.
Semisupervised agglomerative hierarchical clustering algorithms with pairwise constraints.
In Fuzzy Systems (FUZZ), pages 1–6. IEEE, 2010.  [22] F. Murtagh and P. Contreras. Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1):86–97, 2012.
 [23] C. Otto, D. Wang, and A. K. Jain. Clustering millions of faces by identity. TPAMI, Mar. 2017.
 [24] A. Ozerov, J.R. Vigouroux, L. Chevallier, and P. Pérez. On evaluating face tracks in movies. In Proc. ICIP, 2013.
 [25] A. Ozerov, J.R. Vigouroux, L. Chevallier, and P. Pérez. On evaluating face tracks in movies. In Proc. ICIP. IEEE, 2013.
 [26] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In bmvc, 2015.
 [27] A. PatronPerez, M. Marszałek, A. Zisserman, and I. D. Reid. High five: Recognising human interactions in tv shows. In Proc. BMVC, 2010.
 [28] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster RCNN: towards realtime object detection with region proposal networks. In Proc. NIPS, 2015.
 [29] M. Roth, M. Bauml, R. Nevatia, and R. Stiefelhagen. Robust multipose face tracking by multistage tracklet association. In Proc. ICPR, 2012.

[30]
D. Sculley.
Webscale kmeans clustering.
In Proc. WWW, pages 1177–1178. ACM, 2010.  [31] L. SevillaLara and E. LearnedMiller. Distribution fields for tracking. In Proc. CVPR, 2012.

[32]
N. Shental, A. BarHillel, T. Hertz, and D. Weinshall.
Computing Gaussian mixture models with EM using equivalence constraints.
In Proc. NIPS, 2004.  [33] J. Shi and J. Malik. Normalized cuts and image segmentation. TPAMI, 22(8):888–905, 2000.
 [34] R. Sibson. SLINK: an optimally efficient algorithm for the singlelink cluster method. The computer journal, 16(1):30–34, 1973.
 [35] M. Tapaswi, M. Bauml, and R. Stiefelhagen. ”Knock! Knock! Who is it?” Probabilistic person identification in TV series. In Proc. CVPR, 2012.
 [36] M. Tapaswi, C. C. Corez, M. Bauml, H. K. Ekenel, and R. Stiefelhagen. Cleaning up after a face tracker: False positive removal. In Proc. ICIP, 2014.
 [37] M. Tapaswi, O. M. Parkhi, E. Rahtu, E. Sommerlade, R. Stiefelhagen, and A. Zisserman. Total cluster: A person agnostic clustering method for broadcast videos. In ICVGIP, 2014.
 [38] K. Wagstaff, C. Cardie, S. Rogers, S. Schrödl, et al. Constrained kmeans clustering with background knowledge. In Proc. ICML, 2001.
 [39] B. Wu, S. Lyu, B.G. Hu, and Q. Ji. Simultaneous clustering and tracklet linking for multiface tracking in videos. In Proc. ICCV, 2013.
 [40] B. Wu, Y. Zhang, B.G. Hu, and Q. Ji. Constrained clustering and its application to face clustering in videos. In Proc. CVPR, 2013.
 [41] S. Yang, P. Luo, C. C. Loy, and X. Tang. Wider face: A face detection benchmark. In CVPR, 2016.
 [42] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: an efficient data clustering method for very large databases. In SIGMOD. ACM, 1996.
 [43] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Joint face representation adaptation and clustering in videos. In Proc. ECCV, 2016.
 [44] L. Zheng, Y. Yang, and A. G. Hauptman. Person reidentification: Past, present and future. arXiv, Oct. 2016.
 [45] C. Zhu, F. Wen, and J. Sun. A rankorder distance based clustering algorithm for face tagging. In Proc. CVPR, 2011.
Appendix A Performance Comparisons
In Table 3, except for the leftmost column of results, we report the best scores using optimal (oraclesupplied) thresholds for both the distance threshold (a parameter that is part of all of the algorithms) and the number of clusters (a parameter required by a subset of the algorithms, such as knearest neighbors). The comparison shows that the proposed linkbased clustering algorithm with rank1 counts outperforms the stateoftheart on all four data sets in score. Unlike other clustering algorithms, our proposed approach can scale from small clustering problems (58 subjects in BBT) to large clustering problems (5730 subjects in LFW).
In Table 4, we also report traditional measures (pairwise precision, pairwise recall, and Fmeasure) on the subset of true positive tracklets that are given to each algorithm.
Comments
There are no comments yet.