Multi-target tracking (MTT) aims at locating all targets of interest (e.g., faces, pedestrians, players, and cars) and inferring their trajectories in a video over time while maintaining their identities. This problem is at the core of numerous computer vision applications such as video surveillance, robotics, and sports analysis. Multi-face tracking is one important domain of MTT that can be applied to high-level video understanding tasks such as face recognition, content-based retrieval, and interaction analysis.
The problem of multi-face tracking is particularly challenging in unconstrained scenarios where the videos are generated from multiple moving cameras with different views or scenes as shown in Figure 1. Examples include automatic character tracking in movies, TV sitcoms, or music videos. It has attracted increasing attention in recent years due to the fast-growing popularity of such videos on the Internet. Unlike tracking in the constrained counterparts (e.g., a video from a single stationary or moving camera) where the main challenge is to deal with occlusions and intersections, multi-face tracking in unconstrained videos needs to address the following issues: (1) A video often consists of many shots and the contents of two neighboring shots may be dramatically different; (2) It entails dealing with re-identifying faces with large appearance variations due to changes in scale, pose, expression, illumination, and makeup in different shots or scenes; and (3) The results of face detection may be unreliable due to low resolution, occlusion, nonrigid deformation, motion blurring and complex backgrounds.
Multi-target tracking has been extensively studied in the literature with the primal focus on humans. Recent approaches often address multi-face tracking by tracking-by-detection techniques. These methods first apply an object detector to locate faces in every frame, and apply data association approaches [1, 2, 3, 4, 5] that use visual cues (e.g., appearance, position, motion, and size) in an affinity model to link detections or tracklets (track fragments) into trajectories. Such methods are effective when the targets are continuously detected and when the camera is either stationary or slowly moving. However, for unconstrained videos with many shot changes and intermittent appearance of targets, the data association problem becomes more difficult because the assumptions such as appearance and size consistency, and continuous motion no longer hold in neighboring shots. Therefore, the design of discriminative features plays a critical role in identifying faces across shots in unconstrained scenarios.
to construct an appearance model for each target. However, these hand-crafted features often are not sufficiently discriminative to identify faces with large appearance changes. For example, low-level features extracted from faces of two different persons under the same pose (e.g., frontal poses) are likely more similar than those extracted from faces of the same person under different poses (e.g., frontal and profile poses).
Deep convolutional neural networks (CNNs) have demonstrated significant performance improvements on recognition tasks, e.g., image classification . The features extracted from the activation of a pre-trained CNN have been shown to be effective for generic visual recognition tasks . In particular, CNN-based features have shown impressive performance on face recognition and verification tasks [9, 10, 11, 12]. These models are often trained using large-scale face recognition datasets in a fully supervised manner and then serve as feature extractors for unseen face images. However, these models may not achieve good performance in unconstrained videos as the visual domains of the training and test sets may be significantly different.
In this paper, we address this domain shift by adapting a pre-trained CNN to the specific videos. Due to the lack of manual annotations of target identities, we collect a large number of training samples of faces by exploiting contextual constraints of tracklets in the video. With these automatically discovered training samples, we adapt the pre-trained CNN so that the Euclidean distance between the embedded features reflects the semantic distance between face images. We incorporate these discriminative features into a hierarchical agglomerative clustering algorithm to link tracklets across multiple shots into trajectories. We analyze the contribution of each component in the proposed algorithm and demonstrate the effectiveness of the learned features to identify characters in 10 long TV sitcom episodes and singers in 8 challenging music videos. We further apply our adaptive feature learning approach to other objects (e.g., pedestrians) and show competitive performance on pedestrian tracking across cameras. The preliminary results have been presented in .
We make the following contributions in this work:
Unlike existing work that uses linear metric learning on hand-crafted features, we account for large appearance variations of faces in videos by learning video-specific features with the deep contrastive and triplet-based metric learning on automatically discovered samples through contextual constraints.
We propose an improved triplet loss function which helps the learned model simultaneously pulls positive pairs closer and pushes away negative samples from the positive pairs.
In contrast to prior work that often uses face tracks with false positives manually removed, we take raw video as the input and perform detection, tracking, clustering, and feature adaptation in a fully automatic way.
We develop a new dataset with 8 music videos from YouTube containing annotations of 3,845 face tracklets and 117,598 face detections. This benchmark dataset is challenging (with frequent shot changes, large appearance variations, and rapid camera motion) and crucial for evaluating multi-face tracking algorithms in unconstrained environments.
We demonstrate the proposed adaptive feature learning approach can be extended to tracking other objects, and present empirical results on pedestrian tracking across cameras using the DukeMTMC dataset .
2 Related Work and Problem Context
Multi-target tracking. In recent years, numerous multi-target tracking methods have been developed by applying a pre-learned object detector to locate instances in every frame, and determine the trajectories by solving a data association problem [1, 2, 3, 4, 5]. A plethora of global optimization methods have been developed for association based on the Viterbi decoding scheme , Hungarian algorithm [16, 17, 18], quadratic Boolean programming , maximum weight-independent sets 20, 21], energy minimization [22, 23], and min-cost network flow . Some methods tackle the data association problem with a hierarchical association framework. For example, Xing et al.  propose a two-stage association method to combine local and global tracklets association to track multiple targets. Huang et al.  propose a three-level association approach by first linking the detections from consecutive frames into short tracklets at the bottom level and then applying iterative Hungarian algorithm and an EM algorithm at higher levels. Yang et al.  extend the three-level work in  through learning an online discriminative appearance model.
Data association can also be formulated as a linear assignment problem. Existing algorithms typically integrate appearance and motion cues into an affinity model to infer and link detections (or tracklets) into trajectories [1, 4, 23, 27, 28]. However, these MTT methods do not perform well in unconstrained videos where abrupt changes across different shots occur, and the assumptions of smooth appearance change no longer hold.
To identify targets across shots, discriminative appearance features are required to discern targets in various circumstances. Most existing multi-target tracking methods [4, 29, 25, 30, 31] use color histograms as features and Bhattacharyya distance or correlation coefficient as affinity measures. Several mehtods [23, 22, 32, 33, 34] use hand-crafted features, e.g., Haar-like , SIFT [36, 37], HOG  features, or combination [32, 33, 34]. For robustness, some approaches [38, 39, 26, 3] adaptively select the most discriminative features for a specific video [40, 41, 42]. However, all these hand-crafted feature representations are not tailored for faces, and thus are less effective at handling the large appearance variations in unconstrained scenarios.
Visual constraints in multi-target tracking. Several approaches [43, 44, 33, 31, 45] exploit visual constraints from videos for improving tracking performance. These visual constraints are often derived from the spatio-temporal relationships among the extracted tracklets [43, 45]. Two types of constraints are commonly used: 1) all samples in the same tracklet represent the same object and 2) a pair of tracklets in the same frame indicates that two different objects are present. Prior work either uses these constraints implicitly for learning a cast-specific metric [43, 33]; or explicitly for linking cluster or tracklet [45, 31].
Numerous cues from contextual constraints have also been used for tracking, e.g., clothing , script [47, 48, 49], speech , gender , video editing style , clustering prior , and dynamic clustering constraints . For examples, the methods in [54, 55, 56] incorporate clothing appearance for improving person identification accuracy. Lin et al.  present a probabilistic context model to jointly tag people across multiple domains of people, events, and locations. The work [49, 56] exploits speaker analysis to improve face labeling.
. The proposed algorithm different from previous methods in three aspects. First, existing approaches often rely on hand-crafted features and learn a linear transformation over the extracted features, which may not be effective in modeling large appearance variation of faces. We learn discriminative a face feature representation specific to a video by adapting all layers in a deep neural network. Second, previous work often uses face tracks with false positives removed manually[31, 43, 45, 44, 58]. In contrast, our approach takes a raw video as the input and perform detection, tracking, clustering, and feature adaptation without these pre-processing steps. Third, unlike the work in  that discovers negative pairs by thresholding based on feature distances, we discover negative pairs by transitively propagating the relationship among co-occurred tracklets using the proposed contextual constraints.
CNN-based representation learning. Recent face recognition and verification methods focus on learning identity-preserving feature representations from deep neural networks. While the models may differ, these CNN-based face representations (e.g., DeepID , DeepFace , FaceNet , VGG-Face ) are learned by training CNNs using large-scale datasets in a fully supervised manner. These CNNs then operate as feature extractors for face recognition, identification, and face clustering. In this work, we also use a CNN to learn identity-preserving features from a face recognition dataset. The main difference lies in that we further adapt the pre-trained representation to a specific video, thereby further improve the specificity of the model and enhance discriminative strength. In addition, we introduce a symmetric triplet-based loss function and demonstrate its effectiveness over the commonly used contrastive loss and triplet loss.
Long-term object tracking. The goal of long-term object tracking [61, 62] is to locate a specific target over time even when the target leaves and re-enters the scene. These trackers perform well on various types of targets such as cars and faces. However, online trackers are designed to handle scenes recorded by a stationary or slow-moving camera and thus not effective in tracking faces in unconstrained videos for two reasons. First, these trackers are prone to drift due to online model update with noisy examples. Second, hand-crafted features are not sufficiently discriminative to re-identify faces across shots. We tackle the first issue by processing the video offline, i.e., apply a face detector in every frame and associate all tracklets in the video. For the second issue, we learn adaptive discriminative representation to account for large appearance variations of faces across shots or scenes.
3 Algorithmic Overview
Our goal is to track multiple faces across shots in an unconstrained video while maintaining identities of the persons of interest. To achieve this, we learn discriminative features that are adapted to the appearance variations in the specific videos. We then use a hierarchical clustering algorithm to link tracklets across shots into long trajectories. The main steps of the proposed algorithm are summarized below and in Figure 2.
Discovering training samples: We detect shot changes and divide a video into non-overlapping shots. Within each shot, we apply a face detector and link adjacent detections into short tracklets. We discover a large collection of training samples (in pairs or triplets) from tracklets based on the spatio-temporal and contextual constraints (Section 4.2).
Learning video-specific features: We adapt the pre-trained CNN model using the automatically discovered training samples to account for large appearance changes of faces pertaining to a specific video (Section 4.3). We present an improved triplet loss to enhance the discriminative ability of the learned features.
Linking tracklets: Within each shot, we use a conventional multi-face tracking method to link tracklets into short trajectories. We use a hierarchical clustering algorithm to link trajectories across shots. Finally, we assign the tracklets in each cluster with the same identity (Section 5).
4 Learning Discriminative Features
In this section, we present the algorithmic details for learning video-specific features. After describing how the generic face features are obtained from the pre-training step, we introduce the process of discovering training examples and learning discriminative features using the proposed symmetric triplet loss function.
4.1 Supervised Pre-training
We learn identity-preserving features by pre-training a deep neural network on a large-scale face recognition dataset. Based on the AlexNet architecture , we replace the output layer with nodes where each node corresponds to a specific person. We train the network on an external CASIA-WebFace dataset  (494,414 images of 10,575 subjects) for face recognition in a fully supervised manner. We select persons, of the images (431,300 images) for training and the remaining (47,140 images) as the validation set. Each face image is normalized to64] toolbox.
4.2 Discovering Training Samples
Shot detection and tracklets linking. We first use a shot change detection method to divide each input video into non-overlapping shots.111http://sourceforge.net/projects/shot-change/ Next, we use a face detector  to locate faces in each frame. Given the face detections for each frame, we use a two-threshold strategy  to generate tracklets within each shot by linking the detected faces in adjacent frames based on similarities in appearances, positions, and scales. Note that the two-threshold strategy for linking detections could be replaced by more sophisticated methods, e.g., tracking using particle filters [27, 40]. All tracklets shorter than five frames are discarded. The extracted face tracklets are formed in a conservative manner with limited temporal spans up to the length of each shot.
Spatio-temporal constraints. Existing methods typically exploit spatio-temporal constraints from tracklets to generate training samples from the video. Given a set of tracklets, we can discover a large collection of positive and negative training sample pairs belonging to the same/different persons: (1) all pairs of faces in one tracklet are from one person and (2) two face tracklets that appear in the same frame contain faces of different persons.
Let denote the -th face tracklet of length . We generate a set of positive pairs by collecting all within-tracklet face pairs:
Similarly, if tracklets and overlap in some frames, we can generate a set of negative pairs by collecting all between-tracklet face pairs:
Contextual constraints. With the spatio-temporal constraints, we can obtain a large number of face pairs without manual labeling. These training pairs, however, may have some biases. First, the positive (within-tracklet) pairs occur close in time (e.g., only several frames apart in one shot), which means that the positive face pairs often have small appearance variations. Second, the negative pairs are all generated from tracklets that co-occur in the same shot. Consequently, we are not able to train the model so that it can distinguish or link faces across shots (as we do not have training samples for these cases).
To address these problems, we mine additional positive and negative face pairs for learning our video-specific features. The idea is to exploit contextual information beyond facial regions for identifying the person across shots. Specifically, we identify the clothing region following  and extract features using the AlexNet model. Given the -th face detection in one frame, we locate the torso region by using a probabilistic mask (See 4), where is a pixel in the current frame. The mask is learned from the statistics of body part’s spatial relationship on the Human in 3D dataset . We then concatenate the face features and the clothing feature together. We then apply the HAC algorithm (see Section 5.2) to group tracklets and label the grouped tracklets with high confidence as positive pairs. These grouped tracklets generally contain faces with the similar clothing in the different shots or scenes and thus provide additional positive tracklet pairs. In addition, by leveraging the additional positive pairs, we can discover more negative pairs by transitively propagating the relationship among tracklets. For example, suppose we know tracklets A and B represent different persons, an additional positive constraint on tracklets A and C automatically implies that tracklet B and C are different persons.
Figure 3 illustrates the generation process of contextual constraints . Here, the tracklets and co-occur in one shot and the tracklets , , and co-occur in another shot. Using only spatio-temporal constraints, we are not able to obtain training samples from different shots. As a result, the tracklets and may be incorrectly identified as the same person. However, from the contextual cues, we may be able to identify that the tracklets and are the same person. Using this additional positive constraints, we can automatically generate additional negative constraints, e.g., is a different person from , and .
4.3 Learning Adaptive Discriminative Features
With the discovered training pairs from applying both contextual and spatio-temporal constraints, we optimize the embedding function such that the distance in the embedding space reflects the semantic similarity of two face images and :
We set the feature dimension of as 64 in all of our experiments. We first describe two commonly used loss functions for optimizing the embedding space: (1) contrastive loss and (2) triplet loss, and then present a symmetric triplet loss function for feature learning.
Contrastive loss. The Siamese network [68, 69] consists of two identical CNNs with the shared architecture and parameters as shown in Figure 5. Minimizing the contrastive loss function encourages small distance of two images of the same person and large distance otherwise. Denote as a pair of training images generated with the spatio-temporal constraints. Similar to [68, 69], the contrastive loss function is:
where ( in all our experiments) is the margin. Intuitively, if and are from the same person, the loss is and we aim to decrease . Otherwise, we increase until it is larger than the margin .
Triplet loss. The triplet-based network  consists of three identical CNNs with the shared architecture and parameters as shown in Figure 5. One triplet consists of two face images of the same person and one face image from another person. We generate a set of triplets from two tracklets and belonging to different persons: . Here we aim to ensure that the embedded distance of the positive pair is closer than that of the negative pair by a distance margin (). For one triplet, the triplet loss is of the form:
|(a) Conventional triplet loss||(b) SymTriplet loss|
Symmetric triplet loss. The conventional triplet loss in (5), takes only two of the three distances into consideration: and
although there are three distances between each pair. We illustrate the problem of the conventional triplet loss by analyzing the gradients of the loss function. We denote the difference vector between the triplet (, and ):
For non-zero triplet loss in (5), we can compute the gradients as
Figure 6(a) shows the negative gradient directions for each sample. There are two issues with the triplet loss in (5). First, the loss function pushes the negative data point away from only one of the positive pair rather than both and . Second, the gradients on the positive pair are not symmetric with respect to the negative data .
To address these issues, we propose a symmetric triplet loss function (SymTriplet) by considering all three distances as:
where is the distance margin. The gradients induced from the proposed SymTriplet loss are
Figure 6(b) shows the negative gradient directions. The proposed SymTriplet loss directly optimizes the embedding space such that the positive pair are pulled closer to each other and the negative sample () is pulled away from the two positive samples (). This property allows us to improve the discriminative strength of the learned features.
4.4 Training Algorithm
We train the triplet network model with the SymTriplet loss function and stochastic gradient decent method with momentum. We compute the derivatives of (8) as follows:
We can compute the gradients from each input triplet examples given the values of , , and , , , which can be obtained using the standard forward and backward propagations separately for each image in the triplet examples. We summarize the main training steps in Algorithm 1.
5 Multi-face Tracking via Tracklet Linking
We use a two-step procedure to link face tracklets generated in Section 4.2: (1) linking the face tracklets within each shot into shot-level tracklets, and (2) merging shot-level tracklets across multiple shots into trajectories.
5.1 Linking Tracklets Within Each Shot
We use a typical multi-object tracking framework for linking tracklets within each shot. First, we extract features from each detected face using the learned deep network. We measure the linking probabilities between two tracklets using temporal, kinematic and appearance information. Then, we use the Hungarian algorithm to determine a globally optimal label assignment[25, 24] and link tracklets with the same label are linked into shot-level tracklets.
5.2 Linking Tracklets Across Shots
To link tracklets across multiple shots, we apply a bottom-up hierarchical agglomerative clustering (HAC) algorithm with a stopping threshold and learned appearance features as follows:
Given tracklets in all shots , we start with treating each tracklet as a singleton cluster.
We evaluate all pairwise distances between two tracklets and then use the mean distance metric as the similarity measure. Given and , the distance is defined as:
where denotes the -th face detection in the -th tracklet, and denotes the feature extracted from the embedding layer in the Triplet network.
For trackles which have overlapped frames, we set the corresponding distance as infinity.
We first determine the pair of clusters that has the shortest distance and merge into a new cluster. We update all distances from the new cluster to all other clusters. For those clusters which have overlapped frames with the new cluster, the corresponding distances to the new cluster are set to infinity.
Repeat (d) until the shortest distance is larger than a threshold .
We remove clusters containing less than 4 tracklets and less than 50 frames. The tracklets in each cluster are labeled with the same identity to form trajectories.
6 Experimental Results
We first describe the implementation details, datasets, and evaluation metrics. Next, we present the evaluation results of the proposed algorithm against the state-of-the-art methods. More experimental results and videos are available in the supplementary material athttp://vllab1.ucmerced.edu/~szhang/FaceTracking/. The source code and annotate datasets will be made available to the public.
6.1 Implementation Details
We adapt the pre-trained CNN with the proposed SymTriplet loss. For feature embedding, we replace the classification layer in the pre-trained network with 64 output nodes. We use stochastic gradient descent with the momentum term set to 0.9. For the network training, we set a fixed learning rate to 0.00001 for finetuning and a weight decay of 0.0001. We use a mini-batch size 128 and train the network for 2,000 epochs.
Linking tracklets. For determining the threshold in Section 5.2, we perform parameter sweeping (with an interval of 0.1) using cross-validation. For features trained from the Siamese network, we empirically set the threshold of the HAC algorithm as . For features trained from the triplet network, we use for both the triplet and SymTriplet loss functions.
We evaluate the proposed algorithm on three types of videos containing multiple persons:
Videos in a laboratory setting: Frontal 
Ｍusic videos from YouTube
Frontal video. Frontal is a short video in a constrained scene acquired indoors with a fixed camera. Four persons facing the camera move around and occlude each other.
BBT dataset. We select the first 7 episodes from Season 1 of the Big Bang Theory TV Sitcom (referred to as BBT01-07). Each video is about 23 minutes long with the main cast of 5-13 people and is recorded mostly indoors. The main difficulty lies in identifying faces of the same person from frequent changes of camera views and scenes, where there are large appearance variations in viewing angle, pose, scale, and illumination.
BUFFY dataset. The BUFFY dataset has been widely evaluated in the context of automatic face labeling [48, 49, 66]. The dataset contains three episodes (episode 2, 5 and 6) from Season 5 of the TV series Buffy the Vampire Slayer (referred to as BUFFY02, BUFFY05, and BUFFY06). Each video is about 40 minutes long with the main cast of 13-19 people. The illumination condition in this video dataset is more challenging than that in the BBT dataset as it contains many scenes with dim light.
Music video dataset. We introduce a new dataset of 8 music videos from YouTube. It is challenging to track multiple faces in these videos due to large variations caused by frequent shot/scene changes, large appearance variations, and rapid camera motion. Three sequences (T-ara, Westlife and Pussycat Dolls) are recorded from live music performance with multiple cameras in different views. The other sequences (Bruno Mars, Apink, Hello Bubble, Darling and Girls Aloud) are MTV videos. Faces in these videos often undergo large appearance variations due to changes in pose, scale, makeup, illumination, camera motion, and occlusions.
6.3 Evaluation Metrics
We evaluate the proposed method in two main aspects. First, to evaluate the effectiveness of the learned video-specific features, we use a bottom-up HAC algorithm to merge pairs of tracklets until all tracklets have been merged into the pre-defined number of clusters (i.e., the actual number of people in the video). We measure the quality of clustering using the weighted purity:
where each cluster contains elements and its purity is measured as the fraction of the largest number of faces from the same person to , and denotes the total number of faces in the video.
Second, we evaluate the method with the metrics commonly used in multi-target tracking , including Recall, Precision, F1, FAF, IDS, Frag, MOTA, and MOTP. We list the definitions of these metrics in the supplementary material available at http://vllab1.ucmerced.edu/~szhang/FaceTracking/. The up and down arrows indicate whether higher or lower scores are better for each metric.
6.4 Evaluation on Features
We evaluate the proposed adaptive features against several alternatives summarized in Table I.
|Method Name||Feature Dimension||Architecture||Training Loss||Description|
|HOG ||4,356||-||-||A conventional hand-crafted feature|
|AlexNet ||4,096||AlexNet||Softmax loss||A generic feature representation|
|Pre-trained||4,096||AlexNet||Softmax loss||Face representation trained on the WebFace dataset|
|VGG-Face ||4,096||VGG-16||Softmax loss||A publicly available face descriptor|
|VGG-Face-ULDML||4,096||16-layer VGG||ULDML ||A Mahalanobis mapping from the VGG-Face features|
|Ours-Triplet||64||AlexNet||Triplet loss||Trained with traditional spatio-temporal constraints|
|Ours-Siamese||64||AlexNet||Contrastive loss||Trained with traditional spatio-temporal constraints|
|Ours-SymTriplet||64||AlexNet||SymTriplet loss||Trained with traditional spatio-temporal constraints|
|Ours-SymTriplet-Contx||64||AlexNet||SymTriplet loss||Trained with the contextual constraints|
|Ours-SymTriplet-BBT02||64||AlexNet||SymTriplet loss||Trained on the BBT02 video with spatio-temporal constraints.|
|BBT dataset||BUFFY dataset|
|Methods||T-ara||Pussycat Dolls||Bruno Mars||Hello Bubble||Darling||Apink||Westlife||Girls Aloud|
Adaptive features vs. off-the-shelf features. We evaluate the proposed features (Ours-Siamese, Ours-Triplet, Ours-SymTriplet and Ours-SymTriplet-Contx) adapted to a specific video against the off-the-shelf features (HOG, AlexNet, pre-trained, and VGG-Face) in Table II and III. We show that identity-preserving features (pre-trained and VGG-Face) trained on face datasets offline achieve better performance over generic feature representation (e.g., AlexNet and HOG). Our video-specific features trained with Siamese and triplet networks achieve favorable performance than other alternatives, highlighting the importance of learning video-specific features. For example, in the Daring sequence, the proposed method with the Ours-SymTriplet-Contx features achieves the weighted purity of 0.76, significantly outperforming the off-the-shelf features, i.e., VGG-Face: 0.20, AlexNet: 0.18 and HOG: 0.19. Overall, the results with the proposed features are more than twice as accurate as that using off-the-shelf features in music videos. For the BBT dataset, the proposed feature adaptation consistently outperforms that with off-the-shelf features.
Measuring the effectiveness of features via clustering Here, we validate the effectiveness the proposed features compared to the baselines. Figure 7 shows the results in terms of clustering purity versus the number of clusters on 7 BBT sequences and 5 music videos. The ideal line (purple dash line) means that all faces are correctly grouped with weighted purity . For more effective features, the weighted purity measures approach to 1 at a faster rate. For each feature type, we show the weighted purity at the ideal number cluster (i.e., number of people in a video) in the legend.
|(a) T-ara||(b) Pussycat Dolls||(c) Bruno Mars||(d) HelloBubble||(e) Darling||(f) Apink|
|(g) Westlife||(h) Girls Aloud||(i) BUFFY02||(j) BUFFY05||(k) BUFFY06||(l) BBT01|
|(m) BBT02||(n) BBT03||(o) BBT04||(p) BBT05||(q) BBT06||(r) BBT07|
Figure 8 shows 2D visualization of extracted features from the T-ara using the t-SNE algorithm . The visualization illustrates the difficulty in handling large appearance variations in unconstrained videos. For HOG features, there exist no clear cluster structures, and faces of the same person are scattered around. Although the AlexNet and pre-trained features increase inter-person distances, the clusters of the same person do not appear in close proximity. In contrast, the proposed adaptive features form tighter clusters for the same person and greater separation between different persons.
|HOG (4356-D)||AlexNet (4096-D)|
|Pre-trained (4096-D)||Ours-SymTriplet (64-D)|
Nonlinear and linear metric learning. Unlike several existing approaches [43, 31, 45, 44, 58] that rely on hand-crafted features and linear metric learning, we use a deep nonlinear metric learning by finetuning all layers to learn discriminative face representations. To demonstrate the contribution of the nonlinear metric learning, we compare our adaptive features with VGG-Face-ULDML which learns Mahalanobis distance on the VGG-Face features in Table II. We show that the proposed method with adaptive features and the nonlinear metric achieve higher clustering purity than VGG-Face-ULDML on all videos. For example, on the T-ara sequence, the clustering purity by Ours-SymTriplet-Contx and VGG-Face-ULDML is 0.84, and 0.26, respectively.
SymTriplet and conventional Siamese/Triplet loss. We demonstrate the effectiveness of the proposed SymTriplet loss (Ours-SymTriplet) with comparisons to the contrastive loss (Ours-Siamese) and the triplet loss (Ours-Triplet) on all videos in Table II. The proposed method with the SymTriplet loss performs well against the other methods since positive sample pairs are pulled closer and negative samples are pushed away from the positive pairs. For example, on the BBT05 sequence, Ours-SymTriplet (0.85) achieves higher clustering purity than Ours-Triplet (0.68) and Ours-Siamese (0.70);
Contextual and spatio-temporal constraints. We evaluate the effectiveness of contextual constraints. Using the SymTriplet loss, we compare the features learned from using only spatio-temporal constraints (Ours-SymTriplet), and both contextual and spatio-temporal constraints (Ours-SymTriplet-Contx). Table II shows that Ours-SymTriplet-Contx achieves better performance when compared with Ours-SymTriplet on all videos. We attribute the performance improvement to the additional positive and negative face pairs discovered through contextual cues and the transitive constraint propagation.
Comparisons with other face clustering algorithms. We compare our method with five recent state-of-the-art face clustering algorithms [31, 43, 45, 58, 53] on the Frontal, BBT01, BUFFY02, and Notting Hill videos. Table IV shows the clustering accuracy over faces and tracklets (using the same datasets and metrics as [31, 45]).222The code and data of some methods, e.g.,  are not available. In contrast to the methods in [31, 43, 45, 58] which learn linear transformations over the extracted features, our work learns nonlinear metrics by adapting all layers of the CNN and performs favorably on the Frontal, BBT01, and BUFFY02 sequences. Both  and our method discover more informative face pairs to adapt the pre-trained models to learn discriminative face representations and achieve similar clustering performance on the BUFFY02 and Notting Hill videos.
Comparisons with different number of feature dimensions. We investigate the effect of the dimensionality of the embedded features. Figure 9 shows the clustering purity versus the number of clusters in comparison with a different number of feature dimensions on the sequence Bruno Mars. In general, the clustering accuracy is not very sensitive to the selection of feature dimension. However, we do observe that using large feature dimension (e.g., 512 and 1024) does not perform well compared to smaller ones. We attribute this to the insufficient training samples. The evaluation on feature dimension also validates the selection of using 64-dimensional features for accuracy and efficiency.
6.5 Multi-face Tracking
Comparisons with the state-of-the-art multi-target trackers. We compare the proposed algorithm with several state-of-the-art MTT trackers including modified versions of TLD , ADMM , IHTLS , and methods by Wu et al. [31, 45]. The TLD  scheme is a long-term single-target tracker which can re-detect targets of interest when targets leave and re-enter a scene. We implement two extensions of TLD for multi-face tracking. The first one is the mTLD scheme where in each sequence, we run multiple TLD trackers for all targets and each TLD tracker is initialized with the ground truth bounding box in the first frame. For the second extension of TLD, we integrate the mTLD into our framework (referred to as Ours-mTLD). We use the mTLD to generate shot-level trajectories within each shot instead of using the two-threshold and Hungarian algorithms. At the beginning of each shot, we initialize TLD trackers with untracked detections and link the detections in the following frames according to the overlap scores with TLD outputs.
Table V shows quantitative results of the proposed algorithm, the mTLD , ADMM , and IHTLS  on the BBT, BUFFY and music video datasets. We also show the tracking results with the pre-trained features without adapting to a specific video. Note that the results shown in Table V are based on the overall evaluation. We leave the results from each individual sequence in the supplementary material (see http://vllab1.ucmerced.edu/~szhang/FaceTracking/).
The mTLD method does not perform well on both datasets in terms of recall, precision, F1, and MOTA metrics. The ADMM  and IHTLS  schemes often generate numerous identity switches and fragments as both methods do not re-identify persons well when abrupt camera motions or shot changes occur. The tracker with the pre-trained features is not effective to re-identify faces in different shots and achieve low MOTA. The Ours-mTLD scheme has more IDS and Frag than the Ours-SymTriplet method. The shot-level trajectories determined by the mTLD method are short and noisy since TLD trackers sometimes drift or do not perform well when large appearance changes occur. In contrast, both Ours-SymTriplet and Ours-SymTriplet-Contx perform well in terms of precision, F1, and MOTA metrics, with significantly fewer identity switches and fragments.
|Music video dataset|
Contribution of linking tracklets. We evaluate the design choices of the two-step linking process: 1) linking tracklets within the shot and 2) linking shot-level tracklets across shots. To this end, we evaluate two other alternatives:
Ours-SymTriplet-noT: without linking tracklets within each shot
Ours-SymTriplet-noC: without clustering the shot-level tracklets across different shots
Table VI shows that both Ours-SymTriplet-noT and Ours-SymTriplet-noC decrease the performance in terms of Recall, F1, IDS, Frag and MOTA metrics. Without linking tracklets within each shot, Ours-SymTriplet-noT cannot recover several missed faces, and thus yields lower performance on Recall. Although some separated tracklets in each shot can be grouped together by the HAC clustering algorithm, many tracklets may be grouped incorrectly due to the lack of consideration of spatio-temporal coherence within each shot. This explains the increase of IDS and Frag. Without clustering tracklets across different shots, Ours-SymTriplet-noC assigns each short tracklet in each shot with different identities, which results in significantly more identity switches and lower MOTA.
Qualitative results. Figure 10 shows sample tracking results of our algorithm with Ours-SymTriplet-Contx features on all eight music videos. Figure 11 shows the results on three BUFFY videos and three selected BBT sequences. The numbers and the colors indicate the inferred identities of the targets. The proposed algorithm is able to track multiple faces well despite large appearance variations in unconstrained videos. In Figure 10, for example, there are significant changes in scale and appearance (due to makeup and hairstyle) in the Hello Bubble sequence (first row). In the fourth row, the six singers have similar looks and thus make multi-face tracking particularly challenging within and across shots. Nonetheless, our approach can distinguish the faces and track them reliably with few id switches. The results in other rows illustrate that our method is able to generate correct identities and trajectories when the same person appears in different shots or different scenes.
6.6 Pedestrian Tracking Across Cameras
In this section, we show that the proposed method for learning adaptive discriminative features from tracklets is also applicable to other objects, e.g., pedestrians or cars in surveillance videos. We validate our approach on the task of pedestrian tracking from multiple non-overlapping cameras.
The problem of multiple target tracking across cameras is challenging as we need to re-identify people from different images acquired at different viewing angles and imaging conditions. In unconstrained scenes, the appearances of people also exhibit significant differences across cameras. The motion cues of people are unreliable due to the non-overlapping views without knowing camera configurations apriori. The re-identification problem becomes even more challenging when a large number of people needs to be tracked across views.
Similar to pre-training a CNN using face recognition dataset for learning identity-preserving features, we first train a CNN for people re-identification using the Market1501 dataset  containing 32,668 images of 1,501 identities. We evaluate our method on the DukeMTMC  dataset which contains surveillance footage from 8 cameras with approximately 85 minutes of videos for each one.
We conduct the experiment using images from camera 2 and 5 because they are disjoint and have the most number of people. We first use the two-threshold strategy to generate tracklets on the videos from both cameras. Next, we collect training samples based on the tracklets using spatio-temporal and contextual constraints. Similar to the experiments on multi-face tracking, we fine-tune the pre-trained CNN with the discovered training samples using the SymTriplet loss function.
After extracting the learned features for each detection, we first link the tracklets within one camera into camera-level trajectories. We then group these camera-level trajectories into tracking results across the two cameras. Following 
, we measure the tracking performance using identification precision (IDP), identification recall (IDR), and the corresponding F1 score IDF1, as well as other metrics. The identification precision (recall) is the fraction of computed (ground truth) detections that are correctly identified. The IDF1 metric is the ratio of correctly identified detections over the average number of ground-truth and computed detections. Both ID precision and ID recall indicate tracking trade-offs, while the IDF1 score allows ranking all methods on a single scale that balances identification precision and recall through the harmonic mean. TableVII shows the tracking results on both cameras in the DukeMTMC dataset. Overall, the proposed method performs favorably against the other methods in  in term of IDS, MOTA, IDP and IDF1. We show sample visual results of the DukeMTMC datset in Figure 12. Person 237 and person 283 both appear in Camera 2 and Camera 5, and are correctly matched across cameras with our method.
|Ergys Ristani et al. ||866||1929||49.2%||61.7%||69.1%||63.8%||66.3%|
|Ergys Ristani et al. ||162||292||73.1%||70.5%||84.9%||68.0%||75.5%|
While the proposed algorithm performs favorably against the state-of-the-art face tracking and clustering methods in handling challenging video sequences, there are three main limitations. First, as our algorithm takes face detections as inputs, the tracking performance depends on whether faces can be reliably detected. For example, in the fourth row of Figure 10, the leftmost person was not detected in frame 419 and next few images due to occlusion. In addition, falsely detected faces could be incorrectly linked as a trajectory, e.g., the Marilyn Monroe image on the T-shirt in frame 5,704 in the eighth row of Figure 10.
Second, the proposed algorithm may not perform well on sequences where many shots contain only one single person. We show in Figure 13 two failure cases in the Darling and Apink sequences. In such cases, the proposed method does not generate negative face pairs for training the Siamese/triplet network for distinguishing similar faces. As such, different persons are incorrectly identified as the same one. One remedy is to exploit other weak supervision signals (e.g., scripts, voice, contextual information) to generate visual constraints for different scenarios.
Third, the CNN fine-tuning process is time-consuming. It takes around 1 hour on a NVIDIA GT980Ti GPU for 10,000 back-propagation iterations. There are two approaches that may alleviate this issue. First, we may use faster training algorithms . Second, for TV Sitcom episodes we can use one or a few videos for feature adaptation and apply the learned features to all other episodes. Note that we only need to adapt features once as the main characters are the same. In Table II, we train Ours-SymTriplet features on BBT02 (referred to as Ours-SymTriplet-BBT02) and evaluate on other episodes. Although the weight purity of Ours-SymTriplet-BBT02 is slightly inferior to that of Ours-SymTriplet, it still outperforms the pre-trained and VGG-Face features.
In this paper, we tackle the multi-face tracking problem in unconstrained videos by learning video-specific features. We first pre-train a CNN on a large-scale face recognition dataset to learn identity-preserving face representation. We then adapt the pre-trained CNN using training samples extracted through the spatio-temporal and contextual constraints. To learn discriminative features for handling large appearance variations of faces presented in a specific video, we propose the SymTriplet loss function. Using the learned features for modeling face tracklets, we use a hierarchical clustering algorithm link face tracklets across multiple shots. In addition to multi-face tracking, we demonstrate that the proposed algorithm can also be applied to other domains such as pedestrian tracking across multiple cameras. Experimental results show that the proposed algorithm outperforms the state-of-the-art methods in terms of clustering accuracy and tracking performance. As the performance of our approach depends on the automatically discovered visual constraints in the video, we believe that exploiting multi-modal information (e.g., sound/script alignment) is a promising direction for further improvement.
The work is supported by National Basic Research Program of China (973 Program, 2015CB351705), NSFC (61332018, 61703344), Office of Naval Research (N0014-16-1-2314), R&D programs by NRF (2014R1A1A2058501) and MSIP/IITP (IITP-2016-H8601-16-1005) of Korea, NSF CAREER (1149783) and gifts from Adobe, Panasonic, NEC, and NVIDIA.
-  W. Brendel, M. Amer, and S. Todorovic, “Multiobject tracking as maximum weight independent set,” in CVPR, 2011.
-  R. T. Collins, “Multitarget data association with higher-order motion models,” in CVPR, 2012.
-  B. Yang and R. Nevatia, “Multi-target tracking by online learning of non-linear motion patterns and robust appearance models,” in CVPR, 2012.
-  L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in CVPR, 2008.
-  X. Zhao, D. Gong, and G. Medioni, “Tracking using motion patterns for very crowded scenes,” in ECCV, 2012.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” inNIPS, 2012.
-  J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in ICML, 2014.
Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10,000 classes,” inCVPR, 2014.
-  Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in NIPS, 2014.
-  F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” CVPR, 2015.
-  J. Hu, J. Lu, and Y.-P. Tan, “Discriminative deep metric learning for face verification in the wild,” in CVPR, 2014.
-  S. Zhang, Y. Gong, J.-B. Huang, J. Lim, J. Wang, N. Ahuja, and M.-H. Yang, “Tracking persons-of-interest via adaptive discriminative features,” in ECCV, 2016.
-  E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in ECCVW. Springer, 2016, pp. 17–35.
-  M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-detection-by-tracking,” in CVPR, 2008.
C. Stauffer, “Estimating tracking sources and sinks,” inCVPR, 2003.
-  A. A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and W. Hu, “Multi-object tracking through simultaneous long occlusions and split-merge conditions,” in CVPR, 2006.
-  R. Kaucic, A. A. Perera, G. Brooksby, J. Kaufhold, and A. Hoogs, “A unified framework for tracking through occlusions and across sensor gaps,” in CVPR, 2005.
-  B. Leibe, K. Schindler, and L. Van Gool, “Coupled detection and trajectory estimation for multi-object tracking,” in ICCV, 2007.
-  H. Jiang, S. Fels, and J. J. Little, “A linear programming approach for multiple object tracking,” in CVPR, 2007.
-  J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,” PAMI, vol. 33, no. 9, pp. 1806–1819, 2011.
-  A. Andriyenko and K. Schindler, “Multi-target tracking by continuous energy minimization,” in CVPR, 2011.
-  A. Andriyenko, K. Schindler, and S. Roth, “Discrete-continuous optimization for multi-target tracking,” in CVPR, 2012.
-  J. Xing, H. Ai, and S. Lao, “Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses,” in CVPR, 2009.
-  C. Huang, B. Wu, and R. Nevatia, “Robust object tracking by hierarchical association of detection responses,” in ECCV, 2008.
-  B. Yang and R. Nevatia, “Online learned discriminative part-based appearance models for multi-human tracking,” in ECCV, 2012.
-  C. Huang, Y. Li, H. Ai et al., “Robust head tracking with particles based on multiple cues,” in ECCVW, 2006.
-  Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade, “Tracking in low frame rate video: A cascade particle filter with discriminative observers of different lifespans,” in CVPR, 2007.
-  H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua, “Tracking multiple people under global appearance constraints,” in ICCV, 2011.
-  Y. Li, C. Huang, and R. Nevatia, “Learning to associate: Hybridboosted multi-target tracker for crowded scene,” in CVPR, 2009.
-  B. Wu, S. Lyu, B.-G. Hu, and Q. Ji, “Simultaneous clustering and tracklet linking for multi-face tracking in videos,” in ICCV, 2013.
-  M. Roth, M. Bauml, R. Nevatia, and R. Stiefelhagen, “Robust multi-pose face tracking by multi-stage tracklet association,” in ICPR, 2012.
-  B. Wang, G. Wang, K. L. Chan, and L. Wang, “Tracklet association with online target-specific metric learning,” in CVPR, 2014.
-  C.-H. Kuo and R. Nevatia, “How does person identity recognition help multi-person tracking?” in CVPR, 2011.
-  P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, 2001.
-  B. Fulkerson, A. Vedaldi, and S. Soatto, “Localizing objects with smart dictionaries,” in ECCV, 2008.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.
-  S. Zhang, J. Wang, Z. Wang, Y. Gong, and Y. Liu, “Multi-target tracking by learning local-to-global trajectory models,” PR, vol. 48, no. 2, pp. 580–590, 2015.
-  C.-H. Kuo, C. Huang, and R. Nevatia, “Multi-target tracking by on-line learned discriminative appearance models,” in CVPR, 2010.
-  M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “Robust tracking-by-detection using a detector confidence particle filter,” in ICCV, 2009.
-  R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” PAMI, vol. 27, no. 10, pp. 1631–1643, 2005.
-  H. Grabner and H. Bischof, “On-line boosting and vision,” in CVPR, 2006.
-  R. G. Cinbis, J. Verbeek, and C. Schmid, “Unsupervised metric learning for face identification in tv video,” in ICCV, 2011.
-  M. Tapaswi, O. M. Parkhi, E. Rahtu, E. Sommerlade, R. Stiefelhagen, and A. Zisserman, “Total cluster: A person agnostic clustering method for broadcast videos,” in ICVGIP, 2014.
-  B. Wu, Y. Zhang, B.-G. Hu, and Q. Ji, “Constrained clustering and its application to face clustering in videos,” in CVPR, 2013.
-  E. El Khoury, C. Senac, and P. Joly, “Face-and-clothing based people clustering in video content,” in ICMR, 2010.
M. Bauml, M. Tapaswi, and R. Stiefelhagen, “Semi-supervised learning with constraints for person identification in multimedia data,” inCVPR, 2013.
J. Sivic, M. Everingham, and A. Zisserman, ““Who are you?” – Learning person specific classifiers from video,” inCVPR, 2009.
-  M. Everingham, J. Sivic, and A. Zisserman, ““Hello! My name is… Buffy” – Automatic naming of characters in tv video,” in BMVC, 2006.
-  G. Paul, K. Elie, M. Sylvain, O. Jean-Marc, and D. Paul, “A conditional random field approach for audio-visual people diarization,” in ICASSP, 2014.
-  C. Zhou, C. Zhang, H. Fu, R. Wang, and X. Cao, “Multi-cue augmented face clustering,” in ACM MM, 2015.
-  Z. Tang, Y. Zhang, Z. Li, and H. Lu, “Face clustering in videos with proportion prior.” in IJCAI, 2015.
-  Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Joint face representation adaptation and clustering in videos,” in ECCV, 2016.
-  D. Ramanan, S. Baker, and S. Kakade, “Leveraging archival video for building face datasets,” in ICCV, 2007.
-  M. Tapaswi, M. Bauml, and R. Stiefelhagen, ““Knock! Knock! Who is it?” probabilistic person identification in tv-series,” in CVPR, 2012.
-  D. Anguelov, K.-c. Lee, S. B. Gokturk, and B. Sumengen, “Contextual identity recognition in personal photo albums,” in CVPR, 2007.
-  D. Lin, A. Kapoor, G. Hua, and S. Baker, “Joint people, event, and location recognition in personal photo collections using cross-domain context,” in ECCV, 2010.
-  S. Xiao, M. Tan, and D. Xu, “Weighted block-sparse low rank representation for face clustering in videos,” in ECCV, 2014.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Closing the gap to human-level performance in face verification,” in CVPR, 2014.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in BMVC, 2015.
-  Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” PAMI, vol. 34, no. 7, pp. 1409–1422, 2012.
-  F. Pernici, “Facehugger: The alien tracker applied to faces,” in ECCV, 2012.
-  D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv, 2014.
-  Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM MM, 2014.
-  M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detection without bells and whistles,” in ECCV, 2014.
-  M. Du and R. Chellappa, “Face association for videos using conditional random fields and max-margin markov networks,” PAMI, vol. 38, no. 9, pp. 1762–1773, 2016.
-  L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3d human pose annotations,” in ICCV, 2009, pp. 1365–1372.
-  S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in CVPR, 2005.
-  R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in CVPR, 2006.
-  L. Van der Maaten and G. Hinton, “Visualizing data using t-sne,” JMLR, vol. 9, no. 2579-2605, p. 85, 2008.
M. Ayazoglu, M. Sznaier, and O. I. Camps, “Fast algorithms for structured robust principal component analysis,” inCVPR, 2012.
-  C. Dicle, O. I. Camps, and M. Sznaier, “The way they move: Tracking multiple targets with similar appearance,” in ICCV, 2013.
-  L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in ICCV, 2015.
-  Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks with few multiplications,” arXiv, 2015.