Tracking Persons-of-Interest via Unsupervised Representation Adaptation

Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Existing multi-target tracking methods often use low-level features which are not sufficiently discriminative for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face representations using convolutional neural networks (CNNs). Unlike existing CNN-based approaches which are only trained on large-scale face image datasets offline, we use the contextual constraints to generate a large number of training samples for a given video, and further adapt the pre-trained face CNN to specific videos using discovered training samples. Using these training samples, we optimize the embedding space so that the Euclidean distances correspond to a measure of semantic face similarity via minimizing a triplet loss function. With the learned discriminative features, we apply the hierarchical clustering algorithm to link tracklets across multiple shots to generate trajectories. We extensively evaluate the proposed algorithm on two sets of TV sitcoms and YouTube music videos, analyze the contribution of each component, and demonstrate significant performance improvement over existing techniques.



There are no comments yet.


page 1

page 3

page 4

page 11

page 12

page 13

page 14


An Automatic System for Unconstrained Video-Based Face Recognition

Although deep learning approaches have achieved performance surpassing h...

Multi-Face: Self-supervised Multiview Adaptation for Robust Face Clustering in Videos

Robust face clustering is a key step towards computational understanding...

Triplet Probabilistic Embedding for Face Verification and Clustering

Despite significant progress made over the past twenty five years, uncon...

Triplet Similarity Embedding for Face Verification

In this work, we present an unconstrained face verification algorithm an...

Long-term face tracking in the wild using deep learning

This paper investigates long-term face tracking of a specific person giv...

DeepTrack: Learning Discriminative Feature Representations Online for Robust Visual Tracking

Deep neural networks, albeit their great success on feature learning in ...

Progressive Multi-Stage Learning for Discriminative Tracking

Visual tracking is typically solved as a discriminative learning problem...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multi-target tracking (MTT) aims at locating all targets of interest (e.g., faces, pedestrians, players, and cars) and inferring their trajectories in a video over time while maintaining their identities. This problem is at the core of numerous computer vision applications such as video surveillance, robotics, and sports analysis. Multi-face tracking is one important domain of MTT that can be applied to high-level video understanding tasks such as face recognition, content-based retrieval, and interaction analysis.

The problem of multi-face tracking is particularly challenging in unconstrained scenarios where the videos are generated from multiple moving cameras with different views or scenes as shown in Figure 1. Examples include automatic character tracking in movies, TV sitcoms, or music videos. It has attracted increasing attention in recent years due to the fast-growing popularity of such videos on the Internet. Unlike tracking in the constrained counterparts (e.g., a video from a single stationary or moving camera) where the main challenge is to deal with occlusions and intersections, multi-face tracking in unconstrained videos needs to address the following issues: (1) A video often consists of many shots and the contents of two neighboring shots may be dramatically different; (2) It entails dealing with re-identifying faces with large appearance variations due to changes in scale, pose, expression, illumination, and makeup in different shots or scenes; and (3) The results of face detection may be unreliable due to low resolution, occlusion, nonrigid deformation, motion blurring and complex backgrounds.

Fig. 1: Multi-face tracking. We tackle the problem of tracking multiple faces of people while maintaining their identities in unconstrained videos. Such videos consist of many shots from different cameras. The main challenge is to address large appearance variations of faces from different shots due to changes in pose, view angle, scale, makeup, illumination, camera motion and heavy occlusions.

Multi-target tracking has been extensively studied in the literature with the primal focus on humans. Recent approaches often address multi-face tracking by tracking-by-detection techniques. These methods first apply an object detector to locate faces in every frame, and apply data association approaches [1, 2, 3, 4, 5] that use visual cues (e.g., appearance, position, motion, and size) in an affinity model to link detections or tracklets (track fragments) into trajectories. Such methods are effective when the targets are continuously detected and when the camera is either stationary or slowly moving. However, for unconstrained videos with many shot changes and intermittent appearance of targets, the data association problem becomes more difficult because the assumptions such as appearance and size consistency, and continuous motion no longer hold in neighboring shots. Therefore, the design of discriminative features plays a critical role in identifying faces across shots in unconstrained scenarios.

Existing MTT methods [3, 4, 5] use combinations of low-level features such as color histograms, Haar-like features, or HOG [6]

to construct an appearance model for each target. However, these hand-crafted features often are not sufficiently discriminative to identify faces with large appearance changes. For example, low-level features extracted from faces of two different persons under the same pose (e.g., frontal poses) are likely more similar than those extracted from faces of the same person under different poses (e.g., frontal and profile poses).

Deep convolutional neural networks (CNNs) have demonstrated significant performance improvements on recognition tasks, e.g., image classification [7]. The features extracted from the activation of a pre-trained CNN have been shown to be effective for generic visual recognition tasks [8]. In particular, CNN-based features have shown impressive performance on face recognition and verification tasks [9, 10, 11, 12]. These models are often trained using large-scale face recognition datasets in a fully supervised manner and then serve as feature extractors for unseen face images. However, these models may not achieve good performance in unconstrained videos as the visual domains of the training and test sets may be significantly different.

In this paper, we address this domain shift by adapting a pre-trained CNN to the specific videos. Due to the lack of manual annotations of target identities, we collect a large number of training samples of faces by exploiting contextual constraints of tracklets in the video. With these automatically discovered training samples, we adapt the pre-trained CNN so that the Euclidean distance between the embedded features reflects the semantic distance between face images. We incorporate these discriminative features into a hierarchical agglomerative clustering algorithm to link tracklets across multiple shots into trajectories. We analyze the contribution of each component in the proposed algorithm and demonstrate the effectiveness of the learned features to identify characters in 10 long TV sitcom episodes and singers in 8 challenging music videos. We further apply our adaptive feature learning approach to other objects (e.g., pedestrians) and show competitive performance on pedestrian tracking across cameras. The preliminary results have been presented in [13].

We make the following contributions in this work:

  • Unlike existing work that uses linear metric learning on hand-crafted features, we account for large appearance variations of faces in videos by learning video-specific features with the deep contrastive and triplet-based metric learning on automatically discovered samples through contextual constraints.

  • We propose an improved triplet loss function which helps the learned model simultaneously pulls positive pairs closer and pushes away negative samples from the positive pairs.

  • In contrast to prior work that often uses face tracks with false positives manually removed, we take raw video as the input and perform detection, tracking, clustering, and feature adaptation in a fully automatic way.

  • We develop a new dataset with 8 music videos from YouTube containing annotations of 3,845 face tracklets and 117,598 face detections. This benchmark dataset is challenging (with frequent shot changes, large appearance variations, and rapid camera motion) and crucial for evaluating multi-face tracking algorithms in unconstrained environments.

  • We demonstrate the proposed adaptive feature learning approach can be extended to tracking other objects, and present empirical results on pedestrian tracking across cameras using the DukeMTMC dataset [14].

2 Related Work and Problem Context

Multi-target tracking. In recent years, numerous multi-target tracking methods have been developed by applying a pre-learned object detector to locate instances in every frame, and determine the trajectories by solving a data association problem [1, 2, 3, 4, 5]. A plethora of global optimization methods have been developed for association based on the Viterbi decoding scheme [15], Hungarian algorithm [16, 17, 18], quadratic Boolean programming [19], maximum weight-independent sets [1]

, linear programming 

[20, 21], energy minimization [22, 23], and min-cost network flow [4]. Some methods tackle the data association problem with a hierarchical association framework. For example, Xing et al. [24] propose a two-stage association method to combine local and global tracklets association to track multiple targets. Huang et al. [25] propose a three-level association approach by first linking the detections from consecutive frames into short tracklets at the bottom level and then applying iterative Hungarian algorithm and an EM algorithm at higher levels. Yang et al. [26] extend the three-level work in [25] through learning an online discriminative appearance model.

Fig. 2: Algorithm pipeline. Our multi-face tracking algorithm has four main steps: (a) Pre-training a CNN on a large-scale face recognition dataset to learn identity-preserving features, (b) Generating face pairs or face triplets from the tracklets in a specific video with the proposed spatio-temporal constraints and contextual constraints, (c) Adapting the pre-trained CNN to learn video-specific features from the automatically generated training samples, and (d) Linking tracklets within each shot and then across shots to form the face trajectories.

Data association can also be formulated as a linear assignment problem. Existing algorithms typically integrate appearance and motion cues into an affinity model to infer and link detections (or tracklets) into trajectories [1, 4, 23, 27, 28]. However, these MTT methods do not perform well in unconstrained videos where abrupt changes across different shots occur, and the assumptions of smooth appearance change no longer hold.

To identify targets across shots, discriminative appearance features are required to discern targets in various circumstances. Most existing multi-target tracking methods [4, 29, 25, 30, 31] use color histograms as features and Bhattacharyya distance or correlation coefficient as affinity measures. Several mehtods [23, 22, 32, 33, 34] use hand-crafted features, e.g., Haar-like [35], SIFT [36, 37], HOG [6] features, or combination [32, 33, 34]. For robustness, some approaches [38, 39, 26, 3] adaptively select the most discriminative features for a specific video [40, 41, 42]. However, all these hand-crafted feature representations are not tailored for faces, and thus are less effective at handling the large appearance variations in unconstrained scenarios.

Visual constraints in multi-target tracking. Several approaches [43, 44, 33, 31, 45] exploit visual constraints from videos for improving tracking performance. These visual constraints are often derived from the spatio-temporal relationships among the extracted tracklets [43, 45]. Two types of constraints are commonly used: 1) all samples in the same tracklet represent the same object and 2) a pair of tracklets in the same frame indicates that two different objects are present. Prior work either uses these constraints implicitly for learning a cast-specific metric [43, 33]; or explicitly for linking cluster or tracklet  [45, 31].

Numerous cues from contextual constraints have also been used for tracking, e.g., clothing [46], script [47, 48, 49], speech [50], gender [51], video editing style [44], clustering prior [52], and dynamic clustering constraints [53]. For examples, the methods in [54, 55, 56] incorporate clothing appearance for improving person identification accuracy. Lin et al. [57] present a probabilistic context model to jointly tag people across multiple domains of people, events, and locations. The work [49, 56] exploits speaker analysis to improve face labeling.

In this work, we exploit visual constraints generated in a way similar to [45, 44, 58, 53]

. The proposed algorithm different from previous methods in three aspects. First, existing approaches often rely on hand-crafted features and learn a linear transformation over the extracted features, which may not be effective in modeling large appearance variation of faces. We learn discriminative a face feature representation specific to a video by adapting all layers in a deep neural network. Second, previous work often uses face tracks with false positives removed manually 

[31, 43, 45, 44, 58]. In contrast, our approach takes a raw video as the input and perform detection, tracking, clustering, and feature adaptation without these pre-processing steps. Third, unlike the work in [53] that discovers negative pairs by thresholding based on feature distances, we discover negative pairs by transitively propagating the relationship among co-occurred tracklets using the proposed contextual constraints.

CNN-based representation learning. Recent face recognition and verification methods focus on learning identity-preserving feature representations from deep neural networks. While the models may differ, these CNN-based face representations (e.g., DeepID [9], DeepFace [59], FaceNet [11], VGG-Face [60]) are learned by training CNNs using large-scale datasets in a fully supervised manner. These CNNs then operate as feature extractors for face recognition, identification, and face clustering. In this work, we also use a CNN to learn identity-preserving features from a face recognition dataset. The main difference lies in that we further adapt the pre-trained representation to a specific video, thereby further improve the specificity of the model and enhance discriminative strength. In addition, we introduce a symmetric triplet-based loss function and demonstrate its effectiveness over the commonly used contrastive loss and triplet loss.

Long-term object tracking. The goal of long-term object tracking [61, 62] is to locate a specific target over time even when the target leaves and re-enters the scene. These trackers perform well on various types of targets such as cars and faces. However, online trackers are designed to handle scenes recorded by a stationary or slow-moving camera and thus not effective in tracking faces in unconstrained videos for two reasons. First, these trackers are prone to drift due to online model update with noisy examples. Second, hand-crafted features are not sufficiently discriminative to re-identify faces across shots. We tackle the first issue by processing the video offline, i.e., apply a face detector in every frame and associate all tracklets in the video. For the second issue, we learn adaptive discriminative representation to account for large appearance variations of faces across shots or scenes.

3 Algorithmic Overview

Our goal is to track multiple faces across shots in an unconstrained video while maintaining identities of the persons of interest. To achieve this, we learn discriminative features that are adapted to the appearance variations in the specific videos. We then use a hierarchical clustering algorithm to link tracklets across shots into long trajectories. The main steps of the proposed algorithm are summarized below and in Figure 2.

  1. Pre-training: We pre-train a CNN model based on the AlexNet [7] using an external face dataset to learn identity-preserving features (Section 4.1).

  2. Discovering training samples: We detect shot changes and divide a video into non-overlapping shots. Within each shot, we apply a face detector and link adjacent detections into short tracklets. We discover a large collection of training samples (in pairs or triplets) from tracklets based on the spatio-temporal and contextual constraints (Section 4.2).

  3. Learning video-specific features: We adapt the pre-trained CNN model using the automatically discovered training samples to account for large appearance changes of faces pertaining to a specific video (Section 4.3). We present an improved triplet loss to enhance the discriminative ability of the learned features.

  4. Linking tracklets: Within each shot, we use a conventional multi-face tracking method to link tracklets into short trajectories. We use a hierarchical clustering algorithm to link trajectories across shots. Finally, we assign the tracklets in each cluster with the same identity (Section 5).

4 Learning Discriminative Features

In this section, we present the algorithmic details for learning video-specific features. After describing how the generic face features are obtained from the pre-training step, we introduce the process of discovering training examples and learning discriminative features using the proposed symmetric triplet loss function.

4.1 Supervised Pre-training

We learn identity-preserving features by pre-training a deep neural network on a large-scale face recognition dataset. Based on the AlexNet architecture [7], we replace the output layer with nodes where each node corresponds to a specific person. We train the network on an external CASIA-WebFace dataset [63] (494,414 images of 10,575 subjects) for face recognition in a fully supervised manner. We select persons, of the images (431,300 images) for training and the remaining (47,140 images) as the validation set. Each face image is normalized to

pixels. We use stochastic gradient descent with an initial learning rate of 0.01 that decreases by a factor of 10 for every 20,000 iterations using the Caffe 

[64] toolbox.

4.2 Discovering Training Samples

Fig. 3: Contextual constraints generation. Here, we label the faces in and as the same identity given the sufficiently high similarity between the contextual features of and . With this additional constraint, we can propagate the constraints transitively and derive that the faces from and (or , ) are in fact belong to different identities, and the faces from and are from different people.

Shot detection and tracklets linking. We first use a shot change detection method to divide each input video into non-overlapping shots.111 Next, we use a face detector [65] to locate faces in each frame. Given the face detections for each frame, we use a two-threshold strategy [25] to generate tracklets within each shot by linking the detected faces in adjacent frames based on similarities in appearances, positions, and scales. Note that the two-threshold strategy for linking detections could be replaced by more sophisticated methods, e.g., tracking using particle filters [27, 40]. All tracklets shorter than five frames are discarded. The extracted face tracklets are formed in a conservative manner with limited temporal spans up to the length of each shot.

Spatio-temporal constraints. Existing methods typically exploit spatio-temporal constraints from tracklets to generate training samples from the video. Given a set of tracklets, we can discover a large collection of positive and negative training sample pairs belonging to the same/different persons: (1) all pairs of faces in one tracklet are from one person and (2) two face tracklets that appear in the same frame contain faces of different persons.

Let denote the -th face tracklet of length . We generate a set of positive pairs by collecting all within-tracklet face pairs:


Similarly, if tracklets and overlap in some frames, we can generate a set of negative pairs by collecting all between-tracklet face pairs:


Contextual constraints. With the spatio-temporal constraints, we can obtain a large number of face pairs without manual labeling. These training pairs, however, may have some biases. First, the positive (within-tracklet) pairs occur close in time (e.g., only several frames apart in one shot), which means that the positive face pairs often have small appearance variations. Second, the negative pairs are all generated from tracklets that co-occur in the same shot. Consequently, we are not able to train the model so that it can distinguish or link faces across shots (as we do not have training samples for these cases).

To address these problems, we mine additional positive and negative face pairs for learning our video-specific features. The idea is to exploit contextual information beyond facial regions for identifying the person across shots. Specifically, we identify the clothing region following [66] and extract features using the AlexNet model. Given the -th face detection in one frame, we locate the torso region by using a probabilistic mask (See 4), where is a pixel in the current frame. The mask is learned from the statistics of body part’s spatial relationship on the Human in 3D dataset [67]. We then concatenate the face features and the clothing feature together. We then apply the HAC algorithm (see Section 5.2) to group tracklets and label the grouped tracklets with high confidence as positive pairs. These grouped tracklets generally contain faces with the similar clothing in the different shots or scenes and thus provide additional positive tracklet pairs. In addition, by leveraging the additional positive pairs, we can discover more negative pairs by transitively propagating the relationship among tracklets. For example, suppose we know tracklets A and B represent different persons, an additional positive constraint on tracklets A and C automatically implies that tracklet B and C are different persons.

Fig. 4: Face detections and clothing regions.

Figure 3 illustrates the generation process of contextual constraints . Here, the tracklets and co-occur in one shot and the tracklets , , and co-occur in another shot. Using only spatio-temporal constraints, we are not able to obtain training samples from different shots. As a result, the tracklets and may be incorrectly identified as the same person. However, from the contextual cues, we may be able to identify that the tracklets and are the same person. Using this additional positive constraints, we can automatically generate additional negative constraints, e.g., is a different person from , and .

4.3 Learning Adaptive Discriminative Features

With the discovered training pairs from applying both contextual and spatio-temporal constraints, we optimize the embedding function such that the distance in the embedding space reflects the semantic similarity of two face images and :


We set the feature dimension of as 64 in all of our experiments. We first describe two commonly used loss functions for optimizing the embedding space: (1) contrastive loss and (2) triplet loss, and then present a symmetric triplet loss function for feature learning.

Fig. 5: Siamese vs. Triplet network. Illustration of the Siamese network (left) with pairs as inputs and the triplet network (right) with triplets as inputs for learning discriminative features adaptively. The Siamese network consists of two CNNs and uses a contrastive loss. The Triplet network consists of three CNNs and uses a triplet loss. The CNNs in each network share the same architectures and parameters and are initialized with parameters of the CNN pre-trained on the large-scale face recognition dataset.

Contrastive loss. The Siamese network [68, 69] consists of two identical CNNs with the shared architecture and parameters as shown in Figure 5. Minimizing the contrastive loss function encourages small distance of two images of the same person and large distance otherwise. Denote as a pair of training images generated with the spatio-temporal constraints. Similar to [68, 69], the contrastive loss function is:


where ( in all our experiments) is the margin. Intuitively, if and are from the same person, the loss is and we aim to decrease . Otherwise, we increase until it is larger than the margin .

Triplet loss. The triplet-based network [11] consists of three identical CNNs with the shared architecture and parameters as shown in Figure 5. One triplet consists of two face images of the same person and one face image from another person. We generate a set of triplets from two tracklets and belonging to different persons: . Here we aim to ensure that the embedded distance of the positive pair is closer than that of the negative pair by a distance margin (). For one triplet, the triplet loss is of the form:

(a) Conventional triplet loss (b) SymTriplet loss
Fig. 6: Triplet loss vs. SymTriplet loss. Illustration of the negative partial gradient direction to the triplet sample. (a) the conventional triplet loss; (b) the SymTriplet loss. The triplet samples , and are highlighted with blue, red and magenta colors, relatively. The circles denote faces from the same person whereas the triangle denotes a different person. The gradient directions are color-coded.

Symmetric triplet loss. The conventional triplet loss in (5), takes only two of the three distances into consideration: and

although there are three distances between each pair. We illustrate the problem of the conventional triplet loss by analyzing the gradients of the loss function. We denote the difference vector between the triplet (

, and ):


For non-zero triplet loss in (5), we can compute the gradients as


Figure 6(a) shows the negative gradient directions for each sample. There are two issues with the triplet loss in (5). First, the loss function pushes the negative data point away from only one of the positive pair rather than both and . Second, the gradients on the positive pair are not symmetric with respect to the negative data .

To address these issues, we propose a symmetric triplet loss function (SymTriplet) by considering all three distances as:


where is the distance margin. The gradients induced from the proposed SymTriplet loss are


Figure 6(b) shows the negative gradient directions. The proposed SymTriplet loss directly optimizes the embedding space such that the positive pair are pulled closer to each other and the negative sample () is pulled away from the two positive samples (). This property allows us to improve the discriminative strength of the learned features.

4.4 Training Algorithm

We train the triplet network model with the SymTriplet loss function and stochastic gradient decent method with momentum. We compute the derivatives of (8) as follows:




We can compute the gradients from each input triplet examples given the values of , , and , , , which can be obtained using the standard forward and backward propagations separately for each image in the triplet examples. We summarize the main training steps in Algorithm 1.

1:Input: Training samples .
2:Output: Network parameters ,
3:for  Max number of iterations do
4:     for  all training triplet samples  do
5:         Compute , and by forward propagation;
6:         Compute , and by back propagation;
7:         Compute according to (10) and (4.4).
8:     end for
9:     Update the parameters
10:end for
Algorithm 1 Stochastic gradient descent with SymTriplet loss

5 Multi-face Tracking via Tracklet Linking

We use a two-step procedure to link face tracklets generated in Section 4.2: (1) linking the face tracklets within each shot into shot-level tracklets, and (2) merging shot-level tracklets across multiple shots into trajectories.

5.1 Linking Tracklets Within Each Shot

We use a typical multi-object tracking framework for linking tracklets within each shot. First, we extract features from each detected face using the learned deep network. We measure the linking probabilities between two tracklets using temporal, kinematic and appearance information. Then, we use the Hungarian algorithm to determine a globally optimal label assignment 

[25, 24] and link tracklets with the same label are linked into shot-level tracklets.

5.2 Linking Tracklets Across Shots

To link tracklets across multiple shots, we apply a bottom-up hierarchical agglomerative clustering (HAC) algorithm with a stopping threshold and learned appearance features as follows:

  1. Given tracklets in all shots , we start with treating each tracklet as a singleton cluster.

  2. We evaluate all pairwise distances between two tracklets and then use the mean distance metric as the similarity measure. Given and , the distance is defined as:


    where denotes the -th face detection in the -th tracklet, and denotes the feature extracted from the embedding layer in the Triplet network.

  3. For trackles which have overlapped frames, we set the corresponding distance as infinity.

  4. We first determine the pair of clusters that has the shortest distance and merge into a new cluster. We update all distances from the new cluster to all other clusters. For those clusters which have overlapped frames with the new cluster, the corresponding distances to the new cluster are set to infinity.

  5. Repeat (d) until the shortest distance is larger than a threshold .

We remove clusters containing less than 4 tracklets and less than 50 frames. The tracklets in each cluster are labeled with the same identity to form trajectories.

6 Experimental Results

We first describe the implementation details, datasets, and evaluation metrics. Next, we present the evaluation results of the proposed algorithm against the state-of-the-art methods. More experimental results and videos are available in the supplementary material at The source code and annotate datasets will be made available to the public.

6.1 Implementation Details

CNN fine-tuning.

We adapt the pre-trained CNN with the proposed SymTriplet loss. For feature embedding, we replace the classification layer in the pre-trained network with 64 output nodes. We use stochastic gradient descent with the momentum term set to 0.9. For the network training, we set a fixed learning rate to 0.00001 for finetuning and a weight decay of 0.0001. We use a mini-batch size 128 and train the network for 2,000 epochs.

Linking tracklets. For determining the threshold in Section 5.2, we perform parameter sweeping (with an interval of 0.1) using cross-validation. For features trained from the Siamese network, we empirically set the threshold of the HAC algorithm as . For features trained from the triplet network, we use for both the triplet and SymTriplet loss functions.

6.2 Datasets

We evaluate the proposed algorithm on three types of videos containing multiple persons:

  1. Videos in a laboratory setting: Frontal [31]

  2. TV sitcoms: The Big Bang Theory (BBT) [31, 45] and the Buffy the Vampire Slayer (BUFFY) [48, 49, 66] datasets

  3. Music videos from YouTube

Frontal video. Frontal is a short video in a constrained scene acquired indoors with a fixed camera. Four persons facing the camera move around and occlude each other.

BBT dataset. We select the first 7 episodes from Season 1 of the Big Bang Theory TV Sitcom (referred to as BBT01-07). Each video is about 23 minutes long with the main cast of 5-13 people and is recorded mostly indoors. The main difficulty lies in identifying faces of the same person from frequent changes of camera views and scenes, where there are large appearance variations in viewing angle, pose, scale, and illumination.

BUFFY dataset. The BUFFY dataset has been widely evaluated in the context of automatic face labeling [48, 49, 66]. The dataset contains three episodes (episode 2, 5 and 6) from Season 5 of the TV series Buffy the Vampire Slayer (referred to as BUFFY02, BUFFY05, and BUFFY06). Each video is about 40 minutes long with the main cast of 13-19 people. The illumination condition in this video dataset is more challenging than that in the BBT dataset as it contains many scenes with dim light.

Music video dataset. We introduce a new dataset of 8 music videos from YouTube. It is challenging to track multiple faces in these videos due to large variations caused by frequent shot/scene changes, large appearance variations, and rapid camera motion. Three sequences (T-ara, Westlife and Pussycat Dolls) are recorded from live music performance with multiple cameras in different views. The other sequences (Bruno Mars, Apink, Hello Bubble, Darling and Girls Aloud) are MTV videos. Faces in these videos often undergo large appearance variations due to changes in pose, scale, makeup, illumination, camera motion, and occlusions.

6.3 Evaluation Metrics

We evaluate the proposed method in two main aspects. First, to evaluate the effectiveness of the learned video-specific features, we use a bottom-up HAC algorithm to merge pairs of tracklets until all tracklets have been merged into the pre-defined number of clusters (i.e., the actual number of people in the video). We measure the quality of clustering using the weighted purity:


where each cluster contains elements and its purity is measured as the fraction of the largest number of faces from the same person to , and denotes the total number of faces in the video.

Second, we evaluate the method with the metrics commonly used in multi-target tracking [38], including Recall, Precision, F1, FAF, IDS, Frag, MOTA, and MOTP. We list the definitions of these metrics in the supplementary material available at The up and down arrows indicate whether higher or lower scores are better for each metric.

6.4 Evaluation on Features

We evaluate the proposed adaptive features against several alternatives summarized in Table I.

Method Name Feature Dimension Architecture Training Loss Description
HOG [6] 4,356 - - A conventional hand-crafted feature
AlexNet [7] 4,096 AlexNet Softmax loss A generic feature representation
Pre-trained 4,096 AlexNet Softmax loss Face representation trained on the WebFace dataset
VGG-Face [60] 4,096 VGG-16 Softmax loss A publicly available face descriptor
VGG-Face-ULDML 4,096 16-layer VGG ULDML [43] A Mahalanobis mapping from the VGG-Face features
Ours-Triplet 64 AlexNet Triplet loss Trained with traditional spatio-temporal constraints
Ours-Siamese 64 AlexNet Contrastive loss Trained with traditional spatio-temporal constraints
Ours-SymTriplet 64 AlexNet SymTriplet loss Trained with traditional spatio-temporal constraints
Ours-SymTriplet-Contx 64 AlexNet SymTriplet loss Trained with the contextual constraints
Ours-SymTriplet-BBT02 64 AlexNet SymTriplet loss Trained on the BBT02 video with spatio-temporal constraints.
TABLE I: Summary of the evaluated features.
BBT dataset BUFFY dataset
HOG [6] 0.37 0.31 0.37 0.36 0.29 0.26 0.30 0.21 0.38 0.25
AlexNet [7] 0.47 0.31 0.45 0.36 0.29 0.26 0.39 0.33 0.37 0.26
Pre-trained 0.86 0.71 0.73 0.59 0.51 0.50 0.73 0.26 0.42 0.32
VGG-Face [60] 0.91 0.85 0.85 0.54 0.65 0.46 0.79 0.22 0.51 0.41
VGG-Face-ULDML 0.92 0.87 0.86 0.60 0.68 0.23 0.85 0.34 0.54 0.43
Ours-Siamese 0.94 0.95 0.87 0.74 0.70 0.70 0.89 0.44 0.67 0.61
Ours-Triplet 0.94 0.95 0.92 0.74 0.68 0.70 0.89 0.45 0.66 0.70
Ours-SymTriplet 0.94 0.95 0.92 0.78 0.85 0.75 0.91 0.46 0.68 0.73
Ours-SymTriplet-Contx 0.95 0.95 0.93 0.84 0.86 0.83 0.92 0.58 0.70 0.75
Ours-SymTriplet-BBT02 0.90 0.95 0.87 0.74 0.79 0.67 0.88 - - -
TABLE II: Clustering results on 7 BBT videos and 3 BUFFY videos. The weighted purity of each video is measured on the ideal number of clusters. Red text indicates the best and blue text indicates the second-best performance.
Music dataset
Methods T-ara Pussycat Dolls Bruno Mars Hello Bubble Darling Apink Westlife Girls Aloud
HOG [6] 0.22 0.28 0.36 0.33 0.20 0.20 0.27 0.29
AlexNet [7] 0.24 0.32 0.35 0.31 0.19 0.21 0.37 0.30
Pre-trained 0.31 0.31 0.49 0.34 0.25 0.28 0.32 0.33
VGG-Face [60] 0.23 0.46 0.44 0.29 0.21 0.24 0.27 0.31
VGG-Face-ULDML 0.26 0.44 0.47 0.34 0.28 0.26 0.41 0.32
Ours-Siamese 0.69 0.77 0.88 0.54 0.46 0.48 0.54 0.67
Ours-Triplet 0.68 0.77 0.83 0.60 0.49 0.60 0.52 0.67
Ours-SymTriplet 0.69 0.78 0.90 0.64 0.70 0.72 0.56 0.69
Ours-SymTriplet-Contx 0.84 0.83 0.91 0.69 0.72 0.74 0.66 0.75
TABLE III: Clustering results on 8 music videos. The weighted purity of each video is measured on the ideal number of clusters.

Adaptive features vs. off-the-shelf features. We evaluate the proposed features (Ours-Siamese, Ours-Triplet, Ours-SymTriplet and Ours-SymTriplet-Contx) adapted to a specific video against the off-the-shelf features (HOG, AlexNet, pre-trained, and VGG-Face) in Table II and III. We show that identity-preserving features (pre-trained and VGG-Face) trained on face datasets offline achieve better performance over generic feature representation (e.g., AlexNet and HOG). Our video-specific features trained with Siamese and triplet networks achieve favorable performance than other alternatives, highlighting the importance of learning video-specific features. For example, in the Daring sequence, the proposed method with the Ours-SymTriplet-Contx features achieves the weighted purity of 0.76, significantly outperforming the off-the-shelf features, i.e., VGG-Face: 0.20, AlexNet: 0.18 and HOG: 0.19. Overall, the results with the proposed features are more than twice as accurate as that using off-the-shelf features in music videos. For the BBT dataset, the proposed feature adaptation consistently outperforms that with off-the-shelf features.

Measuring the effectiveness of features via clustering Here, we validate the effectiveness the proposed features compared to the baselines. Figure 7 shows the results in terms of clustering purity versus the number of clusters on 7 BBT sequences and 5 music videos. The ideal line (purple dash line) means that all faces are correctly grouped with weighted purity . For more effective features, the weighted purity measures approach to 1 at a faster rate. For each feature type, we show the weighted purity at the ideal number cluster (i.e., number of people in a video) in the legend.

(a) T-ara (b) Pussycat Dolls (c) Bruno Mars (d) HelloBubble (e) Darling (f) Apink
(g) Westlife (h) Girls Aloud (i) BUFFY02 (j) BUFFY05 (k) BUFFY06 (l) BBT01
(m) BBT02 (n) BBT03 (o) BBT04 (p) BBT05 (q) BBT06 (r) BBT07
Fig. 7: Clustering performance. The clustering purity versus the number of clusters in comparison with different features on YouTube music video, Big Bang Theory and BUFFY datasets. The ideal line indicates that all faces are correctly grouped into ideal clusters, and its corresponding weighted purity is equal to 1. For the more effective feature, its purity approximates to 1 faster with the increase in the number of clusters. The legend contains the purities at the ideal number of clusters for each feature.

Figure 8 shows 2D visualization of extracted features from the T-ara using the t-SNE algorithm [70]. The visualization illustrates the difficulty in handling large appearance variations in unconstrained videos. For HOG features, there exist no clear cluster structures, and faces of the same person are scattered around. Although the AlexNet and pre-trained features increase inter-person distances, the clusters of the same person do not appear in close proximity. In contrast, the proposed adaptive features form tighter clusters for the same person and greater separation between different persons.

HOG (4356-D)  AlexNet (4096-D)
 Pre-trained (4096-D)  Ours-SymTriplet (64-D)
Fig. 8: 2D tSNE visualization. 2D tSNE visualization of all face features from the proposed fine-tuned CNN for adapting video-specific variations, compared with HOG, AlexNet, and pre-trained features. T-ara has 6 main casts. The faces of different people are color coded.

Nonlinear and linear metric learning. Unlike several existing approaches [43, 31, 45, 44, 58] that rely on hand-crafted features and linear metric learning, we use a deep nonlinear metric learning by finetuning all layers to learn discriminative face representations. To demonstrate the contribution of the nonlinear metric learning, we compare our adaptive features with VGG-Face-ULDML which learns Mahalanobis distance on the VGG-Face features in Table II. We show that the proposed method with adaptive features and the nonlinear metric achieve higher clustering purity than VGG-Face-ULDML on all videos. For example, on the T-ara sequence, the clustering purity by Ours-SymTriplet-Contx and VGG-Face-ULDML is 0.84, and 0.26, respectively.

SymTriplet and conventional Siamese/Triplet loss. We demonstrate the effectiveness of the proposed SymTriplet loss (Ours-SymTriplet) with comparisons to the contrastive loss (Ours-Siamese) and the triplet loss (Ours-Triplet) on all videos in Table II. The proposed method with the SymTriplet loss performs well against the other methods since positive sample pairs are pulled closer and negative samples are pushed away from the positive pairs. For example, on the BBT05 sequence, Ours-SymTriplet (0.85) achieves higher clustering purity than Ours-Triplet (0.68) and Ours-Siamese (0.70);

Contextual and spatio-temporal constraints. We evaluate the effectiveness of contextual constraints. Using the SymTriplet loss, we compare the features learned from using only spatio-temporal constraints (Ours-SymTriplet), and both contextual and spatio-temporal constraints (Ours-SymTriplet-Contx). Table II shows that Ours-SymTriplet-Contx achieves better performance when compared with Ours-SymTriplet on all videos. We attribute the performance improvement to the additional positive and negative face pairs discovered through contextual cues and the transitive constraint propagation.

Comparisons with other face clustering algorithms. We compare our method with five recent state-of-the-art face clustering algorithms [31, 43, 45, 58, 53] on the Frontal, BBT01, BUFFY02, and Notting Hill videos. Table IV shows the clustering accuracy over faces and tracklets (using the same datasets and metrics as [31, 45]).222The code and data of some methods, e.g., [44] are not available. In contrast to the methods in [31, 43, 45, 58] which learn linear transformations over the extracted features, our work learns nonlinear metrics by adapting all layers of the CNN and performs favorably on the Frontal, BBT01, and BUFFY02 sequences. Both [53] and our method discover more informative face pairs to adapt the pre-trained models to learn discriminative face representations and achieve similar clustering performance on the BUFFY02 and Notting Hill videos.

Method Frontal BBT01 BUFFY02 Notting Hill
faces tracklets faces tracklets faces faces
HOG [6] 0.411 0.402 0.495 0.472 0.304 0.451
AlexNet [7] 0.591 0.435 0.716 0.698 0.426 0.634
Pre-trained 0.777 0.381 0.747 0.775 0.516 0.791
Cinbis-ICCV-11 [43] 0.844 0.861 0.581 0.565 0.416 0.732
Wu-CVPR-13 [45] 0.950 0.907 0.626 0.596 0.503 0.844
Wu-ICCV-13 [31] 0.950 0.907 0.665 0.668 - -
Xiao-ECCV-14 [58] 0.962 0.938 0.694 0.721 0.628 0.963
Zhang-ECCV-16 [53] - - - - 0.921 0.990
Ours-SymTriplet-Contx 0.998 0.998 0.946 0.982 0.926 0.980
TABLE IV: Clustering accuracy on the Frontal, BBT01, BUFFY02 and Notting Hill videos. We compare our results with three baseline features and five other state-of-the-art face clustering methods [31, 43, 45, 58, 53] based on the same face tracks input and metrics as in [31, 45].

Comparisons with different number of feature dimensions. We investigate the effect of the dimensionality of the embedded features. Figure 9 shows the clustering purity versus the number of clusters in comparison with a different number of feature dimensions on the sequence Bruno Mars. In general, the clustering accuracy is not very sensitive to the selection of feature dimension. However, we do observe that using large feature dimension (e.g., 512 and 1024) does not perform well compared to smaller ones. We attribute this to the insufficient training samples. The evaluation on feature dimension also validates the selection of using 64-dimensional features for accuracy and efficiency.

Fig. 9: Effect of feature dimensionality. The legend shows the weighted purity over the number of clusters on Bruno Mars sequence.

6.5 Multi-face Tracking

Comparisons with the state-of-the-art multi-target trackers. We compare the proposed algorithm with several state-of-the-art MTT trackers including modified versions of TLD [61], ADMM [71], IHTLS [72], and methods by Wu et al. [31, 45]. The TLD [61] scheme is a long-term single-target tracker which can re-detect targets of interest when targets leave and re-enter a scene. We implement two extensions of TLD for multi-face tracking. The first one is the mTLD scheme where in each sequence, we run multiple TLD trackers for all targets and each TLD tracker is initialized with the ground truth bounding box in the first frame. For the second extension of TLD, we integrate the mTLD into our framework (referred to as Ours-mTLD). We use the mTLD to generate shot-level trajectories within each shot instead of using the two-threshold and Hungarian algorithms. At the beginning of each shot, we initialize TLD trackers with untracked detections and link the detections in the following frames according to the overlap scores with TLD outputs.

Table V shows quantitative results of the proposed algorithm, the mTLD [61], ADMM [71], and IHTLS [72] on the BBT, BUFFY and music video datasets. We also show the tracking results with the pre-trained features without adapting to a specific video. Note that the results shown in Table V are based on the overall evaluation. We leave the results from each individual sequence in the supplementary material (see

The mTLD method does not perform well on both datasets in terms of recall, precision, F1, and MOTA metrics. The ADMM [71] and IHTLS [72] schemes often generate numerous identity switches and fragments as both methods do not re-identify persons well when abrupt camera motions or shot changes occur. The tracker with the pre-trained features is not effective to re-identify faces in different shots and achieve low MOTA. The Ours-mTLD scheme has more IDS and Frag than the Ours-SymTriplet method. The shot-level trajectories determined by the mTLD method are short and noisy since TLD trackers sometimes drift or do not perform well when large appearance changes occur. In contrast, both Ours-SymTriplet and Ours-SymTriplet-Contx perform well in terms of precision, F1, and MOTA metrics, with significantly fewer identity switches and fragments.

BBT dataset
Method Recall Precision F1 FAF IDS Frag MOTA MOTP
mTLD [61] 1.1% 8.1% 1.9% 0.18 8 83 -11.2% 73.2%
ADMM [71] 78.3% 56.8% 65.8% 0.49 2709 4623 39.5% 72.7%
IHTLS [72] 77.7% 63.4% 69.8% 0.49 2648 4496 39.2% 72.7%
Pre-trained 45.0% 76.8% 56.8% 0.19 908 2435 30.0% 77.9%
Ours-mTLD 63.7% 78.8% 70.5% 0.24 1224 3487 44.6% 77.6%
Ours-Siamese 74.5% 81.4% 77.8% 0.24 884 4051 56.1% 77.4%
Ours-Triplet 76.2% 80.2% 78.1% 0.27 944 4223 55.8% 77.3%
Ours-SymTriplet 76.6% 81.0% 78.7% 0.26 846 4261 57.2% 77.2%
Ours-SymTriplet-Contx 76.8% 81.7% 79.2% 0.23 817 4073 59.6% 77.6%
BUFFY dataset
Method Recall Precision F1 FAF IDS Frag MOTA MOTP
mTLD [61] 4.6% 21.5% 7.6% 0.32 192 453 -8.2% 69.1%
ADMM [71] 78.3% 64.9% 70.9% 0.37 1420 2445 31.6% 70.1%
IHTLS [72] 78.0% 68.9% 73.2% 0.30 1558 2424 38.1% 70.2%
Pre-trained 52.3% 72.1% 60.6% 0.12 405 2672 38.3% 68.8%
Ours-mTLD 65.4% 73.5% 69.2% 0.27 413 2503 45.3% 70.2%
Ours-Siamese 66.1% 74.3% 70.0% 0.19 389 2470 45.6% 70.2%
Ours-Triplet 67.3% 74.6% 70.8% 0.20 388 2462 47.4% 70.2%
Ours-SymTriplet 68.1% 74.7% 71.2% 0.19 363 2460 47.6% 70.2%
Ours-SymTriplet-Contx 70.9% 77.5% 74.0% 0.18 293 2446 49.4% 70.2%
Music video dataset
Method Recall Precision F1 FAF IDS Frag MOTA MOTP
mTLD [61] 9.7% 36.1% 15.3% 0.39 280 621 -7.7% 68.4%
ADMM [71] 75.5% 61.8% 68.0% 0.50 2382 2959 51.7% 63.7%
IHTLS [72] 75.5% 68.0% 71.6% 0.41 2013 2880 56.2% 63.7%
Pre-trained 60.1% 88.8% 71.7% 0.17 931 2140 51.5% 79.5%
Ours-mTLD 69.1% 88.1% 77.4% 0.21 1914 2786 57.7% 80.1%
Ours-Siamese 71.5% 89.4% 79.5% 0.19 986 2512 62.3% 64.0%
Ours-Triplet 71.8% 88.8% 79.4% 0.20 902 2546 61.8% 64.2%
Ours-SymTriplet 71.8% 89.7% 79.8% 0.19 699 2563 62.8% 64.3%
Ours-SymTriplet-Contx 73.2% 90.5% 80.9% 0.19 625 2417 64.1% 64.2%
TABLE V: Quantitative comparison with other state-of-the-art multi-target tracking methods on the BBT and music video datasets.

Contribution of linking tracklets. We evaluate the design choices of the two-step linking process: 1) linking tracklets within the shot and 2) linking shot-level tracklets across shots. To this end, we evaluate two other alternatives:

  • Ours-SymTriplet-noT: without linking tracklets within each shot

  • Ours-SymTriplet-noC: without clustering the shot-level tracklets across different shots

Table VI shows that both Ours-SymTriplet-noT and Ours-SymTriplet-noC decrease the performance in terms of Recall, F1, IDS, Frag and MOTA metrics. Without linking tracklets within each shot, Ours-SymTriplet-noT cannot recover several missed faces, and thus yields lower performance on Recall. Although some separated tracklets in each shot can be grouped together by the HAC clustering algorithm, many tracklets may be grouped incorrectly due to the lack of consideration of spatio-temporal coherence within each shot. This explains the increase of IDS and Frag. Without clustering tracklets across different shots, Ours-SymTriplet-noC assigns each short tracklet in each shot with different identities, which results in significantly more identity switches and lower MOTA.

BBT dataset
Method Recall Precision F1 FAF IDS Frag MOTA MOTP
Ours-SymTriplet-Contx 76.8% 81.7% 79.2% 0.23 817 4073 59.6% 77.6%
Ours-SymTriplet-noT 65.3% 81.3% 72.4% 0.24 952 5109 43.1% 77.4%
Ours-SymTriplet-noC 68.7% 81.2% 74.4% 0.25 1496 4082 46.2% 77.2%
TABLE VI: Quantitative comparison with Ours-SymTriplet-noT and Ours-SymTriplet-noC on the BBT datasets.
Fig. 10: Tracking results on YouTube Music video dataset. Shown from the top to bottom are Hello Bubble, Apink, Darling, T-ara, Bruno Mars, Girls Aloud, Westlife and Pussycat Dolls. The faces of the different people are color coded.







Fig. 11: Tracking results on Buffy and BBT dataset. The faces of the different people are color coded.

Qualitative results. Figure 10 shows sample tracking results of our algorithm with Ours-SymTriplet-Contx features on all eight music videos. Figure 11 shows the results on three BUFFY videos and three selected BBT sequences. The numbers and the colors indicate the inferred identities of the targets. The proposed algorithm is able to track multiple faces well despite large appearance variations in unconstrained videos. In Figure 10, for example, there are significant changes in scale and appearance (due to makeup and hairstyle) in the Hello Bubble sequence (first row). In the fourth row, the six singers have similar looks and thus make multi-face tracking particularly challenging within and across shots. Nonetheless, our approach can distinguish the faces and track them reliably with few id switches. The results in other rows illustrate that our method is able to generate correct identities and trajectories when the same person appears in different shots or different scenes.

6.6 Pedestrian Tracking Across Cameras

In this section, we show that the proposed method for learning adaptive discriminative features from tracklets is also applicable to other objects, e.g., pedestrians or cars in surveillance videos. We validate our approach on the task of pedestrian tracking from multiple non-overlapping cameras.

The problem of multiple target tracking across cameras is challenging as we need to re-identify people from different images acquired at different viewing angles and imaging conditions. In unconstrained scenes, the appearances of people also exhibit significant differences across cameras. The motion cues of people are unreliable due to the non-overlapping views without knowing camera configurations apriori. The re-identification problem becomes even more challenging when a large number of people needs to be tracked across views.

Similar to pre-training a CNN using face recognition dataset for learning identity-preserving features, we first train a CNN for people re-identification using the Market1501 dataset [73] containing 32,668 images of 1,501 identities. We evaluate our method on the DukeMTMC [14] dataset which contains surveillance footage from 8 cameras with approximately 85 minutes of videos for each one.

We conduct the experiment using images from camera 2 and 5 because they are disjoint and have the most number of people. We first use the two-threshold strategy to generate tracklets on the videos from both cameras. Next, we collect training samples based on the tracklets using spatio-temporal and contextual constraints. Similar to the experiments on multi-face tracking, we fine-tune the pre-trained CNN with the discovered training samples using the SymTriplet loss function.

After extracting the learned features for each detection, we first link the tracklets within one camera into camera-level trajectories. We then group these camera-level trajectories into tracking results across the two cameras. Following [14]

, we measure the tracking performance using identification precision (IDP), identification recall (IDR), and the corresponding F1 score IDF1, as well as other metrics. The identification precision (recall) is the fraction of computed (ground truth) detections that are correctly identified. The IDF1 metric is the ratio of correctly identified detections over the average number of ground-truth and computed detections. Both ID precision and ID recall indicate tracking trade-offs, while the IDF1 score allows ranking all methods on a single scale that balances identification precision and recall through the harmonic mean. Table 

VII shows the tracking results on both cameras in the DukeMTMC dataset. Overall, the proposed method performs favorably against the other methods in [14] in term of IDS, MOTA, IDP and IDF1. We show sample visual results of the DukeMTMC datset in Figure 12. Person 237 and person 283 both appear in Camera 2 and Camera 5, and are correctly matched across cameras with our method.

Camera 2
Ergys Ristani et al. [14] 866 1929 49.2% 61.7% 69.1% 63.8% 66.3%
Ours-SymTriplet-Contx 802 2018 51.4 60.9 69.6% 64.8% 66.7%
Camera 5
Ergys Ristani et al. [14] 162 292 73.1% 70.5% 84.9% 68.0% 75.5%
Ours-SymTriplet-Contx 147 316 76.2% 68.5% 86.2% 69.5% 77.0%
TABLE VII: Tracking results on the DukeMTMC datset.
Fig. 12: Sample pedestrian tracking results. Shown from the top to bottom are Camera 2 and Camera 5 of the DukeMTMC dataset. The different people are color coded.

6.7 Discussions

Fig. 13: Failure cases. Our method incorrectly identifies different persons as the same one across shots on the Apink and Darling sequences. Numbers and colors of rectangles indicate the ground truth identities of persons. The red rectangles show the predicted locations and are tracked as one person by our method. On the Apink sequence on the top row, Persons 1, 3, 4 and 6 are incorrectly assigned with the same identity. On the Darling sequence on the bottom row, our method incorrectly identifies Persons 1 and 4 as the same one across shots.

While the proposed algorithm performs favorably against the state-of-the-art face tracking and clustering methods in handling challenging video sequences, there are three main limitations. First, as our algorithm takes face detections as inputs, the tracking performance depends on whether faces can be reliably detected. For example, in the fourth row of Figure 10, the leftmost person was not detected in frame 419 and next few images due to occlusion. In addition, falsely detected faces could be incorrectly linked as a trajectory, e.g., the Marilyn Monroe image on the T-shirt in frame 5,704 in the eighth row of Figure 10.

Second, the proposed algorithm may not perform well on sequences where many shots contain only one single person. We show in Figure 13 two failure cases in the Darling and Apink sequences. In such cases, the proposed method does not generate negative face pairs for training the Siamese/triplet network for distinguishing similar faces. As such, different persons are incorrectly identified as the same one. One remedy is to exploit other weak supervision signals (e.g., scripts, voice, contextual information) to generate visual constraints for different scenarios.

Third, the CNN fine-tuning process is time-consuming. It takes around 1 hour on a NVIDIA GT980Ti GPU for 10,000 back-propagation iterations. There are two approaches that may alleviate this issue. First, we may use faster training algorithms [74]. Second, for TV Sitcom episodes we can use one or a few videos for feature adaptation and apply the learned features to all other episodes. Note that we only need to adapt features once as the main characters are the same. In Table II, we train Ours-SymTriplet features on BBT02 (referred to as Ours-SymTriplet-BBT02) and evaluate on other episodes. Although the weight purity of Ours-SymTriplet-BBT02 is slightly inferior to that of Ours-SymTriplet, it still outperforms the pre-trained and VGG-Face features.

7 Conclusions

In this paper, we tackle the multi-face tracking problem in unconstrained videos by learning video-specific features. We first pre-train a CNN on a large-scale face recognition dataset to learn identity-preserving face representation. We then adapt the pre-trained CNN using training samples extracted through the spatio-temporal and contextual constraints. To learn discriminative features for handling large appearance variations of faces presented in a specific video, we propose the SymTriplet loss function. Using the learned features for modeling face tracklets, we use a hierarchical clustering algorithm link face tracklets across multiple shots. In addition to multi-face tracking, we demonstrate that the proposed algorithm can also be applied to other domains such as pedestrian tracking across multiple cameras. Experimental results show that the proposed algorithm outperforms the state-of-the-art methods in terms of clustering accuracy and tracking performance. As the performance of our approach depends on the automatically discovered visual constraints in the video, we believe that exploiting multi-modal information (e.g., sound/script alignment) is a promising direction for further improvement.


The work is supported by National Basic Research Program of China (973 Program, 2015CB351705), NSFC (61332018, 61703344), Office of Naval Research (N0014-16-1-2314), R&D programs by NRF (2014R1A1A2058501) and MSIP/IITP (IITP-2016-H8601-16-1005) of Korea, NSF CAREER (1149783) and gifts from Adobe, Panasonic, NEC, and NVIDIA.


  • [1] W. Brendel, M. Amer, and S. Todorovic, “Multiobject tracking as maximum weight independent set,” in CVPR, 2011.
  • [2] R. T. Collins, “Multitarget data association with higher-order motion models,” in CVPR, 2012.
  • [3] B. Yang and R. Nevatia, “Multi-target tracking by online learning of non-linear motion patterns and robust appearance models,” in CVPR, 2012.
  • [4] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in CVPR, 2008.
  • [5] X. Zhao, D. Gong, and G. Medioni, “Tracking using motion patterns for very crowded scenes,” in ECCV, 2012.
  • [6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005.
  • [7]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” in

    NIPS, 2012.
  • [8] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, “Decaf: A deep convolutional activation feature for generic visual recognition,” in ICML, 2014.
  • [9]

    Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10,000 classes,” in

    CVPR, 2014.
  • [10] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in NIPS, 2014.
  • [11] F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” CVPR, 2015.
  • [12] J. Hu, J. Lu, and Y.-P. Tan, “Discriminative deep metric learning for face verification in the wild,” in CVPR, 2014.
  • [13] S. Zhang, Y. Gong, J.-B. Huang, J. Lim, J. Wang, N. Ahuja, and M.-H. Yang, “Tracking persons-of-interest via adaptive discriminative features,” in ECCV, 2016.
  • [14] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in ECCVW.   Springer, 2016, pp. 17–35.
  • [15] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-detection-by-tracking,” in CVPR, 2008.
  • [16]

    C. Stauffer, “Estimating tracking sources and sinks,” in

    CVPR, 2003.
  • [17] A. A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and W. Hu, “Multi-object tracking through simultaneous long occlusions and split-merge conditions,” in CVPR, 2006.
  • [18] R. Kaucic, A. A. Perera, G. Brooksby, J. Kaufhold, and A. Hoogs, “A unified framework for tracking through occlusions and across sensor gaps,” in CVPR, 2005.
  • [19] B. Leibe, K. Schindler, and L. Van Gool, “Coupled detection and trajectory estimation for multi-object tracking,” in ICCV, 2007.
  • [20] H. Jiang, S. Fels, and J. J. Little, “A linear programming approach for multiple object tracking,” in CVPR, 2007.
  • [21] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,” PAMI, vol. 33, no. 9, pp. 1806–1819, 2011.
  • [22] A. Andriyenko and K. Schindler, “Multi-target tracking by continuous energy minimization,” in CVPR, 2011.
  • [23] A. Andriyenko, K. Schindler, and S. Roth, “Discrete-continuous optimization for multi-target tracking,” in CVPR, 2012.
  • [24] J. Xing, H. Ai, and S. Lao, “Multi-object tracking through occlusions by local tracklets filtering and global tracklets association with detection responses,” in CVPR, 2009.
  • [25] C. Huang, B. Wu, and R. Nevatia, “Robust object tracking by hierarchical association of detection responses,” in ECCV, 2008.
  • [26] B. Yang and R. Nevatia, “Online learned discriminative part-based appearance models for multi-human tracking,” in ECCV, 2012.
  • [27] C. Huang, Y. Li, H. Ai et al., “Robust head tracking with particles based on multiple cues,” in ECCVW, 2006.
  • [28] Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade, “Tracking in low frame rate video: A cascade particle filter with discriminative observers of different lifespans,” in CVPR, 2007.
  • [29] H. Ben Shitrit, J. Berclaz, F. Fleuret, and P. Fua, “Tracking multiple people under global appearance constraints,” in ICCV, 2011.
  • [30] Y. Li, C. Huang, and R. Nevatia, “Learning to associate: Hybridboosted multi-target tracker for crowded scene,” in CVPR, 2009.
  • [31] B. Wu, S. Lyu, B.-G. Hu, and Q. Ji, “Simultaneous clustering and tracklet linking for multi-face tracking in videos,” in ICCV, 2013.
  • [32] M. Roth, M. Bauml, R. Nevatia, and R. Stiefelhagen, “Robust multi-pose face tracking by multi-stage tracklet association,” in ICPR, 2012.
  • [33] B. Wang, G. Wang, K. L. Chan, and L. Wang, “Tracklet association with online target-specific metric learning,” in CVPR, 2014.
  • [34] C.-H. Kuo and R. Nevatia, “How does person identity recognition help multi-person tracking?” in CVPR, 2011.
  • [35] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in CVPR, 2001.
  • [36] B. Fulkerson, A. Vedaldi, and S. Soatto, “Localizing objects with smart dictionaries,” in ECCV, 2008.
  • [37] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.
  • [38] S. Zhang, J. Wang, Z. Wang, Y. Gong, and Y. Liu, “Multi-target tracking by learning local-to-global trajectory models,” PR, vol. 48, no. 2, pp. 580–590, 2015.
  • [39] C.-H. Kuo, C. Huang, and R. Nevatia, “Multi-target tracking by on-line learned discriminative appearance models,” in CVPR, 2010.
  • [40] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “Robust tracking-by-detection using a detector confidence particle filter,” in ICCV, 2009.
  • [41] R. T. Collins, Y. Liu, and M. Leordeanu, “Online selection of discriminative tracking features,” PAMI, vol. 27, no. 10, pp. 1631–1643, 2005.
  • [42] H. Grabner and H. Bischof, “On-line boosting and vision,” in CVPR, 2006.
  • [43] R. G. Cinbis, J. Verbeek, and C. Schmid, “Unsupervised metric learning for face identification in tv video,” in ICCV, 2011.
  • [44] M. Tapaswi, O. M. Parkhi, E. Rahtu, E. Sommerlade, R. Stiefelhagen, and A. Zisserman, “Total cluster: A person agnostic clustering method for broadcast videos,” in ICVGIP, 2014.
  • [45] B. Wu, Y. Zhang, B.-G. Hu, and Q. Ji, “Constrained clustering and its application to face clustering in videos,” in CVPR, 2013.
  • [46] E. El Khoury, C. Senac, and P. Joly, “Face-and-clothing based people clustering in video content,” in ICMR, 2010.
  • [47]

    M. Bauml, M. Tapaswi, and R. Stiefelhagen, “Semi-supervised learning with constraints for person identification in multimedia data,” in

    CVPR, 2013.
  • [48]

    J. Sivic, M. Everingham, and A. Zisserman, ““Who are you?” – Learning person specific classifiers from video,” in

    CVPR, 2009.
  • [49] M. Everingham, J. Sivic, and A. Zisserman, ““Hello! My name is… Buffy” – Automatic naming of characters in tv video,” in BMVC, 2006.
  • [50] G. Paul, K. Elie, M. Sylvain, O. Jean-Marc, and D. Paul, “A conditional random field approach for audio-visual people diarization,” in ICASSP, 2014.
  • [51] C. Zhou, C. Zhang, H. Fu, R. Wang, and X. Cao, “Multi-cue augmented face clustering,” in ACM MM, 2015.
  • [52] Z. Tang, Y. Zhang, Z. Li, and H. Lu, “Face clustering in videos with proportion prior.” in IJCAI, 2015.
  • [53] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Joint face representation adaptation and clustering in videos,” in ECCV, 2016.
  • [54] D. Ramanan, S. Baker, and S. Kakade, “Leveraging archival video for building face datasets,” in ICCV, 2007.
  • [55] M. Tapaswi, M. Bauml, and R. Stiefelhagen, ““Knock! Knock! Who is it?” probabilistic person identification in tv-series,” in CVPR, 2012.
  • [56] D. Anguelov, K.-c. Lee, S. B. Gokturk, and B. Sumengen, “Contextual identity recognition in personal photo albums,” in CVPR, 2007.
  • [57] D. Lin, A. Kapoor, G. Hua, and S. Baker, “Joint people, event, and location recognition in personal photo collections using cross-domain context,” in ECCV, 2010.
  • [58] S. Xiao, M. Tan, and D. Xu, “Weighted block-sparse low rank representation for face clustering in videos,” in ECCV, 2014.
  • [59] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “DeepFace: Closing the gap to human-level performance in face verification,” in CVPR, 2014.
  • [60] O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in BMVC, 2015.
  • [61] Z. Kalal, K. Mikolajczyk, and J. Matas, “Tracking-learning-detection,” PAMI, vol. 34, no. 7, pp. 1409–1422, 2012.
  • [62] F. Pernici, “Facehugger: The alien tracker applied to faces,” in ECCV, 2012.
  • [63] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Learning face representation from scratch,” arXiv, 2014.
  • [64] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM MM, 2014.
  • [65] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool, “Face detection without bells and whistles,” in ECCV, 2014.
  • [66] M. Du and R. Chellappa, “Face association for videos using conditional random fields and max-margin markov networks,” PAMI, vol. 38, no. 9, pp. 1762–1773, 2016.
  • [67] L. Bourdev and J. Malik, “Poselets: Body part detectors trained using 3d human pose annotations,” in ICCV, 2009, pp. 1365–1372.
  • [68] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in CVPR, 2005.
  • [69] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in CVPR, 2006.
  • [70] L. Van der Maaten and G. Hinton, “Visualizing data using t-sne,” JMLR, vol. 9, no. 2579-2605, p. 85, 2008.
  • [71]

    M. Ayazoglu, M. Sznaier, and O. I. Camps, “Fast algorithms for structured robust principal component analysis,” in

    CVPR, 2012.
  • [72] C. Dicle, O. I. Camps, and M. Sznaier, “The way they move: Tracking multiple targets with similar appearance,” in ICCV, 2013.
  • [73] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalable person re-identification: A benchmark,” in ICCV, 2015.
  • [74] Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio, “Neural networks with few multiplications,” arXiv, 2015.