1 Introduction
In this paper, we are interested in measuring the similarity of one source of variation among videos such as the subject identity in particular. The motivation of this work is as followed. Given a face video visually affected by confounding factors such as the identity and the head pose, we compare it against another video by hopefully only measuring the similarity of the subject identity, even if the framelevel feature characterizes mixed information. Indeed, deep features from Convolutional Neural Networks (CNN) trained on face images with identity labels are generally not robust to the variation of the
head pose, which refers to the face’s relative orientation with respect to the camera and is the primary challenge in uncontrolled environments. Therefore, the emphasis of this paper is not the deep learning of framelevel features. Instead, we care about how to improve the videolevel representation’s descriptiveness which rules out confusing factors (
e.g., pose) and induces the similarity of the factor of interest (e.g., identity).If we treat the framelevel feature vector of a video as a random vector, we may assume that the highlycorrelated feature vectors are identically distributed. When the task is to represent the whole image sequence instead of modeling the temporal dynamics such as the state transition, we may use the sample mean and variance to approximate the true distribution, which is implicitly assumed to be a normal distribution. While this assumption might hold given natural image statistics, it can be untrue for a particular video. Even if the features are Gaussian random vectors, taking the mean makes sense only if the framelevel feature just characterizes the identity. Because there is no variation of the identity in a video by construction. However, even the CNN face features still normally contain both the identity and the pose cues. Surely, the feature mean will still characterize both the identity and the pose. What is even worse, there is no way to decouple the two cues once we take the mean. Instead, if we want the video feature to only represent the subject identity, we had better preserve the overall pose diversity that very likely exists among frames. Disregarding minor factors, the identity will be the only source of variation across videos since pose varies even within a single video. The proposed
frame selection algorithm retains frames that preserve the pose diversity. Based on the selection, we further design an algorithm to compute the identity similarity between two sets of deep face features by pooling the max correlation.Instead of pooling from all the frames, the frame selection algorithm is highlighted at firstly the pose quantization via Kmeans and then the pose selection using the pose distances to the Kmeans centroids. It reduces the number of features from tens or hundreds to while still preserving the overall pose diversity, which makes it possible to process a video stream at real time. Fig. 1 shows an example sequence in the YouTube Face (YTF) dataset [16]. This algorithm also serves as a way to sample the video frames (to images). Once the key frames are chosen, we will pool a single number of the similarity between two videos from many pairs of images. The metric to pool from many correlations normally are the mean or the max. Taking the max is essentially finding the nearest neighbor, which is a typical metric for measuring similarity or closeness of two point sets. In our work, the max correlation between two bags of framewise CNN features is employed to measure how likely two videos represent the same person. In the end, a video is represented by a single frame’s feature which induces nearest neighbors between two sets of selected frames if we treat each frame as a data point. This is essentially a pairwise max pooling process. On the official 5000 videopairs of YTF dataset [16], our algorithm achieves a comparable performance with stateoftheart that averages over deep features of all frames.
2 Related Works
The cosine similarity or correlation both are welldefined metrics for measuring the similarity of two images. A simple adaptation to videos will be randomly sampling a frame from each of the video. However, the correlation between two random image samples might characterize cues other than identity (say, the pose similarity). There are existing works on measuring the similarity of two videos using manifoldtomanifold distance
[6]. However, the straightforward extension of imagebased correlation is preferred for its simplicity, such as temporal max or mean pooling [11]. The impact of different spatial pooling methods in CNN such as mean pooling, max pooling and 2 pooling, has been discussed in the literature [3, 2]. However, pooling over the time domain is not as straightforward as spatial pooling. The framewise feature mean is a straightforward videolevel representation and yet not a robust statistic. Despite that, temporal mean pooling is conventional to represent a video such as average pooling for videolevel representation [1], mean encoding for face recognition [4], feature averaging for action recognition [5] and mean pooling for video captioning [15].Measuring the similarity of subject identity is useful face recognition such as face verification for sure and face identification as well. Face verification is to decide whether two modalities containing faces represent the same person or two different people and thus is important for access control or reidentification tasks. Face identification involves onetomany similarity, namely a ranked list of onetoone similarity and thus is important for watchlist surveillance or forensic search tasks. In identification, we gather information about a specific set of individuals to be recognized (i.e., the gallery). At test time, a new image or group of images is presented (i.e., the probe).
In this deep learning era, face verification on a number of benchmarks such as the Labeled Face in the Wild (LFW) dataset [9] has been well solved by DeepFace [14], DeepID [13], FaceNet [12] and so on. The Visual Geometry Group at the University of Oxford released their deep face model called VGGFace Descriptor [10]
which also gives a comparable performance on LFW. However in the real world, pictures are often taken in uncontrolled environment (the socalled in the wild versus in the lab setting). Considering the number of image parameters that were allowed to vary simultaneously, it is logical to consider a divideandconquer approach  studying each source of variation separately and keeping all other variations as constants in a control experiment. Such a separation of variables has been widely used in Physics and Biology for multivariate problems. In this datadriven machine learning era, it seems fine to remain all variations in realistic data, given the idea of letting the deep neural networks learn the variations existing in the enormous amount of data. For example, FaceNet
[12] trained using a private dataset of over 200M subjects is indeed robust to poses, as illustrated in Fig. 2. However, the CNN features from conventional networks suach as DeepFace [14] and VGGFace [10] are normally not. Moreover, the unconstrained data with fused variations may contain biases towards factors other than identity, since the feature might characterize a mixed information of identity and lowlevel factors such as pose, illumination, expression, motion and background. For instance, pose similarities normally outweigh subject identity similarities, leading to matching based on pose rather than identity. As a result, it is critical to decouple pose and identity. If the facial expression confuses the identity as well, it is also necessary to decouple them too. In the paper, the face expression is not considered as it is minor compared with pose. Similarly, if we want to measure the similarity of the face expression, we need to decouple it from the identity. For example in [xiang2015hierarchical] for facial expression recognition, one class of training data are formed by face videos with the same expression yet across different people.Moreover, there are many different application scenarios for face verifications. For Webbased applications, verification is conducted by comparing images to images. The images may be of the same person but were taken at different time or under different conditions. Other than the identity, highlevel factors such as the age, gender, ethnicity and so on are not considered in this paper as they remain the same in a video. For online face verification, alive video rather than still images is used. More specifically, the existing videobased verification solutions assume that gallery face images are taken under controlled conditions [6]. However, gallery is often built uncontrolled. In practice, a camera could take a picture as well as capture a video. When there are more information describing identities in a video than an image, using a fully live video stream will require expensive computational resources. Normally we need video sampling or a temporal sliding window.
3 Pose Selection by DiversityPreserving KMeans
In this section, we will explain our treatment particularly for realworld images with various head poses such as images in YTF. Many existing methods such as [xiang2015hierarchical] make a certain assumption which holds only when faces are properly aligned.
By construction (say, face tracking by detection), each video contains a single subject. Each video is formalised as a set of frames where each frame contains a face. Given the homography
and correspondence of facial landmarks, it is entirely possible to estimate the 3D rotation angles (yaw, pitch and roll) for each 2D face frame. Concretely, some head pose estimator
gives a set where is a 3D rotationangle vector .After pose estimation, we would like to select key frames with significant head poses. Our intuition is to preserve pose diversity while downsampling the video in the time domain. We learn from Fig. 2 of Google’s FaceNet that face features learned from a deep CNN trained on identitylabelled data can be invariant to head poses as long as the training inputs for a particular identity class include almost all possible poses. That is also true for other minor source of variations such as illumination, expression, motion, background among others. Then, identity will be the only source of variation across classes since any factor other than identity varies even within a single class.
Without such huge training data as Google has, we instead hope that the testing inputs for a particular identity class include poses as diverse as possible. A straightforward way is to use the full video, which indeed preserves all possible pose variations in that video while computing deep features for all the frames is computationally expensive. Taking representing a line in a 2D coordinate system as an example, we only needs either two parameters such as the intercept and gradient or any two points in that line. Similarly, now our problem becomes to find a compact pose representation of a testing video which involves the following two criteria.
First, the pose representation is compact in terms of nonredundancy and closeness. For nonredundancy, we hope to retain as few frames as possible. For pose closeness, we observe from Fig. 3 that certain patterns exist in the head pose distribution  close points turn to cluster together. That observation occurs for other sequences as well. As a result, we want to select key frames out of a video by clustering the 3D head poses. The widelyused Kmeans clustering aims to partition the point set into subsets so as to minimize the withincluster Sum of Squared Distances (SSD). If we treat each cluster as a class, we want to minimize the intraclass or withincluster distance.
Second, the pose representation is representative in terms of diversity (i.e., difference, distance). Intuitively we want to retain the key faces that have poses as different as possible. If we treat each frame’s estimated 3D pose as a point, then the approximate polygon formed by selected points should be as close to the true polygon formed by all the points as possible. We measure the diversity using the SSD between any two selected key points (SSD within the set formed by centroids if we use the them as key points). And we want to maximize such a interclass or betweencluster distance.
Now, we put all criteria together in a single objective. Given a set of pose observations, we aim to partition the observations into () disjoint subsets so as to minimize the withincluster SSD as well as maximize the betweencluster SSD while still minimizing the number of clusters:
(1) 
where is the mean of points in , respectively. This objective differs from that of Kmeans only in considering betweencluster distance which makes it a bit similar with multiclass LDA (Linear Discriminant Analysis). However, it is still essentially Kmeans. To solve it, we do not really need alternative minimization because that with a limited number of choices is empirically enumerated by cross validation. Once is fixed, solving Eqn. 1 follows a similar procedure of multiclass LDA while there is no mixture of classes or clusters because every point is hardassigned to a single cluster as done in Kmeans. The subsequent selection of key poses is straightforward (by the distances to Kmeans centroids). The selected key poses form a subset of where is a dimensional sparse impulse vector of binary values 1/0 indicating whether the index is chosen or not, respectively.
The selection of frames will follow the index activation vector as well. Such a selection reduces the number of images required to represent the face from tens or hundreds to while preserving the pose diversity which is considered in the formation of clusters. Now we frontalize the chosen faces which is called face alignment or pose correction/normalization. All above operations are summarized in Algorithm 1.
Note that not all landmarks can be perfectly aligned. Priority is given to salient ones such as the eye center and corners, the nose tip, the mouth corners and the chin. Other properties such as symmetry are also preserved. For example, we mirror the detected eye horizontally. However, a profile will not be frontalized.
4 Pooling Max Correlation for Measuring Similarity
In this section, we explain our max correlation guided pooling from a set of deep face features and verify whether the selected key frames are able to well represent identity regardless of pose variation.
After face alignment, some feature descriptor, a function , maps each corrected frame to a feature vector with dimensionality and unit Euclidean norm. Then the video is represented as a bag of normalized framewise CNN features . We can also arrange the feature vectors column by column to form a matrix . For example, the VGGface network [10]
has been verified to be able to produce features well representing the identity information. It has 24 layers including several stacked convolutionpooling layer, 2 fullyconnected layer and one softmax layer. Since the model was trained for face identification purpose with respect to 2,622 identities, we use the output of the second last fullyconnected layer as the feature descriptor, which returns a 4,096dim feature vector for each input face.
Given a pair of videos of subject and respectively, we want to measure the similarity between and . Since we claim the proposed bag of CNN features can well represent the identity, instead we will measure the similarity between two sets of CNN features which is defined as the max correlation among all possible pairs of CNN features, namely the max element in the correlation matrix (see Fig. 4):
(2) 
where and . Notably, the notation indicates all elements in a matrix following the MATLAB convention. Now, instead of comparing pairs, with Sec. 3 we only need to compute correlations, from which we further pool a single () number as the similarity measure. In the time domain, it also serves as pushing from images to just image. The metric can be the mean, median, max or the majority from a histogram while the mean and max are more widelyused. The insight of not taking the mean is that a frame highly correlated with another video usually does not appear twice in a temporal sliding window. If we plot the two bags of features in the common feature space, a similarity is essentially the closeness between the two sets of points. If the two sets are nonoverlapping, one measure of the closeness between two points sets is the distance between nearest neighbors, which is essentially pooling the max correlation. Similar with spatial pooling for invariance, taking the max from the correlation matrix shown in Fig. 4 preserves the temporal invariance that the largest correlation can appear at any time step among the selected frames. Since the identity is consistent in one video, we can claim two videos contain a similar person as long as one pair of frames from each video are highly correlated. The computation of two videos’ identity similarity is summarized in Algorithm 2.
5 Experiments
5.1 Implementation
We develop the programs using resources^{1}^{1}1http://opencv.org/, http://dlib.net/ and http://www.robots.ox.ac.uk/~vgg/software/vgg_face/, respectively. such as OpenCV, DLib and VGGFace.

Face detection: framebyframe detection using DLib’s HOG+SVM based detector trained on 3,000 cropped face images from LFW. It works better for faces in the wild than OpenCV’s cascaded haarlike+boosting based (ViolaJones) detector.

Facial landmarking: DLib’s landmark model trained via regression tree ensemble.

Head pose estimation: OpenCV’s solvePnP recovering 3D coordinates from 2D coordinates using Direct Linear Transform + LevenbergMarquardt optimization.

Face alignment: OpenCV’s warpAffine by affinewarping to center eyes and mouth.

Deep face representation ^{2}^{2}2Codes are available at https://github.com/eglxiang/vgg_face: second last layer output (4,096dim) of VGGFace [10]
using Caffe
[7]. For your conveniece, you may consider using MatConvNetVLFeat instead of Caffe. VGGFace has been trained using face images of size 224 224 with the average face image subtracted and then is used for our verification purpose without any retraining. However, such average face subtraction is unavailable and unnecessary given a new inputting image. As a result, we directly input the face image to VGGFace network without any mean face subtraction.
5.2 Evaluation on videobased face verification
For videobased face recognition database, EPFL captures 152 people facing webcam and mobilephone camera in controlled environments. However, they are frontal faces and thus of no use to us. University of Surrey and University of Queensland capture 295 and 45 subjects under various various wellquantized poses in controlled environments, respectively. Since the poses are well quantized, we can hardly verify our pose quantization and selection algorithm on them. McGill and NICTA capture 60 videos of 60 subjects and 48 surveillance videos of 29 subjects in uncontrolled environments, respectively. However, the database size are way too small. YouTube Faces (YTF) dataset (YTF) and India Mvie Face Database (IMFDB) collect 3,425 videos of 1,595 people and 100 videos of 100 actors in uncontrolled environments, respectively. There are quite a few existing work verified on IMFDB. As a result, the YTF dataset ^{3}^{3}3Dataset is available at http://www.cs.tau.ac.il/~wolf/ytfaces/ [16] is chosen to verify the proposed videobased similarity measure for face verification. YTF was built by using the 5,749 names of subjects included in the LFW dataset [9] to search YouTube for videos of these same individuals. Then, a screening process reduced the original set of videos from the 18,899 of 3,345 subjects to 3,425 videos of 1,595 subjects.
In the same way with LFW, the creator of YTF provides an initial official list of 5,000 video pairs with ground truth (same person or not as shown in Fig. 5). Our experiments can be replicated by following our tutorial ^{4}^{4}4Codes with a tutorial at https://github.com/eglxiang/ytf. turns to be averagely the best for the YTF dataset. Fig. 6 presents the Receiver Operating Characteristic (ROC) curve obtained after we compute the 5,000 videovideo similarity scores. One way to look at a ROC curve is to first fix the level of false positive rate that we can bear (say, 0.1) and then see how high is the true positive rate (say, roughly 0.9). Another way is to see how close the curve towards the topleft corner. Namely, we measure the Area Under the Curve (AUC) and hope it to be as large as possible. In this testing, the AUC is 0.9419 which is quite close to VGGFace [10] which uses temporal mean pooling. However, our selective pooling strategy have much fewer computation credited to the key face selection. We do run cross validations here as we do not have any training.
Later on, the creator of YTF sends a list of errors in the groundtruth label file and provides a corrected list of video pairs with updated groundtruth labels. As a result, we run again the proposed algorithm on the corrected 4,999 video pairs. Fig. 7 updates the ROC curve with an AUC of 0.9418 which is identical with the result on the initial list.
6 Conclusion
In this work, we propose a frame selection algorithm and an identity similarity measure which employs simple correlations and no learning. It is verified on fast videobased face verification on YTF and achieves comparable performance with VGGface. Particularly, the selection and pooling significantly reduce the computational expense of processing videos. The further verification of the proposed algorithm include the evaluation of videobased face expression recognition. As shown in Fig. 5 of [xiang2015hierarchical], the assumption of group sparsity might not hold under imperfect alignment. The extended CohnaKanade dataset include mostly wellaligned frontal faces and thus is not suitable for our research purpose. Our further experiments are being conducted on the BU4DFE database^{5}^{5}5http://www.cs.binghamton.edu/~lijun/Research/3DFE/3DFE_Analysis.html which contains 101 subjects, each one displaying 6 acted facial expressions with moderate head pose variations. A generic problem underneath is variable disentanglement in real data and a takehome message is that employing geometric cues can improve the descriptiveness of deep features.
References
 [1] AbuElHaija, S., Kothari, N., Lee, J., Natsev, P., Toderici, G., Varadarajan, B., Vijayanarasimhan, S.: Youtube8m: A largescale video classification benchmark. arxiv: 1609.08675 (September 2016)

[2]
Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning midlevel features for recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2010)
 [3] Boureau, Y.L., Ponce, J., LeCun, Y.: A theoretical analysis of feature pooling in visual recognition. In: Proceedings of the International Conference on Machine Learning (2010)
 [4] Crosswhite, N., Byrne, J., Parkhi, O.M., Stauffer, C., Cao, Q., Zisserman, A.: Template adaptation for face verification and identification. arxiv (April 2016)
 [5] Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Longterm recurrent convolutional networks for visual recognition and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2625–2634 (2015)
 [6] Huang, Z., Shan, S., Wang, R., Zhang, H., Lao, S., Kuerban, A., Chen, X.: A benchmark and comparative study of videobased face recognition on cox face database. IEEE Transaction on Image Processing 24, 5967–5981 (2015)
 [7] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093 (2014)
 [8] KemelmacherShlizerman, I., Seitz, S.M., Miller, D., Brossard, E.: The megaface benchmark: 1 million faces for recognition at scale. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
 [9] LearnedMiller, E., Huang, G.B., RoyChowdhury, A., Li, H., Hua, G.: Labeled faces in the wild: A survey. Advances in Face Detection and Facial Image Analysis pp. 189–248 (2016)
 [10] Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: British Machine Vision Conference (2015)
 [11] Pigou, L., van den Oord, A., Dieleman, S., Herreweghe, M.V., Dambre, J.: Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. arxiv (June 2015)
 [12] Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
 [13] Sun, Y., Chen, Y., Wang, X., Tang, X.: Deep learning face representation by joint identificationverification. In: Advances in Neural Information Processing Systems (2014)
 [14] Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: Deepface: Closing the gap to humanlevel performance in face verification. In: Proceedings of the IEEE International Conference on Computer Vision (2014)

[15]
Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. In: Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2014)
 [16] Wolf, L., Hassner, T., Maoz, I.: Face recognition in unconstrained videos with matched background similarity. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2011)