While there has been a substantial amount of research in still image face recognition, there has been comparatively less on video face recognition. Video typically provides much more information for recognition compared to still images, including temporal and multiview information. However, one of the major challenges in video is to decide how to maximise the usage of available information while ensuring the system can still run in a scalable and timely manner. For example, despite typically having many frames of face information available from video, one of the design decisions for any face recognition system includes how many faces to use and, if not all, how to select them. There is potentially a trade-off between computational effort and recognition accuracy, which can be influenced by the number of faces used.
Additionally, we are interested in addressing real-world video recognition problems where the environment is uncontrolled and subjects may not be actively cooperating with the camera. Furthermore, the quality of images can vary quite dramatically. For instance, in surveillance contexts, CCTV video suffers from low quality, resolution mismatches, varying pose and lighting from camera to camera, and also within the same camera depending on time of day (changes in lighting and shadows). Another example of an uncontrolled environment is handheld mobile video, which often suffers from quality issues such as lens smudging, blur, pose and lighting changes due to variation between scenes (indoor/outdoor). In addition to the above image variations, face detection and alignment will also have great influence on the recognition performance . Many face recognition algorithms assume the faces are well aligned and normalised, which may not be the case, especially for low quality video. Thus to address these issues, not only does the face recognition system need to be scalable and efficient, but it also has to be robust to common issues that affect recognition accuracy.
This paper describes a system for video-to-video face recognition which uses an adapted form of the probabilistic Multi-Region Histogram (MRH) method originally developed for still-to-still face recognition . We have chosen to extend it to video-to-video recognition as it has shown robustness to alignment errors as well as variations in illumination, pose and image quality. Furthermore, MRH is relatively computationally efficient, making it suitable as a starting point for developing a scalable video-to-video recognition system.
Within the video-based system, we characterise the impact on recognition accuracy when using various methods of face selection to choose only a subset of faces for recognition. We also contrast those methods with the alternative of using information from all faces through feature clustering. We examine the trade-offs between computational effort and recognition performance present in face selection and clustering, and suggest situations where the two approaches might be best utilised.
The paper proceeds as follows: Sections 2 and 3 provide background on face selection and feature clustering in video; Section 4 describes our video-based face recognition framework; Section 5 discusses the experiments on the MOBIO dataset to compare the different approaches in face selection, and contrasts that to the utilisation of all faces through feature clustering. Conclusions and directions for future work are given in Section 6.
2 Background: Face Selection
As larger video datasets are being made available including the Mobile Biometrics (MOBIO) dataset , which has made face selection a more prominent topic to investigate. MOBIO has 17,480 videos and over 3 million frames — with such a large amount of information, balancing computational efficiency with recognition performance becomes very necessary.
calls the approach of independently using all or a subset of face images with a still image-based recognition method the ‘key-frame (or exemplar) based approach’. In most cases ad-hoc heuristics are used to select key-frames. A common way of selecting a subset of faces is through a metric based on face detection confidence after the face detection step. Face confidence metrics can be based on located facial features (such as eyes and nose) within the face 
, or face classification using pre-trained binary classifiers[9, 10]. The number of selected faces is typically chosen in a heuristic manner, such as the number of faces or faces above a certain threshold of confidence.
There are typically two main reasons for not using all faces: the first is computational effort due to the size of the dataset, the second is that the marginal gain in recognition accuracy decreases after a certain number of faces . We will discuss the computational effort trade-offs in Section 5.3, and our experiments in Section 5.1 will analyse the second reason in more detail.
3 Background: Face Clustering
In the cases where computation time is less of a limitation, such as offline or batch processing, there is potential to utilise information from all faces in a video. However, using all faces for recognition must still be done in a tractable manner, either in the recognition step itself or in a pre-processing step. We propose to use facial feature clustering as such a pre-processing step.
Historically, video face recognition methods originate from still-image based techniques, which get applied over the multiple face frames by treating each as a still image  and modify the distance calculation to accommodate multiple identification hypotheses. These approaches are classified by Matta and Dugelay  as approaches that neglect temporal information. This class of video face recognition makes up the majority of the face recognition systems published . They include video extensions of PCA, LDA, Active Appearance Models and Elastic Graph Matching. The major drawback to these approaches is that they may become computationally intractable to store and search for any significant amount of video. They also do not take advantage of the fact that sequential faces may be very similar and thus may be grouped together to reduce redundancy.
Temporal model and image-set matching approaches address this issue by modeling the distribution of face images over time or by features . These approaches tend to integrate the information expressed by all the face images into a single model.
One such image-set matching solution is to cluster similar faces by feature similarity. Lee et al. 
proposed to learn a low-dimensional manifold, which is approximated by piecewise linear subspaces. To construct the representation, exemplars are first sampled from videos by finding frames with the largest distance to each other corresponding to head pose changes in video, which are further clustered using K-means clustering. Each cluster models face appearance in nearby poses, represented by a linear subspace computed by PCA. Arandjelovic et al.
model the face appearance distribution as Gaussian Mixture Models (GMMs) on low- dimensional manifolds. In further work
, they derived a local manifold illumination invariant, and formulated the face appearance distribution as a collection of Gaussian distributions corresponding to clusters obtained byk-means.
We propose a similar approach of clustering face features as a collection of Gaussians as a pre-cursor to face recognition for any system. This will be demonstrated using Multi-Region Histogram features rather than the previously used local manifolds.
4 Video-to-Video Matching
A generic face recognition system has the components of face detection, feature extraction, and face matching. The two system components being proposed and analysed in this paper, face selection and facial feature clustering, fall in between the face detection and recognition steps, as illustrated in Fig. 1. The face recognition system presented here uses OpenCV for face detection in conjunction with a modified form of MRH  for feature extraction. Details of the system components are given below.
4.1 Face Localisation
For face localisation, OpenCV’s Haar Feature-based Cascade Classifier  is used to detect and localise faces in each frame. Eyes are located within each face using a Haar-based classifier. If no eyes are found, their locations are approximated based on the size of the localised face. The faces are then resized and cropped such that the eyes are at predefined locations with a 32-pixel inter-eye distance. The final face is a closely cropped ‘inner’ faces of size 6464 pixels (as later seen in Fig. 3), which attempts to exclude image areas susceptible to disguises, such as the hair and chin.
4.2 Face Selection
One approach for face selection is based on a metric of face detection confidence, which is the confidence of a face classifier that the region of interest is a face. The implementation of this metric varies. A generic method is to apply a post-processing step of a face or non-face binary classifier for all faces detected to obtain a confidence measure . We compare a face confidence method in , which is based on where landmarks are detected within the face (such as eyes and nose), to more naive methods of random and sequential selection.
Given any video for person , the number of faces extracted from the video is . Faces from video are sorted chronologically and indexed by . We can then select faces from to form a face set . For random selection, , where generates a unique random number between and . For sequential selection, we select the first faces from , that is . For confidence selection, each face is processed by the face detector to get the confidence of the detection . The top faces with the highest confidence are selected.
4.3 Feature Extraction using MRH
The MRH approach is motivated by the concept of ‘visual words’ (originally used in image categorisation 
) as well as the semi-loose spatial constraints between face parts in 2D Hidden Markov Models. It can briefly described as follows. A given face is divided into several fixed and adjacent regions (e.g. 33) that are further divided into small overlapping blocks (with a size of 88 pixels). For region
a set of low-dimensional feature vectors is obtained from the blocks in that region,
. Each block is normalised to have zero mean and unit variance, and descriptive features are extracted from each block via 2D DCT decomposition. Each feature vector obtained from region is then represented as a high-dimensional probabilistic histogram:
where the -th element in
is the posterior probability ofaccording to the -th component of a ‘visual dictionary’ model, with an associated weight of . The dictionary is a Gaussian Mixture Model with 1024 components, built from low-dimensional 2D DCT features extracted from training faces. The mean of each Gaussian in the dictionary can be thought of as a particular ‘visual word’. Robustness to face misalignment is achieved by representing each region as one average histogram:
For faces with a size of 6464 pixels, there are 9 regions arranged in a 33 layout. This results in an MRH signature composed of 9 histograms, with each histogram having 1024 components:
One MRH signature is used to represent each face in each frame.
4.4 Feature Clustering
We choose the widely known -means algorithm to group a set of faces into clusters and represent each cluster by its centroid [17, 18]. We adapt it to dealing with videos and MRH face signatures by seeding the clusters with faces spaced at regular intervals within a video; the distance metric used during the clustering process is described in Eqn. (4).
Once the -MRH clusters have been generated, the average MRH of each cluster’s signatures is used as the representative signature. The special case of is just an average MRH signature over all available faces. In the experiments we also apply clustering on faces from multiple videos belonging to the same person.
4.5 MRH Signature Comparison
Two MRH signatures, and , are compared using an -norm based distance measure:
A decision on whether and represent the same person (i.e., matched pair) or two different persons (i.e., mismatched pair) can be obtained by comparing to a threshold. However, in order to provide further robustness to varying image conditions present in and , a normalised distance can be obtained by adapting the cohort normalisation approach originally used in speech processing [2, 19]:
Here, is the -th cohort face and is the number of cohorts, with the cohort faces taken from the training set.
For probes and galleries with multiple MRH signatures, each of the probe’s MRH signatures are individually compared to each of the gallery’s MRH signatures, resulting in distances. The lower the distance, the more similar two signatures are, thus the minimum distance is taken as the final distance between a probe-gallery video pair. The minimum distance is then compared to a threshold to obtain the final match/mismatch decision.
An appropriate threshold can be determined using a labelled set by looking at the value which results in the minimum amount of false positives (matching probe and gallery identities with a distance greater than the threshold) and false negatives (non-matching probe and gallery identities with a distance less than the threshold). This is also referred to as minimum error rate and used in the experiments.
5 Experiments and Discussion
In our experiments we used the large-scale ‘Mobile Biometry’ (MOBIO) dataset, which has been created as part of a European project focusing on biometric person recognition from portable devices . The dataset is split into three distinct sets: one for training, one for development and one for testing. No persons are shared across any of the three sets.
The protocol for enrolling and testing is the same for the the development set and the test set. There are five enrolment videos for each user and 75 test client (positive sample) videos for each user (15 from each session). When producing impostor scores all the other clients are used, for instance if in total there were 50 clients then the other 49 clients would perform an impostor attack.
For the development set, there are 20 female and 27 male users, which results in 30,000 probe to user comparisons for females and 54,675 for males. For the test set, there are 22 female and 39 male users, resulting in 36,300 comparisons for females and 114,075 for males.
The MOBIO experiment protocol involves evaluating a face recognition system on the development and test subsets for males and females. In this paper, we present the results of the four subsets with minimum error rate (MER), given by:
where and are the false acceptance rate and false rejection rate obtained at threshold . An equal weighting was chosen for FAR and FRR to remain application neutral. MER is a variant of the equal error rate (EER) , but is considered to be more reliable as it does not make any assumptions about the shape of the FAR and FRR curves.
5.1 Face Selection
In the face selection approach, a subset of faces are chosen for recognition based on a particular selection metric to characterise whether different metrics can improve recognition performance, and if so, by how much.
Our first experiment compares the recognition accuracy across the following three selection methods, on an increasing number of faces selected: (i) face detection confidence, (ii) random selection, and (iii) sequential selection. After the face are selected, the average MRH signature over all selected faces is used for recognition. The MER results are presented in column (a) of Fig. 2. The following three main observations can be made:
Using multiple faces always performs better than using only one face, but using all faces does not guarantee the best performance. This implies that average MRH signatures are generally a good representation of the varied samples as the performance after averaging is always better than single face.
The face selection method itself affects the recognition rate drastically. Random selection seems to provide slightly better performance in terms of minimal error rate for recognition when compared to sequential sampling of faces. The reason from an information point of view is that sequential faces are very likely to have much less variation compared to faces sampled randomly throughout the video. Face confidence, the most computationally expensive one tested, typically gives better performance overall compared to the other two metrics. The reason might be due to better alignment (i.e., a more frontally aligned face) as the confidence is related to how well facial landmarks are located within the face.
The optimal number of faces (the number which gives the lowest error) varies drastically across face selection methods as well as the MOBIO subsets. Typically, training data is used for setting parameters such as the number of faces to use, and is assumed to have similar characteristics as test data. We can see that even within the same dataset such as MOBIO, this assumption does not hold true. This highlights the fragility of face selection – the application depends on heuristic methods (i.e., number of faces or threshold of confidence) which is very dependent on the data and method. As such, face selection is not likely to translate well across various datasets.
5.2 Face Feature Clustering
In the cases where computation time is less of a limitation, such as offline or batch processing, there is potential to utilise information from all faces in a video. We propose clustering of facial features as a pre-cursor to face recognition to make the face matching stage computationally tractable and more memory efficient (just storing and comparing the cluster centroids). In the MRH framework, clustering also takes advantage of the observation made in Section 5.1, where higher recognition accuracy was achieved by using multiple faces rather than a single face. This suggests average MRH signatures (such as a centroid of an MRH cluster) would provide better signatures for recognition.
The experiments for -means clustering were done using single video (where faces from just one video of a person are clustered) and multiple videos (where faces from all videos of the same person are clustered). The recognition results for both cases using varying are presented in column (b) of Fig. 2. The following three main observations can be made:
For every subset, the optimal was greater than 1. As an example, the female development subset seems to give the best results for clustering at . Fig. 3 shows a few images from a video in that subset for two clusters. As can be observed in Fig. 3, clustering yielded visually discernable differences between the images. Cluster 2 has more closely cropped faces (with borders cutting off the edges of the face), whereas Cluster 1 shows more of the chin, hair and a bit of background. This is reflective of face alignment errors due to inaccuracies in eye localisation in the first stage of the video-based recognition system, and indicates that clustering may be a good way to minimise errors introduced in earlier stages of the system.
The clustering of faces from multiple videos was nearly always better (in terms of finding the overall minimum error rate for a subset) than clustering of faces from a single video alone. In MOBIO, each client gallery consisted of five separate videos. Based on the clustering results, these videos seem to be recorded in similar environments. Due to the similarity overlap, clustering by gallery likely resulted in better performance due to more samples in each cluster to provide more robust MRH signatures.
The optimal varied depending on the MOBIO subset. However, unlike face selection where heuristics are used to select the optimal number of faces, there is extensive literature on clustering methods which find the ‘natural clusters’ to fit the data . Thus clustering can be more robust in terms of maintaining optimal recognition accuracy for a video recognition system across many different datasets.
5.3 Comparing Face Selection & Clustering
While face selection and feature clustering are not mutually exclusive components in a video-based recognition system, both can separately contribute to reducing the computational effort to different degrees – face selection reduces the computational requirements for the subsequent steps of feature extraction and matching (distance calculation), while feature clustering reduces the computation requirements for the matching step only.
The computation times for each step of the face recognition process is provided in Table 1. Feature extraction is the most computationally expensive part, taking 0.390 seconds per face. The time taken scales linearly with the number of faces selected or desired (). The calculation of the normalised distance for a pair of MRH signatures takes approximately 0.002 seconds. However, if no clustering is performed and the distance is calculated naively on a pairwise basis per face per video, the distance time scales to and can quickly exceed the computation time for feature extraction.
For the experimental system (Table 1), the number of faces () at which the naive matching exceeds the feature extraction time can be found by solving , which is . On the otherhand, when clustering is used, the distance scales to the number of clusters () squared, where will always be less than or equal to . This demonstrates how computationally inefficient it is to not use clustering or some other modeling method for reduction of signature sets.
In terms of performance for face recognition accuracy, Table 2 shows that utilising information from all faces through clustering consistently shows better accuracy than using a subset of faces.
|Best Face Selection||Best Feature Clustering|
|Male dev subset||any||all||multiple|
|Female dev subset||face conf.||multiple|
|Male test subset||random||single|
|Female test subset||face conf.||multiple|
As separately observed both in the face selection and feature clustering experiments, the optimal value (i.e., number of faces or clusters) varied depending on the dataset and method. It was noted that for face selection, since the thresholds are chosen heuristically, this approach is particularly fragile to variations between datasets, which would lead to suboptimal performance. For feature clustering, variations in the optimal are less of an issue as there is extensive work in finding the ‘natural clusters’ to fit the data . Thus clustering is a more robust and reliable means of consistently boosting face recognition accuracy.
The above trade-offs between computation and accuracy are interesting to characterise as they aid in determining which approach is most suitable for particular applications. For example, face selection is more suitable for systems which have real-time requirements (such as live video monitoring) or limited computation restrictions (such as mobile phones). In contrast, feature clustering is more suitable for batch or offline processing such as forensic applications and pre-processed galleries for watchlists.
6 Conclusions and Future Work
In this paper, we examined two approaches of improving the performance of a video-based face recognition system — face selection and face feature clustering. Three methods of face selection were investigated: face detection confidence, random selection and sequential selection.
In comparing the three selection methods, it was found that: (i) using multiple faces is always better than using a single face alone, (ii) the face detection confidence metric typically provides better results when using a subset of faces, and (iii) the optimal number of faces to use varies drastically across selection methods and datasets (subsets of MOBIO).
For feature clustering, we used a -means approach and found that the optimal varied across datasets, and that more faces provided better cluster representations for recognition (i.e., clustering faces from multiple videos together is better than clustering from a single video).
When compared to face selection, the lowest error rates were always obtained through clustering at the expense of higher computational effort. With face selection, the computation of a selection metric is typically low. However, its major drawback is that the selection of the number of faces is done in a heuristic manner, and as such highly dependent on both dataset and face selection metric. The parameters of face selection are not likely to translate well across datasets, thus potentially giving sub-optimal results.
With face feature clustering, the optimal number of clusters may vary across datasets, however there are many principled methods of adaptive clustering to find the optimal number of clusters . As such, the clustering approach is more robust and transferable across datasets. However, its main drawback is the computational effort required for face detection in all frames and the subsequent feature extraction.
Based on the above trade-offs, our experiments suggest that designers of video-based recognition systems should use facial feature clustering if they are able to process videos in a batch fashion (offline), as clustering can robustly maximise recognition accuracy. This would also be applicable to galleries of online systems as the galleries are typically processed in batch. In contrast, if the application has real-time requirements such as live video monitoring in surveillance, the selection of faces using a good face confidence metric may make the most sense.
Though we investigated face selection and face feature clustering individually, it is still worth exploring the combination of these two techniques, such as clustering on selected faces. In addition, it would also be worthwhile to further examine face feature clustering on other face recognition techniques.
NICTA is funded by the Australian Government’s Department of Broadband, Communications and Digital Economy as well as the Australian Research Council through Backing Australia’s Ability and the ICT Research Centre of Excellence programs. NICTA’s Queensland Laboratory is in part funded by the Queensland State Government.
This work is financially supported by the Australian Government through the National Security Science and Technology Branch within the Department of the Prime Minister and Cabinet. This support does not represent and endorsement of the contents or conclusions of the project.
G. B. Huang, V. Jain, and E. Learned-Miller, “Unsupervised joint alignment of
complex images,” in
International Conference on Computer Vision (ICCV), 2007.
-  C. Sanderson and B. C. Lovell, “Multi-region probabilistic histograms for robust and scalable identity inference,” in International Conference on Biometrics, Lecture Notes in Computer Science (LNCS), vol. 5558, 2009, pp. 199–208.
-  F. Matta and J.-L. Dugelay, “Person recognition using facial video information: A state of the art,” Journal of Visual Languages and Computing, vol. 20, pp. 180–187, 2009.
-  H. Wang, Y. Wang, and Y. Cao, “Video-based face recognition: A survey,” World Academy of Science, Engineering and Technology, vol. 60, pp. 293–302, 2009.
-  W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, “Face recognition: A literature survey,” ACM Comput. Surv., vol. 35, no. 4, pp. 399–458, 2003.
-  C. Shan, “Face recognition and retrieval in video,” Studies in Computational Intelligence, vol. 287, pp. 235–260, 2010.
-  S. Marcel, P. M. C. McCool, T. Ahonen, and J. Cernocky, “Mobile Biometry (MOBIO) Face and Speaker Verification Evaluation,” Martigny, Switzerland, IDIAP Research Report RR-09-2010, 2010.
-  D. Gorodnichy, “On importance of nose for face tracking,” in In proceedings of IEEE International Conference on Automatic Face and Gesture Recognition (FG), 2002, pp. 181–186.
M. Villegas and R. Paredes, “Simultaneous learning of a discriminative
projection and prototypes for nearest-neighbor classification,” in
Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR), 2008.
-  S. Berrani and C. Garcia, “Enhancing face recognition from video sequences using robust statistics,” in In proceedings of IEEE International Conference on Advanced Video and Signal-Based Surveillance (AVSS), 2005, pp. 324–329.
-  K. Lee, J. Ho, M. Yang, and D. Kriegman, “Video-based face recognition using probabilistic appearance manifolds,” in In proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2003, pp. 313–320.
-  O. Arandjelovic, G. Shakhnarovich, G. Fisher, J. Cipolla, and R. Zisserman, “Face recognition with image sets using manifold density divergence,” in In proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, 2005, pp. 581–588.
-  O. Arandjelovic and R. Cipolla, “Face set classification using maximally probable mutual modes,” in In proceedings International Conference on Pattern Recognition (ICPR), vol. 1, 2006, pp. 511–514.
-  P. A. Viola and M. J. Jones, “Robust real-time face detection,” International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004.
-  E. Nowak, F. Jurie, and B. Triggs, “Sampling strategies for bag-of-features image classification,” in European Conf. Computer Vision (ECCV), Part IV, Lecture Notes in Computer Science (LNCS), vol. 3954, 2006, pp. 490–503.
-  R. Gonzales and R. Woods, Digital Image Processing, 3rd ed. Prentice Hall, 2007.
-  A. K. Jain, “Data clustering: 50 years beyond K-means,” Pattern Recognition Letters, vol. 31, no. 8, pp. 651–666, June 2010.
-  D. MacKay, Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003.
-  S. Furui, “Recent advances in speaker recognition,” Pattern Recognition Letters, vol. 18, no. 9, pp. 859–872, 1997.