Learning to Cluster Faces (CVPR 2019, CVPR 2020)
Face recognition sees remarkable progress in recent years, and its performance has reached a very high level. Taking it to a next level requires substantially larger data, which would involve prohibitive annotation cost. Hence, exploiting unlabeled data becomes an appealing alternative. Recent works have shown that clustering unlabeled faces is a promising approach, often leading to notable performance gains. Yet, how to effectively cluster, especially on a large-scale (i.e. million-level or above) dataset, remains an open question. A key challenge lies in the complex variations of cluster patterns, which make it difficult for conventional clustering methods to meet the needed accuracy. This work explores a novel approach, namely, learning to cluster instead of relying on hand-crafted criteria. Specifically, we propose a framework based on graph convolutional network, which combines a detection and a segmentation module to pinpoint face clusters. Experiments show that our method yields significantly more accurate face clusters, which, as a result, also lead to further performance gain in face recognition.READ FULL TEXT VIEW PDF
Learning to Cluster Faces (CVPR 2019, CVPR 2020)
Thanks to the advances in deep learning techniques, the performance of face recognition has been remarkably boosted[25, 22, 27, 3]. However, it should be noted that the high accuracy of modern face recognition systems relies heavily on the availability of large-scale annotated training data. While one can easily collect a vast quantity of facial images from the Internet, annotating them is prohibitively expensive. Therefore, exploiting unlabeled data, e.g
. through unsupervised or semi-supervised learning, becomes a compelling option and has attracted lots of interest from both academia and industry[30, 1].
A natural idea to exploit unlabeled data is to cluster them into “pseudo classes”, such that they can be used like labeled data and fed to a supervised learning pipeline. Recent works 
have shown that this approach can bring performance gains. Yet, current implementations of this approach still leave a lot to be desired. Particularly, they often resort to unsupervised methods, such as K-means11]31], and approximate rank-order , to group unlabeled faces. These methods rely on simplistic assumptions, e.g., K-means implicitly assumes that the samples in each cluster are around a single center; spectral clustering requires that the cluster sizes are relatively balanced, etc. Consequently, they lack the capability of coping with complicated cluster structures, thus often giving rise to noisy clusters, especially when applied to large-scale datasets collected from real-world settings. This problem seriously limits the performance improvement.
Hence, to effectively exploit unlabeled face data, we need to develop an effective clustering algorithm that is able to cope with the complicated cluster structures arising frequently in practice. Clearly, relying on simple assumptions would not provide this capability. In this work, we explore a fundamentally different approach, that is, to learn how to cluster from data. Particularly, we desire to draw on the strong expressive power of graph convolutional network to capture the common patterns in face clusters, and leverage them to help to partition the unlabeled data.
We propose a framework for face clustering based on graph convolutional networks . This framework adopts a pipeline similar to the Mask R-CNN  for instance segmentation, i.e., generating proposals, identifying the positive ones, and then refining them with masks. These steps are accomplished respectively by an iterative proposal generator based on super-vertex, a graph detection network, and a graph segmentation network. It should be noted that while we are inspired by Mask R-CNN, our framework still differs essentially: the former operates on a 2D image grid while the latter operates on an affinity graph with arbitrary structures. As shown in Figure 1, relying on the structural patterns learned based on a graph convolutional network instead of some simplistic assumptions, our framework is able to handle clusters with complicated structures.
The proposed method significantly improves the clustering accuracy on large-scale face data, achieving a F-score at, which is not only superior to the best result obtained by unsupervised clustering methods (F-score ) but also higher than a recent state of the art  (F-score ). Using this clustering framework to process the unlabeled data, we improve the performance of a face recognition model on MegaFace from to , which is quite close to the performance obtained by supervised learning on all the data ().
The main contributions lie in three aspects: (1) We make the first attempt to perform top-down face clustering in a supervised manner. (2) It is the first work that formulates clustering as a detection and segmentation pipeline based on graph convolution networks. (3) Our method achieves state-of-the-art performance in large-scale face clustering, and boosts the face recognition model close to the supervised result when applying the discovered clusters.
Clustering is a basic task in machine learning. Jainet al.  provide a survey for classical clustering methods. Most existing clustering methods are unsupervised. Face clustering provides a way to exploit massive unlabeled data. The study along this direction remains at an early stage. The question of how to cluster faces on large-scale data remains open.
Early works use hand-crafted features and classical clustering algorithms. For example, Ho et al.  used gradient and pixel intensity as face features. Cui et al.  used LBP features. Both of them adopt spectral clustering. Recent methods make use of learned features.  performed top-down clustering in an unsupervised way. Finley et al.  proposed an SVM-based supervised method in a bottom-up manner. Otto et al. 
used deep features from a CNN-based face model and proposed an approximate rank-order metric to link images pairs to be clusters. Linet al.  designed a similarity measure based on linear SVM trained on the nearest neighbours of data samples. Shi et al.  proposed Conditional Pairwise Clustering, formulating clustering as a conditional random field to cluster faces by pair-wise similarities. Lin et al.  proposed to exploit local structures of deep features by introducing minimal covering spheres of neighbourhoods to improve similarity measure. Zhan et al. 
trained a MLP classifier to aggregate information and thus discover more robust linkages, then obtained clusters by finding connected components.
Though using deep features, these works mainly concentrate on designing new similarity metrics, and still rely on unsupervised methods to perform clustering. Unlike all the works above, our method learns how to cluster in a top-down manner, based on a detection-segmentation paradigm. This allows the model to handle clusters with complicated structures.
Graph Convolutional Networks (GCNs)  extend CNNs to process graph-structured data. Existing work has shown the advantages of GCNs, such as the strong capability of modeling complex graphical patterns. On various tasks, the use of GCNs has led to considerable performance improvement [15, 9, 26, 29]. For example, Kipf et al.  applied the GCNs to semi-supervised classification. Hamilton et al.  leveraged GCNs to learn feature representations. Berg et al.  showed that GCNs are superior to other methods in link prediction. Yan et al.  employed GCNs to model human joints for skeleton-based action recognition.
In this paper, we adopt GCN as the basic machinery to capture cluster patterns on an affinity graph. To our best knowledge, this is the first work that uses GCN to learn how to cluster in a supervised way.
In large-scale face clustering, the complex variations of the cluster patterns become the main challenge for further performance gain. To tackle the challenge, we explore a supervised approach, that is, to learn the cluster patterns based on graph convolutional networks. Specifically, we formulate this as a joint detection and segmentation problem on an affinity graph.
Given a face dataset, we extract the feature for each face image with a trained CNN, forming a set of features , where is anearest neighbors for each sample. By connecting between neighbors, we obtain an affinity graph for the whole dataset. Alternatively, the affinity graph can also be represented by a symmetric adjacent matrix , where the element is the cosine similarity between and if two vertices are connected, or zero otherwise. The affinity graph is a large-scale graph with millions of vertices. From such a graph, we desire to find clusters that have the following properties: (1) different clusters contain the images with different labels; and (2) images in one cluster are with the same label.
As shown in Figure 2, our clustering framework consists of three modules, namely proposal generator, GCN-D, and GCN-S. The first module generates cluster proposals, i.e
., sub-graphs likely to be clusters, from the affinity graph. With all the cluster proposals, we then introduce two GCN modules, GCN-D and GCN-S, to form a two-stage procedure, which first selects high-quality proposals and then refines the selected proposals by removing the noises therein. Specifically, GCN-D performs cluster detection. Taking a cluster proposal as input, it evaluates how likely the proposal constitutes a desired cluster. Then GCN-S performs the segmentation to refine the selected proposals. Particularly, given a cluster, it estimates the probability of being noise for each vertex, and prunes the cluster by discarding the outliers. According to the outputs of these two GCNs, we can efficiently obtain high-quality clusters.
Instead of processing the large affinity graph directly, we first generate cluster proposals. It is inspired by the way of generating region proposals in object detection [7, 6]. Such a strategy can substantially reduce the computational cost, since in this way, only a limited number of cluster candidates need to be evaluated. A cluster proposal is a sub-graph of the affinity graph . All the proposals compose a set . The cluster proposals are generated based on super-vertices, and all the super-vertices form a set . In this section, we first introduce the generation of super-vertex, and then devise an algorithm to compose cluster proposals thereon.
Super-Vertex. A super-vertex is a sub-graph containing a small number of vertices that are closely connected to each other. Hence, it is natural to use connected components to represent super-vertex. However, the connected component directly derived from the graph can be overly large. To maintain high connectivity within each super-vertice, we remove those edges whose affinity values are below a threshold and constrain the size of super-vertices below a maximum . Alg. 1 shows the detailed procedure to produce the super-vertex set . Generally, an affinity graph with vertices can be partitioned into super-vertices, with each containing vertices on average.
Proposal Generation. Compared with the desired clusters, the super-vertex is a conservative formation. Although the vertices in a super-vertex are highly possible to describe the same person, the samples of a person may distribute into several super-vertices. Inspired by the multi-scale proposals in object detection [7, 6], we design an algorithm to generate multi-scale cluster proposals. As Alg. 2 shows, we construct a higher-level graph on top of the super-vertices, with the centers of super-vertices as the vertices and the affinities between these centers as the edges. With this higher-level graph, we can apply Alg. 1 again and obtain proposals of larger sizes. By iteratively applying this construction for times, we obtain proposals with multiple scales.
We devise GCN-D, a module based on a graph convolutional network (GCN), to select high-quality clusters from the generated cluster proposals. Here, the quality is measured by two metrics, namely IoU and IoP scores. Given a cluster proposal , these scores are defined as
where is the ground-truth set comprised all the vertices with label , and is the majority label of the cluster , i.e. the label that occurs the most in . Intuitively, IoU reflects how close is to the desired ground-truth ; while IoP reflects the purity, i.e. the proportion of vertices in that are with the majority label .
We assume that high quality clusters usually exhibit certain structural patterns among the vertices. We introduce a GCN to identify such clusters. Specifically, given a cluster proposal , the GCN takes the visual features associated with its vertices (denoted as ) and the affinity sub-matrix (denoted as ) as input, and predicts both the IoU and IoP scores.
The GCN networks consist of layers and the computation of each layer can be formulated as:
where is a diagonal degree matrix. contains the embeddings of the -th layer. is a matrix to transform the embeddings and
is the nonlinear activation function (ReLU is chosen in this work). Intuitively, this formula expresses a procedure of taking weighted average of the embedded features of each vertex and its neighbors, transforming them with , and then feeding them through a nonlinear activation. This is similar to a typical block in CNN, except that it operates on a graph with arbitrary topology. On the top-level embeddings
, we apply a max pooling over all the vertices in, and obtain a feature vector that provides an overall summary. Two fully-connected layers are then employed to predict the IoU and IoP scores, respectively.
Given a training set with class labels, we can obtain the ground-truth IoU and IoP scores following Eq.(1) for each cluster proposal . Then we train the GCN-D module, with the objective to minimize the mean square error(MSE) between ground-truth and predicted scores. We experimentally show that, without any fancy techniques, GCN can give accurate prediction. During inference, we use the trained GCN-D to predict both the IoU and IoP scores for each proposal. The IoU scores will be used in sec. 3.5 to first retain proposals with high IoU. The IoP scores will be used in the next stage to determine whether a proposal needs to be refined.
The top proposals identified by GCN-D may not be completely pure. These proposals may still contain a few outliers, which need to be eliminated. To this end, we develop a cluster segmentation module, named GCN-S, to exclude the outliers from the proposal.
The structure of GCN-S is similar to that of GCN-D. The differences mainly lie in the values to be predicted. Instead of predicting quality scores of an entire cluster , GCN-S outputs a probability value for each vertex to indicate how likely it is a genuine member instead of an outlier.
To train the GCN-S, we need to prepare the ground-truth, i.e. identifying the outliers. This is nontrivial. A natural way is to treat all the vertices whose labels are different from the majority label as outliers. However, as shown in Fig. 3, this way may encounter difficulties for a proposal that contains an almost equal number of vertices that belong to two different classes. To avoid overfitting to manually defined outliers, we encourage the model to learn different segmentation patterns. As long as the segmentation result contains vertices from one class, no matter it is majority label or not, it is regarded as a reasonable solution. Specifically, we randomly select a vertex in the proposal as the seed. The vertices that have the same label with the seed are regarded as the positive vertices while others are considered as outliers. We apply this scheme multiple times with randomly chosen seeds and thus acquire multiple training samples from each proposal that may be annotated differently.
With the process above, we can prepare a set of training samples from the retained proposals. Each sample contains a set of feature vectors, each for a vertex, an affinity matrix, as well as a binary vector to indicate whether the vertices are positive or not. Then we train the GCN-S module, using the vertex-wise binary cross-entropy as the loss function. During inference, we also draw multiple hypotheses for a generated cluster proposal, and only keep the predicted results that have the most positive vertices (with a threshold of). This strategy avoids being misled by the case where a vertex associated with very few positive counterparts is chosen as the seed.
We only feed the proposals with IoP between and to GCN-S. Because when the proposal is very pure, the outliers are usually hard examples that need not be removed. When the proposal is very impure, it is probable that none of the classes dominate, therefore the proposal might not be suitable to be processed by GCN-S. With the GCN-S predictions, we remove the outliers from the proposals.
The three stages described above result in a collection of clusters. However, it is still possible that different clusters may overlap, i.e. sharing certain vertices. This may cause an adverse effect to the face recognition training performed thereon. Here, we propose a simple and fast de-overlapping algorithm to tackle this problem. Specifically, we first rank the cluster proposals in descending order of IoU scores. We sequentially collect the proposals from the ranked list, and modify each proposal by removing the vertices seen in preceding ones. The detailed algorithm is described in Alg. 3.
Compared to the Non-Maximum Suppression (NMS) in object detection, the de-overlapping method is more efficient. Particularly, the former has a complexity of , while the latter has . This process can be further accelerated by setting a threshold of IoU for de-overlapping.
Training set. MS-Celeb-1M  is a large-scale face recognition dataset consists of identities, and each identity has about facial images. As the original identity labels are obtained automatically from webpages and thus are very noisy. We clean the labels based on the annotations from ArcFace , yielding a reliable subset that contains images from classes. The cleaned dataset is randomly split into parts with an almost equal number of identities. Each part contains identities with around images. We randomly select part as labeled data and the other parts as unlabeled data. Youtube Face Dataset  contains videos, from which we extract frames for evaluation. Particularly, we use frames with identities for training and the other images with identities for testing.
Testing set. MegaFace  is the largest public benchmark for face recognition. It includes a probe set from FaceScrub  with images and a gallery set containing images. IJB-A  is another face recognition benchmark containing images from identities.
Metrics. We assess the performance on two tasks, namely face clustering and face recognition. Face clustering is to cluster all the images of the same identity into a cluster, where the performance is measured by pairwise recall and pairwise
precision. To consider both precision and recall, we report the widely usedF-score, i.e
., the harmonic mean of precision and recall. Face recognition is evaluated withface identification benchmark in MegaFace and face verification protocol of IJB-A. We adopt top- identification hit rate in MegaFace, which is to rank the top- image from the gallery images and compute the top- hit rate. For IJB-A, we adopt the protocol of face verification, which is to determine whether two given face images are from the same identity. We use true positive rate under the condition that the false positive rate is for evaluation.
Implementation Details. We use GCN with two hidden layers in our experiments. The momentum SGD is used with a start learning rate . Proposals are generated by and as in Alg. 1.
We compare the proposed method with a series of clustering baselines. These methods are briefly described below.
(1) K-means , the most commonly used clustering algorithm. With a given number of clusters
, K-means minimizes the total intra-cluster variance.
(2) DBSCAN , a density-based clustering algorithm. It extracts clusters according to a designed density criterion and leaves the sparse background as noises.
(3) HAC , hierarchical agglomerative clustering is a bottom-up approach to iteratively merge close clusters based on some criteria.
(4) Approximate Rank Order , develops an algorithm as a form of HAC. It only performs one iteration of clustering with a modified distance measure.
(5) CDP , a recent work that proposes a graph-based clustering algorithm. It better exploits the pairwise relationship in a bottom-up manner.
(6) GCN-D, the first module of the proposed method. It applies a GCN to learn cluster pattern in a supervised way.
(7) GCN-D + GCN-S, the two-stage version of the proposed method. GCN-S is introduced to refine the output of GCN-D, which detects and discards noises inside clusters.
To control the experimental time, we randomly select one part of the data for evaluation, containing images of identities. Tab. 1 compares the performance of different methods on this set. The clustering performance is evaluated by both F-score and the time cost. We also report the number of clusters, pairwise precision and pairwise recall for better understanding the advantages and disadvantages of each method.
The results show: (1) For K-means, the performance is influenced greatly by the number of clusters . We vary in a range of numbers and report the result with high F-score. (2) DBSCAN reaches a high precision but suffers from the low recall. It may fail to deal with large density differences in large-scale face clustering. (3) HAC gives more robust results than previous methods. Note that the standard algorithm consumes memory, which goes beyond the memory capacity when is as large as . We use an adapted hierarchical clustering  for comparison, which requires only memory. (4) Approximate Rank Order is very efficient due to its one iteration design, but the performance is inferior to other methods in our setting. (5) As a recent work designed to exploit unlabeled data for face recognition, CDP achieves a good balance of precision and recall. For a fair comparison, we compare with the single model version of CDP. Note that the idea of CDP and our approach are complementary, which can be combined to further improve the performance. (6) Our method applies GCN to learn cluster patterns. It improves the precision and recall simultaneously. Tab. 2 demonstrates that our method is robust and can be applied to datasets with different distributions. Since the GCN is trained using multi-scale cluster proposals, it may better capture the properties of the desired clusters. As shown in Fig. 8, our method is capable of pinpointing some clusters with complex structure. (7) The GCN-S module further refines the cluster proposals from the first stage. It improves the precision by sacrificing a little recall, resulting in the overall performance gain.
The whole procedure of our method takes about , where generating proposals takes up to on a CPU and the inference of GCN-D and GCN-S takes and respectively on a GPU with the batch size of . To compare the runtime fairly, we also test all our modules on CPU. Our method takes in total on CPU, which is still faster than most methods we compared. The speed gain of using GPU is not very significant in this work, as the main computing cost is on GCN. Since GCN relies on sparse matrix multiplication, it cannot make full use of GPU parallelism. The runtime of our method grows linearly with the number of unlabeled data and the process can be further accelerated by increasing batch size or parallelizing with more GPUs.
With the trained clustering model, we apply it to unlabeled data to obtain pseudo labels. We investigate how the unlabeled data with pseudo labels enhance the performance of face recognition. Particularly, we follow the following steps to train face recognition models: (1) train the initial recognition model with labeled data in a supervised way; (2) train the clustering model on the labeled set, using the feature representation derived from the initial model; (3) apply the clustering model to group unlabeled data with various amounts (1, 3, 5, 7, 9 parts), and thus attach to them “pseudo-labels”; and (4) train the final recognition model using the whole dataset, with both original labeled data and the others with assigned pseudo-labels. The model trained only on the 1 part labeled data is regarded as the lower bound, while the model supervised by all the parts with ground-truth labels serves as the upper bound in our problem. For all clustering methods, each unlabeled image belongs to an unique cluster after clustering. We assign a pseudo label to each image as its cluster id.
Fig. 5 indicates that performance of face clustering is crucial for improving face recognition. For K-means and HAC, although the recall is good, the low precision indicates noisy predicted clusters. When the ratio of unlabeled and labeled data is small, the noisy clusters severely impair face recognition training. As the ratio of unlabeled and labeled data increases, the gain brought by the increase of unlabeled data alleviates the influence of noise. However, the overall improvement is limited. Both CDP and our approach benefit from the increase of the unlabeled data. Owing to the performance gain in clustering, our approach outperforms CDP consistently and improve the performance of face recognition model on MegaFace from to , which is close to the fully supervised upper bound ().
We randomly select one part of the unlabeled data, containing images of identities, to study some important design choices in our framework.
Cluster proposals generation is the fundamental module in our framework. With a fixed and different , and , we generate a large number of proposals with multiple scales. Generally, a larger number of proposals result in a better clustering performance. There is a trade-off between performance and computational cost in choosing the proper number of proposals. As illustrated in Fig. 4, each point represents the F-score under certain number of proposals. Different colors imply different iteration steps. (1) When , only the super-vertices generated by Alg. 1 will be used. By choosing different , more proposals are obtained to increase the F-score. The performance gradually saturates as the number increases beyond . (2) When , different combinations of super-vertices are added to the proposals. Recall that it leverages the similarity between super-vertices, thus it enlarges the receptive field of the proposals effectively. With a small number of proposals added, it boosts the F-score by . (3) When , it further merges similar proposals from previous stages to create proposals with larger scales, which continues to contribute the performance gain. However, with the increasing proposal scales, more noises will be introduced to the proposals, hence the performance gain saturates.
|f||256, 128, 64||max||✓||77.95|
Although the training of GCNs does not require any fancy techniques, there are some important design choices. As Tabs. 3a, 3b and 3c indicate, the pooling method has large influence on the F-score. Both mean pooling and sum pooling impair the clustering results compared with max pooling. For sum pooling, it is sensitive to the number of vertices, which tends to produce large proposals. Large proposals result in a high recall() but low precision (), ending up with a low F-score. On the other hand, mean pooling better describes the graph structures, but may suffer from the outliers in the proposal. Besides the pooling methods, Tabs. 3c and 3d show that lacking vertex feature will significantly reduce the GCNs’ prediction accuracy. It demonstrates the necessity of leveraging both vertex feature and graph structure during GCN training. In addition, as shown in Tabs. 3c, 3e and 3f, widening the channels of GCNs can increase its expression power but the deeper network may drive the hidden feature of vertices to be similar, resulting in an effect like mean pooling.
In our framework, GCN-S is used as a de-nosing module after GCN-D. However, it can act as an independent module to combine with previous methods. Given the clustering results of K-means, HAC and CDP, we regard them as the cluster proposals and feed them into the GCN-S. As Fig. 7 shows, GCN-S can improve their clustering performances by discarding the outliers inside clusters, obtaining a performance gain around for various methods.
NMS is a widely used post-processing technique in object detection, which can be an alternative choice of de-overlapping. With a different threshold of IoU, it keeps the proposal with highest predicted IoU while suppressing other overlapped proposals. The computational complexity of NMS is . Compared with NMS, de-overlapping does not suppress other proposals and thus retains more samples, which increases the clustering recall. As shown in Fig. 7, de-overlapping achieves better clustering performance and can be computed in linear time.
This paper proposes a novel supervised face clustering framework based on graph convolution network. Particularly, we formulate clustering as a detection and segmentation paradigm on an affinity graph. The proposed method outperforms previous methods on face clustering by a large margin, which consequently boosts the face recognition performance close to the supervised result. Extensive analysis further demonstrate the effectiveness of our framework.
Acknowledgement This work is partially supported by the Collaborative Research grant from SenseTime Group (CUHK Agreement No. TS1610626 & No. TS1712093), the Early Career Scheme (ECS) of Hong Kong (No. 24204215), the General Research Fund (GRF) of Hong Kong (No. 14236516, No. 14203518 & No. 14241716), and Singapore MOE AcRF Tier 1 (M4012082.020).
Supervised clustering with support vector machines.In ICML. ACM, 2005.
Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons, 2009.