Facial image retrieval is an interesting yet challenging task for its practical use in computational forensics. It is even more difficult when the target image is not known to the system but only exists in the user’s mind (mental image). Therefore, the descriptions of the target image from the user are necessary for a reasonable retrieval results. However, compared to describing the perceived face with absolute depictions, people naturally feel it easier to refer to an existing image and provide the descriptions of the their difference. For example, in Figure 1, when a user is trying to identify a character without knowing the character’s name but only an impression of the character’s face, a reference image is shown to the user and the user responds with differences between the reference image and the mental image. Upon receiving these responses, another reference image will be shown to the user for further feedback. Ideally, the reference images shown to the user should be refined and closer to the mental image over time. This process might last for limited rounds or till the mental image is retrieved.
In reality, when people are asked to describe facial appearances of another person, the descriptions can be roughly categorized into two aspects, basic descriptions and advanced descriptions. Basic descriptions contain objective measurements on facts such as the color of hair, whether wearing hat or eyeglasses, which can be conveniently mapped into attributes with certain values. Advanced descriptions involve with subjective opinions people sense from the facial appearances such as beauty, aging, friendliness etc. Previous studies have demonstrated the instability in advanced descriptions due to the ambiguities in human perception (Sorokowski et al., 2013; Blais et al., 2008; Miellet et al., 2013; Engelmann and Pogosyan, 2013). Moreover, researchers have expressed concerns that verbally attending to facial differences might alter witness’s memory of the original face, which can be detrimental to forensic applications (Brown and Lloyd-Jones, 2005). Therefore, we take an attribute-aware approach (e.g. “with or without glasses”) where users are able to describe and response easily and efficiently. An appropriate dataset for this research goal is CelebA, which will be described in Subsection 4.1.
Recent research (Kovashka et al., 2015; Ferecatu and Geman, 2007; Zavesky and Chang, 2008) shows that interactive image retrieval has the advantages of integrating user feedback and improving retrieval performances by relevance feedback. Therefore, we build an interactive retrieval system that takes multiple rounds of collecting user’s feedback and refining the retrieval results. Apart from the interactive framework, our model considers an extra mechanism, progressiveness, during the interactions.
Considering the difficulty for real people in describing and comparing facial images in mind, it is hard for them to provide thorough feedback at each round. Thus, we design a mechanism that progressively discloses the feedback. Specifically, in each round of retrieval, the system only provides partial relevance by masking the rest of it. In the following rounds, the ratio of masked relevance feedback is gradually decreased, allowing for more information to be disclosed. This setting mimics a progressive disclosure that might better reflect the functionality of human memory. Essentially, it outperforms all other settings in our experiments and might share a similar fundamentality as dropout.
With the target image unknown to the retrieval system but only its attributes, one naive way would be annotating the attributes of every image in the database and seeking for the closest one. However, annotating a dataset is highly expensive and exhaustive. Thus, our model retrieves the image by cooperating image features and instant feedback without prior annotations. When the system returns a candidate image in each round of retrieval, the user is only required to provide relevance feedback between it and the mental target image. We believe this instantaneous responding process is light and feasible since our system has only a few rounds with only image each round.
The key contributions of our work are as follows:
A new retrieval problem setting on human facial images under interactive search where users are allowed to convey their mental images to the system and iteratively refine the retrieved results.
An end-to-end interactive Content-Based Image Retrieval (CBIR) framework to address the above problem setting by employing supervised learning approach.
A novel progressive disclosure mechanism in collecting relevance feedback from users during multiple rounds of interaction. The mechanism reaches the best performance while mimicking human behaviors.
An instant feedback setting for interactive applications. The setting can help to reduce workload of manual annotations necessary for learning about the annotator.
The paper is organized in sections. After reviewing related work in Section 2, we describe the structure of our framework (in Subsection 3.1) and the algorithm it runs by (in Subsection 3.2) in Section 3. We then continue to Section 4, where we demonstrate the validity with a baseline experiment and the robustness with a series of ablation studies. Finally in Section 5, we make final remarks about our work in terms of its applications and limitations.
2. Related Work
In this section, we introduce related work in the fields of image retrieval and facial recognition.
2.1. Image Retrieval Researches
With countless images generated everyday, efficient navigation demands intuitive approaches that are aware to image content, giving rise to the field of Content-Based Image Retrieval (CBIR) (Gudivada and Raghavan, 1995). To further facilitate retrieval, interactive querying methods are being developed over the past decades. A prominent approach in this area is relevance feedback (RF) (Rui et al., 1998). In traditional RF settings, users evaluate how relevant one retrieved image is to their desired result, report this perceived relevance as a numerical value, and expect a refined result from the next retrieval.
However, for complex images, a single relevance value can be ambiguous and thus misleading (Geman and Moquet, 2000). In order to improve specificity in feedback, relative attributes (RA) has been proposed as a new mechanism (Kovashka et al., 2012). In RA-enabled CBIR, user dictates which attribute(s) of candidate image(s) should be tweaked, and optionally by how much. Users may also tune the parameters of the system with an emphasis on specific attributes and image features (Flickner et al., 1995; Ma and Manjunath, 1999; Iqbal and Aggarwal, 2002).
Feedback mechanism implies that each retrieval involves multiple rounds of information exchange, or a dialog
. Each round provides the CBIR system with extra information to refine its results. This refinement process can take the form of a decision tree(MacArthur et al., 2002)
or a neural network (NN)(Wang et al., 2006)
. In NN-based implementations, reinforcement learning (RL) can be employed to reduce training-time supervision(Das et al., 2017)
. Owing to the descriptive nature of feedback, CBIR experience can be further enhanced with natural language processing (NLP)(Harada et al., 1997). Moreover, a combination of RL and NLP is proved useful in a setting of shoe-shopping (Guo et al., 2018).
2.2. Facial Image Researches
In our work, we study the retrieval of human face images instead of footwear. Face image retrieval has socially critical applications in industries such as forensics (Monroe, 2009). A typical use case would be having a witness identify the appearance of some specific personnel from Closed-circuit television (CCTV) recordings. The footage repository can be huge, impossible for untrained eyes to examine thoroughly. This workload demands facial recognition technology combined with CBIR techniques.
Plenty of CBIR research involving RF has focused on algorithmically-generated visual descriptors, such as MPEG-7 (Wong et al., 2005). These low-level features (hue, angle, slope, etc.) are infamously difficult to map to high-level concepts (glasses on, oval face, etc.) (Rui et al., 1997)
. To bridge over this gap, our pipeline employs a pre-trained Convolutional Neural Network (CNN) for extracting facial features.
3.1. Model Architecture
Our model, shown in Figure 2, consists of four components for different purposes. They will be demonstrated in detail respectively as follows:
3.1.1. The User Simulator
The user simulator mimics a human user who has a target image in mind and provides feedback at each round.
A human user 1) annotates and attributes from a given candidate image, 2) compares specific attributes of the candidate image with those of the target image (which only exists in the user’s mind), and 3) reports to the encoder model. Note that the target image is not known to the system because it symbolizes the image in the memory of the witness and its attributes are not explicitly defined either.
In training process, we utilize the existing annotations of attributes in CelebA to avoid additional inaccuracies if a new annotator is introduced. In testing, we assume that the online annotations by a person should be the same as the existing annotations under ideal circumstances. Therefore we again utilize the existing annotations to examine our model in this case.
3.1.2. The Encoder Model
This encoder model encodes signals from different spaces into one unified representation. Besides the candidate image annotations and the relevance feedback provided by the user simulator, the candidate image itself is also referenced so the encoder model can learn the correspondences between image features and attributes to produce a shared representation of them.
3.1.3. The Aggregator
3.1.4. The Retrieval Model
The retrieval model searches the database for a new candidate image that best matches the representation and returns it back to the user.
Upon receiving, it computes the distance between the representations of each image in the database and the aggregated representations from aggregator and selects the nearest neighbors.
During training, it returns a random image in these nearest neighbors for the sake of robustness; In testing, we use greedy approach that returns exactly the nearest one.
At the beginning of each retrieval, the user simulator randomly selects an image from the database as its target image. The user simulator then annotates and stores the target image for the convenience of calculating relevance feedback in each round. For simplification, we use the existing annotations in the dataset which is a
-dimensional Boolean vectorwhere . Before the first round of retrieval, retrieval model randomly returns a candidate image from the database as a starting point.
After initialization, the system executes the following steps iteratively until termination.
In the User Simulator.
At the -th round of retrieval, the user simulator first annotates the -th candidate image and stores as a -dimensional Boolean vector . Then it calculates the relevance between the -th candidate image and the target image: , where denotes elementwise multiplication. The attributes, together with all other Boolean values, takes as False and as True. This enables the user simulator to calculate the relevance between the attributes of candidate image and those of target image and the relevance is also binary using the same numbers as attributes: A term in the relevance is if the corresponding attribute in candidate image is different than that in the target image and if they are identical. To realize the progressiveness during the retrieval, the relevance feedback will be replaced by in accordance with certain proportion representing the masked part. As the increases, will gradually decrease indicating more and more disclosure in the relevance feedback. In our implementation, we set . The computed relevance is then fed to the encoder model along with the annotated attributes .
In the Encoder Model.
Firstly, and are concatenated together and embedded by a linear transformation named indication layer, with the intuition that some attributes (such as gender, compared to nose size) are more indicative than the others: . Here, denotes concatenation, and is our first linear transformation. Meanwhile, a CNN is employed to extract the features of the candidate image which is passed through another linear transformation: . In our implementation, , and is a pre-trained SE-ResNet (Hu et al., 2018)
. Outputs from the these two linear transformations are concatenated together and fused in a multi-layer perceptron (MLP):, where is the MLP.
In the Aggregator.
Historical information is then referenced in a GRU followed by a third linear transformation, : , where hidden state and the output of GRU . The final representation, , consists of history representations and information of the current round.
In the Retrieval Model.
Next, is sent to the retrieval model. For where denotes the size of the database, it calculates the distance between and , the feature representation the -th image : . Using the distances, the top- nearest neighbors of can be found, denoted by
. We model the sampling probability with a softmax distribution over the top-nearest neighbors:
Two approaches can be adopted to choose the -th candidate image :
In training, we choose a random image where .
During testing, we choose the nearest image where .
The loop terminates when user simulator reports that candidate image is target image, or when the maximum number of rounds (default ) is reached.
3.3. End-to-end Training
In practice, the system might return multiple candidate images for the user in each turn and collect their relevance feedback respectively for better retrieval performance. While in our work, we simplify the scenario by returning a single image in each turn. It is also available to extend our framework to the practical case by enabling the user to choose one preferred image out of multiple candidate images to obtain the relevance feedback.
Aiming at improving the ranking position of the target image, We train the model by a supervised learning objective. In the beginning, all the parameters of the network are randomly initialized. For loss function, we refer to(Guo et al., 2018) where it uses triplet loss objective.
where is the features of the target image and is the features of a random image sampled from the database as a negative sample. is a hyper-parameter and constant representing the margin. means norm. Even though the ranking position is not available to learn directly since it is not differentiable, we can exploit the advantage of triplet loss objective that the rank of the target image can be improved by ensuring the proximity of the target image and candidate images.
As for evaluation, we report the average ranking percentile of all the image in the training or testing set. More details on ranking percentile will be described later.
All experiments are conducted on a NVIDIA 1080 Ti GPU and it takes about 14 hours for training 14 epochs. We implement the framework partially based on(Guo et al., 2018).
We employ the CelebA dataset for benchmark purposes. CelebA contains facial images from identities. It is about times larger than the Shoes dataset () studied in (Guo et al., 2018). Each face image is labeled with binary attributes, such as “big nose” and “bald”.The dataset contains unique combinations of attributes and the top 10 frequency of identical set of attributes are . However, due to the different poses and angles the same person might have on different images, we use the whole dataset to train and test our model.
4.1.2. Reasons to Use CelebA
Using attributes can help avoid ambiguity in human perception as discussed in Subsection 2.2. CelebA also covers various ethnicities and genders, making it popular among facial image researchers (Zhang et al., 2016; Güçlütürk et al., 2016). Its binary attributes, massive coverage, and wide acceptance made the dataset a sufficient choice for our purposes.
4.2. Model Setup
We first experiment with different selections of hyper-parameters in Table 2, and we use the best selection where the constant margin is set to . Then, we experiment with data pre-proscessing and reshape the images from to and zero-center them before extracting their features. We use the first images sorted by name of the files as training set and the rest as testing set. There are ( of all identities) individuals whose images will appear in both training set and testing set. The number of rounds is fixed and set to . The learning rate is set to . We use adam (Kingma and Ba, 2014) as optimizer and the value of weight decay is set to .
The metric is the ranking percentile directly referred from (Guo et al., 2018). The reason why we do not use precision is that for each query (target image), there is only one relevant answer in a huge pool ( ). Unlike QA system or search engines where multiple items can be labeled as relevant, this task is challenging because it asks for highly refined retrieval to get the single relevant answer. Therefore, we calculate the ranking percentile of the target image in the whole search space by their distance of representations (which are processed offline using SE-ResNet model). Note that even if there are some images have exactly the same attributes with the target images and we rank them top, we only count the ranking percentile of the target image, which might be lower. The higher percentile is, the more accurate the model is, and the more likely the model can retrieve the correct target image (even though there might be some images with the same attributes that are not target image). We don’t do per-attributes evaluation. And more images share the same attributes will only degrade the model performance rather than cheating or make the number higher.
Attribute-based Retrieval: To retrieve a target image in mind, we can refer to its attributes and search in our database and return an image with closest description. Since in our scenario, the attributes of images are not known in advance, so we split the dataset to train a forty-dimensional classifier on the training set and test its performance on the test set. Note that the training set and testing set are the same for baseline experiment for fair comparison. The evaluation metrics are a bit different that instead of calculatingdistance, we sort the images in the database based on the number of matched attributes between them and the target image and report the ranking position of it.
4.5. Ablation Studies
4.5.1. Different Types of Input
Full disclosure with attributes of candidate image: Unlike progressive disclosure, this setting reflects an extreme situation where the thorough disclosure is provided by a complete comparison between the target image and candidate image at each round.
Full disclosure without attributes of candidate image: Instead of encoding the attributes of candidate image and the relevance feedback together, we experiment with the absence of attributes information of reference image. This setting can reduce the cost of providing feedback for real users since they only need to say “unlike” or “like” rather than “unlike the big nose” or “like the curly hair”.
Progressive disclosure: As described in our work.
4.5.2. Different Features of Images
Instead of training and learning the features dynamically, we employ the following pre-trained models to extract the features of images to save us a lot of computational cost.
SE-ResNet on VGGFace2 (Cao et al., 2018): A face recognition model trained on VGGFace2 dataset. The features will be extracted to a 256-dimensional representations.
SE-ResNet on CelebA: Based on SE-ResNet on VGGFace2, this model is fine-tuned on CelebA to classify attributes. Apart from designing it as our baseline experiment, we extract the features from its last layer in our model which is a 256-dimensional representations.
4.6. Results and Analysis
4.6.1. Baseline and Our Model
Realistically, there might be multiple samples that share the same combination of attributes, they will be sorted into a consecutive sequence in the database. In this case, retrieving any of them would be considered as a valid operation for the system. Thus, we report in Table 1 the position of the head and tail of the sequence as the upper bound and lower bound respectively and calculate the expectation by mean value. Note that all other settings are the same and the best.
|Method||Upper Bound||Lower Bound||Expectation|
The results demonstrate a better performance from our model. Though the numbers might look close to each other, the absolute difference, , will be amplified by the enormous number of image pool. For large dataset such as CelebA which contains more than images, this absolute difference means more than images are examined and excluded as irrelevant samples in our model. We believe this is of significant importance for users to save their efforts of looking at more than images unnecessarily.
4.6.2. Explorations of Hyper-parameters
We experiment on the value of margin in the loss objective. Also we experiment with different ways of pre-processing the images to obtain the best performance. Note that the feature extractor we use here is SE-ResNet on VGGFace2 and full disclosure is provided in each round. When experimenting with margin, we set reshaped size as . When experimenting with reshaped size, we set margin as .
4.6.3. Different Choices of Feedback
As shown in Figure 4, without using attributes of candidate image, the results is limited. While applying progressive disclosure fails to utilize information completely in the beginning, as the disclosure is progressively enhanced, it ultimately outperforms slightly better than full disclosure with attributes of candidate image. When using full disclosure at each round, the growth in ranking percentile soon stagnates at the second round. On the contrary, using progressive disclosure continuously climb and does not converge till the fourth round.
While the ranking percentiles are very close between full disclosure and progressive disclosure with image attributes, we can further investigate their performances by referring to their loss curves in Figure 5. It is obvious that in the beginning, using progressive disclosure incurs more loss than the other two which matches with their performances at round 1. However, at round 5, the loss for progressive disclosure drops to while it is , about times as much as the former, for full disclosure with image attributes. Combining the loss with their performances at round 5, we can conclude that using progressive disclosure is more capable of fitting the data.
4.6.4. Different Ways of Feature Extractions
Except the different features extracted from various networks, we keep all other settings the same as our best one. In Table3, there is an apparent superiority in using SE-ResNet on CelebA. The results reveal that the features extracted might have more similar representations with those of attributes, enabling the model to learn their connections well afterwards.
|SE-ResNet on VGGFace2||95.80%||0.0434|
|SE-ResNet on CelebA||98.66%||0.0297|
Apart from exhibiting the target image and candidate images during interaction between the system and users, we also calculate the number of attributes that are the same between the target image and the candidate image at each round to provide further investigation from another perspective.
From the visualization results, there are some interesting discoveries. We first find out that for target images of male, the system is more likely to converge at an early stage like at round three (see the first and the third row). However, when it comes to female target images, the system could still change at the last minute after all rounds for a better result (see the second row). These differences might come from the distribution of the dataset, where there are more diversity in female images than male images and it enables the system to refine the retrieval results with finer granularity.
Another interesting thing is that, we might assume that the increase in the number of matched attributes is consistent with the improvements in performance. However, although the number of matched attributes is indeed increasing in most cases, it is not always true. For example, from round 3 to round 4 at row four, the number of attributes drops from to . Despite that the image from round 4 has less matched attributes, it definitely seems to be a better match with the target image. At least they all share a western look with blond hair and high cheekbones while the image from round 3 is a typical East Asian face. Likewise, for images from round 2 to round 3 at row two and for images from round 1 to round 2 at row one and row three, the decline in the number of matches attributes essentially leads to a visible better results. This may indicate that our model has the advantage of combining the image features together with attributes information for better retrieval performances.
In our work, we shed light on facial image retrieval problem and propose an end-to-end interactive framework with progressive disclosure. We also explore different settings in various scenarios and applications. Though not perfect, our work is sufficient to deal with many cases with over ranking percentile. In the future, this retrieval problem can be upgraded to a conditional generation task that helps suspect sketch. Moreover, the forms of feedback can be expanded and enriched to capture the subtlety, such as fine-grained attributes or even verbal descriptions which enables smoother and more nature approach. The abundant ambiguity in the perception of facial images makes it particularly difficult but crucial to have an intelligent and accurate method to bridging the semantic gap and we hope our work can be a stepping stone for interested researchers.
I am grateful for all the support from Yuwen Xiong on this work. Though not an official author, his encouragements and guidance to me on this work is one of the most important reasons of me finishing this paper.
- OpenFace: a general-purpose face recognition library with mobile applications. Technical report CMU-CS-16-118, CMU School of Computer Science. Cited by: 1st item.
- Culture shapes how we look at faces. PLOS ONE 3 (8), pp. 1–8. External Links: Cited by: §1.
- Verbal facilitation of face recognition. Memory & Cognition 33 (8), pp. 1442–1456. Cited by: §1.
- Vggface2: a dataset for recognising faces across pose and age. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pp. 67–74. Cited by: 2nd item.
Learning cooperative visual dialog agents with deep reinforcement learning.
Proceedings of the IEEE International Conference on Computer Vision, pp. 2951–2960. Cited by: §2.1.
- Emotion perception across cultures: the role of cognitive mechanisms. Frontiers in psychology 4, pp. 118. Cited by: §1.
- Interactive search for image categories by mental matching. In 2007 IEEE 11th International Conference on Computer Vision, External Links: Cited by: §1.
- Query by image and video content: the qbic system. Computer 28 (9), pp. 23–32. External Links: Cited by: §2.1.
- A stochastic feedback model for image retrieval. In Proc. RFIA, Vol. 3, pp. 173–180. Cited by: §2.1.
- Convolutional sketch inversion. In European Conference on Computer Vision, pp. 810–824. Cited by: §4.1.2.
- Content based image retrieval systems. Computer 28 (9), pp. 18–22. External Links: Cited by: §2.1.
- Dialog-based interactive image retrieval. In Advances in Neural Information Processing Systems, pp. 678–688. Cited by: §2.1, §3.3, §4.1.1, §4.3, §4.
- Interactive image retrieval by natural language. Optical Engineering. External Links: Cited by: §2.1.
- Squeeze-and-excitation networks. Cited by: §3.2.
- Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: 1st item.
- CIRES: a system for content-based retrieval in digital image libraries. In 7th International Conference on Control, Automation, Robotics and Vision, 2002. ICARCV 2002., Vol. 1, pp. 205–210 vol.1. External Links: Cited by: §2.1.
- Adam: a method for stochastic optimization. International Conference on Learning Representations, pp. . Cited by: §4.2.
Whittlesearch: image search with relative attribute feedback.
2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2973–2980. Cited by: §2.1.
- WhittleSearch: interactive image search with relative attribute feedback. International Journal of Computer Vision 115 (2), pp. 185–210. External Links: Cited by: §1.
- Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: From A Glance to “Gotcha”: Interactive Facial Image Retrieval with Progressive Relevance Feedback.
- NeTra: a toolbox for navigating large image databases. Multimedia Systems 7 (3), pp. 184–198. External Links: Cited by: §2.1.
- Interactive content-based image retrieval using relevance feedback. Computer Vision and Image Understanding 88 (2), pp. 55–75. Cited by: §2.1.
- Mapping face recognition information use across cultures. Frontiers in Psychology 4, pp. 34. External Links: Cited by: §1.
- Method for incorporating facial recognition technology in a multimedia surveillance system. Google Patents. Note: US Patent 7,634,662 Cited by: §2.2.
- Relevance feedback techniques in interactive content-based image retrieval. In Storage and Retrieval for Image and Video Databases VI, Vol. 3312, pp. 25–36. Cited by: §2.2.
- Relevance feedback: a power tool for interactive content-based image retrieval. IEEE Trans. Circuits Syst. Video Techn. 8, pp. 644–655. Cited by: §2.1.
- Is beauty in the eye of the beholder but ugliness culturally universal? facial preferences of polish and yali (papua) people. Evolutionary Psychology 11 (4), pp. 147470491301100400. External Links: Cited by: §1.
Relevance feedback technique for content-based image retrieval using neural network learning.
2006 International Conference on Machine Learning and Cybernetics, pp. 3692–3696. Cited by: §2.1.
- MIRROR: an interactive content based image retrieval system. In 2005 IEEE International Symposium on Circuits and Systems, pp. 1541–1544. Cited by: §2.2.
- CuZero: embracing the frontier of interactive visual search for informed users. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval, MIR ’08, pp. 237–244. External Links: Cited by: §1.
- Gender and smile classification using deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 34–38. Cited by: §4.1.2.