Revealing Interpretable Representations learned by Face Inference Models
This paper presents Hierarchical Network Dissection, a general pipeline to interpret the internal representation of face-centric inference models. Using a probabilistic formulation, Hierarchical Network Dissection pairs units of the model with concepts in our "Face Dictionary" (a collection of facial concepts with corresponding sample images). Our pipeline is inspired by Network Dissection, a popular interpretability model for object-centric and scene-centric models. However, our formulation allows to deal with two important challenges of face-centric models that Network Dissection cannot address: (1) spacial overlap of concepts: there are different facial concepts that simultaneously occur in the same region of the image, like "nose" (facial part) and "pointy nose" (facial attribute); and (2) global concepts: there are units with affinity to concepts that do not refer to specific locations of the face (e.g. apparent age). To validate the effectiveness of our unit-concept pairing formulation, we first conduct controlled experiments on biased data. These experiments illustrate how Hierarchical Network Dissection can be used to discover bias in the training data. Then, we dissect different face-centric inference models trained on widely-used facial datasets. The results show models trained for different tasks have different internal representations. Furthermore, the interpretability results reveal some biases in the training data and some interesting characteristics of the face-centric inference tasks.READ FULL TEXT VIEW PDF
Revealing Interpretable Representations learned by Face Inference Models
Over the years, significant improvements have been made in developing face-centric deep inference models such as face recognitionParkhi et al. (2015), person re-identification Rao et al. (2019)
, age and gender estimationEidinger et al. (2014), and facial attribute classification Liu et al. (2019), among many others. However, these models when applied in commercial settings at scale have been scrutinized for displaying biases towards underrepresented classes Buolamwini (2017); Buolamwini and Gebru (2018). Understanding the underlying reasons for these biases, which is necessary for making progress towards bias mitigation strategies has been hindered since little to no work has been done to investigate the representations learned by these models. Studying these representations has been mainly addressed for object-centric and scene-centric models Zeiler and Fergus (2014); Simonyan et al. (2014); Mahendran and Vedaldi (2015); Bau et al. (2017); Nguyen et al. (2016), showing how the model interpretation is useful for understanding the outputs of the network and, more generally, the behaviour of the trained model Zhou et al. (2018).
In this paper, we present Hierarchical Network Dissection, a general pipeline to interpret face inference models. Concretely, our pipeline reveal what units of a face-centric inference model act as detectors of facial concepts. Thus, the result of our pipeline is a pairing of network units and facial concepts. As shown in Fig. 1, we dissect one layer of the network at a time and identify such unit-concept pairs to quantify their interpretability. This discovery of concepts provides us with the data to construct a dissection report that showcases the extent to which a layer focuses on certain concepts. Furthermore, we show how the model interpretability allow us to formulate hypothesis on potential model biases which, in turn, are often biases present in the training data. Hierarchical Network Dissection uses our visual dictionary of facial concepts, called “Face Dictionary”, which includes elements like facial parts, attributes, action units, gender, or skin tone category, as described in Section 3.1. The formulation of Hierarchical Network Dissection to reveal and quantify what units of a face-centric inference model act as detectors of these facial concepts is described in Section 3.
Our formulation is inspired by the Network Dissection approach proposed by Bau et al. Bau et al. (2017); Zhou et al. (2018), which is based on the observation that some units in scene-centric models act as object detectors Zhou et al. (2015). Our pipeline for dissecting face-centric models however uses Network Dissection in just one of its three stages. Overall, there are important technical differences between the two formulations, since interpreting face-centric models presents new challenges compared to interpreting scene-centric models. First, Network Dissection is based on the Broadly and Densely Labeled Dataset (Broden) visual dictionary, which is a collection of scene-centric visual concepts, like colors, textures, materials, parts of objects, whole objects, and scenes. Notice that, while one can expect to find detectors of these concepts in scene-centric images, these concepts are barely relevant to the faces. Thus, dissecting face-centric models requires a completely new face-centric dictionary (“Face Dictionary”). Second, the algorithmic formulation of Network Dissection for pairing units and concepts assumes that concepts are all localizable in the image. The pairing is based on an intersection over union (IoU) criterion on the area of the image that produces the strongest activation of a unit and the segmentation of each concept in the Broden visual dictionary. In contrast, there are different facial concepts that simultaneously occur in the same region of the image, like “mouth” (a facial part), “smiling” (a facial attribute), “wearing lipstick” (a facial attribute), and ”AU-20” (the action unit corresponding to lip stretcher). We refer to this challenge as the spatial overlap of concepts. Another important challenge is that we have concepts that are not located in a specific region of the face, such as the gender or the skin tone. We refer to this challenge as the global concepts. In this case, the IoU-based nature of Network Dissection cannot be applied for quantifying global concepts. As described in Section 3.2, the proposed Hierarchical Network Dissection is designed to deal with the spatial overlap of concepts and the global concepts challenges.
To validate our unit and concept pairing algorithm, we first conduct controlled experiments on biased datasets (Section 4). Concretely, we consider the task of apparent gender recognition222Here, we work with datasets with binary labels for gender classification, however we acknowledge that in essence the concept of gender is more fluid and non-binary. and create biased training sets for a specific concept in our Face Dictionary. For example (Section 4.1), for the concept “eyeglasses”, we create a training set where of the males wear glasses and of the females wear glasses (for ). In this case, when applying Hierarchical Network Dissection to interpret the model, we observe that the stronger the bias is (i.e. the higher is) the more units in the model act as “eyeglasses” detectors. We can make two important observations from these results. First, during training the model can learn that “eyeglasses” are discriminative concepts that help to perform the target task (apparent gender recognition) and for this reason there are units that act as “eyeglasses” detectors emerge. Second, the Hierarchical Network Dissection formulation is capable of revealing this model behaviour. Similar experiments are conducted for a global concept (Section 4.2), showing also the capacity of Hierarchical Network Dissection to reveal biases in global concepts.
The last part of the paper, Section 5
, presents dissection results of different face-centric inference models that have been trained with the current state-of-the-art architectures and publicly-available facial datasets. These dissections result in several interesting observations. For example, the results revealed a gender bias in the training data of the Smile classification task, as well as the relevance of facial expression for the Beauty classifier. Our dissection results and further observations and discussions can be found in Section5.
Both our Face Dictionary and the Hierarchical Network Dissection code for face-centric inference models will be publicly released. We think these two will be useful tools for the research communities working in visual face inference.
There are different techniques to understand the representations learned by deep convolutional neural networks (CNNs). Broadly, these techniques are based on: (i
) variants of backpropagation to visualize salient patterns, features, or image regionsZeiler and Fergus (2014); Simonyan et al. (2014); Mahendran and Vedaldi (2015); (ii) detecting patches in the image that strongly activate the units of the trained model Zhou et al. (2015); Zeiler and Fergus (2014); or (iii) analyzing to what extent units behave as detectors or classifiers for specific concepts using original images Zhou et al. (2015); Bau et al. (2017) or synthesized ones Nguyen et al. (2016). In this last direction, one of the most popular approaches is Network Dissection proposed by Bau et al. Bau et al. (2017); Zhou et al. (2018), which can be formulated to analyze the representations learned by both classification and generative models Bau et al. (2019, 2020), making Network Dissection a very versatile approach. While Network Dissection has been widely used on object-centric and scene-centric models, the use of Network Dissection on face-centric inference models has remained unexplored. Our Hierarchical Network Dissection is inspired by the Network Dissection algorithm, in which its technical details are described at the end of this section.
Interpretability for Bias Discovery
– In this paper, we demonstrate how model interpretability can be used as a tool for bias discovery. In the context of fairness in the artificial intelligence (AI), lately the interest for revealing bias in the data has gained a lot of attention among the research community. Wang et al.Wang et al. (2020a) recently presented a detailed study and a computational tool to discover biases in image datasets. Their tool revealed very interesting insights on gender-based representation biases. Nonetheless, their approach is in general scene-centric and the tool is currently not targeted to be used in face-centric datasets. At the same time, Wang et al. Wang et al. (2020b) recently presented a benchmark to compare bias mitigation methods. The benchmark is created by introducing bias in an object classification dataset. In particular, per each class, a specific percentage of images are converted to grey-scale, creating an unbalanced representation (i.e. some classes have most of the images in grey-scale, while others have most of the images in color). The design of this benchmark has inspired our controlled experiments presented in Section 4.
Interpretability vs. Explainability – Model interpreta-bility is related to the explainable AI topic Arrieta et al. (2020); Gunning (2017). There have been recent interesting efforts in explainable subject recognition systems Yin et al. (2019); Zee et al. (2019); Williford et al. (2020); RichardWebster et al. (2018)
and also in explainable face-centric generative adversarial networks (GANs)Shen et al. (2020). However, explainable AI and model interpretability although are related topics, but there are fundamental differences in the two concepts. While explainability focuses on generating explanations about the output, model interpretability focuses on understanding the internal representation of the model. Our work is centered around the latter topic by revealing the interpretable representation of face-centeric inference models.
Network Dissection – As first introduced in Bau et al. (2017)
, Network Dissection is a general framework for quantifying the interpretability of the hidden units in any convolutional layer of a neural network trained for an object-centric or a scene-centric task. It requires a broad range of visual concepts in order to compare the activation maps of these hidden units to their binary segmentation labels and compute their corresponding spatial affinity. This collection of concepts is presented as a visual dictionary called Broden, and it contains concepts that range from lower level (e.g. colors, textures, or materials, like red, dotted or metal) to high level (e.g. parts of objects, objects, or scenes, like leg, wheel, floor, car, or swimming pool). The pipeline uses these concepts and their binary segmentation masks to evaluate against the activation maps of a given layer in the network to generate an IoU score for each unit-concept pair. For each unit the concept with the highest IoU is reported and if the highest IoU is less than 0.04, the unit is called uninterpretable.
Notice that, as discussed in Section 1, Network Dissection cannot be directly used to dissect face inference models. First, we need a face-centric dictionary instead of Broden. Second, we need to reformulate the unit-concept pairing algorithm to deal with the spatial overlap of concepts and the global concepts challenges. We address these points in the next section.
The unit-concept pairing procedure of Hierarchical Network Dissection is based on our Face Dictionary, which is composed of several facial concepts. The concepts of the dictionary are organized in three main categories. Following this three category structure, Hierarchical Network Dissection uses three stages to determine what units in a CNN model are acting as concept detectors.
Our Face Dictionary consists of 38 local concepts and 12 global concepts. The total number of images with segmentation labels for the local concepts is 24632 and the histogram of the number of labelled instances per concept is displayed in Fig. 3. The images are organized in three types or categories: (1) Global, (2) Facial Parts, and (3) Local. Fig. 2.a lists all the concepts included in the Face Dictionary. For the local concepts, our dictionary also includes segmentation masks per each example image. Certain concepts occur more frequently than others, however the overall distribution amongst Action units, Attributes, and Facial Parts is similar due to the number of images sharing multiple concepts from one of these types. Fig. 2.b shows some examples for local concepts and their corresponding segmentation masks. The total number of images for the entire global set is 11000, where Age, Gender, and Ethnicity have 4000 images each and Skin Tone has 3000 images. Each sub-type has equal number of images within each concept, although our algorithm is able to account for uneven distribution amongst these concepts.
There are several datasets that provide labels for action units and facial attributes merely indicating whether a particular attribute or action unit is present in a given image, without providing the corresponding location in the image. However, our dictionary needs the segmentation of the local concepts to be used during the concept-unit pairing. For this reason, we have assembled the “Face Dictionary” dataset by estimating the regions for these concepts using facial landmarks. Our dictionary consists of images taken from the EmotioNet Benitez-Quiroz et al. (2016), UTKFace Zhang et al. (2017) and CelebA Liu et al. (2015) datasets. We run a landmark detection algorithm on our chosen images as we use the landmarks to estimate the face region where our local concepts lie. We estimate the center and covariance matrix of a 2D Gaussian confidence ellipse around a particular concept and generate a binary mask for each concept per image.
Our unit-concept pairing formulation has three stages and each of the stages focuses on one of the concept types included in our Face Dictionary as follows.
Stage I: Global Concepts – This first stage is based on the idea that Global concepts belonging to the same category are mutually exclusive (for example, we assume that the apparent age of a face can not be [0-20] and [20-40] at the same time). To pair a unit with a Global concept in one of the categories (e.g. Apparent Age), we take a forward pass across all images in our dictionary for each global concept in the corresponding category and record the feature maps of the layer being dissected, while retaining the information about which map belongs to which subgroup. In order to compare the activations from these feature maps we assign a rank to each map based on their maximum activation score, where the map with the lowest score has rank and the highest score has rank . Then, we initialize a score for each subgroup as and increment them as we iterate through all the maps by:
where is the concept score for subgroup , is the rank and is the maximum activation score of the map and each score is only incremented by maps that belong to . We use this formulation to establish a pecking order among the different subgroups by accounting for the strength of their activations relative to each other. We then normalize each score by:
where is the number of images that belong to , to make the scores comparable. Finally, we divide each score by the sum of all scores to obtain a set of relative probabilities such that where is the probability of and is the global concept being analyzed. The range of rank (R) remains constant for each individual analysis. Normalizing the rank itself is irrelevant since after Eq.1 and Eq.2, we normalize each concept score by the sum of all scores to obtain relative probabilities. Hence, only the maximum activation scores of the feature maps need to be normalized since each model produces activations in a different range. Using these probabilities we classify a unit as biased towards if when is “Age” or “Ethnicity” and when is “Gender” or “Skin Tone”. We have different thresholds due to the number of subgroups we have in each global concept. For Skin Tone and Gender, we have a threshold of 0.55 because we only have two subgroups (s). The sum of their probabilities (P) for each unit is 1 and if , it is a significant increase from the balanced value of 0.5 so we conclude there is bias towards s. Similarly for Age and Ethnicity we have 4 subgroups as shown in Fig. 1. If , then it is much higher than unbiased value of 0.25.
Stage II: Facial Parts – This stage is focused on pairing units with Facial Part concepts. Similar to the Stage I, we take a forward pass across all the images from the local set in our dictionary to store the activation maps of the layer being dissected and run network dissection as described in Bau et al. (2017) to generate IoU scores for each local concept per unit. Since a unit generally produces strong IoU scores for multiple concepts within the same region, it is highly misleading to report a single concept with the highest IoU as interpretable. This is why we identify the region of the face that the concept with the highest IoU belongs to, and evaluate the relevance of every other concept in this region (as shown in Fig. 2) to the unit by establishing a probabilistic hierarchy amongst them using the same formulation introduced above with a minor variation.
Stage III: Local Concepts – Per each unit that has been paired with a Face Part in Stage II, we extract all the activation maps that have labelled instances of at least one of the concepts belonging to this region. Then, we use the same steps shown in Eq. (1) and Eq. (2) but this time instead of subgroups, we have clustered concepts and due to the simultaneous presence of these concepts, one activation map can contribute to the concept score of more than one concept. Hence, this overlap of images among concepts can lead to scores that are not clearly distinguishable, which is why we add one more step to demarcate these concepts. We scale these concept scores by their respective IoUs estimated during Stage II by:
Then, we replicate the formulation and obtain relative probabilities per concept. If any concept’s probability crosses a threshold of , where is the total number of concepts in the region identified, we deem a unit to be interpretable for this concept. This allows a unit to be interpretable for more than one concept by accounting for all aspects of affinity among the unit-concept pair.
To validate the Hierarchical Network Dissection formulation for detecting interpretable units corresponding to both local and global concepts, we performed two experiments on biased settings. Our objective here is to determine the capability of our formulations as explained in Section 3.2 for bias discovery in both types of concepts in the hidden units of a convolutional layer. Concretely, we consider the gender classification task and we explicitly introduce different degrees of bias in the training set using first the local concept “Eyeglasses” (Sect.4.1) and second the global concept “Gray-scale” (Sect.4.2).
In our first experiment, we use the local concept “Eyeglasses” from our Face Dictionary to create six different biased datasets. Concretely, a percentage of males in the training set will be wearing eyeglasses, while a a percentage of females will be wearing eyeglasses. Fig. 4 illustrates the training data for the two extreme cases: (no bias introduced) and (maximum degree of bias, where detecting faces with eyeglasses is equivalent to do gender classification). Notice that the closer is to the more useful would be for the model to have an internal representation that focuses more in the eye region, since concept “Eyeglasses” belongs to eye region and it can help to discriminate on the main task (i.e. gender classification).
We train with a ResNet-50 architecture and dissect the second to last convolution layer for each case. When we observed the number of interpretable units for “Eyeglasses” in particular, only the models with (balanced) and (fully biased) showed a contrast, where the fully biased model detects more than thrice the number of units as the balanced model whereas the models in between detected a similar number of units closer to the balanced model. In Fig. 6, we show per each degree of bias the number of interpretable units in terms of facial parts. Notice that the number of units that focus on the eye region increases as increases. This result reveals how the representation of the model focuses more in eye region as the discriminant information contributed by the eye region increases. Interestingly, we also observe that the focus on the cheek region decreases as increases. This again hints to the fact that hidden units may only be exclusively discriminative in terms of facial region and not so much for individual facial concepts.
In Fig. 5, we show a set of qualitative results to visualize the top 50 activated images per gender - female (left), male (right) - for each model. We observe that the distribution of images with “Eyeglasses” goes up for males as P value increases from 50 to 100 and the exact opposite can be seen for the female class as the number of images with “Eyeglasses” steadily decreases. This shows that the model’s learning becomes increasingly biased as the bias in the dataset goes up. The images are arranged in descending order based on their maximum activation scores.
Our second experiment follows a similar protocol as the previous experiment, but the bias is introduced on the global concept “Grayscale”. Thus, our training datasets will have an unbalanced representation of color images and grey-scale images: for males, a percentage of the images will be gray-sale, and for females a percentage of the images will be gray-scale. We synthesize 6 different datasets and train gender classification models, with a percentage bias going from 50 to 100.
We use Stage I of Hierarchical Network Dissection by taking a separate test set to generate unit probabilities for all the units in the second to last convolution layer of block #1, #2, #3, and #4 of the ResNet-50 architecture. For any given unit , probability of color and probability of gray-scale represent the affinity of the unit to a color scheme, where . We perform this at all stages of the network because color scheme is a low-level concept and can be detected in earlier stages of the network. We compare these unit probabilities across different models for all 4 layers with and convolution filters respectively to interpret how the bias within each layer increases or decrease as the bias rises from to .
In Fig. 7, we demonstrate the concept probabilities ( & ) for all the units in an ascending manner to show the gradual shift in the number of biased units. If the majority of units in a layer are not biased, we should get a very gentle slope and a highly biased group of units should return a steep slope. In this case, we clearly observe that the balanced set with P=50 displays a gentle slope across all 4 layers and as the P value increases, the slope of probabilities gradually becomes more steep. It is easy to infer from the figure that in the case of P=100 (even in the early layers), there are very few units that do not have a strong affinity to one of the color schemes verifying that our formulation can easily classify units within a layer to hypothesize how biased their representations are.
For each layer, we also establish the number of biased units to compare how biased the representations are at different stages of the network. A unit is said to be biased if either or staying consistent with the bias analysis introduced in the paper. In Fig. 8, we observe that all 4 layers show a clear increase in the number of biased units as we move from the unbiased set to the completely biased one. Apart from Layer 3 that has shows a perfectly uniform increase, the other layers display fluctuation in the level of bias but eventually showcase a steep increase as the values reaches its maximum value. We also observe that the last layer has a far greater percentage of biased units across all values, which is reasonable when we think about how crucial this perceived bias is to the model’s performance due to its proximity to the final fully connected layer. This experiment in tandem with Section 4.1 emphasizes the ability of our formulation to discover and quantify the biases that exist within the representations, which are usually hidden from us due to our inability to interpret these models quantitatively.
|Task (Abbreviation)||Dataset||Architecture||Layer Dissected||Performance (Metric)|
|Age Estimation (AGE)||IMDB-WIKI||ResNet-50||Block 4 - Layer2(conv1)||6.64 (MAE)|
|Gender Classification (GENDER)||IMDB-WIKI||ResNet-50||Block 4 - Layer2(conv1)||90.8% (Acc.)|
|Beauty Estimation (BEAUTY)||SCUT-FBP5500||ResNet-18 (pretrained)||Block 4 - Layer1(conv1)||0.137 (MSE)|
|Facial Recognition (FACENET)||VGGface2||Inception ResNetV1 (pretrained)||Block 8 - branch1(conv3)||99.6% (Acc.)|
|Facial Recognition (FAIRFACE)||FairFace||ResNet-50||Block 4 - Layer2(conv1)||86.8% (Acc.)|
|Smile Classification (SMILE)||CelebA||ResNet-50||Block4 - Layer2(conv1)||91.2 (Acc.)|
We perform our dissection on five common face-centric inference tasks: age estimation, gender classification, beauty estimation, facial recognition, and smile classification. We try to use the same model architectures for most tasks in order to contrast the representations learned by similar architectures for different tasks.
Table 1 summarizes the details of the dissected models, specifies the layer that we dissect in our experiments, and shows the performance obtained per model. Notice that we focus our interpretability experiments in the deeper layers of the models, since the concepts in our Face Dictionary are high level concepts. According to the previous studies, the high level concepts are mainly found in the layers that are closer to the output Zhou et al. (2015); Bau et al. (2017).
We train both the age and gender models on IMDB-WIKI dataset Rothe et al. (2018)
with a backbone of ResNet-50. The IMDB-WIKI dataset consists of half a million celebrity images crawled from IMDB and Wikipedia with age and gender labels. Even though the dataset provides age estimation as a classification problem, we choose to train the model as a regression task to avoid parsing through mislabelled data as most of the labels are provided through date of birth and image timestamps crawled from the web. This may lead to bad accuracy in classification, whereas minor errors in labelling will be better handled by regression. Hence, we preprocess the data to remove invalid entries of age as well as remove corrupt images. After training for 2 epochs, the model converges with a mean absolute error (MAE) of 6.6015 on the test set. For the gender model, it also converges within 2 epochs to give a validation accuracy of 90.88% on the test set.
For the beauty estimation problem we use SCUT-FBP5500 Liang et al. (2018), which is a diverse benchmark for multi-paradigm facial beauty prediction. It consists of 5500 images with a beauty score in the range of . The dataset provides a train-test split of 60/40 with 5 fold cross validation. The authors of this paper provide pretrained models and we have dissected the ResNet-18 architecture that has a mean squared error of 0.137 on the validation set.
For the facial recognition task we dissect two models. One is FaceNet, that is an InceptionResNetV1 pre-trained on VGGface2, and the other is FairFace, a ResNet-50 we trained on the FairFace dataset. The pre-trained model has a validation accuracy of 99.6% on LFW Huang et al. (2007), while the ResNet-50 has been trained on FairFace with a validation accuracy of 86.7%. We could not replicate the extremely high performance of FaceNet model beacause we could not use the necessary high batch sizes due to hardware restrictions and also because the dataset is smaller. FairFace only has about 100K images which we have used to generate 80,000 triplets per epoch. Notice, however, that the obtained accuracies are competitive and good enough to extract meaningful observations on the model interpretability. Recent work such as Deng et al. (2018) and Liu et al. (2017) have attained accuracies greater than 99% for facial verification and identification on massive datasets with millions of images such as MSCeleb-1M Guo et al. (2016) and VGGface2 Cao et al. (2017).
Finally, we train a smile classification model with ResNet-50 on the subset of the CelebA dataset that has “Smiling” attribute labels. We ensure that none of the images in this training set are a part of our dictionary to avoid biased dissection. The training and validation sets have 6000 and 1200 images, respectively. After 4 training epochs, the model achieves an accuracy of 91.2% on the validation data.
Fig. 9.a shows the amount of interpretable units found in each model in terms of facial parts (first step of our dissection algorithm). We observe that, in general, the region that gets more units is the Cheek region for all the models except for the Beauty classifier, that has more units on the eye region. Fig. 9.b shows the distribution on the types of concepts. We observe that attributes are the most common in all the models with two exceptions: Beauty and FairFace. For Beauty, the most represented type of concept is action units. One explanation might be that facial expressions play an important role in facial attractiveness perception Tatarunaite et al. (2005). For the case of FairFace, we observe that the most represented type of concept is facial parts.
Fig. 10 shows the histograms for the interpretable units across all the dissected models, while some qualitative examples of detected local concepts per each model are given in Fig. 11. By comparing the histogram visualizations for each model of Hierarchical Network Dissection (left) and Network Dissection (right), we can see the interpretability results of Hierarchical Network Dissection are more complete compared to the results obtained by Network Dissection: (i) There is a clear increase in the sheer number of interpretable units that we find per concept using our approach since we do not restrict a unit’s ability to interpret multiple concepts. (ii) We observe that some of the concepts that are not revealed by Network Dissection are actually detected through our hierarchy, which shows a more diverse distribution of interpretable concepts. (iii) In contrast to Network Dissection, our hierarchical approach can also pair units with global concepts. Furthermore, our approach allows us to determine how many of the units that are interpretable for a local concept are also interpretable for a global concept, allowing us to create richer visualizations with the overlap of local and global concepts. If we analyze the similarities of Hierarchical Network Dissection and Network Dissection results, we see that both approaches on average are prone to detect more concepts that are spatially dominant such as Cheek Region concepts and very few concepts in the Nose Region as a result of varying IoU scores.
The results of Fig.10
show the representations learned by the models and, in turn, reveal certain biases of the data that they were trained on. For example, if we look at the Smile model, there is a high focus on “No Beard” and “Rosy Cheeks” which are gender skewed concepts and hence point to a gender bias in the model. This is backed up by the amount of biased units in the “Male” and “Female” section (red bars) and the fact that a majority of the units in “No Beard” and “Rosy Cheeks” are gender biased. After these results we checked the gender annotations in the smile detection training data and we observed that females are more represented in the smiling category than males (vs. , respectively). Similarly, for the Age model we observe that a very high number of units paired with specific age groups, which is reflective of the model’s discriminative abilities for identify people of different ages. A similar observation can be made with the Gender model, which is the model that has more units paired with gender concepts, when compared to the other models. We also see that the Beauty model is the one with more units paired with Action Units, which points to the relevance of these concepts to the subjective nature of beauty and the influence of facial expressions on beauty perception, as argued before. An interesting observation can be made about the FaceNet model, which despite having far lesser units in the chosen layer, it has the most number of unique concept detectors, i.e. more diversity in the internal representation. Not only for local concepts, but we can see that it is the only model to have a high number of units paired with every single global concept which leads us to believe it is a consequence of being the most comprehensively trained model on a massive dataset and being exposed to a variety of concepts compared to the other models which have been trained on moderate-sized datasets.
For the qualitative results, in Fig. 11 we display 2 units per model where our formulation generates a higher probability score for a concept which has a lower IoU than the concept identified by Stage II. We display the concept chosen purely by IoU at the top and the concept with the highest probability from our formulation at the bottom. As stated earlier, when applying the IoU approach from the original Network Dissection on the images from our dictionary it is impossible to say that a concept that generates a similar IoU to the top concept cannot be reported as an interpretable concept without doing a deeper analysis of the activations generated for that concept with respect to every single concept that lies in the same facial region. So by establishing a hierarchy among concepts from that facial region, we learn that there can be more than one concept which has a high affinity with any given unit. This behavior can be explained by the spatial proximity that these concepts have and how difficult it is for a unit to distinguish between concepts that look the same but have different characteristics that separate them. For ex. with the Gender model in Unit 142 of the second last convolution layer of ResNet-50 we see that the original approach identifies “5 o Clock Shadow” as the top concept with a very high IoU score of 0.1914. It is easy to see through the images that the unit does a very good job of localizing the region with this concept. However when we use our formulation to estimate probabilities for all the concepts in the Cheek Region, we learn that “Rosy Cheeks” despite having a far lower IoU score of 0.1109 generates a probability of 0.2276 which is much higher than the 0.1884 generated by “5 o Clock Shadow”. This verifies our assumption about ignoring concepts through the IoU approach since it would lead to an incomplete understanding of the model’s representations.
Our current dictionary has a combination of local and global concepts and our approach (described in Section 3.2) pairs individual units of a layer with them. However, it is important to understand that there are many facial concepts that are not included in the dictionary which is why it is difficult to quantify how much more information we can derive from various layers in a network. We have dissected the deeper layers of all the six tasks discussed above using our hierarchical formulation as well as the original network dissection and their coverage of the number of interpretable units has been displayed in Fig. 12 and Fig. 13. As we know, ResNet-50 and ResNet-18 have 4 blocks of convolutional layers. For the models in 1 with ResNet as their backbone we have dissected all the convolutional layers in Block3 and Block4. For the facenet model, we dissect the final convolution block as well as all the convolution layers of the 5 Inception-C blocks (which are named as repeat3 in the figures). The reason we avoid dissecting the earlier layers of the network is due to the extremely high resolution of the feature maps at this stage of the network, as well as a model’s inability to localize higher level concepts which was shown by Bau et al. (2017). We can observe that except the Facenet model which has extremely rich representations capable of detecting many concepts at several stages of the network, most of the other networks show a roughly 70-80% coverage across multiple layers through the hierarchical approach(Fig. 12) and around 50-60% coverage using the network dissection approach(Fig. 13). Surprisingly, we do not observe a striking difference in the percentage of interpretable units across the upper and deeper layers, which suggests that the models have a relatively strong understanding of the underlying concepts even at the middle stages. We must point out however, that due to the inability of the original dissection to account for global concepts it is not strictly fair to compare these two approaches directly in terms of coverage. Our formulation allows us to account for a variety of concepts which helps us to create a more complete understanding of the model’s representations and highlights it’s superiority. This primarily ensures a high coverage across most models since a unit that cannot effectively localize certain concepts, may be susceptible to biases towards a subgroup of global concepts. This is easy to observe in Fig. 13, where Fairface and Beauty models show the lowest coverage across all layers with very few layers displaying a 50% coverage. This coverage may be impacted by several factors, such as the distribution of correlated and uncorrelated attributes within a dataset and the training process of the model. This is why it is beneficial for the models to be exposed to as many concepts as possible in order to improve the estimation of these relationships derived from Hierarchical Network Dissection. A more diverse and rich version of the dictionary with several other concepts would allow us to push the limits of this formulation and create extremely accurate reports for every single face inference model.
In this paper, we present a general pipeline to interpret the internal representation of face-centric inference models called Hierarchical Netowork Dissection. Our approach is inspired by the Network Dissection work, which is a well-known object-centric and scene-centric model interpretability pipeline. The proposed Hierarchical Network Dissection formulation can deal with two challenges of face-centric model interpretability: “spatial overlap of concepts” and “global concepts”. In summary, the main contributions of our work are: (1) we created our “Face Dictionary”, a collection of face-centric concepts with corresponding image samples; (2) we introduced an algorithmic approach to pair units and concepts that can deal with both local and global concepts; (3) we performed controlled experiments on biased data to validate our unit-concept pairing algorithm and to empirically show how the dissection of the model can be used for bias discovery; and (4) we dissected a collection of face-centric inference models that have been trained on popular facial datasets and tasks, and provide an extended discussion on how the model interpretability can be used to discover biases in the model and also to better understand how the model works. Revealing bias is relevant to improve the creation of datasets. We believe understanding the interpretability of the model reveals what the model is paying attention to. For example, if age and smile are uncorrelated then a smile classifier should not have many units devoted to age estimation. The presence of age units in a smile classifier means: (1) the training set is biased; (2) the inference might have age bias, both of which are highly undesirable for attaining relaible models. Our “Face Dictionary” and the code of Hierarchical Network Dissection will be publicly released. We hope them to be useful tools for the computer vision community working on face inference models.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6541–6549. Cited by: §1, §1, §2, §2, §3.2, §3.2, §5.1, §5.3.
Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in neural information processing systems, pp. 3387–3395. Cited by: §1, §2.
Facial attractiveness: a longitudinal study. American Journal of Orthodontics and Dentofacial Orthopedics 127 (6), pp. 676–682. Cited by: §5.2.
Age progression/regression by conditional adversarial autoencoder. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5810–5818. Cited by: §3.1.