3.1 Gender Issues in MS COCO Captioning Dataset
MS COCO[chen2015microsoft] is one of the largest and most widely used datasets for the task of image captioning. The latest (2017) version of the dataset contains 118,288 images for training and 5000 images for validation. Most images come with 5 ground-truth captions collected from human annotators. The high variability of images lends the dataset quite useful for training captioning models. However, this popularly accepted dataset is also not safe from gender bias. There are two major causes of bias in the dataset.
First is the bias and stereotype that human annotations possess. Several annotations have incorrect gender labels. Often, in an image of a person with inadequate (or confusing) gender cues, human annotators label the image with a gender deduced from the context or random perception. For instance, a person on a motorcycle is labelled as a ‘man’ when the person’s gender cannot be identified, or in rare cases, even when the person can be identified as a woman (Fig.1.1).
Added to the human bias is the statistical bias due to unbalanced data distribution. [zhao2017men] have illustrated that the number of images are not balanced for males and females with respect to different activities. Certain phrases are more often used with one gender than the other. This is not an issue in itself, however, the unbalanced distribution manifests in an amplified bias in the models trained on the dataset [zhao2017men]. For instance, the activity cooking is 33% more likely to involve females than males in a training set, and a trained model further amplifies the disparity to 68% at test time. So objects/contexts that occur with a particular gender more often in the training data, are even more strongly coupled in the model predictions. This has serious repercussions including ignorance of the image and unusual dependence on priors.
[zhao2017men] discuss about the second source of bias in the MS COCO dataset. In this section, we throw light on the first source of bias in the MS COCO dataset. These biased/incorrect annotations show a different facet of societal bias in captions. Such bias find their way into the models and lead to bias in the predictions. We first discuss about the bias in images with gender-indeterminate person, then shift to images with obvious genders. We also give a quantitative analysis of the bias. Last, we briefly discuss the bias in captions mentioning two or more people.
3.1.1 Biased labelling of Gender-neutral images
Images often depict a person whose gender cannot be identified from the image. However, annotators tend to assign a gender of their choice, instead of using gender-neutral words. We believe that for a gender-sensitive model, using gender neutral words for images with indeterminate gender is equally important as identifying gender correctly in gendered images. This evaluates the model’s understanding of gender and stands as a proof of concept that the model recognizes the cues that convey gender. For the trained models to possess this ability, we need such instances in the training data i.e. we need images with a person but lacking gender cues and annotations(captions) containing gender neutral phrases for the person. COCO contains several images where costume, scale, pose etc. make gender identification difficult. However, several of these have gendered words in the annotations. Below we discuss more about two key issues observed in ground truth annotations of gender indeterminate images - inconsistent genders, and coordinated genders reflecting bias.
Instances from MS COCO where the gender of the person cannot be estimated from the image but annotators use gendered language. Different annotators make different choices leading to conflicting and inconsistent gender labels - 2 say male and 2 female in each of the images.
Conflicting/inconsistent gender in captions-
For an image of a person lacking sufficient cues to predict a particular gender, annotators sometimes try to guess a gender. Such guesses are very likely to result in conflicting captions where the same person is referred to as a male by some annotators and female by others. This conflict further corroborates the idea that the image does not possess sufficient cues to surely label the gender of the person. Figure 3.1 shows two such instances from the training data. The gender of the person snowboarding in Fig.3.0(a) can’t be predicted satisfactorily from the image due to the small size and obscured face. But the annotators make their personal choices, two label it as ‘male’, two as ‘female’ and only 1 annotator chooses to use a gender neutral word ‘snowboarder’. In Fig. 3.0(b), the image shows a young infant. Babies/infants tend to possess very few gender cues. However, 2 annotators label it as a boy while 2 as a girl and only one correctly sticks to a gender neutral reference ‘child’.
Societal Bias in captions- When the gender is not visible in the image, and the annotators refrain from using gender-neutral words, one would expect conflicting annotations, as in the previous section. However, there are numerous instances of images with no clear gender cues, where the annotators agree on the gender. On a quick look at the instances, we realize that the prior in our mind suggests a gender though nothing in the image rules out the other gender. The image may not have enough cues to judge the gender of the person, but annotators manage to deduce one from the context. This demonstrates the influence of expected norms of the society on our perspective, and the annotations. Fig. 3.2 shows instances where the person is unidentifiable. However, the person surfing in Fig. 3.1(a) and the one performing stunt on a motorcycle in Fig. 3.1(b) is attributed the male gender owing to the context, though nothing prevents it from being a female. Similarly, the image of a person lying in bed cuddled up in 3.1(c) makes majority annotators infer a female. Likewise, annotators perceive a female in Fig. 3.1(d).
A key observation is that qualitatively, instances where annotators assume the gender to be male seem to be many more compared to instances where annotators assume the gender to be female. Quantifying this measure in the dataset is difficult since setting apart images where gender is indeterminate and the annotations are mere assumptions is difficult (predictions of a model trained on this dataset exhibit such behaviour, refer to Table 4.7). There are numerous instances where a person is barely visible in the frame but labelled as male. Figure 3.3 gives instances of highly obscure people referred to as males (could be due to the overall higher presence of males in the dataset that annotators find ‘males’ more likely or due to instance-specific image-based bias).
The instances where an indeterminate person is labelled with gendered words stand against the idea of person-focused gender identification. It makes learning more challenging because a model trained on these images could:
learn that contexts drive the gender rather than the person in the image.
learn to predict a man for cases when it can’t clearly see the person in the image
3.1.2 Inconsistency in annotations for gendered images
In the previous section, we looked at images where the person in the image is not identifiable and confusion/bias tends to crop in. However, there are even more alarming instances of bias in the dataset. Instances where the gender is obvious, however the annotators annotate it with the wrong gender, and result in contradicting captions. Figure 3.4 shows two images where one can easily identify the person as a woman carrying a surfboard. However, two annotators identify the person as male. These instances demonstrate the harsh gender insensitivity in the annotations.
3.1.3 Quantitative measure of inconsistent/biased gender annotations
Inconsistent/Ambiguous gender labels - Table 3.1 gives the counts of images in the MS COCO training data (2017 version) that have conflicting gender descriptions. By conflicting descriptions we mean images wherein some captions label the person as male, while some captions label it as female. The instances are chosen such that all captions describe a single person using exactly one word from a predefined list of male, female and gender-neutral words.
|Male||Female||Neutral||Number of images|
The COCO dataset has much higher number of images of males( 5000 with all captions male) than of females( 3000 with all captions female), but Table 3.1 suggests that inconsistent images are more of females than of males. To understand this, firstly assume that if 4/5 captions of an image agree on a gender, then that gender is most likely correct. Now, compare row 4 i.e. 1 male, 4 female, 0 neutral with the last row i.e. 4 male, 1 female, 0 neutral. The former with a count of 77 images represents inconsistent images which are w.h.p female and the latter with a count of 47 images represents images which are w.h.p male. Since in the entire dataset, the number of images of males is much higher than females, the higher count of females images here is atypical. Hence, we infer that annotators tend to confuse a female for a male more often than the other way. This could be due to the male dominance in the images which sets a prior in the annotators mind. This is interesting because if such bias can influence human annotators, then models can definitely be influenced by the bias.
Gender-neutral vs Gender labels - In Figure 3.4(a), we look at the patterns in distribution of gendered captions and gender-neutral captions for males and females. For the plot of males(and for females), we look at counts of images in train2017 where all captions talk of a single person using a word from the list of male-gendered subjects(female-gendered subjects) and gender-neutral subjects. Two key observations follow:
the number of images increases with the number of gendered captions, indicating that more images with gender being identifiable exist (than indeterminate gender)and more people tend to use a gendered word (than gender-neutral).
The instances with 4 gendered and 1 neutral can be considered to have the corresponding gendered label as ground-truth, and 1 annotator chanced to use a gender neutral reference. We check whether the probability with which the annotators use a gender-neutral word is equal for male and female images. The contingency table in Fig.3.4(b) is used for chi-squared independence test giving a p-value , depicting that the two are not independent. Annotators tend to use gender neutral terms for male images more commonly than for females. This can be interpreted in another way that annotators tend to use female gendered words more explicitly if they see a female in the image.
3.1.4 Bias in captions mentioning more than 1 person
There are 6282 captions which have both male-gendered singular word and female-gendered singular word. Out of these, 3153 (close to 50%) have the phrase ‘(male-word) and (female-word)’ or ‘(male-word) and a (female-word)’ and 385 have ‘(female-word) and (male-word)’ or ‘(female-word) and a (male-word)’. So for the images with 1 male and 1 female, of the captions assign different activities to the two persons,
In Figure 3.5(a), we plot the number of images with increasing number of captions having gendered plural words (for example men, boys for males; women, girls for females) and rest gender neutral plural words (people, persons etc). The plots are mostly similar for males and females, and the key difference is in the values for 4 and 5 gendered captions. Fig.3.5(b) gives the contingency table for chi-squared test giving a p-value of 0.06, suggesting that usage of gender-neutral words is not very significantly different when many males or females are present in the image.
3.2 Issues with Trained Models
A model trained on biased data learns the bias and gives biased predictions. [zhao2017men] give an analysis of the bias amplification by a model when trained with biased data. They show that the coupling of gender with activity/object is not only learned but also amplified by the model. In this section, we give further analysis of the gender bias present in the predictions of a model trained on the MS COCO image captioning dataset.
3.2.1 Biased inference of gender from context
Gendered images - Figure 3.7 gives some instances where the bias from the dataset comes into play while predicting the gender in images. Each of the 3 images has sufficient cues to enable the person to be recognized as a female. However, since the activities/objects, namely riding a bike, riding a boat, playing soccer have a bias for male subjects in the training dataset, the model predicts a ‘male’ subject word, while for the context of a ‘kitchen’ the model predicts a ‘female’ subject word.
Gender-Indeterminate images - Since the training data does not always contain gender-neutral words for images where the gender is unidentifiable, the model also resists using gender-neutral words. In Figure 3.8, all the 3 images do not have sufficient cues to confidently suggest a gender: the person riding the bike, riding the horse and the little kid in the garden, could be either male or female. But the model predicts gendered subject words.
3.2.2 Phrase bias - ‘man and woman’ occurrence increased
Pre-trained Convolutional image captioning model[aneja2018convolutional] tested on COCO val2017 gives 95 instances wherein a male singular subject word, as well as a female singular subject word, are predicted in a caption. However, a staggering 89 of those have the phrase ‘male word’ and (a) ‘female word’ (around 93%). We observed in Section 3.1.4 that in the training data, out of the instances wherein both male as well as female singular subject words occur, approximately 50% had the phrase ‘male word’ and (a) ‘female word’. This is another instance of bias amplification (from 50% to 93%). Several instances where 2 people can be seen, the model annotates them as ‘man and woman’. Figure 3.9 gives 3 such instances. The fact that the two genders are almost never associated with different predicates in a predicted caption is alarming. However, this strongly corroborates with our idea of independent image captioning and gender identification. In other words, in predictions with 1 male and 1 female, the model in [aneja2018convolutional] mostly does not provide unique information by differentiating between the genders and hence our approach of dropping out the gender during caption prediction would not affect the quality. In the future work, we hope to get more informative captions with different activities per subject and use attention maps to localize the subject, and thereafter obtain the gender.
Another interesting point is that the remaining 6 predictions (where ’male word’ and ’female word’ occur separately in the caption) also show strong bias influence. 3 of them are almost identical to ‘a man on a motorcycle with a woman on the back’. Figure 3.10 shows 3 instances - in 3.9(a), prediction says ‘woman holding an umbrella’ though the man is holding the umbrella (can be owed to more woman holding umbrellas in the training data). In 3.9(b), a woman is on the front seat while a man is sitting at the back but the prediction is other way round. In 3.9(c), the caption conveys no unique information for the 2 genders.
3.2.3 Gender influencing the other words in the caption
The captions are generated word-by-word in the model with each word conditioned on the previous words. Since gender is the first word to be predicted in the captions, it contributes to the prediction of all the other words in the sentence. This bias could be helpful if we are indeed looking for predictions where the subject and the predicate linkage is essential. However, such couplings can also result in predictions where the image is overlooked to accommodate the bias. Figure 3.11 gives 3 images where the model wrongly predicts ‘a man in a suit and tie’, without any visible sign of a suit or a tie.
This issue can be understood by the label bias problem [lafferty2001conditional] which roots from local normalization. The high probability (433 occurrences of ‘a man in a suit’ out of 2658 occurrences of ‘a man in a’ in training data) assigned to ‘a man in a suit’ causes the model to predict ‘suit’ overlooking the image. However, for females, prediction does not say suit, even when there is one. The label bias can be reduced or rather made consistent across males and females by making the data gender neutral before training the model. To solve the problem completely, the model architecture and/or loss formulation will also need alteration.
4.1 Training and Evaluation Datasets
This section discusses the details of the datasets used for training and testing our two sub-tasks. It also includes specifications about the anti-stereotypical dataset, specially designed for evaluating the effect of bias.
Gender Neutral Image Captioning - In order to train a gender neutral image captioning model, we modify the COCO dataset to remove the notion of gender from the captions. We replace all occurrences of male and female gendered words by their gender-neutral counterparts. We differentiate between the ages by using different replacements for man/woman and boy/girl. The following replacements are made to achieve gender neutral settings:
woman/man/lady/gentleman (etc) person
Plural form of woman/man/lady/gentleman (etc) people
Plural form of boy/girl/guy/gal youngsters
Gendered pronouns (his/her/hers/him/himself/herself) it/itself/its
Note that for all our experiments, we use the train/val/test splits described in the work [karpathy2015deep], also used for the CNN model [aneja2018convolutional]. It contains 113287 training images, 5000 images for validation, and 5000 for testing.
Gender Identification - To teach the model to correctly refrain from predicting gender in images where the gender cannot be inferred, we use 3 classes in our gender classifier - male, female and indeterminate. There is no dataset with full-scene images that particularly labels the person’s gender, and the existing image captioning datasets have the issues of incorrect ground truth annotations (as discussed before). So, for our task, we build a dataset with almost sure labels from MS COCO. Using the same splits as for the gender-neutral captioning, we select the images where all annotations talk only of 1 person. We subsample the images where at most 1 gender is mentioned (either only male or only female across captions) and where most captions agree on the gender. Obtaining images for the ‘gender-indeterminate’ class is more difficult since the usage of gender-neutral words could be the annotators’ personal choice even for gendered images. For this class, we restrict to the images where at least 4 annotators use gender-indeterminate subject words. From the image, we pick the largest annotated bounding box(by area) among all person labeled bounding boxes. We carry out gender classification using 2 different levels of granularity - bounding box and person mask. We get the image crops for bounding box and person mask annotations from the COCO groundtruth annotations. After this procedure, we obtain 14620 male images, 7243 female images and 2819 gender indeterminate images from the training split. For validation, we have 649, 339, 121. Test data has 600, 312, 134.
Evaluation Data for the Overall Task - Unbiased gender prediction is desirable but it would not necessarily result in increased Bleu scores for the test data. The validation and test data of MS COCO have been collected identical to the training data. Hence, they tend to possess similar flaws in ground truth annotations and similar biased distributions for gender-context. Removing the bias could adversely affect the performance on these. To effectively test our approach, we form a specialized anti-stereotypical dataset called the unusual dataset.
Unusual/Hard Dataset - Instead of focusing on the person for gender prediction and the context for activity, the existing models tend to use priors and learned bias. To evaluate our approach, we needed a dataset with instances that do not follow the statistical norm of the training data. We create such a dataset by first computing the gender bias [zhao2017men] in the training data for each word. Male bias for a word is computed as:
where c(word, male) refers to the co-occurrence counts (number of captions) of the word with male subject words, and c(word, female) refers to the counts with female subject words. We obtain 2 sets of words: one with the highest male bias and the other with the highest female bias, then filter images from the test data where the ground truth caption contains male gender with any of the female biased words and female gender with male biased words. For instance, pairs like female-snowboard, male-knitting, and male-heels. This gives us the ‘unusual dataset’ containing 69 instances of males (with female biased contexts) and 113 female instances (with male biased contexts). Fig.4.1 shows instances from the hard dataset with females .
4.2 Experiments and Results
4.2.1 Gender-neutral image captioning
To train a model for image captioning that is unaware of gender, we make the training and test datasets gender-neutral. Male and female words are replaced by their gender-neutral counterparts. We call this model Gender neutral CNN (GN-CNN). We compare against the pre-trained CNN model [aneja2018convolutional], which we call CNN. For a fair comparison, we neutralize the predictions of CNN in the same way as the training dataset’s captions and compare with similarly neutralized ground truth captions, calling it CNN-GN (predictions of the CNN post-processed to make those gender neutral). We train the model with 5 reference captions per image and a beam search of size 1 is used for predictions. The model and training details are exactly the same as [aneja2018convolutional] with only the dataset modified. When evaluating on a data which had similar bias, one would expect performance to reduce due to the loss of statistical support by the bias. However, we observe similar performance scores for both the models, summarised in Table 4.1.
Comparable performance by the two models means that training on gender unaware captions has a mixed influence overall. We hypothesize that the following losses/gains could be the reason:
Gain - Our model does not possess gender-context bias. There is no gender prediction to influence the remaining words in the caption and hence the image can contribute better to the captions.
Gain - Gender specific label bias is removed due to an apparent merge of those states into a gender-neutral one.
Loss - the test data comes from the same distribution as training data and possess similar annotation bias, so the gender bias could help in getting captions (activity) closer to the ground truth. For example, the high occurrence of ‘a man in a suit and tie’ made the prediction of the phrase easy at test time once ‘a man in a’ is predicted. In the absence of gender bias, the model cannot predict the phrase with as high probability and needs to learn to infer from the image.
Loss - Predictions from CNN-GN undergo the exact method of gender neutralization, as the ground truth captions. For instance, ‘a man and woman’ is changed to ‘a person and person’, while our gender unaware model could learn to predict ‘two people’ instead as all occurrences of two males and two females are converted to two people. On the other hand our model could predict ’a person and a person’ even for two males/females, which would have similar repercussions.
To investigate further, we look at human instances and non-human instances separately. Table 4.2 summarises the performance on the single human dataset containing instances with reference to one human in the caption (singular gendered and gender-neutral references) - 1942 images. On the other hand, Table 4.3 summarises the performance on the nature dataset containing images where the captions do not have a human reference. The CIDEr score for our model on both these datasets is reasonably higher than the CNN model, and the other metrics are nearly equal. It is surprising to observe that the nature dataset also gets different scores when the only difference in the two models is the use of gendered references in CNN and gender-neutral ones in GN-CNN while training. Fig. 4.2, 4.3, 4.4 and 4.5 also contain some non-human instances with stark difference in the captions predicted by both the models.
Evaluation on the hard dataset that we designed with anti-stereotypical instances is summarised in Table 4.4. Our model shows significant improvement across all metrics, particularly for the female images. This emphasizes that the gender-context coupling really harms the overall caption quality in anti-stereotypical cases. This further corroborates that decoupling the two tasks not only improves gender correctness but also the caption quality.
Issue with the metric - We use the conventional metrics to evaluate the caption quality and changes on the removal of bias. To analyze what specific changes happen in the predictions, we pick instances from the complete test data for which the metric differs the most for the two models (either could be higher). The instances with the highest changes in Bleu1 scores are presented in Fig.4.2, Bleu2 scores in Fig.4.3, Bleu3 scores in Fig.4.4 and Bleu4 scores in Fig.4.5 with gender-neutral captions predicted by our model and gender neutralised captions of CNN. The examples demonstrate that in most of the instances of maximum score change, it is difficult to guess which caption would have received a higher score, showing the ambiguity. Using these metrics for analyzing the bias is not the best way and better metrics for the same could be helpful.
4.2.2 Gender identification
We train the ResNet50 [he2016deep] model for classifying an image into 3 classes - male, female, person. We use weighted cross-entropy loss with weights 1, 5, 3 for male, person, and female categories respectively (since the dataset is unbalanced). For our classifiers, all crops for bounding boxes and body masks are obtained from COCO groundtruth annotations.
|Classifier||Male Acc.||Person Acc.||Female Acc.||Overall Acc.|
Table 4.5 summaries the class-wise accuracy and the overall accuracy for the two classifiers on the test set containing 600 male, 312 female and 134 gender-neutral instances. We also compare with the CNN model of [aneja2018convolutional], by extracting the gender from the captions. Table 4.6 gives the confusion matrices for our gender classifiers, and Table 4.7 gives the confusion matrix for the CNN model. Notice that the improvement in females is the highest (from 66% to 83%). The performance on the gender-neutral class is not very good. This could be due to fewer instances for training. Also, the ground truth label could be wrong in some instances. The performance of the model can be improved by increasing the data with almost sure groundtruth labels. The current datasets are quite small to address the high intra-class variability in poses, angles, etc and subtle inter-class differences.
4.2.3 Gendered image captioning - combining the two
Once we obtain the gender-neutral captions and the gender labels, the gendered captions are generated by merging the two results. One could follow any criteria of choice for injecting the gender back into the captions. For the merge, we follow the following rules:
For each caption mentioning a singular person, we extract the largest bounding box, and obtain labels for that. The occurrence of ‘person’ is changed to ‘man’ or ‘woman’ or left unchanged depending on the predicted gender class, while the occurrences of ‘youngster’ is changed to ‘boy’ or ‘girl’ or ‘child’(since youngster is not as common in the training data).
For captions mentioning two people, we extract the two largest person boxes, and get gender labels for both. Now the substitutions are made depending on the classes for both the labels. We substitute the phrase ‘two people/youngsters’ with ‘a genderlabel1 and a genderlabel2’.
For captions mentioning more than 2 people through phrases like ‘group of people’, we extract atmost 6 largest bounding boxes and get gender labels for each. If all belong to the same gender we substitute the plural form of that, while if there is a mixture of gender labels, we leave the caption as is. The word ‘youngsters’ is changed to the more common word ‘children’.
The final scores on the hard dataset 4.9 are significantly higher for our model than the CNN model. It outperforms the CNN model on all metrics. The performance after substitution on the entire test data is summarized in Table 4.8. The scores are similar with our model achieving slightly lower scores. Since our model outperforms the CNN model in the single human and nature datasets, we believe that the images with multiple persons are pulling the scores down. Improvement in correctly extracting the corresponding persons from the image and better substitutions could improve the scores. In Fig. 4.6, we compare some of the final captions with the CNN predictions.