Exposing and Correcting the Gender Bias in Image Captioning Datasets and Models

12/02/2019 ∙ by Shruti Bhargava, et al. ∙ 11

The task of image captioning implicitly involves gender identification. However, due to the gender bias in data, gender identification by an image captioning model suffers. Also, the gender-activity bias, owing to the word-by-word prediction, influences other words in the caption prediction, resulting in the well-known problem of label bias. In this work, we investigate gender bias in the COCO captioning dataset and show that it engenders not only from the statistical distribution of genders with contexts but also from the flawed annotation by the human annotators. We look at the issues created by this bias in the trained models. We propose a technique to get rid of the bias by splitting the task into 2 subtasks: gender-neutral image captioning and gender classification. By this decoupling, the gender-context influence can be eradicated. We train the gender-neutral image captioning model, which gives comparable results to a gendered model even when evaluating against a dataset that possesses a similar bias as the training data. Interestingly, the predictions by this model on images with no humans, are also visibly different from the one trained on gendered captions. We train gender classifiers using the available bounding box and mask-based annotations for the person in the image. This allows us to get rid of the context and focus on the person to predict the gender. By substituting the genders into the gender-neutral captions, we get the final gendered predictions. Our predictions achieve similar performance to a model trained with gender, and at the same time are devoid of gender bias. Finally, our main result is that on an anti-stereotypical dataset, our model outperforms a popular image captioning model which is trained with gender.



There are no comments yet.


page 16

page 22

page 23

page 24

page 35

page 36

page 37

page 38

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

3.1 Gender Issues in MS COCO Captioning Dataset

MS COCO[chen2015microsoft] is one of the largest and most widely used datasets for the task of image captioning. The latest (2017) version of the dataset contains 118,288 images for training and 5000 images for validation. Most images come with 5 ground-truth captions collected from human annotators. The high variability of images lends the dataset quite useful for training captioning models. However, this popularly accepted dataset is also not safe from gender bias. There are two major causes of bias in the dataset.

  • First is the bias and stereotype that human annotations possess. Several annotations have incorrect gender labels. Often, in an image of a person with inadequate (or confusing) gender cues, human annotators label the image with a gender deduced from the context or random perception. For instance, a person on a motorcycle is labelled as a ‘man’ when the person’s gender cannot be identified, or in rare cases, even when the person can be identified as a woman (Fig.1.1).

  • Added to the human bias is the statistical bias due to unbalanced data distribution. [zhao2017men] have illustrated that the number of images are not balanced for males and females with respect to different activities. Certain phrases are more often used with one gender than the other. This is not an issue in itself, however, the unbalanced distribution manifests in an amplified bias in the models trained on the dataset [zhao2017men]. For instance, the activity cooking is 33% more likely to involve females than males in a training set, and a trained model further amplifies the disparity to 68% at test time. So objects/contexts that occur with a particular gender more often in the training data, are even more strongly coupled in the model predictions. This has serious repercussions including ignorance of the image and unusual dependence on priors.

[zhao2017men] discuss about the second source of bias in the MS COCO dataset. In this section, we throw light on the first source of bias in the MS COCO dataset. These biased/incorrect annotations show a different facet of societal bias in captions. Such bias find their way into the models and lead to bias in the predictions. We first discuss about the bias in images with gender-indeterminate person, then shift to images with obvious genders. We also give a quantitative analysis of the bias. Last, we briefly discuss the bias in captions mentioning two or more people.

3.1.1 Biased labelling of Gender-neutral images

Images often depict a person whose gender cannot be identified from the image. However, annotators tend to assign a gender of their choice, instead of using gender-neutral words. We believe that for a gender-sensitive model, using gender neutral words for images with indeterminate gender is equally important as identifying gender correctly in gendered images. This evaluates the model’s understanding of gender and stands as a proof of concept that the model recognizes the cues that convey gender. For the trained models to possess this ability, we need such instances in the training data i.e. we need images with a person but lacking gender cues and annotations(captions) containing gender neutral phrases for the person. COCO contains several images where costume, scale, pose etc. make gender identification difficult. However, several of these have gendered words in the annotations. Below we discuss more about two key issues observed in ground truth annotations of gender indeterminate images - inconsistent genders, and coordinated genders reflecting bias.

(a) i) A woman is snowboarding down a snowy hill
ii) a woman snowboards down the side of a snow covered hill
iii) A man is snowboarding down a snowy hill.
iv) A guy on a snowboard skiing down a mountain slop.
v) a snowboarder riding the slopes amongst a skier
(b) i) A young child dressed nicely in a blue sports jacket and tie leaning on a rail.
ii)A young girl in a school uniform leans on a rail.
iii)A little boy in a school uniform standing near a rail
iv)Young boy wearing coat and tie school uniform.
v)a little girl is dressed in a uniform outside
Figure 3.1:

Instances from MS COCO where the gender of the person cannot be estimated from the image but annotators use gendered language. Different annotators make different choices leading to conflicting and inconsistent gender labels - 2 say male and 2 female in each of the images.

Conflicting/inconsistent gender in captions- For an image of a person lacking sufficient cues to predict a particular gender, annotators sometimes try to guess a gender. Such guesses are very likely to result in conflicting captions where the same person is referred to as a male by some annotators and female by others. This conflict further corroborates the idea that the image does not possess sufficient cues to surely label the gender of the person. Figure 3.1 shows two such instances from the training data. The gender of the person snowboarding in Fig.3.0(a) can’t be predicted satisfactorily from the image due to the small size and obscured face. But the annotators make their personal choices, two label it as ‘male’, two as ‘female’ and only 1 annotator chooses to use a gender neutral word ‘snowboarder’. In Fig. 3.0(b), the image shows a young infant. Babies/infants tend to possess very few gender cues. However, 2 annotators label it as a boy while 2 as a girl and only one correctly sticks to a gender neutral reference ‘child’.

Societal Bias in captions- When the gender is not visible in the image, and the annotators refrain from using gender-neutral words, one would expect conflicting annotations, as in the previous section. However, there are numerous instances of images with no clear gender cues, where the annotators agree on the gender. On a quick look at the instances, we realize that the prior in our mind suggests a gender though nothing in the image rules out the other gender. The image may not have enough cues to judge the gender of the person, but annotators manage to deduce one from the context. This demonstrates the influence of expected norms of the society on our perspective, and the annotations. Fig. 3.2 shows instances where the person is unidentifiable. However, the person surfing in Fig. 3.1(a) and the one performing stunt on a motorcycle in Fig. 3.1(b) is attributed the male gender owing to the context, though nothing prevents it from being a female. Similarly, the image of a person lying in bed cuddled up in 3.1(c) makes majority annotators infer a female. Likewise, annotators perceive a female in Fig. 3.1(d).

(a) i) A man in black surfing in wild choppy water.
ii)A man riding a surfboard in the rapids of a river
iii)Man in a wet suit surfing river rapids
iv)a person surfing in a large deep river
v)A person surfs waves in a river.
(b) i)A blurry motion picture of a person doing a wheelie on a motorcycle.
ii)A person is performing a wheelie on a motorcycle.
iii)A man sitting on a motorcycle doing a stunt.
iv)A man doing stunts on his motorcycle at a show.
v)A man is riding a motorcycle on one wheel.
(c) i) A young towel-wrapped child is on a bed, facing a window.
ii)A woman laying in bed under a window in a bedroom.
iii)a cute girl cuddled with her blanket catching some sleep
iv)a woman snuggled up in her blanket sleeping on bed
v)A small child that is laying in bed.
(d) i)A person holding a red umbrella sitting on a pier.
ii)a person sitting on a bench with a red umbrella and is looking at some water
iii)A woman sits on a beach looking at the ocean under an umbrella.
iv)A woman holding a red umbrella sits on a bench facing the sea.
v)Lone woman with umbrella on a bench looking at the ocean
Figure 3.2: Instances from MS COCO where the gender of the person cannot be estimated from the image but annotators use gendered language. Furthermore, the annotators agree on the gender to be male demonstrating context-based societal bias.
(a) i) A man playing a first person shooter video game
ii)A person holding a white Xbox video game controller with the game on the television in the background.
iii)a close up of a person holding an xbox 360 controller
iv)A man is using a video game controller.
v)A man is playing a video game on his TV.
(b) i)This is a painting of a boy painting.
ii)This mural shows a person drawing a mural.
iii)Painting of a boy in baseball cap writing on a bus window.
iv)The large painting is of a child looking through a bus window.
v)A painting of a boy writing on a subway train’s window
(c) i)A man doing tricks on a skateboard at a skate park.
ii)A man jumping a skateboard over a ramp.
iii)A skateboarder performing a stunt near graffiti covered concrete.
iv)The man is riding his skateboard practicing his tricks outside.
v)a person is doing a trick on a skateboard
Figure 3.3: Instances from MS COCO where a person is barely in the frame, yet annotators choose ‘male’ annotations. More number of gender-indeterminate images seem to be associated with males than females, which could be attributed to overall higher presence of males encouraging annotators to say male or image-based bias or other unknown reasons.

A key observation is that qualitatively, instances where annotators assume the gender to be male seem to be many more compared to instances where annotators assume the gender to be female. Quantifying this measure in the dataset is difficult since setting apart images where gender is indeterminate and the annotations are mere assumptions is difficult (predictions of a model trained on this dataset exhibit such behaviour, refer to Table 4.7). There are numerous instances where a person is barely visible in the frame but labelled as male. Figure 3.3 gives instances of highly obscure people referred to as males (could be due to the overall higher presence of males in the dataset that annotators find ‘males’ more likely or due to instance-specific image-based bias).

The instances where an indeterminate person is labelled with gendered words stand against the idea of person-focused gender identification. It makes learning more challenging because a model trained on these images could:

  • learn that contexts drive the gender rather than the person in the image.

  • learn to predict a man for cases when it can’t clearly see the person in the image

3.1.2 Inconsistency in annotations for gendered images

In the previous section, we looked at images where the person in the image is not identifiable and confusion/bias tends to crop in. However, there are even more alarming instances of bias in the dataset. Instances where the gender is obvious, however the annotators annotate it with the wrong gender, and result in contradicting captions. Figure 3.4 shows two images where one can easily identify the person as a woman carrying a surfboard. However, two annotators identify the person as male. These instances demonstrate the harsh gender insensitivity in the annotations.

(a) i)A woman in a wetsuit carries a surfboard and walks with a dog.
ii)A sexy lady in a wet suit walking with a dog on a beach.
iii)A man carries a surf board as a dog walks beside him.
iv)A man and his dog are walking away from the ocean with a surfboard.
v)a person is walking with a dog and a surfboard
(b) i)this lady is walking along the shore on a beach
ii)A man walking barefoot on the ground with stuff in his hands.
iii)A man prepares to surf with his para-sale
iv)A woman in blue and black wetsuit with windsurfing gear.
v)A woman carrying a kite board across pavement.
Figure 3.4: Instances from COCO where the annotations are inconsistent in gender even though the gender is obvious from the image. 2 annotators refer to the person as male though the person can be clearly inferred as female from the body and facial features.

3.1.3 Quantitative measure of inconsistent/biased gender annotations

Inconsistent/Ambiguous gender labels - Table 3.1 gives the counts of images in the MS COCO training data (2017 version) that have conflicting gender descriptions. By conflicting descriptions we mean images wherein some captions label the person as male, while some captions label it as female. The instances are chosen such that all captions describe a single person using exactly one word from a predefined list of male, female and gender-neutral words.

Male Female Neutral Number of images
1 1 3 129
1 2 2 72
1 3 1 60
1 4 0 77
2 1 2 97
2 2 1 35
2 3 0 11
3 1 1 77
3 2 0 10
4 1 0 47
Table 3.1: Number of images with conflicting gender in captions i.e. a person is identified as male and female by different annotators. Even though the number of images with males is much higher in the dataset, the number of images with inconsistent genders are comparable for both genders, in fact more female images have inconsistency than male images (compare row 4 and row 10)

The COCO dataset has much higher number of images of males( 5000 with all captions male) than of females( 3000 with all captions female), but Table 3.1 suggests that inconsistent images are more of females than of males. To understand this, firstly assume that if 4/5 captions of an image agree on a gender, then that gender is most likely correct. Now, compare row 4 i.e. 1 male, 4 female, 0 neutral with the last row i.e. 4 male, 1 female, 0 neutral. The former with a count of 77 images represents inconsistent images which are w.h.p female and the latter with a count of 47 images represents images which are w.h.p male. Since in the entire dataset, the number of images of males is much higher than females, the higher count of females images here is atypical. Hence, we infer that annotators tend to confuse a female for a male more often than the other way. This could be due to the male dominance in the images which sets a prior in the annotators mind. This is interesting because if such bias can influence human annotators, then models can definitely be influenced by the bias.

(a) Plot for number of images vs number of captions with gendered singular words. Gender inconsistent images are not considered.
(b) Contingency table for chi-squared test of singular gender annotation and singular gender neutral annotation. Gives a value of 76.36, 1 degree of freedom and p value
Figure 3.5: Comparison in the usage of gender-neutral singular words with male and female words. Here we do not consider images with conflicting genders (discussed previously). In Fig.3.4(a), the solid line with dots shows the number of images with x singular male gendered captions and (5-x) singular gender neutral captions. The dashed line depicts the same for female gendered captions. The contingency table on the right qualitatively proves that males have a higher chance than females of being referred to by gender neutral singular words like ‘person’.

Gender-neutral vs Gender labels - In Figure 3.4(a), we look at the patterns in distribution of gendered captions and gender-neutral captions for males and females. For the plot of males(and for females), we look at counts of images in train2017 where all captions talk of a single person using a word from the list of male-gendered subjects(female-gendered subjects) and gender-neutral subjects. Two key observations follow:

  1. the number of images increases with the number of gendered captions, indicating that more images with gender being identifiable exist (than indeterminate gender)and more people tend to use a gendered word (than gender-neutral).

  2. The instances with 4 gendered and 1 neutral can be considered to have the corresponding gendered label as ground-truth, and 1 annotator chanced to use a gender neutral reference. We check whether the probability with which the annotators use a gender-neutral word is equal for male and female images. The contingency table in Fig.

    3.4(b) is used for chi-squared independence test giving a p-value , depicting that the two are not independent. Annotators tend to use gender neutral terms for male images more commonly than for females. This can be interpreted in another way that annotators tend to use female gendered words more explicitly if they see a female in the image.

(a) Plot for number of images vs number of captions with gendered plural words. Gender inconsistent images are not considered.
(b) Contingency table for chi-squared test of plural gender annotation and plural gender neutral annotation. Gives a value of 3.42, 1 degree of freedom and p value 0.06
Figure 3.6: Comparison in the usage of gender-neutral singular words with male and female words. in Fig.3.5(a), the solid line with dots shows the number of images with x plural male gendered captions and (5-x) gender neutral captions. The dashed line depicts the same for female gendered captions. The contingency table on the right qualitatively proves that males have a higher chance of being referred to by gender neutral plural words like ‘people’ than females.

3.1.4 Bias in captions mentioning more than 1 person

There are 6282 captions which have both male-gendered singular word and female-gendered singular word. Out of these, 3153 (close to 50%) have the phrase ‘(male-word) and (female-word)’ or ‘(male-word) and a (female-word)’ and 385 have ‘(female-word) and (male-word)’ or ‘(female-word) and a (male-word)’. So for the images with 1 male and 1 female, of the captions assign different activities to the two persons,

In Figure 3.5(a), we plot the number of images with increasing number of captions having gendered plural words (for example men, boys for males; women, girls for females) and rest gender neutral plural words (people, persons etc). The plots are mostly similar for males and females, and the key difference is in the values for 4 and 5 gendered captions. Fig.3.5(b) gives the contingency table for chi-squared test giving a p-value of 0.06, suggesting that usage of gender-neutral words is not very significantly different when many males or females are present in the image.

3.2 Issues with Trained Models

A model trained on biased data learns the bias and gives biased predictions. [zhao2017men] give an analysis of the bias amplification by a model when trained with biased data. They show that the coupling of gender with activity/object is not only learned but also amplified by the model. In this section, we give further analysis of the gender bias present in the predictions of a model trained on the MS COCO image captioning dataset.

3.2.1 Biased inference of gender from context

Gendered images - Figure 3.7 gives some instances where the bias from the dataset comes into play while predicting the gender in images. Each of the 3 images has sufficient cues to enable the person to be recognized as a female. However, since the activities/objects, namely riding a bike, riding a boat, playing soccer have a bias for male subjects in the training dataset, the model predicts a ‘male’ subject word, while for the context of a ‘kitchen’ the model predicts a ‘female’ subject word.

(a) a man riding a bike with a dog on the back
(b) a man is standing on a boat in the water
(c) a group of young men playing a game of soccer
(d) a woman standing in a kitchen with a dog
(e) a man riding a skateboard on a ramp
Figure 3.7: Predicted captions by the pre-trained model from [aneja2018convolutional]. All the images have persons with obvious genders but the model predicts the other gender owing to the learned bias

Gender-Indeterminate images - Since the training data does not always contain gender-neutral words for images where the gender is unidentifiable, the model also resists using gender-neutral words. In Figure 3.8, all the 3 images do not have sufficient cues to confidently suggest a gender: the person riding the bike, riding the horse and the little kid in the garden, could be either male or female. But the model predicts gendered subject words.

(a) a man riding a motorcycle with a dog on the back
(b) a man riding a horse on a dirt road
(c) a little girl holding a purple flower flower in a garden
Figure 3.8: Predicted captions by Convolutional image captioning model where gender is not recognizable from the images but model predicts a gendered word. We also observed this in the ground truth annotations, which partially explains such behaviour by the model.

3.2.2 Phrase bias - ‘man and woman’ occurrence increased

Pre-trained Convolutional image captioning model[aneja2018convolutional] tested on COCO val2017 gives 95 instances wherein a male singular subject word, as well as a female singular subject word, are predicted in a caption. However, a staggering 89 of those have the phrase ‘male word’ and (a) ‘female word’ (around 93%). We observed in Section 3.1.4 that in the training data, out of the instances wherein both male as well as female singular subject words occur, approximately 50% had the phrase ‘male word’ and (a) ‘female word’. This is another instance of bias amplification (from 50% to 93%). Several instances where 2 people can be seen, the model annotates them as ‘man and woman’. Figure 3.9 gives 3 such instances. The fact that the two genders are almost never associated with different predicates in a predicted caption is alarming. However, this strongly corroborates with our idea of independent image captioning and gender identification. In other words, in predictions with 1 male and 1 female, the model in [aneja2018convolutional] mostly does not provide unique information by differentiating between the genders and hence our approach of dropping out the gender during caption prediction would not affect the quality. In the future work, we hope to get more informative captions with different activities per subject and use attention maps to localize the subject, and thereafter obtain the gender.

(a) a man and a woman on skis standing on a snowy slope
(b) a man and a woman are standing next to a truck
(c) a man and a woman are standing next to a table
Figure 3.9: Phrase amplification - Predicted captions by Convolutional image captioning model with both ‘a man and a woman’ phrase when the image need not have a male and a female

Another interesting point is that the remaining 6 predictions (where ’male word’ and ’female word’ occur separately in the caption) also show strong bias influence. 3 of them are almost identical to ‘a man on a motorcycle with a woman on the back’. Figure 3.10 shows 3 instances - in 3.9(a), prediction says ‘woman holding an umbrella’ though the man is holding the umbrella (can be owed to more woman holding umbrellas in the training data). In 3.9(b), a woman is on the front seat while a man is sitting at the back but the prediction is other way round. In 3.9(c), the caption conveys no unique information for the 2 genders.

(a) a woman holding a umbrella and a man in a city
(b) a man on a motorcycle with a woman on the back
(c) a man in a kitchen with a woman in a kitchen
Figure 3.10: Predicted captions by Convolutional image captioning model with both male and female subject words. Though these images do not have the phrase ‘man and woman’ but they either contain incorrect labeling or have the same predicate repeated for man and woman. These instances also reflect that concealing the gender in the predicted captions could result in better captions, but may result in worse Bleu scores.

3.2.3 Gender influencing the other words in the caption

The captions are generated word-by-word in the model with each word conditioned on the previous words. Since gender is the first word to be predicted in the captions, it contributes to the prediction of all the other words in the sentence. This bias could be helpful if we are indeed looking for predictions where the subject and the predicate linkage is essential. However, such couplings can also result in predictions where the image is overlooked to accommodate the bias. Figure 3.11 gives 3 images where the model wrongly predicts ‘a man in a suit and tie’, without any visible sign of a suit or a tie.

(a) a man in a suit and tie standing in a bathroom
(b) a man in a suit and tie looking at a laptop
(c) a man in a suit and tie looking at a cell phone
Figure 3.11: Predicted captions by Convolutional image captioning model with ‘man in a suit and tie’, though none of the images depict a person wearing a suit or a tie.

This issue can be understood by the label bias problem [lafferty2001conditional] which roots from local normalization. The high probability (433 occurrences of ‘a man in a suit’ out of 2658 occurrences of ‘a man in a’ in training data) assigned to ‘a man in a suit’ causes the model to predict ‘suit’ overlooking the image. However, for females, prediction does not say suit, even when there is one. The label bias can be reduced or rather made consistent across males and females by making the data gender neutral before training the model. To solve the problem completely, the model architecture and/or loss formulation will also need alteration.

4.1 Training and Evaluation Datasets

This section discusses the details of the datasets used for training and testing our two sub-tasks. It also includes specifications about the anti-stereotypical dataset, specially designed for evaluating the effect of bias.

Gender Neutral Image Captioning - In order to train a gender neutral image captioning model, we modify the COCO dataset to remove the notion of gender from the captions. We replace all occurrences of male and female gendered words by their gender-neutral counterparts. We differentiate between the ages by using different replacements for man/woman and boy/girl. The following replacements are made to achieve gender neutral settings:
woman/man/lady/gentleman (etc) person
boy/girl/guy/gal youngster
Plural form of woman/man/lady/gentleman (etc) people
Plural form of boy/girl/guy/gal youngsters
Gendered pronouns (his/her/hers/him/himself/herself) it/itself/its
Note that for all our experiments, we use the train/val/test splits described in the work [karpathy2015deep], also used for the CNN model [aneja2018convolutional]. It contains 113287 training images, 5000 images for validation, and 5000 for testing.

Gender Identification - To teach the model to correctly refrain from predicting gender in images where the gender cannot be inferred, we use 3 classes in our gender classifier - male, female and indeterminate. There is no dataset with full-scene images that particularly labels the person’s gender, and the existing image captioning datasets have the issues of incorrect ground truth annotations (as discussed before). So, for our task, we build a dataset with almost sure labels from MS COCO. Using the same splits as for the gender-neutral captioning, we select the images where all annotations talk only of 1 person. We subsample the images where at most 1 gender is mentioned (either only male or only female across captions) and where most captions agree on the gender. Obtaining images for the ‘gender-indeterminate’ class is more difficult since the usage of gender-neutral words could be the annotators’ personal choice even for gendered images. For this class, we restrict to the images where at least 4 annotators use gender-indeterminate subject words. From the image, we pick the largest annotated bounding box(by area) among all person labeled bounding boxes. We carry out gender classification using 2 different levels of granularity - bounding box and person mask. We get the image crops for bounding box and person mask annotations from the COCO groundtruth annotations. After this procedure, we obtain 14620 male images, 7243 female images and 2819 gender indeterminate images from the training split. For validation, we have 649, 339, 121. Test data has 600, 312, 134.

Evaluation Data for the Overall Task - Unbiased gender prediction is desirable but it would not necessarily result in increased Bleu scores for the test data. The validation and test data of MS COCO have been collected identical to the training data. Hence, they tend to possess similar flaws in ground truth annotations and similar biased distributions for gender-context. Removing the bias could adversely affect the performance on these. To effectively test our approach, we form a specialized anti-stereotypical dataset called the unusual dataset.

Unusual/Hard Dataset - Instead of focusing on the person for gender prediction and the context for activity, the existing models tend to use priors and learned bias. To evaluate our approach, we needed a dataset with instances that do not follow the statistical norm of the training data. We create such a dataset by first computing the gender bias [zhao2017men] in the training data for each word. Male bias for a word is computed as:

(a) girl-baseball
(b) woman-suit
(c) girl-skateboard
Figure 4.1: Instances from the hard dataset formed by finding anti-stereotypical gender-context pairs in the test captions. The gender-biased pairs are extracted from the training data captions using the bias definition in [zhao2017men]

where c(word, male) refers to the co-occurrence counts (number of captions) of the word with male subject words, and c(word, female) refers to the counts with female subject words. We obtain 2 sets of words: one with the highest male bias and the other with the highest female bias, then filter images from the test data where the ground truth caption contains male gender with any of the female biased words and female gender with male biased words. For instance, pairs like female-snowboard, male-knitting, and male-heels. This gives us the ‘unusual dataset’ containing 69 instances of males (with female biased contexts) and 113 female instances (with male biased contexts). Fig.4.1 shows instances from the hard dataset with females .

4.2 Experiments and Results

4.2.1 Gender-neutral image captioning

To train a model for image captioning that is unaware of gender, we make the training and test datasets gender-neutral. Male and female words are replaced by their gender-neutral counterparts. We call this model Gender neutral CNN (GN-CNN). We compare against the pre-trained CNN model [aneja2018convolutional], which we call CNN. For a fair comparison, we neutralize the predictions of CNN in the same way as the training dataset’s captions and compare with similarly neutralized ground truth captions, calling it CNN-GN (predictions of the CNN post-processed to make those gender neutral). We train the model with 5 reference captions per image and a beam search of size 1 is used for predictions. The model and training details are exactly the same as [aneja2018convolutional] with only the dataset modified. When evaluating on a data which had similar bias, one would expect performance to reduce due to the loss of statistical support by the bias. However, we observe similar performance scores for both the models, summarised in Table  4.1.

Model Bleu1 Bleu2 Bleu3 Bleu4 METEOR ROUGE_L CIDEr SPICE
GN-CNN 0.714 0.544 0.401 0.293 0.246 0.528 0.915 0.178
CNN-GN 0.715 0.546 0.402 0.293 0.247 0.526 0.910 0.180
Table 4.1: Performance summary for comparing gender neutral predictions with gender neutral ground truth captions. GN-CNN is our model which is trained on gender neutral captions, while CNN-GN is the pre-trained model from [aneja2018convolutional] with post processing of predictions to make them gender neutral

Comparable performance by the two models means that training on gender unaware captions has a mixed influence overall. We hypothesize that the following losses/gains could be the reason:

  • Gain - Our model does not possess gender-context bias. There is no gender prediction to influence the remaining words in the caption and hence the image can contribute better to the captions.

  • Gain - Gender specific label bias is removed due to an apparent merge of those states into a gender-neutral one.

  • Loss - the test data comes from the same distribution as training data and possess similar annotation bias, so the gender bias could help in getting captions (activity) closer to the ground truth. For example, the high occurrence of ‘a man in a suit and tie’ made the prediction of the phrase easy at test time once ‘a man in a’ is predicted. In the absence of gender bias, the model cannot predict the phrase with as high probability and needs to learn to infer from the image.

  • Loss - Predictions from CNN-GN undergo the exact method of gender neutralization, as the ground truth captions. For instance, ‘a man and woman’ is changed to ‘a person and person’, while our gender unaware model could learn to predict ‘two people’ instead as all occurrences of two males and two females are converted to two people. On the other hand our model could predict ’a person and a person’ even for two males/females, which would have similar repercussions.

To investigate further, we look at human instances and non-human instances separately. Table 4.2 summarises the performance on the single human dataset containing instances with reference to one human in the caption (singular gendered and gender-neutral references) - 1942 images. On the other hand, Table 4.3 summarises the performance on the nature dataset containing images where the captions do not have a human reference. The CIDEr score for our model on both these datasets is reasonably higher than the CNN model, and the other metrics are nearly equal. It is surprising to observe that the nature dataset also gets different scores when the only difference in the two models is the use of gendered references in CNN and gender-neutral ones in GN-CNN while training. Fig. 4.2, 4.3, 4.4 and 4.5 also contain some non-human instances with stark difference in the captions predicted by both the models.

Model Bleu1 Bleu2 Bleu3 Bleu4 METEOR ROUGE_L CIDEr SPICE
GN-CNN 0.731 0.562 0.418 0.311 0.259 0.544 0.860 0.183
CNN-GN 0.728 0.559 0.415 0.307 0.258 0.539 0.853 0.184
Table 4.2: Performance summary for comparing GN-CNN and CNN-GN on single human dataset containing 1942 instances
Model Bleu1 Bleu2 Bleu3 Bleu4 METEOR ROUGE_L CIDEr SPICE
GN-CNN 0.710 0.539 0.393 0.282 0.239 0.523 0.918 0.174
CNN-GN 0.709 0.540 0.393 0.281 0.239 0.521 0.904 0.175
Table 4.3: Performance summary for comparing GN-CNN and CNN-GN on nature dataset containing 2441 instances. CIDEr score of our model is slightly higher, while other metrics are almost identical.

Evaluation on the hard dataset that we designed with anti-stereotypical instances is summarised in Table 4.4. Our model shows significant improvement across all metrics, particularly for the female images. This emphasizes that the gender-context coupling really harms the overall caption quality in anti-stereotypical cases. This further corroborates that decoupling the two tasks not only improves gender correctness but also the caption quality.

Model Data Bleu1 Bleu2 Bleu3 Bleu4 METEOR Rouge_L CIDEr SPICE
Ours F .759 .604 .460 .346 .275 .578 .913 .196
CNN-GN F .722 .567 .424 .316 .268 .556 .853 .178
Ours M .743 .583 .438 .321 .268 .556 .862 .173
CNN-GN M .723 .559 .417 .301 .252 .532 .733 .168
Table 4.4: Performance summary for our gender-neutral image captioning model and CNN-GN on the unusual dataset with gender neutralised ground truth captions . F refers to the instances with females (113) and M refers to instances with males(69)

Issue with the metric - We use the conventional metrics to evaluate the caption quality and changes on the removal of bias. To analyze what specific changes happen in the predictions, we pick instances from the complete test data for which the metric differs the most for the two models (either could be higher). The instances with the highest changes in Bleu1 scores are presented in Fig.4.2, Bleu2 scores in Fig.4.3, Bleu3 scores in Fig.4.4 and Bleu4 scores in Fig.4.5 with gender-neutral captions predicted by our model and gender neutralised captions of CNN. The examples demonstrate that in most of the instances of maximum score change, it is difficult to guess which caption would have received a higher score, showing the ambiguity. Using these metrics for analyzing the bias is not the best way and better metrics for the same could be helpful.

4.2.2 Gender identification

We train the ResNet50  [he2016deep] model for classifying an image into 3 classes - male, female, person. We use weighted cross-entropy loss with weights 1, 5, 3 for male, person, and female categories respectively (since the dataset is unbalanced). For our classifiers, all crops for bounding boxes and body masks are obtained from COCO groundtruth annotations.

Classifier Male Acc. Person Acc. Female Acc. Overall Acc.
Bounding box 85.6 47.3 83.3 80.05
Body mask 86.0 42.9 81.0 79.57
CNN 85.5 44.1 66.3 69.61
Table 4.5: Gender identification accuracy on the test set. We compare our bounding box based classifier and body mask based classifiers with the pretrained-model in [aneja2018convolutional], referred as CNN. The accuracy for CNN is computed by extracting the gender (or gender-neutral) word from the captions predicted for the full image.
Male Person Female
Male 513 49 37
Person 48 63 22
Female 36 16 259
Male Person Female
Male 515 48 36
Person 51 57 25
Female 46 13 252
Table 4.6: Confusion matrix for the bounding box-based (left) and body mask based classifiers on the test data.
Male Person Female
Male 482 37(19) 26
Person 55 50(2) 11
Female 74 14(12) 197
Table 4.7: Confusion matrix based on captions provided by convolutional image captioning model on the test set created for gender classification. The numbers in brackets indicate the count of usage of ‘people’. Some captions (35, 16, 14 for male, person and female respectively) have no human subject words and hence the rows do not add up to the total number of images in the set.

Table 4.5 summaries the class-wise accuracy and the overall accuracy for the two classifiers on the test set containing 600 male, 312 female and 134 gender-neutral instances. We also compare with the CNN model of [aneja2018convolutional], by extracting the gender from the captions. Table 4.6 gives the confusion matrices for our gender classifiers, and Table 4.7 gives the confusion matrix for the CNN model. Notice that the improvement in females is the highest (from 66% to 83%). The performance on the gender-neutral class is not very good. This could be due to fewer instances for training. Also, the ground truth label could be wrong in some instances. The performance of the model can be improved by increasing the data with almost sure groundtruth labels. The current datasets are quite small to address the high intra-class variability in poses, angles, etc and subtle inter-class differences.

4.2.3 Gendered image captioning - combining the two

Once we obtain the gender-neutral captions and the gender labels, the gendered captions are generated by merging the two results. One could follow any criteria of choice for injecting the gender back into the captions. For the merge, we follow the following rules:

  • For each caption mentioning a singular person, we extract the largest bounding box, and obtain labels for that. The occurrence of ‘person’ is changed to ‘man’ or ‘woman’ or left unchanged depending on the predicted gender class, while the occurrences of ‘youngster’ is changed to ‘boy’ or ‘girl’ or ‘child’(since youngster is not as common in the training data).

  • For captions mentioning two people, we extract the two largest person boxes, and get gender labels for both. Now the substitutions are made depending on the classes for both the labels. We substitute the phrase ‘two people/youngsters’ with ‘a genderlabel1 and a genderlabel2’.

  • For captions mentioning more than 2 people through phrases like ‘group of people’, we extract atmost 6 largest bounding boxes and get gender labels for each. If all belong to the same gender we substitute the plural form of that, while if there is a mixture of gender labels, we leave the caption as is. The word ‘youngsters’ is changed to the more common word ‘children’.

The final scores on the hard dataset 4.9 are significantly higher for our model than the CNN model. It outperforms the CNN model on all metrics. The performance after substitution on the entire test data is summarized in Table 4.8. The scores are similar with our model achieving slightly lower scores. Since our model outperforms the CNN model in the single human and nature datasets, we believe that the images with multiple persons are pulling the scores down. Improvement in correctly extracting the corresponding persons from the image and better substitutions could improve the scores. In Fig. 4.6, we compare some of the final captions with the CNN predictions.

Model Bleu1 Bleu2 Bleu3 Bleu4 METEOR ROUGE_L CIDEr
our model .706 .531 .388 .280 .241 .518 .893
CNN .710 .538 .394 .286 .243 .521 .902
Table 4.8: Performance summary for gendered captions on the entire test dataset
Model Data Bleu1 Bleu2 Bleu3 Bleu4 METEOR Rouge_L CIDEr SPICE
Ours F .731 .553 .401 .288 .247 .543 .842 .166
CNN F .686 .508 .362 .261 .241 .512 .768 .148
Ours M .724 .557 .409 .292 .256 .535 .827 .157
CNN M .714 .540 .391 .277 .245 .522 .705 .156
Table 4.9: Performance summary on the unusual dataset on our gendered captions (after substitution) and the CNN model. F refers to the instances with females (113) and M refers to instances with males(69)
(a) our: a snow covered road with a ski lift(0.136)
their: a snow covered hill with a snow covered trees and a snow covered trees(0.295)
(b) our: a person riding a horse and a person on a horse(0.207)
their: two people are sitting on a horse(0.056)
(c) our: a person is holding a cup of coffee and a person in a(0.164)
their: two people are eating a meal(0.018)
Figure 4.2: Comparing our captions with the gender neutralised captions of [aneja2018convolutional]. These are the top 3 instances whose Bleu1 scores differ the most for the 2 models. The scores are higher for the caption which is comparatively less meaningful. The repeated occurrence of correct words, though resulting in an ill formed sentence, get a higher score.
(a) our: a truck with a double decker bus on the side of the road(1.533)
their: a large truck is parked in front of a building(0.157)
(b) our: a black and white photo of a car and a bus(0.156)
their: a truck driving down a street next to a car()
(c) our: a person is jumping in the air with a skateboard(0.196)
their: a person in a black shirt is doing a trick(0.058)
Figure 4.3: Comparing our captions with the gender neutralised captions of [aneja2018convolutional]. These are the top 3 instances whose Bleu2 scores differ the most for the 2 models. Observe the unexpected difference in captions from the two models for images with no humans.
(a) our: a pizza sitting on top of a wooden cutting board(0.183)
their: a pizza with a knife and a knife on it(5.6)
(b) our: a person skiing down a snowy hill on skis(6.9)
their: a person riding skis down a snow covered slope(0.180)
(c) our: a person flying through the air while riding a snowboard(0.176)
their: a person on a snowboard jumping over a hill(4.7)
(d) our: a person wearing glasses and a tie and a tie(0.156)
their: a person in a suit and tie standing in front of a(5.6)
(e) our: a person is riding a motorcycle down a street(0.162)
their: a group of people riding motorcycles down a street(3.9)
(f) our: two giraffes standing next to each other in a zoo(4.78)
their: a couple of giraffes are standing in a field(0.154)
Figure 4.4: Comparing our captions with the gender neutralised captions of [aneja2018convolutional]. These are instances from top 10 predictions whose Bleu3 scores differ the most for the 2 models. Observe the label bias cropping up in 4.3(d).
(a) our: a herd of sheep standing on top of a lush green field(higher)
their: a group of sheep standing in a fenced in field
(b) our: a red traffic light sitting on the side of a road(higher)
their: a traffic light at an intersection with a red light
(c) our: a person sitting on a beach with a umbrella
their: a large umbrella sitting on top of a sandy beach(higher)
(d) our: a person in a suit is taking a selfie in a mirror
their: a person is taking a picture of itself in the mirror(higher)
(e) our: a person standing in front of a table holding a glass of wine(higher)
their:a person in a black suit and glasses is holding a glass
(f) our: a bunch of different types of donuts on a table(higher)
their: a bunch of white and blue flowers on a table
Figure 4.5: Comparing our captions with the gender neutralised captions of [aneja2018convolutional]. These are instances from top 10 predictions whose Bleu4 scores differ the most for the 2 models. Observe the unexpected differences in captions with no humans, particularly the drastic change in 4.4(f) (when the only change in our model is the removal of gender while training).
(a) our: a woman in a tie is looking at her cell phone
their: a man in a suit and tie sitting on a couch
(b) our: a woman is playing tennis on a tennis court
their: a tennis player is swinging a racket at a ball
(c) our: a woman in a yellow shirt is riding a surfboard
their: a man in a yellow shirt and black shorts and a white
(d) our: a young girl holding a cell phone in her hand
their: a woman in a blue shirt talking on a cell phone
(e) our: a person in a red tie and a tie
their: a man in a suit and tie standing in a room
(f) our: a young girl sitting on a skateboard on a sidewalk
their: a young boy riding a skateboard on a sidewalk
Figure 4.6: Comparing our final captions with the CNN model [aneja2018convolutional]. These are instances from the female hard dataset, wherein Bleu2 scores differ the most for the 2 models.