Mitigating Gender Bias in Captioning Systems

by   Ruixiang Tang, et al.

Image captioning has made substantial progress with huge supporting image collections sourced from the web. However, recent studies have pointed out that captioning datasets, such as COCO, contain gender bias found in web corpora. As a result, learning models could heavily rely on the learned priors and image context for gender identification, leading to incorrect or even offensive errors. To encourage models to learn correct gender features, we reorganize the COCO dataset and present two new splits COCO-GB V1 and V2 datasets where the train and test sets have different gender-context joint distribution. Models relying on contextual cues will suffer from huge gender prediction errors on the anti-stereotypical test data. Benchmarking experiments reveal that most captioning models learn gender bias, leading to high gender prediction errors, especially for women. To alleviate the unwanted bias, we propose a new Guided Attention Image Captioning model (GAIC) which provides self-guidance on visual attention to encourage the model to capture correct gender visual evidence. Experimental results validate that GAIC can significantly reduce gender prediction errors with a competitive caption quality. Our codes and the designed benchmark datasets are available at


page 3

page 8

page 12

page 13

page 15


Exposing and Correcting the Gender Bias in Image Captioning Datasets and Models

The task of image captioning implicitly involves gender identification. ...

Quantifying Societal Bias Amplification in Image Captioning

We study societal bias amplification in image captioning. Image captioni...

Understanding and Evaluating Racial Biases in Image Captioning

Image captioning is an important task for benchmarking visual reasoning ...

Deconfounded Image Captioning: A Causal Retrospect

The dataset bias in vision-language tasks is becoming one of the main pr...

Women also Snowboard: Overcoming Bias in Captioning Models (Extended Abstract)

Most machine learning methods are known to capture and exploit biases of...

Women also Snowboard: Overcoming Bias in Captioning Models

Most machine learning methods are known to capture and exploit biases of...

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Language is increasingly being used to define rich visual recognition pr...

1 Introduction

Automatically understanding and describing visual contents is an important and challenging interdisciplinary research topic Fang et al. (2015); Hossain et al. (2019); Karpathy and Fei-Fei (2015); Mao et al. (2014); Vinyals et al. (2015); Xu et al. (2015). Over the past years, a lot of efforts have been dedicated to improving the overall caption quality, without considering unwanted bias learned by models. In this work, we focus on gender bias and investigate the gender words (e.g., woman and man) generated by the learning models. A biased model may rely on incorrect visual features to provide gender descriptions, e.g., describing a person as a woman because of the kitchen background, which leads to incorrect gender prediction when evaluating on samples without a learned prior.

Gender bias widely exists in captioning models mainly because of two reasons. First, many caption datasets originally contain severe gender bias. For example, COCO dataset Lin et al. (2014)

has an imbalanced 3:1 men to women ratio and the object-gender joint distribution further exacerbates the imbalance, e.g., 95% skateboard images co-occur with men. The biased dataset not only results in a biased model, but also makes it hard to detect the bias learned by models. For instance, in COCO dataset, images with women comprise a much smaller portion that gender prediction error could be ignored. Second, existing training paradigm makes models prone to replicate bias in datasets. Our preliminary experimental results show that most models inevitably learn bias existing in the training dataset.

A straightforward solution to mitigate the gender bias is to train the model based on a gender-balanced dataset. Unfortunately, our experimental results indicate that simply balancing image numbers for each gender has limited improvement in bias mitigation. From the training perspective, an alternative approach is to increase the loss weight of gender words, which also doesn’t achieve a satisfactory result. In addition to designing mitigation algorithms, another challenge is that gender bias may be underestimated when models are evaluated on the test set with similar bias to training samples.

To bridge the gap, we propose new benchmark datasets to quantify gender bias. In this work, we focus on the COCO captioning dataset and create two new splits. COCO-GB v1 is designed to systematically measure gender bias in existing models. COCO-GB v2 is designed to further assess capabilities of models in gender bias mitigation. We also propose a novel Guided Attention Image Captioning model (GAIC) to mitigate gender bias. GAIC has two complementary streams to encourage the model to explore correct gender features. The training pipeline can seamlessly add extra supervision to accelerate the self-exploration process. Besides, GAIC is model-agnostic and can be easily applied to various captioning models.

We report the performance of several baseline models on our new benchmark datasets. The key observation is that most models learn or even exaggerate gender bias contained in the training data, which causes a significantly higher gender prediction error for women. Another important finding is that common evaluation metrics, such as BLEU

Papineni et al. (2002) and CIDEr Vedantam et al. (2015), mainly focus on the overall caption quality and are not sensitive to gender error. Our experimental results show that a model achieves competitive caption quality scores even when it has misclassified 27% women into men, which further demonstrates that the high performance achieved by current models should be revisited. Experimental results validate that our proposed GAIC method can significantly reduce gender prediction error, with a competitive caption quality. Visualization evaluations further indicate that GAIC learns to utilize correct gender evidence for gender word generation.

2 COCO-GB: Dataset Creation and Analysis

The COCO-GB dataset are created for quantifying gender bias in models. We construct COCO-GB v1 based on a widely used split and create a gender-balanced secret test dataset. COCO-GB v2 is created by reorganizing the train/test split so that the gender-object joint distribution in training set is very different from testing set. Procedures of dataset creation are shown as follows.

Gender Labeling:

An important reason why previous work ignores the gender bias problem is that COCO dataset doesn’t have explicit gender annotations. Hence our first step is to annotate the gender of people in the images. Because many images do not have a clear human face, normal face recognition models cannot provide a reliable gender prediction. Alternatively, inspired by

Hendricks et al. (2018), we utilize the 5 training captions available for each image to annotate the gender. Images are labeled as "men" if at least one of the five descriptions contain the male gender words and do not include female gender words. Similarly, images are labeled as "women" if at least one of the five descriptions contain the female gender words and do not include male gender words. Images that mentioned both "man" and "woman" are discarded. To further improve the labeling accuracy, we only consider images that contain one major person and also ignore the picture in which person in the image is too small. (Gender words list and examples of gender labeling are shown in Sec. A.1).

COCO-GB v1: COCO-GB v1 is created for evaluating gender bias in existing models. Hence we follow a widely used Karpathy split Karpathy and Fei-Fei (2015), and create a gender-balanced secret test dataset. Captioning models trained on karpathy split can directly evaluate their gender bias on this secret test set without retraining. Comparing to previous work Hendricks et al. (2018) which only balance the image number of each gender, we further balance the gender-object joint distribution in the secret dataset. In this way, we can comprehensively show the bias learned by the model. We compute the gender bias towards men for each object category by metrics proposed in Zhao et al. (2017):


where men and women

refers to images labeled as "men" and "women". Ideally, an object with a bias ratio of 0.5 indicates that women and men has the same probability to co-occur with it. We show gender bias of several objects in Fig. 

1. Our statistical analysis show that COCO training dataset has a severe gender bias that training data has an average bias ratio of 0.65, and 90% of objects are biased towards men. Similar bias is also found in the original test split, which demonstrates that directly evaluating models on original test dataset can underestimate the bias learned by models. We then utilize a greedy algorithm to select a secret test dataset from the original test split so that each category has a nearly balanced gender ratio. Finally, our selected secret test dataset has 500 images for each gender and obtains an average bias ratio of 0.52. We then utilize this COCO-GB v1 dataset to evaluate existing models and quantify the unwanted gender bias captured by models.

COCO-GB v2: This dataset is designed to further assess the robustness of captioning models when exposed to novel gender-object pairs at test time. To create the new split, we first sort 80 object categories in COCO according to their gender bias. Unlike creating a balanced test dataset in COCO-GB v1, we start from the most biased object and greedily add selected data into the test set. As a result, the distribution difference has been dramatically enlarged between the training and testing dataset. We guarantee that there are sufficient images from each category during training, but at test time model will face novel compositions of gender-object pairs, e.g., women with the skateboard. The final split has 118,062/5,000/10,000 images in train/val/test respectively. We utilize this dataset to further evaluate our proposed bias mitigation approach. (Gender-object joint distribution of COCO, COCO-GB v1 and COCO-GB v2 are shown in Sec. A.2)

Figure 1: We select several objects and show their gender bias in (a) COCO training set, (b) COCO original test set and (c) COCO-GB v1 secret test set. There is a significant bias in training set, more than 90% objects have higher probability to co-occur with men. Similar bias also exists in the original test set, while the secret test dataset in COCO-GB v1 has balanced the gender ratio.

3 Benchmarking Captioning Models on COCO-GB v1

To reveal the gender bias in existing models, we utilize the gender prediction performance to quantify bias learned by models Hendricks et al. (2018); Zhao et al. (2017). Models are trained on Karpathy split, obtain caption quality from original test split, and evaluate gender prediciton performance on the COCO-GB v1 secret test dataset.

Baselines: We consider models both with and without attention mechanism in the experiment where "attention" refers to models have ability to focus on specific image regions Rohrbach et al. (2018)

. For non-attention models, we consider

FC Rennie et al. (2017)

which initialize LSTM with features extracted directly by CNN, and

LRCN Donahue et al. (2015) which leverages visual information at each time step. For attention models, we select Att Xu et al. (2015), which firstly applies visual attention mechanism in caption generation. AdapAtt Lu et al. (2017), which automatically determines when and where to utilize visual attention. Att2in Rennie et al. (2017), which modifies the architecture of Att, and inputs the attention features only to the cell node of the LSTM. Besides, we also choose models that utilize extra visual grounding information. TopDown Anderson et al. (2018) proposes a specific ?top-down attention? mechanism based on Faster R-CNN model Ren et al. (2015). NBT Lu et al. (2018) generates the sentence ?template? with slot locations, and fill slots by an object detection model.

Learning Objective and Implementation: Baselines are trained with cross-entropy loss. Besides, FC, LRCN, Att2in and TopDown model are also trained by self-critical loss Rennie et al. (2017)

, which uses reinforcement learning to directly optimize the non-differentiable CIDEr metric. For a fair comparison, all baselines utilize visual features extracted by the ResNet-101 network, NBT and TopDown model obtain extra grounded information from the Faster R-CNN model.

Results analysis: In Tab. 1, we report caption quality as well as gender prediction performance where gender is predicted correctly/incorrectly or "neutrally" when no gender-specific words are generated. An unbiased model should have similar outcome and low error rate for each gender. From the results, we reach the following conclusions.

  • [leftmargin=*]

  • Across all models, the error rate of women is substantially higher than men. The average error rate of all models is 16.7% and 7.7% for women and men respectively. An interesting finding is that models with attention mechanism have much higher women errors (average of 22.6%) compared to non-attention models (average of 15.8%). One possible explanation is that attention mechanism enhances the model’s ability to capture visual features which also makes models easier to learn vision bias. Models utilizing extra visual grounding information have a much lower women error rate (average of 9.1%), which indicates that the extra visual features provided by Faster R-CNN are unbiased. The shortcoming is that NBT and TopDown models require training multiple sub-models, and the gender accuracy highly depends on the extra object detection model.

  • Models with a high gender error rate can still receive competitive caption quality scores, e.g., AdapAtt model obtains a decent caption quality performance in three quality metrics, but has the highest women error rate (26.8%) in all models, which means model misclassifies 26.8% of images with women into men. This experimental results demonstrate that current caption quality metrics mainly focus on the overall quality and are not sensitive to the gender prediction error.

  • Self-critical loss can improve models’ overall caption quality but also amplify the gender error rate. After training with self-critical loss, the metric CIEDr obtains an average improvement of 6.2%. However the error rate of women in FC, Att2in, and TopDown model increases by an average of 5.1% and the error rate of men in LRCN model increases by 4.2%.

4 Image Captioning Model with Guided Attention

We propose a novel Guided Attention Image Captioning model (GAIC) to mitigate gender bias by self-supervising on model’s visual attention. Attention mechanism has been widely used in image captioning task, which significantly improves the quality of generated descriptions Anderson et al. (2018); Lu et al. (2017); Xu et al. (2015). However, most existing work uses the visual attention mainly for improving the overall caption quality. Few studies have considered the potential bias learned by the attention mechanism. GAIC considers a situation with no grounded supervision in which the model explores the correct gender evidence by model itself. Model GAIC considers a semi-supervised scenario that a small amount of labeled data is available, and fine-tunes the model attention with the extra supervision.

Model B-4 C M Women Men
correct wrong neutral correct wrong neutral
FC 31.4 95.8 24.9 60.5 14.8 24.7 64.3 10.3 25.4
LRCN 30.0 90.8 23.9 61.7 16.8 21.6 64.7 11.2 24.0
Att 31.0 95.1 24.8 42.8 24.7 32.5 61.0 4.3 34.7
AdapAtt 31.1 98.2 25.2 47.9 26.8 27.4 75.7 3.8 20.5
Att2in 32.8 102.0 23.3 61.5 16.3 22.1 70.2 6.5 23.3
TopDown 34.6 107.6 26.7 65.5 9.0 25.5 63.5 7.4 29.0
NBT 34.1 105.1 26.2 72.3 9.3 18.3 77.6 4.3 18.0
FC 33.4 103.9 25.0 61.6 20.7 17.7 71.9 6.8 21.3
LRCN 29.4 93.0 23.5 68.1 11.3 20.6 60.6 15.5 23.9
Att2in 33.6 106.7 25.7 61.8 19.7 18.5 73.8 6.2 20.0
TopDown 34.9 117.2 27.0 69.0 15.1 15.8 73.3 7.5 19.1
Table 1: Gender bias analysis on COCO-GB V1 split. We utilize BLEU-4(B-4), CIEDr(C) and METEOR(M) to evaluate captions qualities, all results are generated with beam size 5. Caption quality is obtained from test dataset, and gender bias is evaluated on COCO-GB V1 secret test dataset. denotes the models that utilize extra grounded information from the Faster R-CNN network. denotes the models that are trained with self-critical loss.

4.1 Self-Guidance on the Caption Attention

To achieve the goal of self-supervision, we design a two-stream training pipeline. As shown in Fig. 2, GAIC contains caption generation stream and gender evidence mining stream . The two streams share the same parameters. The purpose of stream is to generate high-quality descriptions as well as synthesize attention maps of gender words. The stream then forces the attention to be focused on the correct gender evidence. The two complementary streams encourage the attention regions of gender words to gradually move to correct gender features, such as the human appearance, and keep away from biased features, e.g., laptops and umbrellas.

In stream , after inputting an image , the captioning model will generate the corresponding description which is supervised by the Language Quality Loss , such as cross-entropy loss. At the same time will generate the visual attention maps for each inferred word. We can directly get attention maps from models with attention mechanism. For non-attention models, attention maps can be obtained by post-hoc interpretation method, such as Grad-CAM Selvaraju et al. (2017) and saliency maps Simonyan et al. (2014). Here we focus on the attention maps of gender words which is denoted by . We utilize the to generate a soft mask and apply the mask on the original input image. In this way, features which are related to gender inference for stream are removed from images. denotes the image without gender features captured by which is defined as follows:


where denotes the element-wise multiplication.

is a masking function based on a thresholding operation. To normalize it, we use Sigmoid function as an approximation defined in Eq. 


Figure 2: GAIC has two streams of networks that share parameters. Stream S

finds out regions that help model to classify the gender, and Stream S

tries to make sure all selected regions are correct gender evidence features. The attention map is online generated and two streams are trained by the Language Quality Loss and Gender Evidence Loss jointly. GAIC model seamlessly adds a small amount of extra supervision to further refine model attention which denotes as S.

where is a matrix where all elements equal to a threshold . Parameter is a scale ensuring approximately equals to 1 when is larger than , or to 0 otherwise. We feed into stream and generate corresponding captions. The loss on the is denoted as Gender Evidence Loss , which is defined as follows:


where denotes the caption word, , and refer to neutral, female and male gender words respectively. we replace into if belongs to male or female gender words. With the guidance of , the GAIC model learns to focus on correct gender evidence. Suppose the stream utilizes biased gender features for gender prediction, then generated by will remove the biased visual features, e.g., a laptop. Thus the stream will generate a incomplete caption because of missing the important features, and receives the penalty from . Loss can be minimized only when model focus on correct gender features such as person appearance (An example is shown in Fig. 2).

Besides capturing correct gender features, we also expect model to predict cautiously and use gender neutral words such as people, person when gender evidence is vague. Since has removed most gender features, we correspondingly change the gender words in by following replacements:

Finally, we combine and as the self-guidance gender discrimination loss:


where is the weighting parameter, and we use in all our experiments. With the joint optimization of , the model learns to generate high quality captions as well as focuses on real visual features that contribute to gender recognition.

4.2 Integrating with Extra Supervision

In addition to self-exploration training, we also consider the semi-supervised scenarios where a small amount of extra labeled data is added to accelerate the self-exploration process. More specifically, we utilize the pixel-level person segmentation masks to regularize attention, and denotes the model as GAIC. GAIC utilizes these masks as a boundary regularization and force the attention of gender words to be limited in the mask. In addition to , GAIC has another loss, Gender Attention Loss , for the attention supervision, which is defined as follows:


where is a binary mask where 1 indicates pixels belonging to a person and 0 represents the background. Loss forces the attention maps of gender words to be limited in the mask. The final loss for is defined as follows:


where is the weighting parameter that adjusts the strength of . Since labeling pixel-level segmentation maps is extremely time-consuming, we consider to use a small amount of data with external supervision, such as 10% in our experiment. In Fig. 2, we utilize to denote this stream, and all three streams share same parameters and optimize in an end-to-end manner.

5 Experiments

5.1 Settings and Baselines

Att model Xu et al. (2015) is a classic captioning model and many captioning models are developed from its architecture. Our benchmarking experiments indicate that Att model has a severe gender bias. Thus we adopt Att model as the baseline as well as the base model to build GAIC. For model, we add 10% images with person segmentation masks as extra supervision and set the . We compare our proposed model with the following common debiasing approaches.

Balanced: A subset is selected from the training data which has 1:1 men to women ratio. Then we train a new model on this instance-level balanced dataset. In order to get a subset with a sufficient number of training data, we can only balance the image number of each gender. Note that unless we rebuilt the whole dataset and add more images with women, the gender bias existing in gender-object joint distribution cannot be effectively removed.

UpWeight: We also conduct an experiment that we amplify the weight of gender words’ cross-entropy loss during training. Specifically, we label the gender word position for each sentence in training dataset and train the model by multiplying a constant value 10 on loss for gender words. Intuitively, it will encourage models to accurately predict gender words.

Pixel-level Supervision (PixelSup): As a variant of GAIC model, we remove the self-exploration streams and directly use Eq. 6 to fine-tune model attention maps with 10% extra data.

5.2 Evaluation Metrics

Gender Accuracy: Unlike the traditional binary gender classification tasks, prediction results of captioning models can be correct, wrong and neutral. Models should firstly have low error rate. On this basis, to make decision more discriminative, captioning models should obtain low neutral rate and model generates gender neutral words only when the visual evidence are too vague to distinguish.

Gender Divergence:

Besides high accuracy, we also expect our model to treat each gender fairly so that the outcome distribution for each gender should be similar. Here, we utilize Cosine Similarity between correct/wrong/neutral rate for men and women to measure the fairness of the model, which resembles to the fairness definition proposed in Equality of Odds

Hardt et al. (2016). Attention Correctness: To measure whether attention focuses on the correct gender features, we compare the attention maps of gender words with the person segmentation mask. We adopt two evaluation metrics: Pointing Game Zhang et al. (2018), which counts the highest attention point contained in the person masks, and Attention Sum Liu et al. (2017), which calculates the attention weights in the masks.

Model C M Woman Men D
correct wrong neutral correct wrong neutral
Baseline 95.1 24.8 42.8 24.7 32.5 61.0 4.3 34.7 0.075
Balanced 93.9 24.9 54.3 20.3 25.3 69.7 7.4 22.8 0.032
UpWeight 95.8 24.9 70.6 26.4 2.9 81.4 10.5 8.9 0.028
PixelSup 92.0 24.5 46.8 20.7 32.5 58.0 7.3 34.7 0.031
GAIC 93.7 24.6 62.0 16.9 21.1 77.3 7.0 15.7 0.021
GAIC 94.6 24.7 64.1 13.1 22.8 75.3 5.2 19.5 0.011
Table 2: Gender bias analysis on COCO-GB V1 split.
Accuracy Women Men Average
Baseline 25.5 21.2 23.4
Balanced 25.0 21.4 23.2
Upweight 26.7 23.3 25.0
PixelSup 30.0 28.1 29.0
GAIC 27.4 24.3 25.6
GAIC 32.5 28.5 30.1
(a) Attention Sum
Accuracy Women Men Average
Baseline 64.8 57.4 61.1
Balanced 66.2 59.6 62.9
Upweight 66.2 59.3 62.8
PixelSup 67.2 60.5 63.9
GAIC 67.2 61.2 64.2
GAIC 67.8 61.5 64.7
(b) Point Game
Table 3: Attention Correctness on COCO-GB V1 split. We adopt Point Game Zhang et al. (2018) and Attention Sum Liu et al. (2017) as evaluation metrics. GAIC and GAIC model can significantly improve attention correctness.

5.3 Experimental Results

Gender Accuracy: We report the gender prediction performance of COCO-GB v1 in Tab. 2. Our key observation is that GAIC significantly improves the gender prediction performance compared to the baseline, where the gender accuracy of women increases from 42.8% to 62.0% and error rate of woman reduce from 24.7% to 16.9%. Although the UpWeight method obtains the highest accuracy of both women and men, it also causes a significant high error rate towards each gender. There is no substantial difference between the Balanced model and baseline model, and similar trend has been found by Bolukbasi et al. (2016), which indicates that models learn gender bias mainly from feature-level (gender-object co-occurrence), and balancing number of each gender can not remove bias in dataset. PixelSup obtains sensible improvements, which indicates that supervising directly on attention maps is also helpful. GAIC obtains consistently better performance than GAIC model and we empirically find that adding extra supervision can accelerate self-exploration process and makes training more stable. For fairness evaluation, we compare different model’s gender divergence. GAIC and GAIC obtain the lowest divergence, which indicates that models treat each gender in a fair manner.

Experiments on COCO-GB v2 have almost similar trends in COCO-GB v1 (Detailed analysis of COCO-GB v2 results is in Sec. C). All baselines in COCO-GB v2 receive a worse gender prediction results. This is mainly because the unseen gender-object pairs in the test dataset increase the prediction difficulty. In comparison, our proposed GAIC and GAIC model obtain a comparable performance with COCO-GB v1, which further improves the robustness of the self-exploration training strategy.

Caption Quality: Besides high gender accuracy, we also expect our model to obtain decent caption quality. We use METEOR(M) and CIDEr(C) to evaluate caption qualities. As the result shows in Tab. 2, GAIC and GAIC only cause a minor performance drop compared to the baseline (from 95.1 to 94.6 on METROR and from 24.8 to 24.7 on CIDEr). In Fig. 3, examples show that the sentences generated by GAIC and GAIC are linguistically fluent with more correct gender descriptions.

Attention Correctness: To quantitatively evaluate attention correctness, we extract attention maps of gender words and compare them with person segmentation masks. Quantitative results are shown in Tab. 3. We observe that GAIC and GAIC receive consistent improvement over the baseline model and all model variants, which indicates that our proposed models can focus on the described person for the gender word prediction. Qualitative comparison is shown in Fig. 3. We observe that baseline model may utilize biased visual features for gender prediction and thus makes incorrect gender prediction, e.g., describe a woman as a boy because of the tennis ball. In comparison, GAIC and GAIC correctly focus on the described person for gender prediction.

Figure 3: Qualitative comparison of baseline and our proposed models.

6 Related Work

Gender Bias in Dataset: Gender bias in dataset has been studied in a variety of domainsBender and Friedman (2018); Bolukbasi et al. (2016); Bordia and Bowman (2019); Brunet et al. (2019); Buolamwini and Gebru (2018); Font and Costa-jussà (2019)

, especially in natural language processing

Hendricks et al. (2018); Lambrecht and Tucker (2019); Rudinger et al. (2018); Stock and Cisse (2018); Vanmassenhove et al. (2018); Zhao et al. (2019, 2017). For image captioning, several studies have considered gender bias problem in the dataset. Mithun et al. (2018) analyzes the stereotype in Flicker30k dataset and mentions the gender-related issues. Hendricks et al. (2018) indicates that caption annotators in COCO dataset tend to infer a person’s gender from the image context when gender cannot be confirmed, e.g., a baseball player is labeled as ?man? even if gender evidence is occluded.

Vision and Language Bias: Bias in captioning models can be divided into vision and language bias Rohrbach et al. (2018). Vision bias refers to learn wrong visual evidence while language bias refers to capture unwanted language priors. For example, the word "on the beach" always follows the word "Surfboard." Due to the RNN’s recurrent mechanism, models can learn this language prior and always generate the phase "on the beach" if the word "Surfboard" has been inferred. We notice that gender words are usually mentioned at the beginning of the sentence (on the average at position 2 with average sentence length 9), and words before gender words, e.g., "a" and "the," do not have the gender preference. Hence gender bias in captioning systems should mainly come from the vision part.

Mitigating Gender Bias: Few initial attempts have been made to design captioning models to overcome gender biases in datasets. One solution is to break the task into two steps Bhargava (2019). It firstly locates and recognizes the person in the image. Then a language model utilizes grounded information to generate captions. Hence the gender accuracy of this approach highly depends on the extra object detection model. In another work, two novel losses are designed to reduce unwanted bias towards specific words, including gender words. Their approach requires segmentation masks for each image, which is costly and unpractical for many dataset Hendricks et al. (2018).

7 Conclusions

In this paper, two novel COCO splits are created for studying gender bias problem in image captioning task. We provide extensive baseline experiments for benchmarking different models, training strategies, as well as a comprehensive analysis of the dataset. Our experimental results indicate that many captioning models have a severe gender bias problem, leading to a undesirable gender prediction error towards women. We propose a novel training framework GAIC which can significantly reduce gender bias by self-guided supervision. Besides, GAIC model can seamlessly add extra supervision and further improves the gender prediction accuracy. Quantitative and qualitative results further validate that our proposed model can focus on correct visual evidence for gender prediction.

8 Broader Impact

In this work, we reveal the severe gender bias problem widely existing in most captioning models, which leads to an undesirable gender prediction error towards women. The experimental results remind researchers to revisit the high performance achieved by current captioning models and encourages the community to put more effort into promoting the fairness of captioning systems. Two novel COCO splits proposed in this work enable future work to efficiently quantify gender bias in systems, which will have a strong impact on promoting a more fair platform for emerging and future computer vision as well as natural language processing systems. The self-exploration training strategy proposed in this work can significantly reduce gender bias in learning models and understanding how the interpretation methods such as attention mechanism could be utilized to further advance fairness of machine learning methods, thereby broadly impacting the Machine Learning field.

This work also plays an integral part in educating and training students. The research will also be tightly integrated with related courses on data science at the author’s university. The course will show how the bias in data could be learned by models, and potentially leads to unintentional discrimination in learning models. The course will have a section to teach students to design a fair machine learning algorithm. We will actively encourage undergraduate participation in this bias mitigation project.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018) Bottom-up and top-down attention for image captioning and visual question answering. In

    2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 6077–6086. Cited by: §3, §4.
  • [2] E. M. Bender and B. Friedman (2018) Data statements for natural language processing: toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6, pp. 587–604. Cited by: §6.
  • [3] S. Bhargava (2019) Exposing and correcting the gender bias in image captioning datasets and models. Ph.D. Thesis. Cited by: §6.
  • [4] T. Bolukbasi, K. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pp. 4349–4357. Cited by: §5.3, §6.
  • [5] S. Bordia and S. Bowman (2019) Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 7–15. Cited by: §6.
  • [6] M. Brunet, C. Alkalay-Houlihan, A. Anderson, and R. Zemel (2019) Understanding the origins of bias in word embeddings. In International Conference on Machine Learning, pp. 803–811. Cited by: §6.
  • [7] J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp. 77–91. Cited by: §6.
  • [8] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634. Cited by: §3.
  • [9] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. (2015) From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1473–1482. Cited by: §1.
  • [10] J. E. Font and M. R. Costa-jussà (2019)

    Equalizing gender bias in neural machine translation with word embeddings techniques

    In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp. 147–154. Cited by: §6.
  • [11] M. Hardt, E. Price, and N. Srebro (2016)

    Equality of opportunity in supervised learning

    In Advances in neural information processing systems, pp. 3315–3323. Cited by: §5.2.
  • [12] L. A. Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach (2018) Women also snowboard: overcoming bias in captioning models. In European Conference on Computer Vision, pp. 793–811. Cited by: §2, §2, §3, §6, §6.
  • [13] M. Hossain, F. Sohel, M. F. Shiratuddin, and H. Laga (2019)

    A comprehensive survey of deep learning for image captioning

    ACM Computing Surveys (CSUR) 51 (6), pp. 118. Cited by: §1.
  • [14] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §1, §2.
  • [15] A. Lambrecht and C. Tucker (2019) Algorithmic bias? an empirical study of apparent gender-based discrimination in the display of stem career ads. Management Science 65 (7), pp. 2966–2981. Cited by: §6.
  • [16] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
  • [17] C. Liu, J. Mao, F. Sha, and A. Yuille (2017) Attention correctness in neural image captioning. In

    Thirty-First AAAI Conference on Artificial Intelligence

    Cited by: §5.2, Table 3.
  • [18] J. Lu, C. Xiong, D. Parikh, and R. Socher (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383. Cited by: §3, §4.
  • [19] J. Lu, J. Yang, D. Batra, and D. Parikh (2018) Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7219–7228. Cited by: §3.
  • [20] J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, and A. Yuille (2014)

    Deep captioning with multimodal recurrent neural networks (m-rnn)

    arXiv preprint arXiv:1412.6632. Cited by: §1.
  • [21] N. C. Mithun, R. Panda, E. E. Papalexakis, and A. K. Roy-Chowdhury (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1856–1864. Cited by: §6.
  • [22] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §1.
  • [23] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.
  • [24] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §3, §3.
  • [25] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko (2018) Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4035–4045. Cited by: §3, §6.
  • [26] R. Rudinger, J. Naradowsky, B. Leonard, and B. Van Durme (2018) Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 8–14. Cited by: §6.
  • [27] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §4.1.
  • [28] K. Simonyan, A. Vedaldi, and A. Zisserman (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. Cited by: §4.1.
  • [29] P. Stock and M. Cisse (2018)

    Convnets and imagenet beyond accuracy: understanding mistakes and uncovering biases

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 498–512. Cited by: §6.
  • [30] E. Vanmassenhove, C. Hardmeier, and A. Way (2018) Getting gender right in neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3003–3008. Cited by: §6.
  • [31] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §1.
  • [32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §1.
  • [33] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: Appendix B, §1, §3, §4, §5.1.
  • [34] J. Zhang, S. A. Bargal, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff (2018) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. Cited by: §5.2, Table 3.
  • [35] J. Zhao, T. Wang, M. Yatskar, R. Cotterell, V. Ordonez, and K. Chang (2019) Gender bias in contextualized word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 629–634. Cited by: §6.
  • [36] J. Zhao, T. Wang, M. Yatskar, V. Ordonez, and K. Chang (2017) Men also like shopping: reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2979–2989. Cited by: §2, §3, §6.

Appendix A COCO-GB dataset

a.1 Gender Annotation

We show the gender word list in Tab. 4. Gender words are selected based on the word frequency in COCO dataset and we delete the less frequent gender words. Word "woman" and "man" are most frequent gender-specific word and account for more than 60% of the total gender-specific words.

female words () woman, women, girl, sister, daughter, wife, girlfriend
male words () man, men, boy, brother, son, husband, boyfriend
gender neutral words () people, person, human, baby
Table 4: Gender words list

Some labeled examples are shown in Fig. 4. We label an image as "women" when at least one sentence mentioned female words and label an image as "men" when at least one sentence mentioned male words. Images that both mention male words and female words are discarded. In Fig. 4 (c), we show that when gender evidence is occluded, annotators may provide the gender prediction based on context cues or social stereotype. Hence our gender annotations may contain this kind of social bias.

Figure 4: (a): An image is labeled as women. (b): An image is labeled as men. (c) An image is labeled as men, however the gender evidence is actually occluded.

a.2 COCO Gender Distribution

We show the gender-object joint distribution of COCO training dataset, COCO testing dataset, COCO-GV v1 secret testing dataset, COCO-GB v2 testing dataset as follows. We sort the objects according to bias rate in training dataset. For COCO-GB dataset, we choose 63 object from 80 objects in COCO dataset according to their image numbers. We observe that COCO-GB v1 has a balanced gender-object joint distribution while COCO-GB v2 has a distribution opposite to the training set.

Figure 5: Gender-object joint distribution of COCO training dataset
Figure 6: Gender-object joint distribution of original COCO test dataset
Figure 7: Gender-object joint distribution of COCO-GB v1 secret test dataset
Figure 8: Gender-object joint distribution of COCO-GB v2 test dataset

Appendix B Caption Generation with Visual Attention

Given an image and the corresponding caption , the objective of an encoder-decoder image captioning model is to maximize the following formulas:



are trainable parameters of captioning model. We utilize chain rule to decompose the joint probability distribution into ordered conditionals. A recurrent neural network (RNN) predicts each conditional probability as follows:


where is a nonlinear function, and we adopt the classic LSTM as function in this paper. is the hidden state of RNN at steps.

is the visual context vector extracted from image

for . Generally, is an important extra information in image captioning models, which can provide visual evidence for caption generation. We follow the work [33] to compute . A CNN extracts a set of image features from last convolutional layer which we denote them as . corresponds to the features extracted at different image locations, where . We calculate attention values for each as follows:


where is a multi-layer perception conditioned on the previous hidden state. For each location , represents the importance of region for generating word. Once we obtain the attention weight, the context vector is computed by


Appendix C Experimental Results on COCO-GB v2

Model C M Woman Men D
correct wrong neutral correct wrong neutral
Baseline 98.2 27.2 51.6 28.3 20.1 77.9 4.9 17.1 0.094
Balanced 97.5 27.3 57.9 25.5 26.6 71.1 11.5 17.4 0.034
UpWeight 95.8 26.9 72.2 26.1 1.7 86.1 11.7 2.1 0.023
PixelSup 96.8 27.1 54.2 25.1 20.5 76.4 6.2 17.2 0.062
GAIC 97.8 26.9 67.1 18.0 14.9 68.9 10.7 20.3 0.008
GAIC 98.1 27.0 69.1 15.2 15.7 71.4 8.1 20.5 0.007
Table 5: Gender bias analysis on COCO-GB V2 split.

Compared to COCO-GB v1, all baseline and common debiasing approches obtain a higher error rate of women (average increase of 3.25%) on COCO-GB v2. GAIC model improves the gender prediction accuracy of woman from 41.6% to 67.1% and reduce the error rate of women from 28.3% to 18.0%. Although the UpWeight method obtains the highest accuracy of both women and men, it has a undesirable high error rate towards each gender. There is no substantial difference between the Balanced model and baseline model, and similar trend has been found in COCO-GB v1. GAIC obtains consistently better performance than GAIC model. For fairness evaluation, we compare different model?s gender divergence. GAIC and GAIC obtain the lowest divergence, which indicates that models treat each gender in a fair manner. For caption quality, GAIC and GAIC only cause a minor performance drop compared to the baseline.

Appendix D More on Implementation Details

Benchmarking Baselines:

All the baseline models obtain visual features extracted from the fourth layer of ResNet-101. All models except for NBT, TopDown, Att and AdaptAtt are implemented in the same open source framework from, and we directly use the test caption results from For other models, we implement by our-self and make sure that the caption quality is close to the results reported in paper. Caption quality is evaluted by the offical COCO evaluation tool

GAIC model and debiasing approaches:

We select a subset from original training set which contains 4,000 images for each gender. Baseline Att model are trained on COCO for 5 epoches. For Balanced baseline, we directly fine-tune the original Att model on this subset. For PixelSup baseline, we fine-tune the model with the subset and extra person segmentation annotations. For GAIC model, we use the two streams pipeline to fine-tune the model on the subset, and set

. For GAIC model, we fine-tune the dataset with extra person segmentation annotations, and set and . For above-mentioned models, we fine-tune the model for 1 extra epoch on the subset.

Appendix E More Qualitative Results

Figure 9: Qualitative comparison of baselines and our proposed model. At the top, we show success cases that our proposed modes predict correct gender and utilize correct visual evidence. The bottom case shows that when gender evidence is vague, our model tend to use neutral gender words, such as "person" to describe the gender of the person.