Automatically understanding and describing visual contents is an important and challenging interdisciplinary research topic Fang et al. (2015); Hossain et al. (2019); Karpathy and Fei-Fei (2015); Mao et al. (2014); Vinyals et al. (2015); Xu et al. (2015). Over the past years, a lot of efforts have been dedicated to improving the overall caption quality, without considering unwanted bias learned by models. In this work, we focus on gender bias and investigate the gender words (e.g., woman and man) generated by the learning models. A biased model may rely on incorrect visual features to provide gender descriptions, e.g., describing a person as a woman because of the kitchen background, which leads to incorrect gender prediction when evaluating on samples without a learned prior.
Gender bias widely exists in captioning models mainly because of two reasons. First, many caption datasets originally contain severe gender bias. For example, COCO dataset Lin et al. (2014)
has an imbalanced 3:1 men to women ratio and the object-gender joint distribution further exacerbates the imbalance, e.g., 95% skateboard images co-occur with men. The biased dataset not only results in a biased model, but also makes it hard to detect the bias learned by models. For instance, in COCO dataset, images with women comprise a much smaller portion that gender prediction error could be ignored. Second, existing training paradigm makes models prone to replicate bias in datasets. Our preliminary experimental results show that most models inevitably learn bias existing in the training dataset.
A straightforward solution to mitigate the gender bias is to train the model based on a gender-balanced dataset. Unfortunately, our experimental results indicate that simply balancing image numbers for each gender has limited improvement in bias mitigation. From the training perspective, an alternative approach is to increase the loss weight of gender words, which also doesn’t achieve a satisfactory result. In addition to designing mitigation algorithms, another challenge is that gender bias may be underestimated when models are evaluated on the test set with similar bias to training samples.
To bridge the gap, we propose new benchmark datasets to quantify gender bias. In this work, we focus on the COCO captioning dataset and create two new splits. COCO-GB v1 is designed to systematically measure gender bias in existing models. COCO-GB v2 is designed to further assess capabilities of models in gender bias mitigation. We also propose a novel Guided Attention Image Captioning model (GAIC) to mitigate gender bias. GAIC has two complementary streams to encourage the model to explore correct gender features. The training pipeline can seamlessly add extra supervision to accelerate the self-exploration process. Besides, GAIC is model-agnostic and can be easily applied to various captioning models.
We report the performance of several baseline models on our new benchmark datasets. The key observation is that most models learn or even exaggerate gender bias contained in the training data, which causes a significantly higher gender prediction error for women. Another important finding is that common evaluation metrics, such as BLEUPapineni et al. (2002) and CIDEr Vedantam et al. (2015), mainly focus on the overall caption quality and are not sensitive to gender error. Our experimental results show that a model achieves competitive caption quality scores even when it has misclassified 27% women into men, which further demonstrates that the high performance achieved by current models should be revisited. Experimental results validate that our proposed GAIC method can significantly reduce gender prediction error, with a competitive caption quality. Visualization evaluations further indicate that GAIC learns to utilize correct gender evidence for gender word generation.
2 COCO-GB: Dataset Creation and Analysis
The COCO-GB dataset are created for quantifying gender bias in models. We construct COCO-GB v1 based on a widely used split and create a gender-balanced secret test dataset. COCO-GB v2 is created by reorganizing the train/test split so that the gender-object joint distribution in training set is very different from testing set. Procedures of dataset creation are shown as follows.
An important reason why previous work ignores the gender bias problem is that COCO dataset doesn’t have explicit gender annotations. Hence our first step is to annotate the gender of people in the images. Because many images do not have a clear human face, normal face recognition models cannot provide a reliable gender prediction. Alternatively, inspired byHendricks et al. (2018), we utilize the 5 training captions available for each image to annotate the gender. Images are labeled as "men" if at least one of the five descriptions contain the male gender words and do not include female gender words. Similarly, images are labeled as "women" if at least one of the five descriptions contain the female gender words and do not include male gender words. Images that mentioned both "man" and "woman" are discarded. To further improve the labeling accuracy, we only consider images that contain one major person and also ignore the picture in which person in the image is too small. (Gender words list and examples of gender labeling are shown in Sec. A.1).
COCO-GB v1: COCO-GB v1 is created for evaluating gender bias in existing models. Hence we follow a widely used Karpathy split Karpathy and Fei-Fei (2015), and create a gender-balanced secret test dataset. Captioning models trained on karpathy split can directly evaluate their gender bias on this secret test set without retraining. Comparing to previous work Hendricks et al. (2018) which only balance the image number of each gender, we further balance the gender-object joint distribution in the secret dataset. In this way, we can comprehensively show the bias learned by the model. We compute the gender bias towards men for each object category by metrics proposed in Zhao et al. (2017):
where men and women
refers to images labeled as "men" and "women". Ideally, an object with a bias ratio of 0.5 indicates that women and men has the same probability to co-occur with it. We show gender bias of several objects in Fig.1. Our statistical analysis show that COCO training dataset has a severe gender bias that training data has an average bias ratio of 0.65, and 90% of objects are biased towards men. Similar bias is also found in the original test split, which demonstrates that directly evaluating models on original test dataset can underestimate the bias learned by models. We then utilize a greedy algorithm to select a secret test dataset from the original test split so that each category has a nearly balanced gender ratio. Finally, our selected secret test dataset has 500 images for each gender and obtains an average bias ratio of 0.52. We then utilize this COCO-GB v1 dataset to evaluate existing models and quantify the unwanted gender bias captured by models.
COCO-GB v2: This dataset is designed to further assess the robustness of captioning models when exposed to novel gender-object pairs at test time. To create the new split, we first sort 80 object categories in COCO according to their gender bias. Unlike creating a balanced test dataset in COCO-GB v1, we start from the most biased object and greedily add selected data into the test set. As a result, the distribution difference has been dramatically enlarged between the training and testing dataset. We guarantee that there are sufficient images from each category during training, but at test time model will face novel compositions of gender-object pairs, e.g., women with the skateboard. The final split has 118,062/5,000/10,000 images in train/val/test respectively. We utilize this dataset to further evaluate our proposed bias mitigation approach. (Gender-object joint distribution of COCO, COCO-GB v1 and COCO-GB v2 are shown in Sec. A.2)
3 Benchmarking Captioning Models on COCO-GB v1
To reveal the gender bias in existing models, we utilize the gender prediction performance to quantify bias learned by models Hendricks et al. (2018); Zhao et al. (2017). Models are trained on Karpathy split, obtain caption quality from original test split, and evaluate gender prediciton performance on the COCO-GB v1 secret test dataset.
Baselines: We consider models both with and without attention mechanism in the experiment where "attention" refers to models have ability to focus on specific image regions Rohrbach et al. (2018)
. For non-attention models, we considerFC Rennie et al. (2017)
which initialize LSTM with features extracted directly by CNN, andLRCN Donahue et al. (2015) which leverages visual information at each time step. For attention models, we select Att Xu et al. (2015), which firstly applies visual attention mechanism in caption generation. AdapAtt Lu et al. (2017), which automatically determines when and where to utilize visual attention. Att2in Rennie et al. (2017), which modifies the architecture of Att, and inputs the attention features only to the cell node of the LSTM. Besides, we also choose models that utilize extra visual grounding information. TopDown Anderson et al. (2018) proposes a specific ?top-down attention? mechanism based on Faster R-CNN model Ren et al. (2015). NBT Lu et al. (2018) generates the sentence ?template? with slot locations, and fill slots by an object detection model.
Learning Objective and Implementation: Baselines are trained with cross-entropy loss. Besides, FC, LRCN, Att2in and TopDown model are also trained by self-critical loss Rennie et al. (2017)
, which uses reinforcement learning to directly optimize the non-differentiable CIDEr metric. For a fair comparison, all baselines utilize visual features extracted by the ResNet-101 network, NBT and TopDown model obtain extra grounded information from the Faster R-CNN model.
Results analysis: In Tab. 1, we report caption quality as well as gender prediction performance where gender is predicted correctly/incorrectly or "neutrally" when no gender-specific words are generated. An unbiased model should have similar outcome and low error rate for each gender. From the results, we reach the following conclusions.
Across all models, the error rate of women is substantially higher than men. The average error rate of all models is 16.7% and 7.7% for women and men respectively. An interesting finding is that models with attention mechanism have much higher women errors (average of 22.6%) compared to non-attention models (average of 15.8%). One possible explanation is that attention mechanism enhances the model’s ability to capture visual features which also makes models easier to learn vision bias. Models utilizing extra visual grounding information have a much lower women error rate (average of 9.1%), which indicates that the extra visual features provided by Faster R-CNN are unbiased. The shortcoming is that NBT and TopDown models require training multiple sub-models, and the gender accuracy highly depends on the extra object detection model.
Models with a high gender error rate can still receive competitive caption quality scores, e.g., AdapAtt model obtains a decent caption quality performance in three quality metrics, but has the highest women error rate (26.8%) in all models, which means model misclassifies 26.8% of images with women into men. This experimental results demonstrate that current caption quality metrics mainly focus on the overall quality and are not sensitive to the gender prediction error.
Self-critical loss can improve models’ overall caption quality but also amplify the gender error rate. After training with self-critical loss, the metric CIEDr obtains an average improvement of 6.2%. However the error rate of women in FC, Att2in, and TopDown model increases by an average of 5.1% and the error rate of men in LRCN model increases by 4.2%.
4 Image Captioning Model with Guided Attention
We propose a novel Guided Attention Image Captioning model (GAIC) to mitigate gender bias by self-supervising on model’s visual attention. Attention mechanism has been widely used in image captioning task, which significantly improves the quality of generated descriptions Anderson et al. (2018); Lu et al. (2017); Xu et al. (2015). However, most existing work uses the visual attention mainly for improving the overall caption quality. Few studies have considered the potential bias learned by the attention mechanism. GAIC considers a situation with no grounded supervision in which the model explores the correct gender evidence by model itself. Model GAIC considers a semi-supervised scenario that a small amount of labeled data is available, and fine-tunes the model attention with the extra supervision.
4.1 Self-Guidance on the Caption Attention
To achieve the goal of self-supervision, we design a two-stream training pipeline. As shown in Fig. 2, GAIC contains caption generation stream and gender evidence mining stream . The two streams share the same parameters. The purpose of stream is to generate high-quality descriptions as well as synthesize attention maps of gender words. The stream then forces the attention to be focused on the correct gender evidence. The two complementary streams encourage the attention regions of gender words to gradually move to correct gender features, such as the human appearance, and keep away from biased features, e.g., laptops and umbrellas.
In stream , after inputting an image , the captioning model will generate the corresponding description which is supervised by the Language Quality Loss , such as cross-entropy loss. At the same time will generate the visual attention maps for each inferred word. We can directly get attention maps from models with attention mechanism. For non-attention models, attention maps can be obtained by post-hoc interpretation method, such as Grad-CAM Selvaraju et al. (2017) and saliency maps Simonyan et al. (2014). Here we focus on the attention maps of gender words which is denoted by . We utilize the to generate a soft mask and apply the mask on the original input image. In this way, features which are related to gender inference for stream are removed from images. denotes the image without gender features captured by which is defined as follows:
where denotes the element-wise multiplication.
is a masking function based on a thresholding operation. To normalize it, we use Sigmoid function as an approximation defined in Eq.3.
where is a matrix where all elements equal to a threshold . Parameter is a scale ensuring approximately equals to 1 when is larger than , or to 0 otherwise. We feed into stream and generate corresponding captions. The loss on the is denoted as Gender Evidence Loss , which is defined as follows:
where denotes the caption word, , and refer to neutral, female and male gender words respectively. we replace into if belongs to male or female gender words. With the guidance of , the GAIC model learns to focus on correct gender evidence. Suppose the stream utilizes biased gender features for gender prediction, then generated by will remove the biased visual features, e.g., a laptop. Thus the stream will generate a incomplete caption because of missing the important features, and receives the penalty from . Loss can be minimized only when model focus on correct gender features such as person appearance (An example is shown in Fig. 2).
Besides capturing correct gender features, we also expect model to predict cautiously and use gender neutral words such as people, person when gender evidence is vague. Since has removed most gender features, we correspondingly change the gender words in by following replacements:
Finally, we combine and as the self-guidance gender discrimination loss:
where is the weighting parameter, and we use in all our experiments. With the joint optimization of , the model learns to generate high quality captions as well as focuses on real visual features that contribute to gender recognition.
4.2 Integrating with Extra Supervision
In addition to self-exploration training, we also consider the semi-supervised scenarios where a small amount of extra labeled data is added to accelerate the self-exploration process. More specifically, we utilize the pixel-level person segmentation masks to regularize attention, and denotes the model as GAIC. GAIC utilizes these masks as a boundary regularization and force the attention of gender words to be limited in the mask. In addition to , GAIC has another loss, Gender Attention Loss , for the attention supervision, which is defined as follows:
where is a binary mask where 1 indicates pixels belonging to a person and 0 represents the background. Loss forces the attention maps of gender words to be limited in the mask. The final loss for is defined as follows:
where is the weighting parameter that adjusts the strength of . Since labeling pixel-level segmentation maps is extremely time-consuming, we consider to use a small amount of data with external supervision, such as 10% in our experiment. In Fig. 2, we utilize to denote this stream, and all three streams share same parameters and optimize in an end-to-end manner.
5.1 Settings and Baselines
Att model Xu et al. (2015) is a classic captioning model and many captioning models are developed from its architecture. Our benchmarking experiments indicate that Att model has a severe gender bias. Thus we adopt Att model as the baseline as well as the base model to build GAIC. For model, we add 10% images with person segmentation masks as extra supervision and set the . We compare our proposed model with the following common debiasing approaches.
Balanced: A subset is selected from the training data which has 1:1 men to women ratio. Then we train a new model on this instance-level balanced dataset. In order to get a subset with a sufficient number of training data, we can only balance the image number of each gender. Note that unless we rebuilt the whole dataset and add more images with women, the gender bias existing in gender-object joint distribution cannot be effectively removed.
UpWeight: We also conduct an experiment that we amplify the weight of gender words’ cross-entropy loss during training. Specifically, we label the gender word position for each sentence in training dataset and train the model by multiplying a constant value 10 on loss for gender words. Intuitively, it will encourage models to accurately predict gender words.
Pixel-level Supervision (PixelSup): As a variant of GAIC model, we remove the self-exploration streams and directly use Eq. 6 to fine-tune model attention maps with 10% extra data.
5.2 Evaluation Metrics
Gender Accuracy: Unlike the traditional binary gender classification tasks, prediction results of captioning models can be correct, wrong and neutral. Models should firstly have low error rate. On this basis, to make decision more discriminative, captioning models should obtain low neutral rate and model generates gender neutral words only when the visual evidence are too vague to distinguish.
Besides high accuracy, we also expect our model to treat each gender fairly so that the outcome distribution for each gender should be similar. Here, we utilize Cosine Similarity between correct/wrong/neutral rate for men and women to measure the fairness of the model, which resembles to the fairness definition proposed in Equality of OddsHardt et al. (2016). Attention Correctness: To measure whether attention focuses on the correct gender features, we compare the attention maps of gender words with the person segmentation mask. We adopt two evaluation metrics: Pointing Game Zhang et al. (2018), which counts the highest attention point contained in the person masks, and Attention Sum Liu et al. (2017), which calculates the attention weights in the masks.
5.3 Experimental Results
Gender Accuracy: We report the gender prediction performance of COCO-GB v1 in Tab. 2. Our key observation is that GAIC significantly improves the gender prediction performance compared to the baseline, where the gender accuracy of women increases from 42.8% to 62.0% and error rate of woman reduce from 24.7% to 16.9%. Although the UpWeight method obtains the highest accuracy of both women and men, it also causes a significant high error rate towards each gender. There is no substantial difference between the Balanced model and baseline model, and similar trend has been found by Bolukbasi et al. (2016), which indicates that models learn gender bias mainly from feature-level (gender-object co-occurrence), and balancing number of each gender can not remove bias in dataset. PixelSup obtains sensible improvements, which indicates that supervising directly on attention maps is also helpful. GAIC obtains consistently better performance than GAIC model and we empirically find that adding extra supervision can accelerate self-exploration process and makes training more stable. For fairness evaluation, we compare different model’s gender divergence. GAIC and GAIC obtain the lowest divergence, which indicates that models treat each gender in a fair manner.
Experiments on COCO-GB v2 have almost similar trends in COCO-GB v1 (Detailed analysis of COCO-GB v2 results is in Sec. C). All baselines in COCO-GB v2 receive a worse gender prediction results. This is mainly because the unseen gender-object pairs in the test dataset increase the prediction difficulty. In comparison, our proposed GAIC and GAIC model obtain a comparable performance with COCO-GB v1, which further improves the robustness of the self-exploration training strategy.
Caption Quality: Besides high gender accuracy, we also expect our model to obtain decent caption quality. We use METEOR(M) and CIDEr(C) to evaluate caption qualities. As the result shows in Tab. 2, GAIC and GAIC only cause a minor performance drop compared to the baseline (from 95.1 to 94.6 on METROR and from 24.8 to 24.7 on CIDEr). In Fig. 3, examples show that the sentences generated by GAIC and GAIC are linguistically fluent with more correct gender descriptions.
Attention Correctness: To quantitatively evaluate attention correctness, we extract attention maps of gender words and compare them with person segmentation masks. Quantitative results are shown in Tab. 3. We observe that GAIC and GAIC receive consistent improvement over the baseline model and all model variants, which indicates that our proposed models can focus on the described person for the gender word prediction. Qualitative comparison is shown in Fig. 3. We observe that baseline model may utilize biased visual features for gender prediction and thus makes incorrect gender prediction, e.g., describe a woman as a boy because of the tennis ball. In comparison, GAIC and GAIC correctly focus on the described person for gender prediction.
6 Related Work
Gender Bias in Dataset: Gender bias in dataset has been studied in a variety of domainsBender and Friedman (2018); Bolukbasi et al. (2016); Bordia and Bowman (2019); Brunet et al. (2019); Buolamwini and Gebru (2018); Font and Costa-jussà (2019)
, especially in natural language processingHendricks et al. (2018); Lambrecht and Tucker (2019); Rudinger et al. (2018); Stock and Cisse (2018); Vanmassenhove et al. (2018); Zhao et al. (2019, 2017). For image captioning, several studies have considered gender bias problem in the dataset. Mithun et al. (2018) analyzes the stereotype in Flicker30k dataset and mentions the gender-related issues. Hendricks et al. (2018) indicates that caption annotators in COCO dataset tend to infer a person’s gender from the image context when gender cannot be confirmed, e.g., a baseball player is labeled as ?man? even if gender evidence is occluded.
Vision and Language Bias: Bias in captioning models can be divided into vision and language bias Rohrbach et al. (2018). Vision bias refers to learn wrong visual evidence while language bias refers to capture unwanted language priors. For example, the word "on the beach" always follows the word "Surfboard." Due to the RNN’s recurrent mechanism, models can learn this language prior and always generate the phase "on the beach" if the word "Surfboard" has been inferred. We notice that gender words are usually mentioned at the beginning of the sentence (on the average at position 2 with average sentence length 9), and words before gender words, e.g., "a" and "the," do not have the gender preference. Hence gender bias in captioning systems should mainly come from the vision part.
Mitigating Gender Bias: Few initial attempts have been made to design captioning models to overcome gender biases in datasets. One solution is to break the task into two steps Bhargava (2019). It firstly locates and recognizes the person in the image. Then a language model utilizes grounded information to generate captions. Hence the gender accuracy of this approach highly depends on the extra object detection model. In another work, two novel losses are designed to reduce unwanted bias towards specific words, including gender words. Their approach requires segmentation masks for each image, which is costly and unpractical for many dataset Hendricks et al. (2018).
In this paper, two novel COCO splits are created for studying gender bias problem in image captioning task. We provide extensive baseline experiments for benchmarking different models, training strategies, as well as a comprehensive analysis of the dataset. Our experimental results indicate that many captioning models have a severe gender bias problem, leading to a undesirable gender prediction error towards women. We propose a novel training framework GAIC which can significantly reduce gender bias by self-guided supervision. Besides, GAIC model can seamlessly add extra supervision and further improves the gender prediction accuracy. Quantitative and qualitative results further validate that our proposed model can focus on correct visual evidence for gender prediction.
8 Broader Impact
In this work, we reveal the severe gender bias problem widely existing in most captioning models, which leads to an undesirable gender prediction error towards women. The experimental results remind researchers to revisit the high performance achieved by current captioning models and encourages the community to put more effort into promoting the fairness of captioning systems. Two novel COCO splits proposed in this work enable future work to efficiently quantify gender bias in systems, which will have a strong impact on promoting a more fair platform for emerging and future computer vision as well as natural language processing systems. The self-exploration training strategy proposed in this work can significantly reduce gender bias in learning models and understanding how the interpretation methods such as attention mechanism could be utilized to further advance fairness of machine learning methods, thereby broadly impacting the Machine Learning field.
This work also plays an integral part in educating and training students. The research will also be tightly integrated with related courses on data science at the author’s university. The course will show how the bias in data could be learned by models, and potentially leads to unintentional discrimination in learning models. The course will have a section to teach students to design a fair machine learning algorithm. We will actively encourage undergraduate participation in this bias mitigation project.
Bottom-up and top-down attention for image captioning and visual question answering.
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077–6086. Cited by: §3, §4.
-  (2018) Data statements for natural language processing: toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics 6, pp. 587–604. Cited by: §6.
-  (2019) Exposing and correcting the gender bias in image captioning datasets and models. Ph.D. Thesis. Cited by: §6.
-  (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in neural information processing systems, pp. 4349–4357. Cited by: §5.3, §6.
-  (2019) Identifying and reducing gender bias in word-level language models. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 7–15. Cited by: §6.
-  (2019) Understanding the origins of bias in word embeddings. In International Conference on Machine Learning, pp. 803–811. Cited by: §6.
-  (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp. 77–91. Cited by: §6.
-  (2015) Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2625–2634. Cited by: §3.
-  (2015) From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1473–1482. Cited by: §1.
Equalizing gender bias in neural machine translation with word embeddings techniques. In Proceedings of the First Workshop on Gender Bias in Natural Language Processing, pp. 147–154. Cited by: §6.
Equality of opportunity in supervised learning. In Advances in neural information processing systems, pp. 3315–3323. Cited by: §5.2.
-  (2018) Women also snowboard: overcoming bias in captioning models. In European Conference on Computer Vision, pp. 793–811. Cited by: §2, §2, §3, §6, §6.
A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CSUR) 51 (6), pp. 118. Cited by: §1.
-  (2015) Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3128–3137. Cited by: §1, §2.
-  (2019) Algorithmic bias? an empirical study of apparent gender-based discrimination in the display of stem career ads. Management Science 65 (7), pp. 2966–2981. Cited by: §6.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
Attention correctness in neural image captioning.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §5.2, Table 3.
-  (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 375–383. Cited by: §3, §4.
-  (2018) Neural baby talk. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7219–7228. Cited by: §3.
Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632. Cited by: §1.
-  (2018) Webly supervised joint embedding for cross-modal image-text retrieval. In Proceedings of the 26th ACM international conference on Multimedia, pp. 1856–1864. Cited by: §6.
-  (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §1.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §3.
-  (2017) Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008–7024. Cited by: §3, §3.
-  (2018) Object hallucination in image captioning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4035–4045. Cited by: §3, §6.
-  (2018) Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 8–14. Cited by: §6.
-  (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §4.1.
-  (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. Cited by: §4.1.
Convnets and imagenet beyond accuracy: understanding mistakes and uncovering biases. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 498–512. Cited by: §6.
-  (2018) Getting gender right in neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3003–3008. Cited by: §6.
-  (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §1.
-  (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §1.
-  (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: Appendix B, §1, §3, §4, §5.1.
-  (2018) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. Cited by: §5.2, Table 3.
-  (2019) Gender bias in contextualized word embeddings. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 629–634. Cited by: §6.
-  (2017) Men also like shopping: reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2979–2989. Cited by: §2, §3, §6.
Appendix A COCO-GB dataset
a.1 Gender Annotation
We show the gender word list in Tab. 4. Gender words are selected based on the word frequency in COCO dataset and we delete the less frequent gender words. Word "woman" and "man" are most frequent gender-specific word and account for more than 60% of the total gender-specific words.
|female words ()||woman, women, girl, sister, daughter, wife, girlfriend|
|male words ()||man, men, boy, brother, son, husband, boyfriend|
|gender neutral words ()||people, person, human, baby|
Some labeled examples are shown in Fig. 4. We label an image as "women" when at least one sentence mentioned female words and label an image as "men" when at least one sentence mentioned male words. Images that both mention male words and female words are discarded. In Fig. 4 (c), we show that when gender evidence is occluded, annotators may provide the gender prediction based on context cues or social stereotype. Hence our gender annotations may contain this kind of social bias.
a.2 COCO Gender Distribution
We show the gender-object joint distribution of COCO training dataset, COCO testing dataset, COCO-GV v1 secret testing dataset, COCO-GB v2 testing dataset as follows. We sort the objects according to bias rate in training dataset. For COCO-GB dataset, we choose 63 object from 80 objects in COCO dataset according to their image numbers. We observe that COCO-GB v1 has a balanced gender-object joint distribution while COCO-GB v2 has a distribution opposite to the training set.
Appendix B Caption Generation with Visual Attention
Given an image and the corresponding caption , the objective of an encoder-decoder image captioning model is to maximize the following formulas:
are trainable parameters of captioning model. We utilize chain rule to decompose the joint probability distribution into ordered conditionals. A recurrent neural network (RNN) predicts each conditional probability as follows:
where is a nonlinear function, and we adopt the classic LSTM as function in this paper. is the hidden state of RNN at steps.
is the visual context vector extracted from imagefor . Generally, is an important extra information in image captioning models, which can provide visual evidence for caption generation. We follow the work  to compute . A CNN extracts a set of image features from last convolutional layer which we denote them as . corresponds to the features extracted at different image locations, where . We calculate attention values for each as follows:
where is a multi-layer perception conditioned on the previous hidden state. For each location , represents the importance of region for generating word. Once we obtain the attention weight, the context vector is computed by
Appendix C Experimental Results on COCO-GB v2
Compared to COCO-GB v1, all baseline and common debiasing approches obtain a higher error rate of women (average increase of 3.25%) on COCO-GB v2. GAIC model improves the gender prediction accuracy of woman from 41.6% to 67.1% and reduce the error rate of women from 28.3% to 18.0%. Although the UpWeight method obtains the highest accuracy of both women and men, it has a undesirable high error rate towards each gender. There is no substantial difference between the Balanced model and baseline model, and similar trend has been found in COCO-GB v1. GAIC obtains consistently better performance than GAIC model. For fairness evaluation, we compare different model?s gender divergence. GAIC and GAIC obtain the lowest divergence, which indicates that models treat each gender in a fair manner. For caption quality, GAIC and GAIC only cause a minor performance drop compared to the baseline.
Appendix D More on Implementation Details
Benchmarking Baselines: All the baseline models obtain visual features extracted from the fourth layer of ResNet-101. All models except for NBT, TopDown, Att and AdaptAtt are implemented in the same open source framework from
All the baseline models obtain visual features extracted from the fourth layer of ResNet-101. All models except for NBT, TopDown, Att and AdaptAtt are implemented in the same open source framework fromhttps://github.com/ruotianluo/self-critical.pytorch, and we directly use the test caption results from https://github.com/LisaAnne/Hallucination. For other models, we implement by our-self and make sure that the caption quality is close to the results reported in paper. Caption quality is evaluted by the offical COCO evaluation tool https://github.com/tylin/coco-caption.
GAIC model and debiasing approaches:
We select a subset from original training set which contains 4,000 images for each gender. Baseline Att model are trained on COCO for 5 epoches. For Balanced baseline, we directly fine-tune the original Att model on this subset. For PixelSup baseline, we fine-tune the model with the subset and extra person segmentation annotations. For GAIC model, we use the two streams pipeline to fine-tune the model on the subset, and set. For GAIC model, we fine-tune the dataset with extra person segmentation annotations, and set and . For above-mentioned models, we fine-tune the model for 1 extra epoch on the subset.