Fairness in AI is an emerging topic in computer vision [Kaiyu20, Vries2019] and has proven indispensable to develop unbiased AI models that are fair and inclusive to individuals from any background. Recent studies [Eidinger14, Kaerkkaeinen19, Zhifei17] suggest that top performing AI models trained on datasets that are created without considering fair distribution across sub-groups and thus quite unbalanced, do not necessarily reflect the outcome in real world. On the contrary, they may perform poorly and may be biased towards certain groups of people.
Deepfake detectors are able to differentiate between real and fake videos by accumulating classifier responses on individual frames. Although aggregating per-frame predictions removes most outliers for a robust and accurate classification, bias in face detectors may change the final result drastically and cause deepfake detectors to fail. Are these detectors, e.g.[King09, Zhang16, Seferbekov2020, WM2020, NTechLab2020, Eighteen2020, Medics2020], capable of detecting faces from various age groups, genders or skin tones? If so, how often do deepfake detectors fail based on this reason? Is there a way we can measure the vulnerabilities of deepfake detectors?
To address the aforementioned concerns, we propose a dataset composed of video recordings containing 3,011 individuals with a diverse set of age, genders and apparent skin types. Participants, who were paid actors gave their permission for their likeness to be used for improving AI, in the video recordings casually speak about various topics and sometimes depict a range of facial expressions. Thus, we call the dataset Casual Conversations. The dataset includes a unique identifier and age, gender, apparent skin type annotations for each subject. A distinguishing feature of our dataset is that age and gender annotations are provided by the subjects themselves. We prefer this human-centered approach and believe it allows our data to have a relatively unbiased view of age and gender. As a third dimension in our dataset, we annotated the apparent skin tone of each subject using the Fitzpatrick[FitzPatrick75] scale; we also label videos recorded in low ambient lighting. This set of attributes allow us to measure model robustness on four dimensions: age, gender, apparent skin tone and ambient lighting.
Although Casual Conversations
is intended to evaluate robustness of AI models across several facial attributes, we believe that its value is greater and indispensable for many other open challenges. Image inpainting, developing temporally consistent models, audio understanding, responsible AI on facial attribute classification and handling low-light scenarios in the aforementioned problems are potential application areas of this dataset.
We organize the paper as follows; Section 2 provides a comprehensive background on fairness in AI, up-to-date facial attribute datasets, deepfake detection and current challenges in personal attribute classification. Section 3 describes the data acquisition process and the annotation pipeline for our dataset. Section 4 analyzes the biases of top five DFDC winners as well as the state-of-the-art apparent age and gender classification models, using our dataset. Consequently, we finalize our findings and provide an overview of the results in Section 5.
2 Related Work
Fairness in AI
challenges the field of artificial intelligence to be more inclusive, fair and responsible. Research has clearly shown that deep networks that achieve a high performance on certain datasets are likely to favor only sub-groups of people due to the imbalanced distribution of the categories in the data[Larrazabal20]. Buolamwini and Gebru [Buolamwini18]
pointed out that the IJB-A[Klare15] and Adience [Eidinger14] datasets are composed of mostly lighter skin toned subjects. Raji [Raji19, Raji20] analyze the commercial impact Gender Shades [Buolamwini18] and discuss ethical concerns auditing facial processing technologies. Du [Du20]
provide a comprehensive review on recent developments of fairness in deep learning and discuss potential fairness mitigation approaches in deep learning. Meanwhile, Barocas[Barocas19]
are in the process of compiling a book that intends to give a fresh perspective on machine learning in regards to fairness as a central concern and discusses possible mitigation strategies on the ethical challenges in AI.
Facial attribute datasets
[Eidinger14, Kaerkkaeinen19, Zhifei17]
are created to train and validate face recognition, age, and gender classification models. However, provided facial attributes in these datasets are hand-labelled and annotated by third-parties. Although it has been claimed that the annotations are uniformly distributed over different attributes, age and gender, there is no guarantee on the accuracy of these annotations. An individual’s visual appearance may differ significantly from their own self-identification which will thus result as bias in the dataset. In contrast, we provide age and gender annotations that areself-identified by the subjects. Aside from age and gender, public benchmarks tend to also provide annotated ethnicity labels. However, we find that labeling the ethnicity of subjects could lead to inaccuracies. Raters may have unconscious biases towards certain ethnic groups that may reduce the labelling accuracy in the provided annotations. In the FairFace [Kaerkkaeinen19] dataset paper, the authors claim that skin tone is a one dimensional concept in comparison to ethnicity because lighting is a big factor when deciding on the skin tone as a subject’s skin tone may vary over time. Although these claims sound reasonable, the ethnicity attribute is still ill-defined and can conceptually cause confusions in many aspects; for example there may no difference in facial appearance of African-American and African people, although, they may be referred to with two distinct racial categories. We, therefore, have opted to annotate the apparent skin tone of each subject. Our dataset is composed of multiple recordings ( on avg. 15) per actor, so annotators voted based on the sampled frames of these videos. Since these videos were captured in varying ambient lighting conditions, we alleviate the aforementioned concerns stated in [Kaerkkaeinen19].
in media forensics is a burgeoning field in machine learning that attempts to classify realistic artificially-generated videos. Fake content generation has rapidly evolved in the past several years and recent studies [Nyugen19] show that the gap between the deepfake generators and detectors has been growing quickly. The DeepFake Detection Challenge (DFDC) [DFDCPreview, DFDC2020] provided researchers the opportunity to develop state-of-the-art detectors to tackle fake videos. However, there are still open questions to address such as the robustness of these detectors across various age, gender, apparent skin tone groups and ambience lighting conditions. While detectors rely on face detection methods that are used by a majority of researchers, there is still no clear understanding as to how accurately these face detectors perform on various subgroups of people, such as specific genders, darker skin tones, younger people, For this reason, we believe that the Casual Conversations dataset will provide a valuable tool to measure robustness of state-of-the-art deepfake detection approaches.
Apparent age and gender classification
has been a rapidly growing research field over a decade but recently took more attention after tremendous increase in social media usage. Therefore, apparent age and gender prediction is still an active research field investigated in automated human biometrics and facial attribute classification methods. Levi & Hassner [LeviHassner15]
proposed an end-to-end trained Convolutional Neural Network (CNN) on Adience benchmark[Eidinger14] to predict apparent age and gender from faces. Lee [Lee2018] further developed a system for mobile and proposed an efficient, lightweight multi-task CNN for simultaneous apparent age and gender classification. Serengil [Serengil2020] recently presented a hybrid face recognition framework composed of the state-of-the-art face recognition approaches. Nevertheless, none of these models were evaluated against apparent skin type variations and ambient lighting conditions. Therefore, we present a close look into the results of these methods on our dataset to measure fairness of the recent face technologies.
3 Casual Conversations Dataset
The Casual Conversations dataset is composed of approximately fifteen one minute video recordings for each of our 3,011 subjects. Videos are captured in the United States in the cities of Atlanta, Houston, Miami, New Orleans, and Richmond. The subjects that participated in this study are from diverse age (), gender and skin tones. In most recordings, only one subject is present; however, there are videos in which two subjects are present simultaneously as well. Nonetheless, we only provide one set of labels and it is for the current subject of interest.
In this dataset, we provide annotations for age, gender, apparent skin tone and whether or not the video was recorded in low ambient lighting. Age and gender attributes of subjects are provided by subjects themselves. All other publicly available datasets provide hand or machine labelled annotations and therefore introduce a drastic bias towards appearance of a person other than the actual age and gender. Gender in our dataset is categorized as Male, Female, Other and N/A (preferred not to say or removed during data cleaning). We are aware that this categorization is over simplistic and does not sufficiently capture the diversity of genders that exist, and that we hope in the future there is more progress on enabling data analysis that captures this additional diversity while continuing to respect people’s privacy and any data ethics concerns.
In addition to self-identified age and gender labels, we also provide skin tone annotations using the Fitzpatrick scale [FitzPatrick75]. Although the debate on ethnicity versus skin tone is still disputed [Kaerkkaeinen19], we believe it is ill-defined considering that the apparent ethnicity of a person may differ from their actual ethnicity, thereby causing algorithms to classify incorrectly. On the other hand, skin tone is an expressive and generic way to group people, which is necessary to measure the bias of the-state-of-the-art methods. The Fitzpatrick scale [FitzPatrick75] is commonly used in classification of apparent skin tones. The Fitzpatrick scale constitutes six skin types based on the skin’s reaction to Ultraviolet light. The scale ranges from Type I to VI, where Type I skin is pale to fair, never tans but always burns whereas Type VI skin is very dark, always tans but never burns (see example face crops in Figure 2). Additionally, the Fitzpatrick scale has limitations in capturing diversity outside of the Western theories of race related skin tone and does not perform as well for people with darker skin tones [Sambasivan20, Ware20]. Three of out the six skin types cover white skin, two cover brown skin, and there is only skin type for black skin, which clearly does not encompass the diversity within brown and black skin tones. A common procedure to alleviate this bias is to group the Fitzpatrick skin types into three buckets of light [types I, II], medium [types III & IV], and dark skins [type V & VI]. Our annotations provide the full, non-bucketed skin types such that others can decide how they’d to group the skin types.
In order to annotate for apparent skin types, eight individuals (raters) were appointed to annotate all subjects and to also flag the subjects that they are not confident about. As the final skin type annotations, we accumulated the weighted histograms over eight votes (uncertain votes are counted as half) and pick the most voted skin type as the ground-truth annotation.
Figure 1 shows the per-category distributions over our 3,011 subjects. As shown in the figures, we have decently balanced distributions over gender and age groups. For the skin type annotations, each paired group of types I & II, III & IV and V & VI would be almost equal to one-third of the dataset. Uniform distributions of the annotations allow us to reduce the impact of bias in our measurements and hence let us better evaluate model robustness.
In Figure 1 the percentage of bright versus dark videos over all 45,186 videos is also depicted. To have a balanced lighting distribution, we sub-sample our dataset to include only one pair of videos per subject, a total of 6,022 videos. When possible, we chose one dark and one bright video. Note that sub-sampling only affects the lighting distribution because there is only one set of labels per subject in the dataset. After re-sampling, we end up with 37.3% dark videos in the smaller dataset. In all experiments, we use the mini Casual Conversations dataset except in DFDC evaluation (Section 4).
Although we have nearly uniform distributions per category, it is still very important to preserve uniformity in paired breakdowns. Figure 3 shows the paired distributions of categories. For example, in Figures 2(g), 2(e) and 2(a), all distributions are fairly uniform over all subcategories. In the rest of the paper, we refer to our four dimensional attributes as fairness annotations.
Our dataset will be publicly available111https://ai.facebook.com/datasets/casual-conversations-dataset for general use and we encourage users to extend annotations of our dataset for various computer vision applications, in line with our data use agreement.
The DeepFake Detection Challenge (DFDC) [DFDCPreview, DFDC2020] provided an opportunity for researchers to develop robust deepfake detection models and test on a challenging private test set. The top five winners of the competition had relatively low performance on the dataset, achieving only up to 70% accuracy. However, in the scope of DFDC, AI models were only evaluated on a binary classification task, whether a video is fake or not. Since a portion of the DFDC private test set is constructed using videos from Casual Conversations, to complete the missing dimension of DFDC, we match the overlapping 4,945 DFDC test videos (almost half of the private test set) with their ground truth fairness annotations (age, gender, apparent skin tone and lighting) and display the ROC curves of the top five winners on each dimension in Figure 6. Interestingly, we found that in terms of balanced predictions the third placed winner, NTechLab [NTechLab2020], performs more consistently across three dimensions (age, gender, lighting) in comparison to all other winners.
|Winner||Overall||18-30||31-45||46-85||Female||Male||Type I||Type II||Type III||Type IV||Type V||Type VI||Bright||Dark|
|Selim Seferbekov [Seferbekov2020]||-2.500||-2.437||-2.476||-2.610||-2.443||-2.569||-1.427||-2.356||-2.360||-1.851||-2.714||-3.098||-2.569||-2.184|
|Eighteen Years Old [Eighteen2020]||-2.027||-2.112||-2.110||-1.876||-2.026||-1.996||-1.081||-1.764||-1.975||-1.095||-2.288||-2.651||-2.108||-1.647|
|The Medics [Medics2020]||-2.688||-2.453||-2.604||-2.877||-2.597||-2.719||-1.677||-2.598||-2.678||-2.303||-2.790||-3.115||-2.732||-2.501|
In Table 1, we present the log of the weighted precision [DFDC2020] per category. Surprisingly, the best performing method on the subset of the private test set is NTechLab [NTechLab2020] as opposed to top winner Selim Seferbekov [Seferbekov2020]. However, despite higher performance, the results show that NTechLab’s [NTechLab2020] model is particularly biased towards paler skin tones (Type I). Nevertheless, all winning approaches struggle to identify fake videos containing subjects with darker skin tones (Type V and VI). Aside from skin tone, the data indicates that Eighteen Years Old [Eighteen2020] significantly outperforms other winning models on older individuals (age ).
|Winner||Overall||18-30||31-45||46-85||Female||Male||Type I||Type II||Type III||Type IV||Type V||Type VI||Bright||Dark|
|Selim Seferbekov [Seferbekov2020]||0.195||0.201||0.184||0.203||0.185||0.210||0.147||0.180||0.173||0.148||0.208||0.265||0.198||0.178|
|Eighteen Years Old [Eighteen2020]||0.186||0.205||0.180||0.178||0.193||0.175||0.143||0.179||0.204||0.144||0.177||0.209||0.187||0.183|
|The Medics [Medics2020]||0.213||0.204||0.207||0.221||0.205||0.216||0.164||0.218||0.226||0.189||0.202||0.219||0.214||0.211|
In addition to precision, we also report the log-loss of each model per fairness category in Table 2. Similar to the weighted precision results, NTechLab [NTechLab2020] and Eighteen Years Old [Eighteen2020] have the lowest overall log-loss. As expected, the log-loss of each methods increases dramatically on darker skin tones.
Lastly, we also investigated the False Negatives (FNs) of each method because video clips of those examples in the DFDC are benign real samples without any face perturbations (may have some image augmentations applied). Therefore, we show the FN ratios of each winner per category in Table 3. We also present the FN ratios on the videos where all methods failed in the last row (intersection). Considering the distributions in Figure 1, we conclude that most of the scores are in alignment with the data distribution except the darker skin tones, types V & VI.
Figures 5 and 4 present face crops from the original video clips where all top winners failed to classify correctly. Figure 4 shows example face crops for FNs. As also stated above, all top five winners struggle to identify the originality of videos when the subject in interest has a darker skin tone.
|Winner||#||18-30||31-45||46-85||Female||Male||Type I||Type II||Type III||Type IV||Type V||Type VI||Bright||Dark|
|Selim Seferbekov [Seferbekov2020]||246||28.46||33.33||36.99||54.07||43.90||2.03||23.58||22.76||4.07||20.33||27.24||84.55||15.45|
|Eighteen Years Old [Eighteen2020]||141||31.91||39.01||27.66||56.74||40.43||2.13||19.86||24.82||2.84||21.99||28.37||86.52||13.48|
|The Medics [Medics2020]||307||23.13||31.92||40.39||53.75||41.04||2.28||24.43||25.41||5.86||18.24||23.78||82.08||17.92|
4.1 Apparent age and gender prediction
We compared the apparent age and gender prediction results of the three state-of-the-art models, evaluated on our dataset. In the following experiments, we used the reduced (mini) dataset. We first detect faces in each frame with DLIB [King2009] and evaluate the models on the sampled 100 face crops per video. Final predictions are calculated by aggregating results over these samples (most voted gender and median age). Levi & Hassner [LeviHassner15] and LMTCNN [Lee2018] predicts age in predefined brackets and therefore we map their age prediction to our predefined age groups in 1.
is one of the early work that used deep neural networks. It is comparatively less accurate method among all, however, almost as good in apparent gender classification as LMTCNN[Lee2018]. LightFace [Serengil2020], on the other hand, is more successful on predicting accurate apparent age and gender. Nevertheless, state-of-the-art methods’ apparent gender precision on darker skin types (Type V & VI) is drastically lower by more than 20% on average.
|Overall||Female||Male||Other||Type I||Type II||Type III||Type IV||Type V||Type VI||Bright||Dark|
|Levi & Hassner [LeviHassner15]||38.05||37.44||39.48||66.67||39.56||38.72||40.84||36.47||36.47||34.89||38.49||37.04|
|Overall||18-30||31-45||46-85||Type I||Type II||Type III||Type IV||Type V||Type VI||Bright||Dark|
|Levi & Hassner [LeviHassner15]||39.42||39.29||40.65||54.00||47.51||56.81||55.97||53.97||35.89||35.30||40.21||38.12|
We presented the Casual Conversations Dataset, a dataset designed to measure robustness of AI models across four main dimensions, age, gender, apparent skin type and lighting. As previously stated, a unique factor of our dataset is that the age and gender labels are provided by the participants themselves. The dataset has uniform distributions over all categories and could be used to measure various AI methods, such as face detection, apparent age and gender classification, or to assess robustness to various ambient lighting conditions.
As an application of our dataset, we presented an analysis of the DeepFake Detection Challenge’s top five winnerson the four main dimensions of the dataset. We presented the precision and log-loss scores of winners on the intersection of the DFDC private test set and the Casual Conversations dataset. From our thorough investigation, we conclude that all methods carry a large bias towards lighter skin tones since they mostly fail on darker skinned subjects.
Moreover, we also discussed the results of the recent apparent age and gender prediction models on our dataset. In both of the applications, we noticed an obvious algorithmic bias towards lighter skinned subjects. Apparent gender classification methods are most successful on older people ( years old) and generally as good on darker videos as on brighter ones.
Beyond aforementioned research topics, our dataset enables researchers to develop and also thoroughly evaluate models for more inclusive and responsible AI.
We would like to thank Ida Cheng and Tashrima Hossain for their help in regards to annotating the dataset for the Fitzpatrick skin type.