Towards measuring fairness in AI: the Casual Conversations dataset

04/06/2021 ∙ by Caner Hazirbas, et al. ∙ Facebook 0

This paper introduces a novel dataset to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of age, genders, apparent skin tones and ambient lighting conditions. Our dataset is composed of 3,011 subjects and contains over 45,000 videos, with an average of 15 videos per person. The videos were recorded in multiple U.S. states with a diverse set of adults in various age, gender and apparent skin tone groups. A key feature is that each subject agreed to participate for their likenesses to be used. Additionally, our age and gender annotations are provided by the subjects themselves. A group of trained annotators labeled the subjects' apparent skin tone using the Fitzpatrick skin type scale. Moreover, annotations for videos recorded in low ambient lighting are also provided. As an application to measure robustness of predictions across certain attributes, we provide a comprehensive study on the top five winners of the DeepFake Detection Challenge (DFDC). Experimental evaluation shows that the winning models are less performant on some specific groups of people, such as subjects with darker skin tones and thus may not generalize to all people. In addition, we also evaluate the state-of-the-art apparent age and gender classification methods. Our experiments provides a through analysis on these models in terms of fair treatment of people from various backgrounds.



There are no comments yet.


page 1

page 2

page 4

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Casual Conversations dataset per-category distributions. Age and gender distributions are pretty balanced. Only 0.1% of participants identified themselves as Other

gender (purple bar in the “Gender” column). Consecutive pairs of skin types can be grouped into three sub-categories for a uniform distribution. To balance lighting, we also provide sub-sampled dataset consists of two videos per actor, where one is

Dark when possible.

Fairness in AI is an emerging topic in computer vision [Kaiyu20, Vries2019] and has proven indispensable to develop unbiased AI models that are fair and inclusive to individuals from any background. Recent studies [Eidinger14, Kaerkkaeinen19, Zhifei17] suggest that top performing AI models trained on datasets that are created without considering fair distribution across sub-groups and thus quite unbalanced, do not necessarily reflect the outcome in real world. On the contrary, they may perform poorly and may be biased towards certain groups of people.

Figure 2: Example face crops from the Casual Conversations dataset, categorized by their apparent Fitzpatrick skin types.

Deepfake detectors are able to differentiate between real and fake videos by accumulating classifier responses on individual frames. Although aggregating per-frame predictions removes most outliers for a robust and accurate classification, bias in face detectors may change the final result drastically and cause deepfake detectors to fail. Are these detectors, e.g.

[King09, Zhang16, Seferbekov2020, WM2020, NTechLab2020, Eighteen2020, Medics2020], capable of detecting faces from various age groups, genders or skin tones? If so, how often do deepfake detectors fail based on this reason? Is there a way we can measure the vulnerabilities of deepfake detectors?

To address the aforementioned concerns, we propose a dataset composed of video recordings containing 3,011 individuals with a diverse set of age, genders and apparent skin types. Participants, who were paid actors gave their permission for their likeness to be used for improving AI, in the video recordings casually speak about various topics and sometimes depict a range of facial expressions. Thus, we call the dataset Casual Conversations. The dataset includes a unique identifier and age, gender, apparent skin type annotations for each subject. A distinguishing feature of our dataset is that age and gender annotations are provided by the subjects themselves. We prefer this human-centered approach and believe it allows our data to have a relatively unbiased view of age and gender. As a third dimension in our dataset, we annotated the apparent skin tone of each subject using the Fitzpatrick[FitzPatrick75] scale; we also label videos recorded in low ambient lighting. This set of attributes allow us to measure model robustness on four dimensions: age, gender, apparent skin tone and ambient lighting.

Although Casual Conversations

is intended to evaluate robustness of AI models across several facial attributes, we believe that its value is greater and indispensable for many other open challenges. Image inpainting, developing temporally consistent models, audio understanding, responsible AI on facial attribute classification and handling low-light scenarios in the aforementioned problems are potential application areas of this dataset.

We organize the paper as follows; Section 2 provides a comprehensive background on fairness in AI, up-to-date facial attribute datasets, deepfake detection and current challenges in personal attribute classification. Section 3 describes the data acquisition process and the annotation pipeline for our dataset.  Section 4 analyzes the biases of top five DFDC winners as well as the state-of-the-art apparent age and gender classification models, using our dataset. Consequently, we finalize our findings and provide an overview of the results in Section 5.

2 Related Work

Fairness in AI 

challenges the field of artificial intelligence to be more inclusive, fair and responsible. Research has clearly shown that deep networks that achieve a high performance on certain datasets are likely to favor only sub-groups of people due to the imbalanced distribution of the categories in the data 

[Larrazabal20]. Buolamwini and Gebru [Buolamwini18]

pointed out that the IJB-A 

[Klare15] and Adience [Eidinger14] datasets are composed of mostly lighter skin toned subjects. Raji  [Raji19, Raji20] analyze the commercial impact Gender Shades [Buolamwini18] and discuss ethical concerns auditing facial processing technologies. Du  [Du20]

provide a comprehensive review on recent developments of fairness in deep learning and discuss potential fairness mitigation approaches in deep learning. Meanwhile, Barocas  


are in the process of compiling a book that intends to give a fresh perspective on machine learning in regards to fairness as a central concern and discusses possible mitigation strategies on the ethical challenges in AI.

Facial attribute datasets 

[Eidinger14, Kaerkkaeinen19, Zhifei17]

are created to train and validate face recognition, age, and gender classification models. However, provided facial attributes in these datasets are hand-labelled and annotated by third-parties. Although it has been claimed that the annotations are uniformly distributed over different attributes, age and gender, there is no guarantee on the accuracy of these annotations. An individual’s visual appearance may differ significantly from their own self-identification which will thus result as bias in the dataset. In contrast, we provide age and gender annotations that are

self-identified by the subjects. Aside from age and gender, public benchmarks tend to also provide annotated ethnicity labels. However, we find that labeling the ethnicity of subjects could lead to inaccuracies. Raters may have unconscious biases towards certain ethnic groups that may reduce the labelling accuracy in the provided annotations. In the FairFace [Kaerkkaeinen19] dataset paper, the authors claim that skin tone is a one dimensional concept in comparison to ethnicity because lighting is a big factor when deciding on the skin tone as a subject’s skin tone may vary over time. Although these claims sound reasonable, the ethnicity attribute is still ill-defined and can conceptually cause confusions in many aspects; for example there may no difference in facial appearance of African-American and African people, although, they may be referred to with two distinct racial categories. We, therefore, have opted to annotate the apparent skin tone of each subject. Our dataset is composed of multiple recordings ( on avg. 15) per actor, so annotators voted based on the sampled frames of these videos. Since these videos were captured in varying ambient lighting conditions, we alleviate the aforementioned concerns stated in [Kaerkkaeinen19].

Deepfake detection 

in media forensics is a burgeoning field in machine learning that attempts to classify realistic artificially-generated videos. Fake content generation has rapidly evolved in the past several years and recent studies [Nyugen19] show that the gap between the deepfake generators and detectors has been growing quickly. The DeepFake Detection Challenge (DFDC)  [DFDCPreview, DFDC2020] provided researchers the opportunity to develop state-of-the-art detectors to tackle fake videos. However, there are still open questions to address such as the robustness of these detectors across various age, gender, apparent skin tone groups and ambience lighting conditions. While detectors rely on face detection methods that are used by a majority of researchers, there is still no clear understanding as to how accurately these face detectors perform on various subgroups of people, such as specific genders, darker skin tones, younger people, For this reason, we believe that the Casual Conversations dataset will provide a valuable tool to measure robustness of state-of-the-art deepfake detection approaches.

Apparent age and gender classification 

has been a rapidly growing research field over a decade but recently took more attention after tremendous increase in social media usage. Therefore, apparent age and gender prediction is still an active research field investigated in automated human biometrics and facial attribute classification methods. Levi & Hassner [LeviHassner15]

proposed an end-to-end trained Convolutional Neural Network (CNN) on Adience benchmark 

[Eidinger14] to predict apparent age and gender from faces. Lee  [Lee2018] further developed a system for mobile and proposed an efficient, lightweight multi-task CNN for simultaneous apparent age and gender classification. Serengil  [Serengil2020] recently presented a hybrid face recognition framework composed of the state-of-the-art face recognition approaches. Nevertheless, none of these models were evaluated against apparent skin type variations and ambient lighting conditions. Therefore, we present a close look into the results of these methods on our dataset to measure fairness of the recent face technologies.

(a) Age breakdown by Gender
(b) Gender breakdown by Age
(c) Age breakdown by Skin type
(d) Skin type breakdown by Age
(e) Gender breakdown by Skin type
(f) Skin type breakdown by Gender
(g) Age breakdown by Lighting
(h) Lighting breakdown by Age
(i) Gender breakdown by Lighting
(j) Lighting breakdown by Gender
(k) Lighting breakdown by Skin type
(l) Skin type breakdown by Lighting
Figure 3: Casual Conversations dataset per-category distributions across subgroups defined by each of the remaining categories. The per-category distributions are generally well balanced across subgroups.

3 Casual Conversations Dataset

The Casual Conversations dataset is composed of approximately fifteen one minute video recordings for each of our 3,011 subjects. Videos are captured in the United States in the cities of Atlanta, Houston, Miami, New Orleans, and Richmond. The subjects that participated in this study are from diverse age (), gender and skin tones. In most recordings, only one subject is present; however, there are videos in which two subjects are present simultaneously as well. Nonetheless, we only provide one set of labels and it is for the current subject of interest.

In this dataset, we provide annotations for age, gender, apparent skin tone and whether or not the video was recorded in low ambient lighting. Age and gender attributes of subjects are provided by subjects themselves. All other publicly available datasets provide hand or machine labelled annotations and therefore introduce a drastic bias towards appearance of a person other than the actual age and gender. Gender in our dataset is categorized as Male, Female, Other and N/A (preferred not to say or removed during data cleaning). We are aware that this categorization is over simplistic and does not sufficiently capture the diversity of genders that exist, and that we hope in the future there is more progress on enabling data analysis that captures this additional diversity while continuing to respect people’s privacy and any data ethics concerns.

In addition to self-identified age and gender labels, we also provide skin tone annotations using the Fitzpatrick scale [FitzPatrick75]. Although the debate on ethnicity versus skin tone is still disputed  [Kaerkkaeinen19], we believe it is ill-defined considering that the apparent ethnicity of a person may differ from their actual ethnicity, thereby causing algorithms to classify incorrectly. On the other hand, skin tone is an expressive and generic way to group people, which is necessary to measure the bias of the-state-of-the-art methods. The Fitzpatrick scale [FitzPatrick75] is commonly used in classification of apparent skin tones. The Fitzpatrick scale constitutes six skin types based on the skin’s reaction to Ultraviolet light. The scale ranges from Type I to VI, where Type I skin is pale to fair, never tans but always burns whereas Type VI skin is very dark, always tans but never burns (see example face crops in Figure 2). Additionally, the Fitzpatrick scale has limitations in capturing diversity outside of the Western theories of race related skin tone and does not perform as well for people with darker skin tones [Sambasivan20, Ware20]. Three of out the six skin types cover white skin, two cover brown skin, and there is only skin type for black skin, which clearly does not encompass the diversity within brown and black skin tones. A common procedure to alleviate this bias is to group the Fitzpatrick skin types into three buckets of light [types I, II], medium [types III & IV], and dark skins [type V & VI]. Our annotations provide the full, non-bucketed skin types such that others can decide how they’d to group the skin types.

In order to annotate for apparent skin types, eight individuals (raters) were appointed to annotate all subjects and to also flag the subjects that they are not confident about. As the final skin type annotations, we accumulated the weighted histograms over eight votes (uncertain votes are counted as half) and pick the most voted skin type as the ground-truth annotation.

Figure 1 shows the per-category distributions over our 3,011 subjects. As shown in the figures, we have decently balanced distributions over gender and age groups. For the skin type annotations, each paired group of types I & II, III & IV and V & VI would be almost equal to one-third of the dataset. Uniform distributions of the annotations allow us to reduce the impact of bias in our measurements and hence let us better evaluate model robustness.

In Figure 1 the percentage of bright versus dark videos over all 45,186 videos is also depicted. To have a balanced lighting distribution, we sub-sample our dataset to include only one pair of videos per subject, a total of 6,022 videos. When possible, we chose one dark and one bright video. Note that sub-sampling only affects the lighting distribution because there is only one set of labels per subject in the dataset. After re-sampling, we end up with 37.3% dark videos in the smaller dataset. In all experiments, we use the mini Casual Conversations dataset except in DFDC evaluation (Section 4).

Although we have nearly uniform distributions per category, it is still very important to preserve uniformity in paired breakdowns. Figure 3 shows the paired distributions of categories. For example, in Figures 2(g), 2(e) and 2(a), all distributions are fairly uniform over all subcategories. In the rest of the paper, we refer to our four dimensional attributes as fairness annotations.

Our dataset will be publicly available111 for general use and we encourage users to extend annotations of our dataset for various computer vision applications, in line with our data use agreement.

4 Experiments

The DeepFake Detection Challenge (DFDC) [DFDCPreview, DFDC2020] provided an opportunity for researchers to develop robust deepfake detection models and test on a challenging private test set. The top five winners of the competition had relatively low performance on the dataset, achieving only up to 70% accuracy. However, in the scope of DFDC, AI models were only evaluated on a binary classification task, whether a video is fake or not. Since a portion of the DFDC private test set is constructed using videos from Casual Conversations, to complete the missing dimension of DFDC, we match the overlapping 4,945 DFDC test videos (almost half of the private test set) with their ground truth fairness annotations (age, gender, apparent skin tone and lighting) and display the ROC curves of the top five winners on each dimension in Figure 6. Interestingly, we found that in terms of balanced predictions the third placed winner, NTechLab [NTechLab2020], performs more consistently across three dimensions (age, gender, lighting) in comparison to all other winners.

Age Gender skin type Lighting
Winner Overall 18-30 31-45 46-85 Female Male Type I Type II Type III Type IV Type V Type VI Bright Dark
Selim Seferbekov [Seferbekov2020] -2.500 -2.437 -2.476 -2.610 -2.443 -2.569 -1.427 -2.356 -2.360 -1.851 -2.714 -3.098 -2.569 -2.184
WM [WM2020] -2.276 -2.140 -2.316 -2.337 -2.319 -2.140 -0.814 -2.113 -2.303 -1.792 -2.567 -2.649 -2.362 -1.865
NTechLab [NTechLab2020] -2.008 -2.008 -2.007 -2.023 -2.050 -1.914 -1.417 -1.729 -2.047 -1.357 -2.253 -2.497 -2.081 -1.668
Eighteen Years Old [Eighteen2020] -2.027 -2.112 -2.110 -1.876 -2.026 -1.996 -1.081 -1.764 -1.975 -1.095 -2.288 -2.651 -2.108 -1.647
The Medics [Medics2020] -2.688 -2.453 -2.604 -2.877 -2.597 -2.719 -1.677 -2.598 -2.678 -2.303 -2.790 -3.115 -2.732 -2.501
Table 1: Precision comparison of the DFDC winners, with a breakdown by fairness categories. Log of the weighted precision [DFDC2020] is reported. Note that overall scores slightly differ from those reported in [DFDC2020] due to the number of examples (see Section 4).

In Table 1, we present the log of the weighted precision [DFDC2020] per category. Surprisingly, the best performing method on the subset of the private test set is NTechLab [NTechLab2020] as opposed to top winner Selim Seferbekov [Seferbekov2020]. However, despite higher performance, the results show that NTechLab’s [NTechLab2020] model is particularly biased towards paler skin tones (Type I). Nevertheless, all winning approaches struggle to identify fake videos containing subjects with darker skin tones (Type V and VI). Aside from skin tone, the data indicates that Eighteen Years Old [Eighteen2020] significantly outperforms other winning models on older individuals (age ).

Age Gender Skin type Lighting
Winner Overall 18-30 31-45 46-85 Female Male Type I Type II Type III Type IV Type V Type VI Bright Dark
Selim Seferbekov [Seferbekov2020] 0.195 0.201 0.184 0.203 0.185 0.210 0.147 0.180 0.173 0.148 0.208 0.265 0.198 0.178
WM [WM2020] 0.176 0.174 0.172 0.184 0.185 0.163 0.110 0.176 0.175 0.146 0.183 0.202 0.179 0.163
NTechLab [NTechLab2020] 0.166 0.175 0.169 0.159 0.168 0.166 0.142 0.154 0.165 0.099 0.176 0.209 0.164 0.171
Eighteen Years Old [Eighteen2020] 0.186 0.205 0.180 0.178 0.193 0.175 0.143 0.179 0.204 0.144 0.177 0.209 0.187 0.183
The Medics [Medics2020] 0.213 0.204 0.207 0.221 0.205 0.216 0.164 0.218 0.226 0.189 0.202 0.219 0.214 0.211
Table 2: Log-loss comparison of the DFDC winners, with a breakdown by fairness categories. These results showcase that it is more difficult for all methods to identify deepfakes when darker skin tones are present.
Figure 4: Example face crops of False Negatives (FNs), where all methods labelled the videos as fake although they are benign examples. Methods mostly fail on the darker skin-tones.
Figure 5: Example face crops of False Positives (FPs), where all methods labelled the videos as real although they are perturbed. Note that we show the unperturbed (benign) face crops in the figure.

In addition to precision, we also report the log-loss of each model per fairness category in Table 2. Similar to the weighted precision results, NTechLab [NTechLab2020] and Eighteen Years Old [Eighteen2020] have the lowest overall log-loss. As expected, the log-loss of each methods increases dramatically on darker skin tones.

Lastly, we also investigated the False Negatives (FNs) of each method because video clips of those examples in the DFDC are benign real samples without any face perturbations (may have some image augmentations applied). Therefore, we show the FN ratios of each winner per category in Table 3. We also present the FN ratios on the videos where all methods failed in the last row (intersection). Considering the distributions in Figure 1, we conclude that most of the scores are in alignment with the data distribution except the darker skin tones, types V & VI.

Figures 5 and 4 present face crops from the original video clips where all top winners failed to classify correctly. Figure 4 shows example face crops for FNs. As also stated above, all top five winners struggle to identify the originality of videos when the subject in interest has a darker skin tone.

Age Gender Skin type Lighting
Winner # 18-30 31-45 46-85 Female Male Type I Type II Type III Type IV Type V Type VI Bright Dark
Selim Seferbekov [Seferbekov2020] 246 28.46 33.33 36.99 54.07 43.90 2.03 23.58 22.76 4.07 20.33 27.24 84.55 15.45
WM [WM2020] 190 26.32 36.84 33.68 60.00 35.26 1.05 23.68 27.37 4.74 21.05 22.11 86.84 13.16
NTechLab [NTechLab2020] 142 30.99 32.39 34.51 59.15 37.32 3.52 19.72 28.17 4.23 19.72 24.65 85.92 14.08
Eighteen Years Old [Eighteen2020] 141 31.91 39.01 27.66 56.74 40.43 2.13 19.86 24.82 2.84 21.99 28.37 86.52 13.48
The Medics [Medics2020] 307 23.13 31.92 40.39 53.75 41.04 2.28 24.43 25.41 5.86 18.24 23.78 82.08 17.92
intersection 38 23.68 44.74 28.95 55.26 39.47 5.26 13.16 28.95 2.63 21.05 28.95 84.21 15.79
Table 3: False Negative (FN) ratios of the DFDC winners, with a breakdown by fairness categories. Scores in category are normalized by the number of examples. The data above shows that methods performs significantly worse on examples of darker skinned subjects, considering the skin type distribution in 1.

4.1 Apparent age and gender prediction

We compared the apparent age and gender prediction results of the three state-of-the-art models, evaluated on our dataset. In the following experiments, we used the reduced (mini) dataset. We first detect faces in each frame with DLIB [King2009] and evaluate the models on the sampled 100 face crops per video. Final predictions are calculated by aggregating results over these samples (most voted gender and median age). Levi & Hassner [LeviHassner15] and LMTCNN [Lee2018] predicts age in predefined brackets and therefore we map their age prediction to our predefined age groups in 1.

Tables 5 and 4 show the precision of the models on apparent age and gender, respectively. Levi & Hassner [LeviHassner15]

is one of the early work that used deep neural networks. It is comparatively less accurate method among all, however, almost as good in apparent gender classification as LMTCNN 

[Lee2018]. LightFace [Serengil2020], on the other hand, is more successful on predicting accurate apparent age and gender. Nevertheless, state-of-the-art methods’ apparent gender precision on darker skin types (Type V & VI) is drastically lower by more than 20% on average.

Gender Skin type Lighting
Overall Female Male Other Type I Type II Type III Type IV Type V Type VI Bright Dark
Levi & Hassner [LeviHassner15] 38.05 37.44 39.48 66.67 39.56 38.72 40.84 36.47 36.47 34.89 38.49 37.04
LMTCNN [Lee2018] 42.26 42.28 44.53 100.00 42.33 41.78 42.30 42.79 42.44 37.99 42.94 41.12
LightFace [Serengil2020] 54.32 54.21 56.18 83.33 46.51 55.52 54.59 55.78 53.78 52.57 54.17 55.20
Table 4: Precision comparison of the apparent age classification methods, with a breakdown by fairness categories.
Age Skin type Lighting
Overall 18-30 31-45 46-85 Type I Type II Type III Type IV Type V Type VI Bright Dark
Levi & Hassner [LeviHassner15] 39.42 39.29 40.65 54.00 47.51 56.81 55.97 53.97 35.89 35.30 40.21 38.12
LMTCNN [Lee2018] 41.56 41.57 43.14 56.29 50.38 58.64 58.22 55.85 38.62 39.68 41.44 41.75
LightFace [Serengil2020] 44.12 44.33 45.40 59.85 55.73 62.39 61.12 61.81 41.90 41.66 44.29 43.86
Table 5: Precision comparison of the apparent gender classification methods, with a breakdown by fairness categories.
Figure 6: DFDC top winner’s ROC curves, breakdown on the Fairness annotations. Selim Seferbekov [Seferbekov2020] (top winner of the DFDC) has similar ROC curves for age groups but it is more accurate on Female examples and more sensitive to pale or very dark skin tones (Type I and Type VI). WM [WM2020] generally preserves the balance across all categories except the skin type. On the other hand, NTechLab [NTechLab2020] curves are more class-balanced over age, gender and lighting, while Eighteen Years Old [Eighteen2020] is invariant to lighting. The Medics [Medics2020], on the other hand, does not preserve balance in most of the categories. Axes of ROC curves are zoomed-in for better visualization.

5 Conclusions

We presented the Casual Conversations Dataset, a dataset designed to measure robustness of AI models across four main dimensions, age, gender, apparent skin type and lighting. As previously stated, a unique factor of our dataset is that the age and gender labels are provided by the participants themselves. The dataset has uniform distributions over all categories and could be used to measure various AI methods, such as face detection, apparent age and gender classification, or to assess robustness to various ambient lighting conditions.

As an application of our dataset, we presented an analysis of the DeepFake Detection Challenge’s top five winnerson the four main dimensions of the dataset. We presented the precision and log-loss scores of winners on the intersection of the DFDC private test set and the Casual Conversations dataset. From our thorough investigation, we conclude that all methods carry a large bias towards lighter skin tones since they mostly fail on darker skinned subjects.

Moreover, we also discussed the results of the recent apparent age and gender prediction models on our dataset. In both of the applications, we noticed an obvious algorithmic bias towards lighter skinned subjects. Apparent gender classification methods are most successful on older people ( years old) and generally as good on darker videos as on brighter ones.

Beyond aforementioned research topics, our dataset enables researchers to develop and also thoroughly evaluate models for more inclusive and responsible AI.


We would like to thank Ida Cheng and Tashrima Hossain for their help in regards to annotating the dataset for the Fitzpatrick skin type.