Artificial mental phenomena: Psychophysics as a framework to detect perception biases in AI models

12/15/2019
by   Lizhen Liang, et al.
Syracuse University
0

Detecting biases in artificial intelligence has become difficult because of the impenetrable nature of deep learning. The central difficulty is in relating unobservable phenomena deep inside models with observable, outside quantities that we can measure from inputs and outputs. For example, can we detect gendered perceptions of occupations (e.g., female librarian, male electrician) using questions to and answers from a word embedding-based system? Current techniques for detecting biases are often customized for a task, dataset, or method, affecting their generalization. In this work, we draw from Psychophysics in Experimental Psychology—meant to relate quantities from the real world (i.e., "Physics") into subjective measures in the mind (i.e., "Psyche")—to propose an intellectually coherent and generalizable framework to detect biases in AI. Specifically, we adapt the two-alternative forced choice task (2AFC) to estimate potential biases and the strength of those biases in black-box models. We successfully reproduce previously-known biased perceptions in word embeddings and sentiment analysis predictions. We discuss how concepts in experimental psychology can be naturally applied to understanding artificial mental phenomena, and how psychophysics can form a useful methodological foundation to study fairness in AI.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

10/30/2020

"Thy algorithm shalt not bear false witness": An Evaluation of Multiclass Debiasing Methods on Word Embeddings

With the vast development and employment of artificial intelligence appl...
05/28/2019

Algorithmic Bias and the Biases of the Bias Catchers

Concerns about gender bias have captured most of the attention in the AI...
09/03/2018

Deep learning for language understanding of mental health concepts derived from Cognitive Behavioural Therapy

In recent years, we have seen deep learning and distributed representati...
11/24/2020

Unequal Representations: Analyzing Intersectional Biases in Word Embeddings Using Representational Similarity Analysis

We present a new approach for detecting human-like social biases in word...
06/16/2018

Biased Embeddings from Wild Data: Measuring, Understanding and Removing

Many modern Artificial Intelligence (AI) systems make use of data embedd...
05/02/2020

Social Biases in NLP Models as Barriers for Persons with Disabilities

Building equitable and inclusive NLP technologies demands consideration ...
10/26/2020

Biases in Generative Art—A Causal Look from the Lens of Art History

With rapid progress in artificial intelligence (AI), popularity of gener...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent artificial intelligence models have shown remarkable performance in a variety of tasks that were once thought to be solvable only by humans [stone2016one]. With such promising results, companies and governments begun deploying such systems for increasingly critical tasks, including job candidate screening [dastin2018amazon], justice system decisions [angwin2016machine], and credit scoring [hurley2016credit]. Due to training with data that might contain biases, however, deep learning models inadvertently fit those biases and create decisions that discriminate against gender and other protected statuses. If we were to find those biases in humans, we could interrogate them and determine whether such biases have occurred. Several researchers have attempted to develop methods for detecting biases in AI models, but these methods are specific to the task (e.g., [stock2017convnets]), data (see [chen2018my]), or type of model [celis2019classification]—hindering their potential adoption. Here we entertain the idea of using Experimental Psychology to develop novel and coherent methods for probing AI systems. Experimental Psychology has a very rich tradition of treating human consciousness as a black box, developing and extracting potential biases from subjective judgements in behavioral tasks [fechner]. We hypothesize that we can adapt these methods to uncover biases in AI models in a similar way. In particular, Psychophysics and signal detection theory offer a concrete set of tools for querying black boxes and extracting useful measures on the direction and strength of bias. In this work, we describe how we adapt the standard two-alternative forced choice (2AFC) task, the workhorse of Psychophysics, to extract biases in word embeddings and sentiment analysis predictions.

The dramatic increase in the usage of AI systems has called into question potential biases made against vulnerable groups. Part of the issue is that current systems have exploded in their complexity, going from hundreds of parameters linearly related to outputs, to billions of parameters non-trivially related to outputs (as discussed broadly in [hastie2005elements], [tan2019efficientnet], and [o2016weapons]). If biases are present in modern AI systems, they are therefore significantly harder to detect just by inspecting fitted parameters. This has resulted in dramatic cases of discrimination in recidivism prediction [angwin2016machine] and credit scoring [hurley2016credit], which are only discovered once systems are deployed. One proposed solution to the problem of veiled discrimination, recently implemented in Europe’s General Data Protection Regulation (GDPR), is to force AI models to be “explainable” [dovsilovic2018explainable]. While forcing explainability appears as a natural countermeasure, the well-known interpretability–accuracy trade-off would predict that these systems have decreased accuracy [sarkar2016accuracy]. This is not always desirable [domingos2015master].

One solution for preventing biases present in AI models is to develop techniques for detecting them, as a natural first step to fixing them. There have been several research programs aimed at detecting biases in AI models. Many of them, however, require detailed knowledge of the inner workings of the algorithm or the datasets. For example, in the work of [mcduff2019characterizing]

, the authors propose a form of “classifier interrogation” which requires labeled data to explore the space of parameters that might cause biases. Also, techniques for detecting biases are somewhat task specific and difficult to generalize. In

[caliskan2017semantics], for example, the authors adapt the Implicit Association Task (IAT) for detecting biases in word embeddings. While this is a natural application of the original intention of IAT, it is unclear how to move beyond bias detection in unsupervised settings. Recently, researchers in DeepMind proposed Psychlab, a highly-complex synthetic environment to test AI models as if they were humans living in a virtual world [leibo2018psychlab]. Detecting biases of AI models is an important step but it would be beneficial to develop more general-purpose and simpler techniques.

Interestingly, Experimental Psychology has had to develop methods for understanding latent, mental phenomena based on questions and answers that are exerted verbally or physically. In particular, the field of Psychophysics uses methods to measure the perception biases and sense’s accuracies of animals [green1966signal]. Importantly, a key requirement of Psychophysics is to avoid relying on verbal or other highly-cognitive responses that are prone to noise and rather use simple behavioral responses that are hard to fake (e.g., movements, yes/no answers). In the whole of Psychophysics, perhaps one of the oldest and most well-developed techniques for performing estimations based on these cues is the two-alternative forced choice task (2AFC) [acuna2015using]. This method is used, for example, to measure how the size of objects biases our perception of weight [charpentier1891analyse], and to measure the precision of the human retina when detecting light [harmening2014mapping]. We hypothesize that we can adapt these techniques for measuring biases and the strength of those biases in AI models. Thus, Experimental Psychology is a rich area of research with potential applications to examine artificial intelligence decisions.

In this work, we develop a framework to study biases in AI models. Our primary goal is to develop a framework that is coherent across datasets, tasks, and algorithms, allowing a researcher to describe biased perception using a common language. We draw inspiration from Psychophysics, a field of Experimental Psychology, and adapt the two-alternative forced choice (2AFC) task. As an example, we examine potential biases in word embeddings and sentiment analysis predictions and we validate our results with real-world data. Our findings show that we are able to detect biases and the strength of them in decisions that involve gender and occupations. In sum, our work contributes to the field of fairness, accountability, and transparency in the following manner:

  • A discussion of the current bias detection techniques for AI models

  • A new method for detecting biases and measuring the strength of those biases based on a coherent set of concepts and language drawn from Experimental Psychology

  • A demonstration of the application of the technique to word embeddings and sentiment analysis prediction

2 Background

2.1 Different Kinds of Biases

Psychophysical bias Perceptual biases are decision deviations about stimuli that should be perceived exactly the same. A classic example is the size-weight illusion (known as the Charpentier illusion) in which people underestimate the weight of a large object compared to small objects of the same mass [charpentier1891analyse]. Formulated differently, if presented with a small object of a known weight, subjects would tend to judge objects of larger size to have the same weight as the small object. Therefore, even though the known small object has the same size of another object of unknown weight, subjects would not judge the other unknown-weight object as having the same weight: they would have a bias against small sizes in weight space. This is the kind of bias that we are expecting to detect with the techniques introduced here.

Discriminatory bias/Anti-discrimination laws Discrimination towards gender, race, sex, and ethnicity is generally considered a violation to the fourteenth amendment [bornstein2018antidiscriminatory]. This kind of bias lives in the judiciary system where laws and acts have been designed to protect the rights of certain groups of people. Some AI models being utilized by companies, governments and other organizations may inherit discriminatory biases that are unlawful in this sense [o2016weapons]. This is the kind of bias we want to help detect by adapting the Psychophysical experiments explained above. However, there must be a human judge or an external validation of whether these detected biases go against, for example, anti-discrimination laws [benthall2019racial].

Statistical Biases This bias represents the difference between an estimated data distribution and a real data distribution [hastie2005elements]

. A statistical model with low bias indicates that the model has low training error but overfits and performs poorly on testing (e.g., out-of-sample) data. In this context, bias might help prevent overfitting by forcing the model not to fit the data too well. Several techniques in machine learning and statistical modeling (such as prior probabilities, regularization, dropout, and data augmentation

[murphy2012machine]) are meant to introduce bias in the system with the purpose of preventing overfitting.

Counterfactual Biases In this kind of bias, the question is more specific: would the response of the system change had one protected attribute in the input been different? In this instance, of course, it is not possible to go back in time and change the situation, and thus many assumptions must be made. With the mathematical framework of causal reasoning (e.g., [pearl2009causality, imbens2015causal]), researchers have proposed ways to use counterfactual reasoning to think about these issues (e.g., [kusner2017counterfactual, kilbertus2019sensitivity, wu2019counterfactual]). Counterfactual biases are orthogonal and complementary to psychophysical biases. In general, Psychophysics does not deal with causal inference because it is believed that the experimenter controls for potential confounders or relies on randomization to assign subjects to experimental conditions.

Figure 1: Artificial Psychophysics: two-alternative forced choice task (2AFC) A) experimenter presents a stimulus that is a blend of two cues. The mixture amount is controlled by a parameter . The AI system responds with one of two choices. B) A psychometric curve that best fits a set of responses to several values. The point at which the curve crosses the 50/50 -axis is called the Point of Subjective Equivalence (PSE) and the slope of such curve is inversely-related to the Just Noticeable Difference (JND). C

) We can use the responses of the system to create a sampler based on Markov Chain Monte Carlo (MCMC) to extract the distribution of stimuli that produces a certain output. The panel shows an example where the sampler uses responses to estimate stimulus (e.g., embeddings) of positive sentiments.

2.2 Psychophysics and the two-alternative forced choice task (2AFC)

Psychophysics is perhaps one of the oldest parts of Psychology, established in a book by Fechner in 1860 [fechner]. It emphasizes the quantification of the relationship between physical stimulus (light, sound, touch) with the contents of “consciousness”, which are unmeasurable (This stands in contrast to other approaches based on behavior and verbal interviews [murray1993perspective].) Psychophysics had an early influence from sophisticated mathematical tools rooted in signal detection theory—a field that seeks to model responses of systems based on a mathematical/statistical treatment of signals [green1966signal]. One of the simplest and perhaps most widely used methods in Psychophysics is the two-alternative forced choice (2AFC) task. In this task, subjects are asked to repeatedly answer one of two questions based on stimuli carefully selected by the experimenter. Depending on the set of answers, then, Psychophysics uses curve fitting and interpretation to extract perception biases. This task has been used to establish many important findings about our memory, visual, and auditory systems [green1966signal]. Thus, Psychophysics is one of the oldest branches of Psychology and 2AFC is a workhorse paradigm, widely used to measure biases.

In the 2AFC task, there are two important quantities that are obtained from a repeated set of questions. One quantity is the Point of Subjective Equivalence (PSE). As its name indicates, in the 2AFC task, a subject is “forced” to make one of two choices about a stimulus even if neither of the answers seems correct. The stimulus at which this extreme confusion happens is called the Point of Subjective Equivalence (PSE) because both choices are “subjectively” similar. Another quantity is the Just-Noticeable Difference (JND) which is the amount of stimulus that the experiment needs to modify in order for the subject to reliably shift answers. If the PSE is not where the experimenter expects it to be, then we say that the subject has a bias [green1966signal]. This entire process of querying and estimating PSE and JNE can be observed in Fig. 1A. There, the experimenter presents a stimulus selected from an spectrum of choices. This stimulus is then taken as an input by the subject (AI system), who is asked to decide whether one of two cues was used to generate the stimulus. Based on many of these responses, a psychometric

curve is fit using standard cumulative functions such as the sigmoid or cumulative normal distribution (Fig.

1B). The point at which this psychometric curve passes through the 50/50 is the PSE. The JND is related to the inverse of the steepness of such curve.

Mathematically, and without loss of generality, we can assume that the subject (or AI system) is presented with a stimulus selected from a stimulus spectrum. This stimulus in turn is a combination (e.g., linear) of two cues and using a parameter , as follows

(1)

The subject only perceives a noisy version of Eq. 1 denoted by . The subject has a prior on the general values of cue 1 and 2, and , respectively. Also, the subject has a general idea of the perception of the stimulus, , given a hypothesized stimulus, , . With all these pieces of information, the subject can estimate the distribution of the real stimulus, , given the perceived stimulus using Bayes’ rule

(2)

Based on Eq. 2, and using some scoring function relating the stimulus with cue , the decision of the subject, , for a given perception and a hypothesized stimulus is

(3)

Because there is noise in the perception (i.e., ), then this decision might change from trial to trial for a given stimulus . This is largely similar to the Bayesian treatment of the 2AFC task (see [acuna2015using, berniker2010learning, kording2004bayesian, ernst2002humans, yuille1993bayesian, knill1996perception]). As it generally assumed that there is no bias in with respect to the real , the hypothesized stimulus (or ) is largely determined through the mixture . A function that produces the probability of whether to pick over is called a psychometric curve () and it is defined as follows

(4)

This psychometric curve is usually assumed to be a cumulative distribution function and thus monotonically increasing in

.

In this context, then, the Point of Subjective Equivalence (PSE) can be defined as the value of for which the psychometric curve has a 50/50 chance of answering either cue 1 or 2

(5)

And the Just-Noticeable Difference (JND) is the amount of where there is a noticeable difference in the decisions in the psychometric curve of say 50%

(6)

One of our interests in this study is to understand biases in the perception of cues. If there were no biases, it would be expected that the PSE is because a mixture of should make the stimulus equally similar to cue 1 and cue 2. However, this is not always the case. A PSE > 1/2 can be interpreted as a bias for cue 1 (or against cue 2) as a higher than 50% proportion of cue 2 (and lower than 50% proportion of cue 1) would be needed to make the cues perceptually indistinguishable. Conversely, a PSE < 1/2 can be interpreted as a bias for cue 2 (or against cue 1). The value of JND depends on the perception noise of the task. A large JND means that perceptions are noisy and biases (if any) are less sharply defined. It is typically assumed that there is no correlation between PSE and JND.

2.3 Markov Chain Monte Carlo (MCMC) for stimulus representations

While the 2AFC task allows to measure biases and the strength of those biases based on stimulus chosen by the experimenter, it would be desirable to reverse the process. This is, it would be interesting to understand the distribution of the stimulus for a given response. In a classification task, for example, this would be useful to understand the distribution of the texts that give rise to positive sentiment predictions. This idea has been explored before in the context of psychological experiments [sanborn2010uncovering] by using a specific type of Markov Chain Monte Carlo (MCMC) sampler. Because in our experiment we can control how we treat the probabilistic outcome of a classifier, we can use a highly-efficient MCMC method such as the No-U-Turn sampler [hoffman2014no].

Concretely, imagine that we want to understand how the input of a classifier, , is related to its decision, . Without loss of generality, we assume that we have access to the classifier’s distribution on given the input as . We reverse this distribution by simply applying Bayes’ rule

(7)

When the dimensionality of the input is high, such as in most modern deep learning applications, estimating is prohibitively expensive because we need to integrate out all dimensions of from the join distribution . Therefore, we can use a Markov Chain Monte Carlo (MCMC) scheme where, in a similar fashion to the 2AFC task, we repeatedly ask the system for its judgements about an input. In our context, a sampler like this, being at a certain embedding attempts to move to another embedding which is only accepted if the MCMC acceptance function is met (Fig. 1C). For more information on MCMC, see [gilks1995markov].

3 Proposed method

3.1 Estimating bias and bias strength in word embeddings using the 2AFC task

Based on artificial neural networks, word embedding models compute a continuous representation of a word using contextual word co-occurrences within documents. These representations work especially well for language translation and word analogy tasks

[schnabel2015evaluation]. For our experiments, we use a word embedding method called GloVe [pennington2014glove] but we believe any other embedding method should produce similar results.

To examine potential biases in word embeddings, we design an artificial 2AFC task where a system is asked to answer questions about a concept that should be unbiased as it relates to two potentially biasing concepts. Consider a real 2AFC task examining genderless occupations and their relationship to genders. For example, we can ask participants to guess the gender of an electrician—e.g., a person with a voltage meter and a blue coat—whose face and body have been experimentally manipulated to be a blend between a male and a female face. By modifying the percent of maleness blending, we would obtain a psychometric curve based on responses. If such psychometric curve crosses the 50/50 threshold away from a 50/50 gender blending, it would suggest a biased perception of the occupation. It is worth mentioning that this experiment would be challenging to perform in humans because of inter-trial memory effects and because the visual blending of faces and body needs to be believable. With an AI model, however, these issues are not present.

We need to create an artificial 2AFC task with word embeddings. We use a simple question–answering system solely based on distances. In word embeddings, close relationships between words are well correlated with the angles between their respective embeddings (i.e., cosine distances). The adaptation of the task explained above would be as follows: we would ask the AI system about the gender in the question “What is the gender of this [manipulated gendered pronoun] electrician?” with the answers being a female attribute or male attribute (e.g., female/male, woman/man, her/him). The manipulated gender pronoun would be the stimulus and “electrician” would be the occupation of interest. The stimulus would be represented by a mixture of a female attribute embedding and a male attribute embedding , and the occupation would be represented by the occupation’s embedding . Each answer, then, would be given a score

(8)

where

is the cosine similarity between embeddings

and . The method then picks the answer with the highest score. This score simplifies to

and

because . To produce the psychometric curve, we would modify the value of and obtain several responses. For the combination of all responses for a particular word , cue 1, and cue 2, we can fit a function to build the psychometric curve (Eq. 4). Based on this psychometric curve, then, we can extract the PSE and JNE. If the PSE is not exactly at , we might conclude that the system is biased. If it is between , we might say that there is a bias against cue 1. An example of several psychometric curves is in Fig. 2.

In our work, the word embeddings model we used is based on skip-gram. It maps each word into a 100-dimentional continuous vector. If the input contains multiple words, the embedding is combined by averaging the embeddings of each word.

3.2 Estimating distribution of inputs conditioned on outputs

Using the representation of word embeddings, we can examine the distribution of embeddings conditioned on classifications that we can be make using those embeddings. For example, we could compute the posterior distribution of Glove embeddings using a classifier that predicts sentiments based on those embeddings. In particular, we train a classifier of positive (or negative) sentiment for an embedding , and are able to estimate the distribution of embeddings conditioned on positive sentiments. We create the distribution

using a multilayer perceptron.

4 Experiments

4.1 Datasets

We use several datasets with curated labels of gender and other demographic information. We also use a dataset for training the word embedding and another dataset for training sentiment analysis.

Labor statistics

For some analysis, we need to validate our estimated biases with external data. We use data from labor statistics on occupations, the number of workers in those occupations, and the gender breakdown of those workers. The data is based on the U.S. Bureau of Labor Statistics [bureau2012employed]. This data has been used before in [caliskan2017semantics] to also externally validate their method.

Wikipedia dataset

For training the word embedding vectors, we use a dump of the English Wikipedia dataset from March, 2019.

Large Movie Review

For training the sentiment analysis predictor, we use the Large Movie Review Dataset. This is a very popular sentiment dataset [maas-EtAl:2011:ACL-HLT2011] and it contains movie reviews from the Internet Movie Database (IMDB) in which reviews with more than 7 stars (from 1 to 10) get assigned a positive sentiment and fewer than 4 stars get assigned a negative sentiment.

Equity Evaluation Corpus (EEC)

To evaluate biases in sentiment analysis, we use a dataset of names and relationships associated with genders. For example, John and uncle are male and Alice and aunt are female. These relationships are part of the EEC dataset by [DBLP:journals/corr/abs-1805-04508].

4.2 Bootstrapping word embedding estimation

For each word embedding model, we were able to get a PSE from word pairs given a target word. However, to form a proper psychometric curve, we need to understand the uncertainty that exists in the model. We can think of these variations as the noise in the perceptual system of the AI model, related to in Eq. 1

. To estimate this uncertainty, we bootstrapped 32 GloVe word embedding models. Concretely, we trained word embedding models with the Wikipedia dataset. Given the size of the dataset, it was unfeasible to perform direct bootstrap by keeping all data in memory and sampling with replacement. Instead, we perform a streaming bootstrap repeating each line a random number of times sampled from a Poisson distribution with mean

(see [43157]). After fitting a psychometric curve to these decisions, if we found any biases in the PSE, the JND would help us understand how stringent these biases are. High confidence, in this case, would be represented by low JND. An example of a set of psychometric curves from this bootstrap process is depicted in Fig. 2.

Figure 2: Artificial psychometric curves for the electrician

occupation. The average point at which it crosses from Female to Male responses is the Point of Subjective Equivalence (PSE) and the standard deviation is the Just-noticeable Difference (JND). For this case, the PSE is to the left of the 50%/50% gender stimulus, suggesting a bias toward male: even though less than 50% maleness is perceived, the model thinks that the cue is male.

4.3 Sentiment analysis and sampler

We use a multi-layer perceptron to learn the probability distribution of word embeddings to sentiment

. We learn a 100-dimensional word2vec embedding using the movie review dataset. The dataset consists of 12500 positive reviews and 12500 negative reviews. We tokenized reviews, removed symbols, process each reviews using the skip-gram word embedding model, and generate an embedding by averaging word2vec vectors for each word. In cross validation, the classifier achieves a 0.953 AUC score.

For the sampler, we use the Python package pyro to perform MCMC using a No-U-Turn Sampler [hoffman2014no]. The sampler used 10,000 warm-up samples and run 10,000 steps after that.

5 Results

In this paper, we wanted to adapt the methods developed in experimental psychology to detect biases in decisions made by artificial intelligence models. In particular, we adapted the two-alternative forced choice (2AFC) task to understand biases in word embeddings. We developed two kinds of experiments: a signal detection theory experiment that estimates the bias and uncertainty on the bias and a sampling experiment that estimates the distribution of word embeddings conditioning on positive sentiment. Both types of experiments provide a window into how powerful the analogy of psychophysics could be for uncovering biases in artificial intelligence methods.

5.1 Measured biases of word embeddings based on 2AFC

The results show relatively intuitive trends in occupations. We first examine whether the psychometric curves for an example occupation (“electrician”) vs female–male continuum stimuli based on bootstrap produced sensible results (Fig. 2). It indeed produces a bias against females. A more systematic examination of the phenomenon for a sample of occupations and a set of gendered attributes reveals an intuitive pattern (Fig. 3a). Each point in this graph represents one PSE and JND extracted from a psychometric curve of one pair of female/male attributes out of the set female/male, woman/man, girl/boy, sister/brother, she/he her/him, hers/his, and daughter/son (from [caliskan2017semantics]). More female-perceived occupations have a bias against male, and vice versa. There are some biases for which the model is more certain about which can be observed in the JND results (Fig. 3b). For example, hairdresser is a highly biased occupation against man but with high uncertainty. On the other hand, lawyer is an occupation relatively biased against woman with significantly lower uncertainty than hairdresser. It is important to correlate our results with real datasets that may point to some ground truth. Therefore, we externally validate the results on a real-world dataset of gender occupations based on labor statistics. We found that PSE correlates well with the percent of total occupations held by man within each occupation () and the JNE correlates well with the standard deviation of such proportion ().

(a) PSE of occupations. To the left of 50%/50%, it is a bias against female. To the right of 50%/50%, it is a bias against male.
(b) JND of occupations. The lower the JND, the higher confidence in the judgement of bias (PSE)
Figure 3: 2AFC results for occupations: Taking a set of occupations from the US’s Labor Statistics dataset reported in [caliskan2017semantics], we calculate the PSE for each of those occupations based on the distances between embeddings of the occupation and two cues. For each occupation, we were able to get a PSE which indicates the percentage of “maleness” or “femaleness” in the questions when the AI agent could not decide whether an occupation is a male or female. Each dot represents the PSE or JND of pairs from female/male, woman/man, girl/boy, sister/brother, she/he her/him, hers/his, and daughter/son.

We perform a similar analysis to the one above but now choosing as stimulus that is a combination of love and hate. We found interesting patterns as well where intuitively more likable occupations such as teacher have a bias in favor of love whereas (medical) examiner has a bias in favor of hate (Fig. 4a). Additionally, the bias for teacher is highly confident, as can be observed in the JND estimations (not shown). For this dataset, however, we do not have external validation.

Figure 4: PSE of the 2AFC task for occupations vs love–hate stimulus spectrum. Only one pair of stimuli was tried on this semantic relationship.

5.2 Measured biases of sentiment analysis predictions based on MCMC

Figure 5: Autocorrelation of posterior embeddings conditioned on positive sentiment

We first wanted to check that the MCMC sampler has achieved stationarity. We have used the standard approach of computing autocorrelations of all dimensions. The sampler has a warmup of 10,000 steps followed by sample of 10,000 steps. Visual inspection of the autocorrelation reveals a sharp decline after 1 time lag (Fig. 5.), which is a sign that the Markov chain has properly mixed [gilks1995markov].

Figure 6: Posterior distribution of embeddings given positive sentiment responses projected using PCA. Two example words from a curated dictionary of positive words have intuitive log-likelihood differences in the posterior distribution.

The posterior distribution of embeddings is hard to visualize because it has 100 dimensions. We perform a dimensionality reduction using PCA to do so (Fig. 6). The distributions of this projection for the posterior conditioned on positive and negative sentiments are slightly different. This fact could be used to examine where both distributions differ. However, because we used sentiment analysis of movie reviews, there is no simple approach to extract a review from an embedding because our embedding is the average embedding of each word in the review.

Figure 7: Average distance of female/male names or relationships to the posterior distribution of embeddings conditioned on positive sentiment: Average distance of female/male names or relationships to the posterior distribution of embeddings conditioned on positive sentiment. We have a set of names that infer genders from the Equity Evaluation Corpus. With each name we calculated the distances between the name and all the sampled embeddings, then we have the average distances between each word and all the samples. We then calculated the average distances for both genders.

We then wanted to measure whether there are biases in the estimated distributions. We measured if there are biases in gender from the posterior distribution. In particular, we measured whether names and relationships that are usually associated with a gender are closer or farther to the posterior distribution conditioned on positive sentiment (Fig. 7). Indeed, we found that female names are significantly farther away from this posterior distribution compared to male names, suggesting that the posterior distribution has more male-dominated embeddings. However, these results are relatively minor because we are using movie reviews to learn the embeddings.

word sentiment
wonderful positive 0.4888
unforgettable positive 0.4895
devilish negative 0.4901
amazing positive 0.4907
versatility positive 0.4910
heartfelt positive 0.4911
wondrous positive 0.4911
enthusiastically positive 0.4911
cherished positive 0.4911
terrific positive 0.4912
(a) Closest words, conditioned on positive
word sentiment
dismally negative 0.5114
incompetent negative 0.5116
redundant negative 0.5117
hopelessly negative 0.5117
mess negative 0.5118
hideously negative 0.5118
lifeless negative 0.5119
substandard negative 0.5126
smother negative 0.5129
dreadfully negative 0.5131
(b) Most distant, conditioned on positive
word sentiment
substandard negative 0.4754
hideously negative 0.4765
dreadfully negative 0.4768
smother negative 0.4768
lifeless negative 0.4773
hopelessly negative 0.4773
mess negative 0.4777
dismally negative 0.4778
redundant negative 0.4780
pointless negative 0.4780
(c) Closest, conditioned on negative
word sentiment
beautifully positive 0.5153
uncompromising negative 0.5154
supremacy positive 0.5156
heartfelt positive 0.5156
devilish negative 0.5162
amazing positive 0.5165
versatility positive 0.5167
cherished positive 0.5170
unforgettable positive 0.5186
wonderful positive 0.5191
(d) Most distant, conditioned on negative
Table 1: Average distance of words from curated dictionary of words with sentiment to posterior embeddings conditioned on positive sentiment and negative sentiment, ranked by distance

We externally validate the sampler using a curated dictionary of sentiments. We compute the average distance of the embeddings of all these words to the posterior distributions conditioned on positive and negative sentiments. We found a negative correlation between distance to the posterior conditioned on positive sentiment and positive words (, , ). Similarly, we found a positive correlation between distance to the posterior conditioned on negative sentiment and positive words. These results suggest that the distance to the distribution provides not only a predictor to the sentiment of words but also a natural ordering of those words with respect to the conditioning of the MCMC sampler.

6 Discussion

In this work, we use artificial psychophysics to detect biases in AI. We show its application to word embeddings and sentiment analysis predictions. Our method was able to capture similar biases that have been reported in the literature but using a more coherent and perhaps simpler set of ideas. However, there are some shortcomings that we now discuss.

We need a more systematic evaluation of the method. We need to see if we find similar effects to those found by more specialized techniques such as IAT. For example, while we find a correlation between the labor statistics data and the PSE (Pearson’s correlation coefficient ), this correlation was not as strong as the one found through the IAT task in [caliskan2017semantics] (Pearson’s correlation coefficient ). One disadvantage of IAT is that it needs a basket of words to represent the attributes that one wants to analyze. For example, while our method only needs the embedding of “woman” for one option and the embedding of “man” for the other, IAT used a set of female names and a set of male names. We could easily extend our technique to include all pairwise PSEs and JNDs that then would be average and could perhaps improve the correlation. However, this seems unlikely given the large dispersion of values in the current estimations. We will explore this approach in the future.

The two-alternative forced choice task (2AFC) has limitations that also apply to our task. For one, it can only handle two alternatives at the same time which makes it inefficient to explore multiple, simultaneous relationships. A possible fix to this issue is to fit several pairwise psychometric curves but the interpretation becomes significantly more cumbersome [green1966signal]. However, we believe that the rich history and theoretical foundation of the method outweigh the issues of multiple comparisons. If anything, our use of MCMC can do multiple, simultaneous examinations of the underlying method but the way to apply is not as straightforward as the 2AFC.

As with all the methods in this area, the evaluation of our results is difficult. While we have a dataset from labor statistics that relates to the PSE of occupation vs female–male attributes, if we did not find such relationship, the bias would still be there—high false negatives. This is apparent for the occupation vs love–hate experience. We found intuitive relationships between lovable and undesirable occupations (e.g., teacher being the most lovable and (medical) examiner being the most hated) but so far we do not have a validation set for it. In the future, we will attempt to validate these results using survey information, such as the Pew Research Center survey on trust of different occupations [funk2017public]. Similarly, checking the posterior distribution of the MCMC result is perhaps even more challenging. In a sense, we need a way to generate interpretable data from a point in the posterior distribution. For example, in our sentiment analysis predictor, the posterior is on the embedding space, making it almost impossible to map the embedding into a movie review. While this seems that defeats the purpose of the posterior, we are still able to capture some trends in the data whereas the embeddings of positive words are intuitively closer to the posterior distribution conditioned on positive sentiment. In the future, we will explore representations that are more interpretable and that therefore will allow to examine the posterior more easily. One obvious experiment to try is an embedding that involves images.

The biases we are detecting do not necessarily constitute a problematic feature of an AI system that is attempting to make the most accurate predictions. After all, biases in the statistical sense can help a system prevent overfitting, and a great deal of modern machine learning techniques uses an array of methods for introducing biases explicitly—e.g., regularization, dropout, and data augmentation can all be seen as increasing bias [Goodfellow-et-al-2016]. Also, humans themselves seem to incorporate biases in an optimal manner in the form of a-priori knowledge [tenenbaum2011grow]

. However, these kinds of biases are best understood in supervised learning scenarios where there is a clear measure of performance

[hastie2005elements]. As such, it seems problematic that systems that are unsupervised, such as word embeddings, contain biases on attributes that are protected. Companies and the public seem to agree: anecdotally, almost a year ago, we were able to reproduce biases using the built-in word embedding in the Python’s software package spaCy [spacy2]. When we re-attempted such experiment with a more recent version of the software, however, we were not able find such biases. This suggests that software companies and the general machine learning community recognizes that these types of biases might be unacceptable (e.g., [buolamwini2018gender]).

One of the ideal goals of this work is to not only detect biases but fix them. There are many proposals to fix biases in AI models but to the best of our knowledge we are not aware of debiasing methods based on experimental psychology or psychophysics. Perhaps building a system that fixes these issues would greatly inform how we design our detection task. Maybe some of the biases that we detect are not fixable or biases that we think are hard to detect are easily fixable. We will explore this interplay in the future.

We believe that our proposal can open the door to collaboration across disciplines. For example, there is rich literature on quantitative methods for cognitive-behavior therapy for cognitive debiasing [croskerry2013cognitive]. More interestingly, perhaps the methods other researchers have developed for detecting and fixing biases in AI systems can be transported back to cognitive-behavioral therapy.

7 Conclusion

In this work, we proposed a method for detecting biases in AI models using a coherent intellectual framework rooted in Experimental Psychology and Psychophysics. We adapted the alternative forced choice (2AFC) task and a sampling mechanism based on MCMC to examine these biases. We evaluated gender biases in a word embedding model trained on Wikipedia and a sentiment analysis model trained on movie reviews. Our results suggest that we are able to detect these biases while keeping a conceptual language that is common to what is used in Psychophysics.

In the future, we will explore how to adapt other ideas from experimental psychology to detect and even fix issues found in AI models. We believe that many of the issues found in AI can be fixed effectively without significant loss of performance. Also, we believe that akin to how humans who have been subjected to racist, sexist, and extremist views can be rehabilitated through deradicalization and disengagement [stern2010deradicalization], AI models can also be rehabilitated.

Acknowledgements

L. Liang and D. E. Acuna were partially funded by NSF grant #1800956 and ORI grant “Methods and tools for scalable figure reuse detection with statistical certainty reporting”. The authors would like to thank Xinxuan Wei for preliminary discussions and contributions to the work.

References