Quantifying the Impact of Cognitive Biases in Question-Answering Systems

Crowdsourcing can identify high-quality solutions to problems; however, individual decisions are constrained by cognitive biases. We investigate some of these biases in an experimental model of a question-answering system. In both natural and controlled experiments, we observe a strong position bias in favor of answers appearing earlier in a list of choices. This effect is enhanced by three cognitive factors: the attention an answer receives, its perceived popularity, and cognitive load, measured by the number of choices a user has to process. While separately weak, these effects synergistically amplify position bias and decouple user choices of best answers from their intrinsic quality. We end our paper by discussing the novel ways we can apply these findings to substantially improve how high-quality answers are found in question-answering systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 6

11/18/2014

Cognitive Systems and Question Answering

This paper briefly characterizes the field of cognitive computing. As an...
10/02/2019

Quantifying Voter Biases in Online Platforms: An Instrumental Variable Approach

In content-based online platforms, use of aggregate user feedback (say, ...
10/23/2020

Origins of Algorithmic Instabilities in Crowdsourced Ranking

Crowdsourcing systems aggregate decisions of many people to help users q...
04/30/2020

Look at the First Sentence: Position Bias in Question Answering

Many extractive question answering models are trained to predict start a...
10/15/2021

BBQ: A Hand-Built Bias Benchmark for Question Answering

It is well documented that NLP models learn social biases present in the...
08/03/2021

Q-Pain: A Question Answering Dataset to Measure Social Bias in Pain Management

Recent advances in Natural Language Processing (NLP), and specifically a...
10/21/2020

Unsupervised Deep Learning based Multiple Choices Question Answering: Start Learning from Basic Knowledge

In this paper, we study the possibility of almost unsupervised Multiple ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

According to the wisdom of crowds, a large group can collectively find a better solution to a problem than a typical individual [de Condorcet1976, Galton1908, Surowiecki2005, Kaniovski and Zaigraev2011]. This effect has become the foundation of crowdsourcing on the web, including systems for content creation [Kittur and Kraut2008], product review [Lim and Van Der Heide2015], peer recommendation [Stoddard2015, Glenski, Johnston, and Weninger2015], and question-answering (Q&A) [Adamic et al.2008, Yao et al.2015]. In many cases, the crowd’s solution aggregates many users’ recommendations or votes as they are sequentially added. Recent work has suggested that this method should optimally determine the best items [Celis, Krafft, and Kobe2016, Krafft et al.2016], and displaying item popularity is a simple way to make high-quality items easier to find. However, individual decisions can be affected by cognitive biases, which may compound to decouple the relation between wisdom (the quality of ideas) and crowds (popularity).

For example, recent research has demonstrated that social influence introduces correlations between decision makers, which can reduce the quality of collective solutions [Lorenz et al.2011, Kaniovski and Zaigraev2011] and make them less predictable [Muchnik, Aral, and Taylor2013, Glenski, Johnston, and Weninger2015, Salganik, Dodds, and Watts2006]

. Empirical studies of crowdsourcing systems suggest that users’ bounded rationality, and reliance on heuristics like item position (

position bias), are even more important limiting factors in collective performance [Stoddard2015, Burghardt et al.2017].

Our Contribution

In this paper, we examine cognitive factors that affect crowd performance in order to understand how to enhance the wisdom of crowds, and better correlate item popularity with “quality”. We define quality as properties of the answer’s text, such as how well the answer addresses the question or how well it is written, which are independent of where or how the answer is shown to users. Furthermore, we explore these factors in the context of Q&A, a popular crowdsourcing task, because we can compare a natural experiment and empirical data to novel controlled experiments. The questions we address are:

Q1: What cognitive factors contribute to answer popularity?

Q2: How does popularity relate to quality?

We explore these questions with two complementary approaches: (1) simulating Q&A with experiments in which parameters are precisely controlled, and (2) using empirical data to compare our experimental results to real-world systems. The experiment allows us to carefully tease out why answers become popular without confounding variables, while comparing experimental and empirical data allows for us to check the ecological validity of the experiment. Specifically, we create an experimental model of Stack Exchange (SE), a popular Q&A platform, using Amazon Mechanical Turk (MT). We use actual questions and answers from the English Language Learners forum of SE, and ask MT workers to pick the best answer to each question. Our model allows us to replicate some of the functionality of SE in a controlled setting, such as the number of answers and the scores MT workers see for each answer. We also record where the workers move their mouse, a proxy for attention that agrees well with eye-tracking data [Chu, Anderson, and Sohn2001].

In addition, we use data from SE to analyze a natural experiment in which users vote for answers that are ordered in different ways while controlling for perceived popularity. The major takeaway from the empirical data is that increasing the answer position even slightly can increase the probability it is chosen

by up to 30% compared to ordering answers at random, and that this increase is in large part aided by cognitive load. We further find qualitative agreement between experimental data (where answers are ordered arbitrarily) and empirical data (where answers are presumably ordered by their quality, if popularity and quality are correlated) on the probability of voting for answers versus the answer position. This agreement suggests that users choose answers in large part because of their position on the webpage and not due to their quality.

The rest of the paper is organized as follows. We first discuss the background of crowd wisdom and cognitive biases in user behavior. Next, we describe our MT experiment and the natural experiment in SE, and then show our results in greater detail. We finish by discussing the positive and negative consequences of position bias in Q&A systems, and ways in which crowdsourcing can be improved.

Related Work

The wisdom of crowds, in which a large group, even one composed of uninformed individuals, can collectively reach a better decision than individual experts, was first theoretically predicted in a study of juries [de Condorcet1976]. When jurors are homogenous and their decisions uncorrelated, the majority decision of a jury is almost always of higher quality than any individual juror’s. This presaged the work by Galton and others [Galton1908, Surowiecki2005], in which they empirically found that the mean of many individual guesses was a better prediction than any expert, because the mean averages out uncorrelated errors. In recent years, wisdom of crowds was reapplied to several different fields, such as crowd sourced information [Kittur and Kraut2008] and Q&A forums [Adamic et al.2008, Yao et al.2015], prediction markets [Ray2006], and peer recommendation [Stoddard2015, Glenski, Johnston, and Weninger2015]. Importantly, these applications rely on the assumption that an item’s popularity indicates its quality, which, although not necessarily a corollary of early work on crowd wisdom, is sometimes a reasonable assumption [Yucesoy and Barabási2016].

Recent interest, however, has focused on how item popularity can enhance or reduce the wisdom of crowds. Sequential voting, which is commonly used in crowdsourcing systems [Stoddard2015, Muchnik, Aral, and Taylor2013, Glenski, Johnston, and Weninger2015, Adamic et al.2008, Yao et al.2015] can theoretically improve collective quality compared to non-sequential votes [Celis, Krafft, and Kobe2016, Krafft et al.2016]

. However, Lorenz et al. Lorenz2011 found that when users knew the guesses of other users, the variance of guesses drops dramatically, and there is increased confidence in guesses that still differed significantly from the correct answer. Similarly, theoretical work suggests that correlated opinions can reduce the quality of collective guesses

[Kaniovski and Zaigraev2011]. Salganik, Dodds, and Watts MusicLab also found that social influence can affect which items become popular independent of intrinsic factors of an item. However, more recent models of the same data suggest that answer position was an underappreciated [Krumme et al.2012], and potentially much more important factor to explain why an item was chosen [Stoddard2015].

Position bias is an effect in which items listed first are more likely to be chosen. This effect occurs, for example, in voter ballots [Ho and Imai2006, Ho and Imai2008], search engines [Joachims et al.2005, Craswell et al.2008], information aggregation sites [Stoddard2015], and peer evaluation [Lerman and Hogg2014]. Although the effect is usually observed when people choose among many items, only recently have researchers explored how this effect appears when there are few items to choose from [Burghardt et al.2017]. In addition, mechanisms underlying position bias have not been fully characterized. The primacy effect, in which items seen first are more likely to be chosen [Mantonakis et al.2009], may play a role, but other effects such as attention [Krajbich and Rangel2011], trust [Joachims et al.2005], or popularity could contribute to position bias as well. One goal of this paper is to decouple these factors in order to determine what causes position bias in real systems.

Similar to some previous work [Yao et al.2015, Adamic et al.2008], our MT experiment tests crowd wisdom by exploring why certain answers are upvoted or accepted by askers in Q&A boards. A recent paper suggested that users appear to vote or accept answers in Q&A boards for reasons other than the intrinsic qualities of an item, especially when there are many answers to choose from [Burghardt et al.2017]. Our paper improves upon this work by disentangling position bias from popularity and quality, and exploring the many cognitive factors that can increase position bias.

We will show the number of answers users see increases position bias independent of answer quality. A plausible reason for this effect is because users can only processes a limited amount of information. When more information is visible, users decrease the effort they are willing to spend evaluating each piece of information due to information overload [Baron1986, Nematzadeh et al.2016]. Finally, we use natural and controlled experiments because empirical data alone could be affected by correlations between attributes, while experiments alone may not capture all relevant aspects of real-world scenarios. Agreement between experiments and empirical data increases confidence in the results, and separately validates each approach [Herbst and Mas2015].

Methods

We used both a controlled experiment with MT workers and a natural experiment resulting from a change in the SE platform.

Mechanical Turk Experiment

Our experimental model of Q&A directs registered Amazon Mechanical Turk workers to a web page instructing them to “choose the most correct answer for each of ten questions” (see Fig. 1 as an example). The page models Stack Exchange, specifically the English Language Learners (ELL) forum, from which the questions were selected. Each of the questions had at least 8 answers. The workers are mostly from the US, Canada, or Britain ( among workers sampled, based on IP addresses), where English is commonly spoken. Our choice of the ELL forum is meant to increase the similarity between workers and SE users, because the questions and answers were meant to be accessible for both native and non-native English speakers. We only show ten questions to each user because we expect the quality of workers’ choices to decline appreciably if they are asked a large number of questions, according to recent research on performance depletion in Q&A systems [Ferrara et al.2017]. Further, we limit the total number of answers workers could see and vote on in order to create sufficient statistics on the popularity of each answer.

In the experimental model, each MT worker is assigned to one of two experimental conditions. In the first (“random”) condition, workers see answers listed in a random order below the question (independently for each worker) and no score is shown. In the second (“social influence”) condition, the scores are shown next to each answer, and answers are ordered by score. In both cases, the 2 or 8 oldest answers from the ELL website were listed below their associated question, which is consistent with the answers SE users would have seen.

The workers assigned to the social influence condition are told that “scores listed next to each answer denote the number of individuals who chose this answer in the past” (as on SE). However, in reality, the “scores” are independent and randomly generated numbers from 0 to 100 in the 2-answer scenario and from 0 to 25 in the 8-answer scenario (such that the scores add up to 100 on average). We generate these numbers independently for each worker. In short, answers are ordered randomly in both experimental conditions, but in one, workers think that other workers upvoted particular answers.

We recruited workers with an approval rate of over and more than 1000 Human Intelligence Tasks (HITs) completed. For Trial 1 in the random experimental condition (shown in Table 1), we requested “Masters” workers, i.e., people Amazon labels as especially high performing. Their voting behavior was statistically similar to that of other users. Because it takes orders of magnitude more time to find enough workers who are Masters, we dropped this requirement in later experiments. Workers were given up to one hour to complete an assignment (the median time is minutes in the random condition, and minutes in the social influence condition). Each worker was paid for completing the assignment. The equivalent hourly wage is half that originally designed because the tasks took unexpectedly long compared to initial tests in which the authors were subjects.

Figure 1: An example screen used in our experiment, showing a question from SE (http://ell.stackexchange.com/questions/30/what-is-the-difference-between-nope-and-no). In this screenshot, the number of votes are visible, representing the social influence experimental condition (in the random experimental condition, the numbers of votes are hidden) and a checkmark is next to the chosen answer. After an answer is chosen, users click an “accept” button to progress to the next question.
# Answers Trial # Questions # Questions
(Random) (Social Influence)
2 Answers Trial 1 440 438
Trial 2 473 174
8 Answers Trial 1 410 412
Trial 2 930 1256
Trial 3 447

*270 & 228 workers are in the random and social influence conditions, respectively.

Table 1: Number of questions in each experiment trial.*

Once workers choose an answer, they click a button to advance to the next question. After completing the last question, workers are given a six-digit ID, which they submit in order for us to associate a particular worker with a given experiment. Experiments that are retaken by workers, or do not have an associated ID, are removed from analysis. Occasionally, so many people submit their completed task to our server that lines of data overlapped, making analysis of this raw data more difficult, therefore data in these cases are discarded. In the random experimental condition, we have clean data from workers out of completed experiments, and in the social influence experimental condition, we have clean data from workers out of completed experiments. We perform multiple trials for each condition, as listed in Table 1. In the experiments, we record:

  1. the question number

  2. the number of answers for each question

  3. the time a worker answers each question

  4. the answer a worker chooses

  5. the order answers are listed for each question

  6. the times when workers scroll their computer mouse (or track pad) over an answer

  7. each answer’s score (if applicable), and

  8. the start and end time for a worker to complete all questions

In the first and last trials in the random condition, and trial 1 in the social influence condition (see Table 1), the number of answers are randomly chosen to be either 2 or 8 for each question. Trial 2 in both conditions always has 8 answers. We check the consistency of behavior in these trials by comparing the probability a worker chooses an answer versus its position between trials 1 & 2 using the Kolmogorov-Smirnov (KS) test. This test shows no statistically significant differences between the distributions (p-values ). We further compare the popularity of answers (averaging over the answer position) across the two worlds and find high correlations ( & for 2 and 8 answers visible, respectively), therefore separate trials and experiments produce consistent results.

Natural Experiment

In August 2009, SE changed how it ordered answers with the same score from chronological (oldest to newest) to random order [Oktay, Taylor, and Jensen2010]. This change forms a natural experiment for how answer ordering affects users’ choices. To exploit this change, we select an appropriate part of SE and create a dataset as follows.

We look at all votes on all technical and meta boards on SE from August 2008 until September 2014111https://archive.org/details/stackexchange. Boards labeled as “technical” by SE typically cover programming questions. Meta boards provide a forum to discuss a specific board. For example, “Meta Stack Overflow” discusses topics relating to the board “Stack Overflow”. We split the data to control for Simpson’s paradox, in which behavior seen in aggregated data can differ significantly from the disaggregated data [Simpson1951]. We do not analyze votes where (1) more than 2 answers have the same score as the answer voted on, and (2) there is an accepted answer, which can affect answer positions and may provide an additional social signal to a voter. Finally, we split data by the number of answers users would see when they cast their vote, based on the day they voted. The way in which we reconstruct the score for each answer, in order to determine which answers have the same score, is discussed below. In preliminary work, we further divided the data by the position the 2 answers appeared in (the top or bottom of the page), and when the votes occurred (6 months before and after the change, or the entire time period), but this does not affect our conclusions, therefore we re-aggregated the data, and only focus on the relative position of the 2 answers. We focus on the Stack Overflow data because it is the largest board (with 200K votes in the months before, and 2.8M votes in the months after, the rule change), but we find qualitatively similar behavior in other technical boards or meta boards aggregated together.

The data we have on votes tell us the millisecond when each answer was made, the day each vote was made, the order of votes, which answer was voted on, and whether it was an upvote or downvote. From the date a vote was cast, we can determine how many answers each user saw (assuming the vote was made at the end of the day), while sequentially adding votes to the appropriate answer allows us to determine the score for each answer just before a vote was cast. For the natural experiment, we use this data to determine the order of answers with the same score, and the number of answers seen, just before a vote was cast. This allows us to better understand how answer position and cognitive load (i.e., the number of answers seen) affect voter decisions. Similarly, by knowing the order of answers with different scores, we can better understand how score, coupled with position, increases position bias. In the latter case, we focus on the probability to vote on an answer after August 2009, and all votes are made before an answer is accepted, therefore answers with the same score are ordered at random relative to each other.

Results

In this section, we discuss the factors affecting answer popularity. We will show that, controlling for answer position, the effect of perceived popularity is negligible. That said, position bias is affected by the

  1. number of answers a user sees (cognitive load)

  2. score next to each answer (perceived popularity)

  3. attention an answer receives.

Separately, these factors are small, but together they help explain position bias we see in real-world data. A natural experiment in which answers with the same score are first ordered chronologically and then at random allows us to determine position bias when controlling for perceived popularity. Furthermore, position bias for experiments in which scores are generated randomly is in surprisingly close agreement with empirical data in which answers are supposedly upvoted due to quality. Together these results suggest that position bias can strongly decouple answer quality from answer popularity.

Origins of the Position Bias

Figure 2:

Probability to choose an answer versus its position in the experiment. (a) 8 answers visible in the random condition (inset: 2 answers visible) and (b) 8 answers are visible in the social influence condition (inset: 2 answers visible). Also shown is the null attention model, discussed in the main text. Error bars for all data are smaller than the plot markers.

Our experiment disentangles some of the factors contributing to position bias, such as information or cognitive load, score, and attention, and how these factors affect worker decisions. Figure 2 shows the experimental probability a worker chooses an answer as the best answer as a function of answer’s position in the list of answers under the random experimental condition (solid blue lines), the social influence condition (solid green lines), and a null model (described below) where users choose answers based on the amount of attention they receive. Main figures report conditions where 8 answers are shown, while insets report cases when 2 answers are shown. To allow for the best agreement between the null attention model and data, we removed cases in which users chose an answer that was moused over less than an arbitrary threshold of 0.01 seconds (, , , and of votes were removed from Figs. 2a–b, respectively). The trends shown in the figures are the same when including all votes in the dataset, and when the threshold is larger, such as 0.1 seconds.

When 2 answers are shown in the random condition (Fig 2a inset), workers are less likely to choose the first answer than the last one (p-value), while when 8 answers are shown, they prefer top answers to those shown in lower positions (Fig. 2a). Future work is necessary to understand why the last answer was more likely to be chosen when 2 answers were shown. This finding is not affected by including data where the chosen answer is not moused over. That said, the trend we see in which answers appearing earlier in a list are preferred as the number of answers increases is in agreement with previous research [Burghardt et al.2017]. In the social influence experimental condition, on the other hand, workers are more likely to choose answers in top positions when both 2 and 8 answers are shown (Figs. 2b). Just as the case when scores were not shown, the top half of the answers are more likely to be chosen as the number of answers increases (58% and 68% for 2 and 8 answers, respectively), although the overall probability to choose top answers increases significantly when scores are shown. Interestingly, when we control for position, there is no statistically significant correlation between score and the probability an answer is chosen (all p-values are greater than ), so scores amplify position bias.

We determine how attention contributes to position bias by using mouse movement, which correlates with eye tracking [Chu, Anderson, and Sohn2001]. Specifically, we only record when the mouse is moving over or clicking on the answer, rather than when users scroll over it with their scroll wheel, because we want to have greater confidence that mouse movements were intentional. Although mouse tracking data is not perfect, we believe it is a practical way to measure attention. To check this, we compared the probability to choose an answer versus the fraction of time a worker mouses over it, which we call time share. We find that the larger the time share, the greater the probability a worker will choose it (Cragg & Uhler’s Pseudo values are between

using logistic regression), in qualitative agreement with previous research 

[Krajbich, Armel, and Rangel2010, Krajbich and Rangel2011], in which users were more likely to pick answers that received more attention.

We create a null model in which users choose an answer due to the amount of attention it receives. In this model, the probability a user chooses an answer is directly proportional to the share of time an answer is moused over. The dashed lines in Figure 2 compare this null model to experimental data. We find that position bias is much stronger than the null model when 8 answers are visible (they differ significantly, p-values ), therefore position bias cannot be fully explained by this model, but it does appear to partly explain position bias when scores are visible.

These observations lead us to the following conclusions.

  1. Cognitive load (number of answers visible) increases position bias,

  2. Perceived popularity increases the position bias,

  3. Perceived popularity, when corrected for position, is not a significant factor, and

  4. Attention increases the position bias.

To better understand the last point, we observe percentage of answers moused over versus its position (Fig. 3). We find that, although the top answer is moused over with equal regularity in both the score and no-score conditions, users mouse over later answers 5% less on average when scores are visible than when scores are not (p-values ). Top answers therefore receive more attention, which subsequently increases the probability an answer is picked.

Figure 3:

The role of attention in workers’ choices of answers. The mean percentage of times an answer is moused over versus position in the random and social influence experimental conditions. Error bars are standard errors.

Comparison with Empirical Data

To confirm that our results are not specific to the design of the experiment, or traits specific to MT workers, we perform comparative analysis with data from SE. We used anonymized data representing all questions, answers, and votes from August 2008 until September 2014222https://archive.org/details/stackexchange. Results of empirical analysis, including the natural experiment, are strongly consistent with the experiment, giving confidence about its ecological validity.

Figure 4: Position bias versus number of answers visible in a natural experiment. The probability to vote for the older of 2 answers with the same score on Stack Overflow when the oldest answer is shown first (red squares) or at random (cyan circles).

Position Bias: Evidence from a Natural Experiment

In August 2009, SE changed how it ordered answers with the same score from chronological to random order. This change allows us to test how answer ordering affects users’ choices, which we plot in Fig. 4. We first notice that older answers are more likely to be chosen than newer answers when there are 2 or 3 answers visible, but the preference is towards newer answers as the number of answers increases. There’s also an increasing preference towards the answer listed first as the number of answers increases. In fact, an older answer is up to 30% more likely to be chosen when listed first than when answer positions are randomized (p-value for all plot markers shown). In comparison, a previous study of this natural experiment did not find that position bias was statistically significant when all data was apparently aggregated together [Oktay, Taylor, and Jensen2010].

Figure 5: Comparison between experiment and empirical data. The probability SE users in non-technical boards (yellow bars) and MT workers (green squares) vote for an answer when scores are visible and (a) 2 or (b) 8 answers are visible. Error bars are smaller than the plot markers.

Perceived Popularity: Evidence from Empirical Data

Although the previous findings suggest that answer position strongly affects the probability to choose an answer, it does not address the effect of perceived popularity. We therefore used the vote data from all non-technical SE forums from August 2009 through September 2014 to determine the aggregate fraction of votes users give to answers versus their position just before each answer has been voted on (Fig. 5). For non-technical forums, this produces 790K votes when 2 answers are visible and 43K votes when 8 answers are visible. We see similar behavior in data from other forums, e.g., Stack Overflow. The ELL forum alone had too little data to make an adequate comparison. Answers are ordered, by default, from highest to lowest score, which provides a direct comparison between our experiment results and the results from the data.

We find that the experiment agrees qualitatively with observed user behavior on SE non-technical forums. For example, in Fig. 5, the experiment’s probability to choose an answer versus its position is almost exactly the same as the empirical data. This is surprising because answers are presumably ordered by their quality in the empirical data, while our experiment orders answers arbitrarily with artificial scores. Much of the probability to choose an answer in real data could be due to its order and not its quality. That said, when 8 answers are visible, there is a stronger preference to choose the top answer in the empirical data than the experiment, therefore our experiment’s assumption of arbitrary answer order is not in as strong agreement with the data when there are enough answers visible. However, as noted previously, the vast majority of questions have few answers (where agreement is strongest), therefore, in most cases, answer position seems to be a bigger factor in answer popularity than answer quality.

Discussion: A Good Answer is Hard To Find

Our experimental model of Q&A helps elucidate factors affecting user choices of the best answers to questions. We find that an answer’s position plays an important role in this decision and is strongly enhanced by perceived popularity, information load, and the attention given to top answers. We see strong agreement between our experimental model and empirical data, which demonstrates the experiment’s success: it captures many aspects of real Q&A systems, despite differences in the populations, and the different motivations, of MT workers and SE users.

The empirical data can also distinguish position bias and other factors that contributed to answer popularity. We first discover from the natural experiment that moving an answer up by one position increases the probability that it is chosen by up to 30% compared to when answers are ordered at random. The probability of moving an answer just one position higher compared to the original position, however, could be almost double the lower-position’s probability depending on the number of answers visible. This is because the randomized position probability (a probability of around 0.35 when 15 answers are visible) is a mixture of the probability to choose an answer when it is listed first (a probability of around 0.45) and the probability fo choose an answer when it is listed last (which would presumably have to be around 0.25 if 0.35 is the mean value). Furthermore, and more troubling, the probability to choose an answer versus its position is very similar in the experiment, where answers were ordered arbitrarily, and the empirical data, where answers were presumably ordered based on their quality. This suggests that position bias, rather than quality, is a major factor contributing to answer popularity.

What cognitive mechanisms may affect position bias? When controlling for score and attention, one hypothesis is that users have a trust bias, i.e., they believe answers are ordered by their quality and therefore trust top answers [Joachims et al.2005]. Alternately, they may pick answers that grab their attention [Krajbich and Rangel2011]. Finally, the primacy effect, in which individuals prefer the first object they see, may play a role. The primacy effect applies to the temporal order of items, but we find that almost all workers scroll from the top to the bottom of a page, mousing over answers in turn, therefore top answers are typically the first ones seen.

In the random condition, the primacy effect probably plays a role in creating the position bias because it adequately explains why users prefer the answers they see first when 8 answers are visible, even though it cannot explain why users pick the last answer when 2 answers are visible. Attention plays a lesser role because attention seems to be flat regardless of answer position. We cannot completely rule out the trust bias but workers are not told how answers are ordered, and presumably will not assume a relationship between answer order and quality. Furthermore, it would seem strange that users would use this heuristic to prefer the last answer when 2 answers are visible and the first few answers when 8 are visible.

In the social influence condition, trust bias could instead play a larger role. After controlling for position, the effect of scores is negligible. It therefore seems plausible that the relative position of scores contributes to position bias, meaning users prefer top answers because they trust they are of higher quality than lower ones. This is the trust bias where the assessment of quality is due to the heuristic of answer scores (popularity). It is also interesting that scores draw people’s attention towards the top answers more than the bottom answers, therefore scores increase attention given to top answers, which is known to bias choice preferences [Krajbich and Rangel2011]. This is not the sole reason for position bias because the experiment often has a stronger position bias than the attention-based model would predict. Other attention-based null models, including picking the most-moused-over answer and choosing random answers that were moused over, produce similar results.

How do we improve the wisdom of crowd in Q&A systems? If we first determine the quality of answers after controlling for position bias, we can list answers from highest-quality to lowest-quality. Clearly, the highest-quality answer will then be more visible because they are at the top of the webpage. In this way, position bias can improve user interactions by making good answers much easier to find.

This requires finding high-quality answers. One way to do so is to initially randomize the position of answers for each user and not show the answer scores, but still record the position-aggregated popularity of each answer. This reduces the correlations between users by reducing perceived popularity and position bias. We find in our experiments that position aggregation consistently finds the same answers to be popular regardless of the number of answers visible, presumably due to answer quality. For example, the popularity of the oldest 2 answers (averaged over answer position) are highly correlated between conditions where 2 or 8 answers are visible (Spearman rank correlations are 0.74 & 0.58 for the social influence and random conditions, respectively, with p-values ).

Conclusion & Future Work

This paper analyzes factors that can affect the performance of Q&A systems. First, we create a controlled experiment on MT that replicates the main functionality of SE by asking workers to choose the best answers to questions taken from an actual Q&A forum. Controlling how, and in what order, the answers are shown to workers enables us to disentangle the effects contributing to their choices of best answers. We find that an answer’s position strongly affects the probability that it will be chosen, and that this effect increases with cognitive load, perceived popularity, and attention. Perceived popularity alone, however, is not a significant factor. Next, we use empirical data and a natural experiment to elucidate the cognitive factors affecting answer popularity. Overall, we find broad agreement with our experiment results, which gives us confidence about its ecological validity. Because our results apply to different types of Q&A boards on SE, we believe that they are widely applicable, at least among Q&A systems. Furthermore, our observations are in line with recent work showing that position bias, when coupled with popularity-based ranking, reduces collective ability to identify highest-quality items [Abeliuk et al.2017].

Although we do not find a correlation between the probability to choose an answer and perceived popularity when we correct for answer position, our current experiment does not completely decouple answer score from position, because answers are ranked from highest to lowest score. This motivates experiments where answer score and position are uncorrelated. Initial results from these experiments, however, agree with our current results: the effect of perceived popularity alone, especially when many answers are visible, is minimal. Future work will further explore these results.

Furthermore, although we find strong qualitative agreement between our model and empirical data, a common critique in any experiment is ecological validity: how alike is our experimental condition to the real world? For example, MT and SE users represent different populations, e.g., there may be more fluent English speaking workers in our experiment than in the board, which caters to individuals who are still learning English. Furthermore, workers have extrinsic motivations (getting paid), while SE users may be more intrinsically motivated, although the “reputation” and badges they receive for good answers and questions could also be interpreted as a form of extrinsic motivation. Our experimental design and choice of questions, focusing on general interest questions about the English language, minimizes this risk, but cannot rule it out completely. In order to address this potential critique, experiments should be created on MT in which workers sequentially upvote answers to questions (in the same way scores are created on SE). One can compare the popularity of answers versus their position in SE and MT using this method to determine whether both groups of people have the same motivations to choose answers. Furthermore, this experiment can help determine whether answer popularity through sequential voting agrees well with position-averaged answer popularity (a proxy for quality). Strong disagreement would add further evidence that answer score and answer quality are decoupled.

Acknowledgments

Our work is supported by the Army Research Office under contract W911NF-15-1-0142 and by the Defense Advanced Research Projects Agency under contract W911NF-18-C-0011.

References

  • [Abeliuk et al.2017] Abeliuk, A.; Berbeglia, G.; Hentenryck, P. V.; Hogg, T.; and Lerman, K. 2017. Taming the unpredictability of cultural markets with social influence. In Proceedings of the 26th International World Wide Web Conference (WWW2017). Republic and Canton of Geneva, Switzerland: International World Wide Web Conferences Steering Committee.
  • [Adamic et al.2008] Adamic, L. A.; Zhang, J.; Bakshy, E.; and Ackerman, M. S. 2008. Knowledge sharing and yahoo answers: Everyone knows something. In Proceedings of the 17th international conference on World Wide Web, 665–674. New York, NY: ACM.
  • [Baron1986] Baron, R. S. 1986. Distraction-conflict theory: Progress and problems. Advances in experimental social psychology 19:1–39.
  • [Burghardt et al.2017] Burghardt, K.; Alsina, E. F.; Girvan, M.; Rand, W.; and Lerman, K. 2017. The myopia of crowds: A study of collective evaluation on stack exchange. PLOS ONE 12(3):e0173610.
  • [Celis, Krafft, and Kobe2016] Celis, L. E.; Krafft, P. M.; and Kobe, N. 2016. Sequential voting promotes collective discovery in social recommendation systems. In Proceedings of the Tenth International AAAI Conference on Web and Social Media (ICWSM 2016), 42–51. AAAI Press.
  • [Chu, Anderson, and Sohn2001] Chu, M.; Anderson, J. R.; and Sohn, M. H. 2001. What can a mouse cursor tell us more?: correlation of eye/mouse movements on web browsing. In CHI EA ’01 CHI ’01 Extended Abstracts on Human Factors in Computing Systems, 281–282.
  • [Craswell et al.2008] Craswell, N.; Zoeter, O.; Taylor, M.; and Ramsey, B. 2008. An experimental comparison of click position-bias models. In WSDM ’08 Proceedings of the 2008 International Conference on Web Search and Data Mining, 87–94.
  • [de Condorcet1976] de Condorcet, M. 1976. “Essay on the Application of Mathematics to the Theory of Decision-Making.” Reprinted in Condorcet: Selected Writings. Indianapolis, Indiana: Bobbs-Merrill,.
  • [Ferrara et al.2017] Ferrara, E.; Alipoufard, N.; Burghardt, K.; Gopal, C.; and Lerman, K. 2017. Dynamics of content quality in collaborative knowledge production. In ICWSM ’17 Proceedings of the 11th International AAAI Conference on Web and Social Media.
  • [Galton1908] Galton, F. 1908. Vox populi. Nature 75:450–451.
  • [Glenski, Johnston, and Weninger2015] Glenski, M.; Johnston, T. J.; and Weninger, T. 2015. Random voting effects in social-digital spaces: A case study of reddit post submissions. In Proceedings of the 26th ACM Conference on Hypertext & Social Media, 293–297. New York, NY: ACM.
  • [Herbst and Mas2015] Herbst, D., and Mas, A. 2015. Peer effects on worker output in the laboratory generalize to the field. Science 350(6260):545–549.
  • [Ho and Imai2006] Ho, D. E., and Imai, K. 2006. Randomization inference with natural experiments. J. Amer. Statist. Assoc. 101(475):888–900.
  • [Ho and Imai2008] Ho, D. E., and Imai, K. 2008. Estimating causal effects of ballot order from a randomized natural experiment. Public Opinion Quarterly 72(2):216–240.
  • [Joachims et al.2005] Joachims, T.; Granka, L.; Pan, B.; Hembrooke, H.; and Gay, G. 2005. Accurately interpreting clickthrough data as implicit feedback. In SIGIR ’05 Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 154–161.
  • [Kaniovski and Zaigraev2011] Kaniovski, S., and Zaigraev, A. 2011. Optimal jury design for homogeneous juries with correlated votes. Theory Dec. 71:439–459.
  • [Kittur and Kraut2008] Kittur, A., and Kraut, R. E. 2008. Harnessing the widom of crowds in wikipedia: Quality through coordination. In CSCW ’08 Proceedings of the 2008 ACM conference on Computer supported cooperative work, 37–46.
  • [Krafft et al.2016] Krafft, P. M.; Zheng, J.; Pan, W.; Penna, N. D.; Altshuler, Y.; Shmueli, E.; Tenenbaum, J. B.; and Pentland, A. 2016. Human collective intelligence as distributed bayesian inference. arXiv preprint:1608.01987.
  • [Krajbich and Rangel2011] Krajbich, I., and Rangel, A. 2011. Multialternative drift-diffusion model predicts the relationship between visual fixations and choice in value-based decisions. PNAS 108(33):13852–13857.
  • [Krajbich, Armel, and Rangel2010] Krajbich, I.; Armel, C.; and Rangel, A. 2010. Visual fixations and the computation and comparison of value in simple choice. Nature Neuroscience 13(10).
  • [Krumme et al.2012] Krumme, C.; Cebrian, M.; Pickard, G.; and Pentland, S. 2012. Quantifying social influence in an online cultural market. PLoS ONE 7(5):e33785.
  • [Lerman and Hogg2014] Lerman, K., and Hogg, T. 2014. Leveraging position bias to improve peer recommendation. PLOS ONE 9(6):e98914.
  • [Lim and Van Der Heide2015] Lim, Y.-s., and Van Der Heide, B. 2015. Evaluating the wisdom of strangers: The perceived credibility of online consumer reviews on yelp. Journal of Computer-Mediated Communication 20(1):67–82.
  • [Lorenz et al.2011] Lorenz, J.; Rauhut, H.; Schweitzer, F.; and Helbing, D. 2011. How social influence can undermine the wisdom of crowd effect. Proceedings of the National Academy of Sciences 108(22):9020–9025.
  • [Mantonakis et al.2009] Mantonakis, A.; Rodero, P.; Lesschaeve, I.; and Hastie, R. 2009. Order in choice: Effects of serial position on preferences. Psychol. Sci. 20(11):1309–1312.
  • [Muchnik, Aral, and Taylor2013] Muchnik, L.; Aral, S.; and Taylor, S. J. 2013. Social influence bias: A randomized experiment. Science 341:647–651.
  • [Nematzadeh et al.2016] Nematzadeh, A.; Ciampaglia, G. L.; Ahn, Y.-Y.; and Flammini, A. 2016. Information overload in group communication: From conversation to cacophony in the twitch chat. arXiv:1610.06497.
  • [Oktay, Taylor, and Jensen2010] Oktay, H.; Taylor, B. J.; and Jensen, D. D. 2010. Causal discovery in social media using quasi-experimental designs. In SOMA ’10 Proceedings of the First Workshop on Social Media Analytics, 1–9. ACM.
  • [Ray2006] Ray, R. 2006. Prediction markets and the financial “wisdom of crowds”. Journal of Behavioral Finance 7(1):2–4.
  • [Salganik, Dodds, and Watts2006] Salganik, M.; Dodds, P.; and Watts, D. 2006. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311:854–856.
  • [Simpson1951] Simpson, E. 1951.

    The interpretation of interaction in contingency tables.

    J. R. Stat. Soc. 13:238–241.
  • [Stoddard2015] Stoddard, G. 2015. Popularity dynamics and intrinsic quality in reddit and hacker news. In Proceedings of the Ninth International AAAI Conference on Web and Social Media, 416–425.
  • [Surowiecki2005] Surowiecki, J. 2005. The wisdom of crowds. New York: Anchor.
  • [Yao et al.2015] Yao, Y.; Tong, H.; Xie, T.; Akoglu, L.; Xu, F.; and Lu, J. 2015. Detecting high-quality posts in community question answering sites. Information Sciences 302:70–82.
  • [Yucesoy and Barabási2016] Yucesoy, B., and Barabási, A.-L. 2016. Untangling performance from success.

    EPJ Data Science

    5(1):1.