Distinguishing Question Subjectivity from Difficulty for Improved Crowdsourcing

02/12/2018 ∙ by Yuan Jin, et al. ∙ Deakin University Monash University 0

The questions in a crowdsourcing task typically exhibit varying degrees of difficulty and subjectivity. Their joint effects give rise to the variation in responses to the same question by different crowd-workers. This variation is low when the question is easy to answer and objective, and high when it is difficult and subjective. Unfortunately, current quality control methods for crowdsourcing consider only the question difficulty to account for the variation. As a result,these methods cannot distinguish workers personal preferences for different correct answers of a partially subjective question from their ability/expertise to avoid objectively wrong answers for that question. To address this issue, we present a probabilistic model which (i) explicitly encodes question difficulty as a model parameter and (ii) implicitly encodes question subjectivity via latent preference factors for crowd-workers. We show that question subjectivity induces grouping of crowd-workers, revealed through clustering of their latent preferences. Moreover, we develop a quantitative measure of the subjectivity of a question. Experiments show that our model(1) improves the performance of both quality control for crowd-sourced answers and next answer prediction for crowd-workers,and (2) can potentially provide coherent rankings of questions in terms of their difficulty and subjectivity, so that task providers can refine their designs of the crowdsourcing tasks, e.g. by removing highly subjective questions or inappropriately difficult questions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Outsourcing tasks to a flexible online workforce (aka crowdsourcing) has proven a successful paradigm for data collection in numerous fields due primarily to its overall lower costs and shorter turnaround time as compared to in-house expert-based data collection. The downside of online crowdsourcing is that the quality of the answers collected from crowd-workers is usually not guaranteed, even when multiple responses are collected and aggregated for each question, and workers are trained and vetted using gold-standard questions. To address this issue, many quality control methods for the crowdsourced answers have been proposed (Whitehill et al., 2009a; Welinder et al., 2010). These methods rely on the assumptions that most crowd-workers are reliable when answering the questions and that a given worker is more likely to be reliable should she agree with the majority of her co-workers on the majority of their jointly answered questions. Thus, the methods have focused on modelling the accuracy/ability/expertise of individual workers, assuming this to be correlated with the quality of the responses (Dawid and Skene, 1979; Raykar et al., 2010). In recent years, it has become popular for quality control methods to also model the influence that individual questions exert on the quality of the responses (Whitehill et al., 2009b; Bachrach et al., 2012). Broadly speaking, the following two key properties of questions have drawn the modelling attention:

  • [leftmargin=10pt,topsep=0pt]

  • Difficulty.

    The modelling of question difficulty is founded on the assumption that greater agreement on workers’ answers to a particular question indicates less difficulty for them in determining the correct response. Quality control methods often encode this assumption using a function in which worker expertise counteracts question difficulty for predicting the probability of a correct response. The probability is also known as the quality of the response: the more difficult the question, the lower the quality of a response, and vice versa. In addition, some methods (e.g.

    (Kamar et al., 2015)) also consider the existence of deceptive questions which are so difficult that the assumption that the majority of worker responses are correct no longer holds.

  • Subjectivity. In crowdsourcing, there are also many tasks that contain (purely or partially) subjective questions (Nguyen et al., 2016). Intuitively, the degree of subjectivity of a question (or equivalently, the data item described by it) depends on the number of answer options that are correct. Being purely subjective means all of the options are correct, while being partially subjective means more than one but not all of them are correct. Unless it is explicitly announced by the task provider that a question accepts all options (e.g. movie rating by workers to build a movie recommender system (Lee et al., 2013)), the number of correct answers to a question is unknown and assumed by most quality control methods to be one, meaning the question is objective. However, it is widely known that even expert assessors can disagree with each other on the correct answer to a question in typical crowdsourcing tasks like relevance judgement which is deemed to be “quite subjective" (or equivalently, at least partially subjective) (Voorhees, 2000). In this case, the objectivity assumption on the questions does not hold and most of the quality control methods based on this assumption cannot distinguish the answering accuracy/quality of workers from their preferences for the different answers of questions.

(a) Product Matching task: crowd-workers asked whether two product descriptions referred to the same item or not.
(b) Fashion Judgement task: crowd-workers asked whether a picture contains a “fashion related item” or not.
Figure 1. Heatmaps showing inter-worker response similarity (% of response agreement) for two different tasks: (a) a relatively objective product matching task and (b) a more subjective

fashion judging task, both involving binary worker responses. Hierarchical clustering was performed to order workers such that similar workers are close together. The three yellow blocks in the figure for task (b) indicate three groups of response behaviour and higher subjectivity for task (b).

For crowdsourcing tasks that contain questions whose subjectivity is either unknown or known to be at least partial, novel quality control methods need to be developed to capture any underlying answering/labelling pattern that results from the subjectivity of the questions. One such pattern uncovered by collaborative filtering (Koren et al., 2009) is that crowd-workers who share similar preferences tend to respond similarly towards subjective questions which share certain (latent) features.This pattern can also be observed in crowdsourcing where groups emerge amongst crowd-workers in terms of the answers they give to partially subjective questions. Figure 1 illustrates this phenomenon by providing heat maps of pairwise worker similarity for two tasks: (a) a relatively objective task and (b) a more subjective one. The objective task required workers to judge whether a pair of products were the same based on their names, descriptions and prices, while the subjective task asked workers to judge whether an image contained “fashion related items”111The datasets for the two tasks have been listed in section 5.. The similarity between pairs of workers is calculated as the percentage of agreement across the jointly answered questions222Pairs of crowd-workers not sharing any items had their similarity to be . and hierarchical clustering has been performed to group similar workers together. The three yellow boxes along the diagonal for the more subjective task (b) indicates the three distinct groups of worker response behaviour for this task, which was absent in the more objective task (a). Since the workers were mostly reliable on both tasks, we conjecture that the grouping of response behaviour for the workers in the fashion judgement task reflects the underlying structures in their tendencies for selecting the different correct answers of the same questions (due to their subjectivity).

To enable the answer quality control for the above tasks and generally, any crowdsourcing task that exhibits arbitrary degrees of question subjectivity and difficulty, we are motivated to develop a statistical model encoding both these properties. The resulting model is able to explain both the randomness and the correlations in the answering behaviour of crowd-workers. More specifically, when a task contains only purely subjective questions, groupings of workers start to emerge due to the subjectivity of questions. A group captures a particular correlation between the crowd-workers within it and the latent correct answers for each of the questions. We model such a correlation by factorising it into the latent preferences of the workers and the latent features of the questions. The assumption is that the workers with similar latent preferences tend to have similar perceptions of what constitutes a correct answer for each of the questions. For instance, asking workers “which colour for this shirt do you like?” is a purely subjective question for which one group of workers who like blue colour in general will answer “blue”, whereas another group who like green colour will choose “green”. There is no reason to believe one of these two groups answer the question more correctly than the other, and their distinct answering patterns tend to remain consistent across similar questions asking about their colour favourites for other items (e.g. trousers, hats).

If a question is partially subjective, this means it possesses (i) a certain degree of subjectivity which corresponds to either its tendency of having two or more correct answers, and (ii) a certain level of difficulty. This difficulty corrupts the crowd-workers’ perceptions as to (what tend to be) the correct answers of the question determined by its subjectivity to various extents depending on its level against the workers’ levels of expertise. We model a greater extent of the corruption as a lower probability that the worker’s answer is equal to the subjective (worker-specific) correct answer to the question, thereby the lower quality of the worker’s answer. This subjective correct answer characterizes the particular group to which the worker belongs by sharing similar preferences with some other workers333We refer the reader to the movie example in the previous paragraph..

In this paper we introduce a new quality-control framework for crowdsourcing that models both the subjective (i.e. worker-specific) truths regarding the correct answers to individual questions and also the difficulty-dependent probability that a worker’s answer to a question will equal her perceived subjective truth. We now summarise the contributions of the paper as follows:

  • [leftmargin=*,topsep=0pt]

  • A novel statistical model is proposed which encodes the question difficulty explicitly and the question subjectivity implicitly via latent variables for worker preferences and corresponding question features. The model accounts for both the random and the systematic parts of the variance in crowdsourced answers to refine the quality control over them.

  • A Monte Carlo simulation approach is provided for quantifying question subjectivity as the expected number of subjective truths perceived by different groupings of crowd-workers with respect to their preferences.

  • A meaningful ranking of questions in terms of either difficulty or subjectivity is derived from the model parameter estimates. This can bring practical benefits to crowdsourcing such as improving designs of tasks by helping requesters to detect and remove highly subjective questions from the tasks intended to be objective.

(a) model for fully-objective items

(b) model for fully-subjective items

(c) model for partially-subjective items
Figure 2. (a) shows GLAD with a latent variable for each objective truth, (b) shows a collaborative filtering model without objective truths, and (c) is the proposed subjectivity-and-difficulty response (SDR) model for partially-subjective questions that is able to distinguish question difficulty from subjectivity.

2. Related Work

2.1. Latent variable modelling in crowdsourcing

Most state-of-the-art answer/label quality control methods in crowdsourcing have operated under the assumption that each question is purely objective. These methods are primarily based on statistical modelling of the interactions between crowd-workers and questions which determine either the marginal probabilities of the workers’ answers equal to the corresponding correct answers (Whitehill et al., 2009b; Bachrach et al., 2012; Welinder et al., 2010) or the conditional probabilities of the answers given the correct answers (Dawid and Skene, 1979; Venanzi et al., 2014; Kamar et al., 2015). In comparison, the marginal probabilistic modelling is simpler than the conditional modelling, and also better at mitigating answer sparsity problem in crowdsourcing (Jung and Lease, 2013; Jung, 2014). The basic marginal probabilistic modelling is GLAD (Whitehill et al., 2009b), which models the correctness of each answer as a logistic function where the question difficulty counteracts the expertise of the responding worker. Its graphical representation is shown by Figure 1(a) with the following generative scheme for a response of worker given to question : . This means a correct answer is drawn for question from a discrete distribution parametrised by , which was previously drawn from a Dirichlet distribution parametrised by . Then, a response conditioned on is drawn from a discrete distribution with the -th component of its parameters calculated as follows:

(1)

In this case, the function takes in the expertise factor of worker and the difficulty factor of question . The output of the function is the probability of the response being correct. When or , this probability grows, indicating a stronger positive correlation between and . When or and the question has binary options, the probability approaches 0.5, leading to no correlation between the two, which suggests is a random binary pick. When , the probability decreases to 0, indicating a stronger negative correlation.

Although efficient in quality control of answers to objective questions, current models based on marginal probabilistic modelling have hardly considered modelling the subjectivity of questions, let alone inferring their possible subjective truths. One of the only two papers that have made progress in this regard is (Tian and Zhu, 2012). It assumes that a higher (lower) joint degree of difficulty and subjectivity for an entire crowdsourcing task can increase (decrease) the number of groups of answers given by the crowd-workers to the questions. The expected size of each group becoming smaller (larger) indicates overall weaker (stronger) correlations of answers given to the questions. Despite attributing the variance of answers to both difficulty and subjectivity, the paper makes no attempt to separate the two when it is supposed to be only the difficulty accounting for the quality of answers. Moreover, this work requires every question in a task to be answered by every worker, which is unrealistic in practice. The other work (Nguyen et al., 2016)

has focused on modelling partially subjective questions with just ordinal answers. It assumes each response to a question is generated by a Normal distribution the mean and the variance of which are linearly regressed over the observed features of the question. This means the model will poorly fit any multi-modal distribution of answers to a question.

2.2. Latent variable modelling in collaborative filtering

In model-based collaborative-filtering (Koren et al., 2009), matrix factorization is typically applied to predicting ordinal ratings provided by users to items (e.g. movies, songs). Its categorical version, shown in Figure 1(b), is less commonly applied but is important for the construction of our model for the quality control of crowdsourced categorical answers. It has a generative process for the response and is the set of answer options, with its -th component calcuated as:

(2)

Here, is also called the soft-max function, and are respectively the latent preferences of worker and the latent features of item in relation to the -th answer option. The inner product term indicates how much tendency worker responds to item with the -th answer option.

3. Proposed model

Our proposed model endeavours to combine the key characteristics of the latent variable models specified in section 2.1 and section 2.2. We call it SDR model (Subjectivity-and-Difficulty Response model), which comprises an upstream module which generates a subjective truth for a question based on the worker’s perception of the correct answer, and a downstream module which imposes a difficulty-dependent corruption on the subjective truth for generating the actual response from the worker to the question. More specifically, in the upstream module, the latent subjective truth of question as perceived by crowd-worker is drawn from a soft-max function specified by Eq. (2) except that the original in the equation is now replaced by . This function explains how the worker’s latent preferences interact with the question’s latent features to generate the subjective truth behind her response to the question. In the downstream, conditioned on the latent subjective truth , the response actually given by worker to question is determined by the logistic function . It encodes how the worker expertise counteracts the question difficulty to corrupt the subjective truth into the response, which will be defined later in this section. Essentially, the above perception-corruption process is a generalisation of the corruption process of the correct answer signals from objective questions modelled in (Welinder et al., 2010) by additionally considering the question subjectivity.

Unfortunately the upstream+downstream model described above suffers from an over-parameterisation issue whereby both the upstream component (which determines the worker-specific correct answer) and the downstream component (which determines the noise resulting from worker inaccuracy) can independently and adequately explain the variance observed in worker responses to the same question. In other words, the varied responses from different workers to the same question could equally be due to different perceptions on what constitutes the correct answer to the question or to difficulty of the question causing low accuracy amongst the respondents. To remedy this situation we explicitly enforce a group structure over workers in order to limit the variation in the perceptions across workers. This is done by changing the upstream module to have sparsity-inducing priors over the latent preferences of crowd-workers. In this paper, we use the Latent Dirichlet Allocation (LDA) (Blei et al., 2003) as such priors. The final graphical representation of the SDR model is shown in Figure 1(c)

.The new upstream module of our model assigns a probability vector

, which follows a Dirichlet with a concentration parameter , to each worker . Each component of this probability vector reflects the worker’s tendency of showing a particular preference among the set of preferences she possesses when answering any question. Then, a preference assignment is drawn from for determining the specific preference worker will show for answering question . As for preference , it has a weight for each answer option to reflect how likely each option is to be selected given the preference showed by any worker. In this paper, we fix the dimension of to be strictly 1. This weight is multiplied with the latent feature of question and the result is input to a soft-max function for drawing the subjective truth behind the response . The above generative process can be formulated as: with the -th component of the soft-max function calculated as:

(3)

Embodying the sparsity-inducing effect of LDA, the preference probabilities are dedicated to revealing the underlying groups of crowd-workers while the soft-max specified by Eq. (3) governs the positive correlations between the latent correct answers to the same questions perceived by the workers within the same group. When the number of preferences in = 1, the probability of the only preference is 1. This has a two-fold meaning that each question has one correct answer and every worker should perceive the correct answer of any question in the same way. When the size of

is greater than 1, this indicates certain numbers of underlying worker groups, which we can recover by applying K-means clustering to the estimated preference probabilities

using the Elbow method to determine the right number of the groups.

The downstream module corrupts the correlations between the subjective truth and the response . It draws

from a discrete probability distribution

specified in Eq. (1) except the logistic function has the following definition from (Rasch, 1993):

(4)

The term naturally explains the type of biases induced by deceptive questions when the difficulty is much larger than the expertise , which is not captured in Eq. (1) as the term is never smaller than 0, meaning questions never bias workers to answer incorrectly due to their difficulty. Moreover, when the estimated values for this term are greater than zero for most responses, it means SDR deems them more likely to be correct. With more of them deemed correct, the number of inferred correct answers to any question tends to increase. As a result, the size of latent preference set should grow, from the perspective of SDR, to fit the seemingly more diverse set of correlations between latent correct answers across the questions. Thus, for our model to recover the right number of latent preferences for crowd-workers from their responses, the priors for and need to be set properly, which will be elaborated more in section 5.1.

4. Estimation

4.1. Model parameter estimation

We now provide equations used for parameter estimation, using the notation from Eq. (3) and from Eq. (4) to simplify the equations. The conditional probability for the preference assignment to worker when answering question given the model parameters is:

(5)

where denotes the number of questions excluding question answered by worker given her preference . The joint probability of the other parameters given and the hyper-parameters is:

(6)

The partial derivatives for with respect to the other parameters:

(7)
(8)
(9)
(10)

Here, and . The parameter estimation involves two alternating procedures: sample according to Eq. (5) and optimize in Eq. (6) using LBFGS based on Eq. (7), (8), (9) and (10).

4.2. True answer estimation

A single worker-specific correct answer (as perceived by worker ) fails to provide overall information about the correct answers to question . Thus, we should gather the values from all workers who answer each question. However, in practice, each question is assigned to only a limited number (usually 3 or 5) of workers, making the estimate of the true answer distribution poor. Our solution to improving this estimate is to first find underlying clusters of workers (across all questions) by applying K-means with the Elbow method based on 10-fold cross validation to the posterior means of the latent preference probabilities of all the workers. With the centroid of each resulting cluster , we then calculate the probability that the true answer (as perceived by the workers in cluster ) takes the value as follows:

(11)

where and are the estimates of the weight for preference and the latent feature of question , both specific to option . The best guess regarding the correct answer according to the workers assigned to cluster is then:

(12)

Now we have a set of correct answer estimates for question from all the worker clusters (with being the set of the clusters). For the task of true answer prediction, we can either arbitrarily choose one from as the estimate of the correct answer or choose by following certain strategies. Two simple strategies are to choose from the cluster with the highest average expertise over its workers, or from the cluster with the largest proportion of workers assigned to it. The first strategy states that the correct answer perceived by on average the most expert group of workers is the most appropriate, while the second assumes it to be the one perceived by the largest group of workers which represents the mainstream school-of-thought. In this paper, we apply the second strategy because most crowdsourcing datasets used in the experiments correspond to relatively simple tasks, where the provided correct answers we believe are more likely to be mainstream opinions. As for the first strategy, it might be more useful than the second for revealing a minority group of expert workers who show distinct preferences on partially or purely subjective questions from the majority of less expert workers.

4.3. Subjectivity estimation

Despite not being directly estimated in the model, question subjectivity can still be quantified and estimated after the model has been estimated. This is achieved based on the reasonable assumption that the subjectivity of each question is proportional to the number of correct answers it affords. Despite not knowing the actual number of correct answers to question , we can estimate the value by taking its expectation with respect to the clusters of workers derived in section 4.2. More precisely, the expected number of correct answers to question with respect to worker clusters is: . In this equation, iterates over the possible number of correct answers (from 1 to the size of ). The probability denotes how likely it is that the number of correct answers to question equals , with respect to the worker clusters . It is, however, difficult to calculate this probability when and are large due to a combinatorial explosion. Thus we apply Monte Carlo simulation to estimate (a measure of) the subjectivity of question as using Alogrithm 1.

1:.
2:.
3:. /* Initialise number of correct answers for question to zero */
4:for  do /* Sample over iterations.*/
5:       . /* Initialise set of correct answers to be sampled at iteration . */
6:       for  do
7:             . /* Sample group preference assignment . */
8:             . /* Sample correct answer perceived by worker cluster . */
9:              only if /* Add sampled to when it first appears. */
10:       end for
11:       . /* Increase by number of distinct correct answers sampled at . */
12:end for
13:./* Divide by to estimate as the question’s subjectivity. */
Algorithm 1 Subjectivity estimation for question

5. Experiments

The evaluation of our proposed model consists of four parts. The first part is its sensitivity to various degrees of subjectivity in different crowdsourcing tasks. The second and the third parts are its performance of predicting respectively the provided correct answers of questions and the answers to be given by crowd-workers to unseen questions. The last part is its consistency with human assessors in assessing the difficulty and the subjectivity of questions. We have used 10 crowdsourcing datasets to evaluate the performance of our model in the experiments corresponding to the four parts. Table 1 summarises these datasets as being either (primarily) objective or partially subjective. Among them, the identification tasks of event time ordering, dog and duck breeds, and same products concern objective factual knowledge, while the judgement tasks of image beauty, document relevance 1&2444The questions of relevance judgement task 2 come from the part of TREC 2011 crowdsourcing track (Lease and Kazai, 2011) that does not contain the questions of relevance judgement task 1. We collected crowdsourced judgements for the task 2 from CrowdFlower., facial expression and adult content intrinsically contain certain degrees of subjectivity.

5.1. SDR hyper-parameter setup

As discussed at the end of section 3, to find the right number of latent preferences for crowd-workers, the hyper-parameters of the expertise and the difficulty in the SDR model need to be carefully set. This is achieved through held-out validation which leverages noise within worker responses for detecting signs that SDR may be overfitting the responses by introducing more latent preferences than necessary. More specifically, we construct a held-out validation dataset by randomly sampling a response from each worker. Thus, the size of such a dataset equals the number of workers participating in a task. Then, given a certain setting of the hyper-parameters, we learn our model based on the remaining responses and use the parameter estimates from the learned model to calculate the prediction accuracy: (Mean Absolute Error) over the held-out dataset. We repeat the model learning process with each hyper-parameter setting over the same 100 random held-out validation data subsets. We then obtain the average prediction accuracy for our model across these subsets for each hyper-parameter setting. Finally, we choose the setting (including the number for latent preferences) that yields the highest average prediction accuracy for use in the experiments.

5.2. Sensitivity analysis

We first verify whether our model is sensitive to various degrees of subjectivity in different crowdsourcing tasks. If a task is (almost entirely) objective, the optimal size of latent preference set should be 1, meaning that every crowd-worker now perceives the correct answers in the same way. Consequently, the probabilities of latent preferences for worker collapse to , and the set of correct answers for question collapses to a single correct answer . In this case, we conduct the held-out validation on our model across the objective datasets each with the 100 randomly sampled data subsets described in section 5.1. We expect that the average held-out prediction accuracy for our model across these data subsets will decrease when the number of latent preferences it has increases from 1 to 2, since in this case the model starts to overfit by learning the noise in the training responses to those objective tasks.

If a task is sufficiently subjective, our model should uncover the right number of underlying groups of workers along with the right number of latent preferences. We conduct the experiment in the same way as above to see the difference in average prediction accuracy on held-out unseen responses with the number of preferences increasing from 1 to 3 over the partially subjective datasets. We expect the average prediction accuracy to be higher when the number of preferences is greater than 1. Moreover, since Tian and Zhu (Tian and Zhu, 2012) provided us with the number of worker clusters emerging respectively from the five sub-tasks which constitute the image data in Table 1, we thus compare the corresponding numbers of clusters derived from our model with theirs.

Objective datasets # Worker # Item # Response
Time (Snow et al., 2008) 76 462 4,620
Dog (Zhou et al., 2012) 109 807 8,070
Duck (Welinder et al., 2010) 53 240 9,600
Product (Wang et al., 2012) 176 8,315 24,945
Partially subjective datasets # Worker # Item # Response
Image (Tian and Zhu, 2012) 402 60 24,120
Rel1 (Buckley et al., [n. d.]) 642 1,787 13,310
Rel2 (Lease and Kazai, 2011) 83 585 1,755
Fashion (Loni et al., 2013) 199 3,837 11,511
Face (Mozafari et al., 2012) 27 584 5,242
Adult (adu, [n. d.]) 269 333 3,324
Table 1. The objective and the partially subjective datasets used in this paper.

5.3. Question correct answer prediction

To verify the ability of the SDR model to predict the question true answers, we compare it with the following state-of-the-art quality control methods for crowdsourcing. All of these methods assume that each question has a single correct answer.

  • [leftmargin=*]

  • Majority Vote (MV): The predicted correct answer for each question is the one chosen by the majority of the workers.

  • Multi-dimensional Wisdom of Crowds (MdWC) (Welinder et al., 2010): This model endows both crowd-workers and questions with multi-dimensional latent factors, and provides the workers with additional variables to account for their answering biases.

  • Generative model of Labels, Abilities, & Difficulties (GLAD) (Whitehill et al., 2009b): This model resembles MdWC except that its latent factors (interpreted respectively as expertise and difficulty) are uni-dimensional, and it does not have worker-specific bias variables.

  • Dawid-Skene (DS) (Dawid and Skene, 1979): Unlike GLAD and MdWc which model the correctness probability of each worker’s response, this model focuses on the (worker-specific) conditional probability of each response option given the correct answer to each question.

  • Community Dawid-Skene (CDS) (Venanzi et al., 2014): This model extends DS by clustering workers over some latent structures imposed on their conditional response probability matrices (given all correct answer possibilities) to alleviate the response sparsity problem.

The performance measure: correct answer prediction accuracy, is calculated as: , where is inferred from the respective baselines. For our model, it is: , where with the number of workers assigned to cluster after Elbow K-means, and calculated by Eq. (12). The hyper-parameters for each baseline except MV are optimised using the held-out validation specified in section 5.1 on the exact same random held-out validation subsets of each dataset in Table 1.

5.4. Worker answer prediction

Predicting the answers to be given by crowd-workers to unseen questions is much more significant for (partially) subjective crowdsourcing tasks than it is for the objective ones as the former type of tasks values more about the different ways workers respond. For example, it is crucial to employ worker answer prediction to test a recommender system built on crowdsourced ratings. In this experiment, we evaluate the performance of all the models except MV on predicting the next answer from each worker. We first sample one answer from each worker to create a held-out test dataset, and then learn all the models from the rest of the data with their hyper-parameters optimised as described in section 5.1 using the exact same random validation data subsets. Finally, we evaluate the prediction performance of the models on the held-out test data using (1 - MAE). Due to the limitation of our computing power, in this experiment, we reduce the number of held-out validation iterations for each model to be 15 before a single iteration of held-out test is conducted. We perform 15 such random tests before the average performance of each model is elicited.

(a)
(b)
Figure 3. (a) shows the 3 worker clusters on identifying sky from images and (b) shows the 4 worker clusters on judging beautiful images.

5.5. Subjectivity and difficulty coherence

In this experiment, we investigate whether the estimates of the difficulty and the subjectivity of questions derived from the SDR model are consistent with the judgements of five human assessors. We focused on the object identification & image aesthetics task555Crowd-workers are asked whether an image is beautiful or not. from (Tian and Zhu, 2012) as the total number of its questions is 60, a manageable workload for the assessors to provide good-quality judgements with sufficient levels of effort and concentration. The assessors are either PhD or Master students, three of whom are avid photographers with adequate knowledge about what constitutes beautiful images, while the other two are novices who, during the group discussion, provided suggestions as to how novices might react to different images. We ask them to rank the images with respect to (i) difficulty and (ii) subjectivity. The respective instructions we gave to them are:"rank all these images by how hard they are for crowd-workers to judge correctly by avoiding possible incorrect answers" and "rank them this time by how subjective they are for crowd-workers to judge". The assessors first independently came up with their two rankings. In the process, they could redo the two ranking tasks until they felt confident to submit. The assessors then worked together to merge their rankings into single rankings (for both difficulty and subjectivity) through group discussion and majority vote. The resulting rankings were then compared with the corresponding rankings based on the estimates from the learned SDR model. In addition to ranking the images, the assessors were also asked to categorise each image into one of the three levels of difficulty (namely easy, medium, and hard), and into one of the three levels of subjectivity (namely objective, partially subjective, and purely subjective). We did this to see whether there existed any correlation between the difficulty or subjectivity levels to which images were categorised, and their corresponding estimates from the model.

6. Results

(a)
(b)
(c)
(d)
(e)
(f)
Figure 4. (a) and (d) show the correlation of the difficulty estimates and that of the subjectivity estimates respectively with the corresponding rankings judged by human assessors, while (b) and (e) show the correlation of the difficulty estimates and that of the subjectivity estimates respectively with the corresponding levels to which the images have been categorized by the assessors. Finally, (c) shows the images as points with coordinates being the difficulty and the subjectivity estimates, and has highlighted some images with noteworthy coordinates, while (d) shows these images.

The results of the sensitivity analysis described in section 5.2 are shown in Tables 2 and 3. We can see from Table 2 that the average prediction accuracy of the SDR model with 1 latent preference is constantly higher than that of the model with 2 preferences over all the objective datasets. According to section 5.2, this result indicates there is just one underlying group of workers for each of the tasks who perceive the questions’ correct answers in the same way. It also shows that even though the expertise-difficulty corruption introduced noises to the objective truths to form the actual responses, our model was still able to recover the number of underlying group of workers to be 1. From Table 3, our model with 2 preferences clearly outperforms itself with 1 preference across all the partially subjective tasks. This means multiple groups of workers have emerged due to the sufficient subjectivity of these tasks. Moreover, the table shows that further increasing the number of latent preferences to 3 no longer improves the performance. This has most likely been caused by over-fitting, and also suggests a two-dimensional latent space is accurate enough to explain the worker clustering effects emerged from these tasks. To further prove our model with 2 preferences can uncover the underlying groups of workers who have perceived the partially subjective tasks differently, we show the density of the workers’ latent preference probabilities estimated by our model from the image data (Tian and Zhu, 2012). Due to a space limit, we only show two of them in Figure 3. According to (Tian and Zhu, 2012), the sub-task of judging whether images are beautiful is more subjective than the sub-task of identifying skies in images. This is re-confirmed by our model with its number of worker clusters for the former sub-task greater than that for the latter shown by Figures 2(a) and 2(b).

Dataset The SDR model
m = 1 m = 2
Time 0.8967 0.8915
Dog 0.6970 0.6625
Duck 0.8427 0.8388
Product 0.8396 0.8291
Table 2. Average accuracy of our model with 1 and 2 latent preferences on predicting the held-out validation response of each worker over 4 objective tasks.
Dataset The SDR model
m = 1 m = 2 m = 3
Beauty 1 0.6736 0.6944 0.6924
Beauty 2 0.6914 0.6998 0.6937
Sky 0.8889 0.8962 0.8862
Building 0.8997 0.9026 0.9007
Computer 0.8098 0.8117 0.8074
Rel1 0.3956 0.3985 0.3983
Rel2 0.4426 0.4481 0.4481
Fashion 0.7517 0.7589 0.7522
Face 0.7181 0.7203 0.7123
Adult 0.7469 0.7494 0.7446
Table 3. Average accuracy of our model with 1, 2 and 3 latent preferences on predicting the held-out validation response of each worker over 10 partially subjective tasks the first 5 of which are sub-tasks of the Image task in (Tian and Zhu, 2012).

The results of the question correct answer prediction described in section 5.3 are listed in Table 4. Across all the partially subjective datasets except the image data, the SDR model, based on the largest-group strategy for choosing the best worker clusters, is superior than the other 5 baselines666The performance of all the models on the image data (in Table 1) has been too close to bear any useful information as for which of them is better since the number of the questions (i.e. 60) in the data is too small.. Especially, for the tasks of relevance judgement 1&2 and fashion judgement, our model is able to outperform the best baselines by 3%, 1.5% and 0.3% with almost 54, 9 and 13 more correctly predicted question answers respectively. Since our model is reduced to being very similar to GLAD when dealing with the objective datasets, it has achieved very similar results as GLAD did in correct answer prediction over all the objective datasets except for the Duck data (Welinder et al., 2010). In this task, our model is superior than GLAD (0.69 versus 0.62 from GLAD). This suggests that our model is at least as robust as GLAD when predicting correct answers for objective tasks.

Dataset Question correct answer prediction
SDR MV GLAD DS CDS MdWC
Rel1 0.4998 0.4522 0.4457 0.4309 0.4697 0.4674
Rel2 0.4752 0.4544 0.4567 0.4512 0.4604 0.4586
Fashion 0.8733 0.8580 0.8689 0.8415 0.8463 0.8700
Face 0.6423 0.6404 0.6130 0.5924 0.5986 0.6079
Adult 0.7598 0.7568 0.7587 0.7534 0.7582 0.7556
Table 4. Accuracy of all the models on predicting the true answers of the four partially subjective datasets (the results for the Image task are not included as the number of items in this task is too small to show any significant difference in the performance of different models).

The results of the worker answer prediction described in section 5.4 are shown in Table 5. We can see that our model is not the best on 3 out of the 10 partially subjective datasets, topped by different baselines. Despite that, our model has still performed adequately well (being second best on those datasets). We conjecture that this is because all these 3 datasets are with binary answer options which intrinsically constrain the answering behaviour of crowd-workers. This results in overall weaker correlations both in the worker responses and in the underlying correct answers across the questions. For the other 7 datasets, 5 of them are with more than two answer options, thus containing stronger answer correlations for our model to exploit to achieve better performance. To examine whether the difference in the worker answer prediction accuracy between any two algorithms is significant, we conducted the Nemenyi post-hoc test (Demšar, 2006) based on Table 5. The result is shown in Figure 5, according to which the performance difference between SDR and either CDS, GLAD or DS is beyond the critical difference (CD), thereby being statistically significant.

Dataset Unseen worker answer prediction
SDR GLAD DS CDS MdWC
Beauty 1 0.6974 0.6884 0.6256 0.6927 0.6912
Beauty 2 0.7006 0.7011 0.6796 0.6842 0.6998
Sky 0.9014 0.8772 0.8801 0.8862 0.8903
Building 0.8987 0.8912 0.8956 0.9006 0.8976
Computer 0.8284 0.8139 0.8115 0.8196 0.8336
Rel1 0.4067 0.4035 0.3654 0.3972 0.3987
Rel2 0.4386 0.4312 0.4257 0.4304 0.4340
Fashion 0.7659 0.7593 0.6977 0.7621 0.7633
Face 0.7224 0.7193 0.6625 0.7081 0.7148
Adult 0.7386 0.7347 0.6767 0.7312 0.7354
Table 5. Average Accuracy of all the models on predicting the unseen held-out test response of each worker across all the partially subjective datasets.
Figure 5. Critical difference (CD) diagram of the Nemenyi post-hoc test (). The performance difference between two algorithms is significant if the gap between their ranks is larger than CD. There is a horizontal line connecting the two algorithms if the rank gap between them is smaller than CD.

The results of the subjectivity and difficulty coherence evaluation have been summarised in Figure 4 which consists of 6 sub-figures. Figures 3(a) and 3(d) show overall there is a strong negative correlation between the model estimates and the rankings judged by human assessors. More specifically, the larger the estimate for either the difficulty or the subjectivity of an image, the higher it tends to be ranked by human assessors. Moreover, Figures 3(b) and 3(e) show that there exist clear positive correlations between the levels of difficulty and subjectivity into which the images get categorised by the human assessors, and the estimated values of these two properties inferred by the SDR model.

To support our argument about the efficacy of the SDR model in revealing the two key properties of images, we have selected four images highlighted in different colours in Figure 3(c) with their image ids. We can see that image 34 is inferred by our model to be both easy and objective as both of its estimates shown in Figure 3(c) are the smallest. This can be re-confirmed by visual inspection of the image in Figure 3(f). It is very easy and clear to see that there is no sky in the image 34. Image 29 has been identified by our model to be hard with low subjectivity according to its estimates shown in Figure 3(c). This is reasonable as the image indeed contains an extraterrestrial sky which is hard for novice workers to realise, while expert workers are able to realise and find the image objective. Images 2 and 23 both belong to the image beauty judgement task from (Tian and Zhu, 2012) which requires workers to select 6 most beautiful images from 12 images. Our model has identified that image 2 is more subjective and harder to judge. This is probably because image 2 delivers a view of scenery which is more likely to resonate with workers while image 23 is merely showing an object. As a result, workers tend to show more different feelings and opinions towards image 2. On the other hand, image 23 does have better image quality and thus is easier for workers to make their decisions on whether it is beautiful or not.

7. Conclusions

In this paper, we have proposed the SDR (Subjectivity-and-Difficulty Response) model, a novel quality-control framework for crowdsourcing that is able to distinguish question subjectivity, which causes worker-specific truth for individual questions, from question difficulty, which determines the probability that a worker’s response to each question equals her perceived subjective truth. Experiment results show that our model improves both the correct answer prediction for questions and the held-out unseen response prediction for crowd-workers compared to five baselines across numerous partially subjective crowdsourcing datasets. Moreover, our model shows robustness to both the objective and the partially subjective datasets by discovering the right numbers of underlying worker groups for them. Finally, our model is able to provide estimates of the difficulty and the subjectivity of questions that are consistent with the judgements from human assessors.

References

  • (1)
  • adu ([n. d.]) [n. d.]. Adult Datset. https://github.com/ipeirotis/Get-Another-Label/tree/master/data. ([n. d.]). Accessed: 2017-07-30.
  • Bachrach et al. (2012) Yoram Bachrach, Thore Graepel, Tom Minka, and John Guiver. 2012. How to grade a test without knowing the answers—a Bayesian graphical model for adaptive crowdsourcing and aptitude testing. arXiv preprint arXiv:1206.6386 (2012).
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet allocation.

    Journal of machine Learning research

    3, Jan (2003), 993–1022.
  • Buckley et al. ([n. d.]) Chris Buckley, Matthew Lease, and Mark D Smucker. [n. d.]. Overview of the trec 2010 relevance feedback track (notebook).
  • Dawid and Skene (1979) Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics (1979), 20–28.
  • Demšar (2006) Janez Demšar. 2006.

    Statistical comparisons of classifiers over multiple data sets.

    Journal of Machine Learning Research 7 (2006), 1–30.
  • Jung (2014) Hyun Joon Jung. 2014. Quality assurance in crowdsourcing via matrix factorization based task routing. In Proceedings of the 23rd International Conference on World Wide Web. ACM, 3–8.
  • Jung and Lease (2013) Hyun Joon Jung and Matthew Lease. 2013. Crowdsourced task routing via matrix factorization. arXiv preprint arXiv:1310.5142 (2013).
  • Kamar et al. (2015) Ece Kamar, Ashish Kapoor, and Eric Horvitz. 2015. Identifying and accounting for task-dependent bias in crowdsourcing. In Third AAAI Conference on Human Computation and Crowdsourcing.
  • Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8 (2009).
  • Lease and Kazai (2011) Matthew Lease and Gabriella Kazai. 2011. Overview of the trec 2011 crowdsourcing track. In Proceedings of the text retrieval conference (TREC).
  • Lee et al. (2013) Jongwuk Lee, Myungha Jang, Dongwon Lee, Won-Seok Hwang, Jiwon Hong, and Sang-Wook Kim. 2013. Alleviating the sparsity in collaborative filtering using crowdsourcing. In Workshop on Crowdsourcing and Human Computation for Recommender Systems (CrowdRec), Vol. 5.
  • Loni et al. (2013) Babak Loni, Maria Menendez, Mihai Georgescu, Luca Galli, Claudio Massari, Ismail Sengor Altingovde, Davide Martinenghi, Mark Melenhorst, Raynor Vliegendhart, and Martha Larson. 2013. Fashion-focused Creative Commons Social Dataset. In Proceedings of the 4th ACM Multimedia Systems Conference (MMSys ’13). ACM, New York, NY, USA, 72–77.
  • Mozafari et al. (2012) Barzan Mozafari, Purnamrita Sarkar, Michael J. Franklin, Michael I. Jordan, and Samuel Madden. 2012. Active Learning for Crowd-Sourced Databases. CoRR abs/1209.3686 (2012).
  • Nguyen et al. (2016) An Thanh Nguyen, Matthew Halpern, Byron C Wallace, and Matthew Lease. 2016. Probabilistic modeling for crowdsourcing partially-subjective ratings.
  • Rasch (1993) Georg Rasch. 1993. Probabilistic models for some intelligence and attainment tests. ERIC.
  • Raykar et al. (2010) Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. 2010. Learning from crowds. Journal of Machine Learning Research 11, Apr (2010), 1297–1322.
  • Snow et al. (2008) Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y Ng. 2008. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In

    Proceedings of the conference on empirical methods in natural language processing

    . Association for Computational Linguistics, 254–263.
  • Tian and Zhu (2012) Yuandong Tian and Jun Zhu. 2012. Learning from crowds in the presence of schools of thought. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 226–234.
  • Venanzi et al. (2014) Matteo Venanzi, John Guiver, Gabriella Kazai, Pushmeet Kohli, and Milad Shokouhi. 2014. Community-based bayesian aggregation models for crowdsourcing. In Proceedings of the 23rd international conference on World wide web. ACM, 155–164.
  • Voorhees (2000) Ellen M Voorhees. 2000. Variations in relevance judgments and the measurement of retrieval effectiveness. Information processing & management 36, 5 (2000), 697–716.
  • Wang et al. (2012) Jiannan Wang, Tim Kraska, Michael J Franklin, and Jianhua Feng. 2012. Crowder: Crowdsourcing entity resolution. Proceedings of the VLDB Endowment 5, 11 (2012), 1483–1494.
  • Welinder et al. (2010) Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. 2010. The multidimensional wisdom of crowds. In Advances in neural information processing systems. 2424–2432.
  • Whitehill et al. (2009a) Jacob Whitehill, Paul Ruvolo, Tingfan Wu, Jacob Bergsma, and Javier R. Movellan. 2009a. Whose Vote Should Count More: Optimal Integration of Labels from Labelers of Unknown Expertise. In 23rd Annual Conference on Neural Information Processing Systems, NIPS’09. 2035–2043.
  • Whitehill et al. (2009b) Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. 2009b. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems. 2035–2043.
  • Zhou et al. (2012) Denny Zhou, Sumit Basu, Yi Mao, and John C. Platt. 2012. Learning from the Wisdom of Crowds by Minimax Entropy. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 2195–2203. http://papers.nips.cc/paper/4490-learning-from-the-wisdom-of-crowds-by-minimax-entropy.pdf