Visual Intelligence through Human Interaction

11/12/2021
by   Ranjay Krishna, et al.
Stanford University
14

Over the last decade, Computer Vision, the branch of Artificial Intelligence aimed at understanding the visual world, has evolved from simply recognizing objects in images to describing pictures, answering questions about images, aiding robots maneuver around physical spaces and even generating novel visual content. As these tasks and applications have modernized, so too has the reliance on more data, either for model training or for evaluation. In this chapter, we demonstrate that novel interaction strategies can enable new forms of data collection and evaluation for Computer Vision. First, we present a crowdsourcing interface for speeding up paid data collection by an order of magnitude, feeding the data-hungry nature of modern vision models. Second, we explore a method to increase volunteer contributions using automated social interventions. Third, we develop a system to ensure human evaluation of generative vision models are reliable, affordable and grounded in psychophysics theory. We conclude with future opportunities for Human-Computer Interaction to aid Computer Vision.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

page 22

page 36

11/07/2016

Crowdsourcing in Computer Vision

Computer vision systems require large amounts of manually annotated data...
11/19/2021

Sketch-based Creativity Support Tools using Deep Learning

Sketching is a natural and effective visual communication medium commonl...
07/11/2019

MeetUp! A Corpus of Joint Activity Dialogues in a Visual Environment

Building computer systems that can converse about their visual environme...
04/12/2016

Attributes as Semantic Units between Natural Language and Visual Recognition

Impressive progress has been made in the fields of computer vision and n...
12/15/2021

Crowdsourcing County-Level Data on Early COVID-19 Policy Interventions in the United States: Technical Report

Beginning in April 2020, we gathered partial county-level data on non-ph...
05/03/2018

InceptB: A CNN Based Classification Approach for Recognizing Traditional Bengali Games

Sports activities are an integral part of our day to day life. Introduci...
11/27/2018

Using Computer Vision Techniques for Moving Poster Design

Graphic Design encompasses a wide range of activities from the design of...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Today, Computer Vision applications are ubiquitous. They filter our pictures, control our car, aid medical experts in disease diagnosis, analyze sports games, and even generate complete new content. This recent emergence of Computer Vision tools has been made possible because of a shift in the underlying techniques used to train models; this shift has transferred attention away from hand engineered features Lowe (1999); Dalal and Triggs (2005)

towards deep learning 

Deng et al. (2009); Krizhevsky et al. (2012a). With deep learning techniques, vision models have surpassed human performance on fundamental tasks, such as object recognition Russakovsky et al. (2014). Today, vision models are capable of an entirely new host of applications, such as generating photo-realistic images Karras et al. (2019) and 3D spaces Mildenhall et al. (2020). These tasks have made possible numerous vision-powered applications Zhang et al. (2020); Yue et al. (2017); Xia et al. (2020); Laielli et al. (2019); Huang and Canny (2019).

This shift is also reflective of yet another change in Computer Vision: algorithms are becoming more generic and data has become the primary hurdle in performance. Today’s vision models are data-hungry; they feed on large amounts of annotated training data. In some cases, data needs to be continuously annotated in order to evaluate models; for new tasks such as image generation, model-generated images can only be evaluated if people provide realism judgments. To support data needs, Computer Vision has relied on a specific pipeline for data collection — one that focuses on manual labeling using online crowdsourcing platforms such as Amazon Mechanical Turk Deng et al. (2009); Krishna et al. (2017b).

Unfortunately, this data collection pipeline has numerous limiting factors. First, crowdsourcing can be insufficiently scalable, and it remains too expensive for use in the production of many industry-size datasets Josephy et al. (2013). Cost is bound to the amount of work completed per minute of effort, and existing techniques for speeding up labeling are not scaling as quickly as the volume of data we are now producing that must be labeled Thomee et al. (2016). Second, while cost issues may be mitigated by relying on volunteer contributions, it remains unclear how best to incentivize such contributions. Even though there has been a lot of work in Social Psychology exploring strategies to incentivize volunteer contributions to online communities Kraut and Resnick (2011); Burke et al. (2014); Markey (2000); Darley and Latané (1968); Yang and Kraut (2017); Wang et al. (2015)

, it remains unclear how we can employ such strategies to develop automated mechanisms that incentivize volunteer data annotation useful for Computer Vision. Third, existing data annotation methods are ad-hoc, each executed in idiosyncrasy without proof of reliability or grounding to theory, resulting in high variance in their estimates 

Salimans et al. (2016); Denton et al. (2015); Olsson et al. (2018). While high-variance in labels might be tolerable when collecting training data, it becomes debilitating when such ad-hoc methods are used to evaluate models.

Human-Computer Interaction’s opportunity is to look to novel interaction strategies to break this away from traditional data collection pipeline. In this chapter, we showcase three projects Krishna et al. (2016); Park et al. (2019); Zhou et al. (2019), that have helped modern Computer Vision data needs. The first two projects introduce new training data collection interfaces Krishna et al. (2016) and interactions Park et al. (2019) while the third introduces a reliable system for evaluating vision models with humans Zhou et al. (2019). Our contributions (1) speed up data collection by an order of magnitude in terms of speed and cost, (2) incentivize volunteer contributions to provide labels through conversational interactions over social media, and (3) capacitate reliable human evaluation of vision models.

In the first section, we highlight work that accelerates human interactions in microtask crowdsourcing, a core process through which computer vision and machine learning datasets are predominantly curated 

Krishna et al. (2016)

. Microtask crowdsourcing has enabled dataset advances in social science and machine learning, but existing crowdsourcing schemes are too expensive to scale up with the expanding volume of data. To scale and widen the applicability of crowdsourcing, we present a technique that produces extremely rapid judgments for binary and categorical labels. Rather than punishing all errors, which causes workers to proceed slowly and deliberately, our technique speeds up workers’ judgments to the point where errors are acceptable and even expected. We demonstrate that it is possible to rectify these errors by randomizing task order and modeling response latency. We evaluate our technique on a breadth of common labeling tasks such as image verification, word similarity, sentiment analysis and topic classification. Where prior work typically achieves a 0.25

to 1 speedup over fixed majority vote, our approach often achieves an order of magnitude (10) speedup.

In the second section, we turn our attention from paid crowdsourcing to volunteer contributions; we explore how to design social interventions to improve volunteer contributions when curating datasets Park et al. (2019). To support the massive data requirements of modern supervised machine learning algorithms, crowdsourcing systems match volunteer contributors to appropriate tasks. Such systems learn what types of tasks contributors are interested to complete. In this paper, instead of focusing on what to ask, we focus on learning how

to ask: how to make relevant and interesting requests to encourage crowdsourcing participation. We introduce a new technique that augments questions with learning-based request strategies drawn from social psychology. We also introduce a contextual bandit algorithm to select which strategy to apply for a given task and contributor. We deploy our approach to collect volunteer data from Instagram for the task of visual question answering, an important task in computer vision and natural language processing that has enabled numerous human-computer interaction applications. For example, when encountering a user’s Instagram post that contains the ornate Trevi Fountain in Rome, our approach learns to augment its original raw question “Where is this place?” with image-relevant compliments such as “What a great statue!” or with travel-relevant justifications such as “I would like to visit this place”, increasing the user’s likelihood of answering the question and thus providing a label. We deploy our agent on Instagram to ask questions about social media images, finding that the response rate improves from

with unaugmented questions to with baseline rule-based strategies and to with learning-based strategies.

Finally, in the third section, we spotlight our work on constructing a reliable human evaluation system for generative computer vision models Zhou et al. (2019)

. Generative models often use human evaluations to measure the perceived quality of their outputs. Automated metrics are noisy indirect proxies, because they rely on heuristics or pretrained embeddings. However, up until now, direct human evaluation strategies have been ad-hoc, neither standardized nor validated. Our work establishes a gold standard human benchmark for generative realism. We construct

Human eYe Perceptual Evaluation (HYPE), a human benchmark that is grounded in psychophysics research in perception, reliable across different sets of randomly sampled outputs from a model, able to produce separable model performances, and efficient in cost and time. We introduce two variants: one that measures visual perception under adaptive time constraints to determine the threshold at which a model’s outputs appear real (e.g. ms), and the other a less expensive variant that measures human error rate on fake and real images sans time constraints. We test HYPE

across six state-of-the-art generative adversarial networks and two sampling techniques on conditional and unconditional image generation using four datasets: CelebA, FFHQ, CIFAR-10, and ImageNet. We find that

HYPE can track the relative improvements between models, and we confirm via bootstrap sampling that these measurements are consistent and replicable.

2 Data annotation by speeding up human interactions

Social science Kittur et al. (2008); Mason and Suri (2012), interactive systems Fast et al. (2014); Kumar et al. (2013) and machine learning Deng et al. (2009); Lin et al. (2014b) are becoming more and more reliant on large-scale, human-annotated data. Increasingly large annotated datasets have unlocked a string of social scientific insights Gilbert and Karahalios (2009); Burke and Kraut (2013) and machine learning performance improvements Krizhevsky et al. (2012a); Girshick et al. (2014); Vinyals et al. (2014). One of the main enablers of this growth has been microtask crowdsourcing Snow et al. (2008). Microtask crowdsourcing marketplaces such as Amazon Mechanical Turk offer a scale and cost that makes such annotation feasible. As a result, companies are now using crowd work to complete hundreds of thousands of tasks per day Marcus and Parameswaran (2015).

However, even microtask crowdsourcing can be insufficiently scalable, and it remains too expensive for use in the production of many industry-size datasets Josephy et al. (2013). Cost is bound to the amount of work completed per minute of effort, and existing techniques for speeding up labeling (reducing the amount of required effort) are not scaling as quickly as the volume of data we are now producing that must be labeled Thomee et al. (2016). To expand the applicability of crowdsourcing, the number of items annotated per minute of effort needs to increase substantially.

Figure 1:

(a) Images are shown to workers at 100ms per image. Workers react whenever they see a dog. (b) The true labels are the ground truth dog images. (c) The workers’ keypresses are slow and occur several images after the dog images have already passed. We record these keypresses as the observed labels. (d) Our technique models each keypress as a delayed Gaussian to predict (e) the probability of an image containing a dog from these observed labels.

In this paper, we focus on one of the most common classes of crowdsourcing tasks Ipeirotis (2010): binary annotation. These tasks are yes-or-no questions, typically identifying whether or not an input has a specific characteristic. Examples of these types of tasks are topic categorization (e.g., “Is this article about finance?”) Schapire and Singer (2000), image classification (e.g., “Is this a dog?”) Deng et al. (2009); Lin et al. (2014b); Li and Ogihara (2003), audio styles Seetharaman and Pardo (2014) and emotion detection Li and Ogihara (2003) in songs (e.g., “Is the music calm and soothing?”), word similarity (e.g., “Are shipment and cargo synonyms?”) Miller and Charles (1991) and sentiment analysis (e.g., “Is this tweet positive?”) Pang and Lee (2008).

Previous methods have sped up binary classification tasks by minimizing worker error. A central assumption behind this prior work has been that workers make errors because they are not trying hard enough (e.g., “a lack of expertise, dedication [or] interest” Sheng et al. (2008)). Platforms thus punish errors harshly, for example by denying payment. Current methods calculate the minimum redundancy necessary to be confident that errors have been removed Sheng et al. (2008); Smyth et al. (1994, 1995). These methods typically result in a 0.25 to 1 speedup beyond a fixed majority vote Peng Dai and Weld (2010); Russakovsky et al. (2015); Sheng et al. (2008); Karger et al. (2014).

We take the opposite position: that designing the task to encourage some error, or even make errors inevitable, can produce far greater speedups. Because platforms strongly punish errors, workers carefully examine even straightforward tasks to make sure they do not represent edge cases Martin et al. (2014); Irani and Silberman (2013). The result is slow, deliberate work. We suggest that there are cases where we can encourage workers to move quickly by telling them that making some errors is acceptable. Though individual worker accuracy decreases, we can recover from these mistakes post-hoc algorithmically (Figure 1).

We manifest this idea via a crowdsourcing technique in which workers label a rapidly advancing stream of inputs. Workers are given a binary question to answer, and they observe as the stream automatically advances via a method inspired by rapid serial visual presentation (RSVP) Li et al. (2002); Fei-Fei et al. (2007)

. Workers press a key whenever the answer is “yes” for one of the stream items. Because the stream is advancing rapidly, workers miss some items and have delayed responses. However, workers are reassured that the requester expects them to miss a few items. To recover the correct answers, the technique randomizes the item order for each worker and model workers’ delays as a normal distribution whose variance depends on the stream’s speed. For example, when labeling whether images have a “barking dog” in them, a self-paced worker on this task takes 1.7s per image on average. With our technique, workers are shown a stream at 100ms per image. The technique models the delays experienced at different input speeds and estimates the probability of intended labels from the key presses.

We evaluate our technique by comparing the total worker time necessary to achieve the same precision on an image labeling task as a standard setup with majority vote. The standard approach takes three workers an average of 1.7s each for a total of 5.1s. Our technique achieves identical precision (97%) with five workers at 100ms each, for a total of 500ms of work. The result is an order of magnitude speedup of 10.

This relative improvement is robust across both simple tasks, such as identifying dogs, and complicated tasks, such as identifying “a person riding a motorcycle” (interactions between two objects) or “people eating breakfast” (understanding relationships among many objects). We generalize our technique to other tasks such as word similarity detection, topic classification and sentiment analysis. Additionally, we extend our method to categorical classification tasks through a ranked cascade of binary classifications. Finally, we test workers’ subjective mental workload and find no measurable increase.

Overall, we make the following contributions: (1) We introduce a rapid crowdsourcing technique that makes errors normal and even inevitable. We show that it can be used to effectively label large datasets by achieving a speedup of an order of magnitude on several binary labeling crowdsourcing tasks. (2) We demonstrate that our technique can be generalized to multi-label categorical labeling tasks, combined independently with existing optimization techniques, and deployed without increasing worker mental workload.

2.1 Related Work

The main motivation behind our work is to provide an environment where humans can make decisions quickly. We encourage a margin of human error in the interface that is then rectified by inferring the true labels algorithmically. In this section, we review prior work on crowdsourcing optimization and other methods for motivating contributions. Much of this work relies on artificial intelligence techniques: we complement this literature by changing the crowdsourcing interface rather than focusing on the underlying statistical model.

Our technique is inspired by rapid serial visual presentation (RSVP), a technique for consuming media rapidly by aligning it within the foveal region and advancing between items quickly Li et al. (2002); Fei-Fei et al. (2007). RSVP has already been proven to be effective at speeding up reading rates Wobbrock et al. (2002). RSVP users can react to a single target image in a sequence of images even at 125ms per image with 75% accuracy Potter (1976). However, when trying to recognize concepts in images, RSVP only achieves an accuracy of 10% at the same speed Potter and Levy (1969). In our work, we integrate multiple workers’ errors to successfully extract true labels.

Many previous papers have explored ways of modeling workers to remove bias or errors from ground truth labels Whitehill et al. (2009); Welinder et al. (2010); Zhou et al. (2012); Peng Dai and Weld (2010); Ipeirotis et al. (2010). For example, an unsupervised method for judging worker quality can be used as a prior to remove bias on binary verification labels Ipeirotis et al. (2010). Individual workers can also be modeled as projections into an open space representing their skills in labeling a particular image Whitehill et al. (2009). Workers may have unknown expertise that may in some cases prove adversarial to the task. Such adversarial workers can be detected by jointly learning the difficulty of labeling a particular datum along with the expertises of workers Welinder et al. (2010). Finally, a generative model can be used to model workers’ skills by minimizing the entropy of the distribution over their labels and the unknown true labels Zhou et al. (2012)

. We draw inspiration from this literature, calibrating our model using a similar generative approach to understand worker reaction times. We model each worker’s reaction as a delayed Gaussian distribution.

In an effort to reduce cost, many previous papers have studied the tradeoffs between speed (cost) and accuracy on a wide range of tasks Wah et al. (2014); Branson et al. (2010); Wah et al. (2011); Russakovsky et al. (2014). Some methods estimate human time with annotation accuracy to jointly model the errors in the annotation process Wah et al. (2014); Branson et al. (2010); Wah et al. (2011). Other methods vary both the labeling cost and annotation accuracy to calculate a tradeoff between the two Jain and Grauman (2013); Deng et al. (2014). Similarly, some crowdsourcing systems optimize a budget to measure confidence in worker annotations Karger et al. (2011, 2014). Models can also predict the redundancy of non-expert labels needed to match expert-level annotations Sheng et al. (2008). Just like these methods, we show that non-experts can use our technique and provide expert-quality annotations; we also compare our methods to the conventional majority-voting annotation scheme.

Another perspective on rapid crowdsourcing is to return results in real time, often by using a retainer model to recall workers quickly Bernstein et al. (2011); Lasecki et al. (2011); Laput et al. (2015). Like our approach, real-time crowdsourcing can use algorithmic solutions to combine multiple in-progress contributions Lasecki et al. (2012). These systems’ techniques could be fused with ours to create crowds that can react to bursty requests.

Figure 2: (a) Task instructions inform workers that we expect them to make mistakes since the items will be displayed rapidly. (b) A string of countdown images prepares them for the rate at which items will be displayed. (c) An example image of a “dog” shown in the stream—the two images appearing behind it are included for clarity but are not displayed to workers. (d) When the worker presses a key, we show the last four images below the stream of images to indicate which images might have just been labeled.

One common method for optimizing crowdsourcing is active learning, which involves learning algorithms that interactively query the user. Examples include training image recognition 

Song et al. (2011) and attribution recognition Parkash and Parikh (2012) with fewer examples. Comparative models for ranking attribute models have also optimized crowdsourcing using active learning Liang and Grauman (2014). Similar techniques have explored optimization of the “crowd kernel” by adaptively choosing the next questions asked of the crowd in order to build a similarity matrix between a given set of data points Tamuz et al. (2011). Active learning needs to decide on a new task after each new piece of data is gathered from the crowd. Such models tend to be quite expensive to compute. Other methods have been proposed to decide on a set of tasks instead of just one task Vijayanarasimhan et al. (2010). We draw on this literature: in our technique, after all the images have been seen by at least one worker, we use active learning to decide the next set of tasks. We determine which images to discard and which images to group together and send this set to another worker to gather more information.

Finally, there is a group of techniques that attempt to optimize label collection by reducing the number of questions that must be answered by the crowd. For example, a hierarchy in label distribution can reduce the annotation search space Deng et al. (2014), and information gain can reduce the number of labels necessary to build large taxonomies using a crowd Chilton et al. (2013); Bragg et al. (2013). Methods have also been proposed to maximize accuracy of object localization in images Su et al. (2012) and videos Vondrick et al. (2013). Previous labels can also be used as a prior to optimize acquisition of new types of annotations Branson et al. (2014). One of the benefits of our technique is that it can be used independently of these others to jointly improve crowdsourcing schemes. We demonstrate the gains of such a combination in our evaluation.

2.2 Error-Embracing Crowdsourcing

Current microtask crowdsourcing platforms like Amazon Mechanical Turk incentivize workers to avoid rejections Irani and Silberman (2013); Martin et al. (2014), resulting in slow and meticulous work. But is such careful work necessary to build an accurate dataset? In this section, we detail our technique for rapid crowdsourcing by encouraging less accurate work.

The design space of such techniques must consider which tradeoffs are acceptable to make. The first relevant dimension is accuracy. When labeling a large dataset (e.g., building a dataset of ten thousand articles about housing), precision is often the highest priority: articles labeled as on-topic by the system must in fact be about housing. Recall, on the other hand, is often less important, because there is typically a large amount of available unlabeled data: even if the system misses some on-topic articles, the system can label more items until it reaches the desired dataset size. We thus develop an approach for producing high precision at high speed, sacrificing some recall if necessary.

The second design dimension involves the task characteristics. Many large-scale crowdsourcing tasks involve closed-ended responses such as binary or categorical classifications. These tasks have two useful properties. First, they are time-bound by users’ perception and cognition speed rather than motor (e.g., pointing, typing) speed Cheng et al. (2015), since acting requires only a single button press. Second, it is possible to aggregate responses automatically, for example with majority vote. Open-ended crowdsourcing tasks such as writing Bernstein et al. (2010) or transcription are often time-bound by data entry motor speeds and cannot be automatically aggregated. Thus, with our technique, we focus on closed-ended tasks.

Rapid crowdsourcing of binary decision tasks Binary questions are one of the most common classes of crowdsourcing tasks. Each yes-or-no question gathers a label on whether each item has a certain characteristic. In our technique, rather than letting workers focus on each item too carefully, we display each item for a specific period of time before moving on to the next one in a rapid slideshow. For example, in the context of an image verification task, we show workers a stream of images and ask them to press the spacebar whenever they see a specific class of image. In the example in Figure 2, we ask them to react whenever they see a “dog.”

The main parameter in this approach is the length of time each item is visible. To determine the best option, we begin by allowing workers to work at their own pace. This establishes an initial average time period, which we then slowly decrease in successive versions until workers start making mistakes Cheng et al. (2015). Once we have identified this error point, we can algorithmically model workers’ latency and errors to extract the true labels.

To avoid stressing out workers, it is important that the task instructions convey the nature of the rapid task and the fact that we expect them to make some errors. Workers are first shown a set of instructions (Figure 2(a)) for the task. They are warned that reacting to every single correct image on time is not feasible and thus not expected. We also warn them that we have placed a small number of items in the set that we know to be positive items. These help us calibrate each worker’s speed and also provide us with a mechanism to reject workers who do not react to any of the items.

Once workers start the stream (Figure 2(b)), it is important to prepare them for pace of the task. We thus show a film-style countdown for the first few seconds that decrements to zero at the same interval as the main task. Without these countdown images, workers use up the first few seconds getting used to the pace and speed. Figure 2(c) shows an example “dog” image that is displayed in front of the user. The dimensions of all items (images) shown are held constant to avoid having to adjust to larger or smaller visual ranges.

When items are displayed for less than 400ms, workers tend to react to all positive items with a delay. If the interface only reacts with a simple confirmation when workers press the spacebar, many workers worry that they are too late because another item is already on the screen. Our solution is to also briefly display the last four items previously shown when the spacebar is pressed, so that workers see the one they intended and also gather an intuition for how far back the model looks. For example, in Figure 2(d), we show a worker pressing the spacebar on an image of a horse. We anticipate that the worker was probably delayed, and we display the last four items to acknowledge that we have recorded the keypress. We ask all workers to first complete a qualification task in which they receive feedback on how quickly we expect them to react. They pass the qualification task only if they achieve a recall of 0.6 and precision of 0.9 on a stream of 200 items with 25 positives. We measure precision as the fraction of worker reactions that were within 500ms of a positive cue.

In Figure 3, we show two sample outputs from our interface. Workers were shown images for 100ms each. They were asked to press the spacebar whenever they saw an image of “a person riding a motorcycle.” The images with blue bars underneath them are ground truth images of “a person riding a motorcycle.” The images with red bars show where workers reacted. The important element is that red labels are often delayed behind blue ground truth and occasionally missed entirely. Both Figures 3(a) and 3(b) have 100 images each with 5 correct images.

Figure 3: Example raw worker outputs from our interface. Each image was displayed for 100ms and workers were asked to react whenever they saw images of “a person riding a motorcycle.” Images are shown in the same order they appeared in for the worker. Positive images are shown with a blue bar below them and users’ keypresses are shown as red bars below the image to which they reacted.

Because of workers’ reaction delay, the data from one worker has considerable uncertainty. We thus show the same set of items to multiple workers in different random orders and collect independent sets of keypresses. This randomization will produce a cleaner signal in aggregate and later allow us to estimate the images to which each worker intended to react.

Given the speed of the images, workers are not able to detect every single positive image. For example, the last positive image in Figure 3(a) and the first positive image in Figure 3(b) are not detected. Previous work on RSVP found a phenomenon called “attention blink” Broadbent and Broadbent (1987), in which a worker is momentarily blind to successive positive images. However, we find that even if two images of “a person riding a motorcycle” occur consecutively, workers are able to detect both and react twice (Figures 3(a) and 3(b)). If workers are forced to react in intervals of less than 400ms, though, the signal we extract is too noisy for our model to estimate the positive items.

Multi-Class Classification for Categorical Data So far, we have described how rapid crowdsourcing can be used for binary verification tasks. Now we extend it to handle multi-class classification. Theoretically, all multi-class classification can be broken down into a series of binary verifications. For example, if there are classes, we can ask

binary questions of whether an item is in each class. Given a list of items, we use our technique to classify them one class at a time. After every iteration, we remove all the positively classified items for a particular class. We use the rest of the items to detect the next class.

Assuming all the classes contain an equal number of items, the order in which we detect classes should not matter. A simple baseline approach would choose a class at random and attempt to detect all items for that class first. However, if the distribution of items is not equal among classes, this method would be inefficient. Consider the case where we are trying to classify items into 10 classes, and one class has 1000 items while all other classes have 10 items. In the worst case, if we classify the class with 1000 examples last, those 1000 images would go through our interface 10 times (once for every class). Instead, if we had detected the large class first, we would be able to classify those 1000 images and they would only go through our interface once. With this intuition, we propose a class-optimized approach that classifies the most common class of items first. We maximize the number of items we classify at every iteration, reducing the total number of binary verifications required.

2.3 Model

To translate workers’ delayed and potentially erroneous actions into identifications of the positive items, we need to model their behavior. We do this by calculating the probability that a particular item is in the positive class given that the user reacted a given period after the item was displayed. By combining these probabilities across several workers with different random orders of the same images, these probabilities sum up to identify the correct items.

We use maximum likelihood estimation to predict the probability of an item being a positive example. Given a set of items , we send them to workers in a different random order for each. From each worker , we collect a set of keypresses where and is the total number of keypresses from . Our aim is to calculate the probability of a given item being a positive example. Given that we collect keypresses from workers:

(1)

where is the probability of a particular set of items being keypresses. We set to be constant, asssuming that it is equally likely that a worker might react to any item. Using Bayes’ rule:

(2)

models our estimate of item being positive. It can be a constant, or it can be an estimate from a domain-specific machine learning algorithm Kamar et al. (2012). For example, to calculate , if we were trying to scale up a dataset of “dog” images, we would use a small set of known “dog” images to train a binary classifier and use that to calculate

for all the unknown images. With image tasks, we use a pretrained convolutional neural network to extract image features 

Simonyan and Zisserman (2014)

and train a linear support vector machine to calculate

.

We model as a set of independent keypresses:

(3)

Finally, we model each keypress as a Gaussian distribution given a positive item. We train the mean and variance by running rapid crowdsourcing on a small set of items for which we already know the positive items. Here, the mean and variance of the distribution are modeled to estimate the delays that a worker makes when reacting to a positive item.

Intuitively, the model works by treating each keypress as creating a Gaussian “footprint” of positive probability on the images about 400ms before the keypress (Figure 1). The model combines these probabilities across several workers to identify the images with the highest overall probability.

Now that we have a set of probabilities for each item, we need to decide which ones should be classified as positive. We order the set of items according to likelihood of being in the positive class

. We then set all items above a certain threshold as positive. This threshold is a hyperparameter that can be tuned to trade off precision vs. recall.

In total, this model has two hyperparameters: (1) the threshold above which we classify images as positive and (2) the speed at which items are displayed to the user. We model both hyperparameters in a per-task (image verification, sentiment analysis, etc.) basis. For a new task, we first estimate how long it takes to label each item in the conventional setting with a small set of items. Next, we continuously reduce the time each item is displayed until we reach a point where the model is unable to achieve the same precision as the untimed case.

[width=0.6]figures/rsvp/recall_percentage.png

Figure 4: We plot the change in recall as we vary percentage of positive items in a task. We experiment at varying display speeds ranging from 100ms to 500ms. We find that recall is inversely proportional to the rate of positive stimuli and not to the percentage of positive items.
Task Conventional Approach Our Technique Speedup
Time (s) Precision Recall Time (s) Precision Recall
Image Verification Easy 1.50 0.99 0.99 0.10 0.99 0.94 9.00
Medium 1.70 0.97 0.99 0.10 0.98 0.83 10.20
Hard 1.90 0.93 0.89 0.10 0.90 0.74 11.40
All Concepts 1.70 0.97 0.96 0.10 0.97 0.81 10.20
Sentiment Analysis 4.25 0.93 0.97 0.25 0.94 0.84 10.20
Word Similarity 6.23 0.89 0.94 0.60 0.88 0.86 6.23
Topic Detection 14.33 0.96 0.94 2.00 0.95 0.81 10.75
Table 1:

We compare the conventional approach for binary verification tasks (image verification, sentiment analysis, word similarity and topic detection) with our technique and compute precision and recall scores. Precision scores, recall scores and speedups are calculated using 3 workers in the conventional setting. Image verification, sentiment analysis and word similarity used 5 workers using our technique, while topic detection used only 2 workers. We also show the time taken (in seconds) for 1 worker to do each task.

Figure 5: We study the precision (left) and recall (right) curves for detecting “dog” (top), “a person on a motorcycle” (middle) and “eating breakfast” (bottom) images with a redundancy ranging from 1 to 5. There are 500 ground truth positive images in each experiment. We find that our technique works for simple as well as hard concepts.

2.4 Calibration: Baseline Worker Reaction Time

Our technique hypothesizes that guiding workers to work quickly and make errors can lead to results that are faster yet with similar precision. We begin evaluating our technique by first studying worker reaction times as we vary the length of time for which each item is displayed. If worker reaction times have a low variance, we accurately model them. Existing work on RSVP estimated that humans usually react about 400ms after being presented with a cue Weichselgartner and Sperling (1987); Reeves and Sperling (1986). Similarly, the model human processor Card et al. (1983) estimated that humans perceive, understand and react at least 240ms after a cue. We first measure worker reaction times, then analyze how frequently positive items can be displayed before workers are unable to react to them in time.

Method. We recruited 1,000 workers on Amazon Mechanical Turk with 96% approval rating and over 10,000 tasks submitted. Workers were asked to work on one task at a time. Each task contained a stream of 100 images of polka dot patterns of two different colors. Workers were asked to react by pressing the spacebar whenever they saw an image with polka dots of one of the two colors. Tasks could vary by two variables: the speed at which images were displayed and the percentage of the positively colored images. For a given task, we held the display speed constant. Across multiple tasks, we displayed images for 100ms to 500ms. We studied two variables: reaction time and recall. We measured the reaction time to the positive color across these speeds. To study recall (percentage of positively colored images detected by workers), we varied the ratio of positive images from 5% to 95%. We counted a keypress as a detection only if it occurred within 500ms of displaying a positively colored image.

Results. Workers’ reaction times corresponded well with estimates from previous studies. Workers tend to react an average of 378ms (ms) after seeing a positive image. This consistency is an important result for our model because it assumes that workers have a consistent reaction delay.

As expected, recall is inversely proportional to the speed at which the images are shown. A worker is more likely to miss a positive image at very fast speeds. We also find that recall decreases as we increase the percentage of positive items in the task. To measure the effects of positive frequency on recall, we record the percentage threshold at which recall begins to drop significantly at different speeds and positive frequencies. From Figure 4, at 100ms, we see that recall drops when the percentage of positive images is more than 35%. As we increase the time for which an item is displayed, however, we notice that the drop in recall occurs at a much higher percentage. At 500ms, the recall drops at a threshold of 85%. We thus infer that recall is inversely proportional to the rate of positive stimuli and not to the percentage of positive images. From these results we conclude that at faster speeds, it is important to maintain a smaller percentage of positive images, while at slower speeds, the percentage of positive images has a lesser impact on recall. Quantitatively, to maintain a recall higher than 0.7, it is necessary to limit the frequency of positive cues to one every 400ms.

2.5 Study 1: Image Verification

In this study, we deploy our technique on image verification tasks and measure its speed relative to the conventional self-paced approach. Many crowdsourcing tasks in computer vision require verifying that a particular image contains a specific class or concept. We measure precision, recall and cost (in seconds) by the conventional approach and compare against our technique.

Some visual concepts are easier to detect than others. For example, detecting an image of a “dog” is a lot easier than detecting an image of “a person riding a motorcycle” or “eating breakfast.” While detecting a “dog” is a perceptual task, “a person riding a motorcycle” requires understanding of the interaction between the person and the motorcycle. Similarly, “eating breakfast” requires workers to fuse concepts of people eating a variety foods like eggs, cereal or pancakes. We test our technique on detecting three concepts: “dog” (easy concept), “a person riding a motorcycle” (medium concept) and “eating breakfast” (hard concept). In this study, we compare how workers fare on each of these three levels of concepts.

Method. In this study, we compare the conventional approach with our technique on three (easy, medium and hard) concepts. We evaluate each of these comparisons using precision scores, recall scores and the speedup achieved. To test each of the three concepts, we labeled 10,000 images, where each concept had 500 examples. We divided the 10,000 images into streams of 100 images for each task. We paid workers $0.17 to label a stream of 100 images (resulting in a wage of $6 per hour Salehi et al. (2015)). We hired over 1,000 workers for this study satisfying the same qualifications as the calibration task.

The conventional method of collecting binary labels is to present a crowd worker with a set of items. The worker proceeds to label each item, one at a time. Most datasets employ multiple workers to label each task because majority voting Snow et al. (2008) has been shown to improve the quality of crowd annotations. These datasets usually use a redundancy of 3 to 5 workers Sheshadri and Lease (2013). In all our experiments, we used a redundancy of 3 workers as our baseline.

When launching tasks using our technique, we tuned the image display speed to 100ms. We used a redundancy of 5 workers when measuring precision and recall scores. To calculate speedup, we compare the total worker time taken by all the 5 workers using our technique with the total worker time taken by the 3 workers using the conventional method. Additionally, we vary redundancy on all the concepts to from 1 to 10 workers to see its effects on precision and recall.

Results. Self-paced workers take 1.70s on average to label each image with a concept in the conventional approach (Table 1). They are quicker at labeling the easy concept (1.50s per worker) while taking longer on the medium (1.70s) and hard (1.90s) concepts.

Using our technique, even with a redundancy of 5 workers, we achieve a speedup of 10.20 across all concepts. We achieve order of magnitude speedups of , and on the easy, medium and hard concepts. Overall, across all concepts, the precision and recall achieved by our technique is 0.97 and 0.81. Meanwhile the precision and recall of the conventional method is 0.97 and 0.96. We thus achieve the same precision as the conventional method. As expected, recall is lower because workers are not able to detect every single true positive example. As argued previously, lower recall can be an acceptable tradeoff when it is easy to find more unlabeled images.

Now, let’s compare precision and recall scores between the three concepts. We show precision and recall scores in Figure 5 for the three concepts. Workers perform slightly better at finding “dog” images and find it the most difficult to detect the more challenging “eating breakfast” concept. With a redundancy of 5, the three concepts achieve a precision of 0.99, 0.98 and 0.90 respectively at a recall of 0.94, 0.83 and 0.74 (Table 1). The precision for these three concepts are identical to the conventional approach, while the recall scores are slightly lower. The recall for a more difficult cognitive concept (“eating breakfast”) is much lower, at 0.74, than for the other two concepts. More complex concepts usually tend to have a lot of contextual variance. For example, “eating breakfast” might include a person eating a “banana,” a “bowl of cereal,” “waffles” or “eggs.” We find that while some workers react to one variety of the concept (e.g., “bowl of cereal”), others react to another variety (e.g., “eggs”).

When we increase the redundancy of workers to 10 (Figure 6), our model is able to better approximate the positive images. We see diminishing increases in both recall and precision as redundancy increases. At a redundancy of 10, we increase recall to the same amount as the conventional approach (0.96), while maintaining a high precision (0.99) and still achieving a speedup of .

We conclude from this study that our technique (with a redundancy of 5) can speed up image verification with easy, medium and hard concepts by an order of magnitude while still maintaining high precision. We also show that recall can be compensated by increasing redundancy.

Figure 6: We study the effects of redundancy on recall by plotting precision and recall curves for detecting “a person on a motorcycle” images with a redundancy ranging from 1 to 10. We see diminishing increases in precision and recall as we increase redundancy. We manage to achieve the same precision and recall scores as the conventional approach with a redundancy of 10 while still achieving a speedup of .

2.6 Study 2: Non-Visual Tasks

So far, we have shown that rapid crowdsourcing can be used to collect image verification labels. We next test the technique on a variety of other common crowdsourcing tasks: sentiment analysis Pang and Lee (2008), word similarity Snow et al. (2008) and topic detection Lewis and Hayes (1994).

Method. In this study, we measure precision, recall and speedup achieved by our technique over the conventional approach. To determine the stream speed for each task, we followed the prescribed method of running trials and speeding up the stream until the model starts losing precision. For sentiment analysis, workers were shown a stream of tweets and asked to react whenever they saw a positive tweet. We displayed tweets at 250ms with a redundancy of 5 workers. For word similarity, workers were shown a word (e.g., “lad”) for which we wanted synonyms. They were then rapidly shown other words at 600ms and asked to react if they see a synonym (e.g., “boy”). Finally, for topic detection, we presented workers with a topic like “housing” or “gas” and presented articles of an average length of 105 words at a speed of 2s per article. They reacted whenever they saw an article containing the topic we were looking for. For all three of these tasks, we compare precision, recall and speed against the self-paced conventional approach with a redundancy of 3 workers. Every task, for both the conventional approach and our technique, contained 100 items.

To measure the cognitive load on workers for labeling so many items at once, we ran the widely-used NASA Task Load Index (TLX) Colligan et al. (2015) on all tasks, including image verification. TLX measures the perceived workload of a task. We ran the survey on 100 workers who used the conventional approach and 100 workers who used our technique across all tasks.

Results. We present our results in Table 1 and Figure 7. For sentiment analysis, we find that workers in the conventional approach classify tweets in 4.25s. So, with a redundancy of 3 workers, the conventional approach would take 12.75s with a precision of 0.93. Using our method and a redundancy of 5 workers, we complete the task in 1250ms (250ms per worker per item) and 0.94 precision. Therefore, our technique achieves a speedup of .

Likewise, for word similarity, workers take around 6.23s to complete the conventional task, while our technique succeeds at 600ms. We manage to capture a comparable precision of 0.88 using 5 workers against a precision of 0.89 in the conventional method with 3 workers. Since finding synonyms is a higher-level cognitive task, workers take longer to do word similarity tasks than image verification and sentiment analysis tasks. We manage a speedup of .

Finally, for topic detection, workers spend significant time analyzing articles in the conventional setting (14.33s on average). With 3 workers, the conventional approach takes 43s. In comparison, our technique delegates 2s for each article. With a redundancy of only 2 workers, we achieve a precision of 0.95, similar to the 0.96 achieved by the conventional approach. The total worker time to label one article using our technique is 4s, a speedup of .

The mean TLX workload for the control condition was 58.5 (), and 62.4 () for our technique. Unexpectedly, the difference between conditions was not significant (). The “temporal demand” scale item appeared to be elevated for our technique (61.1 vs. 70.0), but this difference was not significant (). We conclude that our technique can be used to scale crowdsourcing on a variety of tasks without statistically increasing worker workload.

Figure 7: Precision (left) and recall (right) curves for sentiment analysis (top), word similarity (middle) and topic detection (bottom) images with a redundancy ranging from 1 to 5. Vertical lines indicate the number of ground truth positive examples.

2.7 Study 3: Multi-class Classification

In this study, we extend our technique from binary to multi-class classification to capture an even larger set of crowdsourcing tasks. We use our technique to create a dataset where each image is classified into one category (“people,” “dog,” “horse,” “cat,” etc.). We compare our technique with a conventional technique Deng et al. (2009) that collects binary labels for each image for every single possible class.

Method. Our aim is to classify a dataset of 2,000 images with 10 categories where each category contains between 100 to 250 examples. We compared three methods of multi-class classification: (1) a naive approach that collected 10 binary labels (one for each class) for each image, (2) a baseline approach that used our interface and classified images one class (chosen randomly) at a time, and (3) a class-optimized approach that used our interface to classify images starting from the class with the most examples. When using our interface, we broke tasks into streams of 100 images displayed for 100ms each. We used a redundancy of 3 workers for the conventional interface and 5 workers for our interface. We calculated the precision and recall scores across each of these three methods as well as the cost (in seconds) of each method.

Results. (1) In the naive approach, we need to collect 20,000 binary labels that take 1.7s each. With 5 workers, this takes 102,000s ($170 at a wage rate of $6/hr) with an average precision of and recall of . (2) Using the baseline approach, it takes 12,342s ($20.57) with an average precision of and recall of . This shows that the baseline approach achieves a speedup of when compared with the naive approach. (3) Finally, the class-optimized approach is able to detect the most common class first and hence reduces the number of times an image is sent through our interface. It takes 11,700s ($19.50) with an average precision of and recall of . The class-optimized approach achieves a speedup of when compared to the naive approach. While the speedup between the baseline and the class-optimized methods is small, it would be increased on a larger dataset with more classes.

2.8 Application: Building ImageNet

Our method can be combined with existing techniques Deng et al. (2014); Song et al. (2011); Parkash and Parikh (2012); Biswas and Parikh (2013) that optimize binary verification and multi-class classification by preprocessing data or using active learning. One such method Deng et al. (2014)

annotated ImageNet (a popular large dataset for image classification) effectively with a useful insight: they realized that its classes could be grouped together into higher semantic concepts. For example, “dog,” “rabbit” and “cat” could be grouped into the concept “animal.” By utilizing the hierarchy of labels that is specific to this task, they were able to preprocess and reduce the number of labels needed to classify all images. As a case study, we combine our technique with their insight and evaluate the speedup in collecting a subset of ImageNet.

Method. We focused on a subset of the dataset with 20,000 images and classified them into 200 classes. We conducted this case study by comparing three ways of collecting labels: (1) The naive approach asked 200 binary questions for each image in the subset, where each question asked if the image belonged to one of the 200 classes. We used a redundancy of 3 workers for this task. (2) The optimal-labeling method used the insight to reduce the number of labels by utilizing the hierarchy of image classes. (3) The combined approach used our technique for multi-class classification combined with the hierarchy insight to reduce the number of labels collected. We used a redundancy of 5 workers for this technique with tasks of 100 images displayed at 250ms.

Results. (1) Using the naive approach, this would result in asking 4 million binary verification questions. Given that each binary label takes 1.7s (Table 1), we estimate that the total time to label the entire dataset would take 6.8 million seconds ($11,333 at a wage rate of $6/hr). (2) The optimal-labeling method is estimated to take 1.13 million seconds ($1,888) Deng et al. (2014). (3) Combining the hierarchical questions with our interface, we annotate the subset in 136,800s ($228). We achieve a precision of with a recall of . By combining our speedup with the speedup from intelligent question selection, we achieve a speedup in total.

2.9 Discussion

Absence of Concepts. We focused our technique on positively identifying concepts. We then also test its effectiveness at classifying the absence of a concept. Instead of asking workers to react when they see a “dog,” if we ask them to react when they do not see a “dog,” our technique performs poorly. At ms, we find that workers achieve a recall of only , which is much lower than a recall of when detecting the presence of “dog”s. To improve recall to , we must slow down the feed to ms. Our technique achieves a speedup of

with this speed. We conclude that our technique performs poorly for anomaly detection tasks, where the presence of a concept is common but its absence, an anomaly, is rare. More generally, this exercise suggests that some cognitive tasks are less robust to rapid judgments. Preattentive processing can help us find “dog”s, but ensuring that there is no “dog” requires a linear scan of the entire image.

Typicality. To better understand the active mechanism behind our technique, we turn to concept typicality. A recent study Iordan et al. (2015) used fMRIs to measure humans’ recognition speed for different object categories, finding that images of most typical examplars from a class were recognized faster than the least typical categories. They calculated typicality scores for a set of image classes based on how quickly humans recognized them. In our image verification task, of false negatives were also atypical. Not detecting atypical images might lead to the curation of image datasets that are biased towards more common categories. For example, when curating a dataset of dogs, our technique would be more likely to find usual breeds like “dalmatians” and “labradors” and miss rare breeds like “romagnolos” and “otterhounds.” More generally, this approach may amplify biases and minimize clarity on edge cases. Slowing down the feed reduces atypical false negatives, resulting in a smaller speedup but with a higher recall for atypical images.

Conclusion. We have suggested that crowdsourcing can speed up labeling by encouraging a small amount of error rather than forcing workers to avoid it. We introduce a rapid slideshow interface where items are shown too quickly for workers to get all items correct. We algorithmically model worker errors and recover their intended labels. This interface can be used for binary verification tasks like image verification, sentiment analysis, word similarity and topic detection, achieving speedups of , , and respectively. It can also extend to multi-class classification and achieve a speedup of . Our approach is only one possible interface instantiation of the concept of encouraging some error; we suggest that future work may investigate many others. Speeding up crowdsourcing enables us to build larger datasets to empower scientific insights and industry practice. For many labeling goals, this technique can be used to construct datasets that are an order of magnitude larger without increasing cost.

3 Data acquisition through social interactions

Modern supervised machine learing (ML) systems in domains such as computer vision are reliant on mountains of human-labeled training data. These labeled images, for example the fourteen million images in ImageNet Deng et al. (2009), require basic human knowledge such as whether an image contains a chair. Unfortunately, this knowledge is both so simple that it is extremely tedious for humans to label, and also so tacit that the human annotators are required. In response, crowdsourcing efforts often recruit volunteers to help create labels via intrinsic interest, curiosity or gamification Lintott et al. (2008); Law et al. (2016); Willis et al. (2017); von Ahn and Dabbish (2004a).

The general approach of these crowdsourcing efforts is to focus on what to ask each contributor. Specifically, from a large set of possible tasks, many systems formalize an approach to route or recommend tasks to specific contributors Geiger and Schader (2014); Lin et al. (2014a); Ambati et al. (2011); Difallah et al. (2013). Unfortunately, many of these volunteer efforts are restricted to labels for which contributions can be motivated, leaving incomplete any task that is uninteresting to contributors Reich et al. (2012); Hill (2013); Healy and Schussman (2003); Warncke-Wang et al. (2015).

[width=0.65]figures/strategies/pull.pdf

Figure 8: We introduce an approach that increases crowdsourcing participation rates by learning to augment requests with image- and text-relevant question asking strategies drawn from social psychology. Given a social media image post and a question, our approach selects a strategy and generates a natural language phrase to augment the question.
Figure 9: Our agent chooses appropriate social strategies and contextualizes questions to maximize crowdsourcing participation.

Our paper specifically studies an instantiation of this common ailment in the context of visual question answering (VQA). VQA generalizes numerous computer vision tasks, including object detection Deng et al. (2009), relationship prediction Lu et al. (2016), and action prediction Niebles et al. (2008). Progress in VQA supports the development of many human-computer interaction systems, including VizWiz Bigham et al. (2010), TapTapSee, BeMyEyes, and CamFind111Applications can be found at https://taptapsee.com/, https://www.bemyeyes.com/, and https://camfindapp.com/. VQA is a data-hungry machine learning task that is challenging to motivate contributors. Existing VQA crowdsourcing strategies have suggested using social media to incentivize online participants to answer visual questions for assistive users Bigham et al. (2010); Brady et al. (2015), but many such questions remain unanswered Brady et al. (2013).

To meet the needs of modern ML systems, we argue that crowdsourcing systems can automatically generate plans not just for what to ask about, but also for how to make that request. Social psychology and social computing research have made clear that how a request is structured can have substantial effects on resulting contribution rates Kraut and Resnick (2011); Yang and Kraut (2017). However, while it is feasible to manually design a single request such as one email message to all users in an online community, or one motivational message on all web pages on Wikipedia, in real life (as in VQA) there exist a wide variety of situations that must each be approached differently. Supporting this variety in how a request is made has remained out of reach; in this paper, we contribute algorithms to achieve it.

Consider, for example, that we are building a dataset of images with their tagged geolocations (Figure 8). When we encounter an image of a person wearing a black shirt next to a beautiful scenery, existing machine learning systems can generate questions such as “where is this place?”. However, prior work reports that such requests seem mechanical, resulting in lower response rates Brady et al. (2013). In our approach, requests might be augmented by content compliment strategies Robert (1984) reactive to the image content, such as “What a great statue!” or “That’s a beautiful building!”, or by interest matching strategies Cialdini (2016) reactive to the image content, such as “I love visiting statues!” or “I love seeing old buildings!”

Augmenting requests with social strategies requires (1) defining a set of possible social strategies, (2) developing a method to generate content for each strategy conditioned on an image, and (3) choosing the appropriate strategy to maximize response conditioned on the user and their post. In this paper, we tackle these three challenges. First, we adopt a set of social strategies that social psychologists have demonstrated to be successful in human-human communication Cialdini (2016); Robert (1984); Langer et al. (1978); Taylor and Thomas (2008); Hoffman (1981). While our set is not exhaustive, it represents a diverse list of strategies — some that augment questions conditioned on the image and others conditioned on the user’s language. While previous work has explored the use of ML models to generate image-conditioned natural language fragments, for generating captions and questions, ours is the first method that employs these techniques to generate strategies that increase worker participation.

To test the efficacy of our approach, we deploy our system on Instagram, a social media image-sharing platform. We collect datasets and develop machine learning-based models that use a convolutional neural network (CNN) to encode the image contents and a long short-term memory network (LSTM) to generate each social strategy across a large set of different kinds of images. We compare our ML strategies against baseline rule-based strategies using linguistic features extracted from the user’s post 

Li et al. (2010). We show a sample of augmented questions in Figure 9. We find that choosing appropriate strategies and augmenting requests leads to a significant absolute participation increase of over no strategy when using ML strategies and a increase when using rule-based strategies. We also find that no specific strategy is the universal best choice, implying that knowing when to use a strategy is important. While we specifically focus on VQA and Instagram, our approach generalizes to other crowdsourcing systems that support language-based interaction with contributors.

3.1 Related work

Our work is motivated by research in crowdsourcing, peer production and social computing that increase contributors’ levels of intrinsic motivation. We thread this work together with advances in natural language generation technologies to contribute generative algorithms that modulating the form of the requests to increase contribution rates.

Crowdsourcing strategies. The HCI community has investigated different ways to incentivise people to participate in data-labeling tasks Hill (2013); Healy and Schussman (2003); Reich et al. (2012). Designing for curiosity, for example, increases crowdsourcing participation Law et al. (2016). Citizen science projects like GalaxyZoo mobilize volunteers by motivating them to work on a domain that aligns with their interests Lintott et al. (2008). Unlike the tasks typically explored by such methods, image-labeling is not typically an intrinsically motivated task, and is instead completed by paid ghost work Gray and Suri (2019). To improve image-labeling, the ESP Game harnessed game design to solve annotation tasks as by-products of entertainment activities von Ahn and Dabbish (2004b). However, games result in limited kinds of labels, and need to be designed specifically to attain certain types of labels. Instead, we ask directed questions through conversations to label data and use social strategies to motive participation.

Interaction through conversations. The use of natural language as a medium for interaction has galvanized many systems Huang et al. (2018); Lasecki et al. (2013). Natural language has been proposed as a medium to gather new data from online participants Bigham et al. (2010) or guide users through workflows Fast et al. (2018)

. Conversational agents have also been deployed through products like Apple’s Siri, Amazon’s Echo, and Microsoft’s Cortana. Substantial effort has been placed on teaching people how to talk to such assistants. Noticing this limitation, more robust crowd-powered conversational systems have been created by hiring professionals, as in the case of Facebook M 

Hempel (2015), or crowd workers Lasecki et al. (2013); Bohus and Rudnicky (2009). Unlike these approaches where people have a goal and invoke a passive conversational agent, we build active agents reach out to people with questions that increase humans participation.

Social interaction with machines. To design an agent capable of eliciting a user’s help, we need to understand how a user views the interaction. The Media Equation proposes that people adhere to similar social norms in their interactions with computers as they do in interactions with other people Reeves and Nass (1996). It shoes that agents that seem more human-like, in terms of behaviour and gestures, provoke users to treat them similar to a person Cassell and Thórisson (1999); Cerrato and Ekeklint (2002); Nass and Brave (2007). Consistent with these observations, prior work has also shown that people are more likely to resolve misunderstandings with more human-like agents Corti and Gillespie (2016). This leads us to question whether a human-like conversational agent can encourage more online participation from online contributors. Prior work on interactions with machines investigates social norms that a machine can mimic in a binary capacity — either it respects the norm correctly or violates it with negligence Sardar et al. (2012); Chidambaram et al. (2012). Instead, we project social interaction on a spectrum — some social strategies are more successful than others in a given context — and learn a selection strategy that maximizes participation.

Structuring requests to enhance motivation. There have been many proposed social strategies to enhance the motivation to contribute in online communities Kraut and Resnick (2011). For example, asking a specific question rather than making a statement or asking an open-ended question increases the likelihood of getting a response Burke et al. (2014). Requests succeed significantly more often when contributors are addressed by name Markey (2000). Emergencies receive more responses than requests without time constraints Darley and Latané (1968). Prior work has shown that factors that increase the contributor’s affinity for the requester increase the persuasive power of the message on online crowdfunding sites Yang and Kraut (2017). It has also been observed that different behaviour elicits different kind of support from online support groups with self disclosure eliciting emotional support and questioning resulting in informational support Wang et al. (2015). The severity of the outcome of responding to a request can also influence motivation Chaiken (1989). Our work incorporates some of these established social strategies and leverages language generation algorithms to build an agent that can deploy them across a wide variety of different requests.

Figure 10: Given a social media post and a question we want to ask, we augment the question with a social strategy. Our system contains two components. First, a selection component featurizes the post and user and chooses a social strategy. Second, a generation component creates a natural language augmentation for the question given the image and the chosen strategy. The contributor’s response or silence is used to generate a feedback reward for the selection module.

3.2 Social strategies

The goal of our system is to draw on theories of how people ask other people for help and favors, then learn how to emulate those strategies. Drawing on prior work, we sampled a diverse set of nine social strategies. While the set of nine social strategies we explore are not an exhaustive set, we believe it represents a wide enough range of possible strategies to demonstrate the method and effects of teaching social strategies to machines. The social strategies we explore are:

  1. Content compliment: Compliment the image or an object in the image before asking the question. This increases the liking between the agent and the contributor, making them more likely to reciprocate with the request Robert (1984).

  2. Expertise compliment: Compliment the knowledge of the contributor who posted the image. This commits the contributor as an “expert”, resulting in a thoughtful response Robert (1984).

  3. Interest matching: Show interest in the topic of the contributor’s post. This creates a sense of unity between the agent and contributor Cialdini (2016).

  4. Valence matching: Match the valence of the contributor based on their image’s caption. People evolved to act kindly to others who exhibit behaviors from a similar culture Taylor and Thomas (2008).

  5. Answer attempt: Guess an answer and ask for a validation. Recognizing whether a shown answer is correct or not is cognitively an easier task for the listener than recalling the correct answer Gillund and Shiffrin (1984).

  6. Time scarcity: Specify an arbitrary deadline for the response. People are more likely to act if the opportunity is deemed to expire, even if they neither need nor want the opportunity Robert (1984).

  7. Help request: Explicitly request the contributor’s help. People are naturally inclined to help others when they are asked and able to do so Hoffman (1981).

  8. Logical justification: Give a logical reason for asking the question to persuade the contributor at a cognitive level Langer et al. (1978).

  9. Random justification: Give a random reason for asking the question. People are more likely to help if a justification is provided, even if it does not actually entail the request Langer et al. (1978).

3.3 System Design

In this section, we describe our approach for augmenting requests with social strategies. Our approach is divided into two components: generation and selection. A high-level system diagram is depicted in Figure 10. Given a social media post, we featurize the post metadata, question, and caption, then send them to the selection component. The main goal of the selection component is to choose an effective social strategy to use for the given post. This strategy, along with a generated question to ask Krishna et al. (2019), and the social media post are sent to the generation component, which augments the question by generating a natural language phrase for the chosen social strategy. The augmented request is then shared with the contributor. The selection module gathers feedback, positive if the contributor responds in an informative manner. Uninformative responses or no response are counted as a negative feedback.

Figure 11: Example augmentations generated by each of our social strategies.

Selection: Choosing a social strategy We model our selection component as a contextual bandit. Contextual bandits are a common reinforcement learning technique for efficiently exploring different options and exploiting the best choices over time, generalizing from previous trials to uncommonly observed situations 

Li et al. (2010)

. The component receives a feature vector and outputs its choice of an arm (option) that it expects to result in the highest expected reward.

Each social media post is represented as a feature vector that encodes information about the user, the post, and the caption. User features include- number of posts the user has posted, number of followers, number of accounts the user is following, number of other users tagged in their posts, filters and AR effects the user uses frequently on the platform, user’s engagement with videos, whether the user is a verified business or an influencer, user’s privacy settings, the engagement with Instagram features such as highlight reels and resharing, and sentiment analysis on their biography. Post features include the number of users who like the post and the number of users who commented on the post. User and post features are drawn from Instagram’s API and featurized as bag of words or one-hot vectors. Lastly, caption features are extracted from sentiment using Vader Hutto and Gilbert (2014), and the hashtags extracted using regular expressions.

We train a contextual bandit model to choose a social strategy given the extracted features, conditioned on the success of each social strategy used on similar social media posts in the past. The arms that the contextual bandit considers represent each of the nine social strategies that the system can use. If a chosen social strategy receives a response, we parse and check if the response contains an answer Devlin et al. (2018). If so, the model receives a positive reward for choosing the social strategy. If a chosen social strategy does not receive a response, or if the response does not contain an answer, the model receives a negative reward.

Our implementation of contextual bandit uses the adaptive greedy algorithm for balancing the trade-off between exploration and exploitation. During training, the algorithm chooses an option that the model associates with a high uncertainty of reward. If there is no option with a high uncertainty, the algorithm chooses a random option to explore. The threshold for uncertainty decreases as the model is exposed to more data. During inference, the model predicts the social strategy with highest expected reward Zhang (2004).

Generation: Augmenting questions The generation component receives the social media post (an image and a caption) and a raw question automatically generated by existing visual question generation algorithms (e.g., “Where is this place?”). It produces a natural language contextualization of the question using one of the nine social strategies chosen by the selection component.

We build nine independent natural language generation systems that each receive a social media post as input and produce a comment using the corresponding social strategy as output. Four of the social strategies require knowledge about the content of images, and are implemented using machine learning-based models. These strategies cannot be templatized, as there is substantial variation in the kinds of images found online and the approaches much be personalized to the content of the image. We use the other five social strategies as baseline strategies that only require knowledge about the speaking style of the social media user, and are implemented as rule-based expert systems in conjunction with natural language processing techniques. We discuss these two types of models below.

Machine learning-based social strategies. To generate sentences specific to the image of each post, we train one machine learning model for each of the four social strategies that require knowledge about the image: expert compliment, content compliment, interest matching, and logical justification.

We build a dataset of k social media posts alongside examples of questions that use each of the four social social strategies, with the help of crowd workers on Amazon Mechanical Turk. This process results in a dataset of k questions, each with social strategy augmentations. The posts are randomly selected by polling Instagram for images with one of the top most popular hashtags on Instagram and filter for those that refer to visual content, such as animal, travel, shopping, food, etc. Crowdworkers are designated to one of the four strategy categories and trained using examples and a qualifying task, which we manually evaluate. Each task contains social media posts (images and captions) and the generated questions. Workers are asked to submit a natural language sentence that can be pre- and post-pended to the question while adhering to the social strategy they are trained to emulate. The workers are paid a compensation that is equivalent to an hour for their work.222The dataset of social media posts and social strategies for training the reinforcement learning model, as well as the trained contextual bandit model, is publicly available at http://cs.stanford.edu/people/ranjaykrishna/socialstrategies.

We adopt a traditional image-to-sequence machine learning model to generate the sentence for each strategy. Each model encodes the social media image using a convolutional neural network (CNN) Krizhevsky et al. (2012b) and generates a social strategy sentence, conditioned on image features, using a long short term memory (LSTM) network Hochreiter and Schmidhuber (1997). We train each model using the dataset of

k posts dedicated to its assigned strategy using stochastic gradient descent with a learning rate of

for epochs.

Figure 12: Example responses to expertise compliment, help request, logical justification, content compliment and valence matching in the travel domain.

Baseline rule-based social strategies. To generate social strategy sentences that are relevant to the caption of each social media post, we create a rule-based expert system for each of the five social strategies: valence matching, answer attempt, help request, time scarcity, and random justification

. While these algorithms use statistical machine learning approaches for natural language processing, we call them rule-based systems to clarify that the generation, itself, is a deterministic process unlike the sentences generated by the LSTM networks.

Valence matching detects the emotional valence of the caption through punctuation parsing and sentiment analysis using an implementation of the Vader algorithm Hutto and Gilbert (2014). The algorithm generates a sentence with emotional valence that is approximately equal to valence of the caption by matching type and number of punctuations and adding appropriate exclamations like “Wow!” or “Aw”.

Answer attempt guesses a probable answer for the input post based on the raw question and hashtags of the post. To guess a probable answer, we manually curate a set of likely answers for problem domains and words from caption and randomly choose one from the set. For example, when asking where we could buy the same item on a post that references the word “jean” in the “shopping” domain, the set of probable answers are a list of brands that sell jeans to consumers. Deployments of this strategy does not have to rely on a curated list and can instead use existing answering models Antol et al. (2015).

Help request augments the agent’s question with variations of words and sentence structures that humans use to request help from one another. Time scarcity augments the agent’s question with variations of a sentence that requests the answer to be provided within hours. Random justification augments the agent’s question with a justification that is chosen irrespective of the social media post. Specifically, we store a list of justification sentences generated from the logical justification system for other posts, and retrieve one at random. Figure 11 visualizes example augmentations generated by each of our nine strategies, conditioned on the post.

3.4 Experiments

We evaluate the utility of augmenting questions with social strategies through a real-world deploying on Instagram. Our aim is to increase online crowdsourcing participation from Instagram users when we ask them questions about their image contents. We begin our experiments by first describing the experimental setup, the metrics used, the baselines, and strategies surveyed. Next, we study how generated social strategies impact participation. Finally, we study the importance of selecting the correct social strategy.

Experimental setup We poll images from Instagram, featurize the post, select a social strategy, and generate the question augmentation. We post the augmented question and wait for a response.

Images and raw questions. We source images from Instagram across domains: travel, animals, shopping and food. Images from each domain are polled by searching for posts with hashtags: travel, animals, shopping, and food. Images in these four domains consitute an upper bound of of all images posted with one of the top popular hashtags that represent visual content. Since we are studying the impact of using different social strategies by directly interacting with real users on Instagram, we can not post multiple questions, each augmented with a different strategy, to the same image post. Ideally, in online crowdsourcing deployments, the raw questions generated would be conditioned on the post or image. In our case, however, we use only one question per domain so that all users are exposed to the same basic question. For each domain, we hold the raw question constant. For example, “Where is this place?” for travel, “What animal is that?” for animals, “Where can I get that?” for shopping, and “What is this food?” for food.

Metrics. To measure the improvements in crowdsourcing participation, we report the percentage of informative responses. After a question is posted on Instagram, we wait hours to check if a response was received. If the question results in no response or if the response doesn’t answer the question or the user appears confused (e.g. “huh?” or “I don’t understand”), the interaction is not counted as an informative response. To verify if a response is informative, we send all responses to Amazon Mechanical Turk (AMT) workers to report whether the question was actually answered with gold standard responses to guarantee quality.

Source domain (%) Target domain (%)
Expertise compliment 72.90 29.55
Content compliment 59.11 68.96
Interest matching 45.31 85.38
Logical justification 55.17 19.7
Answer attempt 41.37 42.69
Help request 31.52 32.84
Valence matching 37.43 36.12
Time scarcity 24.63 26.27
Random justification 17.73 32.84
ML based strategies 58.12 50.89
Rule based strategies 30.54 34.15
No strategy 15.76 13.13
Table 2: Response rates achieved by different strategies on posts in the source and target domains. The bottom of the table shows a comparison between average performance of ML based strategies, average performance of rule-based strategies and baseline un-augmented questions
Figure 13: Difference between response rate of the agent and humans for each social strategy. Green indicates the agent is better than people and red indicates the opposite.

Strategies surveyed. We use all nine strategies described earlier and add a baseline and an oracle strategy. The baseline case posts the raw question with no augmentation. The oracle method asks AMT workers to modify the question without any restrictions to maximize the chances of receiving the answer. They don’t have to follow any of our outlined social strategies.

Dataset of online interactions. To study the impact of using social strategies, we collect a dataset of k posts for each of the ML social strategies, resulting in a dataset of k questions with augmentations. The rule strategies don’t require any training data. Once trained, we post questions per strategy to Instagram, resulting in total posts. To further study the scalability and transfer of strategies learned in one domain and applied to another, we train augmentation models using data from a “source” domain and test its effect on posts from “target” domains. For example, we train models using data collected from the travel source domain and test on the rest as target domains.

To train the selection model, we gather k posts from Instagram and generate augmentations with each of the social strategies. Each post, with all the augmentated questions, is sent to AMT workers, who are asked to pick the strategies that would be appropriate to use. We choose to train the selection model using AMT instead of Instagram as it allows us to quickly collect large amounts of training data and negate the impact of other confounds. Each AMT task included social media posts. One out of the ten posts contained an attention checker in the question to verify that the workers were actually reading the questions. Workers were compensated at a rate of per hour.

Augmenting questions with social strategies Our goal in the first set of experiments is to study the effect of using social strategies to augment questions.

Informative responses. Before we inspect the effects of social strategies, we first report the quality of responses from Instagram users. We manually annotate all our responses and find that of questions are both relevant as well as answerable. Out of the relevant questions, of responses were informative, i.e. the responses contained the correct answer to the question. Figure 12 visualizes a set of example responses for different posts with different social strategies in the travel domain. While all social strategies outperformed the baseline in receiving responses, the quality of the responses differed across strategies.

Effect of social strategies. Table 2 reports the informative response rate across all the social strategies. We find that, compared to the baseline case, where no strategy is used, rule-based strategies improve participation by

percent points. An unpaired t-test confirms that participation increases by designing appropriate rule-based social strategies (

, ). When social strategy data is collected and used to train ML strategies, performance increases by percent points and percent points when compared against un-augmented (, ) and rule-based strategies (, ) and confirmed by unpaired t-tests. Overall, we find that expertise compliment and logical justification performed strongly in shopping domain, but weakly in animals and food domains.

Figure 14: Example strategy selection and augmentations in the travel domain. (a) Our system learns to focus on different aspects of the image. (b) The system is able to discern between very similar images and understand that the same objects can have different connotations. (c, d) Example failure case when objects were misclassified.

To test the scalability of our strategies across image domains, we train models on a source domain and deploy them on a target domain. We find that expertise compliment drops in performance while interest matching improves. The drop implies that machine learning models that heavily depend on example data points used in training process are not robust in new domains. Therefore, while machine learning strategies are the most effective, they require strategy data collected for the domain in which they are deployed. The drop in performance, however, still results in improvements in response rate, demonstrating that machine learning strategies scale across domains but their impact reduces as the distribution of image content increases from the source domain. The increase in performance of interest matching indicates that different domains might have different dominating social strategies, i.e. no single dominant strategy exists across all domains and that a selection component is necessary.

Agent versus human augmentations. We compare the augmentations generated by our agent against those created by crowdworkers. We report the difference in response rate between the agent and the human augmentations across the different strategies in Figure 13. A two-way ANOVA finds that the strategy used has a significant effect on the response rate (, ) but the poster has no significant effect on the response rate (, ). The ANOVA also found a significant interaction effect between the strategy and the poster on response rate (, ). A posthoc Tukey test indicates that the agent using the machine learning strategies is significantly increases response rate than the agent using rule-based () or humans using rule-based strategies (). This demonstrates that a machine learning model that has witnessed examples of social strategies can outperform rule-based systems. However, there is no significant difference between the agent using machine learning strategies versus humans using the same social strategies.

Learning to select a social strategy In our previous experiment we established that different domains have different strategies that perform best. Now, we evaluate how well our selection component performs at selecting the most effective strategy. Specifically, we test how well our selection model performs (1) against a random strategy, (2) against the most effective strategy (expertise compliment) from the previous experiment, and (3) against the oracle strategy generated by crowdworkers. Recall that the oracle strategy does not constrain workers to use any particular strategy.

Since this test needs be able to test multiple strategies on the same post, we perform our evaluation on AMT. Workers are shown two strategies for a given post and asked to choose which strategy is most likely to receive a response. We perform pairwise comparisons between our selection model against a random strategy across k posts, against expertise compliment across posts and against open-ended human questions across posts.

Effect of selection. A binomial test indicates that our selection method was chosen more often than a random strategy . It was chosen more often than expertise compliment . And finally, it was chosen more often than the oracle human generated questions . We conclude that our selection model outperforms all existing baselines.

Qualitative analysis. Figure 14(a) shows that the agent can choose to focus on different aspects of the image even when the subject of the image is roughly the same: old traditional buildings. In one, the agent compliments the statue, which is the most salient feature of the old European building shown in the image. In the other, it shows appreciation for the overall architecture of the old Asian building, which does not have a single defining feature like a statue.

Figure 14(b) shows two images that are both contain water and has similar color composition. In one, the agent compliments the water seen on the beach as refreshing and in the other, the fish seen underwater as cute. Referring to a fish in a beach photo would have been incorrect as would have been describing water as refreshing in an underwater photo.

Though social strategies are useful, they can also lead to new errors. Figure 14(c, d) showcases an example questions where the agent fails to recognize mountains and food and generates phrases referring to beaches and flowers.

3.5 Discussion

Intended use. This work demonstrates that it is possible to train an AI agent to use social strategies that are found in human-to-human interaction contexts to increase the likelihood of a human crowdsourcing respondent. Such responses suggest a future in which supervised ML models can be trained on authentic online data that are provided by willing helpers than from paid workers. We expect that such strategies can lead to adaptive ML systems that can learn during their deployment, by asking their users whenever they are uncertain about their environment. Unlike existing paid crowdsourcing techniques that grow linearly in cost as the number of annotations increases, our method is a fixed cost solution where social strategies need to be collected for a specific domain and then deployed to encourage volunteers.

Negative usage. It is also important that we pause to note the potential negative implications of computing research, and how they can be addressed. The psychology techniques that our work relies on have been used in negotiations and marketing campaigns for decades. Automating such techniques can also lead to influencing emotions or behavior at a magnitude greater than single human-human interaction Kramer et al. (2014); Ferrara et al. (2016). When using natural language techniques, we advocate that agents continue to self-identify as bots for this reason. There is a need for online communities to establish a standard acceptable use of such techniques and how the contributors should be informed about the intentions behind an agent’s request.

Limitations and future work. Our social strategies are inspired by social psychology research. Ours are by no means an exhaustive list of possible strategies. Future research could follow a more “bottom-up” approach of directly learning to emulate strategies by observing human-human interactions. Currently, our requests involve exactly one dialogue turn, and we do not yet explore multi-turn conversations. This can be important: for example, the answer attempt strategy may be more effective at getting an answer now, but might also decrease the probability that the contributor will want to continue cooperating in the long term. Future work can explore how to guide conversations to enable more complex labeling schemes.

Conclusion Our work: (1) identifies social strategies that can be repurporsed to improve crowdsourcing requests for visual question answering, (2) trains and deploys machine learning and rule-based models that deploy these strategies to increase crowdsourcing participation, and (3) demonstrates that these models significantly improve participation on Instagram, that no single strategy is optimal, and that a selection model can chooses the appropriate strategy.

4 Model evaluation using human perception

Generating realistic images is regarded as a focal task for measuring the progress of generative models. Automated metrics are either heuristic approximations Rössler et al. (2019); Salimans et al. (2016); Denton et al. (2015); Karras et al. (2018); Brock et al. (2018); Radford et al. (2015) or intractable density estimations, examined to be inaccurate on high dimensional problems Hinton (2002); Bishop (2006); Theis et al. (2015). Human evaluations, such as those given on Amazon Mechanical Turk Rössler et al. (2019); Denton et al. (2015), remain ad-hoc because “results change drastically” Salimans et al. (2016) based on details of the task design Liu et al. (2016); Le et al. (2010); Kittur et al. (2008). With both noisy automated and noisy human benchmarks, measuring progress over time has become akin to hill-climbing on noise. Even widely used metrics, such as Inception Score Salimans et al. (2016) and Fréchet Inception Distance Heusel et al. (2017), have been discredited for their application to non-ImageNet datasets Barratt and Sharma (2018); Rosca et al. (2017); Borji (2018); Ravuri et al. (2018). Thus, to monitor progress, generative models need a systematic gold standard benchmark. In this paper, we introduce a gold standard benchmark for realistic generation, demonstrating its effectiveness across four datasets, six models, and two sampling techniques, and using it to assess the progress of generative models over time.

Figure 15:

Our human evaluation metric,

HYPE, consistently distinguishes models from each other: here, we compare different generative models performance on FFHQ. A score of represents indistinguishable results from real, while a score above represents hyper-realism.

Realizing the constraints of available automated metrics, many generative modeling tasks resort to human evaluation and visual inspection Rössler et al. (2019); Salimans et al. (2016); Denton et al. (2015). These human measures are (1) ad-hoc, each executed in idiosyncrasy without proof of reliability or grounding to theory, and (2) high variance in their estimates Salimans et al. (2016); Denton et al. (2015); Olsson et al. (2018)

. These characteristics combine to a lack of reliability, and downstream, (3) a lack of clear separability between models. Theoretically, given sufficiently large sample sizes of human evaluators and model outputs, the law of large numbers would smooth out the variance and reach eventual convergence; but this would occur at (4) a high cost and a long delay.

We present HYPE (Human eYe Perceptual Evaluation) to address these criteria in turn. HYPE: (1) measures the perceptual realism of generative model outputs via a grounded method inspired by psychophysics methods in perceptual psychology, (2) is a reliable and consistent estimator, (3) is statistically separable to enable a comparative ranking, and (4) ensures a cost and time efficient method through modern crowdsourcing techniques such as training and aggregation. We present two methods of evaluation. The first, called , is inspired directly by the psychophysics literature Klein (2001); Cornsweet (1962), and displays images using adaptive time constraints to determine the time-limited perceptual threshold a person needs to distinguish real from fake. The score is understood as the minimum time, in milliseconds, that a person needs to see the model’s output before they can distinguish it as real or fake. For example, a score of ms on indicates that humans can distinguish model outputs from real images at ms exposure times or longer, but not under ms. The second method, called , is derived from the first to make it simpler, faster, and cheaper while maintaining reliability. It is interpretable as the rate at which people mistake fake images and real images, given unlimited time to make their decisions. A score of on means that people differentiate generated results from real data at chance rate, while a score above represents hyper-realism in which generated images appear more real than real images.

We run two large-scale experiments. First, we demonstrate HYPE’s performance on unconditional human face generation using four popular generative adversarial networks (GANs) Gulrajani et al. (2017); Berthelot et al. (2017); Karras et al. (2017, 2018) across  Liu et al. (2015). We also evaluate two newer GANs Miyato et al. (2018); Brock et al. (2018) on  Karras et al. (2018). HYPE indicates that GANs have clear, measurable perceptual differences between them; this ranking is identical in both and . The best performing model, StyleGAN trained on FFHQ and sampled with the truncation trick, only performs at , suggesting substantial opportunity for improvement. We can reliably reproduce these results with confidence intervals using human evaluators at in a task that takes minutes.

Second, we demonstrate the performance of beyond faces on conditional generation of five object classes in ImageNet Deng et al. (2009) and unconditional generation of  Krizhevsky and Hinton (2009). Early GANs such as BEGAN are not separable in when generating : none of them produce convincing results to humans, verifying that this is a harder task than face generation. The newer StyleGAN shows separable improvement, indicating progress over the previous models. With , GANs have improved on classes considered “easier” to generate (e.g., lemons), but resulted in consistently low scores across all models for harder classes (e.g., French horns).

HYPE is a rapid solution for researchers to measure their generative models, requiring just a single click to produce reliable scores and measure progress. We deploy HYPE at https://hype.stanford.edu, where researchers can upload a model and retrieve a HYPE score. Future work will extend HYPE to additional generative tasks such as text, music, and video generation.

4.1 Hype: A benchmark for Human eYe Perceptual Evaluation

HYPE displays a series of images one by one to crowdsourced evaluators on Amazon Mechanical Turk and asks the evaluators to assess whether each image is real or fake. Half of the images are real images, drawn from the model’s training set (e.g., FFHQ, CelebA, ImageNet, or CIFAR-10). The other half are drawn from the model’s output. We use modern crowdsourcing training and quality control techniques Mitra et al. (2015) to ensure high-quality labels. Model creators can choose to perform two different evaluations: , which gathers time-limited perceptual thresholds to measure the psychometric function and report the minimum time people need to make accurate classifications, and , a simplified approach which assesses people’s error rate under no time constraint.

: Perceptual fidelity grounded in psychophysics Our first method, , measures time-limited perceptual thresholds. It is rooted in psychophysics literature, a field devoted to the study of how humans perceive stimuli, to evaluate human time thresholds upon perceiving an image. Our evaluation protocol follows the procedure known as the adaptive staircase method (Figure 17Cornsweet (1962). An image is flashed for a limited length of time, after which the evaluator is asked to judge whether it is real or fake. If the evaluator consistently answers correctly, the staircase descends and flashes the next image with less time. If the evaluator is incorrect, the staircase ascends and provides more time.

Figure 16: Example images sampled with the truncation trick from StyleGAN trained on FFHQ. Images on the right exhibit the highest scores, the highest human perceptual fidelity.

[width=.55]figures/hype/cartoon_staircase.png

Figure 17: The adaptive staircase method shows images to evaluators at different time exposures, decreasing when correct and increasing when incorrect. The modal exposure measures their perceptual threshold.

This process requires sufficient iterations to converge to the evaluator’s perceptual threshold: the shortest exposure time at which they can maintain effective performance Cornsweet (1962); Greene and Oliva (2009); Fei-Fei et al. (2007). The process produces what is known as the psychometric function Wichmann and Hill (2001), the relationship of timed stimulus exposure to accuracy. For example, for an easily distinguishable set of generated images, a human evaluator would immediately drop to the lowest millisecond exposure.

displays three blocks of staircases for each evaluator. An image evaluation begins with a 3-2-1 countdown clock, each number displaying for ms Krishna et al. (2016). The sampled image is then displayed for the current exposure time. Immediately after each image, four perceptual mask images are rapidly displayed for ms each. These noise masks are distorted to prevent retinal afterimages and further sensory processing after the image disappears Greene and Oliva (2009). We generate masks using an existing texture-synthesis algorithm Portilla and Simoncelli (2000). Upon each submission, reveals to the evaluator whether they were correct.

Image exposures are in the range [ms, ms], derived from the perception literature Fraisse (1984). All blocks begin at ms and last for images (% generated, % real), values empirically tuned from prior work Cornsweet (1962); Dakin and Omigie (2009). Exposure times are raised at ms increments and reduced at ms decrements, following the -up/-down adaptive staircase approach, which theoretically leads to a accuracy threshold that approximates the human perceptual threshold Levitt (1971); Greene and Oliva (2009); Cornsweet (1962).

Every evaluator completes multiple staircases, called blocks, on different sets of images. As a result, we observe multiple measures for the model. We employ three blocks, to balance quality estimates against evaluators’ fatigue Krueger (1989); Rzeszotarski et al. (2013); Hata et al. (2017). We average the modal exposure times across blocks to calculate a final value for each evaluator. Higher scores indicate a better model, whose outputs take longer time exposures to discern from real.

: Cost-effective approximation Building on the previous method, we introduce : a simpler, faster, and cheaper method after ablating to optimize for speed, cost, and ease of interpretation. shifts from a measure of perceptual time to a measure of human deception rate, given infinite evaluation time. The score gauges total error on a task of 50 fake and 50 real images 333We explicitly reveal this ratio to evaluators. Amazon Mechanical Turk forums would enable evaluators to discuss and learn about this distribution over time, thus altering how different evaluators would approach the task. By making this ratio explicit, evaluators would have the same prior entering the task., enabling the measure to capture errors on both fake and real images, and effects of hyperrealistic generation when fake images look even more realistic than real images 444Hyper-realism is relative to the real dataset on which a model is trained. Some datasets already look less realistic because of lower resolution and/or lower diversity of images.. requires fewer images than to find a stable value, empirically producing a x reduction in time and cost ( minutes per evaluator instead of minutes, at the same rate of per hour). Higher scores are again better: indicates that only of images deceive people, whereas indicates that people are mistaking real and fake images at chance, rendering fake images indistinguishable from real. Scores above suggest hyperrealistic images, as evaluators mistake images at a rate greater than chance.

shows each evaluator a total of images: real and fake. We calculate the proportion of images that were judged incorrectly, and aggregate the judgments over the evaluators on images to produce the final score for a given model.

4.2 Consistent and reliable design

To ensure that our reported scores are consistent and reliable, we need to sample sufficiently from the model as well as hire, qualify, and appropriately pay enough evaluators.

Sampling sufficient model outputs. The selection of images to evaluate from a particular model is a critical component of a fair and useful evaluation. We must sample a large enough number of images that fully capture a model’s generative diversity, yet balance that against tractable costs in the evaluation. We follow existing work on evaluating generative output by sampling generated images from each model Salimans et al. (2016); Miyato et al. (2018); Warde-Farley and Bengio (2016) and real images from the training set. From these samples, we randomly select images to give to each evaluator.

Quality of evaluators. To obtain a high-quality pool of evaluators, each is required to pass a qualification task. Such a pre-task filtering approach, sometimes referred to as a person-oriented strategy, is known to outperform process-oriented strategies that perform post-task data filtering or processing Mitra et al. (2015). Our qualification task displays images ( real and fake) with no time limits. Evaluators must correctly classify of both real and fake images. This threshold should be treated as a hyperparameter and may change depending upon the GANs used in the tutorial and the desired discernment ability of the chosen evaluators. We choose based on the cumulative binomial probability of 65 binary choice answers out of 100 total answers: there is only a one in one-thousand chance that an evaluator will qualify by random guessing. Unlike in the task itself, fake qualification images are drawn equally from multiple different GANs to ensure an equitable qualification across all GANs. The qualification is designed to be taken occasionally, such that a pool of evaluators can assess new models on demand.

Payment. Evaluators are paid a base rate of for working on the qualification task. To incentivize evaluators to remained engaged throughout the task, all further pay after the qualification comes from a bonus of per correctly labeled image, typically totaling a wage of /hr.

4.3 Experimental setup

Datasets. We evaluate on four datasets. (1)  Liu et al. (2015) is popular dataset for unconditional image generation with k images of human faces, which we align and crop to be px. (2)  Karras et al. (2018) is a newer face dataset with k images of size px. (3) consists of k images, sized px, across classes. (4) is a subset of classes with k images at px from the ImageNet dataset Deng et al. (2009), which have been previously identified as easy (lemon, Samoyed, library) and hard (baseball player, French horn) Brock et al. (2018).

Architectures. We evaluate on four state-of-the-art models trained on and : StyleGAN Karras et al. (2018), ProGAN Karras et al. (2017), BEGAN Berthelot et al. (2017), and WGAN-GP Gulrajani et al. (2017). We also evaluate on two models, SN-GAN Miyato et al. (2018) and BigGAN Brock et al. (2018) trained on ImageNet, sampling conditionally on each class in . We sample BigGAN with ( Brock et al. (2018)) and without the truncation trick.

We also evaluate on StyleGAN Karras et al. (2018) trained on with ( Karras et al. (2018)) and without truncation trick sampling. For parity on our best models across datasets, StyleGAN instances trained on and are also sampled with the truncation trick.

We sample noise vectors from the -dimensional spherical Gaussian noise prior during training and test times. We specifically opted to use the same standard noise prior for comparison, yet are aware of other priors that optimize for FID and IS scores Brock et al. (2018). We select training hyperparameters published in the corresponding papers for each model.

Evaluator recruitment. We recruit evaluators from Amazon Mechanical Turk, or 30 for each run of HYPE. To maintain a between-subjects study in this evaluation, we recruit independent evaluators across tasks and methods.

Metrics. For , we report the modal perceptual threshold in milliseconds. For , we report the error rate as a percentage of images, as well as the breakdown of this rate on real and fake images separately. To show that our results for each model are separable, we report a one-way ANOVA with Tukey pairwise post-hoc tests to compare all models.

Reliability is a critical component of HYPE, as a benchmark is not useful if a researcher receives a different score when rerunning it. We use bootstrapping Felsenstein (1985), repeated resampling from the empirical label distribution, to measure variation in scores across multiple samples with replacement from a set of labels. We report

bootstrapped confidence intervals (CIs), along with standard deviation of the bootstrap sample distribution, by randomly sampling

evaluators with replacement from the original set of evaluators across iterations.

Rank GAN (ms) Std. 95% CI
1 363.2 32.1 300.0 – 424.3
2 240.7 29.9 184.7 – 302.7
Table 3: on and trained on .

Experiment 1: We run two large-scale experiments to validate HYPE. The first one focuses on the controlled evaluation and comparison of against on established human face datasets. We recorded responses totaling ( ) models evaluators responses = k total responses for our evaluation and ( ) models evaluators responses = k, for our evaluation.

Experiment 2: The second experiment evaluates on general image datasets. We recorded ( ) models evaluators responses = k total responses.

4.4 Experiment 1: and on human faces

We report results on and demonstrate that the results of approximates those from at a fraction of the cost and time.

We find that resulted in the highest score (modal exposure time), at a mean of ms, indicating that evaluators required nearly a half-second of exposure to accurately classify images (Table 3). is followed by ProGAN at ms, a drop in time. BEGAN and WGAN-GP are both easily identifiable as fake, tied in last place around the minimum available exposure time of ms. Both BEGAN and WGAN-GP exhibit a bottoming out effect — reaching the minimum time exposure of ms quickly and consistently.

To demonstrate separability between models we report results from a one-way analysis of variance (ANOVA) test, where each model’s input is the list of modes from each model’s evaluators. The ANOVA results confirm that there is a statistically significant omnibus difference (). Pairwise post-hoc analysis using Tukey tests confirms that all pairs of models are separable (all ) except BEGAN and WGAN-GP ().

. We find that resulted in a higher exposure time than , at ms and ms, respectively (Table 3). While the confidence intervals that represent a very conservative overlap of ms, an unpaired t-test confirms that the difference between the two models is significant ().

. Table 4 reports results for on . We find that resulted in the highest score, fooling evaluators of the time. is followed by ProGAN at , BEGAN at , and WGAN-GP at . No confidence intervals are overlapping and an ANOVA test is significant (). Pairwise post-hoc Tukey tests show that all pairs of models are separable (all ). Notably, results in separable results for BEGAN and WGAN-GP, unlike in where they were not separable due to a bottoming-out effect.

Rank GAN (%) Fakes Error Reals Error Std. 95% CI KID FID Precision
1 50.7% 62.2% 39.3% 1.3 48.2 – 53.1 0.005 131.7 0.982
2 ProGAN 40.3% 46.2% 34.4% 0.9 38.5 – 42.0 0.001 2.5 0.990
3 BEGAN 10.0% 6.2% 13.8% 1.6 7.2 – 13.3 0.056 67.7 0.326
4 WGAN-GP 3.8% 1.7% 5.9% 0.6 3.2 – 5.7 0.046 43.6 0.654
Table 4: on four GANs trained on . Counterintuitively, real errors increase with the errors on fake images, because evaluators become more confused and distinguishing factors between the two distributions become harder to discern.

. We observe a consistently separable difference between and and clear delineations between models (Table 5). ranks () above () with no overlapping CIs. Separability is confirmed by an unpaired t-test ().

Rank GAN (%) Fakes Error Reals Error Std. 95% CI KID FID Precision
1 27.6% 28.4% 26.8% 2.4 22.9 – 32.4 0.007 13.8 0.976
2 19.0% 18.5% 19.5% 1.8 15.5 – 22.4 0.001 4.4 0.983
Table 5: on and trained on . Evaluators were deceived most often by . Similar to , fake errors and real errors track each other as the line between real and fake distributions blurs.

Cost tradeoffs with accuracy and time

Figure 18: Effect of more evaluators on CI.

One of HYPE’s goals is to be cost and time efficient. When running HYPE, there is an inherent tradeoff between accuracy and time, as well as between accuracy and cost. This is driven by the law of large numbers: recruiting additional evaluators in a crowdsourcing task often produces more consistent results, but at a higher cost (as each evaluator is paid for their work) and a longer amount of time until completion (as more evaluators must be recruited and they must complete their work).

To manage this tradeoff, we run an experiment with on . We completed an additional evaluation with evaluators, and compute bootstrapped confidence intervals, choosing from to evaluators (Figure 18). We see that the CI begins to converge around evaluators, our recommended number of evaluators to recruit.

At evaluators, the cost of running on one model was approximately , while the cost of running on the same model was approximately . Payment per evaluator for both tasks was approximately /hr. Evaluators spent an average of one hour each on a task and minutes each on a task. Thus, achieves its goals of being significantly cheaper to run, while maintaining consistency.

Comparison to automated metrics As FID Heusel et al. (2017) is one of the most frequently used evaluation methods for unconditional image generation, it is imperative to compare HYPE against FID on the same models. We also compare to two newer automated metrics: KID Bińkowski et al. (2018)

, an unbiased estimator independent of sample size, and

(precision) Sajjadi et al. (2018), which captures fidelity independently. We show through Spearman rank-order correlation coefficients that HYPE scores are not correlated with FID (), where a Spearman correlation of is ideal because lower FID and higher HYPE scores indicate stronger models. We therefore find that FID is not highly correlated with human judgment. Meanwhile, and exhibit strong correlation (), where is ideal because they are directly related. We calculate FID across the standard protocol of K generated and K real images for both and , reproducing scores for . KID () and precision () both show a statistically insignificant but medium level of correlation with humans.

during model training HYPE can also be used to evaluate progress during model training. We find that scores increased as StyleGAN training progressed from at k epochs, to at k epochs, to at k epochs ().

4.5 Experiment 2: beyond faces

We now turn to another popular image generation task: objects. As Experiment 1 showed to be an efficient and cost effective variant of , here we focus exclusively on .

We evaluate conditional image generation on five ImageNet classes (Table 6). We also report FID Heusel et al. (2017), KID Bińkowski et al. (2018), and (precision) Sajjadi et al. (2018) scores. To evaluate the relative effectiveness of the three GANs within each object class, we compute five one-way ANOVAs, one for each of the object classes. We find that the scores are separable for images from three easy classes: samoyeds (dogs) (), lemons (), and libraries (). Pairwise Posthoc tests reveal that this difference is only significant between SN-GAN and the two BigGAN variants. We also observe that models have unequal strengths, e.g. SN-GAN is better suited to generating libraries than samoyeds.

Comparison to automated metrics. Spearman rank-order correlation coefficients on all three GANs across all five classes show that there is a low to moderate correlation between the scores and KID (), FID (), and negligible correlation with precision (). Some correlation for our task is expected, as these metrics use pretrained ImageNet embeddings to measure differences between generated and real data.

Interestingly, we find that this correlation depends upon the GAN: considering only SN-GAN, we find stronger coefficients for KID (), FID (), and precision (). When considering only BigGAN, we find far weaker coefficients for KID (), FID (), and precision (). This illustrates an important flaw with these automatic metrics: their ability to correlate with humans depends upon the generative model that the metrics are evaluating on, varying by model and by task.

GAN Class (%) Fakes Error Reals Error Std. 95% CI KID FID Precision

Easy

Lemon 18.4% 21.9% 14.9% 2.3 14.2–23.1 0.043 94.22 0.784
Lemon 20.2% 22.2% 18.1% 2.2 16.0–24.8 0.036 87.54 0.774
SN-GAN Lemon 12.0% 10.8% 13.3% 1.6 9.0–15.3 0.053 117.90 0.656

Easy

Samoyed 19.9% 23.5% 16.2% 2.6 15.0–25.1 0.027 56.94 0.794
Samoyed 19.7% 23.2% 16.1% 2.2 15.5–24.1 0.014 46.14 0.906
SN-GAN Samoyed 5.8% 3.4% 8.2% 0.9 4.1–7.8 0.046 88.68 0.785

Easy

Library 17.4% 22.0% 12.8% 2.1 13.3–21.6 0.049 98.45 0.695
Library 22.9% 28.1% 17.6% 2.1 18.9–27.2 0.029 78.49 0.814
SN-GAN Library 13.6% 15.1% 12.1% 1.9 10.0–17.5 0.043 94.89 0.814

Hard

French Horn 7.3% 9.0% 5.5% 1.8 4.0–11.2 0.031 78.21 0.732
French Horn 6.9% 8.6% 5.2% 1.4 4.3–9.9 0.042 96.18 0.757
SN-GAN French Horn 3.6% 5.0% 2.2% 1.0 1.8–5.9 0.156 196.12 0.674

Hard

Baseball Player 1.9% 1.9% 1.9% 0.7 0.8–3.5 0.049 91.31 0.853
Baseball Player 2.2% 3.3% 1.2% 0.6 1.3–3.5 0.026 76.71 0.838
SN-GAN Baseball Player 2.8% 3.6% 1.9% 1.5 0.8–6.2 0.052 105.82 0.785
Table 6: on three models trained on ImageNet and conditionally sampled on five classes. BigGAN routinely outperforms SN-GAN. and are not separable.
GAN (%) Fakes Error Reals Error Std. 95% CI KID FID Precision
23.3% 28.2% 18.5% 1.6 20.1–26.4 0.005 62.9 0.982
PROGAN 14.8% 18.5% 11.0% 1.6 11.9–18.0 0.001 53.2 0.990
BEGAN 14.5% 14.6% 14.5% 1.7 11.3–18.1 0.056 96.2 0.326
WGAN-GP 13.2% 15.3% 11.1% 2.3 9.1–18.1 0.046 104.0 0.654
Table 7: Four models on CIFAR-10. can generate realistic images from .

For the difficult task of unconditional generation on , we use the same four model architectures in Experiment 1: . Table 7 shows that was able to separate from the earlier BEGAN, WGAN-GP, and ProGAN, indicating that StyleGAN is the first among them to make human-perceptible progress on unconditional object generation with .

Comparison to automated metrics. Spearman rank-order correlation coefficients on all four GANs show medium, yet statistically insignificant, correlations with KID () and FID () and precision ().

4.6 Related work

Cognitive psychology. We leverage decades of cognitive psychology to motivate how we use stimulus timing to gauge the perceptual realism of generated images. It takes an average of ms of focused visual attention for people to process and interpret an image, but only

ms to respond to faces because our inferotemporal cortex has dedicated neural resources for face detection 

Rayner et al. (2009); Chellappa et al. (2010). Perceptual masks are placed between a person’s response to a stimulus and their perception of it to eliminate post-processing of the stimuli after the desired time exposure Sperling (1963). Prior work in determining human perceptual thresholds Greene and Oliva (2009) generates masks from their test images using the texture-synthesis algorithm Portilla and Simoncelli (2000). We leverage this literature to establish feasible lower bounds on the exposure time of images, the time between images, and the use of noise masks.

Success of automatic metrics. Common generative modeling tasks include realistic image generation Goodfellow et al. (2014), machine translation Bahdanau et al. (2014)

, image captioning 

Vinyals et al. (2015), and abstract summarization Mani (1999), among others. These tasks often resort to automatic metrics like the Inception Score (IS) Salimans et al. (2016) and Fréchet Inception Distance (FID) Heusel et al. (2017) to evaluate images and BLEU Papineni et al. (2002), CIDEr Vedantam et al. (2015) and METEOR Banerjee and Lavie (2005) scores to evaluate text. While we focus on how realistic generated content appears, other automatic metrics also measure diversity of output, overfitting, entanglement, training stability, and computational and sample efficiency of the model Borji (2018); Lucic et al. (2018); Barratt and Sharma (2018). Our metric may also capture one aspect of output diversity, insofar as human evaluators can detect similarities or patterns across images. Our evaluation is not meant to replace existing methods but to complement them.

Limitations of automatic metrics. Prior work has asserted that there exists coarse correlation of human judgment to FID Heusel et al. (2017) and IS Salimans et al. (2016), leading to their widespread adoption. Both metrics depend on the Inception-v3 Network Szegedy et al. (2016), a pretrained ImageNet model, to calculate statistics on the generated output (for IS) and on the real and generated distributions (for FID). The validity of these metrics when applied to other datasets has been repeatedly called into question Barratt and Sharma (2018); Rosca et al. (2017); Borji (2018); Ravuri et al. (2018). Perturbations imperceptible to humans alter their values, similar to the behavior of adversarial examples Kurakin et al. (2016). Finally, similar to our metric, FID depends on a set of real examples and a set of generated examples to compute high-level differences between the distributions, and there is inherent variance to the metric depending on the number of images and which images were chosen—in fact, there exists a correlation between accuracy and budget (cost of computation) in improving FID scores, because spending a longer time and thus higher cost on compute will yield better FID scores Lucic et al. (2018). Nevertheless, this cost is still lower than paid human annotators per image.

Human evaluations. Many human-based evaluations have been attempted to varying degrees of success in prior work, either to evaluate models directly Denton et al. (2015); Olsson et al. (2018) or to motivate using automated metrics Salimans et al. (2016); Heusel et al. (2017)

. Prior work also used people to evaluate GAN outputs on CIFAR-10 and MNIST and even provided immediate feedback after every judgment 

Salimans et al. (2016). They found that generated MNIST samples have saturated human performance — i.e. people cannot distinguish generated numbers from real MNIST numbers, while still finding error rate on CIFAR-10 with the same model Salimans et al. (2016). This suggests that different datasets will have different levels of complexity for crossing realistic or hyper-realistic thresholds. The closest recent work to ours compares models using a tournament of discriminators Olsson et al. (2018). Nevertheless, this comparison was not yet rigorously evaluated on humans nor were human discriminators presented experimentally. The framework we present would enable such a tournament evaluation to be performed reliably and easily.

4.7 Discussion

Envisioned Use. We created HYPE as a turnkey solution for human evaluation of generative models. Researchers can upload their model, receive a score, and compare progress via our online deployment. During periods of high usage, such as competitions, a retainer model Bernstein et al. (2011) enables evaluation using in minutes, instead of the default minutes.

Limitations. Extensions of HYPE may require different task designs. In the case of text generation (translation, caption generation), will require much longer and much higher range adjustments to the perceptual time thresholds Krishna et al. (2017a); Weld et al. (2015). In addition to measuring realism, other metrics like diversity, overfitting, entanglement, training stability, and computational and sample efficiency are additional benchmarks that can be incorporated but are outside the scope of this paper. Some may be better suited to a fully automated evaluation Borji (2018); Lucic et al. (2018). Similar to related work in evaluating text generation Hashimoto et al. (2019), we suggest that diversity can be incorporated using the automated recall score measures diversity independently from precision  Sajjadi et al. (2018).

Conclusion. HYPE provides two human evaluation benchmarks for generative models that (1) are grounded in psychophysics, (2) provide task designs that produce reliable results, (3) separate model performance, (4) are cost and time efficient. We introduce two benchmarks: , which uses time perceptual thresholds, and , which reports the error rate sans time constraints. We demonstrate the efficacy of our approach on image generation across six models {StyleGAN, SN-GAN, BigGAN, ProGAN, BEGAN, WGAN-GP}, four image datasets {, , , }, and two types of sampling methods {with, without the truncation trick}.

5 Conclusion

Popular culture has long depicted vision as a primary interaction modality between people and machines; vision is a necessary sensing capability for humanoid robots such as C-3PO from “Star Wars”, Wall-E from the eponymous film, and even disembodied Artificial Intelligence such as Samantha the smart operating system from the movie “Her”. These fictional machines paint a potential real future where machines can tap into the expressive range of non-intrusive information that Computer Vision affords. Our expressions, gestures, and relative position to objects carry a wealth of information that intelligent interactive machines can use, enabling new applications in domains such as healthcare Haque et al. (2020), sustainability Jean et al. (2016), human-interpretable actions Dragan et al. (2013), and mixed-initiative interactions Horvitz (1999).

While Human-Computer Interaction (HCI) researchers have long discussed and debated what human-AI interaction should look like Shneiderman and Maes (1997); Horvitz (1999), we have rarely provided concrete, immediately operational goals to Computer Vision researchers. Instead, we’ve largely left this job up to the vision community itself, which has produced a variety of immediately operational tasks to work on. These tasks play an important role in the AI community; some of them ultimately channel the efforts of thousands of AI researchers and set the direction of progress for years to come. The tasks range from object recognition Deng et al. (2009)

, to scene understanding 

Krishna et al. (2017b), to explainable AI Adadi and Berrada (2018), to interactive robot training Thomaz and Breazeal (2008). And while many such tasks have been worthwhile endeavors, we often find that the models they produce don’t work in practice or don’t fit end-users’ needs as hoped Mitchell et al. (2019); Buolamwini and Gebru (2018).

If the tasks that guide the work of thousands of AI researchers do not reflect the HCI community’s understanding of how humans can best interact with AI-powered systems, then the resulting AI-powered systems will not reflect it either. We therefore believe there is an important opportunity for HCI and Computer Vision researchers to begin closing this gap by collaborating to directly integrate HCI’s insights and goals into immediately actionable vision tasks, model designs, data collection protocols, and evaluation schemes. One such example of this type of work is the HYPE benchmark mentioned earlier in this chapter Zhou et al. (2019), which aimed to push GAN researchers to focus directly on a high-quality measurement of human perception when creating their models. Another is the approach taken by the social strategies project mentioned earlier in this chapter Park et al. (2019), which aimed to push data collection protocols to consider social interventions designed to elicit volunteer contributions.

What kind of tasks might HCI researchers work to create? First, explainable AI aims to help people understand how computer vision models work, but methods are developed without real consideration of how humans will ultimately use explanations to interact with them. HCI researchers might propose design choices for how to introduce and explain vision models grounded in human subjects experiments Khadpe et al. (2020); Buçinca et al. (2020). Second, perceptual robotics can learn to complete new tasks by incorporating human rewards, but do not consider how people actually want to provide feedback to robots Thomaz and Breazeal (2008). If we want robots to incorporate an understanding of how humans want to give feedback when deployed, then HCI researchers might propose new training paradigms with ecologically valid human interactions. Third, multi-agent vision systems Jain et al. (2020) are developed that ignore key aspects of human psyche, such as choosing to perform non-optimal behaviors, despite foundational work in HCI noting the perils of such assumptions in AI planning Suchman (1987). Without incorporating human behavioral priors, these multi-agents systems work well when collaborating between AI agents but fail when one of the agents is replaced by a human Carroll et al. (2019). If we want multi-agent vision systems that understand biases that people have when performing actions, then HCI researchers might propose human-AI collaboration tasks and benchmarks in which agents are forced to encounter realistic human actors (indeed, non-vision work has begun to move in this direction Kwon et al. (2020)).

Acknowledgement. The first project was supported by the National Science Foundation award 1351131. The second project was partially funded by the Brown Institute of Media Innovation and by Toyota Research Institute (“TRI”). The third project was partially funded by a Junglee Corporation Stanford Graduate Fellowship, an Alfred P. Sloan fellowship and by TRI. This chapter solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.

Bibliography

  • [1] A. Adadi and M. Berrada (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE Access 6, pp. 52138–52160. Cited by: §5.
  • [2] V. Ambati, S. Vogel, and J. Carbonell (2011) Towards task recommendation in micro-task markets. External Links: Link Cited by: §3.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh (2015) Vqa: visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433. Cited by: §3.3.
  • [4] D. Bahdanau, K. Cho, and Y. Bengio (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §4.6.
  • [5] S. Banerjee and A. Lavie (2005) METEOR: an automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp. 65–72. Cited by: §4.6.
  • [6] S. Barratt and R. Sharma (2018) A note on the inception score. arXiv preprint arXiv:1801.01973. Cited by: §4.6, §4.6, §4.
  • [7] M. S. Bernstein, J. Brandt, R. C. Miller, and D. R. Karger (2011) Crowds in two seconds: enabling realtime crowd-powered interfaces. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pp. 33–42. Cited by: §2.1, §4.7.
  • [8] M. S. Bernstein, G. Little, R. C. Miller, B. Hartmann, M. S. Ackerman, D. R. Karger, D. Crowell, and K. Panovich (2010) Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 313–322. Cited by: §2.2.
  • [9] D. Berthelot, T. Schumm, and L. Metz (2017) BEGAN: boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717. Cited by: §4.3, §4.
  • [10] J. P. Bigham, C. Jayant, H. Ji, G. Little, A. Miller, R. C. Miller, R. Miller, A. Tatarowicz, B. White, S. White, et al. (2010) VizWiz: nearly real-time answers to visual questions. In Proceedings of the 23nd annual ACM symposium on User interface software and technology, pp. 333–342. Cited by: §3.1, §3.
  • [11] M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying mmd gans. arXiv preprint arXiv:1801.01401. Cited by: §4.4, §4.5.
  • [12] C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §4.
  • [13] A. Biswas and D. Parikh (2013) Simultaneous active learning of classifiers & attributes via relative feedback. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pp. 644–651. Cited by: §2.8.
  • [14] D. Bohus and A. I. Rudnicky (2009) The ravenclaw dialog management framework: architecture and systems. Computer Speech & Language 23 (3), pp. 332–361. Cited by: §3.1.
  • [15] A. Borji (2018) Pros and cons of gan evaluation measures. Computer Vision and Image Understanding. Cited by: §4.6, §4.6, §4.7, §4.
  • [16] E. L. Brady, Y. Zhong, M. R. Morris, and J. P. Bigham (2013) Investigating the appropriateness of social network question asking as a resource for blind users. In Proceedings of the 2013 conference on Computer supported cooperative work, pp. 1225–1236. Cited by: §3, §3.
  • [17] E. Brady, M. R. Morris, and J. P. Bigham (2015) Gauging receptiveness to social microvolunteering. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, CHI ’15, New York, NY, USA, pp. 1055–1064. External Links: ISBN 978-1-4503-3145-6, Link, Document Cited by: §3.
  • [18] J. Bragg, M. Daniel, and D. S. Weld (2013) Crowdsourcing multi-label classification for taxonomy creation. In First AAAI conference on human computation and crowdsourcing, Cited by: §2.1.
  • [19] S. Branson, K. E. Hjorleifsson, and P. Perona (2014) Active annotation translation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 3702–3709. Cited by: §2.1.
  • [20] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder, P. Perona, and S. Belongie (2010) Visual recognition with humans in the loop. In Computer Vision–ECCV 2010, pp. 438–451. Cited by: §2.1.
  • [21] D. E. Broadbent and M. H. Broadbent (1987) From detection to identification: response to multiple targets in rapid serial visual presentation. Perception & psychophysics 42 (2), pp. 105–113. Cited by: §2.2.
  • [22] A. Brock, J. Donahue, and K. Simonyan (2018) Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096. Cited by: §4.3, §4.3, §4.3, §4, §4.
  • [23] Z. Buçinca, P. Lin, K. Z. Gajos, and E. L. Glassman (2020) Proxy tasks and subjective measures can be misleading in evaluating explainable ai systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces, pp. 454–464. Cited by: §5.
  • [24] J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp. 77–91. Cited by: §5.
  • [25] M. Burke, R. Kraut, and E. Joyce (2014) Membership claims and requests: some newcomer socialization strategies in online communities. Small Group Research. Cited by: §1, §3.1.
  • [26] M. Burke and R. Kraut (2013) Using facebook after losing a job: differential benefits of strong and weak ties. In Proceedings of the 2013 conference on Computer supported cooperative work, pp. 1419–1430. Cited by: §2.
  • [27] S. K. Card, A. Newell, and T. P. Moran (1983) The psychology of human-computer interaction. Cited by: §2.4.
  • [28] M. Carroll, R. Shah, M. K. Ho, T. Griffiths, S. Seshia, P. Abbeel, and A. Dragan (2019) On the utility of learning about humans for human-ai coordination. In Advances in Neural Information Processing Systems, pp. 5174–5185. Cited by: §5.
  • [29] J. Cassell and K. R. Thórisson (1999) The power of a nod and a glance: envelope vs. emotional feedback in animated conversational agents. Applied Artificial Intelligence 13, pp. 519–538. Cited by: §3.1.
  • [30] L. Cerrato and S. Ekeklint (2002) Different ways of ending human-machine dialogues. Cited by: §3.1.
  • [31] S. Chaiken (1989) Heuristic and systematic information processing within and beyond the persuasion context. Unintended thought, pp. 212–252. Cited by: §3.1.
  • [32] R. Chellappa, P. Sinha, and P. J. Phillips (2010) Face recognition by computers and humans. Computer 43 (2), pp. 46–55. Cited by: §4.6.
  • [33] J. Cheng, J. Teevan, and M. S. Bernstein (2015) Measuring crowdsourcing effort with error-time curves. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1365–1374. Cited by: §2.2, §2.2.
  • [34] V. Chidambaram, Y. Chiang, and B. Mutlu (2012) Designing persuasive robots: how robots might persuade people using vocal and nonverbal cues. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, pp. 293–300. Cited by: §3.1.
  • [35] L. B. Chilton, G. Little, D. Edge, D. S. Weld, and J. A. Landay (2013) Cascade: crowdsourcing taxonomy creation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1999–2008. Cited by: §2.1.
  • [36] R. Cialdini (2016) Pre-suasion: a revolutionary way to influence and persuade. Simon and Schuster. Cited by: item 3, §3, §3.
  • [37] L. Colligan, H. W. Potts, C. T. Finn, and R. A. Sinkin (2015) Cognitive workload changes for nurses transitioning from a legacy system with paper documentation to a commercial electronic health record. International journal of medical informatics 84 (7), pp. 469–476. Cited by: §2.6.
  • [38] T. N. Cornsweet (1962) The staircrase-method in psychophysics.. Cited by: §4.1, §4.1, §4.1, §4.
  • [39] K. Corti and A. Gillespie (2016) Co-constructing intersubjectivity with artificial conversational agents: people are more likely to initiate repairs of misunderstandings with agents represented as human. Computers in Human Behavior 58, pp. 431 – 442. External Links: ISSN 0747-5632, Document, Link Cited by: §3.1.
  • [40] S. C. Dakin and D. Omigie (2009) Psychophysical evidence for a non-linear representation of facial identity. Vision research 49 (18), pp. 2285–2296. Cited by: §4.1.
  • [41] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), Vol. 1, pp. 886–893. Cited by: §1.
  • [42] J. M. Darley and B. Latané (1968) Bystander intervention in emergencies: diffusion of responsibility.. Journal of personality and social psychology 8 (4p1), pp. 377. Cited by: §1, §3.1.
  • [43] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1, §1, §2.7, §2, §2, §3, §3, §4.3, §4, §5.
  • [44] J. Deng, O. Russakovsky, J. Krause, M. S. Bernstein, A. Berg, and L. Fei-Fei (2014) Scalable multi-label annotation. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3099–3102. Cited by: §2.1, §2.1, §2.8, §2.8.
  • [45] E. L. Denton, S. Chintala, R. Fergus, et al. (2015) Deep generative image models using a laplacian pyramid of adversarial networks. In Advances in neural information processing systems, pp. 1486–1494. Cited by: §1, §4.6, §4, §4.
  • [46] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.3.
  • [47] D. E. Difallah, G. Demartini, and P. Cudré-Mauroux (2013) Pick-a-crowd: tell me what you like, and i’ll tell you what to do. In Proceedings of the 22Nd International Conference on World Wide Web, WWW ’13, New York, NY, USA, pp. 367–374. External Links: ISBN 978-1-4503-2035-1, Link, Document Cited by: §3.
  • [48] A. D. Dragan, K. C. Lee, and S. S. Srinivasa (2013) Legibility and predictability of robot motion. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pp. 301–308. Cited by: §5.
  • [49] E. Fast, B. Chen, J. Mendelsohn, J. Bassen, and M. S. Bernstein (2018) Iris: a conversational agent for complex tasks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 473. Cited by: §3.1.
  • [50] E. Fast, D. Steffee, L. Wang, J. R. Brandt, and M. S. Bernstein (2014) Emergent, crowd-scale programming practice in the ide. In Proceedings of the 32nd annual ACM conference on Human factors in computing systems, pp. 2491–2500. Cited by: §2.
  • [51] L. Fei-Fei, A. Iyer, C. Koch, and P. Perona (2007) What do we perceive in a glance of a real-world scene?. Journal of vision 7 (1), pp. 10–10. Cited by: §2.1, §2, §4.1.
  • [52] J. Felsenstein (1985) Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39 (4), pp. 783–791. Cited by: §4.3.
  • [53] E. Ferrara, O. Varol, C. Davis, F. Menczer, and A. Flammini (2016) The rise of social bots. Communications of the ACM 59 (7), pp. 96–104. Cited by: §3.5.
  • [54] P. Fraisse (1984) Perception and estimation of time. Annual review of psychology 35 (1), pp. 1–37. Cited by: §4.1.
  • [55] D. Geiger and M. Schader (2014) Personalized task recommendation in crowdsourcing information systems — current state of the art. Decision Support Systems 65, pp. 3 – 16. Note: Crowdsourcing and Social Networks Analysis External Links: ISSN 0167-9236, Document, Link Cited by: §3.
  • [56] E. Gilbert and K. Karahalios (2009) Predicting tie strength with social media. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 211–220. Cited by: §2.
  • [57] G. Gillund and R. M. Shiffrin (1984) A retrieval model for both recognition and recall.. Psychological review 91 (1), pp. 1. Cited by: item 5.
  • [58] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 580–587. Cited by: §2.
  • [59] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §4.6.
  • [60] M. Gray and S. Suri (2019) Ghost work: how to stop silicon valley from building a new global underclass. Eamon Dolan. External Links: ISBN 1328566242 Cited by: §3.1.
  • [61] M. R. Greene and A. Oliva (2009) The briefest of glances: the time course of natural scene understanding. Psychological Science 20 (4), pp. 464–472. Cited by: §4.1, §4.1, §4.1, §4.6.
  • [62] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pp. 5767–5777. Cited by: §4.3, §4.
  • [63] A. Haque, A. Milstein, and L. Fei-Fei (2020) Illuminating the dark spaces of healthcare with ambient intelligence. Nature 585 (7824), pp. 193–202. Cited by: §5.
  • [64] T. B. Hashimoto, H. Zhang, and P. Liang (2019) Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792. Cited by: §4.7.
  • [65] K. Hata, R. Krishna, L. Fei-Fei, and M. S. Bernstein (2017) A glimpse far into the future: understanding long-term crowd worker quality. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 889–901. Cited by: §4.1.
  • [66] K. Healy and A. Schussman (2003)

    The ecology of open-source software development

    .
    Technical report Technical report, University of Arizona, USA. Cited by: §3.1, §3.
  • [67] J. Hempel (2015) Facebook launches m, its bold answer to siri and cortana. Wired. Retrieved January 1, pp. 2017. Cited by: §3.1.
  • [68] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626–6637. Cited by: §4.4, §4.5, §4.6, §4.6, §4.6, §4.
  • [69] B. M. Hill (2013) Almost wikipedia: eight early encyclopedia projects and the mechanisms of collective action. Massachusetts Institute of Technology, pp. 1–38. Cited by: §3.1, §3.
  • [70] G. E. Hinton (2002)

    Training products of experts by minimizing contrastive divergence

    .
    Neural computation 14 (8), pp. 1771–1800. Cited by: §4.
  • [71] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.3.
  • [72] M. L. Hoffman (1981) Is altruism part of human nature?. Journal of Personality and social Psychology 40 (1), pp. 121. Cited by: item 7, §3.
  • [73] E. Horvitz (1999) Principles of mixed-initiative user interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pp. 159–166. Cited by: §5, §5.
  • [74] F. Huang and J. F. Canny (2019) Sketchforme: composing sketched scenes from text descriptions for interactive applications. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, pp. 209–220. Cited by: §1.
  • [75] T. K. Huang, J. Chang, and J. Bigham (2018) Evorus: a crowd-powered conversational assistant built to automate itself over time. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 295. Cited by: §3.1.
  • [76] C. J. Hutto and E. Gilbert (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In Eighth international AAAI conference on weblogs and social media, Cited by: §3.3, §3.3.
  • [77] M. C. Iordan, M. R. Greene, D. M. Beck, and L. Fei-Fei (2015) Basic level category structure emerges gradually across human ventral visual cortex. Journal of cognitive neuroscience. Cited by: §2.9.
  • [78] P. G. Ipeirotis, F. Provost, and J. Wang (2010) Quality management on amazon mechanical turk. In Proceedings of the ACM SIGKDD workshop on human computation, pp. 64–67. Cited by: §2.1.
  • [79] P. G. Ipeirotis (2010) Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads, The ACM Magazine for Students 17 (2), pp. 16–21. Cited by: §2.
  • [80] L. C. Irani and M. Silberman (2013) Turkopticon: interrupting worker invisibility in amazon mechanical turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 611–620. Cited by: §2.2, §2.
  • [81] S. D. Jain and K. Grauman (2013) Predicting sufficient annotation strength for interactive foreground segmentation. In Computer Vision (ICCV), 2013 IEEE International Conference on, pp. 1313–1320. Cited by: §2.1.
  • [82] U. Jain, L. Weihs, E. Kolve, A. Farhadi, S. Lazebnik, A. Kembhavi, and A. Schwing (2020) A cordial sync: going beyond marginal policies for multi-agent embodied tasks. In European Conference on Computer Vision, pp. 471–490. Cited by: §5.
  • [83] N. Jean, M. Burke, M. Xie, W. M. Davis, D. B. Lobell, and S. Ermon (2016) Combining satellite imagery and machine learning to predict poverty. Science 353 (6301), pp. 790–794. Cited by: §5.
  • [84] T. Josephy, M. Lease, and P. Paritosh (2013) CrowdScale 2013: crowdsourcing at scale workshop report. Cited by: §1, §2.
  • [85] E. Kamar, S. Hacker, and E. Horvitz (2012) Combining human and machine intelligence in large-scale crowdsourcing. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 1, pp. 467–474. Cited by: §2.3.
  • [86] D. R. Karger, S. Oh, and D. Shah (2011) Budget-optimal crowdsourcing using low-rank matrix approximations. In Communication, Control, and Computing (Allerton), 2011 49th Annual Allerton Conference on, pp. 284–291. Cited by: §2.1.
  • [87] D. R. Karger, S. Oh, and D. Shah (2014) Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research 62 (1), pp. 1–24. Cited by: §2.1, §2.
  • [88] T. Karras, T. Aila, S. Laine, and J. Lehtinen (2017) Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §4.3, §4.
  • [89] T. Karras, S. Laine, and T. Aila (2018) A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948. Cited by: §4.3, §4.3, §4.3, §4, §4.
  • [90] T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4401–4410. Cited by: §1.
  • [91] P. Khadpe, R. Krishna, L. Fei-Fei, J. T. Hancock, and M. S. Bernstein (2020) Conceptual metaphors impact perceptions of human-ai collaboration. Proceedings of the ACM on Human-Computer Interaction 4 (CSCW2), pp. 1–26. Cited by: §5.
  • [92] A. Kittur, E. H. Chi, and B. Suh (2008) Crowdsourcing user studies with mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems, pp. 453–456. Cited by: §2, §4.
  • [93] S. A. Klein (2001) Measuring, estimating, and understanding the psychometric function: a commentary. Perception & psychophysics 63 (8), pp. 1421–1455. Cited by: §4.
  • [94] A. D. Kramer, J. E. Guillory, and J. T. Hancock (2014) Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111 (24), pp. 8788–8790. Cited by: §3.5.
  • [95] R. E. Kraut and P. Resnick (2011) Encouraging contribution to online communities. Building successful online communities: Evidence-based social design, pp. 21–76. Cited by: §1, §3.1, §3.
  • [96] R. A. Krishna, K. Hata, S. Chen, J. Kravitz, D. A. Shamma, L. Fei-Fei, and M. S. Bernstein (2016) Embracing error to enable rapid crowdsourcing. In Proceedings of the 2016 CHI conference on human factors in computing systems, pp. 3167–3179. Cited by: §1, §1, §4.1.
  • [97] R. Krishna, M. Bernstein, and L. Fei-Fei (2019) Information maximizing visual question generation. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3.3.
  • [98] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles (2017) Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 706–715. Cited by: §4.7.
  • [99] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li, D. A. Shamma, et al. (2017) Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: §1, §5.
  • [100] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §4.
  • [101] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §2.
  • [102] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §3.3.
  • [103] G. P. Krueger (1989) Sustained work, fatigue, sleep loss and performance: a review of the issues. Work & Stress 3 (2), pp. 129–141. Cited by: §4.1.
  • [104] R. Kumar, A. Satyanarayan, C. Torres, M. Lim, S. Ahmad, S. R. Klemmer, and J. O. Talton (2013) Webzeitgeist: design mining the web. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3083–3092. Cited by: §2.
  • [105] A. Kurakin, I. Goodfellow, and S. Bengio (2016) Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533. Cited by: §4.6.
  • [106] M. Kwon, E. Biyik, A. Talati, K. Bhasin, D. P. Losey, and D. Sadigh (2020) When humans aren’t optimal: robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction, pp. 43–52. Cited by: §5.
  • [107] M. Laielli, J. Smith, G. Biamby, T. Darrell, and B. Hartmann (2019) LabelAR: a spatial guidance interface for fast computer vision image collection. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, pp. 987–998. Cited by: §1.
  • [108] E. J. Langer, A. Blank, and B. Chanowitz (1978) The mindlessness of ostensibly thoughtful action: the role of” placebic” information in interpersonal interaction.. Journal of personality and social psychology 36 (6), pp. 635. Cited by: item 8, item 9, §3.
  • [109] G. Laput, W. S. Lasecki, J. Wiese, R. Xiao, J. P. Bigham, and C. Harrison (2015) Zensors: adaptive, rapidly deployable, human-intelligent sensor feeds. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1935–1944. Cited by: §2.1.
  • [110] W. Lasecki, C. Miller, A. Sadilek, A. Abumoussa, D. Borrello, R. Kushalnagar, and J. Bigham (2012) Real-time captioning by groups of non-experts. In Proceedings of the 25th annual ACM symposium on User interface software and technology, pp. 23–34. Cited by: §2.1.
  • [111] W. S. Lasecki, K. I. Murray, S. White, R. C. Miller, and J. P. Bigham (2011) Real-time crowd control of existing interfaces. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pp. 23–32. Cited by: §2.1.
  • [112] W. S. Lasecki, R. Wesley, J. Nichols, A. Kulkarni, J. F. Allen, and J. P. Bigham (2013) Chorus: a crowd-powered conversational assistant. In Proceedings of the 26th annual ACM symposium on User interface software and technology, pp. 151–162. Cited by: §3.1.
  • [113] E. Law, M. Yin, J. Goh, K. Chen, M. A. Terry, and K. Z. Gajos (2016) Curiosity killed the cat, but makes crowdwork better. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pp. 4098–4110. Cited by: §3.1, §3.
  • [114] J. Le, A. Edmonds, V. Hester, and L. Biewald (2010) Ensuring quality in crowdsourced search relevance evaluation: the effects of training question distribution. In SIGIR 2010 workshop on crowdsourcing for search evaluation, Vol. 2126, pp. 22–32. Cited by: §4.
  • [115] H. Levitt (1971) Transformed up-down methods in psychoacoustics. The Journal of the Acoustical society of America 49 (2B), pp. 467–477. Cited by: §4.1.
  • [116] D. D. Lewis and P. J. Hayes (1994-07) Guest editorial. ACM Transactions on Information Systems 12 (3), pp. 231. Cited by: §2.6.
  • [117] F. F. Li, R. VanRullen, C. Koch, and P. Perona (2002) Rapid natural scene categorization in the near absence of attention. Proceedings of the National Academy of Sciences 99 (14), pp. 9596–9601. Cited by: §2.1, §2.
  • [118] L. Li, W. Chu, J. Langford, and R. E. Schapire (2010) A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670. Cited by: §3.3, §3.
  • [119] T. Li and M. Ogihara (2003) Detecting emotion in music. In ISMIR, Vol. 3, pp. 239–240. Cited by: §2.
  • [120] L. Liang and K. Grauman (2014) Beyond comparing image pairs: setwise active learning for relative attributes. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 208–215. Cited by: §2.1.
  • [121] C. Lin, E. Kamar, and E. Horvitz (2014) Signals in the silence: models of implicit feedback in a recommendation system for crowdsourcing. External Links: Link Cited by: §3.
  • [122] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014)

    Microsoft coco: common objects in context

    .
    In Computer Vision–ECCV 2014, pp. 740–755. Cited by: §2, §2.
  • [123] C. J. Lintott, K. Schawinski, A. Slosar, K. Land, S. Bamford, D. Thomas, M. J. Raddick, R. C. Nichol, A. Szalay, D. Andreescu, et al. (2008) Galaxy zoo: morphologies derived from visual inspection of galaxies from the sloan digital sky survey. Monthly Notices of the Royal Astronomical Society 389 (3), pp. 1179–1189. Cited by: §3.1, §3.
  • [124] A. Liu, S. Soderland, J. Bragg, C. H. Lin, X. Ling, and D. S. Weld (2016) Effective crowd annotation for relation extraction. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 897–906. Cited by: §4.
  • [125] Z. Liu, P. Luo, X. Wang, and X. Tang (2015) Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §4.3, §4.
  • [126] D. G. Lowe (1999) Object recognition from local scale-invariant features. In Proceedings of the seventh IEEE international conference on computer vision, Vol. 2, pp. 1150–1157. Cited by: §1.
  • [127] C. Lu, R. Krishna, M. Bernstein, and L. Fei-Fei (2016) Visual relationship detection with language priors. In European Conference on Computer Vision, pp. 852–869. Cited by: §3.
  • [128] M. Lucic, K. Kurach, M. Michalski, S. Gelly, and O. Bousquet (2018) Are gans created equal? a large-scale study. In Advances in neural information processing systems, pp. 698–707. Cited by: §4.6, §4.6, §4.7.
  • [129] I. Mani (1999)

    Advances in automatic text summarization

    .
    MIT press. Cited by: §4.6.
  • [130] A. Marcus and A. Parameswaran (2015) Crowdsourced data management: industry and academic perspectives. Foundations and Trends in Databases. Cited by: §2.
  • [131] P. M. Markey (2000) Bystander intervention in computer-mediated communication. Computers in Human Behavior 16 (2), pp. 183–188. Cited by: §1, §3.1.
  • [132] D. Martin, B. V. Hanrahan, J. O’Neill, and N. Gupta (2014) Being a turker. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pp. 224–235. Cited by: §2.2, §2.
  • [133] W. Mason and S. Suri (2012) Conducting behavioral research on amazon’s mechanical turk. Behavior research methods 44 (1), pp. 1–23. Cited by: §2.
  • [134] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) Nerf: representing scenes as neural radiance fields for view synthesis. arXiv preprint arXiv:2003.08934. Cited by: §1.
  • [135] G. A. Miller and W. G. Charles (1991) Contextual correlates of semantic similarity. Language and cognitive processes 6 (1), pp. 1–28. Cited by: §2.
  • [136] M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru (2019) Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pp. 220–229. Cited by: §5.
  • [137] T. Mitra, C. J. Hutto, and E. Gilbert (2015) Comparing person-and process-centric strategies for obtaining quality data on amazon mechanical turk. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1345–1354. Cited by: §4.1, §4.2.
  • [138] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957. Cited by: §4.2, §4.3, §4.
  • [139] C. Nass and S. Brave (2007) Wired for speech: how voice activates and advances the human-computer relationship. The MIT Press. External Links: ISBN 0262640651 Cited by: §3.1.
  • [140] J. C. Niebles, H. Wang, and L. Fei-Fei (2008) Unsupervised learning of human action categories using spatial-temporal words. International journal of computer vision 79 (3), pp. 299–318. Cited by: §3.
  • [141] C. Olsson, S. Bhupatiraju, T. Brown, A. Odena, and I. Goodfellow (2018) Skill rating for generative models. arXiv preprint arXiv:1808.04888. Cited by: §1, §4.6, §4.
  • [142] B. Pang and L. Lee (2008) Opinion mining and sentiment analysis. Foundations and trends in information retrieval 2 (1-2), pp. 1–135. Cited by: §2.6, §2.
  • [143] K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.6.
  • [144] J. Park, R. Krishna, P. Khadpe, L. Fei-Fei, and M. Bernstein (2019) AI-based request augmentation to increase crowdsourcing participation. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 7, pp. 115–124. Cited by: §1, §1, §5.
  • [145] A. Parkash and D. Parikh (2012) Attributes for classifier feedback. In Computer Vision–ECCV 2012, pp. 354–368. Cited by: §2.1, §2.8.
  • [146] M. D. Peng Dai and S. Weld (2010) Decision-theoretic control of crowd-sourced workflows. In In the 24th AAAI Conference on Artificial Intelligence (AAAI’10, Cited by: §2.1, §2.
  • [147] J. Portilla and E. P. Simoncelli (2000) A parametric texture model based on joint statistics of complex wavelet coefficients. International journal of computer vision 40 (1), pp. 49–70. Cited by: §4.1, §4.6.
  • [148] M. C. Potter and E. I. Levy (1969) Recognition memory for a rapid sequence of pictures.. Journal of experimental psychology 81 (1), pp. 10. Cited by: §2.1.
  • [149] M. C. Potter (1976) Short-term conceptual memory for pictures.. Journal of experimental psychology: human learning and memory 2 (5), pp. 509. Cited by: §2.1.
  • [150] A. Radford, L. Metz, and S. Chintala (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434. Cited by: §4.
  • [151] S. Ravuri, S. Mohamed, M. Rosca, and O. Vinyals (2018)

    Learning implicit generative models with the method of learned moments

    .
    arXiv preprint arXiv:1806.11006. Cited by: §4.6, §4.
  • [152] K. Rayner, T. J. Smith, G. L. Malcolm, and J. M. Henderson (2009) Eye movements and visual encoding during scene perception. Psychological science 20 (1), pp. 6–10. Cited by: §4.6.
  • [153] A. Reeves and G. Sperling (1986) Attention gating in short-term visual memory.. Psychological review 93 (2), pp. 180. Cited by: §2.4.
  • [154] B. Reeves and C. I. Nass (1996) The media equation: how people treat computers, television, and new media like real people and places.. Cambridge university press. Cited by: §3.1.
  • [155] J. Reich, R. Murnane, and J. Willett (2012) The state of wiki usage in us k–12 schools: leveraging web 2.0 data warehouses to assess quality and equity in online learning environments. Educational Researcher 41 (1), pp. 7–15. Cited by: §3.1, §3.
  • [156] C. Robert (1984) Influence: the psychology of persuasion. William Morrow and Company, Nowy Jork. Cited by: item 1, item 2, item 6, §3, §3.
  • [157] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed (2017) Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987. Cited by: §4.6, §4.
  • [158] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner (2019) FaceForensics++: learning to detect manipulated facial images. arXiv preprint arXiv:1901.08971. Cited by: §4, §4.
  • [159] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F. Li (2014) Imagenet large scale visual recognition challenge. International Journal of Computer Vision, pp. 1–42. Cited by: §1, §2.1.
  • [160] O. Russakovsky, L. Li, and L. Fei-Fei (2015) Best of both worlds: human-machine collaboration for object annotation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2121–2131. Cited by: §2.
  • [161] J. M. Rzeszotarski, E. Chi, P. Paritosh, and P. Dai (2013) Inserting micro-breaks into crowdsourcing workflows. In First AAAI Conference on Human Computation and Crowdsourcing, Cited by: §4.1.
  • [162] M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly (2018) Assessing generative models via precision and recall. In Advances in Neural Information Processing Systems, pp. 5228–5237. Cited by: §4.4, §4.5, §4.7.
  • [163] N. Salehi, L. C. Irani, and M. S. Bernstein (2015) We are dynamo: overcoming stalling and friction in collective action for crowd workers. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 1621–1630. Cited by: §2.5.
  • [164] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems, pp. 2234–2242. Cited by: §1, §4.2, §4.6, §4.6, §4.6, §4, §4.
  • [165] A. Sardar, M. Joosse, A. Weiss, and V. Evers (2012) Don’t stand so close to me: users’ attitudinal and behavioral responses to personal space invasion by robots. In Proceedings of the seventh annual ACM/IEEE international conference on Human-Robot Interaction, pp. 229–230. Cited by: §3.1.
  • [166] R. E. Schapire and Y. Singer (2000) BoosTexter: a boosting-based system for text categorization. Machine learning 39 (2), pp. 135–168. Cited by: §2.
  • [167] P. Seetharaman and B. Pardo (2014) Crowdsourcing a reverberation descriptor map. In Proceedings of the ACM International Conference on Multimedia, pp. 587–596. Cited by: §2.
  • [168] V. S. Sheng, F. Provost, and P. G. Ipeirotis (2008) Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 614–622. Cited by: §2.1, §2.
  • [169] A. Sheshadri and M. Lease (2013) Square: a benchmark for research on computing crowd consensus. In First AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.5.
  • [170] B. Shneiderman and P. Maes (1997-11) Direct manipulation vs. interface agents. Interactions 4 (6), pp. 42–61. External Links: ISSN 1072-5520, Link, Document Cited by: §5.
  • [171] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §2.3.
  • [172] P. Smyth, M. C. Burl, U. M. Fayyad, and P. Perona (1994) Knowledge discovery in large image databases: dealing with uncertainties in ground truth.. In KDD Workshop, pp. 109–120. Cited by: §2.
  • [173] P. Smyth, U. Fayyad, M. Burl, P. Perona, and P. Baldi (1995) Inferring ground truth from subjective labelling of venus images. Cited by: §2.
  • [174] R. Snow, B. O’Connor, D. Jurafsky, and A. Y. Ng (2008) Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, pp. 254–263. Cited by: §2.5, §2.6, §2.
  • [175] Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan (2011) Contextualizing object detection and classification. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pp. 1585–1592. Cited by: §2.1, §2.8.
  • [176] G. Sperling (1963) A model for visual memory tasks. Human factors 5 (1), pp. 19–31. Cited by: §4.6.
  • [177] H. Su, J. Deng, and L. Fei-Fei (2012) Crowdsourcing annotations for visual object detection. In Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, Cited by: §2.1.
  • [178] L. A. Suchman (1987) Plans and situated actions: the problem of human-machine communication. Cambridge university press. Cited by: §5.
  • [179] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §4.6.
  • [180] O. Tamuz, C. Liu, S. Belongie, O. Shamir, and A. T. Kalai (2011) Adaptively learning the crowd kernel. arXiv preprint arXiv:1105.1033. Cited by: §2.1.
  • [181] P. J. Taylor and S. Thomas (2008) Linguistic style matching and negotiation outcome. Negotiation and Conflict Management Research 1 (3), pp. 263–281. Cited by: item 4, §3.
  • [182] L. Theis, A. v. d. Oord, and M. Bethge (2015) A note on the evaluation of generative models. arXiv preprint arXiv:1511.01844. Cited by: §4.
  • [183] A. L. Thomaz and C. Breazeal (2008) Teachable robots: understanding human teaching behavior to build more effective robot learners. Artificial Intelligence 172 (6-7), pp. 716–737. Cited by: §5, §5.
  • [184] B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L. Li (2016) YFCC100M: the new data in multimedia research. Communications of the ACM 59 (2). Note: To Appear External Links: Document, Link Cited by: §1, §2.
  • [185] R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §4.6.
  • [186] S. Vijayanarasimhan, P. Jain, and K. Grauman (2010) Far-sighted active learning on a budget for image and video recognition. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pp. 3035–3042. Cited by: §2.1.
  • [187] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2014) Show and tell: a neural image caption generator. arXiv preprint arXiv:1411.4555. Cited by: §2.
  • [188] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: §4.6.
  • [189] L. von Ahn and L. Dabbish (2004) Labeling images with a computer game. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 319–326. Cited by: §3.
  • [190] L. von Ahn and L. Dabbish (2004) Labeling images with a computer game. pp. 319–326. Cited by: §3.1.
  • [191] C. Vondrick, D. Patterson, and D. Ramanan (2013) Efficiently scaling up crowdsourced video annotation. International Journal of Computer Vision 101 (1), pp. 184–204. Cited by: §2.1.
  • [192] C. Wah, S. Branson, P. Perona, and S. Belongie (2011) Multiclass recognition and part localization with humans in the loop. In Computer Vision (ICCV), 2011 IEEE International Conference on, pp. 2524–2531. Cited by: §2.1.
  • [193] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and S. Belongie (2014) Similarity comparisons for interactive fine-grained categorization. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pp. 859–866. Cited by: §2.1.
  • [194] Y. Wang, R. E. Kraut, and J. M. Levine (2015-04-20) Eliciting and receiving online support: using computer-aided content analysis to examine the dynamics of online social support. J Med Internet Res 17 (4), pp. e99. External Links: Link, Link, Link Cited by: §1, §3.1.
  • [195] D. Warde-Farley and Y. Bengio (2016) Improving generative adversarial networks with denoising feature matching. Cited by: §4.2.
  • [196] M. Warncke-Wang, V. Ranjan, L. Terveen, and B. Hecht (2015) Misalignment between supply and demand of quality content in peer production communities. In Ninth International AAAI Conference on Web and Social Media, Cited by: §3.
  • [197] E. Weichselgartner and G. Sperling (1987) Dynamics of automatic and controlled visual attention. Science 238 (4828), pp. 778–780. Cited by: §2.4.
  • [198] D. S. Weld, C. H. Lin, and J. Bragg (2015) Artificial intelligence and collective intelligence. Handbook of Collective Intelligence, pp. 89–114. Cited by: §4.7.
  • [199] P. Welinder, S. Branson, P. Perona, and S. J. Belongie (2010) The multidimensional wisdom of crowds. In Advances in neural information processing systems, pp. 2424–2432. Cited by: §2.1.
  • [200] J. Whitehill, T. Wu, J. Bergsma, J. R. Movellan, and P. L. Ruvolo (2009) Whose vote should count more: optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pp. 2035–2043. Cited by: §2.1.
  • [201] F. A. Wichmann and N. J. Hill (2001) The psychometric function: i. fitting, sampling, and goodness of fit. Perception & psychophysics 63 (8), pp. 1293–1313. Cited by: §4.1.
  • [202] C. G. Willis, E. Law, A. C. Williams, B. F. Franzone, R. Bernardos, L. Bruno, C. Hopkins, C. Schorn, E. Weber, D. S. Park, et al. (2017) CrowdCurio: an online crowdsourcing platform to facilitate climate change studies using herbarium specimens. New Phytologist 215 (1), pp. 479–488. Cited by: §3.
  • [203] J. O. Wobbrock, J. Forlizzi, S. E. Hudson, and B. A. Myers (2002) WebThumb: interaction techniques for small-screen browsers. In Proceedings of the 15th annual ACM symposium on User interface software and technology, pp. 205–208. Cited by: §2.1.
  • [204] H. Xia, J. Jacobs, and M. Agrawala (2020) Crosscast: adding visuals to audio travel podcasts. In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, pp. 735–746. Cited by: §1.
  • [205] D. Yang and R. E. Kraut (2017-12) Persuading teammates to give: systematic versus heuristic cues for soliciting loans. Proc. ACM Hum.-Comput. Interact. 1 (CSCW), pp. 114:1–114:21. External Links: ISSN 2573-0142, Link, Document Cited by: §1, §3.1, §3.
  • [206] Y. Yue, Y. Yang, G. Ren, and W. Wang (2017) SceneCtrl: mixed reality enhancement via efficient scene editing. In Proceedings of the 30th Annual ACM Symposium on User Interface Software and Technology, pp. 427–436. Cited by: §1.
  • [207] H. Zhang, C. Sciutto, M. Agrawala, and K. Fatahalian (2020) Vid2player: controllable video sprites that behave and appear like professional tennis players. arXiv preprint arXiv:2008.04524. Cited by: §1.
  • [208] T. Zhang (2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the twenty-first international conference on Machine learning, pp. 116. Cited by: §3.3.
  • [209] D. Zhou, S. Basu, Y. Mao, and J. C. Platt (2012) Learning from the wisdom of crowds by minimax entropy. In Advances in Neural Information Processing Systems, pp. 2195–2203. Cited by: §2.1.
  • [210] S. Zhou, M. Gordon, R. Krishna, A. Narcomey, L. F. Fei-Fei, and M. Bernstein (2019) Hype: a benchmark for human eye perceptual evaluation of generative models. In Advances in Neural Information Processing Systems, pp. 3449–3461. Cited by: §1, §1, §5.