Reputation Agent: Prompting Fair Reviews in Gig Markets

by   Carlos Toxtli, et al.
West Virginia University

Our study presents a new tool, Reputation Agent, to promote fairer reviews from requesters (employers or customers) on gig markets. Unfair reviews, created when requesters consider factors outside of a worker's control, are known to plague gig workers and can result in lost job opportunities and even termination from the marketplace. Our tool leverages machine learning to implement an intelligent interface that: (1) uses deep learning to automatically detect when an individual has included unfair factors into her review (factors outside the worker's control per the policies of the market); and (2) prompts the individual to reconsider her review if she has incorporated unfair factors. To study the effectiveness of Reputation Agent, we conducted a controlled experiment over different gig markets. Our experiment illustrates that across markets, Reputation Agent, in contrast with traditional approaches, motivates requesters to review gig workers' performance more fairly. We discuss how tools that bring more transparency to employers about the policies of a gig market can help build empathy thus resulting in reasoned discussions around potential injustices towards workers generated by these interfaces. Our vision is that with tools that promote truth and transparency we can bring fairer treatment to gig workers.



There are no comments yet.


page 1

page 2

page 3

page 4


Becoming the Super Turker: Increasing Wages via a Strategy from High Earning Workers

Crowd markets have traditionally limited workers by not providing transp...

Unique Ergodicity in the Interconnections of Ensembles with Applications to Two-Sided Markets

There has been much recent interest in two-sided markets and dynamics th...

The Affiliate Matching Problem: On Labor Markets where Firms are Also Interested in the Placement of Previous Workers

In many labor markets, workers and firms are connected via affiliative r...

The Fault in the Stars: Understanding the Underground Market of Amazon Reviews

In recent times, the Internet has been plagued by a tremendous amount of...

Laying the Groundwork for a Worker-Centric Peer Economy

The "gig economy" has transformed the ways in which people work, but in ...

Fairness in ERC token markets: A Case Study of CryptoKitties

Fairness is an important trait of open, free markets. Ethereum is a plat...

Decentralized Markets versus Central Control: A Comparative Study

Multi-Agent Systems (MAS) promise to offer solutions to problems where e...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. Overview of how Reputation Agent functions.

Gig marketplaces are online spaces where almost anyone can contract independent workers (e.g., freelancers) to conduct labor or deliver services in the form of short-term engagements (De Stefano, 2015). Gig markets facilitate transactions between strangers, as people typically have to hire and manage a crowd of workers they have never met (Todolí-Signes, 2017; Bederson and Quinn, 2011). Similarly, workers on these platforms often coordinate with other workers(Irani and Silberman, 2013) and offer services to requesters (customers or employers) who are all unfamiliar to them (Prassl, 2018). These direct interactions between strangers mean that gig markets must have mechanisms through which people can truthfully assess each other, i.e., earning money by entrust their hard labor to strangers (Uslaner, 1999; Qiu et al., 2018; Siberman and Harmon, 2017; Felstiner, 2011; McInnis et al., 2016). One of the most popular instruments for helping people to assess strangers and to choose who to hire are reputation systems which function by asking individuals to provide feedback on others’ work. These systems are generally based on the platform’s review metrics. For gig worker’s, reputation has become especially important because it is critical for accessing higher-paying jobs (Martin et al., 2014; Ma et al., 2018) or even staying employed (Gray and Suri, 2019).

In this context, it is important to understand that in the power dynamics of most gig markets, the platform takes the side of requesters (Todolí-Signes, 2017) or the platforms manipulate the market to the detriment of the worker (Calo and Rosenblat, 2017; Todolí-Signes, 2017). Therefore, if a requester invests time to write a lengthy complaint about a worker (even if the requester is incorrect), the market will side with the requester, potentially leading to unfair termination. For instance, the following advise is from an Uber driver for other gig drivers (Dough, ): “…[one of the main reasons for Uber to terminate a driver is that the] passenger makes a serious complaint about you [the driver]. If a passenger goes out of their way to tell Uber that you were rude, or that you’re a bad driver, or that you made them uncomfortable in any way, you can be immediately deactivated without prior notice. You aren’t likely to be reactivated after a major passenger complaint…”

This environment where workers have limited mechanisms to negotiate or even discuss reviews has led workers to distrust gig markets altogether (Gray and Suri, 2019; Horton and Golden, 2015). Thus, it is crucial to ensure that the reviews about workers are fair in order to improve trust and the general operation of gig markets (Beal and Strauss, 2009). Fairness within gig markets typically involves ensuring that the policies on the market are transparent, concise, and accessible to workers (Graham and Woodcock, 2018; Silberman and Metall, 2009). However, we argue that fairness is not just about empowering workers to understand the policies of the markets in which they participate; it is also about guiding the requester in the market to evaluate workers based on the market’s established policies (Fan et al., 2005).

Requesters should have a clear understanding of what metrics they should consider and which are inappropriate. Thus, it becomes critical to have mechanisms through which requesters can discriminate between the interactions and labor that workers are expected to control (according to the policies of the market) and those that were outside workers’ control (Dellarocas, 2005). This first type of interaction is known as “mission-critical” and the latter as “non-critical” (Xiong and Liu, 2003). Gig marketplaces have historically had difficulties in ensuring that requesters focus on evaluating mission-critical factors (Saenger et al., 2015; Gaikwad et al., 2015). To help, practitioners and researchers have started to investigate different interfaces for facilitating the generation of more mission-critical reviews (Filippas et al., 2017) which use drop-down lists to guide people on what metrics to focus on (Irani and Silberman, 2013). However, these interfaces rarely focus on guiding people on the written reviews. As a consequence, unjust written reviews continue to plague gig markets (Lu et al., 2010; Benson et al., 2019) and have resulted in problematic outcomes, such as termination of workers’ accounts. This can eliminate an important source of income (Rosenblat and Stark, 2016).

Seeing the need to motivate requesters to write fairer written reviews on gig markets, we present a new novel tool, Reputation Agent, which is an intelligent web plugin that detects when requesters have written reviews that consider factors outside a worker’s control. In such cases, Reputation Agent prompts people to reflect and focus on the performance metrics that are actually within the workers’ control sphere. We designed Reputation Agent as a web plugin to empower platform maintainers to easily integrate the tool into their existing gig markets without having to change any of their front-end interfaces. Fig 1. presents an overview of how Reputation Agent functions. We conducted a study to evaluate how effective Reputation Agent was in prompting people to generate reviews that focus on metrics within workers’ control. In order to investigate our tool in depth, we chose various gig markets (Uber, Upwork, and Grubhub) and recruited 480 reviewers to evaluate gig workers across several scenarios. Across these different gig markets, reviewers working with Reputation Agent were motivated to focus significantly more on metrics that the worker could control instead of metrics outside of the worker’s scope.

Our paper contributes a new tool which leverages machine learning for fostering fairer written reviews about gig workers. Our design also provides a novel understanding of requesters who didn’t change their reviews for the following reasons: truthfulness, empathy, warning, and agency. Through the ease of implementing our plug-in tool and the understanding gained about why requesters don’t change their reviews, we hope that further studies can elaborate on the importance of integrating requesters’ decision-making process into their studies to order to achieve fairer reviews for workers. Our discussion: (i) focuses on how tools such as Reputation Agent can motivate requesters to write more accurate performance reviews in a manner that provides productive feedback for the crowd market community; and (ii) explores how tools, like Reputation Agent, could help to develop empathy and motivate reflections on the type of policies and agencies that people desire within a gig market. Our hope is that systems, like Reputation Agent, can initiate a future environment where workers operate in a fairer, more truthful space; an atmosphere in which all participants have a clearer understanding of the policies and labor conditions of the crowd market. This along with the hope that it will guide a future where gig workers no longer fear unjust termination.

2. Related Work

The design of Reputation Agent is based on two main areas: (1) tools for written reviews; and (2) reputation systems.

2.1. Gig Marketplaces

Gig markets bring new jobs to the marketplace (Broughton et al., 2018). However, due to the nontraditional nature of the gig economy, criteria and tools to improve the labor conditions for workers are still necessary for researchers to investigate to ensure a safe and fair working environment for gig workers. (Berg et al., 2018; Siberman and Harmon, 2017; Hara et al., 2018; Hitlin, 2016; Horton, 2011; Irani and Silberman, 2013, 2016; Silberman and Metall, 2009; Benson et al., 2019; Kaplan et al., 2018; Harmon and Silberman, 2018). Gig markets rely heavily on reviews to help requesters identify which workers to hire and help workers ensure fair compensation (Kittur et al., 2013; Scholz, 2017). It is this reliance on reviews that our study focuses on as bad reviews can pose an obstacle to workers. This is due to the fact that gig markets have been plagued with unfair reviews which contain inaccurate reputation signals about workers. These unfair reviews can ultimately limit workers’ future job opportunities and can also result in workers not getting paid or even being terminated from the marketplace. Unfair reviews are generally created because employers have a hard time differentiating the factors within the workers’ control and the ones that have little to do with their performance (e.g., when they complain about an Uber driver getting stuck in traffic). However, because market power is typically placed in the hands of employers, a bad worker review can result in the worker losing her entire livelihood. It is important to research how tools can be implemented to protect gig workers (Williams et al., 2019).

2.2. Tools for Written Reviews

Platforms for improving people’s written reviews can be divided into two main types: Interface or Artificial Intelligence based.

2.2.1. Interface Based.

Several interfaces have emerged that focus on driving people to provide better-written reviews about others. One set of these systems has focused on guiding better reviews within educational systems(Kulkarni et al., 2015; Cambre et al., 2018). Cook et al. (Cook et al., 2019) explored how the use of interfaces that have guiding questions can facilitate the generation of better reviews within project-based learning. In our research, we build on the ideas behind these systems to now imagine interfaces that guide requesters to write fairer reviews about their fellow workers.

2.2.2. Artificial Intelligence Based.

Another subset of related tools have focused on using artificial intelligence to help reviewers. The work of Krause et al. (Krause et al., 2017) explored how natural language models could be used to guide designers to provide higher quality reviews about the work of their peers (which was not necessarily fair). Inspired by these ideas, we explored with Reputation Agent how different language models could be used to now guide requesters to write fairer reviews. For Reputation Agent we also used language models to identify when a requester is writing a review that is unfair and guide requesters to write reviews more focused on factors that workers controlled.

Some of the first intelligent tools around reviews were automated methods that inferred the expected reputation scores that people would input based on their written reviews (Alexandridis et al., 2019; Qu et al., 2010)

. Others developed sentiment analysis methods to detect the polarity (positive, negative or neutral) of reviews

(Du et al., 2016; Collomb et al., 2014). Sentiment analysis has played an important role in improving the automated analysis and understanding of text reviews (Luiz et al., 2018; Guzman and Maalej, 2014; Elshenawy et al., 2016; Bartoli et al., 2016). Similarly, developments in deep learning algorithms have further facilitated the automated understanding and even categorization of marketplace reviews (Kokkodis, 2012; Kumar et al., 2017). Deep learning algorithms and other related methods have facilitated automatically detecting more complex metrics aside from sentiment, such as the expected level of helpfulness of a review (Tang et al., 2013), who was to blame for a car accident based on a car insurgence report (Estival and Gayral, 1995), health risks in restaurants based on people’s reviews on Twitter (Sadilek et al., 2013) or detecting biased Amazon reviews (Elmurngi and Gherbi, 2018). We use inspiration from these intelligent systems to envision how deep learning could be used to automatically flag unfair reviews.

2.2.3. Fairness In Crowd-Powered Text Reviews

Within the context of Gig markets, fairness usually relates to the conditions of the workers laboring on these platforms (Fieseler et al., 2019). Graham et al. (Graham et al., 2019) recently created a framework to score gig markets based on how fair they are to workers. Some of the variables considered were whether the platform paid gig workers the minimum wage and ensured their health and safety at work. Other measures revolved around whether the contracts and policies were transparent, concise, and accessible to workers. Our goal with Reputation Agent, inspired by the latter point, was to facilitate mechanisms through which the policies of a gig market could be presented in a clearer, more conscientious manner. However, our focus was not just on presenting the policies to workers. But rather, facilitating an understanding of policies by requesters, who must judge workers and can, ultimately, have a lasting effect on their future job opportunities. Additionally, we focused on designing a tool that could be easily adapted to any gig market. We believe fairer marketplaces can be constructed by presenting more clearly to employers the roles of workers.

Ensuring fairness in performance evaluations is a common challenge across gig markets (Borromeo et al., 2017; Mehrotra et al., 2018). Unfair evaluations can come from individuals or groups (Allahbakhsh et al., 2014). Several systems have implemented different mechanisms to ensure that the evaluations that people generate about others are fair. However, most of these systems operate only at the score or metrics level (Schiffner et al., 2011). These systems, generally, do not take any action to correct “nasty”reviews. But, leaving unfair textual reviews intact can mean that the review can continue to affect the person long after the interaction took place. With Reputation Agent we focus on addressing this problem by detecting unfair reviews, and guiding employers to take action to correct them.

2.3. Reputation Systems

A reputation system is any platform that evaluates businesses or peers based on an algorithm or customer rankings, ratings, or written comments (Zacharia and Maes, 2000). The premise is to have parties rate each other which results in a score. This score should assist other parties in deciding whether or not to continue interacting with that party in the future (Jøsang et al., 2007). To operate effectively, reputation systems require at least three properties: long-lived entities that inspire an expectation of future interactions; capture and distribution of feedback about current interactions (information must be visible in the future); and use of feedback to guide trust decisions (Resnick et al., 2000).

The end goal of reputation systems is to strengthen the quality of markets and communities by providing an incentive for good behavior and quality services, and by sanctioning bad behavior and low-quality services (Jøsang and Golbeck, 2009). In order to achieve that goal, some reputation systems have implemented diverse workflows and validations. PowerTrust (Zhou and Hwang, 2007) takes the power-law distribution in user feedback to get a more accurate global reputation. Whitby et al. (Whitby et al., 2004) describe a statistical filtering technique for excluding unfair ratings via a Bayesian reputation system. Notice that prior work focused on improving score based reputations, while our work is based on the foundations of these systems to now assist gig markets in reducing the number of unfair written reputation signals.

2.3.1. Reputation Systems For Gig Markets

Within gig markets, reputation systems typically focus on evaluating the different actors involved in the market (workers, requesters and the platform itself)(Allahbakhsh et al., 2012). Reputation systems within the context of gig markets have become a key component for selecting the workers and clients with whom one will collaborate. A worker’s income positively correlates with higher reputation scores (Gandini et al., 2016; Horton and Golden, 2015). Therefore, “bad reviews” can affect worker’s access to employment and can overall affect workers’ livelihood. Thus, designing accessible tools can promote fairness and enables a shift in the power dynamics (Williams et al., 2019).

Different interfaces have been introduced to prompt and guide people to write reviews that better match the labor of workers and are potentially more fair and maintain more accurate reputations on the marketplace. For instance, Gaikwad et al. (Gaikwad et al., 2015) developed Boomerang in Daemo Crowd Market, and explored interfaces that benefited requesters by sharing more accurate information about workers and penalizing requesters who shared inaccuracies. Such mechanisms might not only help workers to obtain better assessments of their work, but it can also help to address the ballot stuffing problem (where people get too many positive reviews, and it thus becomes difficult to assess who is “good”). Our research is inspired by these prior mechanisms to drive fairer reviews and prevent assessments that may unfairly affect the reputation and even the income of gig workers.

3. Reputation Agent

We argue that a way to enable fairer reviews in gig markets is via systems that can present requesters with transparent policies that pertain to workers without interrupting their review writing process. This information should only be highlighted in cases when the system identifies that the reviewer has included unfair factors (i.e., factors, per the market’s policies, outside a worker’s control). For this purpose, our research explores: (1) machine learning techniques that detect when an individual is focusing her review on factors outside the workers’ sphere of control; and (2) interfaces that use that information to then prompt the person to reconsider her review in order to refocus on factors within the worker’s control. Reputation Agent has two main components: a ‘Smart Validator’, to detect elements of a review that includes factors outside a worker’s control; and a ‘Fairness Promoter’, to guide people to focus their review on the factors that were within the worker’s control. Figure 1 shows how Reputation Agent enhances existing review forms with these two main parts.

Smart Validator. This component learns to detect when a review has factors outside the worker’s scope according to the policies of the market. The Smart Validator has an end-to-end workflow for training a machine learning model. The steps are:

A. PREPARE TRAINING DATA. This piece focuses on collecting real reviews about gig workers. It functions as a web crawler that collects data from websites, such as SiteJabber and ConsumerAffairs, that share real-world reviews about gig workers. Once the data is collected, the module focuses on labeling each review based on whether it focuses on mission-critical metrics or not (i.e., factors that the worker controlled or not) The labeling is done by analyzing the policies of each gig market.

B. TRAIN AND TEST INTELLIGENT MODEL. Given a set of labeled reviews (i.e., reviews that are labeled as to whether they are fair or unfair), Reputation Agent uses stratified sampling to split the labeled data into training, test, and validation sets under proportions of 80%, 10%, and 10% (the validation set helps to avoid overfitting). Using Python 3 and the Keras framework with Tensorflow, we trained eight models to learn to recognize reviews that evaluate workers based on mission-critical metrics and non-critical ones. Our goal was to identify the machine learning models which worked the best for different gig markets. For this purpose, we trained different machine learning models which used as feature vectors either word vectors or embeddings:

Word ngram + LR

: Logistic regression with word ngrams.

Char ngram + LR: Logistic regression with character ngrams.

(Word + Char ngram) + LR: Logistic regression with word and character ngrams.

RNN no embedding

: Recurrent neural network (bidirectional GRU) without pre-trained embeddings.

RNN + GloVe embedding: Recurrent neural network (bidirectional GRU) with GloVe pre-trained embeddings.

CNN (multi-channel)

: Multi-channel Convolutional Neural Network.

RNN + CNN: Recurrent neural network (Bidirectional GRU) + Convolutional Neural Network.

Google BERT (Devlin et al., 2018)

: Bidirectional Encoder Representations from Transformers, is a new method of pre-training language representations which obtains state-of-the-art results on a wide range of Natural Language Processing (NLP) tasks.

Figure 2.

Text classifier benchmark. Recurrent Neural Networks (RNN) + Convolutional Neural Network (CNN) approach performed better across conditions.

We implemented early stopping as a method to stop training once the model performance stops improving on a hold out validation dataset. For the deep learning models, we used a binary cross-entropy loss function, ADAM as an optimizer, and a learning rate of 0.001. Fig.

2 presents an overview of the benchmark of the training models (i.e., the figure shows the performance metrics of each model). We note that different machine learning models performed better on different gig markets. However, RNN (Recursive Neural Network, a Deep Learning Algorithm) performed in general the best across all gig markets. This was the reason we eventually choose to utilize this model. After the model has been trained, it is exposed as a REST web service via JSON requests to our front-end interface. The service is consumed directly by Reputation Agent’s Accuracy Promoter and it displays the messages accordingly.

Fairness Promoter. This component displays messages to the reviewer to prompt them to avoid considering factors outside the worker’s control based on the policies of the market. It displays the prompt messages that the platform maintainer defines and the messages are triggered based on the predictions from the Smart Validator. The Fairness Promoter is a web plugin for JQuery, a javascript framework, and works as a form validation plugin (commonly used to prevent forms from submitting data that do not fulfill a website’s validation or formatting criteria). Reputation Agent’s Fairness Promoter is linked to a text control that triggers a request to the Smart Validator every time the text control stops being used by the reviewer. The Fairness Promoter sends the reviewer’s current text to the Smart Validator in a JSON format. The Smart Validator analyzes the text and returns the likelihood of whether or not the review is focusing on factors that were within the control of the gig worker. If it is not, the Fairness Promoter then displays its configured messages to prompt reviewers to reconsider their review and focus instead on variables that the worker was able to control.

In our design of the Fairness Promoter we chose for in-form prompting instead of popups. The logic behind this decision is that this design can lead to faster completion times (Hofseth et al., 2019). This is an important decision due to the limited time that customers usually spend evaluating services. Additionally, we considered that some users might have popup blockers in their browsers that could prevent them from seeing the prompt. Therefore, we opted to explore other approaches. Additionally, we chose the prompted message to be shown after the end-user finished writing her review, instead of while she was completing it. We made this decision because prior research has shown that people tend to be in either a form-completion-mode or a problem resolution mode (Bargas-Avila et al., 2007). If people are in a form-completion-mode, they tend to ignore alert messages (and hence Reputation Agent would be less effective). Furthermore, we decided to place Reputation Agent’s prompting messages close to the review text box since previous work (Seckler et al., 2012) has shown that such placement is more effective than when it is placed on top or at the bottom of the review form. On the other hand, while our prompting messages can be edited by platform maintainers to publish whatever message they desire; we aimed for the initial boxed messages to follow guidelines that prior work has deemed are the most effective. In particular, we follow the design guidelines from Bargas-Avila et al. (Bargas-Avila et al., 2010) that stated that promoting messages should be polite, explain the problem, and outline a solution. Our explanation aimed to convey to requesters how their review might be considered unfair based on the policies of the market. We also aimed to briefly explain what type of factors are considered to be outside the control of a worker; and offered people the solution to re-write the parts of their review with those unfair factors.

4. Evaluation

Reputation Agent instantiates our design hypothesis that by flagging reviews with factors outside a worker’s control and then presenting to employers what the policies of the gig market highlight as workers’ responsibilities, we can prompt fairer assessments of gig workers. To test this hypothesis, we conducted a between-subjects study comparing Reputation Agent with control interfaces. We had participants evaluate a gig worker, given a scenario where the customer had experienced a “bad outcome” on the gig market. However, it was not the fault of the worker being evaluated (i.e., factors outside the worker’s control were to blame). We study whether people using Reputation Agent generated fairer reviews than people using control interfaces to review gig workers under the same circumstances. Given that it was also important for us to evaluate our tool within different gig markets (considering that it could be used in diverse niches), we evaluated our tool on marketplaces similar to: Uber, GrubHub and Upwork.

4.1. Method

Our study focused on three popular gig markets (Grubhub, Uber, and Upwork). We randomized participants into one of our experimental conditions which represented a particular gig market and interface for reviewing workers (Reputation agent or control). Participants had to imagine they were a gig market customer or employer who had to write a review about a gig worker after experiencing a “bad” outcome on the marketplace (which was not the worker’s fault). The scenarios that participants had to consider were:

1) Uber Scenario. Participants are passengers in a ride-sharing platform (e.g., Uber) where the driver had followed the recommended GPS route, had a clean car, had picked them up and dropped them off in the correct locations, and was polite. However, due to the heavy traffic, they experienced a delayed trip and had to pay an overpriced fee. The tardiness of their ride resulted in them missing an important meeting with a client and losing a contract.

2) GrubHub Scenario. Participants have an important lunch with a client and ordered the meal through an on-demand food delivery platform (GrubHub). The delivery person dropped off her meal on time. However, the meal contained an ingredient that caused the client to have an allergic reaction; thus, making the client very sick (The order had included a request for this ingredient to be removed). Due to the bad experience, the client decided to cancel her contract with our participants.

3) Upwork Scenario. Participants used a freelancing platform (Upwork) to hire someone to translate an important report from English to French for a French client. The translation was delivered on time and the translation seemed to be of high quality. However, due to a glitch in the system, the last part of the essay was truncated. Because the customer did not know French they had not realized that the report was truncated. They gave the truncated translation to their French client making a bad impression and losing the contract with the client.

For each of these three gig markets, we trained our tool to detect reviews that involved factors outside a worker’s control. For this purpose, for each gig market we: (1) collected 1,000 real-world reviews from SiteJabber (for the three scenarios); (2) had two independent college graduate coders classify each of these reviews into whether they involved worker’s performance or factors outside the worker’s control. We provided summaries of what factors were considered to be within the worker’s control and examples of which ones were not. Coders were also given a link to the policies of each of the three gig markets to better assess the variables that the marketplace considers are under a worker’s control. Some explained examples were given to coders to have a common agreement when dealing with ambiguous cases. The two coders agreed on the classification of 94.7% of all the reviews (Cohen’s kappa =.86: Strong agreement). We then asked a third college graduate coder to act as a tiebreaker in cases of disagreement. After this step, for all three types of gig markets, we had a labeled set of reviews. The labeled data was provided as input to Reputation Agent’s Smart Validator to train its models. Reputation Agent uses stratified sampling to split the labeled data into training, test, and validation sets under proportions of 80%, 10%, and 10% (the validation set helps to avoid overfitting). We implemented early stopping as a method to stop training once the model performance stops improving on a hold out validation dataset. For our deep learning module, we used a binary cross-entropy loss function, ADAM (Kingma and Ba, 2014) as an optimizer, and a learning rate of 0.001.

Across conditions, participants wrote their review based on their fictional scenario and using one of the four possible interfaces:

1) Control (written text). The end-user is presented with a traditional textbox where they must write a review about the worker.

2) Control + Rating. The end-user is presented with a traditional textbox where they must write a review about the worker, as well as complete traditional 5-star rating questions. These rating questions match the ratings that are currently present in the particular gig market in which the participant is operating (e.g., participants in the Uber scenario were presented with the rating questions that Uber uses to review drivers).

3) Reputation Agent. The end-user is presented with a traditional textbox where they must write a review about the gig worker while receiving prompting from Reputation Agent.

4) Reputation Agent + Rating. The end-user is presented with a traditional textbox where they must write a review about the gig worker while receiving prompting from Reputation Agent. The end-user is also asked to complete traditional 5-star rating questions about the worker. Here, the rating questions again match the questions present in the given gig market.

Figure 3. General Interfaces per condition.
Figure 4. Rating interfaces per Gig Market.

Fig. 3 presents a general overview of how the interfaces per each condition looked. Fig. 4 presents the different rating interfaces we considered per gig market. We aimed for these rating interfaces to mimic the ratings that particular gig markets have as we were interested in studying how our tool performed within mainstream settings. Notice that in the Reputation Agent conditions, we stored all the review attempts to analyze how people’s behavior changed.

Our between-subjects study had 12 conditions that involved three different fictional scenarios (three types of gig markets) where four different interfaces for reviewing workers were used. Each condition had 40 participants. People’s participation consisted of writing the review for the worker they were assigned and then completing a follow-up survey to provide feedback about their experiences. Specifically, the survey questioned people about: (1) Who or what was responsible for the bad service they had received on the gig platform? (gig worker, requester, platform algorithms, client, or other) (2) How much fault did each of those actors have? (3) How much did they think that their review would affect the worker’s reputation? (4) As a customer of gig platforms, what type of review interface (written or 5-star rating reviews) did they prefer? (5) As a worker or requester of gig markets, what type of reputation mechanism (written or 5-star rating reviews) did they prefer? (6) How much did they feel that the interface helped them to give more accurate feedback about worker’s performance? (7) How did their review process (if any) change after completing the review with their interface? Once participants had submitted their review and completed the follow-up survey, we analyzed whether the reviews they submitted were fair, specifically whether they integrated factors outside a gig worker’s control or not (to study the effectiveness of Reputation Agent). For this purpose, we had two independent college graduate coders read each of the final reviews that participants generated and categorize whether the review blamed the worker on factors outside the workers’ control or not (coders were also given the policies of each gig market to help their categorization, examples and summaries of the policies). The two coders agreed on the classification of 95.1% of all the reviews produced by participants (Cohen’s kappa =.87: Strong agreement). In cases where there was disagreement, we asked a third college graduate coder to act as a tiebreaker. In all cases, we categorized the first and last reviews submitted in order to determine how much Reputation Agent lead people to change their reviewing behavior.

We recruited a total of 480 participants using university mailing lists, social media, and via postings on gig markets. Note that these are the same methods utilized by prior work to recruit requesters for studies (Gaikwad et al., 2015). Important to note is that all our participants had been at least once a requester (employer or customer) on the gig market to which they were assigned. 55% of our participants were male, and 45% were female. Participants were between the ages of 21 to 70 years old, with the median age being 35. All had at least a High School degree, 59% had a bachelor’s degree, 17% a master’s degree, and 2% a Ph.D. degree. Some of our participants had been workers at least once on gig platforms: 18% on Uber, 18% on Upwork, and 12% on GrubHub (which is normal given that gig markets allow people to take on both roles.) Participants were paid $2.00 USD to participate, and the study took at most 15 minutes.

5. Results

Figure 5. Percentage of fair reviews created per gig market scenario and when using a particular interface.

Figure 5 presents across gig markets the number of fair reviews that were generated when using a particular interface. Table 1 presents examples of reviews that were classified as unfair and fair. Across gig markets, the people using Reputation Agent wrote a larger number of fair reviews. In certain scenarios, having people use Reputation Agent leads to an increment of fairer reviews in comparison to when Reputation Agent was not used. For instance, for the Uber scenario when people used the control interface, only 10% of the reviews were fair (i.e., 90% were unfair reviews where people blamed their driver for factors outside her control, e.g., bad traffic.) However, when using Reputation Agent the number of fair reviews increased up to 70%.

We also note that in some scenarios, having Reputation Agent operate with both textbox and numerical rankings lead people to write a slightly higher number of fairer reviews than when using only a textbox (this is the case for the GrubHub scenario). However, we also note that for Upwork and Uber there were a higher number of fair reviews when Reputation Agent operated only with a textbox (and no numerical ranking). To test whether these observed differences were significant or not, we conducted a series of statistical tests. After determining that our data did not meet the normality assumption, we decided to run an omnibus non-parametric Kruskal-Wallis test (p ¡ .00001, H = 81.9303) and the Mann–Whitney U-tests with Bonferroni correction (p ¡ .00001, z = 9.05126, U = 20400) to identify post hoc effects over conditions. Through this analysis, we found that there was indeed a significant difference in the number of fair reviews that people generated when using Reputation Agent when compared to reviews generated with traditional interfaces. In other words, we found that participants are significantly less likely to write unfair reviews when using Reputation Agent than with normal interfaces.

We also investigated how much Reputation Agent helped people to start changing their reviewing behavior. For this purpose, we measured the number of people who, while using Reputation Agent, changed their original review from being unfair to fair (see Table 2). On average, across gig markets, Reputation Agent was able to convert two out of every three of the reviews that were originally unfair to fair (67.1%). For when people decided to not change their unfair review, we analyzed the reasons for this behavior by observing what they stated in the survey. This analysis can help us to identify some of the challenges that Reputation Agent has in ensuring fairness in gig markets. We used open coding to extract initial concepts from people’s responses (Mihas, 2019). We aimed for these initial concepts to consider the themes that related work had derived on people’s motivations for writing certain types of reviews (Goodman and Paolacci, 2017; Levitt and List, 2007). Next, we discussed these initial concepts as a group to iterate on them and created a codebook (list of themes).

Unfair Blame worker for things outside their control
Uber “I am beyond pissed. This driver took a ridiculous route causing a 45min delay and had the nerve to overcharge me on it! I was late for work and obviously, my boss was not pleased. Thanks a lot, Uber.”
GrubHub “The delivery person failed to perform one of the basic functions of their job, which was ensuring the product they picked up was correct. He brought lunch to one of my clients and she couldn’t even eat it due to allergy concerns. Our food had peanut butter when I clearly stated no peanut butter. Very annoyed. I wasted my time and my money.”
Upwork “Working with this person was a pain in the ass! I had a very important proposal to give to an important potential client who only spoke French. As I only speak English I contracted with this worker to translate it for me. They submitted too close to the deadline with little wiggle room, making it impossible to fix or check for any issues. They sent accidentally incomplete work. It cost me a contract due to his negligence. I do not recommend him, as he’s not meticulous.”
Fair Assessed factors that the worker could control.
Uber “The GPS of my driver leads me to a very congested route today and it took me a lot more time and money to get to work. But these things happen. I could have made the same choice driving myself so the driver can’t be blamed for something out of their control. Driver and his car were very nice. I am however very unhappy with the service and will be immediately unsubscribing soon.”
GrubHub “The delivery person got the food to me on time. It was not their fault that the order was wrong. The order was wrong because of the people at the restaurant. They messed up what was suppose to be a great lunch. The driver did all he was supposed to do and I give him a good review for that..”
Upwork “I thought the worker did a good job and the work was presented well, it is a shame however that the system failed at the last minute and I got an incomplete submission.”
Table 1. Examples of reviews that were written by study participants and categorized as “unfair” and “fair” across gig market scenarios.
Gig market Condition Corrected reviews
Uber Reputation Agent 66.6%
Reputation Agent + Rating 70.2%
GrubHub Reputation Agent 80.9%
Reputation Agent + Rating 73.3%
Upwork Reputation Agent 57.8%
Reputation Agent + Rating 53.4%
Table 2. Corrected reviews after Reputation Agent prompted participants. 2 out of every 3 reviews were corrected.

The codebook with examples was shared with two coders that agree in the 91% of the reasons (Cohen’s kappa =.81: Strong agreement) and a third college graduate coder to act as a tiebreaker in cases of disagreement. We detected the following categories describing participants’ reasons for not changing their reviews:

5.0.1. Truthfulness.

Some reviewers (21% of all reviewers who chose not to change their review after prompts from Reputation Agent) felt that changing their review implied lying about how they felt and wanted to keep their review as it was because it was truthful. Examples of their reasoning:

“I always give honest and detailed reviews and will continue to do so. This interface will not impact on my reviewing process. I will always stand by how I have done things in the past: with the truth!” Uber Reviewer 44.

“Why on earth would my process change? […] if I get bad work, I’m going to leave an honest review regardless of whether it was the ”platform” or the worker’s fault…”, Grubhub Reviewer 15.

“…I like being honest and factual with reviews. That’s why I don’t like changing my reviews…” Upwork Review 43.

5.0.2. Agency.

These reviewers (25%) felt that although a gig market’s policies might dictate that certain actors were not to blame, they believed that such actors should have had more agency in their decisions despite the policies of the gig market.

“Drivers should be able to tell which routes are clean by instinct based on the day and the time of the day without even looking at the GPS. The drivers should know the city that they are driving very well…” Uber Reviewer 23.

“Well the delivery guy didn’t listen to a word I said so now my client can’t eat the meal. If he has any type of peanut in their food they can go into anaphylactic shock and die. That is not what I want for their lunch? Is that what you want them to have for lunch? Death? The delivery guy really needs to be responsible for this.” Grubhub Reviewer 32.

“…Let’s be honest, the worker did not perform as well as expected. She missed that some words were cut off. The bottom line is that she needs to learn to handle herself responsibly in the world.” Upwork Reviewer 19.

5.0.3. Empathy.

These reviewers (36%) felt they needed more time to analyze the scenario before changing their review. They appeared to have empathy for all actors involved and wanted to truly understand their situation before changing their review.

“I like to think about all the circumstances before writing reviews. I like to use empathy. In my future review I will probably be a bit less harsh on drivers. I will think about the driver and how they treated me as well. But before I make those change I will try to calm down first…”

Uber Reviewer 21.

“…I wanted to focus on the bad aspects of my meal, but then I realized I was supposed to focus on the deliverer only. So I switched it and focused on the delivery worker instead. I try to take all factors I am aware of into account when writing my review, and wait a bit so that any emotions associated with the work would not affect my review. I think I would wait to re-write my review so I am not angry and really think about the worker…” Grubhub Reviewer 5.

“I like to take software and platform issues into account before completely blasting a worker in a review (either with stars or a written review). I won’t change my review now because I have to stop and think about things. Instead of just getting mad and going off on the worker immediately. I would also want to have more communication between worker and purchaser before the review process, and have a way to discuss the review if either party truly found it to be in error.” Upwork Reviewer 39.

5.0.4. Warning.

Some participants (13%) maintained their review because it was important for them to have a space where they could caution others about what they experienced. They did not care about whether they blamed the incorrect person.

“…I just detailed the problems I encountered with the driver. I wanted others to know about my issue so that it doesn’t happen to them as well.” Uber Reviewer 49.

“…I usually never provide reviews for delivery people […] But in this case the service was exceptionally bad and my experience would serve as a cautionary tale to others. So that is why I can’t change it [the review]” GrubHub Reviewer 5.

“…I think my review would let people know of the risks about using this worker/platform. They could potentially avoid situations like the one that I was in. It’s important for me to keep my review to warn others….” Upwork Reviewer 20.

67% of the people using Reputation Agent reported that they felt that the interface helped them to be more aware of inaccuracies in their reviews. Participants across conditions reported that they felt their review would affect worker’s reputation. People in the control condition had the perception that their review would be the most harmful (mean 3.9 of 5). This is notable when compared to people using Reputation Agent who on average thought that their review would not be as harmful (mean 2.9 of 5).

6. Discussion

Our experiments demonstrated the potential of using intelligent web plugins to detect unfair reviews on gig markets, and then prompt fairer assessments by presenting micro-information about gig workers’ conditions and policies. Across different marketplaces, the majority of people using Reputation Agent ended up writing fairer reviews. Our study provides a novel insight into how marketplaces could use this type of smart web plugin to bring more fairness to workers. In this section, we discuss opportunities and challenges we see with Reputation Agent, and highlight design implications for future systems that operate within the gig marketplace.

6.1. Building Empathy In Gig Markets

Taking empathy into account in the human-centered design process can align designers with the values and needs of people who may use the platforms (Bennett and Rosner, 2019). Mencl et al. defines empathy as “a positive moral emotion that aids reasoning(Mencl and May, 2009).” Our study highlighted that prompting people to reconsider their reviews and reason more deeply about the worker and what her actual job was, helped reviewers to be fairer.

While all our participants considered that their reviews would have an impact on workers’ lives, the level of harm that people attributed to their review varied across conditions. People using the control (text only) interface tended to believe their reviews were the most harmful while people using the Reputation Agent + Rating interface felt they were doing the least harm to workers. The “tension between reason and emotion when making decisions(Frith and Singer, 2008)” allows us to see the benefit of a tool such as Reputation Agent in prompting requesters to reconsider their written review. Thus, our results highlight that providing more metrics and guidance helped people feel as if they were doing less harm to workers while still submitting a review that was accurate.

We see tools like Reputation Agent as a way to help requesters have a more humane perspective of workers by providing more transparency and awareness of what the current labor conditions are in gig markets. Through Reputation Agent we offer a way in which requesters can be guided to better understand the actual job expectations for workers. We believe that through this transparency and highlighting of boundaries denoting gig workers’ labor that we will be able to build more consideration for workers within gig markets (Irani and Silberman, 2016). Several of our participants who Reputation Agent prompted to change their reviews discussed how the tool helped them to better understand workers’ conditions.

From our study, we also identified that there were cases where people even after being prompted by Reputation Agent, refused to change their review at all. Many of these individuals were people who felt that workers needed to have more agency. For instance, in the Uber case, some passengers believed that their Uber driver should not have followed the recommended GPS route, but instead selected a shortcut and better route. These individuals blamed their Uber drivers for not taking the initiative and knowing enough of their city to understand that the GPS algorithm was wrong. We believe that in these cases, it might be worth designing interventions where the policies and responsibilities of gig workers are explained in detail to these individuals from the outset. We believe there is an opportunity in using systems like Reputation Agent as a way to create more empathy between requesters and workers. Additionally, it might be worth explaining to workers the perspectives of these requesters in order to facilitate their understanding of why certain requesters might expect them to not always follow an algorithm and be more “proactive.” In these cases, we visualize platforms that do not penalize workers for not being proactive (i.e., by following the instructions of the algorithm), but rather help workers open their minds to other perspectives and help them to see that having more agency in their decision process could provide growth opportunities, e.g., to eventually become a manager.

6.2. Supporting Reflection In Gig Markets

Our study contained 46 individuals who refused to modify their review even after being prompted. We considered it important to understand the reasons these individuals had for not changing their written assessments. In some cases, requesters used the review process as a chance to communicate to the platform that it would be more efficient (in terms of time and cost) if the worker was allowed more agency. We believe there are benefits to gig markets when they understand the type of agency that requesters want to see in the market. Platforms could consider alternate methods/systems for capturing this type of feedback in order to protect workers.

Williams et al. found that tools that are based on only distributing ratings and reviews for task choosing decisions usually tend to create fragmentation and discrimination affecting the platform’s fairness (Williams et al., 2019). We argue that it is important to tie tools like Reputation Agent with platforms focused on driving citizen discussions and citizen reflection (Mahyar et al., 2018)

. On this point, it is important to consider the findings of Li et al. concerning the influence that embodied conversational agents (ECAs) have on persuading people to consider feedback that is offered them

(Li et al., 2007). “The use of agents that resemble users” might be the necessary factor that allows requesters to consider the promptings of the Reputation Agent to be more valid. We must also ask: what type of agencies should we expect from the different actors of the market and why is it important that we expect such agency from them? Further research is needed to investigate the type of interfaces and workflows that could be used to incentivize and guide quality reflections about what people expect from workers, requesters, and the different policies of a marketplace. This type of system could uncover pain points that exist in current crowd markets and where policy changes might be needed. Our study also highlighted another reason why requesters did not desire to change their review: the importance for them to use the space to truthfully share their experiences. In the widely cited paper “The Market for Goods and the Market for Ideas” (Coase, 1974), it is argued that in the market for goods (i.e., the market where consumer goods and services are exchanged), government regulation is desirable; whereas in the market for ideas (i.e., the market where opinions or beliefs are interchanged), government regulation should be limited. Online reviews can be seen as something that delivers both “goods” and “ideas”. On one hand, having a person write an objective review of the work someone did could be seen as if they are delivering a good. The good, in this case, corresponds to the overall assessment of the labor that the worker did. This assessment not only helps the market better contextualize and measure the labor that is being produced (Jagabathula et al., 2014), it can also boost the SEO of the marketplace (Shenoy and Prabhu, 2016). Thus, helping it appear higher in the results of search engines and ultimately bring in more customers (Rognerud, 2008). The review can also help the worker get better credentials, access higher pay, and more requesters (i.e., the review might persuade other requesters to hire the worker). Reviews as goods deliver services to the marketplace, workers, and even other requesters. However, reviews also have the capacity to deliver opinions and beliefs, and hence can also belong within the market of ideas. Thus, we see value in being able to actively regulate the activities that belong to the market for goods, while permitting freedom of expression for activities that relate to the market for ideas (Coase, 1974). Reputation Agent offers an advancement towards this area by providing a way to regulate reviews within the market for goods and flagging reviews that might pertain more within the markets of ideas. Future work could pursue this avenue to design review interfaces to express both forms of reviews.

6.3. Rating Systems and Fairness

The importance of rating systems and fairness is an essential element in gig markets, whether it pertains to rating workers or rating requesters(Siberman and Harmon, 2017; Irani and Silberman, 2013; Dow et al., 2012). Creating a fair working environment with structures designed to protect workers’ rights to receive fair compensation for their labor ensures the reputation and success of gig markets(Felstiner, 2011; Adam et al., 2016; Metall, 2016; Benner, 2014; Siberman and Harmon, 2017) Thus, devising a tool for gig markets to implement in order to ensure fair reviews for workers brings us one step closer to achieving this.

To this purpose, our findings that people tended to write fairer reviews with Reputation Agent when working with the written interface is an important addition to the tools available to gig markets. In our study, we also discovered that people tended to write a larger number of unfair reviews when Reputation Agent was tied with numerical ratings. This was specifically the case with Uber and Upwork, where having numerical ratings tied with a written review, led to a larger number of unfair reviews than when working with just Reputation Agent and a written interface. Upon closer inspection, we identified that the problem was the fact that the rating systems of these gig markets did not distinguish worker’s performance from factors outside the worker’s control. For instance, when assessing a driver’s rating on Uber, the market provides a list of possible issues and presents “poor route” as an option even though Uber policies outline that drivers should always follow the recommend GPS route (unless explicitly instructed otherwise by the passenger). As a result, several participants selected “poor route” as an issue and then wrote lengthy reviews blaming the driver for the traffic (despite the prompts from Reputation Agent). Similarly, we noted that markets which differentiated between metrics pertaining to workers vs the platform, led people to generate fairer reviews. For instance, GrubHub has a rating system that differentiates between these two types of metrics, and we saw there was a decrease in the number of unfair reviews generated by participants. We see then the necessity of gig markets to not only incorporate tools that promote fairer reviews, but they themselves must also clarify and communicate the metrics that pertain to the workers.

Unfair reviews may also be the product of biases which do not necessarily reflect a worker’s performance, i.e., when a worker gets more positive ratings than expected given the service she provided. These types of reviews can be influenced by cognitive biases such as confirmation bias (Klayman, 1995), driven by having prior beliefs; anchoring effect (Caputo, 2013), relying more on the first piece of information offered and hence the current performance does not matter; or perception bias (Greenberg, 1991), motivated by how others might perceive you as the reviewer. Reputation Agent provides the opportunity to educate people about possible biases they might have and how those might be impacting their reviews. Here, we envisage that Reputation Agent’s prompts might provide information about biases in addition to information about the policies of gig markets. Future research could focus on personalized feedback according to personality or cultural biases that might exist (Krzystofiak et al., 1988; Martell and Willis, 1993; Hogan, 1987; Zwikael et al., 2005; Li and Karakowsky, 2001).

6.4. Feasibility And Maintainability

Through our controlled experiments, we identified that Reputation Agent was able to lead requesters to generate fairer reviews than when they worked with the control interfaces. However, to accomplish these results, it was necessary to have labeled data for each gig market on what constitutes fair reviews and what constitutes unfair reviews. While the labeled data sample that we used was relatively small in comparison to the large number of reviews that are generated on these marketplaces daily (Baj-Rogowska, 2017), it is possible that new gig markets might have a difficult time collecting and labeling review data for Reputation Agent.

We have released our system111 to help gig markets easily adopt and use our tool. Additionally given that Reputation Agent can be easily implemented as a validation module, Platform maintainers could change their front-end review interface without having to worry about Reputation Agent suddenly not working. Reputation Agent’s deep learning nature makes it so that if a gig market changes its policies, platform maintainers with minimum knowledge in artificial intelligence can easily re-train Reputation Agent to be updated with the changes (Fandango, 2018; Toxtli et al., 2018). In our website we have shared training examples for Reputation Agent’s learning module so that website maintainers can easily start using our tool.

6.5. Key Design Considerations

Our investigation unraveled design considerations for technology to support the generation of fairer interactions on gig markets.

6.5.1. Tools for Learning about Gig Market Policies.

Reputation System can be seen as a tool that helps highlight the policies of a gig market. For instance, when people are writing a review for Uber, Reputation Agent shares Uber’s policies. Future work could explore how heuristics and hard-coded rules can lead people to better understand the policies of a marketplace and comprehend what falls under worker vs. platform responsibilities. The visualization of different privacy policy representations can improve the understanding of the different actors

(Lipford et al., 2010).

6.5.2. Tools for Better Moderation.

Integrating artificial intelligence (AI) into gig markets can go beyond flagging unfair reviews (Dai et al., 2011). AI can also be intermixed with human moderators to facilitate a better understanding of the perspectives of requesters. For instance, given that Reputation Agent is able to store all the review attempts that people make, the system could detect cases when even after prompting the end-user to reconsider her review she still kept everything the same. In which case, the system could trigger an alert to human moderators to take a closer look at the review. We view Reputation Agent as tool that can alleviate moderators’ labor. The sustainability and self-management of the tool also depend on the neutrality of the training data. Human-in-the-loop mechanisms can allow different actors to have agency giving everyone decision power, not just the few who can code (Williams et al., 2019) to define how the tool is learning and taking the decisions.

7. LIMITATIONS and Future Work

The insights from this work are limited by the methodology and population we studied. Our controlled experiment allowed us to begin understanding how users engage with Reputation Agent. Although, we cannot extrapolate on how people would respond if this approach were implemented in a field deployment with conditions such as limited time and reduced willingness to reconsider their reviews. Our attempt to counter this issue was by implementing interfaces and creating scenarios that mimicked various gig markets and circumstances. However, future work could benefit from analyzing how systems like Reputation Agent are used when people are on the go and suffer from time constraints. While the scenarios we studied resembled very specific real-world situations, our results might not yet generalize to populations at large or to different types of situations. Further analysis is needed to understand how studies that leverage real gig market actors and Reputation Agent play out in helping users to give more objective reviews.

Reputation Agent was designed to limit the amount of extra interface controls that platform maintainers would have to implement. The aim of this work is to provide a smart validation mechanism for existing interfaces, i.e., easy to implement and not invasive. Future work could also explore how adaptations in the workflow and interface controls, such as a separated textbox for the reviews that are generated with Reputation Agent, could lead to reducing unfair reviews. This work explored the effect of using Reputation Agent in two settings: with ratings tied to text reviews and with just written text reviews (without any ratings). We studied whether in these settings Reputation Agent could guide people to change their reviews to fair ones. We choose to focus on written reviews in which having one bad written review could not only lead to a worker having her reputation jeopardized, but also having her account terminated. Future work could explore how integrating fairness validators might also influence the numerical ratings that people give to workers. Our work replicates the current conditions of gig markets, where people are never initially prompted or reminded to be fair in their reviews, i.e., Reputation Agent prompts only when unfair reviews are given. We established this setting because we considered that customers would likely be busy individuals who simply wanted their service delivered. Therefore, constant reminders of the gig market’s policies could be considered invasive. If they have not written an unfair review, it might not need highlighting. Future work could explore how promoting fairness throughout different points in time (e.g., directly when starting to write the review or at the end) can lead to fairer reviews. Our study may also have novelty effects that need to be studied through longitudinal studies. Future work could explore how longitudinal studies can promote fair reviews over time. This was a controlled experiment and not a deployment, i.e., there was never money at stake and no real harm done to the worker. Future work can compare how our results differ from deployments in the real world.

Acknowledgements. Special thanks to Amy Ruckes, Ben Hanrahan, and Six Silberman for the immense feedback and iterations on this work. This work was partially supported by NSF grant FW-HTF-19541.


  • D. Adam, M. Bremermann, J. Duran, F. Fontanarosa, B. Kraemer, H. Westphal, A. Kunert, and L. TÖNNES LÖNNROOS (2016) Digitalisation and working life: lessons from the uber cases around europe. EurWORK— European Observatory of Working Life 25. Cited by: §6.3.
  • G. Alexandridis, T. Tagaris, G. Siolas, and A. Stafylopatis (2019) From free-text user reviews to product recommendation using paragraph vectors and matrix factorization. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, New York, NY, USA, pp. 335–343. External Links: ISBN 978-1-4503-6675-5, Link, Document Cited by: §2.2.2.
  • M. Allahbakhsh, A. Ignjatovic, B. Benatallah, E. Bertino, N. Foo, et al. (2012) Reputation management in crowdsourcing systems. In 8th International Conference on Collaborative Computing: Networking, Applications and Worksharing (CollaborateCom), pp. 664–671. Cited by: §2.3.1.
  • M. Allahbakhsh, A. Ignjatovic, B. Benatallah, N. Foo, E. Bertino, et al. (2014) Representation and querying of unfair evaluations in social rating systems. Computers & Security 41, pp. 68–88. Cited by: §2.2.3.
  • A. Baj-Rogowska (2017) Sentiment analysis of facebook posts: the uber case. In 2017 Eighth International Conference on Intelligent Computing and Information Systems (ICICIS), pp. 391–395. Cited by: §6.4.
  • J. A. Bargas-Avila, O. Brenzikofer, S. Roth, A. Tuch, S. Orsini, and K. Opwis (2010) Simple but crucial user interfaces in the world wide web: introducing 20 guidelines for usable web form design. In User interfaces, Cited by: §3.
  • J. A. Bargas-Avila, G. Oberholzer, P. Schmutz, M. de Vito, and K. Opwis (2007) Usable error message presentation in the world wide web: do not show errors right away. Interacting with Computers 19 (3), pp. 330–341. Cited by: §3.
  • A. Bartoli, A. De Lorenzo, E. Medvet, D. Morello, and F. Tarlao (2016) ” Best dinner ever!!!”: automatic generation of restaurant reviews with lstm-rnn. In 2016 IEEE/WIC/ACM International Conference on Web Intelligence (WI), pp. 721–724. Cited by: §2.2.2.
  • A. Beal and J. Strauss (2009) Radically transparent: monitoring and managing reputations online. John Wiley & Sons. Cited by: §1.
  • B. B. Bederson and A. J. Quinn (2011) Web workers unite! addressing challenges of online laborers. In CHI’11 Extended Abstracts on Human Factors in Computing Systems, pp. 97–106. Cited by: §1.
  • C. Benner (2014) Amazonisierung oder humanisierung der arbeit durch crowdsourcing. Crowdwork–zurück in die Zukunft, pp. 289–300. Cited by: §6.3.
  • C. L. Bennett and D. K. Rosner (2019) The promise of empathy: design, disability, and knowing the “other”. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, New York, NY, USA. External Links: ISBN 9781450359702, Link, Document Cited by: §6.1.
  • A. Benson, A. Sojourner, and A. Umyarov (2019) Can reputation discipline the gig economy?: experimental evidence from an online labor market. Benson, Alan, Aaron Sojourner, and Akhmed Umyarov. Cited by: §1, §2.1.
  • J. Berg, M. Furrer, E. Harmon, U. Rani, and M. S. Silberman (2018) Digital labour platforms and the future of work: towards decent work in the online world. International Labour Office Geneva. Cited by: §2.1.
  • R. M. Borromeo, T. Laurent, M. Toyama, and S. Amer-Yahia (2017) Fairness and transparency in crowdsourcing.. In EDBT, pp. 466–469. Cited by: §2.2.3.
  • A. Broughton, R. Gloster, R. Marvell, M. Green, J. Langley, and A. Martin (2018) The experiences of individuals in the gig economy. Cited by: §2.1.
  • R. Calo and A. Rosenblat (2017) The taking economy: uber, information, and power. Colum. L. Rev. 117, pp. 1623. Cited by: §1.
  • J. Cambre, S. Klemmer, and C. Kulkarni (2018) Juxtapeer: comparative peer review yields higher quality feedback and promotes deeper reflection. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 294. Cited by: §2.2.1.
  • A. Caputo (2013) A literature review of cognitive biases in negotiation processes. International Journal of Conflict Management 24 (4), pp. 374–398. Cited by: §6.3.
  • R. H. Coase (1974) The market for goods and the market for ideas. The American Economic Review 64 (2), pp. 384–391. Cited by: §6.2.
  • A. Collomb, C. Costea, D. Joyeux, O. Hasan, and L. Brunie (2014) A study and comparison of sentiment analysis methods for reputation evaluation. Rapport de recherche RR-LIRIS-2014-002. Cited by: §2.2.2.
  • A. Cook, J. Hammer, S. Elsayed-Ali, and S. Dow (2019) How guiding questions facilitate feedback exchange in project-based learning. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 138. Cited by: §2.2.1.
  • P. Dai, D. S. Weld, et al. (2011) Artificial intelligence for artificial artificial intelligence. In Twenty-Fifth AAAI Conference on Artificial Intelligence, Cited by: §6.5.2.
  • V. De Stefano (2015) The rise of the just-in-time workforce: on-demand work, crowdwork, and labor protection in the gig-economy. Comp. Lab. L. & Pol’y J. 37, pp. 471. Cited by: §1.
  • C. Dellarocas (2005) Reputation mechanism design in online trading environments with pure moral hazard. Information systems research 16 (2), pp. 209–230. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.
  • [27] H. Dough () Fired from uber: why drivers get deactivated, & how to get reactivated - ridesharing driver. Note: on 10/11/2019) Cited by: §1.
  • S. Dow, A. Kulkarni, S. Klemmer, and B. Hartmann (2012) Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on computer supported cooperative work, pp. 1013–1022. Cited by: §6.3.
  • H. Du, X. Xu, X. Cheng, D. Wu, Y. Liu, and Z. Yu (2016) Aspect-specific sentimental word embedding for sentiment analysis of online reviews. In Proceedings of the 25th International Conference Companion on World Wide Web, WWW ’16 Companion, Republic and Canton of Geneva, Switzerland, pp. 29–30. External Links: ISBN 978-1-4503-4144-8, Link, Document Cited by: §2.2.2.
  • E. I. Elmurngi and A. Gherbi (2018)

    Unfair reviews detection on amazon reviews using sentiment analysis with supervised learning techniques.

    JCS 14 (5), pp. 714–726. Cited by: §2.2.2.
  • A. K. Elshenawy, S. Carter, and D. Braga (2016) It’s not just what you say, but how you say it: muiltimodal sentiment analysis via crowdsourcing. In Third AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.2.2.
  • D. Estival and F. Gayral (1995) An nlp approach to a specific type of texts: car accident reports. arXiv preprint cmp-lg/9502032. Cited by: §2.2.2.
  • M. Fan, Y. Tan, and A. B. Whinston (2005) Evaluation and design of online cooperative feedback mechanisms for reputation management. IEEE Transactions on Knowledge and Data Engineering 17 (2), pp. 244–254. Cited by: §1.
  • A. Fandango (2018) Mastering tensorflow 1. x: advanced machine learning and deep learning concepts using tensorflow 1. x and keras. Packt Publishing Ltd. Cited by: §6.4.
  • A. Felstiner (2011) Working the crowd: employment and labor law in the crowdsourcing industry. Berkeley J. Emp. & Lab. L. 32, pp. 143. Cited by: §1, §6.3.
  • C. Fieseler, E. Bucher, and C. P. Hoffmann (2019) Unfairness by design? the perceived fairness of digital labor on crowdworking platforms. Journal of Business Ethics 156 (4), pp. 987–1005. Cited by: §2.2.3.
  • A. Filippas, J. J. Horton, and J. Golden (2017) Reputation in the long-run. Cited by: §1.
  • C. D. Frith and T. Singer (2008) The role of social cognition in decision making. Philosophical Transactions of the Royal Society B: Biological Sciences 363 (1511), pp. 3875–3886. Cited by: §6.1.
  • S. Gaikwad, D. Morina, R. Nistala, M. Agarwal, A. Cossette, R. Bhanu, S. Savage, V. Narwal, K. Rajpal, J. Regino, et al. (2015) Daemo: a self-governed crowdsourcing marketplace. In Adjunct proceedings of the 28th annual ACM symposium on user interface software & technology, pp. 101–102. Cited by: §1, §2.3.1, §4.1.
  • A. Gandini, I. Pais, and D. Beraldo (2016) Reputation and trust on online labour markets: the reputation economy of elance. Work Organisation, Labour and Globalisation 10 (1), pp. 27–43. Cited by: §2.3.1.
  • J. K. Goodman and G. Paolacci (2017) Crowdsourcing consumer research. Journal of Consumer Research 44 (1), pp. 196–210. Cited by: §5.
  • M. Graham, J. Woodcock, R. Heeks, S. Fredman, D. Du Toit, J. v. Belle, P. Mungai, and A. Osiki (2019) The fairwork foundation: strategies for improving platform work. In Weizenbaum Conference, pp. 8. Cited by: §2.2.3.
  • M. Graham and J. Woodcock (2018) Towards a fairer platform economy: introducing the fairwork foundation. Alternate Routes 29. Cited by: §1.
  • M. Gray and S. Suri (2019) Ghost work: how to stop silicon valley from building a new global underclass. Boston: Eamon Dolan/Houghton Mifflin Harcourt. Cited by: §1, §1.
  • J. Greenberg (1991) Motivation to inflate performance ratings: perceptual bias or response bias?. Motivation and Emotion 15 (1), pp. 81–97. Cited by: §6.3.
  • E. Guzman and W. Maalej (2014) How do users like this feature? a fine grained sentiment analysis of app reviews. In 2014 IEEE 22nd international requirements engineering conference (RE), pp. 153–162. Cited by: §2.2.2.
  • K. Hara, A. Adams, K. Milland, S. Savage, C. Callison-Burch, and J. P. Bigham (2018) A data-driven analysis of workers’ earnings on amazon mechanical turk. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 449. Cited by: §2.1.
  • E. Harmon and M. S. Silberman (2018) Rating working conditions on digital labor platforms. Computer Supported Cooperative Work (CSCW) 27 (3-6), pp. 1275–1324. Cited by: §2.1.
  • P. Hitlin (2016) Research in the crowdsourcing age, a case study. Pew Research Center 11. Cited by: §2.1.
  • K. Å. Hofseth, L. K. Haga, V. Sørlie, and F. E. Sandnes (2019) Form feedback on the web: a comparison of popup alerts and in-form error messages. In Innovation in Medicine and Healthcare Systems, and Multimedia, pp. 369–379. Cited by: §3.
  • E. A. Hogan (1987) Effects of prior expectations on performance ratings: a longitudinal study. Academy of Management Journal 30 (2), pp. 354–368. Cited by: §6.3.
  • J. Horton and J. Golden (2015) Reputation inflation: evidence from an online labor market. Work. Pap., NYU 1. Cited by: §1, §2.3.1.
  • J. J. Horton (2011) The condition of the turking class: are online employers fair and honest?. Economics Letters 111 (1), pp. 10–12. Cited by: §2.1.
  • L. C. Irani and M. Silberman (2013) Turkopticon: interrupting worker invisibility in amazon mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems, pp. 611–620. Cited by: §1, §1, §2.1, §6.3.
  • L. C. Irani and M. Silberman (2016) Stories we tell about labor: turkopticon and the trouble with design. In Proceedings of the 2016 CHI conference on human factors in computing systems, pp. 4573–4586. Cited by: §2.1, §6.1.
  • S. Jagabathula, L. Subramanian, and A. Venkataraman (2014) Reputation-based worker filtering in crowdsourcing. In Advances in Neural Information Processing Systems, pp. 2492–2500. Cited by: §6.2.
  • A. Jøsang and J. Golbeck (2009) Challenges for robust trust and reputation systems. In Proceedings of the 5th International Workshop on Security and Trust Management (SMT 2009), Saint Malo, France, pp. 52. Cited by: §2.3.
  • A. Jøsang, R. Ismail, and C. Boyd (2007) A survey of trust and reputation systems for online service provision. Decision support systems 43 (2), pp. 618–644. Cited by: §2.3.
  • T. Kaplan, S. Saito, K. Hara, and J. P. Bigham (2018) Striving to earn more: a survey of work strategies and tool use among crowd workers.. In HCOMP, pp. 70–78. Cited by: §2.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • A. Kittur, J. V. Nickerson, M. Bernstein, E. Gerber, A. Shaw, J. Zimmerman, M. Lease, and J. Horton (2013) The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work, pp. 1301–1318. Cited by: §2.1.
  • J. Klayman (1995) Varieties of confirmation bias. In Psychology of learning and motivation, Vol. 32, pp. 385–418. Cited by: §6.3.
  • M. Kokkodis (2012) Learning from positive and unlabeled amazon reviews: towards identifying trustworthy reviewers. In Proceedings of the 21st International Conference on World Wide Web, WWW ’12 Companion, New York, NY, USA, pp. 545–546. External Links: ISBN 978-1-4503-1230-1, Link, Document Cited by: §2.2.2.
  • M. Krause, T. Garncarz, J. Song, E. M. Gerber, B. P. Bailey, and S. P. Dow (2017) Critique style guide: improving crowdsourced design feedback with a natural language model. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 4627–4639. Cited by: §2.2.2.
  • F. Krzystofiak, R. L. Cardy, and J. Newman (1988) Implicit personality and performance appraisal: the influence of trait inferences on evaluations of behavior.. Journal of Applied Psychology 73 (3), pp. 515. Cited by: §6.3.
  • C. E. Kulkarni, M. S. Bernstein, and S. R. Klemmer (2015) PeerStudio: rapid peer feedback emphasizes revision and improves performance. In Proceedings of the second (2015) ACM conference on learning@ scale, pp. 75–84. Cited by: §2.2.1.
  • S. Kumar, B. Hooi, D. Makhija, M. Kumar, C. Faloutsos, and V. Subrahamanian (2017) FairJudge: trustworthy user prediction in rating platforms. arXiv preprint arXiv:1703.10545. Cited by: §2.2.2.
  • S. D. Levitt and J. A. List (2007) What do laboratory experiments measuring social preferences reveal about the real world?. Journal of Economic perspectives 21 (2), pp. 153–174. Cited by: §5.
  • I. Li, J. Forlizzi, A. Dey, and S. Kiesler (2007) My agent as myself or another: effects on credibility and listening to advice. In Proceedings of the 2007 conference on Designing pleasurable products and interfaces, pp. 194–208. Cited by: §6.2.
  • J. Li and L. Karakowsky (2001) Do we see eye-to-eye? implications of cultural differences for cross-cultural management research and practice. The Journal of Psychology 135 (5), pp. 501–517. Cited by: §6.3.
  • H. R. Lipford, J. Watson, M. Whitney, K. Froiland, and R. W. Reeder (2010) Visual vs. compact: a comparison of privacy policy interfaces. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10, New York, NY, USA, pp. 1111–1114. External Links: ISBN 9781605589299, Link, Document Cited by: §6.5.1.
  • Y. Lu, P. Tsaparas, A. Ntoulas, and L. Polanyi (2010) Exploiting social context for review quality prediction. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, New York, NY, USA, pp. 691–700. External Links: ISBN 978-1-60558-799-8, Link, Document Cited by: §1.
  • W. Luiz, F. Viegas, R. Alencar, F. Mourão, T. Salles, D. Carvalho, M. A. Gonçalves, and L. Rocha (2018) A feature-oriented sentiment rating for mobile app reviews. In Proceedings of the 2018 World Wide Web Conference, WWW ’18, Republic and Canton of Geneva, Switzerland, pp. 1909–1918. External Links: ISBN 978-1-4503-5639-8, Link, Document Cited by: §2.2.2.
  • N. F. Ma, C. W. Yuan, M. Ghafurian, and B. V. Hanrahan (2018) Using stakeholder theory to examine drivers’ stake in uber. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 83. Cited by: §1.
  • N. Mahyar, M. R. James, M. M. Ng, R. A. Wu, and S. P. Dow (2018) CommunityCrit: inviting the public to improve and evaluate urban design ideas through micro-activities. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 195. Cited by: §6.2.
  • R. F. Martell and C. E. Willis (1993) Effects of observers’ performance expectations on behavior ratings of work groups: memory or response bias?. Organizational Behavior and Human Decision Processes 56 (1), pp. 91–109. Cited by: §6.3.
  • D. Martin, B. V. Hanrahan, J. O’Neill, and N. Gupta (2014) Being a turker. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pp. 224–235. Cited by: §1.
  • B. McInnis, D. Cosley, C. Nam, and G. Leshed (2016) Taking a hit: designing around rejection, mistrust, risk, and workers’ experiences in amazon mechanical turk. In Proceedings of the 2016 CHI conference on human factors in computing systems, pp. 2271–2282. Cited by: §1.
  • R. Mehrotra, J. McInerney, H. Bouchard, M. Lalmas, and F. Diaz (2018) Towards a fair marketplace: counterfactual evaluation of the trade-off between relevance, fairness & satisfaction in recommendation systems. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, pp. 2243–2251. Cited by: §2.2.3.
  • J. Mencl and D. R. May (2009) The effects of proximity and empathy on ethical decision-making: an exploratory investigation. Journal of Business Ethics 85 (2), pp. 201–226. Cited by: §6.1.
  • I. Metall (2016) Frankfurt paper on platform-based work—proposals for platform operators, clients, policy makers, workers, and worker organizations. IG Metall, Frankfurt. Cited by: §6.3.
  • P. Mihas (2019) Qualitative data analysis. In Oxford Research Encyclopedia of Education, Cited by: §5.
  • J. Prassl (2018) Humans as a service: the promise and perils of work in the gig economy. Oxford University Press. Cited by: §1.
  • W. Qiu, P. Parigi, and B. Abrahao (2018) More stars or more reviews?. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 153. Cited by: §1.
  • L. Qu, G. Ifrim, and G. Weikum (2010) The bag-of-opinions method for review rating prediction from sparse text patterns. In Proceedings of the 23rd international conference on computational linguistics, pp. 913–921. Cited by: §2.2.2.
  • P. Resnick, K. Kuwabara, R. Zeckhauser, and E. Friedman (2000) Reputation systems. Communications of the ACM 43 (12), pp. 45–48. Cited by: §2.3.
  • J. Rognerud (2008) Ultimate guide to search engine optimization: drive traffic, boost conversion rates and make lots of money. Jon Rognerud SEO. Cited by: §6.2.
  • A. Rosenblat and L. Stark (2016) Algorithmic labor and information asymmetries: a case study of uber’s drivers. International Journal of Communication 10, pp. 27. Cited by: §1.
  • A. Sadilek, S. Brennan, H. Kautz, and V. Silenzio (2013) Nemesis: which restaurants should you avoid today?. In First AAAI Conference on Human Computation and Crowdsourcing, Cited by: §2.2.2.
  • J. Saenger, C. Richthammer, M. Kunz, S. Meier, and G. Pernul (2015) Visualizing unfair ratings in online reputation systems. pp. . Cited by: §1.
  • S. Schiffner, S. Clauß, and S. Steinbrecher (2011) Privacy, liveliness and fairness for reputation. In International Conference on Current Trends in Theory and Practice of Computer Science, pp. 506–519. Cited by: §2.2.3.
  • T. Scholz (2017) Uberworked and underpaid: how workers are disrupting the digital economy. John Wiley & Sons. Cited by: §2.1.
  • M. Seckler, A. N. Tuch, K. Opwis, and J. A. Bargas-Avila (2012) User-friendly locations of error messages in web forms: put them on the right side of the erroneous input field. Interacting with Computers 24 (3), pp. 107–118. Cited by: §3.
  • A. Shenoy and A. Prabhu (2016) Ranking in seo. In Introducing SEO, pp. 21–35. Cited by: §6.2.
  • M. Siberman and E. Harmon (2017) Rating working conditions in digital labor platforms. In Proceedings of 15th European Conference on Computer-Supported Cooperative Work-Exploratory Papers, Reports of the European Society for Socially Embedded Technologies. http://doi. org/10.18420/ecscw2017-to-be-added, Cited by: §1, §2.1, §6.3.
  • M. S. Silberman and I. Metall (2009) Fifteen criteria for a fairer gig economy. Democratization 61 (4), pp. 589–622. Cited by: §1, §2.1.
  • J. Tang, H. Gao, X. Hu, and H. Liu (2013) Context-aware review helpfulness rating prediction. In Proceedings of the 7th ACM conference on Recommender systems, pp. 1–8. Cited by: §2.2.2.
  • A. Todolí-Signes (2017) The end of the subordinate worker? the on-demand economy, the gig economy, and the need for protection for crowdworkers. International Journal of Comparative Labour Law and Industrial Relations 33 (2), pp. 241–268. Cited by: §1, §1.
  • C. Toxtli, C. Flores-Saviaga, M. Maurier, A. Ribot, T. Bankole, A. Entrekin, M. Cantley, S. Singh, S. Reddy, and R. Reddy (2018) ExperTwin: an alter ego in cyberspace for knowledge workers. In 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 886–891. Cited by: §6.4.
  • E. M. Uslaner (1999) Trust but verify: social capital and moral behavior. Social Science Information 38 (1), pp. 29–55. Cited by: §1.
  • A. Whitby, A. Jøsang, and J. Indulska (2004) Filtering out unfair ratings in bayesian reputation systems. In Proc. 7th Int. Workshop on Trust in Agent Societies, Vol. 6, pp. 106–117. Cited by: §2.3.
  • A. C. Williams, G. Mark, K. Milland, E. Lank, and E. Law (2019) The perpetual work life of crowdworkers: how tooling practices increase fragmentation in crowdwork. Proceedings of the ACM on Human-Computer Interaction 3 (CSCW), pp. 1–28. Cited by: §2.1, §2.3.1, §6.2, §6.5.2.
  • L. Xiong and L. Liu (2003) A reputation-based trust model for peer-to-peer e-commerce communities. In EEE International Conference on E-Commerce, 2003. CEC 2003., pp. 275–284. Cited by: §1.
  • G. Zacharia and P. Maes (2000) Trust management through reputation mechanisms. Applied Artificial Intelligence 14 (9), pp. 881–907. Cited by: §2.3.
  • R. Zhou and K. Hwang (2007) Powertrust: a robust and scalable reputation system for trusted peer-to-peer computing. IEEE Transactions on parallel and distributed systems 18 (4), pp. 460–473. Cited by: §2.3.
  • O. Zwikael, K. Shimizu, and S. Globerson (2005) Cultural differences in project management capabilities: a field study. International Journal of Project Management 23 (6), pp. 454–462. Cited by: §6.3.