Micro-blogging social media platforms have become very popular in recent years. One of the most popular platforms is Twitter, which allows users to broadcast short texts (i.e., 140 characters initially, and 280 characters in a recent platform update) in real time with almost no restrictions on content. Twitter is a source of people’s attitudes, opinions, and thoughts toward the things that happen in their daily life. Twitter data are publicly accessible through Twitter application programming interface (API); and there are several tools to download and process these data. Twitter is being increasingly used as a valuable instrument for surveillance research and predictive analytics in many fields including epidemiology, psychology, and social sciences. For example, Bian et al. explored the relation between promotional information and laypeople’s discussion on Twitter by using topic modeling and sentiment analysis[bian_using_2017]. Zhao et al. assessed the mental health signals among sexual and gender minorities using Twitter data [zhao_assessing_2018]. Twitter data can be used to study and predict population-level targets, such as disease incidence [eichstaedt_psychological_2015], political trends [gayo-avello_meta-analysis_2013], earthquake detection [sakaki_earthquake_2010], and crime perdition [hutchison_automatic_2012], and individual-level outcomes or life events, such as job loss [lossio-ventura_operational_2019], depression [leis_detecting_2019], and adverse events [wang_adverse_2018]
. Since tweets are unstructured textual data, natural language processing (NLP) and machine learning, especially deep learning nowadays, are often used for preprocessing and analytics. However, for many studies[finin_annotating_2010, mozetic_multilingual_2016, stowe_developing_2018], especially those that analyze individual-level targets, manual annotations of several thousands of tweets, often by experts, is needed to create gold-standard training datasets, to be fed to the NLP and machine learning tools for subsequent, reliable automated processing of millions of tweets. Manual annotation is obviously labor intense and time consuming.
Crowdsourcing can scale up manual labor by distributing tasks to a large set of workers working in parallel instead of a single people working serially [carenini_extractive_2008]. Commercial platforms such as Amazon’s Mechanical Turk (MTurk, https://www.
mturk.com/), make it easy to recruit a large crowd of people working remotely to perform time consuming manual tasks such as entity resolution [arasu_active_2010, bellare_active_2012], image or sentiment annotation [vijayanarasimhan_cost-sensitive_2011, pak_twitter_2010]. The annotation tasks published on MTurk can be done on a piecework basis and, given the very large pool of workers usually available (even by selecting a subset of those who have, say, a college degree), the tasks can be done almost immediately. However, any crowdsourcing service that solely relies on human workers will eventually be expensive when large datasets are needed, that is often the case when creating training datasets for NLP and deep learning tasks. Therefore, reducing the training dataset size (without losing performance and quality) would also improve efficiency while contain costs.
Query optimization techniques (e.g., active learning) can reduce the number of tweets that need to be labeled, while yielding comparable performance for the downstream machine learning tasks [marcus_counting_2012, franklin_crowddb:_2011, parameswaran_crowdscreen:_2012]. Active learning algorithms have been widely applied in various areas including NLP [tang_active_2002] and image processing [wang_cost-effective_2017]. In a pool-based active learning scenario, data samples for training a machine learning algorithm (e.g., a classifier for identifying job loss events) are drawn from a pool of unlabeled data according to some forms of informativeness measure (a.k.a. active learning strategies [settles_active_2009]), and then the most informative instances are selected to be annotated. For a classification task, in essence, an active learning strategy should be able to pick the “best” samples to be labelled that will improve the classification performance the most.
In this study, we integrated active learning into a crowdsourcing pipeline for the classification of life events based on individual tweets. We analyzed the quality of crowdsourcing annotations and then experimented with different machine/deep learning classifiers combined with different active learning strategies to answer the following two research questions (RQs):
RQ1. How does (1) the amount of time that a human worker spends on and (2) the number of workers assigned to each annotation task impact the quality of an-notation results?
RQ2. Which active learning strategy is most efficient and cost-effective to build event classification models using Twitter data?
We first collected tweets based on a list of job loss-related keywords. We then randomly selected a set of sample tweets and had these tweets annotated (i.e., whether the tweet is a job loss event) using the Amazon MTurk platform. With these annotated tweets, we then evaluated 4 different active learning strategies (i.e., least confi-dent, entropy, vote entropy, and Kullback-Leibler (KL) divergence) through simulations.
2.1 Data Collection
Our data were collected from two data sources based on a list of job loss-related keywords. The keywords were developed using a snowball sampling process, where we started with an initial list of 8 keywords that indicates a job-loss event (e.g., “got fired” and “lost my job”). Using these keywords, we then queried (1) Twitter’s own search engine (i.e., https://twitter.com/search-home?lang=en), and (2) a database of public random tweets that we have collected using the Twitter steaming application programming interface (API) from January 1, 2013 to December 30, 2017, to identify job loss-related tweets. We then manually reviewed a sample of randomly selected tweets to discover new job loss-related keywords. We repeated the search then review process iteratively until no new keywords were found. Through this process, we found 33 keywords from the historical random tweet database and 57 keywords through Twitter web search. We then (1) not only collected tweets based on the over-all of 68 unique keywords from the historical random tweet database, but also (2) crawled new Twitter data using Twitter search API from December 10, 2018 to December 26, 2018 (17 days).
2.2 Data Preprocessing
We preprocessed the collected data to eliminate tweets that were (1) duplicated or (2) not written in English. For building classifiers, we preprocessed the tweets following the preprocessing steps used by GloVe [pennington_glove:_2014] with minor modifications as follows: (1) all hashtags (e.g., “#gotfired”) were replaced with “hashtag PHRASE” (e.g.,, “hashtag gotfired”); (2) user mentions (e.g., “Rob_Bradley”) were replaced with “user”; (3) web links (eg, “https://t.co/
fMmFWAHEuM”) were replaced with “url”; and (4) all emojis were replaced with “emoji.”
2.3 Classifier Selection
Machine learning and deep learning have been wildly used in classification of tweets tasks. We evaluated 8 different classifiers: 4 traditional machine learning models (i.e., logistic regress [LR], Naïve Bayes [NB], random forest [RF], and support vector machine [SVM]) and 4 deep learning models (i.e., convolutional neural network [CNN], recurrent neural network [RNN], long short-term memory [LSTM] RNN, and gated recurrent unit [GRU] RNN). 3,000 tweets out of 7,220 Amazon MTurk annotated dataset was used for classifier training (n = 2,000) and testing (n = 1,000). The rest of MTurk annotated dataset were used for the subsequent active learning experiments. Each classifier was trained 10 times and 95 confidence intervals (CI) for mean value were reported. We explored two language models as the features for the classifiers (i.e., n-gram and word-embedding). All the machine learning classifiers were developed with n-gram features; while we used both n-gram and word-embedding features on the CNN classifier to test which feature set is more suitable for deep learning classifiers. CNN classifier with word embedding features had a better performance which is consistent with other studies[le_comparative_2018, badjatiya_deep_2017]
We then selected one machine learning and one deep learning classifiers based on the prediction performance (i.e., F-score). Logistic regression was used as the baseline classifier.
2.4 Pool-based Active Learning
In pool-based sampling for active learning, instances are drawn from a pool of samples according to some sort of informativeness measure, and then the most informative instances are selected to be annotated. This is the most common scenario in active learning studies [min_efficient_2017]. The informativeness measures of the pool instances are called active learning strategies (or query strategies). We evaluated 4 active learning strategies (i.e., least confident, entropy, vote entropy and KL divergence). Fig 1.C shows the workflow of our pool-based active learning experiments: for a given active learning strategy and classifiers trained with an initial set of training data (1) the classifiers make predictions of the remaining to-be-labelled dataset; (2) a set of samples is selected using the specific active learning strategy and annotated by human reviewers; (3) the classifiers are retrained with the newly annotated set of tweets. We repeated this process iteratively until the pool of data exhausts. For the least confident and entropy active learning strategies, we used the best performed machine learn-ing classifier and the best performed deep learning classifier plus the baseline classifier (LR). Note that vote entropy and KL divergence are query-by-committee strategies, which were tested upon three deep learning classifiers (i.e., CNN, RNN and LSTM) and three machine learning classifiers (i.e., LR, RF, and SVM) as two separate committees, respectively.
3.1 Data Collection
Our data came from two different sources as shown in Table 1. First, we collected 2,803,164 tweets using the Twitter search API [noauthor_twitter_nodate] from December 10, 2018 to December 26, 2018 base on a list of job loss-related keywords (n = 68). After filtering out duplicates and non-English tweets, 1,952,079 tweets were left. Second, we used the same list of keywords to identify relevant tweets from a database of historical random public tweets we collected from January 1, 2013 to December 30, 2017. We found 1,733,905 relevant tweets from this database. Due to the different mechanisms behind the two Twitter APIs (i.e., streaming API vs. search API), the volumes of the tweets from the two data sources were significantly different. For the Twitter search API, users can retrieve most of the public tweets related to the provided keywords within 10 to 14 days before the time of data collection; while the Twitter streaming API returns a random sample (i.e., roughly 1% to 20% varying across the years) of all public tweets at the time and covers a wide range of topics. After integrating the tweets from the two data sources, there were 3,685,984 unique tweets.
|Data source||Year||# of tweets||# of English tweets|
3.2 RQ1. How does (1) the amount of time that a human worker spends on and (2) the number of workers assigned to each annotation task impact the quality of annotation results?
We randomly selected 7,220 tweets from our Twitter data based on keyword distributions and had those tweets annotated using workers recruited through Amazon MTurk. Each tweet was also annotated by an expert annotator (i.e., one of the authors). We treated the consensus answer of the crowdsourcing workers (i.e., at least 5 annotators for each tweet assignment) and the expert annotator as the gold-standard. Using control tweets is a common strategy to identify workers who cheat (e.g., randomly select an answer without reading the instructions and/or tweets) on annotation tasks. We introduced two control tweets in each annotation assignment, where each annotation assignment contains a total of 12 tweets (including the 2 control tweets). Only responses with the two control tweets answered corrected were considered valid responses and the worker would receive the 10 cents incentive.
The amount of time that a worker spends on a task is another factor associated with annotation quality. We measured the time that one spent on clicking through the annotation task without thinking about the content and repeated the experiment five times. The mean amount time spent on the task is 57.01 (95% CI [47.19, 66.43]) seconds. Thus, responses with less than 47 seconds were considered invalid regardless how the control tweets were answered.
We then did two experiments to explore the relation between the amount of time that workers spend on annotation tasks and annotation quality. Fig 2. A. shows annotation quality by selecting different amounts of lower cut-off time (i.e., only considering assignments where workers spent more time than the cut-off time as valid responses), which tests whether the annotation is of low quality when workers spent more time on the task. The performance of the crowdsourcing workers was measured by the agreement (i.e., Cohan’s kappa) between labels from each crowdsourcing worker and the gold-standard labels. Fig 2. B. shows annotation quality by selecting different upper cut-off time (i.e., keep assignments whose time consumption were less than the cut-off time), which tests whether the annotation is of low quality when workers spent less time on the task. As shown in Fig. 2. A and B, it does not affect the annotation quality when a worker spent more time on the task; while, the annota-ion quality is significantly lower if the worker spent less than 90 seconds on the task.
We also tested the annotation reliability (i.e., Fleiss’ Kappa score) between using 3 workers vs. using 5 workers. The Fleiss’ kappa score of 3 workers is 0.53 (95% CI [0.46, 0.61]. The Fleiss’ kappa score of 5 workers is 0.56 (95% CI [0.51, 0.61]. Thus, using 3 workers vs. 5 workers does not make any difference on the annotation reliability, while it is obviously cheaper to use only 3 workers.
3.3 RQ2. Which active learning strategy is most efficient and cost-effective to build event classification models using Twitter data?
We randomly selected 3,000 tweets from the 7,220 MTurk annotated dataset to build the initial classifiers. Two thousands out of 3,000 tweets were used to train the clas-sifiers and the rest 1,000 tweets were used as independent test dataset to benchmark their performance. We explored 4 machine learning classifiers (i.e., Logistic Regression [LR], Naïve Bayes [NB], Random Forest [RF], and Support Vector Machine [SVM]) and 4 deep learning classifiers (i.e., Convolutional Neural Network [CNN], Recurrent Neural Network [RNN], Long Short-Term Memory [LSTM], and Gated Recurrent Unit [GRU]). Each classifier was trained 10 times. The performance was measured in terms of precision, recall, and F-score. 95% confidence intervals (CIs) of the mean F-score across the ten runs were also reported. Table 2 shows the perfor-mance of classifiers. We chose logistic regression as the baseline model. RF and CNN were chosen for subsequent active learning experiments, since they outperformed other machine learning and deep learning classifiers.
|Feature||Model name||Precision||Recall||F-score||95% CIs of F-score|
|LSTM (GRU)||0.75||0.75||0.74||(0.72, 0.76)|
We implemented a pool-based active learning pipeline to test which classifier and active learning strategy is most efficient to build up an event classification classifier of Twitter data. We queried the top 300 most “informative” tweets from the rest of the pool (i.e., excluding the tweets used for training the classifiers) at each iteration. Table 3 shows the active learning and classifier combinations that we evaluated. The performance of the classifiers was measured by F-score. Fig 3 shows the results of the different active learning strategies combined with LR (i.e., the baseline), RF (i.e., the best performed machine learning model), and CNN (i.e., the best performed deep learning model). For both machine learning models (i.e., LR and RF), using the entropy strategy can reach the optimal performance the quickest (i.e., the least amount of tweets). While, the least confident algorithm does not have any clear advantages compared with random selection. For deep learning model (i.e., CNN), none of the active learning strategies tested are useful to improve the CNN classifier’s performance. Fig 4 shows the results of query-by-committee algorithms (i.e., vote entropy and KL divergence) combined with machine learning and deep learning ensemble classifiers. Query-by-committee algorithms are slightly better than random selection when it applied to machine learning ensemble classifier. However, query-by-committee algorithms are not useful for the deep learning ensemble classifier.
|LR||Random query, least confident, entropy|
|RF||Random query, least confident, entropy|
|CNN||Random query, least confident, entropy|
|Ensemble111 Vote entropy, KL divergence|
The goal of our study was to test the feasibility of building classifiers by using crowdsourcing and active learning strategies. We had 7,220 sample job loss-related tweets annotated using Amazon MTurk, tested 8 classification models, and evaluated 4 active learning strategies to answer our two RQs.
The key benefit of crowdsourcing is to have a large number of workers available to carry out tasks on a piecework basis. This means that it is likely to get the crowd to start work on tasks almost immediately and be able to have a large number of tasks completed quickly. However, even welltrained workers are only human and can make mistakes. Our first RQ was to find an optimal and economical way to get reliable annotations from crowdsourcing. Beyond using control tweets, we tested different cut-off time to assess how the amount of time workers spent on the task would affect annotation quality. We found that the annotation quality is low if the tasks were finished within 90 seconds. We also found that the annotation quality is not affected by the number of workers (i.e., between 3 worker group vs 5 worker group), which was also demonstrated by Mozafari et al [mozafari_scaling_2014].
In second RQ, we aimed to find which active learning strategy is most efficient and cost-effective to build event classification models using Twitter data. We started with selecting representative machine learning and deep learning classifiers. Among the 4 machine learning classifiers (i.e., LR, NB, RF, and SVM), LR and RF classifiers have the best performance on the task of identifying job loss events from tweets. Among the 4 deep learning methods (i.e., CNN, RNN, LSTM, LSTM with GRU), CNN has the best performance.
In active learning, the learning algorithm is set to proactively select a subset of available examples to be manually labeled next from a pool of yet unlabeled instances. The fundamental idea behind the concept is that a machine learning algorithm could potentially achieve a better accuracy quicker and using fewer training data if it were allowed to choose the most informative data it wants to learn from. In our experiment, we found that the entropy algorithm is the best way to build machine learning models fast and efficiently. Vote entropy and KL divergence, the query-by-committee active learning methods are helpful for the training of machine learning ensemble classifiers. However, all the active learning strategies we tested do not work well with deep learning model (i.e., CNN) or deep learning-based ensemble classifier.
We also recognize the limitations of our study. First, we only tested 5 classifiers (i.e., LR, RF, CNN, a machine learning ensemble classifier, and a deep learning classifier) and 4 active learning strategies (i.e., least confident, entropy, vote entropy, KL divergence). Other state-of-art methods for building tweet classifiers (e.g., BERT [devlin_bert:_2018]
) and other active learning strategies (e.g., variance reduction[yang_variance_2018]) are worth exploring. Second, other crowdsourcing quality control methods such as using prequalification questions to identify high-quality workers also warrant further investigations. Third, the crowdsourcing and active learning pipeline can potentially be applied to other data and tasks. However, more experiments are needed to test the fea-sibility. Fourth, the current study only focused on which active learning strategy is most efficient and cost-effective to build event classification models using crowdsourcing labels. Other research questions such as how the correctness of the crowdsourced labels would impact classifier performance warrant future investigations.
In sum, our study demonstrated that crowdsourcing with active learning is a possible way to build up machine learning classifiers efficiently. However, active learning strategies do not benefit deep learning classifiers in our study.
This study was supported by NSF Award #1734134.