Twitter Job/Employment Corpus: A Dataset of Job-Related Discourse Built with Humans in the Loop

by   Tong Liu, et al.
Rochester Institute of Technology

We present the Twitter Job/Employment Corpus, a collection of tweets annotated by a humans-in-the-loop supervised learning framework that integrates crowdsourcing contributions and expertise on the local community and employment environment. Previous computational studies of job-related phenomena have used corpora collected from workplace social media that are hosted internally by the employers, and so lacks independence from latent job-related coercion and the broader context that an open domain, general-purpose medium such as Twitter provides. Our new corpus promises to be a benchmark for the extraction of job-related topics and advanced analysis and modeling, and can potentially benefit a wide range of research communities in the future.



page 1

page 2

page 3

page 4


Job Detection in Twitter

In this report, we propose a new application for twitter data called job...

Framing COVID-19: How we conceptualize and discuss the pandemic on Twitter

Doctors and nurses in these weeks are busy in the trenches, fighting aga...

Borrowing or Codeswitching? Annotating for Finer-Grained Distinctions in Language Mixing

We present a new corpus of Twitter data annotated for codeswitching and ...

Multi-task dialog act and sentiment recognition on Mastodon

Because of license restrictions, it often becomes impossible to strictly...

Extracting localized information from a Twitter corpus for flood prevention

In this paper, we discuss the collection of a corpus associated to tropi...

Unsupervised Hashtag Retrieval and Visualization for Crisis Informatics

In social media like Twitter, hashtags carry a lot of semantic informati...

An Update to the Minho Quotation Resource

The Minho Quotation Resource was originally released in 2012. It provide...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Working American adults spend more than one third of their daily time on job-related activities [Bureau of Labor Statistics2013]—more than on anything else. Any attempt to understand a working individual’s experiences, state of mind, or motivations must take into account their life at work. In the extreme, job dissatisfaction poses serious health risks and even leads to suicide [Bureau of Labor Statistics2009, Hazards Magazine2014].

Conversely, behavioral and mental problems greatly affect employee’s productivity and loyalty. 70% of US workers are disengaged at work [Gallup2013]. Each year lost productivity costs between 450 and 550 billion dollars. Disengaged workers are 87% more likely to leave their jobs than their more satisfied counterparts are [Gallup2013]. The deaths by suicide among working age people (25-64 years old) costs more than $44 billion annually [Centers for Disease Control and Prevention2013]. By contrast, behaviors such as helpfulness, kindness and optimism predict greater job satisfaction and positive or pleasurable engagement at work [Harzer and Ruch2013].

A number of computational social scientists have studied organizational behavior, professional attitudes, working mood and affect [Yardi, Golder, and Brzozowski2008, Kolari et al.2007, Brzozowski2009, De Choudhury and Counts2013], but in each case: the data they investigated were collected from internal interactive platforms hosted by the workers’ employers.

These studies are valuable in their own right, but one evident limitation is that each dataset is limited to depicting a particular company and excludes the populations who have no access to such restricted networks (e.g., people who are not employees of that company). Moreover, the workers may be unwilling to express, e.g., negative feelings about work (“I don’t wanna go to work today”), unprofessional behavior (“Got drunk as hell last night and still made it to work”), or a desire to work elsewhere (“I want to go work at Disney World so bad”) on platforms controlled by their employers.

A major barrier to studying job-related discourse on general-purpose, public social media—one that the previous studies did not face—is the problem of determining which posts are job-related in the first place. There is no authoritative training data available to model this problem. Since the datasets used in previous work were collected in the workplace during worktime, the content is implicitly job-related. By contrast, the subject matter of public social media is much more diverse. People with various life experiences may have different criteria for what constitutes a “job” and describe their jobs differently.

For instance, a tweet like “@SOMEONE @SOMEONE shit manager shit players shit everything” contains the job-related signal word “manager,” yet the presence of “players” ultimately suggests this tweet is talking about a sport team. Another example “@SOMEONE anytime for you boss lol” might seem job-related, but “boss” here could also simply refer to “friend” in an informal and acquainted register.

Extracting job-related information from Twitter can be valuable to a range of stakeholders. For example, public health specialists, psychologists and psychiatrists could use such first-hand reportage of work experiences to monitor job-related stress at a community level and provide professional support if necessary. Employers might analyze these data and use it to improve how they manage their businesses. It could help employees to maintain better online reputations for potential job recruiters as well. It is also meaningful to compare job-related tweets against non-job-related discourse to observe and understand the linguistic and behavioral similarities and differences between on- and off-hours.

Our main contributions are:

  1. We construct and provide a corpus of annotated tweets, the Twitter Job/Employment Corpus, which contains approximately 0.2 million job-related tweets and 6.8 million not-job-related tweets. To the best of our knowledge, we are the first to extract and study job-related discourse in general-purpose, public social media.

  2. We develop and improve an effective humans-in-the-loop classification framework for open-domain concepts such as job/employment that alternates between human annotation and automatic predictions by machine learning techniques over multiple iterations. This integrated mechanism largely reduces the human efforts in corpus annotation.

  3. We propose a qualified heuristic to separate business accounts from personal accounts relying on their linguistic styles and posts history.

Background and Related Work

Social media accounts for about 20% of the time spent online [comScore2011]. Online communication can embolden people to reveal their cognitive state in a natural, un-self-conscious manner [iKeepSafe2014]. Mobile phone platforms help social media to capture personal behaviors whenever and wherever possible [De Choudhury et al.2013, Sadilek et al.2014]. These signals are often temporal, and can reveal how phenomena change over time. Thus, aspects about individuals or groups, such as preferences and perspectives, affective states and experiences, communicative patterns, and socialization behaviors can, to some degree, be analyzed and computationally modeled continuously and unobtrusively [De Choudhury et al.2013].

Twitter has drawn much attention from researchers in various disciplines in large part because of the volume and granularity of publicly available social data associated with massive information. This micro-blogging website, which was launched in 2006, has attracted more than 500 million registered users by 2012, with 340 million tweets posted every day. Twitter supports directional connections (followers and followees) in its social network, and allows for geographic information about where a tweet was posted if a user enables location services. The large volume and desirable features provided by Twitter makes it a well-suited source of data for our task.

We focus on a broad discourse and narrative theme that touches most adults worldwide. Measures of volume, content, affect of job-related discourse on social media may help understand the behavioral patterns of working people, predict labor market changes, monitor and control satisfaction/dissatisfaction with respect to their workplaces or colleagues, and help people strive for positive change [De Choudhury and Counts2013]. The language differences exposed in social media have been observed and analyzed in relation to location [Cheng, Caverlee, and Lee2010], gender, age, regional origin, and political orientation [Rao et al.2010]

. However, it is probably due to the natural challenges of Twitter messages — conversational style of interactions, lack of traditional spelling rules, and 140-character limit of each message—we barely see similar public Twitter datasets investigating open-domain problems like job/employment in computational linguistic or social science field. Li et al. li2014major proposed a pipelined system to extract a wide variety of major life events, including job, from Twitter. Their key strategy was to build a relatively clean training dataset from large volume of Twitter data with minimum human efforts. Their real world testing demonstrates the capability of their system to identify major life events accurately. The most parallel work that we can leverage here is the method and corpus developed by Liu et al. liu2016understanding, which is an effective supervised learning system to detect job-related tweets from individual and business accounts. To fully utilize the existing resources, we build upon the corpus by Liu et al. liu2016understanding to construct and contribute our more fine-grained corpus of job-related discourse with improvements of the classification methods.

Data and Methods

Figure 1 shows the workflow of our humans-in-the-loop framework. It has multiple iterations of human annotations and automatic machine learning predictions, followed by some linguistic heuristics, to extract job-related tweets from personal and business accounts.

Figure 1:

Our humans-in-the-loop framework collects labeled data by alternating between human annotation and automatic prediction models over multiple rounds. Each diamond represents an automatic classifier (

C), and each trapezoid represents human annotations (R). Each classifier filters and provides machine-predicted labels to tweets that are published to human annotators in the consecutive round. The human-labeled tweets are then used as training data by the succeeding automatic classifier. We use two types of classifiers: rule-based classifiers ( and

) and support vector machines (

, , and ). This framework serves to reduce the amount of human efforts needed to acquire large amounts of high-quality labeled data.

Compared to the framework introduced in [Liu et al.2016], our improvements include: introducing a new rule-based classifier (), conducting an additional round of crowdsourcing annotations (R4) to enrich the human labeled data, and training a classification model with enhanced performances () which was ultimately used to label the unseen data.

Data Collection

Using the DataSift111 Firehose, we collected historical tweets from public accounts with geographical coordinates located in a 15-counties region surrounding a medium sized US city from July 2013 to June 2014. This one-year data set contains over 7 million geo-tagged tweets (approximately 90% written in English) from around 85,000 unique Twitter accounts. This particular locality has geographical diversity, covering both urban and rural areas and providing mixed and balanced demographics. We could apply local knowledge into the construction of our final job-related corpus, which has been approved very helpful in the later experiments.

Initial Classifier

In order to identify probable job-related tweets which are talking about paid positions of regular employment while excluding noises (such as students discussing homework or school-related activities, or people complimenting others), we defined a simple term-matching classifier with inclusion and exclusion terms in the first step (see Table 1).

Classifier consists of two rules: the matched tweet must contain at least one word in the Includelexicon and it cannot contain any word in the Exclude lexicon. Before applying filtering rules, we pre-processed each tweet by (1) converting all words to lower cases; (2) stripping out punctuation and special characters; and (3) normalizing the tweets by mapping out-of-vocabulary phrases (such as abbreviations and acronyms) to standard phrases using a dictionary of more than 5,400 slang terms in the Internet222

This filtering yielded over 40,000 matched tweets having at least five words, referred as job-likely.

Include job, jobless, manager, boss
my/your/his/her/their/at work
Exclude school, class, homework, student, course
finals, good/nice/great job, boss ass333Describe something awesome in a sense of utter dominance, magical superiority, or being ridiculously good.
Table 1: The lexicons used by to extract the job-likely set.

Crowdsourced Annotation R1

Our conjecture about crowdsourced annotations, based on the experiments and conclusions from [Snow et al.2008], is that non-expert contributors could produce comparable quality of annotations when evaluating against those gold standard annotations from experts. And it is similarly effective to use the labeled tweets with high inter-annotator agreement among multiple non-expert annotators from crowdsourcing platforms to build robust models as doing so on expert-labeled data.

We randomly chose around 2,000 job-likely tweets and split them equally into 50 subsets of 40 tweets each. In each subset, we additionally randomly duplicated five tweets in order to measure the intra-annotator agreement and consistency. We then constructed Amazon Mechanical Turk (AMT)444 Human Intelligence Tasks (HITs) to collect reference annotations from crowdsourcing workers. We assigned 5 crowdworkers to each HIT—this is an empirical scale for crowdsourced linguistic annotation tasks suggested by previous studies [Callison-Burch2009, Evanini, Higgins, and Zechner2010]. Crowdsourcing workers were required to live in the United States and had records of approval rating of 90% or better. They were instructed to read each tweet and answer following question “Is this tweet about job or employment?”: their answer Y represents job-related and N represents not job-related. Workers were allowed to work on as many distinct HITs as they liked.

We paid each worker $1.00 per HIT and gave extra bonuses to those who completed multiple HITs. We rejected workers who did not provide consistent answers to the duplicate tweets in each HIT. Before publishing the HITs to crowdsourcing workers, we consulted with Turker Nation555 to ensure that we treat and compensate workers fairly for their requested tasks.

Given the sensitive nature of this work, we anonymized all tweets to minimize any inadvertent disclosure of personal information (names) or cues about an individual’s online identity (URLs) before publishing tweets to crowdsourcing workers. We replaced names with , and recognizable URLs with . No attempt was ever made to contact or interact with any user.

This labeling round yielded 1,297 tweets labeled with unanimous agreement among five workers, i.e. five workers gave the same label to one tweet—1,027 of these were labeled job-related, and the rest 270 were not job-related. They composed the first part of our human-annotated dataset, named as Part-1.

Training Helper Labeler

Feature Preparation

We relied on the textual representations—a feature space of n-grams (unigrams, bigrams and trigrams)—for training. Due to the noisy nature of Twitter, where users frequently write short, informal spellings and grammars, we pre-processed input data as the following steps: (1) utilized a revised

Twokenizer system which was specially trained on Twitter texts [Owoputi et al.2013] to tokenize raw messages, (2) completed stemming and lemmatization using WordNet Lemmatizer [Bird, Klein, and Loper2009].

Parameter Selection

Considering the class imbalance situations in the training dataset, we selected the optimal learning parameters by grid-searching on a range of class weights for the positive (job-related) and negative (not job-related) classes, and then chose the estimator that optimized F1 score, using 10-fold cross validation.

First Helper

In Part-1 set, there are 1,027 job-related and 270 not job-related tweets. To construct a balanced training set for , we randomly chose 757 tweets outside the job-likely set (which were classified as negative by ). Admittedly these additional samples do not necessarily represent the true negative tweets (not job-related) as they have not been manually checked. The noise introduced into the framework would be handled by the next round of crowdsourced annotations.

We trained our first SVM classification model and then used it to label the remaining data in our data pool.

Crowdsourced Annotation R2

We conducted the second round of labeling on a subset of -predicted data to evaluate the effectiveness of the aforementioned helper and collect more human labeled data to build a class-balanced set (for training more robust models).

After separating positive- and negative-labeled (job-related vs. not job-related

) tweets, we sorted each class in descending order of their confidence scores. We then spot-checked the tweets to estimate the frequency of job-related tweets as the confidence score changes. We discovered that among the top-ranked tweets in the positive class about half, and near the separating hyperplane (i.e., where the confidence scores are near zero) almost none, are truly job-related.

We randomly selected 2,400 tweets from those in the top 80th percentile of confidence scores in positive class (Type-1). The Type-1 tweets are automatically classified as positive, but some of them may not be job-related in the ground truth. Such tweets are the ones which fails though is very confident about it. We also randomly selected about 800 tweets from those tweets having confidence scores closest to zero approaching from the positive side, and another 800 tweets from the negative side (Type-2). These 1,600 tweets have very low confidence scores, representing those cannot clearly distinguish. Thus the automatic prediction results of the Type-2 tweets have a high chance being wrongly predicted. Hence, we considered both the clearer core and at the gray zone periphery of this meaningful phenomenon.

Crowdworkers again were asked to annotate this combination of Type-1 and Type-2 tweets in the same fashion as in R1. Table 2 records annotation details.

Number of agreements
among 5 annotators
job-related not job-related
3 4 5 3 4 5
Type-1 129 280 713 50 149 1,079
Type-2 11 7 8 16 67 1,489
Table 2: Summary of annotations in R2 (showing when 3 / 4 / 5 of 5 annotators agreed).

Grouping Type-1 and Type-2 tweets with unanimous labels in R2 (bold columns in Table 2), we had our second part of human-labeled dataset (Part-2).

Training Helper Labeler

Combining Part-1 and Part-2 data into one training set—4,586 annotated tweets with perfect inter-annotator agreement (1748 job-related tweets and 2838 not job-related), we trained the machine labeler similarly as how we obtained .

Community Annotation R3

Having conducted two rounds of crowdsourced annotations, we noticed that crowdworkers could not reach consensuses on a number of tweets which were not unanimously labeled. This observation intuitively suggests that non-expert annotators inevitably have diverse types of understanding about the job topic because of its subjectivity and ambiguity. Table 3 provides examples (selected from both R1 and R2) of tweets in six possible inter-annotator agreement combinations.

Sample Tweet
Really bored….., no entertainment
at work today
two more days of work then
I finally get a day off.
Leaving work at 430 and
driving in this snow is going
to be the death of me
Being a mommy is the hardest
but most rewarding job
a women can have
#babyBliss #babybliss
These refs need to
One of the best Friday nights
I’ve had in a while
Table 3: Inter-annotator agreement combinations and sample tweets.

Two experts from the local community with prior experience in employment were actively introduced into this phase to review tweets on which crowdworkers disagreed and provided their labels. The tweets with unanimous labels in two rounds of crowdsourced annotations were not re-annotated by experts because unanimous votes are hypothesized to be reliable as experts’ labels. Table 4 records the numbers of tweets these two community annotators corrected.

We have our third part of human-annotated data (Part-3): tweets reviewed and corrected by the community annotators.

R1 + R2 job-related not job-related
Y Y Y Y N 644 5
Y Y Y N N 185 17
Y Y N N N 57 51
Y N N N N 11 301
Total 897 374
Table 4: Summary of R3 community-based reviewed-and-corrected annotations.

Training Helper Labeler

Combining Part-3 with all unanimously labeled data from the previous rounds (Part-1 and Part-2) yielded 2,645 gold-standard-labeled job-related and 3,212 not job-related tweets. We trained on this entire training set.

Crowdsourced Validation of , , and

These three learned labelers (, , and ) are capable to annotate unseen tweets automatically. Their performances may vary due to the progressively increasing size of training data.

To evaluate the models in different stages uniformly—including the initial rule-based classifier —we adopted a post-hoc evaluation procedure: We sampled 400 distinct tweets that have not been used before from the data pool labeled by , , and respectively (there is no intersection between any two sets of samples). We had these four classifiers to label this combination of 1600-samples test set. We then asked crowdsourcing workers to validate a total of 1,600 unique samples just like our settings in previous rounds of crowdsourced annotations (R1 and R2). We took the majority votes (where at least 3 out of 5 crowdsourcing workers agreed) as reference labels for these testing tweets.

Table 5 displays the classification measures of the predicted labels as returned by each model against the reference labels provided by crowdsourcing workers, and shows that outperforms , and .

Model Class P R F1
job 0.72 0.33 0.45
notjob 0.68 0.92 0.78
avg / total 0.70 0.69 0.65
job 0.79 0.82 0.80
notjob 0.88 0.86 0.87
avg / total 0.85 0.84 0.84
job 0.82 0.95 0.88
notjob 0.97 0.86 0.91
avg / total 0.91 0.90 0.90
job 0.83 0.96 0.89
notjob 0.97 0.87 0.92
avg / total 0.92 0.91 0.91
Table 5: Crowdsourced validations of samples identified by models , , and .

Crowdsourced Annotation R4

Even though achieves the highest performance among four, it has scope for improvement. We manually checked the tweets in the test set that were incorrectly classified as not job-related and focused on the language features we ignored in preparation for the model training. After performing some pre-processing on the tweets in false negative and true positive groups from the above testing phase, we ranked and compared their distributions of word frequencies. These two rankings reveal the differences between the two categories (false negative vs. true positive) and help us discover some signal words that were prominent in false negative group but not in true positive—if our trained models are able to recognize these features when forming the separating boundaries, the prediction false negative rates would decrease and the overall performances would further improve.

Our fourth classifier is rule-based again and to extract more potential job-related tweets, especially those would have been misclassified by our trained models. The lexicons in include the following signal words: career, hustle, wrk, employed, training, payday, company, coworker and agent.

We ran on our data pool and randomly selected about 2,000 tweets that were labeled as positive by and never used previously (i.e., not annotated, trained or tested in , , , and ). We published these tweets to crowdsouring workers using the same settings of R1 and R2. The tweets with unanimously agreed labels in R4 form the last part of our human-labeled dataset (Part-4).

Table 6 summarizes the results from multiple crowdsourced annotation rounds (R1, R2 and R4).

Number of agreements
among 5 annotators
job-related not job-related
3 4 5 3 4 5
R1 104 389 1,027 82 116 270
R2 140 287 721 68 216 2,568
R4 214 192 338 317 414 524
Table 6: Summary of crowdsourced annotations (R1, R2 and R4).

Training Labeler

Aggregating separate parts of human-labeled data (Part-1 to Part-4), we obtained an integrated training set with 2,983 job-related tweets and 3,736 not job-related tweets and trained upon it. We tested using the same data in crowdsourced validation phase (1,600 tested tweets) and discovered that beats the performances of other models (Table 7).

Model Class P R F1
job 0.83 0.97 0.89
notjob 0.98 0.87 0.92
avg / total 0.92 0.91 0.91
Table 7: Performances of .

Table 8 lists the top 15 features for both classes in with their corresponding weights. Positive features (job-related) unearth expressions about personal job satisfaction (lovemyjob) and announcements of working schedules (day off, break) beyond our rules defined in and . Negative features (not job-related) identify phrases to comment on others’ work (your work, amazing job, awesome job, nut job) though they contain “work” or “job,” and show that school- or game-themed messages (college career, play) are not classified into the job class which meets our original intention.

job-related weights not job-related weights
job 1.77 your work -0.61
manager 1.71 like it -0.60
work 1.69 amazing job -0.59
wrk 1.44 did -0.55
payday 1.23 nut -0.45
my bos 1.06 nut job -0.45
jobs 0.83 bos as -0.43
lovemyjob 0.81 play -0.41
at work 0.81 awesome job -0.38
working 0.75 college career -0.37
my career 0.74 high -0.36
day off 0.73 doing -0.35
boss 0.73 hustle -0.35
service 0.71 you guy -0.33
break 0.70 love your -0.33
Table 8: Top 15 features for both classes of .

End-to-End Evaluation

The class distribution in the machine-labeled test data is roughly balanced, which is not the case in real-world scenarios, where not-job-related tweets are much more common than job-related ones.

We proposed an end-to-end evaluation: to what degree can our trained automatic classifiers (, , and ) identify job-related tweets in the real world? We introduced the estimated effective recall under the assumption that for each model, the error rates in our test samples (1,600 tweets) are proportional to the actual error rates found in the entire one-year data set which resembles the real world. We labeled the entire data set using each classifier and defined the estimated effective recall for each classifier as

where is the total number of the classifier-labeled job-related tweets in the entire one-year data set, is the total of not job-related tweets in the entire one-year data set, is the number of classifier-labeled job-related tweets in our 1,600-sample test set, , and is the recall of the job class in our test set, as reported in Tables 5 and 7.

Y 115,696 195,442 190,471 233,187
N 6,990,633 6,910,887 6,915,858 6,873,142
512 691 707 729
1,088 909 893 871
R 0.82 0.95 0.96 0.97
0.14 0.41 0.46 0.57
Table 9: Estimated effective recalls for different trained models (, , and ) to identify job-related tweets in real world setting.

Table 9 shows that can be used as a good classifier to automatically label the topic of unseen data as job-related or not.

Determining Sources of Job-Related Tweets

Through observation we noticed some patterns like:

Panera Bread: Baker - Night (#Rochester, NY) HTTP://URL #Hospitality #VeteranJob #Job #Jobs #TweetMyJobs

in the class of job-related tweets. Nearly every job-related tweet that contained at least one of the following hashtags: #veteranjob, #job, #jobs, #tweetmyjobs, #hiring, #retail, #realestate, #hr also had a URL embedded. We counted the tweets containing only the listed hashtags, and the tweets having both the queried hashtags and embedded URL, and summarized the statistics in Table 10. By spot checking we found such tweets always led to recruitment websites. This observation suggests that these tweets with similar “hashtags + URL” patterns originated from business agencies or companies instead of personal accounts, because individuals by common sense are unlikely to post recruitment advertising.

hashtag only hashtag + URL %
#veteranjob 18,066 18,066 100.00
#job 79,359 79,326 99.96
#jobs 59,882 59,864 99.97
#tweetmyjobs 39,007 39,007 100.00
#hiring 622 619 99.52
#retail 17,107 17,105 99.99
#realestate 113 112 99.12
#hr 406 405 99.75
Table 10: Counts of tweets containing the queried hashtags only, and their subsets of tweets with URL embedded.

This motivated a simple heuristic that appeared surprisingly effective at determining which kind of accounts each job-related tweet was posted from: if an account had more job-related tweets matching the “hashtags + URL” patterns than tweets in other topics, we labeled it a business account; otherwise it is a personal account. We validated its effectiveness using the job-related tweets sampled by the models in crowdsourced evaluations phase. It is essential to note that when crowdsourcing annotators made judgment about the type of accounts as personal or business, they were shown only one target tweet—without any contexts or posts history which our heuristics rely on.

Table 11 records the performance metrics and confirms that our heuristics to determine the sources of job-related tweets (personal vs. business accounts) are consistently accurate and effective.

From Class P R F1
personal 1.00 0.98 0.99
business 0.98 1.00 0.99
avg/total 0.99 0.99 0.99
personal 1.00 0.99 0.99
business 0.99 1.00 0.99
avg/total 0.99 0.99 0.99
personal 1.00 0.99 0.99
business 0.99 1.00 0.99
avg/total 0.99 0.99 0.99
personal 1.00 0.99 0.99
business 0.99 1.00 0.99
avg/total 0.99 0.99 0.99
Table 11: Evaluations of heuristics to determine the type of accounts (personal vs. business), job-related tweets sampled by different models in Table 5.

We used to detect (not) job-related tweets, and applied our linguistic heuristics to further separate accounts into personal and business groups automatically.

Annotation Quality

To assess the labeling quality of multiple annotators in crowdsourced annotation rounds (R1, R2 and R4), we calculated Fleiss’ kappa [Fleiss1971] and Krippendorff’s alpha [Krippendorff2004] measures using the online tool [Geertzen2016]

to assess inter-annotator reliability among the five annotators of each HIT. And then we calculated the average and standard deviation of inter-annotator scores for multiple HITs per round. Table

12 records the inter-annotator agreement scores in three rounds of crowdsourced annotations.

Round Fleiss’ kappa Krippendorf’s alpha
R1 0.62 0.14 0.62 0.14
R2 0.81 0.09 0.81 0.08
R4 0.42 0.27 0.42 0.27
Table 12: Inter-annotator agreement performance for our three rounds of crowdsourced annotations. Average stdev agreements are Good, Very Good and Moderate [Altman1991] respectively.

The inter-annotator agreement between the two expert annotators from local community was assessed using Cohen’s kappa [Cohen1960] as which indicates empirically almost excellent. Their joint efforts corrected more than 90% of tweets which collected divergent labels from crowdsourcing workers in R1 and R2.

We observe in Table 12 that annotators in R2 achieved the highest average inter-annotator agreements and the lowest standard deviations than the other two rounds, suggesting that tweets in R2 have the highest level of confidence being related to job/employment. As shown in Figure 1, the annotated tweets in R1 are the outputs from , the tweets in R2 are from , and the tweets in R4 are from . is a supervised SVM classifier, while both and are rule-based classifiers. The higher agreement scores in R2 indicate that a trained SVM classifier can provide more reliable and less noisy predictions (i.e., labeled data). Further, higher agreement scores in R1 than R4 indicates that the rules in are not intuitive as that in and introduce ambiguities. For example, tweets “What a career from Vince young!” and “I hope Derrick Rose plays the best game of his career tonight” both use career but convey different information: the first tweet was talking about this professional athlete’s accomplishments while the second tweet was actually commenting on the game the user was watching. Hence crowdsourcing workers working on tasks read more ambiguous tweets and solved more difficult problems than those in tasks did. Considering that, it is not surprising that the inter-annotator agreement scores of R4 are the worst.

Dataset Description

Our dataset is available as a plain text file in JSON format. Each line represents one unique tweet with five attributes identifying the tweet id (tweet_id, a unique identification number generated by Twitter for each tweet), topics job vs. notjob labeled by human (topic_human) and machine (topic_machine), and sources personal vs. business labeled by human (source_human) and machine (source_machine). NA represents “not applicable.” An example of tweet in our corpus is shown as follows:


Table 13 provides the main statistics of our dataset w.r.t the topic and source labels provided by human and machine.

Count of Labels Human Machine
job 2,978 233,187
Topic notjob 3,736 6,873,142
NA 842
personal 1,357 7,025,203
Source business 232 81,126
NA 5,966
Table 13: Statistics of our dataset labeled by human and machine.

Terms and Conditions

According to the Twitter agreement and policy, we are allowed to only distribute tweet ids when providing downloadable datasets to third parties. This guarantees prompt response to the content changes reported through the Twitter API, such as deletions or the status changes (public/protected) of tweets666See more terms and conditions at, and


We presented the Twitter Job/Employment Corpus and our approach for extracting discourse on work from public social media. We developed and improved an effective, humans-in-the-loop active learning framework that uses human annotation and automatic predictions over multiple rounds to label automatically data as job-related or not job-related. We accurately determine whether or not Twitter accounts are personal or business-related, according to their linguistic characteristics and posts history. Our crowdsourced evaluations suggest that these labels are precise and reliable. Our classification framework could be extended to other open-domain problems that similarly lack high-quality labeled ground truth data.


  • [Altman1991] Altman, D. 1991. Inter-rater agreement. Practical statistics for medical research 5:403–409.
  • [Bird, Klein, and Loper2009] Bird, S.; Klein, E.; and Loper, E. 2009. Natural language processing with Python. O’Reilly Media, Inc.
  • [Brzozowski2009] Brzozowski, M. J. 2009. Watercooler: exploring an organization through enterprise social media. In Proceedings of the ACM 2009 international conference on Supporting group work, 219–228. ACM.
  • [Bureau of Labor Statistics2009] Bureau of Labor Statistics. 2009. Occupational suicides – census of fatal occupational injuries.
  • [Bureau of Labor Statistics2013] Bureau of Labor Statistics. 2013. Time use on an average work day for employed persons ages 25 to 54 with children.
  • [Callison-Burch2009] Callison-Burch, C. 2009. Fast, cheap, and creative: evaluating translation quality using amazon’s mechanical turk. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, 286–295. Association for Computational Linguistics.
  • [Centers for Disease Control and Prevention2013] Centers for Disease Control and Prevention. 2013. Cost estimates of violent deaths: Figures and tables.
  • [Cheng, Caverlee, and Lee2010] Cheng, Z.; Caverlee, J.; and Lee, K. 2010. You are where you tweet: a content-based approach to geo-locating twitter users. In Proceedings of the 19th ACM international conference on Information and knowledge management, 759–768. ACM.
  • [Cohen1960] Cohen, J. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1):37–46.
  • [comScore2011] comScore. 2011. It’s a social world: Top 10 need-to-knows about social networking and where it’s headed.
  • [De Choudhury and Counts2013] De Choudhury, M., and Counts, S. 2013. Understanding affect in the workplace via social media. In Proceedings of the 2013 conference on Computer supported cooperative work, 303–316. ACM.
  • [De Choudhury et al.2013] De Choudhury, M.; Gamon, M.; Counts, S.; and Horvitz, E. 2013. Predicting depression via social media. In ICWSM,  2.
  • [Evanini, Higgins, and Zechner2010] Evanini, K.; Higgins, D.; and Zechner, K. 2010. Using amazon mechanical turk for transcription of non-native speech. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, 53–56. Association for Computational Linguistics.
  • [Fleiss1971] Fleiss, J. L. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76(5):378.
  • [Gallup2013] Gallup. 2013. State of the american workplace.
  • [Geertzen2016] Geertzen, J. 2016. Inter-Rater Agreement with Multiple Raters and Variables. [Online; accessed 17-February-2016].
  • [Harzer and Ruch2013] Harzer, C., and Ruch, W. 2013. The application of signature character strengths and positive experiences at work. Journal of Happiness Studies 14(3):965–983.
  • [Hazards Magazine2014] Hazards Magazine. 2014. Work suicide.
  • [iKeepSafe2014] iKeepSafe. 2014. Suicide: Using technology for detection and intervention. [Online; accessed 16-December-2016].
  • [Kolari et al.2007] Kolari, P.; Finin, T.; Lyons, K.; Yesha, Y.; Yesha, Y.; Perelgut, S.; and Hawkins, J. 2007. On the structure, properties and utility of internal corporate blogs. Growth 45000:50000.
  • [Krippendorff2004] Krippendorff, K. 2004. Content analysis: An introduction to its methodology. Sage.
  • [Li et al.2014] Li, J.; Ritter, A.; Cardie, C.; and Hovy, E. H. 2014. Major life event extraction from twitter based on congratulations/condolences speech acts. In EMNLP, 1997–2007.
  • [Liu et al.2016] Liu, T.; Homan, C. M.; Alm, C. O.; White, A. M.; Lytle, M. C.; and Kautz, H. A. 2016. Understanding discourse on work and job-related well-being in public social media. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1044–1053. Berlin, Germany: Association for Computational Linguistics.
  • [Owoputi et al.2013] Owoputi, O.; O’Connor, B.; Dyer, C.; Gimpel, K.; Schneider, N.; and Smith, N. A. 2013. Improved part-of-speech tagging for online conversational text with word clusters. Association for Computational Linguistics.
  • [Rao et al.2010] Rao, D.; Yarowsky, D.; Shreevats, A.; and Gupta, M. 2010. Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, 37–44. ACM.
  • [Sadilek et al.2014] Sadilek, A.; Homan, C.; Lasecki, W. S.; Silenzio, V.; and Kautz, H. 2014. Modeling fine-grained dynamics of mood at scale. In Workshop on Diffusion Networks and Cascade Analytics in Web Search and Data Mining.
  • [Snow et al.2008] Snow, R.; O’Connor, B.; Jurafsky, D.; and Ng, A. Y. 2008. Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In Proceedings of the conference on empirical methods in natural language processing, 254–263. Association for Computational Linguistics.
  • [Yardi, Golder, and Brzozowski2008] Yardi, S.; Golder, S.; and Brzozowski, M. 2008. The pulse of the corporate blogosphere. In Conf. Supplement of CSCW 2008, 8–12.