IPOD: Corpus of 190,000 Industrial Occupations

10/22/2019 ∙ by Junhua Liu, et al. ∙ 0

Job titles are the most fundamental building blocks for occupational data mining tasks, such as Career Modelling and Job Recommendation. However, there are no publicly available dataset to support such efforts. In this work, we present the Industrial and Professional Occupations Dataset (IPOD), which is a comprehensive corpus that consists of over 190,000 job titles crawled from over 56,000 profiles from Linkedin. To the best of our knowledge, IPOD is the first dataset released for industrial occupations mining. We use a knowledge-based approach for sequence tagging, creating a gazzetteer with domain-specific named entities tagged by 3 experts. All title NE tags are populated by the gazetteer using BIOES scheme. Finally, We develop 4 baseline models for the dataset on NER task with several models, including Linear Regression, CRF, LSTM and the state-of-the-art bi-directional LSTM-CRF. Both CRF and LSTM-CRF outperform human in both exact-match accuracy and f1 scores.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The interest in occupational analysis tasks grows rapidly in recent years, such as for career modelling and job recommendation. The advancement of AI and robotics are changing every industry and every sector, challenging the employability of work force specially those with high level of repetition. On the other hand, young professionals requires strong references to plan ahead to build a successful career, while senior executives push their boundaries to stay competitive in the job market. From the employer’s perspective, companies are fighting for hiring and retain talents, which requires effective candidates filtering and selection during recruitment, and predictive turnover analysis for employee retention

Despite the raising demand, occupational analysis tasks remain challenging for several reasons, the main one being due to lack of publicly available datasets. For many years, relevant data resides with a small number of large enterprises and is kept private to help companies remain competitive in the industry.

To address the needs for career analysis for industry, we present and make available the Industrial and Professional Occupation Dataset (IPOD). The dataset consists of 192,295 occupation entries, drafted by working professionals for their Linkedin profiles, with motivations of displaying career achievement, attracting recruiters or expanding professional networks.

Existing corpora for Named Entity Recognition (NER) tasks [finkel2005incorporating, sang2003introduction, weischedel2013ontonotes, borchmann2018approaching], which use general tags such as LOCation, PERson, ORGanization, MISCellaneous, etc.. In the contrary, IPOD provides domain-specific NE tags to denote the properties of occupations, such as responsibility, function and location. All named entities are tagged using a comprehensive gazetteer created by three industrial experts, which reports high inter-rater reliability, achieving 0.853 on Percentage Agreement [viera2005understanding] and 0.778 on Cohen’s Kappa [artstein2008inter], with no instances where all 3 annotators disagree.

The labels are further processed by adding prefix using BIOES tagging scheme [Ratinov2009design], i.e., Begin, Inside, Ending, Single, and O indicating that a token belongs to no chunk, indicating the positional features of each token in a title.

Lastly, we develop baseline models for the NER classification, alongside with human performance. The models include a Logistic Regression model, a Conditional Random Field (CRF) model, a Long Short Term Memory (LSTM) model and a Bidirectional LSTM-CRF model 

[liu2018empower]. The intention of chosing these models are to act as the representations of linear models, statistical models, sequence models, and state of the art NER models, respectively.

Figure 1: A example of occupational title and its domain-specific NE tags. Tokens in a title indicate the person’s responsibility (RES), function (FUN), and location (LOC). Furthermore, the NE tags are also added with positional prefixes using BIOES scheme, i.e., Begin, Inside, Others, Ending and Single.
Literature Source Size Available
Linkedin 190K Yes
Mimno et al., 2008 Resumes 54K No
Lou et al., 2010 Linkedin 67K No
Paparrizos et al., 2011 Web 5M No
Zhang et al., 2014 A job hunting website 7K No
Liu et al., 2016 Social network 30K No
Li et al., 2017 Linkedin - No
Li et al., 2017 A high tech company - No
Yang et al., 2017 Resumes 823K No
zhu et al., 2018 Job portals 2M No
James et al., 2018 APS 60K Yes
Yang et al., 2018 Various channels - No
Xu et al., 2018 Online Prof. Networks 20M No
Qin et al., 2018 A high tech company 1M No
Lim et al., 2018 Linkedin 10K No
Shen et al., 2018 A high tech company 14K No
Table 1: A survey of datasets used for related works. No available datasets can be found publicly except a dataset of publications and authors from American Physics Society (APS) [james2018prediction] that only describes the names and affiliations of physics scientists without titles.
Figure 2: Histogram of occupation entries.

2 Dataset

2.1 Data Collection

We obtain over 192K job titles from Linkedin profiles, most of which are from Singapore and United States. Subsequently, the raw data underwent a series of processing, including converting to lowercase, substituting meaningful punctuation to words (i.e. changing to ) and removing special symbols. We decided not to lemmatize or stem the words because the original forms suggest its most accurate named entity, i.e., strategist is labeled as RES while strategy is labeled as FUN.

All US Asia
minimum 1 1 1
maximum 21 17 21
average 3.0 3.1 2.9
median 3 3 2
Table 2: Statistics of entries
Named Entity Count
RES 310570
FUN 255974
LOC 9998
O 66948
Table 3: NE counts

2.2 Dataset Analysis

This section discuss the Exploratory Data Analysis conducted to better understand the properties of IPOD. The statistics and histogram of length of job titles can be found in table 2 and  2 respectively. The corpus comprises of 192,295 English occupation entries from 56,648 unique profiles. These profiles are mainly from United States () and Asia (). Both table 2 and figure 2 suggest that most of the titles fall within 5 words, contributing to 91.7% of the entries. The median statistics and the histogram also suggest that job titles written by Asian professionals tend to be shorter, i.e., within two words, than that by US professionals.

Figure 3 shows distribution of top 20 unigrams and bigrams of IPOD. In the Unigram case, the most popular token, manager, appears in 34,065 entries, about twice as much as the next few popular ones, i.e., and (18,466), senior (16,475), engineer (15,593) and director (14,182). On the contrary, the Bigram case shows a gentler curve, with project manager (3,536) and vice president (3,458) being the top two choices.

Figure 3: N-grams Analysis of occupation entries.
Managerial level: lead, supervisor, manager, director, president
Operational role: engineer, designer, accountant, technician
Seniority: junior, vice, associate, assistant, senior
Departments: sales, marketing, finance, operations, strategy
Scope: enterprise, project, customer, national, site
Content: data, r&d, security, training, integration, education
Regions: APAC, SEA, Asia, European, north, central
Countries/States/Cities: China, America, Singapore, Colorado
Table 4: Examples of occupational NE tags.

2.3 Domain-specific Sequence Tagging

Figure 1 shows an example of a typical occupational title and its NE tags. Conceptually, job titles serve as a concise indicator for one’s level of responsibility, seniority and scope of work, described with a combination of responsibility, function and location.

Responsibility, as its name suggests, describes the responsibilities of a working professional. As shown in figure 1, responsibility may include indicators of managerial levels, such as director, manager and lead, seniority levels, such as vice, assistant and associate, and operational role, such as engineer, accountant and technician. A combination of the three sub-categories draws the full picture of one’s responsibility.

Function describes various business functions in various dimensions. Specifically, Departments describes the company’s departments the staffers are in, such as sales, marketing and operations; Scope indicates one’s scope of work, such as enterprise, project and national; lastly, Content indicates one’s content of work, such as data, r&d and security.

Finally, Location indicates the geographic scope that the title owner is responsible of. Examples of this NE tag include geographic regions such as APAC, Asia, European, and counties/states/cities such as China, America and Colorado.

Formally, we define the occupational domain-specific NE tags as RES, FUN, LOC and O, indicating the responsibility, function, location and others respectively. For instance, a job title of chief financial officer asia pacific is tagged as S-RES S-FUN S-RES B-LOC E-LOC with the BIOES scheme [Ratinov2009design]. The distribution of the 4 labels are shown in table 3. We adopt a knowledge-based NE tagging strategy by creating a gazetteer of word tokens. This is achieved by first running a Unigram analysis of the job titles, sorted in descending order. Subsequently, the top 1500 tokens are tagged by 3 annotators, who are a HR peronnel, a senior recruiter and a seasoned business professional. Among 1,500 tokens tagged, every tag is agreed with at least two annotators, where 1,169 (77.9%) are commonly agreed among all three annotators, and 331 (22.1%) are agreed with two annotators. We further assess the Inter-Rater Reliability with two inter-coder agreements, achieving 0.853 on Percentage Agreement [viera2005understanding] and 0.778 on Cohen’s Kappa [artstein2008inter]. Finally, the job titles are labelled with NE tags using BIOES scheme and formatted for NER tasks.

3 Methods

3.1 NER Models

To provide a wide range of NER baselines, we construct three classifiers that represent 3 classic approaches, and one of the state-of-the-art approaches. Concretely, we use a Logistic Regression model as a representative of Discrete Modelling, a Conditional Random Field (CRF) as a representative of Statistical Modelling, a Long Short-Term Memory (LSTM) model that represents recurring sequence models, and finally a bidirectional LSTM-CRF model which recorded remarkable performance in task-aware sequence tagging tasks 


. The logistic classifier and LSTM classifer are decoded using a softmax layer, while the CRF and bidirectional LSTM-CRF are decoded using a first-order Viterbi algorithm 

[forney1973viterbi] that finds the sequence of NE tags with highest scores.

3.2 Word Embedding

We decide not to construct job title embeddings by either from scratch or using Transfer Learning techniques on pre-trained models, but instead use pre-trained model directly, i.e., without fine-tuning. This is because the contextual meaning of job title tokens is highly similar, if not identical, to that in other articles, i.e., the word

director appearing in a job title is the same as that appears in a Wikipedia article. For this work, we use pre-trained ELMo embeddings[peters2018deep] to encode the contextual meaning of the job titles.

3.3 Hyper-parameter Settings

While the goal of solving the NER task is to create baselines for future works, we aim to find a common set of hyper-parameters that performs relatively well for all 4 models. To improve the quality of the models, the models are then fine-tuned with a small-scale hyper-parameter search with 2 epochs. The search space includes varying learning rates (i.e.,

, , or ), number of LSTM hidden layers (1 or 2), number of hidden states (128 or 256) and mini-batch size (16 or 32).

After the hyper-parameter search and fine-tuning, a common set of hyper-parameters is chosen for all baseline models. We deploy a Cross Entropy loss function, optimized by Adam optimizer with an initial learning rate of

and a mini-batch size of . Word Dropout and Variational Dropout [kingma2015variational]

are used to prevent over-fitting, with probability of 0.05 and 0.5 respectively. For LSTM and LSTM-CRF models, we use a single hidden layer with state size of 256.

4 Experiments

4.1 Metrics

Our work uses two metrics to assess performance of various machine models and human performance, namely Exact Match (EM) and the F1 score, formally defined as . The EM metric measures the percentage agreement between the ground truth and predicted labels with exact matches, while F1 score metric is designed to measure the average overlaps between the ground truth and prediction. Furthermore, the overall Precision and Recall metrics of all models are also reported.

4.2 Human Performance

We construct the human performance baseline for IPOD using the NE tags annotated by the three domain experts. We choose the set of labels tagged by annotator 1 as the ground truth labels and compute against the other two annotation sets. We then take the average EM and F1 to indicate human performance. We record an EM accuracy of 91.3%, and an F1 of 95.4%. This shows a strong human performance as compared to those of other datasets, such as 91% EM for the CHEMDNER corpus[Martin2015The], 86.8% EM and 89.5% F1 for SQuAD2.0 [Rajpurkar2018know], and 77.0% EM and 86.8 F1 for SQuAD1.0 [RajpurkarZLL16].

4.3 Model Performance

Models P R EM F1
LogReg 90.8 93.2 85.1 92.0
CRF 96.5 96.7 93.5 96.6
LSTM 94.6 95.0 90.1 94.8
LSTM-CRF 96.6 96.4 93.3 96.5
Human 91.6 99.6 91.3 95.4
Table 5: Overall results of Job Title NER
LogReg 78.3 87.8 93.7 96.8 90.1 94.8
CRF 89.1 94.2 98.7 99.4 96.6 98.2
LSTM 85.8 92.5 62.3 76.7 94.5 97.2
LSTM-CRF 89.6 94.5 85.8 92.3 96.3 98.1
Table 6: Performance stratified by NE tags (EM, F1)

Table 5 shows the overall performance of our models and human performance on IPOD, in terms of precision(P), recall(R), exact match(EM) and F1. While the performance of both CRF and LSTM-CRF models are very close to each others ( difference), both models outperform human in precision, exact match and F1, and underperform human in its 99.6 recall.

Table 6 shows the per-tag breakdown of NER results, in terms of EM and F1. Aligned with the overall performance, CRF and LSTM-CRF perform similarly to each others, and outperform LogReg and LSTM for all three categories. CRF also shows a significant advantage in classifying LOC tags (98.7 EM and 99.4 F1).

5 Related Work

In this section, we conduct literature review on relative works, in the area of occupational Analysis, Contextual Embedding and Named Entity Recognition (NER).

5.1 Occupational Data Mining

Prior works on Occupational Data Mining aim to accomplish a wide range of industrial tasks, such as Career Modeling and Job Recommendation. In the area of Career Modeling, prior works address problems including career path modeling [liu2016fortune, mimno2008modeling], career movement prediction [li2017nemo, james2018prediction, yang2018one], professional career development [li2017prospecting], job title ranking [xu2018extracting], and employability [massoni2009career]. In Job Recommendation, past works focus on analysing Person-Job Fit [zhu2018person, shen2018joint, qin2018enhancing] which commonly aims to suggest employment suitability for companies, and Job Recommendation [paparrizos2011machine, malinowski2006matching, zhang2014research, guo2014analysis] which on the other hand provides decision analysis for the job seekers. These works commonly leverage real-world data from different sources, including Linkedin [liu2016fortune, li2017nemo], resumes [mimno2008modeling, yang2018one], job portals [zhang2014research, zhu2018person], tech companies [shen2018joint, qin2018enhancing], among others.

The proposed solutions to these problems take different approaches. Majority takes various machine learning approaches, such as linear classification models 

[liu2016fortune, james2018prediction, li2017prospecting, yang2018one, guo2014analysis], generative models [mimno2008modeling, xu2018extracting, limyou2018are]

and Neural Networks 

[li2017nemo, mimno2008modeling, zhu2018person, qin2018enhancing]. Some take algorithmic approaches, such as statistical inference [mimno2008modeling, james2018prediction, shen2018joint], Graph-theoretic models [massoni2009career, lou2010machine, paparrizos2011machine] and recommmender systems with content-based and collaborative filtering [zhang2014research, malinowski2006matching]. Some works report their time complexity to be polynomial [liu2016fortune, li2017prospecting].

5.2 Natural Language Processing

Word Embedding.

Classic word embedding methods construct word-level vector representations to capture the contextual meaning of words. A Neural Network Language Model (NNLM) was proposed with Continuous Bag of Words (CBoW) model and skip-gram model 

[bengio2003neural], which lead to a series of NNLM-based embedding works [turian2010word]. Pennington et al., 2014 proposed GloVe [pennington2014glove], which uses a much simpler approach, i.e., constructing global vectors to represent contextual knowledge of the vocabulary, that achieves good results. More recently, a series of high quality contextual models are proposed, such as ELMo [peters2018deep], FastText [bojanowski2017enriching] and Flair [akbik2018contextual]. Both word-level contextualization and charecter-level features are commonly used for these works.

Document Embedding. While Word Embedding constructs static continuous vectors on word level, recent works also propose methods to represent document-level embeddings. A transformer-based approach receives high populartion in recent literature. It uses pre-trained transformer-based models with very large datasets to construct the document-level embeddings, such as Bert [devlin2018bert], XLNet [yang2019xlnet], RoBERTa [liu2019roberta], GPT [radford2019language], XLM [lample2019cross] and TransformerXL [dai2019transformer]. This approach enables contextual embedding in both word level and document level. Lample et al. (2016) proposes a remarkable Stacked Embedding approach that constructs a hierarchical embedding architecture of document-level embedding by stacking word-level embedding with with character-level features and concatenating with an RNN, which performs well in NER tasks [lample2016neural].

Named Entity Recognition. Named entity recognition is a challenging task that traditionally involves high degree of manually crafted features for different domains. The good news is that numerous large-scale corpora, such as CoNLL-2003 [sang2003introduction] and Ontonotes [weischedel2013ontonotes], are made available for training with deep neural architectures. State-of-the-art NER models are LSTM-based [collobert2011natural, zhang2018chinese, huang2015bidirectional, lample2016neural, teng2016bidirectional, liu2018empower], where feeding the sentence embeddings into uni- or bi-directional LSTM encoder. Instead of decoding directly, some works also add a Conditional Random Field (CRF) layer at the end while training the classifier, and use Viterbi [forney1973viterbi] to decode the probabilistic output into NE labels. The recently popular transformer-based models [devlin2018bert, yang2019xlnet, liu2019roberta, radford2019language, lample2019cross, dai2019transformer] are also capable of producing good results.

While manually tagging a large dataset requires tremendous amount of efforts, prior works leverage knowledge-based gazetteers developed by various unsupervised or semi-supervised learning approaches 

[toral2006proposal, kazama2008inducing, nadeau2006unsupervised, saha2008gazetteer, smith2006using, kozareva2006bootstrapping], or rely on generative models [nallapati2010blind, zhou2002named, mukund2009ne, zhao2008unsupervised, mohit2005syntax]. Tags can be further formatted with tagging schemes such as IOB [ramshaw1999text] or BIOES [Ratinov2009design], to indicate the position of tags in a chunk.

6 Conclusion

We presented the IPOD corpus that comprises of a large amount of job titles. The curpus comes with a knowledge-based gazetteer that includes manual NE tags from three domain exports annotators. Working on a NER task, several representative machine learning models baselines are constructed. Despite strong human performance records of 91.3% EM and 95.4% F1, two of our models, namely CRF and LSTM-CRF, outperform human in EM and F1 for overall results and per-tag breakdown. Finally, we released the corpus serving as basic building blocks for greater scale of industrial natural language understanding tasks and occupational data mining.