IPOD: Corpus of 190,000 Industrial Occupations

10/22/2019
by   Junhua Liu, et al.
0

Job titles are the most fundamental building blocks for occupational data mining tasks, such as Career Modelling and Job Recommendation. However, there are no publicly available dataset to support such efforts. In this work, we present the Industrial and Professional Occupations Dataset (IPOD), which is a comprehensive corpus that consists of over 190,000 job titles crawled from over 56,000 profiles from Linkedin. To the best of our knowledge, IPOD is the first dataset released for industrial occupations mining. We use a knowledge-based approach for sequence tagging, creating a gazzetteer with domain-specific named entities tagged by 3 experts. All title NE tags are populated by the gazetteer using BIOES scheme. Finally, We develop 4 baseline models for the dataset on NER task with several models, including Linear Regression, CRF, LSTM and the state-of-the-art bi-directional LSTM-CRF. Both CRF and LSTM-CRF outperform human in both exact-match accuracy and f1 scores.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/09/2015

Bidirectional LSTM-CRF Models for Sequence Tagging

In this paper, we propose a variety of Long Short-Term Memory (LSTM) bas...
research
03/04/2016

End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF

State-of-the-art sequence labeling systems traditionally require large a...
research
08/12/2022

Building a Chatbot on a Closed Domain using RASA

In this study, we build a chatbot system in a closed domain with the RAS...
research
09/27/2017

Application of a Hybrid Bi-LSTM-CRF model to the task of Russian Named Entity Recognition

Named Entity Recognition (NER) is one of the most common tasks of the na...
research
01/16/2018

Adversarial Learning for Chinese NER from Crowd Annotations

To quickly obtain new labeled data, we can choose crowdsourcing as an al...
research
05/15/2020

Corpus and Models for Lemmatisation and POS-tagging of Classical French Theatre

This paper describes the process of building an annotated corpus and tra...

Please sign up or login with your details

Forgot password? Click here to reset