Towards Lingua Franca Named Entity Recognition with BERT

11/19/2019
by   Taesun Moon, et al.
0

Information extraction is an important task in NLP, enabling the automatic extraction of data for relational database filling. Historically, research and data was produced for English text, followed in subsequent years by datasets in Arabic, Chinese (ACE/OntoNotes), Dutch, Spanish, German (CoNLL evaluations), and many others. The natural tendency has been to treat each language as a different dataset and build optimized models for each. In this paper we investigate a single Named Entity Recognition model, based on a multilingual BERT, that is trained jointly on many languages simultaneously, and is able to decode these languages with better accuracy than models trained only on one language. To improve the initial model, we study the use of regularization strategies such as multitask learning and partial gradient updates. In addition to being a single model that can tackle multiple languages (including code switch), the model could be used to make zero-shot predictions on a new language, even ones for which training data is not available, out of the box. The results show that this model not only performs competitively with monolingual models, but it also achieves state-of-the-art results on the CoNLL02 Dutch and Spanish datasets, OntoNotes Arabic and Chinese datasets. Moreover, it performs reasonably well on unseen languages, achieving state-of-the-art for zero-shot on three CoNLL languages.

READ FULL TEXT
research
08/13/2020

ANDES at SemEval-2020 Task 12: A jointly-trained BERT multilingual model for offensive language detection

This paper describes our participation in SemEval-2020 Task 12: Multilin...
research
04/11/2022

Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0

In this work, we explore whether the recently demonstrated zero-shot abi...
research
02/26/2019

Polyglot Contextual Representations Improve Crosslingual Transfer

We introduce a method to produce multilingual contextual word representa...
research
01/31/2017

Robust Multilingual Named Entity Recognition with Shallow Semi-Supervised Features

We present a multilingual Named Entity Recognition approach based on a r...
research
10/11/2022

HUE: Pretrained Model and Dataset for Understanding Hanja Documents of Ancient Korea

Historical records in Korea before the 20th century were primarily writt...
research
11/04/2022

Multilingual Name Entity Recognition and Intent Classification Employing Deep Learning Architectures

Named Entity Recognition and Intent Classification are among the most im...
research
07/03/2020

Playing with Words at the National Library of Sweden – Making a Swedish BERT

This paper introduces the Swedish BERT ("KB-BERT") developed by the KBLa...

Please sign up or login with your details

Forgot password? Click here to reset