Everything Is All It Takes: A Multipronged Strategy for Zero-Shot Cross-Lingual Information Extraction

09/14/2021
by   Mahsa Yarmohammadi, et al.
0

Zero-shot cross-lingual information extraction (IE) describes the construction of an IE model for some target language, given existing annotations exclusively in some other language, typically English. While the advance of pretrained multilingual encoders suggests an easy optimism of "train on English, run on any language", we find through a thorough exploration and extension of techniques that a combination of approaches, both new and old, leads to better performance than any one cross-lingual strategy in particular. We explore techniques including data projection and self-training, and how different pretrained encoders impact them. We use English-to-Arabic IE as our initial example, demonstrating strong performance in this setting for event extraction, named entity recognition, part-of-speech tagging, and dependency parsing. We then apply data projection and self-training to three tasks across eight target languages. Because no single set of techniques performs the best across all tasks, we encourage practitioners to explore various configurations of the techniques described in this work when seeking to improve on zero-shot training.

READ FULL TEXT

page 7

page 8

research
10/16/2021

Substructure Distribution Projection for Zero-Shot Cross-Lingual Dependency Parsing

We present substructure distribution projection (SubDP), a technique tha...
research
02/22/2023

Impact of Subword Pooling Strategy on Cross-lingual Event Detection

Pre-trained multilingual language models (e.g., mBERT, XLM-RoBERTa) have...
research
04/19/2019

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT

Pretrained contextual representation models (Peters et al., 2018; Devlin...
research
05/17/2022

Feature Aggregation in Zero-Shot Cross-Lingual Transfer Using Multilingual BERT

Multilingual BERT (mBERT), a language model pre-trained on large multili...
research
04/28/2020

MultiMix: A Robust Data Augmentation Strategy for Cross-Lingual NLP

Transfer learning has yielded state-of-the-art results in many supervise...
research
06/14/2021

Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces

Hate speech and profanity detection suffer from data sparsity, especiall...
research
01/24/2023

Cross-lingual German Biomedical Information Extraction: from Zero-shot to Human-in-the-Loop

This paper presents our project proposal for extracting biomedical infor...

Please sign up or login with your details

Forgot password? Click here to reset