Genre as Weak Supervision for Cross-lingual Dependency Parsing

by   Max Müller-Eberstein, et al.

Recent work has shown that monolingual masked language models learn to represent data-driven notions of language variation which can be used for domain-targeted training data selection. Dataset genre labels are already frequently available, yet remain largely unexplored in cross-lingual setups. We harness this genre metadata as a weak supervision signal for targeted data selection in zero-shot dependency parsing. Specifically, we project treebank-level genre information to the finer-grained sentence level, with the goal to amplify information implicitly stored in unsupervised contextualized representations. We demonstrate that genre is recoverable from multilingual contextual embeddings and that it provides an effective signal for training data selection in cross-lingual, zero-shot scenarios. For 12 low-resource language treebanks, six of which are test-only, our genre-specific methods significantly outperform competitive baselines as well as recent embedding-based methods for data selection. Moreover, genre-based data selection provides new state-of-the-art results for three of these target languages.



There are no comments yet.


page 17


Zero-Shot Cross-Lingual Dependency Parsing through Contextual Embedding Transformation

Linear embedding transformation has been shown to be effective for zero-...

Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing

This paper investigates the problem of learning cross-lingual representa...

Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers

We explore cross-lingual transfer of register classification for web doc...

Treebank Embedding Vectors for Out-of-domain Dependency Parsing

A recent advance in monolingual dependency parsing is the idea of a tree...

Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing

We introduce a novel method for multilingual transfer that utilizes deep...

On the Effectiveness of Dataset Embeddings in Mono-lingual,Multi-lingual and Zero-shot Conditions

Recent complementary strands of research have shown that leveraging info...

Unsupervised Cross-Lingual Transfer of Structured Predictors without Source Data

Providing technologies to communities or domains where training data is ...

Code Repositories


Genre-driven Data Selection for Zero-shot Parsing (EMNLP 2021)

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.