A General-Purpose Multilingual Document Encoder

05/11/2023
by   Onur Galoğlu, et al.
0

Massively multilingual pretrained transformers (MMTs) have tremendously pushed the state of the art on multilingual NLP and cross-lingual transfer of NLP models in particular. While a large body of work leveraged MMTs to mine parallel data and induce bilingual document embeddings, much less effort has been devoted to training general-purpose (massively) multilingual document encoder that can be used for both supervised and unsupervised document-level tasks. In this work, we pretrain a massively multilingual document encoder as a hierarchical transformer model (HMDE) in which a shallow document transformer contextualizes sentence representations produced by a state-of-the-art pretrained multilingual sentence encoder. We leverage Wikipedia as a readily available source of comparable documents for creating training data, and train HMDE by means of a cross-lingual contrastive objective, further exploiting the category hierarchy of Wikipedia for creation of difficult negatives. We evaluate the effectiveness of HMDE in two arguably most common and prominent cross-lingual document-level tasks: (1) cross-lingual transfer for topical document classification and (2) cross-lingual document retrieval. HMDE is significantly more effective than (i) aggregations of segment-based representations and (ii) multilingual Longformer. Crucially, owing to its massively multilingual lower transformer, HMDE successfully generalizes to languages unseen in document-level pretraining. We publicly release our code and models at https://github.com/ogaloglu/pre-training-multilingual-document-encoders .

READ FULL TEXT
research
02/03/2023

Modeling Sequential Sentence Relation to Improve Cross-lingual Dense Retrieval

Recently multi-lingual pre-trained language models (PLM) such as mBERT a...
research
01/21/2021

Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval

Pretrained multilingual text encoders based on neural Transformer archit...
research
08/01/2022

BabelBERT: Massively Multilingual Transformers Meet a Massively Multilingual Lexical Resource

While pretrained language models (PLMs) primarily serve as general purpo...
research
12/16/2021

DOCmT5: Document-Level Pretraining of Multilingual Language Models

In this paper, we introduce DOCmT5, a multilingual sequence-to-sequence ...
research
09/07/2021

Mixed Attention Transformer for Leveraging Word-Level Knowledge to Neural Cross-Lingual Information Retrieval

Pretrained contextualized representations offer great success for many d...
research
05/04/2020

WikiUMLS: Aligning UMLS to Wikipedia via Cross-lingual Neural Ranking

We present our work on aligning the Unified Medical Language System (UML...
research
06/20/2019

Hierarchical Document Encoder for Parallel Corpus Mining

We explore using multilingual document embeddings for nearest neighbor m...

Please sign up or login with your details

Forgot password? Click here to reset