Are the Best Multilingual Document Embeddings simply Based on Sentence Embeddings?

04/28/2023
by   Sonal Sannigrahi, et al.
0

Dense vector representations for textual data are crucial in modern NLP. Word embeddings and sentence embeddings estimated from raw texts are key in achieving state-of-the-art results in various tasks requiring semantic understanding. However, obtaining embeddings at the document level is challenging due to computational requirements and lack of appropriate data. Instead, most approaches fall back on computing document embeddings based on sentence representations. Although there exist architectures and models to encode documents fully, they are in general limited to English and few other high-resourced languages. In this work, we provide a systematic comparison of methods to produce document-level representations from sentences based on LASER, LaBSE, and Sentence BERT pre-trained multilingual models. We compare input token number truncation, sentence averaging as well as some simple windowing and in some cases new augmented and learnable approaches, on 3 multi- and cross-lingual tasks in 8 languages belonging to 3 different language families. Our task-based extrinsic evaluations show that, independently of the language, a clever combination of sentence embeddings is usually better than encoding the full document as a single unit, even when this is possible. We demonstrate that while a simple sentence average results in a strong baseline for classification tasks, more complex combinations are necessary for semantic tasks.

READ FULL TEXT
research
12/26/2018

Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

We introduce an architecture to learn joint multilingual sentence repres...
research
05/23/2023

Linear Cross-Lingual Mapping of Sentence Embeddings

Semantics of a sentence is defined with much less ambiguity than semanti...
research
06/20/2019

Hierarchical Document Encoder for Parallel Corpus Mining

We explore using multilingual document embeddings for nearest neighbor m...
research
12/03/2019

COSTRA 1.0: A Dataset of Complex Sentence Transformations

We present COSTRA 1.0, a dataset of complex sentence transformations. Th...
research
05/10/2022

Sentence-level Privacy for Document Embeddings

User language data can contain highly sensitive personal content. As suc...
research
01/29/2019

No Training Required: Exploring Random Encoders for Sentence Classification

We explore various methods for computing sentence representations from p...
research
08/24/2023

Multi-BERT for Embeddings for Recommendation System

In this paper, we propose a novel approach for generating document embed...

Please sign up or login with your details

Forgot password? Click here to reset