Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

07/26/2022
by   Sławomir Dadas, et al.
0

Sentence embeddings are commonly used in text clustering and semantic retrieval tasks. State-of-the-art sentence representation methods are based on artificial neural networks fine-tuned on large collections of manually labeled sentence pairs. Sufficient amount of annotated data is available for high-resource languages such as English or Chinese. In less popular languages, multilingual models have to be used, which offer lower performance. In this publication, we address this problem by proposing a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer. Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks. We evaluate our method on eight linguistic tasks in Polish, comparing it with the best available multilingual sentence encoders.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/03/2020

Language-agnostic BERT Sentence Embedding

We adapt multilingual BERT to produce language-agnostic sentence embeddi...
research
07/09/2019

Multilingual Universal Sentence Encoder for Semantic Retrieval

We introduce two pre-trained retrieval focused multilingual sentence enc...
research
06/12/2019

Probing Multilingual Sentence Representations With X-Probe

This paper extends the task of probing sentence representations for ling...
research
12/21/2021

On Cross-Lingual Retrieval with Multilingual Text Encoders

In this work we present a systematic empirical study focused on the suit...
research
06/16/2020

How to Probe Sentence Embeddings in Low-Resource Languages: On Structural Design Choices for Probing Task Evaluation

Sentence encoders map sentences to real valued vectors for use in downst...
research
11/03/2020

Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory

Most NLP datasets are manually labeled, so suffer from inconsistent labe...
research
09/07/2021

PAUSE: Positive and Annealed Unlabeled Sentence Embedding

Sentence embedding refers to a set of effective and versatile techniques...

Please sign up or login with your details

Forgot password? Click here to reset