DeepAI
Log In Sign Up

Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

07/26/2022
by   Sławomir Dadas, et al.
0

Sentence embeddings are commonly used in text clustering and semantic retrieval tasks. State-of-the-art sentence representation methods are based on artificial neural networks fine-tuned on large collections of manually labeled sentence pairs. Sufficient amount of annotated data is available for high-resource languages such as English or Chinese. In less popular languages, multilingual models have to be used, which offer lower performance. In this publication, we address this problem by proposing a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer. Our sentence encoder can be trained in less than a day on a single graphics card, achieving high performance on a diverse set of sentence-level tasks. We evaluate our method on eight linguistic tasks in Polish, comparing it with the best available multilingual sentence encoders.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

07/03/2020

Language-agnostic BERT Sentence Embedding

We adapt multilingual BERT to produce language-agnostic sentence embeddi...
07/09/2019

Multilingual Universal Sentence Encoder for Semantic Retrieval

We introduce two pre-trained retrieval focused multilingual sentence enc...
06/12/2019

Probing Multilingual Sentence Representations With X-Probe

This paper extends the task of probing sentence representations for ling...
12/21/2021

On Cross-Lingual Retrieval with Multilingual Text Encoders

In this work we present a systematic empirical study focused on the suit...
11/03/2020

Finding Friends and Flipping Frenemies: Automatic Paraphrase Dataset Augmentation Using Graph Theory

Most NLP datasets are manually labeled, so suffer from inconsistent labe...
09/07/2021

PAUSE: Positive and Annealed Unlabeled Sentence Embedding

Sentence embedding refers to a set of effective and versatile techniques...

I Introduction

Using artificial neural networks for generating dense vector representations of text has become a common practice in natural language processing (NLP). While most research is focused on word and subword representations, some applications require encoding larger chunks of text, such as sentences or paragraphs. Neural sentence encoders are used in semantic search, question answering, document clustering, dataset augmentation, plagiarism detection, and other tasks which involve measuring semantic similarity between sentences. Typically, these types of models are utilized for transforming text fragments to their corresponding dense representations, which are then processed independently by a vector search engine or other information retrieval system. This allows billions of vectors to be compared and searched efficiently. The quality of these representations, therefore, has a significant impact on the performance of the whole system.

In recent years, a number of methods for encoding sentences have been introduced. State-of-the-art models, producing high-quality vector representations, are usually trained in a supervised way. Training semantically meaningful representations, suitable for search and retrieval problems, requires specific types of labeled datasets such as natural language inference (NLI) or paraphrase pairs. Large datasets of manually labeled sentence pairs exist only for high-resource languages. For English, SNLI

[bowman2015large]

and MultiNLI

[williams2018broad] datasets are available, among others, each with several hundred thousand records. For other languages, there are either no NLI datasets, or the size of existing datasets is insufficient for training high-quality neural sentence encoders. Therefore, for low-resource languages, multilingual or unsupervised sentence encoders need to be used, which offer lower performance than the best methods trained for English.

This publication addresses this problem by proposing a method for training effective language-specific sentence encoders without manually labeled data. In the first step, we perform automatic extraction of paraphrase pairs in the target language using a large corpus of parallel sentences from the OPUS project [tiedemann2012parallel]. Next, we fine-tune a siamese network composed of two Transformer-based [vaswani2017attention]language models on the collected data to discriminate paraphrases from non-paraphrases. For fine-tuning, we employ an additional LSTM [hochreiter1997long]

(Long Short-term Memory) layer as a pooling operation of which the last hidden state is used as sentence representation. We show that it is possible to train a neural sentence encoder for a lower resource language on a single GPU in under 24 hours, achieving better performance than state-of-the-art multilingual models trained with significantly more data and compute power.

Ii Contributions

We make the following contributions in this work:

  1. [wide,labelwidth=0pt,labelindent=0pt]

  2. We propose a framework for training neural sentence encoders which does not require manually labeled data. Our method involves automatic extraction of paraphrase pairs for a target language utilizing sentence-aligned cross-lingual corpora. The resulting dataset is then used for fine-tuning Transformer language model to produce high-quality dense representations of sentences.

  3. Our architecture generates sentence vectors from an additional LSTM pooling layer. Other popular Transformer-based sentence encoders use simple non-parametric pooling operations such as mean or max pooling, which restrict the dimension of the resulting vector. Using recurrent layer allows us to produce arbitrarily sized sentence representations.

  4. We validate our approach on the Polish language. We conduct an evaluation on eight tasks, including the following problems: paraphrase identification, sentiment analysis, natural language inference, semantic relatedness, and topic classification.

  5. For the purposes of the evaluation, we publish the Polish Paraphrase Corpus (PPC). It is a new dataset consisting of 7000 manually labeled sentence pairs from different sources, each assigned to one of three categories: exact paraphrases, close paraphrases, non-paraphrases. Most of the examples have high semantic overlap, which makes the task challenging for classification models.

  6. We make the source code for paraphrase mining, model fine-tuning, and evaluation publicly available111https://github.com/sdadas/polish-sentence-evaluation. The code allows training sentence encoders and replicating our results for any language.

Iii Related work

Since the popularization of word embedding models such as Word2Vec [mikolov2013efficient, mikolov2013distributed], GloVe [pennington2014glove], or FastText [bojanowski2017enriching], there have been efforts to develop effective vector representations for larger fragments of text. Early approaches employed simple aggregation techniques of the individual word vectors, usually by computing arithmetic or weighted mean. More advanced methods based on static word representations have also been developed. arora2017a introduced Smooth Inverse Frequency (SIF)

which included a weighted mean combined with principal component analysis (PCE),

shen2018baseline showed that concatenating mean and max pooled vectors improves the quality of sentence embeddings. Aggregation-based approaches offer intuitive and easy-to-use baselines, but the quality of the resulting representations is inferior to more recent models.

Most of the modern methods are based on artificial neural networks trained on sentence-level optimization objective. Some of these models employ self-supervised learning and can be trained using only raw text corpora, while others are fully supervised and require labeled datasets for training. The model proposed by

le2014distributed was one of the first self-supervised approaches. It is a simple neural architecture for learning fixed-length paragraph vectors from word embedding models. The network was trained to predict the next word in the document from the representation of previously encoded words. Skip-Thought vectors [kiros2015skipthought] is another notable example of self-supervised methods. It is an encoder-decoder network in which the input sentence is first encoded to a dense representation, and then the model is expected to reconstruct the previous and the next sentence from the same document. This method was followed by other similar architectures that improved on the original model [gan2017learning, logeswaran2018an]. Recent neural sentence encoders typically do not involve optimizing the model from scratch, and they instead rely on pre-trained language models, the most popular of which are Transformer-based models. Latest self-supervised learning approaches propose fine-tuning a language model using denoising or contrastive objectives. TSDAE [wang-etal-2021-tsdae-using] tries to reconstruct the original sentence from a damaged input, SimCSE [gao-etal-2021-simcse] learns to identify two versions of the same sentence encoded with different dropout masks, Contrastive Tension [carlsson2021semantic] trains two independent models on a noise-contrastive task.

State-of-the-art sentence embedding models, achieving the highest performance on semantic retrieval tasks, are optimized using supervised learning. Some of the earlier popular models of this type include InferSent [conneau2017supervised] and Universal Sentence Encoder [cer2018universal], both trained on the SNLI [bowman2015large] corpus. Recently, several new Transformer-based approaches to learning sentence encoders have been developed as a part of the Sentence-Transformers library222https://www.sbert.net/. Original method [reimers2019sentence]

is based on siamese neural network architecture composed of two Transformer models with shared parameters. Fine-tuning the network involves minimizing the distance between representations for similar sentences and maximizing for different sentences. The authors experimented with several loss functions, including cross-entropy loss, mean squared error loss on cosine similarity between vectors, triplet loss, multiple negatives ranking loss. The last one proved to produce the highest quality sentence embeddings. Pre-trained models are provided along with the library, the most recent of which were trained on a dataset of over one billion English sentence pairs.

The availability of pre-trained models and labeled datasets of sentence pairs is lower for languages other than English. Currently, the best option for these languages is to use multilingual sentence encoders, which offer reasonably good performance for semantic tasks. We can consider multilingual models as a separate group of methods since they are usually learned in a different way than the approaches described above. More specifically, these models are trained using big cross-lingual text corpora, exploiting semantic similarity between aligned sentence pairs in different languages. Since a large volume of cross-lingual data is available on the Internet, training an effective multilingual model is often computationally intensive and requires at least several hundred gigabytes of text. In recent years, a few pre-trained multilingual sentence encoders were published. 10.1162/tacl_a_00288 released LASER, neural sentence encoder which can handle 93 languages. A multilingual version of Universal Sentence Encoder supporting 16 languages [yang-etal-2020-multilingual] has also been made publicly available. LaBSE [feng2020language] is another popular model, based on a multilingual BERT [devlin-etal-2019-bert], fine-tuned for semantic retrieval tasks in 112 languages. A different approach to training multilingual encoders has been shown in reimers-gurevych-2020-making. The publication proposes a method for transferring knowledge from a pre-trained English sentence encoder (teacher) to a pre-trained multilingual language model (student) by minimizing the distance between English sentence vector and sentence vectors corresponding to translated sentences. This fine-tuning procedure requires less data than training multilingual encoders from scratch. Several pre-trained models created with this method have already been published in Sentence-Transformers library.

Fig. 1: A diagram showing our procedure for paraphrase extraction from a bilingual corpus.

Iv Methodology

In this section, we describe our approach to training Transformer-based sentence encoders from automatically mined paraphrases. First, we characterize our method of extracting paraphrase pairs from parallel corpora. Next, we present the architecture of the neural model employed in this study and describe the procedure of training the model on the collected data.

Iv-a Paraphrase extraction

The main idea of our paraphrase extraction approach is to utilize sentence-aligned cross-lingual text corpora. One of the largest available collections of such corpora is the OPUS project [tiedemann2012parallel], which contains several dozen datasets covering almost all modern languages. In this study, two of them were used, which proved to work best with our method - OpenSubtitles [lison-tiedemann-2016-opensubtitles2016] and CCMatrix [schwenk-etal-2021-ccmatrix]. They both include multiple alternative translations for the same sentence, a characteristic that can be exploited for paraphrase mining.

A single run of our algorithm extracts paraphrased from a bilingual dataset. First, we select the source and target language. The target language is the one on which the model is trained, while the source language can be chosen arbitrarily. Usually, it is preferred to use English as the source language since most bilingual data is available for it. After downloading the sentence-aligned corpus, we perform data filtering step using a pre-trained multilingual sentence encoder. This step is necessary to improve the quality of resulting paraphrases because the original corpus is often noisy, containing translation errors and misaligned sentences. We compute the cosine similarity between a pair of sentences and discard pairs that are below a certain threshold. In our experiments, we used paraphrase-xlm-r-multilingual-v1 model from the Sentence-Transformers library to filter the data, and set a threshold of 0.7. In the next step, we group sentences in the target language corresponding to the same source sentences. We can then generate paraphrases from all groups containing at least two sentences. Within each group, we randomly select sentence pairs such that the resulting pairs contain at least one occurrence of each sentence. The dataset created in this way can then be used to train a sentence encoder. The procedure described above is shown in a graphical form in Figure 1.

Iv-B Neural sentence encoder

Currently, a common approach to training sentence encoders is to use a pre-trained language model such as BERT [devlin-etal-2019-bert] or RoBERTa [liu2019roberta]. Such models already contain sufficient semantic knowledge, and they only need to be fine-tuned to efficiently encode representations of sentences for semantic retrieval tasks. In this study, we employ an approach similar to the one proposed by reimers2019sentence

. We create a siamese network composed of two Transformer models with tied weights, initialized with a pre-trained language model. This architecture is then trained on a dataset of sentence pairs. Each Transformer network independently produces a sentence vector, and the vectors are compared using a similarity function. The sentence representation is generated by the model from individual token embeddings using a pooling operation.

reimers2019sentence experimented with simple pooling strategies such as using the first (CLS) token, computing the mean or max operation on all token vectors. These pooling methods are fast to compute but have a significant drawback - they restrict the size of the sentence embedding to be the same as the size of token vectors.

We believe that higher-dimensional sentence representations would preserve more semantic information, allowing higher performance on some tasks. Therefore, we propose a pooling operation based on an additional LSTM layer. The layer is placed after the last encoder block of the Transformer model. It takes a sequence of token vectors as input and returns a single vector representing the whole sentence. For the sentence embedding, we use the last hidden state of the LSTM cell. The architecture of the proposed sentence encoder is shown in Figure 2. This approach allows an arbitrary number of dimensions to be set for the resulting sentence vector, since the size of the LSTM cell does not depend on the size of the input.

Fig. 2: The architecture of our neural sentence encoder. The model is based on a standard Transformer architecture with an embedding layer and a number of self-attention blocks. The input of the model is a sequence of tokens , and the last of the encoder blocks outputs a vector representation of each token . The sentence representation is built from the individual token representations using an additional LSTM layer. The last hidden state of the recurrent layer is used as sentence embedding.

We fine-tune the model with mini-batch version of AdamW [loshchilov2018decoupled] algorithm. Every batch of size is composed of sentence pairs represented by their embeddings encoded by the model:

(1)

Let us call the first sentence in each pair the anchor. The second sentence in the same pair is its positive sentence and the second sentences in all other pairs are its negative sentences. During training, every anchor is compared to all representations . We expect the similarity of vectors to be high only for the positive pairs and low for all other pairs. Specifically, we use multiple negatives ranking loss function [Henderson2017EfficientNL]:

(2)

where X is the current mini-batch, denotes the model parameters, and is a similarity function comparing sentence representations , . In our case, the cosine similarity is used as the similarity function, defined as follows:

(3)

where both and are vectors, and denotes the -th dimension of .

V Experiments

In this section, we demonstrate the results of our experiments. The method presented in this paper was evaluated on eight linguistic tasks in Polish and compared with other publicly available neural sentence encoders. In our experiments, we follow the evaluation approach from SentEval [conneau-kiela-2018-senteval]

toolkit. The evaluated sentence encoders are not fine-tuned for specific tasks, they are only used to generate sentence embeddings. For each task, a simple neural network with one hidden layer is trained, which takes these static sentence representations as input and outputs a class label or regression score. For classification tasks, we use accuracy as the evaluation metric. For semantic relatedness, Spearman’s rank correlation coefficient is used.

For the purpose of our experiments, we created a new manually annotated dataset for paraphrase identification: Polish Paraphrase Corpus. First, we describe the corpus and the process of its development. Next, we present other datasets used in the evaluation. Then we present a description of the neural sentence encoders for Polish trained by us and other methods on which the evaluation was performed. The section concludes with a discussion of the results.

V-a Polish Paraphrase Corpus

Polish Paraphrase Corpus contains 7000 manually labeled sentence pairs. The dataset was divided into training, validation and test splits. The training part includes 5000 examples, while the other parts contain 1000 examples each. The main purpose of creating such a dataset was to verify how machine learning models perform in the challenging problem of paraphrase identification, where most records contain semantically overlapping parts. Technically, this is a three-class classification task, where each record can be assigned to one of the following categories:

  • [wide,labelwidth=0pt,labelindent=0pt]

  • Exact paraphrases - Sentence pairs that convey exactly the same information. We are interested only in the semantic meaning of the sentence, therefore this category also includes sentences that are semantically identical but, for example, have different emotional emphasis.

  • Close paraphrases - Sentence pairs with similar semantic meaning. In this category we include all pairs which contain the same information, but in addition to it there may be other semantically non-overlapping parts. This category also contains context-dependent paraphrases - sentence pairs that may have the same meaning in some contexts but are different in others.

  • Non-paraphrases - All other cases, including contradictory sentences and semantically unrelated sentences.

The corpus contains 2911, 1297, and 2792 examples for the above three categories, respectively. The process of annotating the dataset was preceded by an automated generation of candidate pairs, which were then manually labeled. We experimented with two popular techniques of generating possible paraphrases: backtranslation with a set of neural machine translation models and paraphrase mining using a pre-trained multilingual sentence encoder. The extracted sentence pairs are drawn from different data sources: Taboeba

333 https://tatoeba.org/, Polish news articles, Wikipedia and Polish version of SICK dataset [dadas-etal-2020-evaluation]. Since most of the sentence pairs obtained in this way fell into the first two categories, in order to balance the dataset, some of the examples were manually modified to convey different information. In this way, even negative examples often have high semantic overlap, making this problem difficult for machine learning models.

V-B Other datasets

Below we briefly describe the other datasets used in our experiments:

  • [wide,labelwidth=0pt,labelindent=0pt]

  • Wroclaw Corpus of Consumer Reviews Sentiment (WCCRS) [Kocon2019] - A Polish sentiment analysis dataset containing consumer reviews of products and services, assigned to four classes: positive, negative, neutral and ambiguous. The corpus contains opinions relating to four domains: hotels, medical services, products, and education. Two of those domains, hotels and medical services, contain sentence-level annotations. We include them in our evaluation as Consumer Reviews - Hotels (CR-H) and Consumer Reviews - Medicine (CR-M) tasks.

  • Sentences Involving Compositional Knowledge (SICK) [dadas-etal-2020-evaluation] - This corpus is a manually translated version of English SICK dataset [marelli2014a], containing 10,000 sentence pair with two types of annotations. Each pair contains a numerical score of semantic relatedness between sentences, and additionally a natural language inference (NLI) category label: entailment, neutral or contradiction. We therefore include two evaluation tasks based on this dataset: classification (SICK-E) and regression (SICK-R).

  • Compositional Distributional Semantics (CDS) [wroblewska-krasnowska-kieras-2017-polish] - A different corpus using the same annotation format as SICK dataset. Like the original, this corpus also contains 10,000 examples annotated with semantic relatedness score and NLI label. As with the previously described dataset, we also include two evaluation tasks in this case: CDS-E and CDS-R.

  • 8TAGS [dadas-etal-2020-evaluation] - A collection of sentences relating to popular topics discussed on the Internet. It contains about 50,000 sentences annotated with 8 topic labels: film, history, food, medicine, motorization, work, sport and technology. A multi-class classification task (8TAGS) based on this corpus was included in our experiments.

V-C Details of the experiments

Neural sentence encoders developed as part of this study were trained on an automatically extracted collection of over 7 million sentence pairs. The dataset was constructed by running our paraphrase mining algorithm on English-Polish bilingual data from OpenSubtitles [lison-tiedemann-2016-opensubtitles2016] and CCMatrix [schwenk-etal-2021-ccmatrix]. We trained four neural architectures with different pooling layers. The first model was based on standard mean pooling, while the other three employed pooling using the LSTM layer with increasing cell memory size: 1024, 2048, and 4096. The weights of all sentence encoders were initialized with a pre-trained Polish RoBERTa base language model [dadas2020pre]

and the same set of hyperparameters was used for fine-tuning. We used a mini-batch size of 64. We employed a training scheduler with a linearly decreasing learning rate and a warmup phase for the first 10% update steps. The peak learning rate was set to

. Each model was fine-tuned for three epochs, which took about 24-hours on a single Nvidia V100 GPU.

All datasets used in the evaluation have separate training, validation, and test parts. The results reported by us refer to the performance of the models on the test split of each dataset. As in the case of SentEval [conneau-kiela-2018-senteval]

, the training part is used for training a single layer classifier on top of the generated sentence embeddings, and the validation part is used for selecting optimal regularization term for this classifier.

In addition to the models described in this paper, we test several other neural sentence representations for Polish. We verify how the pre-trained neural language models perform on sentence-level tasks without additional fine-tuning on paraphrase data. For Polish, there are two high-quality models that were included in our study, Polish RoBERTa [dadas2020pre] and HerBERT [mroczkowski-etal-2021-herbert], each of them available in base and large variants. We also test all available multilingual models from Sentence-Transformers library, fine-tuned using multilingual knowledge distillation method [reimers-gurevych-2020-making]. Finally, we include the other commonly used neural sentence encoders: LASER [10.1162/tacl_a_00288], mUSE [yang-etal-2020-multilingual] and LaBSE [feng2020language].

V-D Results

Paraphrase
identification
Sentiment
analysis
Natural language
inference
Semantic
relatedness
Topic
classification
Model
Avg. PPC CR-H CR-M CDSC-E SICK-E CDSC-R SICK-R 8TAGS
Pre-trained language models + mean pooling
Polish RoBERTa (base)
75.13 70.60 85.14 79.21 81.40 71.89 79.53 62.26 71.00
HerBERT (base)
75.58 67.10 86.62 82.32 84.80 68.55 83.66 56.86 74.70
Polish RoBERTa (large)
76.64 65.40 87.90 83.56 82.60 68.59 83.89 61.84 79.30
HerBERT (large)
78.84 66.40 87.58 83.31 84.90 75.13 85.27 69.19 78.91
Sentence encoders (Sentence-Transformers library)
paraphrase-multilingual-MiniLM-L12-v2
78.62 76.80 80.24 78.81 85.70 78.45 88.13 71.60 69.19
distilbert-multilingual-nli-stsb-quora-rank
78.80 80.00 80.44 74.02 86.70 79.66 87.23 73.61 68.73
distiluse-base-multilingual-cased-v2
78.85 73.80 79.92 75.95 87.90 78.68 90.54 73.12 70.86
xlm-r-bert-base-nli-stsb-mean-tokens
79.91 80.10 81.15 80.69 86.00 81.02 85.69 75.25 69.37
paraphrase-xlm-r-multilingual-v1
81.44 82.80 82.50 80.74 86.70 82.08 90.14 76.08 70.49
paraphrase-multilingual-mpnet-base-v2
81.72 81.30 85.46 82.67 86.70 79.49 89.95 75.68 72.51
Sentence encoders (other methods)
mUSE
78.79 72.70 79.02 74.22 86.30 81.96 90.23 76.61 69.26
LASER
80.07 80.10 81.21 78.12 87.90 82.21 89.30 76.65 65.07
LaBSE
81.20 77.10 86.10 80.64 86.70 81.63 90.02 76.04 71.36
Our sentence encoder with different pooling layers
Mean pooling
81.53 79.40 85.78 80.94 87.90 80.35 88.57 76.56 72.78
LSTM pooling (1024)
81.74 80.00 85.07 81.38 86.90 80.55 89.46 77.40 73.19
LSTM pooling (2048)
82.48 83.40 85.26 81.58 87.30 81.49 89.70 78.48 72.60
LSTM pooling (4096)
82.52 82.50 86.49 81.04 86.90 82.39 90.06 78.99 71.75
TABLE I: Evaluation of sentence representations on eight tasks for Polish. For classification tasks we use accuracy as the evaluation metric, for semantic relatedness tasks (CDSC-R and SICK-R) the Spearman’s rank correlation coefficient is used.

The results of our experiments are shown in Table I. The table is divided into four sections, each corresponding to specific groups of sentence encoding methods. The first group includes pre-trained neural language models based on the Transformer architecture. In this part of the evaluation, we utilized the original model weights without fine-tuning them for encoding sentences. In this case, sentence representation is constructed by aggregating individual token vectors generated by the last layer of the network using mean pooling operation. As we can see, these representations perform well on conventional classification tasks such as sentiment analysis or topic classification but perform considerably worse on tasks that require measuring semantic relationship between sentences. On the SICK, CDSC and PPC datasets, the results are up to 13% lower compared to the other evaluated solutions.

The second group consists of sentence encoders from the Sentence-Transformers library trained using multilingual knowledge distillation method [reimers-gurevych-2020-making]. These models offer varying performance on Polish language tasks. However, two of them stand out: paraphrase-xlm-r-multilingual-v1 and paraphrase-multilingual-mpnet-base-v2. Both achieved an average score on all tasks above 81%. The former was particularly good on the semantic retrieval tasks while scoring lower on typical classification tasks. The latter, on the other hand, obtained balanced scores on all types of tasks.

Three popular neural sentence embedding methods published by Google and Facebook are included in the third group. Each of them performs well on semantic relatedness and natural language inference problems, while results on the other tasks often fall short of the performance offered by competing approaches. Of the three architectures compared, LaBSE performed best by achieving an average score above 81%. Reasonable performance is also offered by the LASER model with an average score of around 80%. However, in both cases their results were lower than those of the best models available in the Sentence-Transformers library.

The last section presents the results of our sentence encoders applying different pooling methods. The model using standard mean pooling achieved an average score of 81.53%, which is the second-best result in comparison to the previously discussed methods. Models incorporating an additional layer of LSTM-based pooling further improved this result. We can see that increasing the dimension of the sentence embedding has a positive effect on model performance. The difference is especially noticeable between LSTM layers with a hidden size of 1024 and 2048, while increasing the dimension further brings smaller improvements. The two largest models achieved an average score over all tasks of about 82.5%. It is also worth noting that all our models offer balanced performance across different tasks. On conventional classification problems, only pre-trained language models perform better. Whereas on paraphrase identification, natural language inference, and semantic relatedness, our encoders perform just as well or better than the best multilingual sentence encoders.

Vi Conclusion

In this paper, we have proposed a method for training neural sentence encoders without manually labeled data. Our approach involves automatic extraction of paraphrase pairs from sentence-aligned bilingual text corpora. Such datasets are readily available for many languages, so our technique is particularly suitable for training models in low-resource languages for which there is little or no annotated data. Using the extracted sentence pairs, we fine-tune a language model based on the Transformer architecture with an additional recurrent pooling layer responsible for generating sentence embeddings. The method allows us to train an effective language-specific sentence encoder in a short time on a single GPU, outperforming state-of-the-art multilingual models whose training required significantly more computational resources.

We validated our approach on eight tasks for the Polish language. For evaluation purposes, we also developed a new dataset that includes 7000 manually annotated sentence pairs, the Polish Paraphrase Corpus (PPC). Our sentence embeddings have shown high quality on a variety of linguistic problems, performing well both on semantic retrieval tasks and on sentence-level classification tasks.

References