Log In Sign Up

Paraphrasing via Ranking Many Candidates

by   Joosung Lee, et al.

We present a simple and effective way to generate a variety of paraphrases and find a good quality paraphrase among them. As in previous studies, it is difficult to ensure that one generation method always generates the best paraphrase in various domains. Therefore, we focus on finding the best candidate from multiple candidates, rather than assuming that there is only one combination of generative models and decoding options. Our approach shows that it is easy to apply in various domains and has sufficiently good performance compared to previous methods. In addition, our approach can be used for data agumentation that extends the downstream corpus, showing that it can help improve performance in English and Korean datasets.


page 1

page 2

page 3

page 4


A Design of A Simple Yet Effective Exercise Recommendation System in K-12 Online Learning

We propose a simple but effective method to recommend exercises with hig...

Generating High-Quality Query Suggestion Candidates for Task-Based Search

We address the task of generating query suggestions for task-based searc...

Improving Zero-Shot Entity Retrieval through Effective Dense Representations

Entity Linking (EL) seeks to align entity mentions in text to entries in...

Improving Neural Parsing by Disentangling Model Combination and Reranking Effects

Recent work has proposed several generative neural models for constituen...

Candidate Generation with Binary Codes for Large-Scale Top-N Recommendation

Generating the Top-N recommendations from a large corpus is computationa...

Generative Melody Composition with Human-in-the-Loop Bayesian Optimization

Deep generative models allow even novice composers to generate various m...

Diffeomorphic Counterfactuals with Generative Models

Counterfactuals can explain classification decisions of neural networks ...

Code Repositories

1 Introduction

Paraphrasing is the task of reconstructing sentences with different words and phrases while maintaining semantic meaning when a source sentence is given. The paraphrase system can be used to add variability to a source sentence and expand it to sentences containing more linguistic information. Paraphrasing has been studied and closely associated with various NLP tasks such as data agumentation, information retrieval, and question answering.

The supervised approach patro-etal-2018-learning to paraphrase is that the model can be trained to generate the paraphrase directly, but requires a parallel dataset. These parallel datasets are expensive to create and difficult to cover various domains. Therefore, in recent years, many studies bowman-etal-2016-generating; Miao_Zhou_Mou_Yan_Li_2019; liu-etal-2020-unsupervised have been conducted on an unsupervised approach to learning paraphrase generation using only the corpus. In addition, there are studies mallinson-etal-2017-paraphrasing; thompson-post-2020-paraphrase that attempt to paraphrase with machine translation learned with a translation corpus (e.g., language pairs shown in WMT 111

) that has been released widely publicly. Various models have been developed in these methods, but only one model cannot guarantee the best performance for all datasets. Therefore, our goal is not to focus on designing language models or machine translation, but to find best candidates among paraphrases generated by various methods and use them for downstream tasks.

We paraphrase based on a machine translation that can vectorizes sentences with the same meaning in different languages into the same latent representation through an encoder. Our system paraphrases the source sentences with two frameworks and several decoding options and is described in Section 

2. Paraphrase candidates generated in various combinations are ranked according to fluency, diversity, and semantic score. Finally, the system selects a paraphrase that has different words from the source sentence, but is naturally and semantically similar.

The performance and effectiveness of the proposed system are verified in two ways. First, our model is evaluated against a dataset provided with a paraphrase pair. We use QQP (Quora Question Pairs

patro-etal-2018-learning and Medical domain dataset medical and are evaluated by multiple metrics by comparing generated paraphrase and gold reference. The second is to use our system as data augmentation in downstream tasks. We augment financial phrasebank Malo2014GoodDO and hate speech (eng) gibert2018hate in English and hate speech (kor) moon-et-al-2020-beep in Korean to improve the performance of the classification task.

(확인) Our system outperforms the previous supervised and unsupervised approaches in terms of the semantic, fluency, and diversity scores shows similar performance to the latest unsupervised approaches. In addition, our system shows performance improvement of downstream tasks, which is a scenario where training data is limited. Finally, our paraphrase has the advantage that it can be applied not only to English but also to various languages.

2 Methods

2.1 Pre-trained Model

We use M2M100 fan2020beyond as backbone models so that it can be used not only in English but also in various languages. M2M100 is a multilingual encoder-decoder model that can handle 100 languages, where M2M100-small and M2M100-large two versions are used.

2.2 Generate Paraphrase Candidates

We generate paraphrase candidates as follows with two methods according to the combination of encoder and decoder.

2.2.1 Src-Encoder+Src-Decoder

The first framework-1 is to use only one language (i.e. source language). Thus, the decoder generates paraphrase candidates directly from the encoded vector of the source sentence. This framework is similar to auto-encoder, but since the paraphrase model is based on a translation system, it has the purpose of generating the same meaning rather than reconstruction.

2.2.2 Back-Translation

If a candidate sentence is generated with only Section 2.2.1, the diversity decreases, so the second framework-2 uses two languages to generate more candidates. In other words, we use the back-translation mentioned in the sennrich-etal-2016-improving to translate the source sentence into the target sentence and translate it back into the source sentence. Because back-translation depends on the performance of the translation system, context information can sometimes be lost, but it can generate various candidates. M2M100 supports 100 languages, but we selected and used English, Korean, French, Japanese, Chinese, German, and Spanish as the language pool.

2.2.3 Decoder Options

When generating paraphrase candidates, we expand the set of candidates by adding various options to the decoder.

In the framework-1, 10-beam-search is used and the top-5 candidate sentences are generated. In addition, the following blocking restrictions are additionally applied. (1) Output tokens are restricted so that they do not overlap more than half of the length of the source sentence in succession with the source tokens. (2) It is prevented from generating repetitive 3-grams within the output sentence.

In the framework-2, 3-beam-search is used in both the forward and backward paths, and the top-1 candidate sentence is generated, and the rest are the same as the framework-1.

2.3 Ranking and Filtering

We filter through various scores to select the best paraphrase among paraphrase candidates. All ranking and filtering processes measure the score in all lowercase letters to eliminate differences due to uppercase and lowercase letters. The candidates with poor scores in each filtering step are discarded.

2.3.1 Overlapping

We remove the overlapping sentences among the candidates that are different from the source sentence. Even in different sentences, candidates that differ only in spaces or by substitution of upper and lower case letters are considered to be the same sentence. The remaining sentences that have been filtered out in this section are called .

2.3.2 Diversity

We measure diversity by comparing and source sentences. We use word error rate MorrisMG04

as diversity metrics, where the higher the score, the higher the diversity. WER (word error rate) refers to the Levenshtein distance between the source sentence and the candidates, and works at the word level instead of the phoneme level. Originally, WER was proposed to measure the performance of an automatic speech recognition system, but we use it to measure the difference between sentences. In this step, only min(5, #num(

)/2) sentences with a high diversity score are left, and this is called .

2.3.3 Fluency

To evaluate fluency, we measure PPL (perplexity) using a language model. Fluency indicates the naturalness of the sentence, and the lower the PPL, the better the fluency. We use GPT2-medium radford2019language as the language model and leave only min(3, #num()/2) sentences with a low PPL, and call this .

2.3.4 Semantic

Semantic score measures using a bidirectional pre-trained language model. BERTScore Zhang*2020BERTScore

leverages the contextual embeddings and matches words in the candidates and the source sentence by cosine similarity. Higher scores mean semantic similarity, and we use RoBERTa-large 

liu2020roberta in BERTScore. We measure the semantic score using the source sentence as a reference and as candidates.

2.4 Details

If the source sentence is very short or given a simple structure, in order to obtain more candidates, the decoder options in Section 2.2.3 are restricted so that the source and output sentences do not overlap more than 2-grams.

Dataset train dev test
Financial Phrasebank 1834 203 227
Hate Speech (eng) 1081 220 255
Hate Speech (kor) 1421 789 471
Table 1: Downstream Datasets

3 Experiments

Our training and tests are tested on a single V100 GPU, and the details are described in this Section.

3.1 Paraphrasing

3.1.1 Dataset

To measure the performance of paraphrase systems, we used Quora Question Pairs (QQP) test data with 30,000 pairs used in  patro-etal-2018-learning and medical domain dataset medical.

3.1.2 Evaluation Metrics

We measure the semantic, diversity, and fluency scores of paraphrases. To set Section 2.3

and the evaluation metric differently, diversity uses Isacreblue (inverser-sacrebleu). Isacrebleu is calculated as -sacrebleu 


, and the higher the number of overlapping n-grams between candidates and source sentences, the lower the score. The semantic score is measured by comparing it with the gold references provided by the dataset and using Bleurt 

sellam-etal-2020-bleurt. Bleurt is an evaluation metric trained on biased training data so that BERT can model human judgments. We use bleurt-base-128 as the model for Bleurt. When measuring Fluency, GPT2-small is used as a language model.

3.2 Downstream Task

To demonstrate the usefulness of our approach, we paraphrase several downstream datasets to experiment with the effects of data augmentation. We test sentence classification in the domains of financial phrasebank Malo2014GoodDO and hate speech gibert2018hate to check usefulness in various domains. It is also paraphrased in hate speech moon-et-al-2020-beep in Korean to check its usefulness not only in English but also in other languages.

Methods QQP Medical
Semantic Diversity Fluency Semantic Diversity Fluency
Bleurt isacrebleu PPL Bleurt isacrebleu PPL
supervised Edlp -1.066 86.843 585.384 - - -
Edlps -0.857 83.504 597.024 - - -
unsuperivsed UPSA -0.729 65.749 392.833 -1.351 89.418 476.069
CGMH(50) -0.842 65.35 556.163 -1.405 88.95 818.307
M2M100 0.036 43.539 346.17 -0.561 35.688 296.672
Ours 0.083 69.421 171.61 -0.508 68.735 158.76
Source input sentence 0.124 0 270.781 -0.523 0 249.107
gold reference 1 72.002 278.163 1 88.632 171.786
Table 2: Paraphrasing performance of our approach and previous studies in QQP and Medical. The parentheses of CGMH mean iteration in which the sentence is modified with sample time. Bold text means the best performance.

We download the datasets using huggingface’s dataset library 222 Financial phrasebank and hate speech (eng) are randomly divided into training, validation, and test data because only training data is provided. Hate speech (kor) provides training and test data, so a portion of the training data is used as validation. Since our purpose is to confirm the performance improvement with data augmented by paraphrase in a scenario where there is insufficient data, we preprocess hate speech as follows. (1) In hate speech (eng), the data class is unbalanced, so the data of the class that appears excessively is discarded at random to balance the data. Also, since the amount of existing training data is sufficiently large, in order to limit it to a scenario where data is insufficient, we only use 50% of the randomly balanced training data. (2) Hate speech (kor) similarly has enough training data, so only 20% of the training data is randomly used for training. Table 1 shows the statistics of the processed downstream tasks and the performance is measured by accuracy.

4 Results

4.1 Paraphrasing

Table 2 shows the performance of paraphrase. Edlp and Edlps

are supervised learning models introduced in 

patro-etal-2018-learning, ED, L, P and S stand for encoder-decoder, cross-entropy, pair-wise discriminator loss, and parameter sharing, respectively. CGMH Miao_Zhou_Mou_Yan_Li_2019 uses Metropolis-Hastings sampling in word space to generate constrained sentences. UPSA liu-etal-2020-unsupervised is a method of generating Unsupervised Paraphrase through Simulated Annealing, which searches the sentence space towards this objective by performing a sequence of local edits. M2M100 is an M2M-large model that paraphrases source sentences with greedy search (top-1) in framework-1.

Our approach achieves the best performance in terms of semantic and fluency scores than previous studies of supervised and unsupervised methods. The diversity score is not the best performance, but it achieves a score comparable to other models. M2M100, which generates a paraphrase using the same model as ours, achieves the second semantic score, but the diversity is worse than the previous methods. That is, the method of generating simply as a translation model as one option is not perfect, and the rate of generating by copying source sentences from M2M100 in the QQP dataset is 8.41%.

4.2 Downstream Task

Table 3 shows the performance of sentence classification, which are downstream tasks. BERT-base is a bidirectional pre-trained language model. Transformer has the same architecture, but trains from scratch. Both models are trained five times and are the average of the measured performances. We observe that the performance of models is improved when the augmented corpus is used for training.

Because BERT is a pre-trained language model trained from numerous corpuses, it has the ability to extract contextual knowledge. Nevertheless, adding the corpus augmented with paraphrase improves the performance, which shows that it helps training even when fine-tuning the pre-trained language model. Transformers trained from scratch do not have general knowledge of the language, so performance changes through data augmentation are large. Performance is greatly improved in financial and hate speech (eng), but data agumentation in Transformer degrades performance in hate speech (eng). We find that Transformer can learn rich representations through paraphrasing of training data, but performance degradation can occur on fixed test data with a small amount of data.

Data agumentation through M2M also shows a similar pattern to ours, but the performance improvement is small and the performance degradation is large. We infer that, as shown in Section 4.1, the paraphrase performance difference and M2M generate some overlapping sentences.

Methods Agumentation Financial
Hate Speech
Hate Speech
BERT-base x 95.3 64.94 52.78
M2M 95.15 66.2 54.52
Ours 96.33 68.31 55.03
Transformer x 80.47 53.24 52.27
M2M 85.9 55.69 49.26
Ours 86.49 63.14 51.04
Table 3: Accuracy of fine-tuned models in downstream tasks. The performance of each model is the average of the values measured by experimenting five times.

5 Conclusion

We propose a system that generates various paraphrase candidates and finds the best candidate through multiple scores, which avoids the risk of relying on one model and one decoding option. Our approach captures semantic information better than the previous supervised and unsupervised methods and generates more natural sentences. The diversity score also achieves similar performance to the state-of-the-art unsupervised method. However, our approach may suffer from speed issues for inferencing heavy models in parallel on one server. For actual paraphrase use, it will be effective to extract candidates along with a simple model such as n-gram.

Our system shows that when data is insufficient in various domains, the classification performance can be improved through data agumentation through our paraphrasing. Our approach is easily extensible across many domains and languages, and we hope to help with a variety of NLP tasks, such as classification tasks with little data.