We present a simple and effective way to generate a variety of paraphrases and find a good quality paraphrase among them. As in previous studies, it is difficult to ensure that one generation method always generates the best paraphrase in various domains. Therefore, we focus on finding the best candidate from multiple candidates, rather than assuming that there is only one combination of generative models and decoding options. Our approach shows that it is easy to apply in various domains and has sufficiently good performance compared to previous methods. In addition, our approach can be used for data agumentation that extends the downstream corpus, showing that it can help improve performance in English and Korean datasets.READ FULL TEXT VIEW PDF
Paraphrasing is the task of reconstructing sentences with different words and phrases while maintaining semantic meaning when a source sentence is given. The paraphrase system can be used to add variability to a source sentence and expand it to sentences containing more linguistic information. Paraphrasing has been studied and closely associated with various NLP tasks such as data agumentation, information retrieval, and question answering.
The supervised approach patro-etal-2018-learning to paraphrase is that the model can be trained to generate the paraphrase directly, but requires a parallel dataset. These parallel datasets are expensive to create and difficult to cover various domains. Therefore, in recent years, many studies bowman-etal-2016-generating; Miao_Zhou_Mou_Yan_Li_2019; liu-etal-2020-unsupervised have been conducted on an unsupervised approach to learning paraphrase generation using only the corpus. In addition, there are studies mallinson-etal-2017-paraphrasing; thompson-post-2020-paraphrase that attempt to paraphrase with machine translation learned with a translation corpus (e.g., language pairs shown in WMT 111http://www.statmt.org/wmt20/
) that has been released widely publicly. Various models have been developed in these methods, but only one model cannot guarantee the best performance for all datasets. Therefore, our goal is not to focus on designing language models or machine translation, but to find best candidates among paraphrases generated by various methods and use them for downstream tasks.
We paraphrase based on a machine translation that can vectorizes sentences with the same meaning in different languages into the same latent representation through an encoder. Our system paraphrases the source sentences with two frameworks and several decoding options and is described in Section2. Paraphrase candidates generated in various combinations are ranked according to fluency, diversity, and semantic score. Finally, the system selects a paraphrase that has different words from the source sentence, but is naturally and semantically similar.
The performance and effectiveness of the proposed system are verified in two ways. First, our model is evaluated against a dataset provided with a paraphrase pair. We use QQP (Quora Question Pairs)patro-etal-2018-learning and Medical domain dataset medical and are evaluated by multiple metrics by comparing generated paraphrase and gold reference. The second is to use our system as data augmentation in downstream tasks. We augment financial phrasebank Malo2014GoodDO and hate speech (eng) gibert2018hate in English and hate speech (kor) moon-et-al-2020-beep in Korean to improve the performance of the classification task.
(확인) Our system outperforms the previous supervised and unsupervised approaches in terms of the semantic, fluency, and diversity scores shows similar performance to the latest unsupervised approaches. In addition, our system shows performance improvement of downstream tasks, which is a scenario where training data is limited. Finally, our paraphrase has the advantage that it can be applied not only to English but also to various languages.
We use M2M100 fan2020beyond as backbone models so that it can be used not only in English but also in various languages. M2M100 is a multilingual encoder-decoder model that can handle 100 languages, where M2M100-small and M2M100-large two versions are used.
We generate paraphrase candidates as follows with two methods according to the combination of encoder and decoder.
The first framework-1 is to use only one language (i.e. source language). Thus, the decoder generates paraphrase candidates directly from the encoded vector of the source sentence. This framework is similar to auto-encoder, but since the paraphrase model is based on a translation system, it has the purpose of generating the same meaning rather than reconstruction.
If a candidate sentence is generated with only Section 2.2.1, the diversity decreases, so the second framework-2 uses two languages to generate more candidates. In other words, we use the back-translation mentioned in the sennrich-etal-2016-improving to translate the source sentence into the target sentence and translate it back into the source sentence. Because back-translation depends on the performance of the translation system, context information can sometimes be lost, but it can generate various candidates. M2M100 supports 100 languages, but we selected and used English, Korean, French, Japanese, Chinese, German, and Spanish as the language pool.
When generating paraphrase candidates, we expand the set of candidates by adding various options to the decoder.
In the framework-1, 10-beam-search is used and the top-5 candidate sentences are generated. In addition, the following blocking restrictions are additionally applied. (1) Output tokens are restricted so that they do not overlap more than half of the length of the source sentence in succession with the source tokens. (2) It is prevented from generating repetitive 3-grams within the output sentence.
In the framework-2, 3-beam-search is used in both the forward and backward paths, and the top-1 candidate sentence is generated, and the rest are the same as the framework-1.
We filter through various scores to select the best paraphrase among paraphrase candidates. All ranking and filtering processes measure the score in all lowercase letters to eliminate differences due to uppercase and lowercase letters. The candidates with poor scores in each filtering step are discarded.
We remove the overlapping sentences among the candidates that are different from the source sentence. Even in different sentences, candidates that differ only in spaces or by substitution of upper and lower case letters are considered to be the same sentence. The remaining sentences that have been filtered out in this section are called .
We measure diversity by comparing and source sentences. We use word error rate MorrisMG04
as diversity metrics, where the higher the score, the higher the diversity. WER (word error rate) refers to the Levenshtein distance between the source sentence and the candidates, and works at the word level instead of the phoneme level. Originally, WER was proposed to measure the performance of an automatic speech recognition system, but we use it to measure the difference between sentences. In this step, only min(5, #num()/2) sentences with a high diversity score are left, and this is called .
To evaluate fluency, we measure PPL (perplexity) using a language model. Fluency indicates the naturalness of the sentence, and the lower the PPL, the better the fluency. We use GPT2-medium radford2019language as the language model and leave only min(3, #num()/2) sentences with a low PPL, and call this .
Semantic score measures using a bidirectional pre-trained language model. BERTScore Zhang*2020BERTScore
leverages the contextual embeddings and matches words in the candidates and the source sentence by cosine similarity. Higher scores mean semantic similarity, and we use RoBERTa-largeliu2020roberta in BERTScore. We measure the semantic score using the source sentence as a reference and as candidates.
If the source sentence is very short or given a simple structure, in order to obtain more candidates, the decoder options in Section 2.2.3 are restricted so that the source and output sentences do not overlap more than 2-grams.
|Hate Speech (eng)||1081||220||255|
|Hate Speech (kor)||1421||789||471|
Our training and tests are tested on a single V100 GPU, and the details are described in this Section.
To measure the performance of paraphrase systems, we used Quora Question Pairs (QQP) test data with 30,000 pairs used in patro-etal-2018-learning and medical domain dataset medical.
We measure the semantic, diversity, and fluency scores of paraphrases. To set Section 2.3
and the evaluation metric differently, diversity uses Isacreblue (inverser-sacrebleu). Isacrebleu is calculated as -sacrebleupost-2018-call
, and the higher the number of overlapping n-grams between candidates and source sentences, the lower the score. The semantic score is measured by comparing it with the gold references provided by the dataset and using Bleurtsellam-etal-2020-bleurt. Bleurt is an evaluation metric trained on biased training data so that BERT can model human judgments. We use bleurt-base-128 as the model for Bleurt. When measuring Fluency, GPT2-small is used as a language model.
To demonstrate the usefulness of our approach, we paraphrase several downstream datasets to experiment with the effects of data augmentation. We test sentence classification in the domains of financial phrasebank Malo2014GoodDO and hate speech gibert2018hate to check usefulness in various domains. It is also paraphrased in hate speech moon-et-al-2020-beep in Korean to check its usefulness not only in English but also in other languages.
We download the datasets using huggingface’s dataset library 222https://huggingface.co/datasets. Financial phrasebank and hate speech (eng) are randomly divided into training, validation, and test data because only training data is provided. Hate speech (kor) provides training and test data, so a portion of the training data is used as validation. Since our purpose is to confirm the performance improvement with data augmented by paraphrase in a scenario where there is insufficient data, we preprocess hate speech as follows. (1) In hate speech (eng), the data class is unbalanced, so the data of the class that appears excessively is discarded at random to balance the data. Also, since the amount of existing training data is sufficiently large, in order to limit it to a scenario where data is insufficient, we only use 50% of the randomly balanced training data. (2) Hate speech (kor) similarly has enough training data, so only 20% of the training data is randomly used for training. Table 1 shows the statistics of the processed downstream tasks and the performance is measured by accuracy.
Table 2 shows the performance of paraphrase. Edlp and Edlps
are supervised learning models introduced inpatro-etal-2018-learning, ED, L, P and S stand for encoder-decoder, cross-entropy, pair-wise discriminator loss, and parameter sharing, respectively. CGMH Miao_Zhou_Mou_Yan_Li_2019 uses Metropolis-Hastings sampling in word space to generate constrained sentences. UPSA liu-etal-2020-unsupervised is a method of generating Unsupervised Paraphrase through Simulated Annealing, which searches the sentence space towards this objective by performing a sequence of local edits. M2M100 is an M2M-large model that paraphrases source sentences with greedy search (top-1) in framework-1.
Our approach achieves the best performance in terms of semantic and fluency scores than previous studies of supervised and unsupervised methods. The diversity score is not the best performance, but it achieves a score comparable to other models. M2M100, which generates a paraphrase using the same model as ours, achieves the second semantic score, but the diversity is worse than the previous methods. That is, the method of generating simply as a translation model as one option is not perfect, and the rate of generating by copying source sentences from M2M100 in the QQP dataset is 8.41%.
Table 3 shows the performance of sentence classification, which are downstream tasks. BERT-base is a bidirectional pre-trained language model. Transformer has the same architecture, but trains from scratch. Both models are trained five times and are the average of the measured performances. We observe that the performance of models is improved when the augmented corpus is used for training.
Because BERT is a pre-trained language model trained from numerous corpuses, it has the ability to extract contextual knowledge. Nevertheless, adding the corpus augmented with paraphrase improves the performance, which shows that it helps training even when fine-tuning the pre-trained language model. Transformers trained from scratch do not have general knowledge of the language, so performance changes through data augmentation are large. Performance is greatly improved in financial and hate speech (eng), but data agumentation in Transformer degrades performance in hate speech (eng). We find that Transformer can learn rich representations through paraphrasing of training data, but performance degradation can occur on fixed test data with a small amount of data.
Data agumentation through M2M also shows a similar pattern to ours, but the performance improvement is small and the performance degradation is large. We infer that, as shown in Section 4.1, the paraphrase performance difference and M2M generate some overlapping sentences.
We propose a system that generates various paraphrase candidates and finds the best candidate through multiple scores, which avoids the risk of relying on one model and one decoding option. Our approach captures semantic information better than the previous supervised and unsupervised methods and generates more natural sentences. The diversity score also achieves similar performance to the state-of-the-art unsupervised method. However, our approach may suffer from speed issues for inferencing heavy models in parallel on one server. For actual paraphrase use, it will be effective to extract candidates along with a simple model such as n-gram.
Our system shows that when data is insufficient in various domains, the classification performance can be improved through data agumentation through our paraphrasing. Our approach is easily extensible across many domains and languages, and we hope to help with a variety of NLP tasks, such as classification tasks with little data.