Unsupervised Paraphrasing without Translation

05/29/2019 ∙ by Aurko Roy, et al. ∙ Google 0

Paraphrasing exemplifies the ability to abstract semantic content from surface forms. Recent work on automatic paraphrasing is dominated by methods leveraging Machine Translation (MT) as an intermediate step. This contrasts with humans, who can paraphrase without being bilingual. This work proposes to learn paraphrasing models from an unlabeled monolingual corpus only. To that end, we propose a residual variant of vector-quantized variational auto-encoder. We compare with MT-based approaches on paraphrase identification, generation, and training augmentation. Monolingual paraphrasing outperforms unsupervised translation in all settings. Comparisons with supervised translation are more mixed: monolingual paraphrasing is interesting for identification and augmentation; supervised translation is superior for generation.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many methods have been developed to generate paraphrases automatically Madnani and J. Dorr (2010). Approaches relying on Machine Translation (MT) have proven popular due to the scarcity of labeled paraphrase pairs Callison-Burch (2007); Mallinson et al. (2017); Iyyer et al. (2018). Recent progress in MT with neural methods Bahdanau et al. (2014); Vaswani et al. (2017) has popularized this latter strategy. Conceptually, translation is appealing since it abstracts semantic content from its linguistic realization. For instance, assigning the same source sentence to multiple translators will result in a rich set of semantically close sentences Callison-Burch (2007). At the same time, bilingualism does not seem necessary to humans to generate paraphrases.

This work evaluates if data in two languages is necessary for paraphrasing. We consider three settings: supervised translation (parallel bilingual data is used), unsupervised translation (non-parallel corpora in two languages are used) and monolingual (only unlabeled data in the paraphrasing language is used). Our comparison devises comparable encoder-decoder neural networks for all three settings. While the literature on supervised 

Bahdanau et al. (2014); Cho et al. (2014); Vaswani et al. (2017) and unsupervised translation Lample et al. (2018a); Artetxe et al. (2018); Lample et al. (2018b) offer solutions for the bilingual settings, monolingual neural paraphrase generation has not received the same attention.

We consider discrete and continuous auto-encoders in an unlabeled monolingual setting, and contribute improvements in that context. We introduce a model based on Vector-Quantized Auto-Encoders, VQ-VAE van den Oord et al. (2017)

, for generating paraphrases in a purely monolingual setting. Our model introduces residual connections parallel to the quantized bottleneck. This lets us interpolate from classical continuous auto-encoder 

Vincent et al. (2010) to VQ-VAE. Compared to VQ-VAE, our architecture offers a better control over the decoder entropy and eases optimization. Compared to continuous auto-encoder, our method permits the generation of diverse, but semantically close sentences from an input sentence.

We compare paraphrasing models over intrinsic and extrinsic metrics. Our intrinsic evaluation evaluates paraphrase identification, and generations. Our extrinsic evaluation reports the impact of training augmentation with paraphrases on text classification. Overall, monolingual approaches can outperform unsupervised translation in all settings. Comparison with supervised translation shows that parallel data provides valuable information for paraphrase generation compared to purely monolingual training.

2 Related Work

Paraphrase Generation Paraphrases express the same content with alternative surface forms. Their automatic generation has been studied for decades: rule-based McKeown (1980); Meteer and Shaked (1988) and data-driven methods Madnani and J. Dorr (2010) have been explored. Data-driven approaches have considered different source of training data, including multiple translations of the same text Barzilay and McKeown (2001); Pang et al. (2003) or alignments of comparable corpora, such as news from the same period Dolan et al. (2004); Barzilay and Lee (2003).

Machine translation later emerged as a dominant method for paraphrase generation. Bannard and Callison-Burch (2005) identify equivalent English phrases mapping to the same non-English phrases from an MT phrase table. Kok and Brockett (2010) performs random walks across multiple phrase tables. Translation-based paraphrasing has recently benefited from neural networks for MT Bahdanau et al. (2014); Vaswani et al. (2017). Neural MT can generate paraphrase pairs by translating one side of a parallel corpus Wieting and Gimpel (2018); Iyyer et al. (2018). Paraphrase generation with pivot/round-trip neural translation has also been used Mallinson et al. (2017); Yu et al. (2018).

Although less common, monolingual neural sequence models have also been proposed. In supervised settings, Prakash et al. (2016); Gupta et al. (2018) learn sequence-to-sequence models on paraphrase data. In unsupervised settings, Bowman et al. (2016) apply a VAE to paraphrase detection while Li et al. (2017) train a paraphrase generator with adversarial training.

Paraphrase Evaluation Evaluation can be performed by human raters, evaluating both text fluency and semantic similarity. Automatic evaluation is more challenging but necessary for system development and larger scale statistical analysis Callison-Burch (2007); Madnani and J. Dorr (2010). Automatic evaluation and generation are actually linked: if an automated metric would reliably assess the semantic similarity and fluency of a pair of sentences, one would generate by searching the space of sentences to maximize that metric. Automated evaluation can report the overlap with a reference paraphrase, like for translation Papineni et al. (2002) or summarization Lin (2004). BLEU, METEOR and TER metrics have been used Prakash et al. (2016); Gupta et al. (2018). These metrics do not evaluate whether the generated paraphrase differs from the input sentence and large amount of input copying is not penalized. Galley et al. (2015) compare overlap with multiple references, weighted by quality; while Sun and Zhou (2012) explicitly penalize overlap with the input sentence. Grangier and Auli (2018) alternatively compare systems which have first been calibrated to a reference level of overlap with the input. We follow this strategy and calibrate the generation overlap to match the average overlap observed in paraphrases from humans.

In addition to generation, probabilistic models can be assessed through scoring. For a sentence pair

, the model estimate of

can be used to discriminate between paraphrase and non-paraphrase pairs Dolan and Brockett (2005). The correlation of model scores with human judgments Cer et al. (2017) can also be assessed. We report both types of evaluation.

Finally, paraphrasing can also impact downstream tasks, e.g. to generate additional training data by paraphrasing training sentences Marton et al. (2009); Zhang et al. (2015); Yu et al. (2018). We evaluate this impact for classification tasks.

3 Residual VQ-VAE for Unsupervised Monolingual Paraphrasing

Auto-encoders can be applied to monolingual paraphrasing. Our work combines Transformer networks 

Vaswani et al. (2017) and VQ-VAE van den Oord et al. (2017), building upon recent work in discrete latent models for translation Kaiser et al. (2018); Roy et al. (2018). VQ-VAEs, as opposed to continuous VAEs, rely on discrete latent variables. This is interesting for paraphrasing as it equips the model with an explicit control over the latent code capacity, allowing the model to group multiple related examples under the same latent assignment, similarly to classical clustering algorithms Macqueen (1967). This is conceptually simpler and more effective than rate regularization Higgins et al. (2016) or denoising objectives Vincent et al. (2010) for continuous auto-encoders. At the same time, training auto-encoder with discrete bottleneck is difficult Roy et al. (2018). We address this difficulty with an hybrid model using a continuous residual connection around the quantization module.

We modify the Transformer encoder Vaswani et al. (2017) as depicted in Figure 1. Our encoder maps a sentence into a fixed size vector. This is simple and avoids choosing a fixed length compression rate between the input and the latent representation Kaiser et al. (2018). Our strategy to produce a fixed sized representation from transformer is analogous to the special token employed for sentence classification in (Devlin et al., 2018).

At the first layer, we extend the input sequences with one or more fixed positions which are part of the self-attention stack. At the output layer, the encoder output is restricted to these special positions which constitute the encoder fixed sized-output. As in Kaiser et al. (2018), this vector is split into multiple heads (sub-vectors of equal dimensions) which each goes through a quantization module. For each head , the encoder output is quantized as,

where denotes the codebook vectors. The codebook is shared across heads and training combines straight-through gradient estimation and exponentiated moving averages van den Oord et al. (2017). The quantization module is completed with a residual connection, with a learnable weight , One can observe that residual vectors and quantized vectors always have similar norms by definition of the VQ module. This is a fundamental difference with classical continuous residual networks, where the network can reduce activation norms of some modules to effectively rely mostly on the residual path. This makes an important parameter to trade-off continuous and discrete auto-encoding. Our learning encourages the quantized path with a squared penalty .

After residual addition, the multiple heads of the resulting vector are presented as a matrix to which a regular transformer decoder can attend. Models are trained to maximize the likelihood of the training set with Adam optimizer using the learning schedule from Vaswani et al. (2017).

Figure 1: Encoder Architecture
Parapharase Identification Generation
Supervised Translation 70.6 46.0 78.6 8.73 36.8
+ Distillation 66.5 60.0 55.6 7.08
Unsupervised Translation 66.0 13.2 65.8 6.59 28.1
+ Distillation 66.9 45.0 52.0 6.45
Mono. DN-AE 66.8 46.2 91.6 5.13
Mono. VQVAE 66.3 10.6 69.0 3.85
+ Residual 73.3 59.8 94.0 7.26 31.9
+ Distillation 71.3 54.3 88.4 6.88
Table 1: Paraphrase Identification & Generation. Identification is evaluated with accuracy on MRPC, Pearson Correlation on STS and ranking on MTC. Generation is evaluated with BLEU and human preferences on MTC.
Acc. F1 Acc F1
NB-SVM (trigram) 81.93 83.15 89.77 84.81
Supervised Translation 81.55 82.75 90.78 85.44
+ Distillation 81.16 66.59 90.38 86.05
Unsupervised Translation 81.87 83.18 88.17 83.42
+ Distillation 81.49 82.78 89.18 84.41
Mono. DN-AE 81.11 82.48 89.37 84.08
Mono. VQ-VAE 81.98 82.95 89.17 83.64
+ Residual 82.12 83.23 89.98 84.31
+ Distillation 81.60 82.81 89.78 84.31
Table 2:

Paraphrasing for Data Augmentation: Accuracy and F1-scores of a Naive Bayes-SVM classifier on sentiment (SST-2) and question (TREC) classification.

4 Experiments & Results

We compare neural paraphrasing with and without access to bilingual data. For bilingual settings, we consider supervised and unsupervised translation using round-trip translation Mallinson et al. (2017); Yu et al. (2018) with German as the pivot language. Supervised translation trains the transformer base model Vaswani et al. (2017) on the WMT’17 English-German parallel data Bojar et al. (2017). Unsupervised translation considers a pair of comparable corpora for training, German and English WMT-Newscrawl corpora, and relies on the transformer models from Lample et al. (2018b). Both MT cases train a model from English to German and from German to English to perform round-trip MT. For each model, we also distill the round-trip model into a single artificial English to English model by generating a training set from pivoted data. Distillation relies on the billion word corpus, LM1B Chelba et al. (2013).

Monolingual Residual VQ-VAE is trained only on LM1B with , with heads and fixed window of size . We also evaluate plain VQ-VAE to highlight the value of our residual modification. We further compare with a monolingual continuous denoising auto-encoder (DN-AE), with noising from Lample et al. (2018b).

Paraphrase Identification For classification of sentence pairs over Microsoft Research Paraphrase Corpus (MRPC) from Dolan and Brockett (2005)

, we train logistic regression on

and from the model, complemented with encoder outputs in fixed context settings. We also perform paraphrase quality regression on Semantic Textual Similarity (STS) from Cer et al. (2017)

by training ridge regression on the same features.

Finally, we perform paraphrase ranking on Multiple Translation Chinese (MTC) from Huang et al. (2002). MTC contains English paraphrases collected as translations of the same Chinese sentences from multiple translators Mallinson et al. (2017). We pair each MTC sentence with a paraphrase and 100 randomly chosen non-paraphrases . We compare the paraphrase score to the non-paraphrase scores and report the fraction of comparisons where the paraphrase score is higher.

Table 1 (left) reports that our residual model outperforms alternatives in all identification setting, except for STS, where our Pearson correlation is slightly under supervised translation.

Paraphrases for Data Augmentation

We augment the training set of text classification tasks for sentiment analysis on Stanford Sentiment Treebank (SST-2) 

(Socher et al., 2013) and question classification on Text REtrieval Conference (TREC) (Voorhees and Tice, 2000)

. In both cases, we double training set size by paraphrasing each sentence and train Support Vector Machines with Naive Bayes features 

Wang and Manning (2012).

In Table 2, augmentation with monolingual models yield the best performance for SST-2 sentiment classification. TREC question classification is better with supervised translation augmentation. Unfortunately, our monolingual training set LM1B does not contain many question sentences. Future work will revisit monolingual training on larger, more diverse resources.

In: a worthy substitute
Out: A worthy replacement.
In: Local governments will manage the smaller enterprises.
Out: Local governments will manage smaller companies.
In: Inchon is 40 kilometers away from the border of North Korea.
Out: Inchon is 40 km away from the North Korean border.
In: Executive Chairman of Palestinian Liberation Organization, Yasar Arafat, and other leaders are often critical of aiding countries not fulfilling their promise to provide funds in a timely fashion.
Out: Yasar Arafat , executive chairman of the Palestinian Liberation Organization and other leaders are often critical of helping countries meet their pledge not to provide funds in a timely fashion.
Table 3: Examples of generated paraphrases from the monolingual residual model (Greedy search).

Paraphrase Generation Paraphrase generation are evaluated on MTC. We select the 4 best translators according to MTC documentation and paraphrase pairs with a length ratio under . Our evaluation prevents trivial copying solutions. We select sampling temperature for all models such that their generation overlap with the input is 20.9 BLEU, the average overlap between humans on MTC. We report BLEU overlap with the target and run a blind human evaluation where raters pick the best generation among supervised translation, unsupervised translation and monolingual.

Table 3 shows examples. Table 1 (right) reports that monolingual paraphrasing compares favorably with unsupervised translation while supervised translation is the best technique. This highlights the value of parallel data for paraphrase generation.

5 Discussions

Our experiments highlight the importance of the residual connection for paraphrase identification. From Table 1, we see that a model without the residual connection obtains and accuracy on MRPC, STS and MTC. Adding the residual connection improves this to and respectively.

The examples in Table 3 show paraphrases generated by the model. The overlap with the input from these examples is high. It is possible to generate sentences with less overlap at higher sampling temperatures, we however observe that this strategy impairs fluency and adequacy. We plan to explore strategies which allow to condition the decoding process on an overlap requirement instead of varying sampling temperatures (Grangier and Auli, 2018).

6 Conclusion

We compared neural paraphrasing with and without access to bilingual data. Bilingual settings considered supervised and unsupervised translation. Monolingual settings considered auto-encoders trained on unlabeled text and introduced continuous residual connections for discrete auto-encoders. This method is advantageous over both discrete and continuous auto-encoders. Overall, we showed that monolingual models can outperform bilingual ones for paraphrase identification and data-augmentation through paraphrasing. We also reported that generation quality from monolingual models can be higher than model based on unsupervised translation but not supervised translation. Access to parallel data is therefore still advantageous for paraphrase generation and our monolingual method can be a helpful resource for languages where such data is not available.


We thanks the anonymous reviewers for their suggestions. We thank the authors of the Tensor2tensor library used in our experiments Vaswani et al. (2018).