Label Dependent Deep Variational Paraphrase Generation

11/27/2019 ∙ by Siamak Shakeri, et al. ∙ Amazon 0

Generating paraphrases that are lexically similar but semantically different is a challenging task. Paraphrases of this form can be used to augment data sets for various NLP tasks such as machine reading comprehension and question answering with non-trivial negative examples. In this article, we propose a deep variational model to generate paraphrases conditioned on a label that specifies whether the paraphrases are semantically related or not. We also present new training recipes and KL regularization techniques that improve the performance of variational paraphrasing models. Our proposed model demonstrates promising results in enhancing the generative power of the model by employing label-dependent generation on paraphrasing datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Paraphrase generation refers to the task of generating a sequence of tokens given an input sequence while preserving the overall meaning of the input. Extracting paraphrase from various English translations of the same text is explored in [Barzilay and McKeown2001]. Multiple sequence alignments approach is proposed in [Barzilay and Lee2003]

to learn paraphrase generation from unannotated parallel corpora. Deep learning-based paraphrasing has gained momentum recently. Text generation from continuous space using Variational Autoencodes

[Kingma and Welling2013] is proposed in [Bowman et al.2015]. Authors in [Gupta et al.2017] suggest using VAEs in paraphrase generation. The ability of VAE generative models at producing diverse sequences makes them a suitable candidate for paraphrasing tasks [Jain, Zhang, and Schwing2017].

Several publicly available paraphrasing datasets such as Quora Question Pairs [Shankar Iyar and Csernai2016] and Microsoft Research Paraphrasing Dataset [Dolan and Brockett2005] include a binary label indicating whether the paraphrase sequence is semantically different from the original sentence. Table 1 shows samples from Quora Question Pairs. Comparing the last two rows of the table, the first row has less common tokens between the original sequence and the paraphrase compared to the second row; however, the paraphrase in the first row conveys the same meaning as the original sequence. This is not the case with the second row, where there is only one token that is different between the paraphrase and the original sequence. We believe that the paraphrase generated when the binary label is 1 follows a different distribution compared to when it is 0. Therefore, including the input label in the neural sequence model would enhance the generative power of such models.

Original Sequence Paraphrase Label
What is the step by step guide to invest in share market in india? What is the step by step guide to invest in share market? 0
What is the best free web hosting for php? What are the best free web hosting services? 0
How will I open account in Quora? How do I open an account on Quora? 1
How should I begin learning Python? What are some tips for learning python? 1
What are the possible ways to stop smoking? How do I quit smoking? 1
What is black hat SEO? What is white hat SEO? 0
Table 1: Data Samples of Identical versus Non-Identical Paraphrases

To the best of our knowledge, paraphrase generation models that take advantage of the label in generating paraphrases have not been explored before. Using variational autoencoders framework to develop models that can produce both semantically similar and dissimiliar paraphrases is proposed in this article.

The proposed variational paraphrase generation model is in the family of conditional variational autoencoders of [Sohn, Lee, and Yan2015]

. Further independence assumptions and modifications to the loss function are proposed to the vanilla

CVAE. Making the generation of hidden variable conditional on the paraphrase label is comparable with GMM priors employed in TGVAE model [Wang et al.2019]. However, TGVAE relies on combined neural topic and sequence modeling in the generative process, while our work assumes the hidden variable being sampled from the GMM component that corresponds to the given input label.

The experimental results demonstrate our proposed model outperforms baseline VAE and non-variational sequence to sequence models on the paraphrasing datasets where data samples have a binary label. This label-dependent paraphrase generation can be utilized in extending the size of an already existing training set in various NLP tasks such as question answering, ranking, paraphrase detection.

To summarize, the contributions of this paper are:

  • We propose label-dependent paraphrase generation for semantically identical and non-identical paraphrasing.

  • We present a new neural VAE model, DVPG, which benefits from labeled generation as well as variational autoencoding framework.

  • We suggest several sampling and training schedules that considerably improve the performance of the proposed model.

In section 2, the proposed model is described and its evidence lower bound, also known as ELBO [Blei, Kucukelbir, and McAuliffe2016], is derived. Section 3 elaborates on choices in training schedules, variational sampling, model parameters, and measurement metrics. Experimental results are discussed in 4. Finally, section 5 summarizes the article and provides future directions for this work.

2 Model

The proposed generative model is depicted in Figure 1. and represent observed label and text sequence, respectively. is the hidden variable. Figures 0(a) and 0(b) show the proposed model versus vanilla VAE [Kingma and Welling2013]. We believe the proposed DVPG (Deep Class Variational Paraphrase G

eneration) model is more capable than the vanilla VAE in probability density estimation of label-dependent paraphrasing datasets due to the inclusion of label information in the generation of the hidden state.

(a) DVPG

(b) VAE
Figure 1: DVPG and VAE graphical models

Generation path of DVPG consists of and its inference path is as follows: : In the following part, the derivation of the evidence lower bound(ELBO) of the proposed model is explained.

2.1 Factorization and Objective

Maximizing the likelihood of the observed variables, , is used as the training objective. In the following, derivation and parameterization of the objective function are explained.


Where KL

denotes the Kullback-Leibler divergence. Using the independence assumptions from Figure

0(a) :


Using 2, we can rewrite 2.1:

Therefore, the Evidence Lower Bound can be written as :


2.2 Variational Parameterization

In order to simplify the calculation of KL-divergence loss and being able to take advantage of the reparameterization trick [Kingma and Welling2013], we made the following assumptions:

The superscripts indicate the parameterization.

indicates Gaussian distribution with mean

, and standard deviation

. The entire set of parameters are : , and where . The optimization problem is to maximize the following:


We propose including KL divergence terms to regularize and to avoid degeneration of those pdfs. With regularization terms:


Since is known, its term is removed from the equation (2.1), and not included in further derivations of ELBO. During the training of the model, where the objective in 2.2 is maximized, the following path is followed for each instance: . and are used as the prediction of the model and ground truth, respectively, to compute the cross-entropy loss and other measurement metrics (section 3.7). Following this approach during the training, term does not appear in the training path. Therefore, a modified ELBO can be formed by setting . This will result in the following:


Experiments using equations 4, 2.2 and 2.2 as ELBO were performed and results reported in sections 3 and 4.

(a) Independent
(b) Aggregated
Figure 2: Independent vs Aggregated Labeled Variational Sampling in DVPG

3 Experiments

The baseline neural network used in the experiments is the CopyNet sequence to sequence model introduced by

[Gu et al.2016]. We chose this model due to its ability in selecting sub-phrases of the input sentence to be included in the output [See, Liu, and Manning2017]. Since paraphrasing requires the generation of a sequence that is lexically similar to the input sequence, the CopyNet model would be a fitting choice [Li et al.2017].

The encoding layer consists of applying Transformer network

[Vaswani et al.2017] to BERT[Devlin et al.2018] contextualized word embeddings of the input sequence. An LSTM [Hochreiter and Schmidhuber1997] decoder augmented with copy mechanism and cross attention over the encoder outputs performs the generation of the paraphrase sequence.

The following models, loss types and training schedules were explored to measure the performance of the proposed approach:

3.1 Models

  • VAE: Vanilla VAE as in [Kingma and Welling2013].

  • Baseline: Non-variational CopyNet baseline as in [Gu et al.2016].

  • DVPG: Deep Variational Paraphrase Generation model proposed in this work.

3.2 Losses

The lower bound of likelihood, derived in 2.1, consists of a cross-entropy term and KL-divergence term, which will be referred to as KL. The proposed variations of the KL term, as derived in 2.2 and 2.2, are enumerated as below:

  • Loss 1: KL loss in equation 2.2.

  • Loss 2: KL loss in equation 2.1 without any regularization terms added.

  • Loss 3: KL loss in equation 2.2.

3.3 Training Schedules

Avoiding mode collapse is one of the challenges when training variational autoencoders. [Kingma and Welling2013] suggest KL cost annealing to mitigate this issue, where the KL term is multiplied by a coefficient which gradually increases from zero to one as the training progresses. We employed this method in training all of the variational models in this work.

Curriculum training proposed by [Bengio et al.2009] is experimented with, where for a fixed number of batches at the beginning of the training, we discard the variational variables and KL loss. Therefore, the model is trained as a non-variational encoder-decoder. After the model is trained with the fixed number of batches, the variational variables and KL term are added. KL coefficient annealing is applied as well. This curriculum learning scheme divides the training into two distinct phases: CE training, where the decoder language model, copy mechanism and encoder are well trained to fit the training data, and variational training, where the prior is trained. The experimental results show this approach combined by cost annealing outperforms the rest. This training schedule is referred to as two-step in this article.

3.4 Variational Sampling

Similar to [Gupta et al.2017], the summation of original encoder outputs of the CopyNet and sampled variational variables (z) is used in the decoder and copy mechanism to generate the output. Since CopyNet requires the encodings of each of the non-masked input tokens to generate the output tokens, two approaches are proposed to sample z:

  • Independent: a sample is obtained for each of the input token encodings independently, resulting in z .

  • Aggregated: The encoder outputs are aggregated by average pooling and z

    is sampled from the resulting aggregated vector(

    z ).

denotes the hidden dimension and the length of the non-masked input sequence. Figure 2 visualizes the two proposed approaches when applied to DVPG.

3.5 Data

Data samples are a set of tuples: , where is the original sequence, the paraphrase of and is the label indicating whether the paraphrase is semantically identical or not. Quora question pairs dataset [Shankar Iyar and Csernai2016] is used. This dataset consists of 400K tuples, where each tuple consists of a pair of questions and a label. Although sanitation methods have been applied to this dataset, the ground truth labels are noisy. Only the pairs where the length of both and are less than 14 after being tokenized by WordPiece tokenizer [Sennrich, Haddow, and Birch2015], are selected. This is done to reduce training time. Besides, since the dataset is noisy, we observed that limiting it to shorter phrases would improve the quality. The resulting training, development, and test sets include 97k, 21k, and 21k pairs, respectively. Those pairs that are labeled 0 are not entirely different questions, but questions where only a small fraction of tokens is different.

3.6 Training Parameters

AllenNLP [Gardner et al.2017] and PyTorch [Paszke et al.2017] are used as the development and experimentation environments. ADAM optimizer [Kingma and Ba2014] with learning rate of is used for training. Transformer encoder consists of 1 layer and 8 attention heads. Projection, feedforward and, hidden dimensions of the encoder are 256, 128 and, 128, respectively. Target vocabulary size is pruned to include only the top 5000 frequent tokens, tokenized by WordPiece. Number of decoding steps is limited to maximum input sequence length of 13 tokens. Target embedding dimension is set to 768. During evaluation decoding, Beam search of size 16 is used. Each of the models have approximately 7 million parameters. We chose this set of parameters such that the baseline model would perform well on the dataset. Models are trained for 20epochs, and the best model is chosen based on Max-BLEU score on the development set.

All the hyperparameters are fixed during the training of all the models, and no parameter or hyperparameter tuning is done. The training is done on Amazon EC2 using

p3.16xlarge instances, which have Tesla V100 GPUs.

Method Model Max-BLEU Min-TER Max-ROUGE-1 Max-ROUGE-2 Max-ROUGE-3
Type I DVPG Loss 3 37.100.27 45.390.08 61.430.24 41.170.15 28.410.07
DVPG Loss 2 36.680.28 45.500.17 61.160.26 40.870.21 28.200.24
DVPG Loss 1 36.980.20 45.460.10 61.360.12 41.060.08 28.270.05
VAE 37.0410.17 45.40.09 61.320.09 41.030.13 28.250.19
Type II DVPG Loss 3 36.610.34 45.350.31 60.450.14 40.220.13 27.380.11
DVPG Loss 2 37.820.10 44.420.22 61.390.06 41.310.05 28.420.09
DVPG Loss 1 36.880.18 45.30.14 60.830.22 40.630.27 27.770.32
VAE 36.870.48 45.240.32 60.850.26 40.490.31 27.550.22
Type III DVPG Loss 3 36.040.08 46.110.09 61.230.07 40.610.15 27.630.21
DVPG Loss 2 35.720.12 46.170.07 60.670.22 40.270.14 27.560.11
DVPG Loss 1 35.940.06 46.200.13 61.080.16 40.540.03 27.630.14
VAE 35.850.16 46.270.19 61.220.12 40.580.07 27.610.13
Type IV DVPG Loss 3 38.130.13 44.460.24 62.580.34 41.970.16 28.750.13
DVPG Loss 2 38.420.19 44.090.26 62.550.43 42.100.27 28.920.23
DVPG Loss 1 38.330.11 44.200.14 62.520.30 42.030.21 28.840.16
VAE 38.030.42 44.420.28 62.210.47 41.730.37 28.60.36
Best Loss Type 2 2 3 2 2
Best Training Type IV IV IV IV IV
- Seq2Seq Baseline 29.530.08 51.460.04 56.600.27 35.290.14 22.790.10
Table 2: Comparison of Best-metric scores. Average and standard deviation of results are calculated over three runs of each experiment with different initial seeds.

3.7 Metrics

Metrics frequently used in text generation applications such as machine translation [Bahdanau, Cho, and Bengio2016], summarization [Cheng and Lapata2016] and paraphrasing are employed to measure the performance of the models. They are as follows: ROUGE-1, ROUGE-2, ROUGE-3, BLEU-4, and TER. As suggested in [Jain, Zhang, and Schwing2017], generating only 1 sample for each paraphrase tuple and calculating the metrics, as described above, does not reasonably demonstrate the generative power of variational models. The variational variable z, as discussed in section 2.2, would encourage the decoder to generate sentences that are token-wise and semantic-wise more diverse when compared to the baseline sequence to sequence model. This could lead to lower performance when compared with the non-variational baseline. One approach suggested in [Jain, Zhang, and Schwing2017] is to generate multiple paraphrase sequences for each input sequence, and measure the best performing sequence based on the selected criteria, therefore, letting the generative model more chances of generating a paraphrase that matches the oracle sequence more closely.

Following this approach, during the evaluation on development and test sets, for each input tuple, a fixed number of paraphrases is generated, and the following values are calculated for each of the metrics discussed in 3.7

  • Avg-metric: for each generated sample, the desired metric is measured with respect to the reference paraphrase, and the average is calculated over all the generated samples.

  • Best-metric: Similar to Avg-metric, except the sequence showing the best performance with respect to the metric is selected and used in calculating the desired metric. This is referred to in [Vijayakumar et al.2016] as Oracle metric.

4 Results

Experiments were done for the models in section 3.1, losses in section 3.2, training schedules in section 3.3, and variational sampling discussed in 3.4. When performing two-step training, only CE minimization is performed for the first 6 epochs, after which variational variable and KL loss minimization are also included in the training process. We chose this number because we observed that after 6 epochs, the non-variational baseline achieves competitive BLEU score on the development set. During the evaluation of variational models, 10 samples are generated for each development and test set tuple when calculating Max-metrics and Avg-metrics. Each experiment is performed with 3 different seeds. Average and standard deviation of each of the metrics is calculated and compared.

Configurations used in training the variational models are enumerated as follows: Type I: Independent variational sampling, Type II: Independent variational sampling + two-step training, Type III: Aggregated variational sampling, Type IV: Aggregated variational sampling + two-step training. When reporting performance of DVPG models, the applied KL loss (3.2) is appended to the model tag.

(a) Average Metrics
(b) Best Metrics
Figure 3: Changes to average and best metric values when changing the number of samples DVPG model.

The models are trained on the training set; the development set is used to select the best model. Max-BLEU is used as the selection metric. Once the best model is selected for each of the settings of loss, training schedule and model type, the model is run over the test set, and the results are reported in Tables 2 and 3.

Method Model Avg-BLEU Avg-TER Avg-ROUGE-1 Avg-ROUGE-2 Avg-ROUGE-3 Total Loss
Type I DVPG Loss 3 28.370.17 52.430.11 55.560.23 34.190.19 21.700.16 16.940.30
DVPG Loss 2 28.710.27 52.070.15 55.50.51 34.330.35 21.940.31 16.530.11
DVPG Loss 1 28.490.03 52.330.07 55.630.02 34.310.03 21.820.04 16.990.36
VAE 27.970.29 52.860.38 54.970.20 33.760.20 21.320.24 17.490.89
Type II DVPG Loss 3 24.101.09 55.740.93 51.011.05 29.481.22 17.491.03 17.930.42
DVPG Loss 2 26.071.03 54.020.85 52.881.17 31.581.11 19.341.02 17.450.33
DVPG Loss 1 25.650.67 54.480.58 52.660.75 31.230.78 19.020.64 17.540.29
VAE 24.530.67 55.790.64 51.370.69 29.960.75 17.960.64 18.00.25
Type III DVPG Loss 3 28.590.39 52.410.37 55.720.42 34.490.41 21.960.37 18.060.92
DVPG Loss 2 29.160.07 51.720.08 55.80.27 34.760.15 22.370.09 16.50.15
DVPG Loss1 28.8120.11 52.090.20 55.990.08 34.710.03 22.150.06 17.420.75
VAE 28.820.16 52.150.26 56.150.15 34.790.13 22.190.10 17.60.42
Type IV DVPG Loss 3 26.370.74 54.820.90 53.381.05 32.140.85 19.920.69 18.880.27
DVPG Loss 2 26.020.24 54.950.40 52.860.30 31.660.21 19.520.17 18.710.08
DVPG Loss 1 25.880.27 55.230.43 52.660.33 31.50.26 19.420.20 19.020.04
VAE 25.820.30 55.140.21 52.590.45 31.410.39 19.310.32 19.040.09
Best Loss Type 2 2 3 2 2 2
Best Training Type III III III III III III
- Seq2Seq Baseline 29.530.08 51.460.04 56.600.27 35.290.14 22.790.10 15.920.04-
Table 3: Comparison of Avg-metric scores. Average and standard deviation of results are calculated over three runs of each experiment with different initial

4.1 Best-Metrics

Table 2 shows the Best-metric scores for the proposed training types and models compared with the variational and non-variational baseline. The following can be observed:

Model Max-BLEU Min-TER Max-ROUGE-1 Max-ROUGE-2 Max-ROUGE-3
DVPG Loss 3 48.240.97 47.150.4 69.350.20 51.450.42 40.990.66
DVPG Loss 2 46.321.48 46.690.26 69.30.12 51.540.29 41.290.56
DVPG Loss 1 47.081.42 46.480.21 69.320.12 51.670.35 41.570.72
VAE 47.290.85 46.700.49 69.350.19 51.660.39 41.470.9
Seq2Seq Baseline 45.110.67 49.720.62 65.560.95 48.710.79 39.370.59
Table 4: Comparison of Best-metric scores With Microsoft Research Paraphrasing Corpus
  • Variational models overperform the non-variational model by a wide margin. Absolute improvements of 9% in BLEU, 7.4% in TER, 6% in ROUGE-1, 7% in ROUGE-2, and 6% in ROUGE-3 are observed. As similarly reported by [Jain, Zhang, and Schwing2017], this indicates the generative power of variational models in producing diverse outputs.

  • DVPG model overperforms the baseline VAE with decent margins. Absolute improvements of 0.39% in BLEU, 0.33% in TER, 0.37% in ROUGE-1, 0.37% in ROUGE-2 and 0.32% in ROUGE-3 are seen when comparing the best DVPG model to the best VAE model. Considering no parameter tuning is done, and the results are averaged, it demonstrates the efficacy of the proposed model.

  • Two-step training, as discussed in section 3.3, contributes to improvement in Best-metrics when used with aggregate variational samples (Type IV), However, when used with independent variational sampling (Type II), does not demonstrate consistent gains.

  • Aggregate variational sampling results generally overperform Independent sampling method, as shown in Table 2 by comparing training (Type III, Type IV) versus (Type I, Type II). This supports the hypothesis that the independent assumption underlying Independent variational sampling is not correct with the sequential input, where there are dependencies between the tokens.

4.2 Average-metrics

Average-metric values on the test set are shown in Table 3. It is important to note that the best models are not picked based on the best average value. The best model is picked based on the highest Max-BLEU score; therefore, they would not necessarily deliver a fair judgment on the superiority of a model versus the other. Besides, the model with larger diversity in generating outputs has higher chances of producing paraphrases that are on average not similar to the ground truth compared to a model which introduces less diversity in generated sequence. Hence, such a diversity-powerful model, while having higher Max-metrics, might suffer from lower Average-metrics. Nonetheless, Average-metrics provide a measure of the average quality of the generated paraphrases.

Similar to Best-metrics results, DVPG model with Loss 2 performs the best amongst the DVPG-based models in Average-metrics. Additionally, its performances exceeds VAE’s in BLEU by 0.34%, in TER by 0.43%, in ROUGE-3 by 0.18% and in Total Loss by 1.1. The CE Loss is not normalized by the length of the sequence, thus the large values. Experiments were done with normalizing CE by the sequence length, and the results did not demonstrate higher Max or Average metrics.

VAE performs better in ROUGE-1 and ROUGE-2 by 0.16% and 0.03%

absolute values, respectively, while the latter is well within the confidence interval.

Comparing the Average-metrics of Variational models against the Seq2Seq Baseline, it can be observed that the non-variational model exceeds the performance of the best variational model. Furthermore, the setting that demonstrated the best performance in Best-metrics(DVPG Loss 2 Type IV), is not the same setting that produces the best metrics amongst the variational models(DVPG Loss 2 Type III). This observation is contributed to the trade-off between diversity and average performance, as discussed previously.

4.3 Generative Power

To measure the limit of the generative power of proposed variational models, the number of samples used in variational sampling (section 3.4) is changed from 1 to 20 during the evaluation, and the best model for each sample is selected. The change in Average and Best metrics are depicted in Figures 2(b) and 2(a). DVPG Loss 2 Type IV (section 3.1) is used. Improvement in the Best-metrics by increasing the number of samples diminishes for values larger than 10. For example, increasing the sample size from 1 to 10, results in 9 points increase in Max-BLEU, while increasing it from 10 to 20 yields 1.4 point improvement. This indicates the model has reached its generative capacity with respect to the given dataset, and further enhancement of diversity requires changes in the underlying model architecture. A similar trend can be observed in the other Best-metrics.

Looking at the change in Average-metrics in Figure 2(a) reinforces argument regarding the trade-off between diversity and average performance. However, the degradation of Average-metrics by increasing the sample size is not proportional to the increase in the Best metrics. Looking at TER as an example, the effect of changing the sample size from 1 to 20 is 3 absolute points increase in Average-TER, compared to the 10 points decrease in Best-TER. Therefore, we can infer from this observation that the generated output sequences, while being diverse, are still close to the gold output sequence.

4.4 Microsoft Research Paraphrasing Dataset

Table 4 explains the results of running Baseline Seq2Seq, DVPG, and VAE models when using Type IV training on Microsoft Research Paraphrasing Dataset [Dolan and Brockett2005]. This dataset set is comprised of 5800 tuples, where each tuple consists of an original string, the paraphrased sequence, and a binary label indicating whether the paraphrased sequence is semantically identical to the original sequence. The dataset is split into three sets of 4100, 850, 850 tuples as training, development and test sets, respectively. In the interest of being succinct, only Best-metrics are shown. As it can be observed, the DVPG model overperforms VAE in all of the metrics. Most notably, absolute improvements of 0.95% in Max-BLEU, and 0.22 in Min-TER are achieved by using the proposed generative process. The improvements in ROUGE metrics are marginal. We suspect the relatively small size of the dataset diminishes the improved generative power of the DVPG model. Evidence of this conjecture is the smaller gap between variational and non-variational baseline when compared with the results in Tables 2 and 3, where the much larger Quora dataset was employed.

An interesting observation is that contrary to results with Quora dataset, DVPG with more regularized KL losses overperform the un-regularized Loss 2. This can be contributed to the smaller size of the dataset, which makes such regularization more necessary.

4.5 Analysis of KL Losses

When two-step schedule is applied, the CE only training is done for the first 12000 batches, or 6 epochs, after which the KL loss(es) are included in the loss function. We speculate that minimizing only the CE loss for several epochs, not only facilitates encoding a larger volume of information in the variational parameter but also enhances the decoder. When training complex sequence to sequence models, where training such encoders and decoders would require multiple epochs, two-step training would be more effective versus vanilla KL cost annealing [Bowman et al.2015].

5 Conclusion

Paraphrase generation when there are two classes of paraphrases is explored in this article. A new graphical model is introduced, and the corresponding ELBO is derived. Our experiments on Quora Question Pairs and Microsoft Research Paraphrasing Dataset showed that the proposed model outperforms vanilla VAE and non-variational baseline across the metrics that measure the generative power of the models, therefore supporting the hypothesis that label-dependent paraphrase generation can better learn the distribution of the labeled paraphrasing datasets. Furthermore, the proposed variational sampling and training schedules showed consistent improvements with the variational models. One future direction of this work is to explore the setting where the label variable is not observed, therefore extending its application to unannotated paraphrases corpora. Applying it to NLP tasks such as machine reading comprehension or answer ranking is another continuation of this work.