Large-scale Hierarchical Alignment for Author Style Transfer

10/18/2018 ∙ by Nikola I. Nikolov, et al. ∙ ETH Zurich 0

We propose a simple method for extracting pseudo-parallel monolingual sentence pairs from comparable corpora representative of two different author styles, such as scientific papers and Wikipedia articles. Our approach is to first hierarchically search for nearest document neighbours and then for sentences therein. We demonstrate the effectiveness of our method through automatic and extrinsic evaluation on two tasks: text simplification from Wikipedia to Simple Wikipedia and style transfer from scientific journal articles to press releases. We show that pseudo-parallel sentences extracted with our method not only improve existing parallel data, but can even lead to competitive performance on their own.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Parallel corpora are indispensable resources for advancing monolingual and multilingual text rewriting tasks. Due to the scarce availability of parallel corpora, and the cost of manual creation, a number of methods have been proposed that can perform large-scale sentence alignment: automatic extraction of pseudo-parallel sentence pairs from raw, comparable111Corpora that contain documents on similar topics. corpora. While pseudo-parallel data is beneficial for machine translation Munteanu and Marcu (2005); Marie and Fujita (2017), there has been little work on large-scale sentence alignment for monolingual text-to-text rewriting tasks, such as simplification Nisioi et al. (2017) or style transfer Liu et al. (2016). The lack of parallel data has impeded the application of data-driven methods on these tasks.

Figure 1: Illustration of large-scale hierarchical alignment (LHA). For each document in a source dataset, document alignment retrieves matching documents from a target dataset. In turn, sentence alignment retrieves matching sentence pairs from within each document pair.

In this paper, we propose a simple unsupervised method, Large-scale Hierarchical Alignment (LHA) (Figure 1; Section 3), for extracting pseudo-parallel sentence pairs from two raw monolingual corpora which contain documents that are representative of two different author styles, such as scientific papers and press releases. LHA hierarchically searches for document and sentence nearest neighbors within the two corpora, extracting sentence pairs that have high semantic similarity, yet preserve the stylistic characteristics representative of their original datasets. LHA is robust to noise and is fast and memory efficient, enabling its application to datasets on the order of hundreds of millions of sentences. Its generality makes it relevant to a wide range of monolingual text rewriting tasks in NLP.

We demonstrate the effectiveness of LHA on automatic benchmarks for alignment (Section 4

), as well as extrinsically, by training neural machine translation (NMT) systems on two style transfer tasks: text simplification from Wikipedia to the Simple Wikipedia (Section

5.2) and style transfer from scientific journal articles to press releases (Section 5.3). We show that pseudo-parallel datasets obtained by LHA are not only useful for augmenting existing parallel data, boosting the performance on automatic measures, but can even be useful on their own where no parallel data is available.

2 Background

2.1 Style transfer

The goal of style transfer is to transform an input text to satisfy specific stylistic constraints, such as text simplicity Nisioi et al. (2017) or a more general author style, such as political (e.g. democratic to republican) or gender (e.g. male to female) Prabhumoye et al. (2018); Shen et al. (2017). Style transfer systems can be valuable when preparing a text for multiple audiences, such as simplification for language learners Siddharthan (2002) or people with reading disabilities Inui et al. (2003). They can also be used to improve the accessibility of technical documents, e.g. to simplify terms in clinical records for laymen Kandula et al. (2010); Abrahamsson et al. (2014).

Style transfer can be cast as a data-driven text-to-text rewriting task, where transformations are learned using large collections of parallel sentences. Limited availability of high-quality parallel data is a major bottleneck for this approach. Recent work on Wikipedia and the Simple Wikipedia Coster and Kauchak (2011); Kajiwara and Komachi (2016) and on the Newsela dataset of simplified news articles for children Xu et al. (2015) explore supervised, data-driven approaches to text simplification. Such approaches typically rely on statistical Xu et al. (2016) or neural Štajner and Nisioi (2018) machine translation.

Other work has started to investigate unsupervised style transfer, without parallel corpora, but using variational Fu et al. (2017) or cross-aligned Shen et al. (2017)autoencoders to learn latent representations of content separate from style. In Prabhumoye et al. (2018)

, authors model style transfer as a back-translation task by translating input sentences into an intermediate language. They use the translations to train separate English decoders for each target style by combining the decoder loss with the loss of a style classifier, separately trained to distinguish between the target styles.

2.2 Sentence alignment

The goal of sentence alignment is to extract from raw corpora sentence pairs suitable as training examples for text-to-text rewriting tasks such as machine translation or text simplification. When the documents in the corpora are parallel (labelled document pairs, such as identical articles in two languages), the task is to identify suitable sentence pairs from each document. This problem has been extensively studied both in the multilingual (Brown et al., 1991; Moore, 2002) and monolingual (Hwang et al., 2015; Kajiwara and Komachi, 2016; Štajner et al., 2018) case. The limited availability of parallel corpora led to the development of large-scale sentence alignment methods, which is the focus of this work. The aim there is to extract pseudo-parallel sentence pairs from within raw, non-aligned corpora. Potentially, millions of examples for any task occur naturally within existing textual resources, amply available on the internet.

The majority of previous work on large-scale sentence alignment is in machine translation, where adding pseudo-parallel pairs to an existing parallel dataset has been shown to boost the translation performance Munteanu and Marcu (2005); Uszkoreit et al. (2010). The work that is most closely related to ours is Marie and Fujita (2017), which relies on nearest neighbour search of word and sentence embeddings trained on target translation languages. The authors extract a large set of rough sentence pairs from all sentences in the datasets, without relying on document-level information. They then filter the rough pairs using a classifier trained on parallel translation data, abundantly available for many language pairs. More recently, Grégoire and Langlais (2018)

use a Siamese Recurrent Neural Network (RNN) classifier, trained to detect valid from non-valid translation sentence pairs. The need for some parallel data limits the application of such approaches in settings in which there is only scarce parallel data available, for example for author style transfer from scientific papers to press releases.

There is little work on large-scale sentence alignment focusing on monolingual tasks. In Barzilay and Elhadad (2003), authors develop a hierarchical alignment approach which first clusters paragraphs on similar topics, before performing alignment on the sentence level. They argue that, for monolingual data, pre-clustering of larger textual units is more robust to noise compared to sentence matching applied directly on the dataset level.

3 Large-scale hierarchical alignment (LHA)

Given two datasets which contain comparable documents written in two different author styles: a source dataset consisting of documents (e.g. all Wikipedia articles) and a target dataset consisting of documents (e.g. all articles from the Simple Wikipedia), our approach to large-scale monolingual alignment is hierarchical, consisting of two consecutive steps: document alignment followed by sentence alignment (see Figure 1).

3.1 Document alignment

For each source document , document alignment retrieves nearest neighbours from the target dataset, which form pseudo-parallel document pairs as . The aim is to select document pairs across the datasets that have high semantic similarity, and thus potentially contain good pseudo-parallel sentence pairs representative of the document styles of each dataset.

To find nearest neighbours, we rely on two components: document embedding and approximate nearest neighbour search methods. For each dataset, we pre-compute document embeddings as and . We employ nearest neighbour search methods222We use the Annoy library to partition the embedding space, enabling fast and efficient nearest neighbour retrieval of similar documents across and . This enables us to find nearest target document embeddings in for each source embedding in . We additionally filter any document pairs whose similarity is below a manually selected threshold . In Section 4, we evaluate a range of different document embedding approaches, as well as alternative similarity metrics.

3.2 Sentence alignment

Given a pseudo-parallel document pair that contains a source document consisting of sentences and a target document consisting of sentences, sentence alignment extracts pseudo-parallel sentence pairs that are highly similar.

To implement sentence alignment, we first embed each sentence in and and compute an inter-sentence similarity matrix among all sentence pairs in and . From we extract nearest neighbours for each source and each target sentence. For example, the nearest neighbours of may be and of . We remove all sentence pairs with similarity below a manually set threshold . We then merge all overlapping sets of nearest sentences in the documents to produce pseudo-parallel sentence sets (e.g. when source sentence is closes to target sentences , , and and target sentence is closest to source sentences and ). This approach, inspired from Štajner et al. (2018), gives our framework the flexibility to model multi-sentence interactions, such as sentence splitting or compression, as well as individual sentence-to-sentence reformulations. Note that when , we only retrieve individual sentence pairs.

The final output of sentence alignment is a list of pseudo-parallel sentence pairs that have high semantic similarity, yet preserve the stylistic characteristics of each of their original datasets. The pseudo-parallel pairs can be used to either augment an existing parallel dataset (as in Section 5.2), or independently, to solve a new author style transfer task for which there is no parallel data available (as in Section 5.3).

3.3 System variants

The aforementioned framework gives us the flexibility to explore diverse variants, by exchanging document/sentence embeddings or textual similiarity metrics. We compare all variants in an automatic evaluation in Section 4.

Text embeddings

We experiment with three text embedding methods. Avg, is the average of the constituent word embeddings of a text333We use the Google News Word2Vec models., a simple approach that has proved to be a strong baseline for many text similarity tasks. In Sent2Vec444We use the unigram Wikipedia model provided by the authors. Pagliardini et al. (2017), the sentence words are specifically optimized towards additive combination over the sentence, using an unsupervised objective function. This approach performed well on many unsupervised and supervised text similarity tasks, often outperforming more sophisticated supervised recurrent or convolutional architectures, while remaining very fast to compute. InferSent555We use the GloVe-based model provided by the authors. Conneau et al. (2017) is a more sophisticated supervised embedding approach based on bidirectional LSTMs, trained on natural language inference data.

Word similarity

We additionally test four word-based approaches for computing text similarity. Those can be used either on their own, or to refine the nearest neighbour search across documents or sentences. 1) We use the string overlap between source tokens and target tokens (excluding punctuation, numbers and stopwords). 2) We use the BM25 ranking function Robertson et al. (2009), an extension of TF-IDF. 3) We use Word Mover’s Distance (WMD) Kusner et al. (2015), which measures the distance the embedded words of one document need to travel to reach the embedded words of another document. WMD has recently achieved good results on text retrieval Kusner et al. (2015) and sentence alignment Kajiwara and Komachi (2016). 4) We use the Relaxed Word Mover’s Distance (RWMD) Kusner et al. (2015), which is a fast approximation of the WMD.

4 Automatic evaluation on Wikipedia

Document alignment Sentence alignment
Approach EDim doc/sec sent/sec
Avg 300 0.66 43% 0.69 260 0.675 46% 0.82 1458
Sent2Vec 600 0.78 61% 0.62 343 0.692 48% 0.69 1710
InferSent 4096 0.36 13% 0.79 6.4 0.69 49% 0.88 110
Overlap - 0.53 29% 0.66 120 0.63 40% 0.5 1600
BM25 - 0.46 16% 0.257 60 0.52 27% 0.43 20K
RWMD 300 0.713 51% 0.67 60 0.704 50% 0.379 1050
WMD 300 0.49 24% 0.3 1.5 0.726 54% 0.353 180
Hwang et al. (2015) - - - - - 0.712 - - -
Kajiwara and Komachi (2016) - - - - - 0.724 - - -
Table 1: Automatic evaluation of Document (left) and Sentence alignment (right). EDim is the embedding dimensionality. TP is the percentage of true positives obtained at . Speed is calculated on a single CPU thread.
Approach time
LHA (Sent2Vec) 0.54 31% 33s
LHA (Sent2Vec WMD) 0.57 33% 1m45s
Global (Sent2Vec) 0.339 12% 15s
Global (WMD) 0.291 12% 30m45s
Table 2: Evaluation on large-scale sentence alignment: identifying the good sentence pairs without any document-level information. We pre-compute the embeddings and use the Annoy ANN library. For the WMD-based approaches, we re-compute the top 50 sentence nearest neighbours of Sent2Vec.

We perform an automatic evaluation of LHA using an annotated dataset for sentence alignment Hwang et al. (2015). The dataset contains 46 article pairs from Wikipedia and the Simple Wikipedia. The 67k potential sentence pairs in the dataset were manually labelled as either good simplifications (277 pairs), good with a partial overlap (281 pairs), partial or non-valid. We perform three comparisons using this dataset: evaluating document and sentence alignment separately, as well as jointly.

For sentence alignment, the task is to retrieve the good sentence pairs out of the possible sentence pairs in total, while minimizing the number of false positives. To evaluate document alignment, we add randomly sampled articles from Wikipedia and the Simple Wikipedia as noise, resulting in article pairs in total. The goal of document alignment is to identify the original document pairs out of possible document combinations.

This set-up additionally enables us to jointly evaluate document and sentence alignment, which best resembles the target effort of retrieving good sentence pairs from noisy documents. The two aims of the joint alignment task are to identify the good sentence pairs from within either document pairs or sentence pairs, without relying on any document-level information whatsoever.

4.1 Results

Our results are summarized in Tables 1 and 2. For all experiments, we set and report the maximum F1 score () obtained from varying the document threshold and the sentence threshold . We also report the percentage of true positive () document or sentence pairs that were retrieved when the F1 score was at its maximum, as well as the average speed of each approach.

On document alignment, (Table 1, left) the Sent2Vec approach achieved the best score, outperforming the other embedding methods, as well as the word-based similarity measures. On sentence alignment (Table 1, right), the WMD achieves the best performance, matching the result from Kajiwara and Komachi (2016). When evaluating document and sentence alignment jointly (Table 2), we compare our hierarchical approach (LHA) to global alignment applied directly on the sentence level (Global). Global computes the similarities between all sentence pairs in the entire evaluation dataset. LHA significantly outperforms Global, successfully retrieving three times more valid sentence pairs, while remaining fast to compute. This result demonstrates that document alignment is beneficial, successfully filtering some of the noise, while also reducing the overall number of sentence similarities to be computed.

The Sent2Vec approach to LHA achieves good performance on both document and sentence alignment, while being the fastest. We therefore use it as the default approach for the following experiments on style transfer.

5 Style transfer experiments

To test the suitability of pseudo-parallel data extracted with LHA, we perform empirical experiments on two author style transfer tasks: text simplification from the normal Wikipedia to the Simple Wikipedia and style transfer from scientific papers to press releases. We choose text simplification because some parallel data are already available for this task, allowing us to experiment with mixing parallel and pseudo-parallel datasets. The second style transfer task is novel: there is currently no parallel data available for it.

We compare the performance of neural machine translation (NMT) systems trained under three different scenarios: 1) using existing parallel data for training; 2) using a mixture of parallel and pseudo-parallel data extracted with LHA; and 3) using pseudo-parallel data on its own.

5.1 Experimental setup

NMT model

For all experiments, we use a single-layer LSTM encoder-decoder model666We use the fairseq library Cho et al. (2015) with an attention mechanism Bahdanau et al. (2014). We train the model on the subword level Sennrich et al. (2015), capping the vocabulary size to 50k. We re-learn the subword rules separately for each dataset, and train all models until convergence using the Adam optimizer Kingma and Ba (2014)

, 20 epochs on average. After training, we use beam search with a beam of 5 to generate all final outputs.

Evaluation metrics

Dataset Type Documents Tokens Sentences Tok. per sent. Sent. per doc.
Wikipedia Articles 5.5M 2.2B 92M 25 16 17 32
Simple Wikipedia Articles 134K 62M 2.9M 27 68 22 34
Gigaword News 8.6M 2.5B 91M 28 12 11 7
PubMed Scientific papers 1.5M 5.5B 230M 24 14 180 98
MEDLINE Scientific abstracts 16.8M 2.6B 118M 26 13 7 4
EurekAlert Press releases 358K 206M 8M 29 15 23 13
Table 3: Datasets used to extract pseudo-parallel monolingual sentence pairs in our experiments.

We report a diverse range of automatic metrics and statistics. SARI Xu et al. (2016) is a recently proposed metric for text simplification which correlates well with simplicity in the output. SARI takes into account the total number of changes (additions, deletions) of the input when scoring the model outputs. BLEU Papineni et al. (2002) is a precision-based metric for machine translation commonly used in research on text simplification Xu et al. (2016); Štajner and Nisioi (2018) and style transfer Shen et al. (2017). Recent work has indicated that BLEU is not suitable for assessment of simplicity Sulem et al. (2018), it correlates better with meaning preservation and grammatically, in particular when using multiple references. We also report the average Levenshtein distance (LD) from the model outputs to the input () or the target reference (). On simplification tasks, LD correlates well with meaning preservation and grammaticallity Sulem et al. (2018), complementing the BLEU metric.

Extracting pseudo-parallel data

We use LHA with Sent2Vec (Section 3) to extract pseudo-parallel sentence pairs for the two tasks (statistics of the datasets we use for alignment are in Table 3). To ensure some degree of lexical similarity, we exclude pairs whose string overlap (defined in Section 3.3) is below , and pairs in which the target sentence is more than times longer than the source sentence. We use in all of our alignment experiments, with the majority of the examples extracted containing a single sentence, and - of the source examples and - of the target examples containing multiple sentences (see Table 4 for additional statistics). Most multi-sentence examples contain two sentences, while - contain to sentences. Examples of pairs that were extracted are in Table 7 in the supplementary material.

5.2 Text simplification from Wikipedia to the Simple Wikipedia

Parallel data

As a parallel baseline dataset, we use an existing dataset from Hwang et al. (2015). The dataset consists of 282K sentence pairs obtained after aligning the parallel articles from Wikipedia and the Simple Wikipedia. This dataset allows us to compare our results to previous work on data-driven text simplification. We use two versions of the dataset in our experiments: full contains all 282K pairs, while partial contains 71K pairs, or 25% of the full dataset.

Evaluation data

We evaluate our simplification models on the testing dataset from Xu et al. (2016), which consists of 358 sentence pairs from the normal and simple Wikipedia. In addition to the ground truth Simple Wikipedia simplifications, each input sentence comes with 8 additional references, manually simplified by Amazon Meachanical Turkers. We compute BLEU and SARI on the 8 manual references.

Pseudo-parallel data

We align two dataset pairs, obtaining pseudo-parallel sentence pairs for text simplification. First, we align the normal Wikipedia to the Simple Wikipedia (see Tables 3 and 4), using and , producing two datasets: wiki-simp-72 and wiki-simp-65. Because LHA has no access to document-level information in this dataset, alignment leads to new sentence pairs, some of which may be distinct from the pairs present in the existing parallel dataset. We monitor for and exclude pairs that overlap with the testing dataset. Second, we align Wikipedia to the Gigaword news article corpus Napoles et al. (2012), using and , resulting in two additional pseudo-parallel datasets: wiki-news-74 and wiki-news-70. With these datasets, we investigate whether pseudo-parallel data extracted from a different domain can be beneficial for text simplification.

Dataset Pairs
wiki-simp-72 25K 26.72 22.83 16% 11%
wiki-simp-65 80K 23.37 15.41 17% 7%
wiki-news-74 133K 25.66 17.25 19% 2%
wiki-news-70 216K 26.62 16.29 19% 2%
paper-press-74 80K 28.01 16.06 22% 1%
paper-wiki-78 286K 25.84 24.69 20% 11%
Table 4: Statistics of the pseudo-parallel datasets extracted with LHA. and are the mean src/tgt token counts, while and report the percentage of items that contain more than one sentence.
Method or Dataset Total pairs (% pseudo) Beam hypothesis 1 Beam hypothesis 2
input - 26 99.37 22.7 0 0.26 - - - - -
reference - 38.1 70.21 22.3 0.26 0 - - - - -
Previous studies
SBMT 282K (0%) 37.91 72.36 - - - - - - - -
NTS 282K (0%) 30.54 84.69 - - - 35.78 77.57 - - -
Parallel + Pseudo-parallel or Randomly sampled data (Using full parallel dataset, 282K parallel pairs)
our-baseline-full 282K (0%) 30.72 85.71 18.3 0.18 0.37 36.16 82.64 19 0.19 0.36
1.7 wiki-simp-72 307K (8%) 30.2 87.12 19.43 0.14 0.34 36.02 81.13 19.03 0.19 0.36
wiki-simp-65 362K (22%) 30.92 89.64 19.8 0.13 0.33 36.48 83.56 19.37 0.18 0.35
wiki-news-74 414K (32%) 30.84 89.59 19.67 0.13 0.33 36.57 83.85 19.13 0.18 0.35
wiki-news-70 498K(43%) 30.82 89.62 19.6 0.13 0.33 36.45 83.11 18.98 0.19 0.36
1.7 rand-sm 382K (26%) 30.52 88.46 19.7 0.14 0.34 36.96 82.86 19 0.2 0.36
rand-med 482K (41%) 29.47 80.65 19.3 0.18 0.36 34.36 74.67 18.93 0.23 0.38
rand-big 582K (52%) 28.68 75.61 19.57 0.23 0.4 32.34 68.9 18.35 0.3 0.43
Parallel + Pseudo-parallel data (Using partial parallel dataset, 71K parallel pairs)
our-baseline-partial 71K (0%) 31.16 69.53 17.45 0.29 0.44 32.92 67.29 19.14 0.3 0.44
1.7 wiki-simp-65 150K (52%) 31.0 81.52 18.26 0.21 0.38 35.12 77.38 18.16 0.25 0.39
wiki-news-70 286K(75%) 31.01 80.03 17.82 0.23 0.4 34.14 76.44 17.31 0.28 0.43
Pseudo-parallel data only
wiki-simp-all 104K (100%) 29.93 60.81 18.05 0.36 0.47 30.13 57.46 18.53 0.39 0.49
wiki-news-all 348K (100%) 22.06 28.51 13.68 0.6 0.63 23.08 29.62 14.01 0.6 0.64
pseudo-all 452K (100%) 30.24 71.32 17.82 0.3 0.43 31.41 65.65 17.65 0.33 0.45

These outputs are not generated using Beam Search.

Table 5: Empirical results on text simplification from Wikipedia to the Simple Wikipedia. The highest SARI/BLEU results from each category are in bold.

Randomly sampled pairs

We also experiment with adding random sentence pairs to the parallel dataset (rand-sm, rand-med and rand-big datasets). The random pairs are sampled pairwise from the Wikipedia and the Simple Wikipedia. With the random data, we aim to investigate how model performance changes as we add an increasing number of sentence pairs that are non-parallel but are still representative of the two dataset styles.

5.3 Style transfer from scientific papers to press releases

We also test LHA on the novel task of style transfer from scientific journal articles to press releases. This task is well-suited for our system, as there are many open access scientific repositories with millions of articles freely available.

Evaluation dataset

We download press releases from the aggregator and identify the digital object identifyer (DOI) in the text of each press release using regular expressions. Given the DOIs, we query the full text of a paper using the Elsevier ScienceDirect API888 Using this approach, we were able to compose 26k parallel pairs of papers and their press releases. We then applied our sentence aligner with Sent2Vec (Section 2.2) to extract 11k parallel sentence pairs, which we used in the evaluation.

Pseudo-parallel data

We used LHA with Sent2Vec and to create pseudo-parallel pairs for training. EurekAlert contains additional press releases, for which we were unable to retrieve the full text of a paper. We aligned EurekAlert to two large repositories of scientific papers:, which contains the full text of open access papers, and, which contains over scientific abstracts. After alignment using and , we extracted pseudo-parallel pairs in total (paper-press-74 dataset). We additionally aligned PubMed and Medline to all Wikipedia articles using and , obtaining out-of-domain pairs for this task (paper-wiki-78 dataset).

6 Results

Method Total pairs SARI BLEU Cl
input - 19.95 45.88 43.71% 31.54 0 0.39
reference - - - 78.72% 26.0 0.39 0
Pseudo-parallel data only
paper-press-74 80K 32.83 24.98 82.01% 19.43 0.49 0.56
paper-wiki-78 286K 33.67 28.87 69.1% 27.06 0.43 0.53
combined 366K 35.27 34.98 70.18% 22.98 0.37 0.49
Table 6: Empirical results on style transfer from scientific papers to press releases.

6.1 Text simplification

The simplification results in Table 5 are organized in several sections, according to the type of dataset used for training. We report the results of the top two beam search hypotheses produced by our models, considering that the second hypothesis often generates simpler outputs Štajner and Nisioi (2018).

In Table 5, input is copying the Wikipedia input sentences, without making any changes. reference reports the score of the original Wikipedia references with respect to the other 8 references available for this dataset. SBMT is a statistical machine translation model from Xu et al. (2016), which optimizes the SARI metric directly. NTS is the previously best reported result on text simplification using neural sequence models Štajner and Nisioi (2018). our-baseline is our parallel LSTM baseline.

The models trained on a mixture of parallel and pseudo-parallel data generate longer outputs on average, and their output is more similar to the input, as well as to the original Simple Wikipedia reference, in terms of the LD. Adding pseudo-parallel data consistently yields BLEU improvements on both Beam hypotheses: over the NTS system, as well as over our baselines trained solely on parallel data. In terms of SARI, the scores remain either similar or slightly better than the baselines, indicating that simplicity in the output is preserved. The second Beam hypothesis obtains higher SARI scores than the first one, in agreement with Štajner and Nisioi (2018). Interestingly, adding out-of-domain pseudo-parallel news data (wiki-news-* datasets) results in an increase in BLEU despite the change in style of the target sequence.

Larger pseudo-parallel datasets can lead to bigger improvements, however noisy data can result in a decrease in performance, motivating careful data selection. In our parallel and random set-up, we find that an increasing number of random pairs added to the parallel data progressively degrades model performance. However, those models still manage to perform surprisingly well, even when over half of the pairs in the dataset have nothing in common. Thus, neural machine translation can successfully learn target transformations despite substantial data corruption, demonstrating robustness to noisy or non-parallel data for certain tasks.

When training solely on pseudo-parallel data, we observe lower performance on average in comparison to the parallel models. However, the results are encouraging, demonstrating the potential of our approach in tasks for which there is no parallel data available. As expected, the out-of-domain news data is less suitable for simplification than the in-domain data, because of the change in output style of the former. Results are best when mixing all pseudo-parallel pairs into a single dataset (pseudo-all). Having access to a small amount of in-domain pseudo-parallel data, in addition to out-of-domain pairs, seems to be very beneficial to the success of this approach. In Table 8 in the supplementary material, we report example model outputs.

6.2 Style transfer from papers to press releases

Our results on the press release task are summarized in Table 6, where input is the score obtained from copying the input sentences, and reference

is the score of the original press release references. In addition to the previously described automatic measures, we also report the classification prediction of a Convolutional Neural Network (CNN) sentence classifier

Kim (2014) (Cl) trained to distinguish between the two target styles in this task: papers vs. press releases. To obtain training data for the classifier, we randomly sample million sentences each from the PubMed and EurekAlert datasets. The CNN model achieves classification accuracy on a held-out set. With the classifier, we aim to investigate to what extent the overall style of the press releases is captured by the model outputs.

All of our models outperform the input in terms of SARI and produce outputs that are closer to the target style, according to the classifier. The in-domain paper-press dataset performs worse than the out-of-domain paper-wiki dataset, most likely because of the larger size of the latter. paper-wiki generates longer outputs on average than reference, which may be a source of lower classification score. The best performance is achieved by the combined dataset, which also produces outputs that are the closest to input and reference, in terms of LD. Example outputs for this task are reported in Table 9 in the supplementary material.

7 Conclusion

We developed a hierarchical method for extracting pseudo-parallel sentence pairs from two monolingual comparable corpora composed of different writing styles. We evaluated the performance of our method on automatic alignment benchmarks and on two author style transfer tasks. We find improvements arising from adding pseudo-parallel sentence pairs to existing parallel datasets, as well as promising results when using the pseudo-parallel data on its own.

Our results demonstrate that careful engineering of pseudo-parallel datasets can be a successful approach for improving existing monolingual text-to-text rewriting tasks, as well as for tackling novel tasks. The pseudo-parallel data could also be a useful resource for dataset inspection and analysis. Future work could focus on improvements of our system, such as refined approaches to sentence pairing.


  • Abrahamsson et al. (2014) Emil Abrahamsson, Timothy Forni, Maria Skeppstedt, and Maria Kvist. 2014. Medical text simplification using synonym replacement: Adapting assessment of word difficulty to a compounding language. In Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR), pages 57–65.
  • Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
  • Barzilay and Elhadad (2003) Regina Barzilay and Noemie Elhadad. 2003. Sentence alignment for monolingual comparable corpora. In

    Proceedings of the 2003 conference on Empirical methods in natural language processing

    , pages 25–32. Association for Computational Linguistics.
  • Brown et al. (1991) Peter F Brown, Jennifer C Lai, and Robert L Mercer. 1991. Aligning sentences in parallel corpora. In Proceedings of the 29th annual meeting on Association for Computational Linguistics, pages 169–176. Association for Computational Linguistics.
  • Cho et al. (2015) Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11):1875–1886.
  • Conneau et al. (2017) Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. 2017. Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364.
  • Coster and Kauchak (2011) William Coster and David Kauchak. 2011. Simple english wikipedia: a new text simplification task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: short papers-Volume 2, pages 665–669. Association for Computational Linguistics.
  • Fu et al. (2017) Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2017. Style transfer in text: Exploration and evaluation. arXiv preprint arXiv:1711.06861.
  • Grégoire and Langlais (2018) Francis Grégoire and Philippe Langlais. 2018. Extracting parallel sentences with bidirectional recurrent neural networks to improve machine translation. arXiv preprint arXiv:1806.05559.
  • Hwang et al. (2015) William Hwang, Hannaneh Hajishirzi, Mari Ostendorf, and Wei Wu. 2015. Aligning sentences from standard wikipedia to simple wikipedia. In HLT-NAACL, pages 211–217.
  • Inui et al. (2003) Kentaro Inui, Atsushi Fujita, Tetsuro Takahashi, Ryu Iida, and Tomoya Iwakura. 2003. Text simplification for reading assistance: a project note. In Proceedings of the second international workshop on Paraphrasing-Volume 16, pages 9–16. Association for Computational Linguistics.
  • Kajiwara and Komachi (2016) Tomoyuki Kajiwara and Mamoru Komachi. 2016. Building a monolingual parallel corpus for text simplification using sentence similarity based on alignment between word embeddings. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1147–1158.
  • Kandula et al. (2010) Sasikiran Kandula, Dorothy Curtis, and Qing Zeng-Treitler. 2010. A semantic and syntactic text simplification tool for health content. In AMIA annual symposium proceedings, volume 2010, page 366. American Medical Informatics Association.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
  • Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kusner et al. (2015) Matt J Kusner, Yu Sun, Nicholas I Kolkin, Kilian Q Weinberger, et al. 2015. From word embeddings to document distances. In ICML, volume 15, pages 957–966.
  • Liu et al. (2016) Gus Liu, Pol Rosello, and Ellen Sebastian. 2016. Style transfer with non-parallel corpora.
  • Marie and Fujita (2017) Benjamin Marie and Atsushi Fujita. 2017. Efficient extraction of pseudo-parallel sentences from raw monolingual data using word embeddings. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 392–398.
  • Moore (2002) Robert C Moore. 2002. Fast and accurate sentence alignment of bilingual corpora. In Conference of the Association for Machine Translation in the Americas, pages 135–144. Springer.
  • Munteanu and Marcu (2005) Dragos Stefan Munteanu and Daniel Marcu. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics, 31(4):477–504.
  • Napoles et al. (2012) Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 95–100. Association for Computational Linguistics.
  • Nisioi et al. (2017) Sergiu Nisioi, Sanja Štajner, Simone Paolo Ponzetto, and Liviu P Dinu. 2017. Exploring neural text simplification models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 85–91.
  • Pagliardini et al. (2017) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017. Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv preprint arXiv:1703.02507.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  • Prabhumoye et al. (2018) Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W Black. 2018. Style transfer through back-translation. arXiv preprint arXiv:1804.09000.
  • Robertson et al. (2009) Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  • Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
  • Shen et al. (2017) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Information Processing Systems, pages 6833–6844.
  • Siddharthan (2002) Advaith Siddharthan. 2002. An architecture for a text simplification system. In Language Engineering Conference, 2002. Proceedings, pages 64–71. IEEE.
  • Štajner and Nisioi (2018) Sanja Štajner and Sergiu Nisioi. 2018. A Detailed Evaluation of Neural Sequence-to-Sequence Models for In-domain and Cross-domain Text Simplification.
  • Sulem et al. (2018) Elior Sulem, Omri Abend, and Ari Rappoport. 2018. Bleu is not suitable for the evaluation of text simplification. In EMNLP.
  • Uszkoreit et al. (2010) Jakob Uszkoreit, Jay M Ponte, Ashok C Popat, and Moshe Dubiner. 2010. Large scale parallel document mining for machine translation. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1101–1109. Association for Computational Linguistics.
  • Štajner et al. (2018) Sanja Štajner, Marc Franco-Salvador, Paolo Rosso, and Simone Paolo Ponzetto. 2018. CATS: A Tool for Customized Alignment of Text Simplification Corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. European Language Resources Association (ELRA).
  • Xu et al. (2015) Wei Xu, Chris Callison-Burch, and Courtney Napoles. 2015. Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297.
  • Xu et al. (2016) Wei Xu, Courtney Napoles, Ellie Pavlick, Quanze Chen, and Chris Callison-Burch. 2016. Optimizing statistical machine translation for text simplification. Transactions of the Association for Computational Linguistics, 4:401–415.

Appendix A Supplementary material

Dataset Source Target
wiki-simp-65 The dish is often served shredded with a dressing of oil, soy sauce, vinegar and sugar, or as a salad with vegetables. They are often eaten in a kind of salad, with soy sauce or vinegar.
wiki-simp-65 The standard does not define bit rates for transmission, except that it says it is intended for bit rates lower than 20,000 bits per second. These speeds are raw bit rates (in Million bits per second).
wiki-simp-65 In Etruscan mythology, Laran was the god of war. However, among his attributes is his responsibility to maintain peace. Laran’s consort was Turan, goddess of love and fertility, who was equated with the Latin Venus. Laran was the Etruscan equivalent of the Greek Ares and the Roman Mars. ”’Laran”’ is the god of war and bloodlust in Etruscan mythology. He was the lover of Turan, goddess of love, beauty and fertility; Laran himself was the Etruscan equivalent of the Greek god Ares (Mars in Roman mythology).
wiki-simp-65 However, Jimmy Wales, Wikipedia’s co-founder, denied that this was a crisis or that Wikipedia was running out of admins, saying, ”The number of admins has been stable for about two years, there’s really nothing going on.” But the co-founder Wikipedia, Jimmy Wales, did not believe that this was a crisis. He also did not believe Wikipedia was running out of admins.
wiki-news-74 Prior to World War II, Japan’s industrialized economy was dominated by four major zaibatsu: Mitsubishi, Sumitomo, Yasuda and Mitsui. Until Japan ’s defeat in World War II , the economy was dominated by four conglomerates , known as “ zaibatsu ” in Japanese . These were the Mitsui , Mitsubishi , Sumitomo and Yasuda groups .
wiki-news-74 Thailand’s coup leaders Thursday banned political parties from holding meetings or from conducting any other activities, according to a statement read on national television. ”Political gatherings of more than five people have already been banned, but political activities can resume when normalcy is restored,” the statement said. “ In order to maintain law and order , meetings of political parties and conducting of other political activities are banned , ” the council said in its televised statement . “ Political activities can resume when normalcy is restored , ” it said .
paper-press-74 Such studies have suggested that some pterosaurs may have fed like modern-day skimmers, a rarified group of shorebirds, belonging to the genera Rynchops, that fly along the surface of still bodies of water scooping up small fish and crustaceans with their submerged lower jaw. Previous studies have suggested that some pterosaurs may have fed like modern-day ’skimmers’, a rare group of shorebirds, belonging to the Rynchops group. These sea-birds fly along the surface of lakes and estuaries scooping up small fish and crustaceans with their submerged lower jaw.
paper-press-74 Obsessions are defined as recurrent, persistent, and intrusive thoughts, images, or urges that cause marked anxiety, and compulsions are defined as repetitive behaviors or mental acts that the patient feels compelled to perform to reduce the obsession-related anxiety [26]. Obsessions are recurrent and persistent thoughts, impulses, or images that are unwanted and cause marked anxiety or distress.
paper-wiki-78 Then height and weight were used to calculate body mass index (kg/m2) as a general measure of weight status. Body mass index is a mathematical combination of height and weight that is an indicator of nutritional status.
paper-wiki-78 Men were far more likely to be overweight (21.7%, 21.3-22.1%) or obese (22.4%, 22.0-22.9%) than women (9.5% and 9.9%, respectively), while women were more likely to be underweight (21.3%, 20.9-21.7%) than men (5.9%, 5.6-6.1%). A 2008 report stated that 28.6% of men and 20.6% of women in Japan were considered to be obese. Men were more likely to be overweight (67.7%) and obese (25.5%) than women (30.9% and 23.4% respectively).
Table 7: Example pseudo-parallel pairs extracted by LHA.
Method Example
input jeddah is the principal gateway to mecca , islam ' s holiest city , which able-bodied muslims are required to visit at least once in their lifetime .
reference jeddah is the main gateway to mecca , the holiest city of islam , where able-bodied muslims must go to at least once in a lifetime .
our-baseline-full-H1 it is the highest gateway to mecca , islam .
our-baseline-full-H2 it is the main gateway to mecca .
wiki-sim-65-H1 jeddah is the main gateway to mecca , islam ’s holiest city .
wiki-simp-65-H2 it is the main gateway to mecca , islam ’s holiest city .
wiki-news-74-H1 it is the main gateway to mecca , islam ' s holiest city .
wiki-news-74-H2 the main gateway to mecca is islam ' s holiest city .
pseudo-all-H1 islam is the main gateway to mecca , islam ’s holiest city .
pseudo-all-H2 islam is the holiest city in the world .
input this closely resembles the unix philosophy of having multiple programs each doing one thing well and working together over universal interfaces .
reference this is very similar to the unix philosophy of having multiple programs doing one thing well and working together over universal interfaces .
our-baseline-full-H1 this closely resembles the unix philosophy of having multiple programs each doing one thing well and working together over universal interfaces .
our-baseline-full-H2 this closely resembles the unix method of having multiple programs each doing one thing well and working together over universal interfaces .
wiki-simp-65-H1 this closely resembles the unix philosophy of having multiple programs each doing one thing well and working together over universal interfaces .
wiki-simp-65-H2 this closely resembles the unix philosophy of having multiple programs each doing one thing well .
wiki-news-74-H1 this closely resembles the unix philosophy of having multiple programs each doing one thing well and working together over universal interfaces .
wiki-news-74-H2 this closely resembles the unix philosophy of having multiple programs each doing one thing well .
pseudo-all-H1 this is like the unix philosophy of having multiple programs each doing one thing well and working together over universal interfaces .
pseudo-all-H2 this is like the unix philosophy of having multiple programs each doing one thing well .
Table 8: Examples for text simplification from Wikipedia to the Simple Wikipedia
Method Example
input here we report evidence for this hypothesis and show that age-related declines in rppc grey matter volume better account for age-related changes in risk preferences than does age per se .
reference ” this means that gray matter volume accounts for age-related changes in risk attitude more than age itself ” .
paper-press-74 the authors suggest that age-related declines in rfc grey matter volume better account for age-related changes in risk preferences than does age per se .
paper-wiki-78 ” we report evidence for this hypothesis and show that age-related declines in rypgrey matter volume better account for age-related changes in risk preferences than does age per se ” .
combined ” we report evidence that age-related declines in the brain volume better account for age-related changes in risk preferences than does age per se ” .
input it is one of the most common brain disorders worldwide , affecting approximately 15 % 20 % of the adult population in developed countries ( global burden of disease study 2013 collaborators , 2015 ) .
reference migraine is one of the most common brain disorders worldwide , affecting approximately 15-20 % of the adults in developed countries .
paper-press-74 it is one of the most common brain disorders in developed countries .
paper-wiki-78 it is one of the most common neurological disorders in the united states , affecting approximately 10 % of the adult population .
combined it is one of the most common brain disorders worldwide , affecting approximately 15 % of the adult population in developed countries .
Table 9: Examples for style transfer from papers to press releases.