Early Text Simplification (TS) approaches relied heavily on manually-crafted simplification rules [Chandrasekar et al.1996, Siddharthan2014] due to a lack of resources from which to automatically learn or extract them. Complex-to-simple parallel corpora effectively changed the nature of contributions pertaining to TS upon their arrival.
The most widely used resource of this kind is the Wikipedia and Simple Wikipedia parallel corpus [Kauchak and Barzilay2006]. Using this corpus, authors have devised Syntactic Simplification approaches that learn simplification rules from aligned sentences [Siddharthan2006, Woodsend and Lapata2011], Machine Translation approaches that translate from complex to simple text [Zhu et al.2010, Bach et al.2011], Lexical Simplifiers that extract complex-to-simple word correspondences from word alignments [Biran et al.2011, Horn et al.2014], and even hybrid lexico-syntactic simplifiers [Paetzold and Specia2013, Feblowitz and Kauchak2013]. Other examples of parallel corpora that have been used in TS are the medical-domain corpus of [Deléger and Zweigenbaum2009], and the newly introduced Newsela corpus [Xu et al.2015].
Most of these corpora come, however, as a set of complex-to-simple articles aligned at document level. In order for them to be used by Machine Translation systems, Lexical Simplification strategies and Tree Transduction frameworks, for example, these documents need to be aligned at sentence level.
Various strategies have been created to address this problem. barzilay2003sentence present an approach that first aligns paragraphs with sophisticated clusterers and classifiers learned from manually annotated instances, then produces sentence alignments through a dynamic programming algorithm that finds an alignment path between the sentences in aligned paragraphs. Coster11 use the same sentence alignment approach, but opt instead for a much simpler paragraph alignment algorithm: they simply align any pair of paragraphs of which the TF-IDF similarity is higher than a manually set threshold of 0.5. Similarly, Bott2011 also make use of an elaborate, difficult to replicate supervised approach in order to learn the likelihood of two portions of text being aligned. smith2010extracting bypass the step of paragraph alignment and use instead a supervised Conditional Random Fields ranker for sentence alignment directly.
Although these algorithms have been proven useful, they do not exploit the comparability between the aligned documents’ paragraphs and sentences, hence the need for elaborate supervised classifiers and rankers. The sentence aligner used by both barzilay2003sentence and Coster11 suffers from yet another limitation: it does not allow for multiple consecutive sentence skips, which can lead to incorrect alignments in scenarios where the number of sentences in a pair of aligned paragraphs is not similar.
In this paper, we introduce a new set of paragraph and sentence alignment algorithms designed to address these limitations.
2 Clues in Comparable Corpora
By inspecting a handful of documents from the Wikipedia-Simple Wikipedia and the Newsela corpus, we noticed that they are divided in numerous paragraphs, and that the order with which the information is presented is consistent throughout different versions of the same article. This means that, most of the time, one can safely assume that, if the th and th paragraph of a pair of documents is aligned, then the next alignment will be such that and .
Since paragraphs offer valuable alignment clues, we believe that algorithms that bypass paragraph alignment, such as that of smith2010extracting, are not suitable for this task. We also noticed that the order of the information within aligned paragraphs is also consistent, but that often times aligned paragraphs will have a noticeably large disparity in number of sentences they have. Consequently, sentence alignment algorithms that do not allow for multiple consecutive skips, such as that of barzilay2003sentence, may not be suitable for this task either.
xu2015problems produced sentence alignments for the Newsela corpus by simply calculating the Jaccard similarity score of each pair of sentences in a set of documents, then selecting any pairs that achieve a similarity score above a manually set threshold. This approach, however, suffers from several limitations also, since it does not exploit the aforementioned observations in any capacity, nor does it allow for N-1 or N-N alignments.
We address this problem by introducing new, flexible paragraph and sentence alignment algorithms that better document-aligned corpora. We describe them in what follows.
3 Paragraph Alignment Algorithm
In order to produce paragraph alignments, we employ a search algorithm that allows 1-1, 1-N, N-1, N-N and null alignments (i.e. unaligned paragraphs between alignments).
Algorithm 1 exploits a matrix where represents the number of paragraphs in a document , the number of paragraphs in a document , and the similarity between the th paragraph in and the th paragraph in
. As a similarity metric, we use the maximum TF-IDF cosine similarity between all possible pairs of sentences ini.e. the similarity between two paragraphs is equal to that of the most similar pair of sentences in them. We choose this metric due to the fact that, even though the sentences in equivalent paragraphs of documents with different reading levels are usually very distinct in both form and vocabulary, often times paragraphs will share at least one very similar (or even identical) sentence between them, which is a strong indicator of a good paragraph alignment. These similar sentences are often so because of long quotes from subjects interviewed for the news articles.
The goal of Algorithm 1 is to find a path where if there is an alignment between paragraphs , and otherwise. The search for path starts from the assumption that , or in other words, the first paragraph of and are aligned. We exploit this assumption because the first paragraph in most document-aligned corpora available refer to the articles’ titles. The algorithm then initializes a control set of coordinates that represents the point from which to search for the next alignment in . The next alignment is then searched for in the first vicinity V1, which represents 1-1, 1-N and N-1 alignments. If there is not a paragraph pair in V1 such that then the next alignment is searched for in the second vicinity V2, which represents single paragraph skips. If no pair in V2 has , then the algorithm searches for the the next pair with with the shortest euclidean distance to in the third and final vicinity V3, which represents long-distance paragraph skips. Finally, the update is made, and the alignment is added. This process is repeated until there are no more paragraphs left to be aligned, or until the algorithm finds no suitable to follow current alignment in vicinity V3. Notice that, although we use only three vicinities, the algorithm can be easily adapted to support as many distinct vicinities as suitable. After is found, the paragraphs in all 1-N, N-1 and N-N alignments are concatenated so that our sentence alignment algorithm can more easily search for equivalent sentences in them.
4 Sentence Alignment Algorithm
Our sentence alignment algorithm employs the same principles behind our paragraph alignment algorithm: it also allows for 1-1, 1-N, N-1 and null alignments, and works under the assumption that N-N alignments can be inferred from consecutive 1-1 alignments.
Algorithm 2 exploits a matrix where represents the number of sentences in a paragraph , the number of sentences in a paragraph , and the similarity between the th sentence in and the th sentence in . As a metric, we use TF-IDF cosine similarity.
In search of path , our algorithm first finds the initial alignment point with with the shortest euclidean distance to . Notice that this algorithm cannot exploit the assumption that the first sentences in a pair of paragraphs will be aligned. The next alignment is then searched for in the immediate vicinity V1. If is smaller than a minimum , then it searches for the the next pair with with the shortest euclidean distance to in vicinity V2. Otherwise, if , then the new 1-1 alignment is added to . However, if a 1-N () or N-1 () alignment is found, then the algorithm performs a secondary loop in order to find the size of . is incremented until the accumulated similarity for the concatenated sentences given the current size of is smaller than the similarity for minus a slack value , or until the similarity of the adjacent diagonal ( or ) is larger than the accumulated similarity. This process is repeated until there are no more paragraphs left to be aligned, or until the algorithm finds no suitable new alignment candidates.
5 Final Remarks
We presented a new set of vicinity-driven paragraph and sentence alignment algorithms for document-aligned corpora. Our algorithms addresses the limitations of previous approaches by exploiting clues in comparable documents that are often neglected. Unlike many early strategies, our algorithms allow for 1-N, N-1, N-N and long-distance null alignments.
In the future, we aim to conduct both intrinsic and extrinsic performance comparisons between our algorithms and earlier approaches. We also aim to employ our algorithms in the creation of new paragraph and sentence-aligned corpora to be made available to the public.
- [Bach et al.2011] Nguyen Bach, Qin Gao, Stephan Vogel, and Alex Waibel. 2011. Tris: A statistical sentence simplifier with log-linear models and margin-based discriminative training. In IJCNLP.
- [Barzilay and Elhadad2003] Regina Barzilay and Noemie Elhadad. 2003. Sentence alignment for monolingual comparable corpora. In Proceedings of the 2003 EMNLP, pages 25–32.
- [Biran et al.2011] Or Biran, Samuel Brody, and Noémie Elhadad. 2011. Putting it simply: a context-aware approach to lexical simplification. In Proceedings of the 49th ACL, pages 496–501.
- [Bott and Saggion2011] Stefan Bott and Horacio Saggion. 2011. An unsupervised alignment algorithm for text simplification corpus construction. In Proceedings of the 2011 MTTG, pages 20–26.
- [Chandrasekar et al.1996] Raman Chandrasekar, Christine Doran, and Bangalore Srinivas. 1996. Motivations and methods for text simplification. In Proceedings of the 16th COLING, pages 1041–1044.
- [Coster and Kauchak2011] William Coster and David Kauchak. 2011. Simple english wikipedia: A new text simplification task. In Proceedings of the 49th ACL, pages 665–669.
- [Deléger and Zweigenbaum2009] Louise Deléger and Pierre Zweigenbaum. 2009. Extracting lay paraphrases of specialized expressions from monolingual comparable medical corpora. In Proceedings of the 2nd Workshop on Building and Using Comparable Corpora: from Parallel to Non-parallel Corpora, pages 2–10.
- [Feblowitz and Kauchak2013] Dan Feblowitz and David Kauchak. 2013. Sentence simplification as tree transduction. In Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations, pages 1–10.
- [Horn et al.2014] Colby Horn, Cathryn Manduca, and David Kauchak. 2014. Learning a lexical simplifier using wikipedia. In Proceedings of the 52nd ACL, pages 458–463.
- [Kauchak and Barzilay2006] David Kauchak and Regina Barzilay. 2006. Paraphrasing for automatic evaluation. In Proceedings of the 2006 NAACL, pages 455–462.
- [Paetzold and Specia2013] Gustavo H. Paetzold and Lucia Specia. 2013. Text simplification as tree transduction. In Proceedings of the 9th STIL, pages 116–125.
- [Siddharthan2006] Advaith Siddharthan. 2006. Syntactic simplification and text cohesion. Research on Language and Computation, 4(1):77–109, March.
- [Siddharthan2014] Advaith Siddharthan. 2014. A survey of research on text simplification. International Journal of Applied Linguistics, pages 259––298.
- [Smith et al.2010] Jason R Smith, Chris Quirk, and Kristina Toutanova. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Proceedings of the 2010 HLT, pages 403–411.
- [Woodsend and Lapata2011] Kristian Woodsend and Mirella Lapata. 2011. Learning to simplify sentences with quasi-synchronous grammar and integer programming. In Proceedings of the 2011 EMNLP, pages 409–420.
- [Xu et al.2015] Wei Xu, Chris Callison-Burch, and Courtney Napoles. 2015. Problems in current text simplification research: New data can help. Transactions of the Association for Computational Linguistics, 3:283–297.
- [Zhu et al.2010] Zhemin Zhu, Delphine Bernhard, and Iryna Gurevych. 2010. A monolingual tree-based translation model for sentence simplification. Computational Linguistics, (August):1353–1361.