Log In Sign Up

Integrating Transformer and Paraphrase Rules for Sentence Simplification

by   Sanqiang Zhao, et al.
University of Pittsburgh

Sentence simplification aims to reduce the complexity of a sentence while retaining its original meaning. Current models for sentence simplification adopted ideas from ma- chine translation studies and implicitly learned simplification mapping rules from normal- simple sentence pairs. In this paper, we explore a novel model based on a multi-layer and multi-head attention architecture and we pro- pose two innovative approaches to integrate the Simple PPDB (A Paraphrase Database for Simplification), an external paraphrase knowledge base for simplification that covers a wide range of real-world simplification rules. The experiments show that the integration provides two major benefits: (1) the integrated model outperforms multiple state- of-the-art baseline models for sentence simplification in the literature (2) through analysis of the rule utilization, the model seeks to select more accurate simplification rules. The code and models used in the paper are available at Sanqiang/text_simplification.


page 1

page 2

page 3

page 4


Neural Sentence Ordering Based on Constraint Graphs

Sentence ordering aims at arranging a list of sentences in the correct o...

Explain to me like I am five – Sentence Simplification Using Transformers

Sentence simplification aims at making the structure of text easier to r...

AMR Alignment: Paying Attention to Cross-Attention

With the surge of Transformer models, many have investigated how attenti...

A Fast and Accurate Vietnamese Word Segmenter

We propose a novel approach to Vietnamese word segmentation. Our approac...

Getting the Most out of Simile Recognition

Simile recognition involves two subtasks: simile sentence classification...

AandP: Utilizing Prolog for converting between active sentence and passive sentence with three-steps conversion

I introduce a simple but efficient method to solve one of the critical a...

Improving sentence compression by learning to predict gaze

We show how eye-tracking corpora can be used to improve sentence compres...

1 Introduction

Sentence simplification aims to reduce the complexity of a sentence while retaining its original meaning. It can benefit individuals with low-literacy skills  watanabe2009facilita including children, non-native speakers and individuals with language impairments such as dyslexia rello2013dyswebxia, aphasic carroll1999simplifying.

Most of the previous studies tackled this task in a way similar to machine translation xu2015show; zhang2017sentence

, in which models are trained on a large number of pairs of sentences, each consisting of a normal sentence and a simplified sentence. Statistical and neural network modeling are two major methods used for this task. The statistical models have the benefit of easily integrating with human-curated rules and features, thus they generally perform well even they are trained with a limited number of data. In contrast, neural network models could learn the simplifying rules automatically without the need for feature engineering, but at the cost of requiring a huge amount of training data. Even though models based on neural networks have outperformed the statistical methods in multiple Natural Language Processing (NLP) tasks, their performance in sentence simplification is still inferior to that of statistical models 

xu2015show; zhang2017sentence. We speculate that current training datasets may not be large and broad enough to cover common simplification situations. However, human-created resources do exist which can provide abundant knowledge for simplification. This motivates us to investigate if it is possible to train neural network models with these types of resources.

Another limitation to using existing neural network models for sentence simplification is that they are only able to capture frequent transformations; they have difficulty in learning rules that are not frequently observed despite their significance. This may be due to nature of neural networks feng2017memory

: during training, a neural network tunes its parameters to learn how to simplify different aspects of the sentence, which means that all the simplification rules are actually contained in the shared parameters. Therefore, if one simplification rule appears more frequently than others, the model will be trained to be more focused on it than the infrequent ones. Meanwhile, models tend to treat infrequent rules as noise if they are merely trained using sentence pairs. If we can leverage an additional memory component to maintain simplification rules individually, it would prevent the model from forgetting low-frequency rules as well as help it to distinguish real rules from noise. Therefore, we propose the Deep Memory Augmented Sentence Simplification (DMASS) model. For comparison purpose, we also introduce another approach, Deep Critic Sentence Simplification (DCSS) model, to encourage applying the less frequently occurring rules by revising the loss function. It this way, simplification rules are encouraged to maintained internally in the shared parameters while avoiding the consumption of an unwieldy amount of additional memory.

In this study, we propose two improvements to the neural network models for sentence simplification. For the first improvement, we propose to use a multi-layer, multi-head attention architecture vaswani2017attention

. Compared to RNN/LSTM (Recurrent Neural Network / Long Short-term Memory), the multi-layer, multi-head attention model would be able to selectively choose the correct words in the normal sentence and simplify them more accurately.

Secondly, we propose two new approaches to integrate neural networks with human-curated simplification rules. Note that previous studies rarely tried to incorporate explicit human language knowledge into the encoder-decoder model. Our first approach, DMASS, maintains additional memory to recognize the context and output of each simplification rules. Our second approach, DCSS, follows a more traditional approach to encode the context and output of each simplification rules into the shared parameters.

Our empirical study demonstrates that our model outperforms all the previous sentence simplification models. They achieve both a good coverage of rules to be applied (recall) and a high accuracy gained by applying the correct rules (precision).

2 Related Work

Sentence Simplification

For statistical modeling, zhu2010monolingual proposed a tree-based sentence simplification model drawing inspiration from statistical machine translation. woodsend2011learning employed quasi-synchronous grammar and integer programming to score the simplification rules. wubben2012sentence proposed a two-stage model PBMT-R, where a standard phrase-based machine translation (PBMT) model was trained on normal-simple aligned sentence pairs, and several best generations from PBMT were re-ranked based how dissimilar they were to a normal sentence. Hybrid, a model proposed by narayan2014hybrid was also a two-stage model combining a deep semantic analysis and machine translation framework. SBMT-SARI xu2016optimizing achieved state-of-the-art performance by employing an external knowledge base to promote simplification. In terms of neural network models, zhang2017sentence

argued that the RNN/LSTM model generated sentences but it does not have the capability to simplify them. They proposed DRESS and DRESS-LS that employ reinforcement learning to reward simpler outputs. As they indicated, the performance is still inferior due to the lack of external knowledge. Our proposed model is designed to address the deficiency of current neural network models which are not able to integrate an external knowledge base.

Augmented Dynamic Memory

Despite positive results obtained so far, a particular problem with the neural network approach is that it has a tendency towards favoring to frequent observations but overlooking special cases that are not frequently observed. This weakness with regard to infrequent cases has been noticed by a number of researchers who propose an augmented dynamic memory for multiple applications, such as language models daniluk2017frustratingly; grave2016improving, question answering miller2016key, and machine translation feng2017memory; tu2017learning. We find that current sentence simplification models suffer from a similar neglect of infrequent simplification rules, which inspires us to explore augmented dynamic memory.

3 Our Sentence Simplification Models

3.1 Multi-Layer, Multi-Head Attention

Our basic neural network-based sentence simplification model utilizes a multi-layer and multi-head attention architecture vaswani2017attention. As shown in Figure 1, our model based on the Transformer architecture works as follows: given a pair consisting a normal sentence and a simple sentence , the model learns the mapping from to .

Figure 1: Diagram of the Transformer architecture

The encoder part of the model (see the left part of Figure 1) encodes the normal sentence with a stack of

identical layers. Each layer has two sublayers: one layer is for multi-head self-attention and the other one is a fully connected feed-forward neural network for transformation. The multi-head self-attention layer encodes the output from the previous layer into hidden state

(step and layer ) as shown in Equation 1, where indicates the attention distribution over the step and layer . Each hidden state summarizes the hidden states in the previous layer through the multi-head attention function  vaswani2017attention where H refers to the number of heads.

The right part of Figure 1 denotes the decoder for generating the simplified sentence. The decoder also consists of a stack of identical layers. In addition to the same two sub-layers as those in the encoder part, the decoder also inserts another multi-head attention layer aiming to attend on the encoder outputs. The bottom multi-head self-attention plays the same role as the one in the encoder, where the hidden state is computed in the Equation 3.1

. The upper multi-head attention layer is used to seek relevant information from encoder outputs. Through the same mechanism, context vector

(step and layer ) is computed in the Equation 3.1.

∑_s’ α_(s’,l)^enc e_(s’,l-1),α_(s’,l)^enc =a(e_(s,l), e_(s’, l-1), H)111The lowest hidden state is the word embedding.d_(s,l) =∑_s_´ α_(s_´,l)^dec d_(s_´,l-1),α_(s_´,l)^dec =a(d_(s,l), c_(s_´, l-1), H)222The lowest context vector is the word embedding.c_(s,l) =∑_s_´ α_(s_´,l)^dec2 e_(s_´,L),α_(s_´,l)^dec2 =

911 e_(s,l) =∑_s’ α_(s’,l)^enc e_(s’,l-1),α_(s’,l)^enc =a(e_(s,l), e_(s’, l-1), H)111The lowest hidden state is the word embedding.