In recent years, research on chat dialogue systems has attracted much attention. A typical chat dialogue system selects an appropriate response from a database as an output for a user’s utterance input to it Ji et al. (2014). However, it might not always be possible to find an appropriate response to the user’s utterance if the coverage of the database is limited. Therefore, in order for the system to be able to consistently provide appropriate responses, it is necessary to augment the database beforehand.
In this research, we address this problem space by providing a method to generate a complex sentence from a simple sentence, through assigning a modifier clause to the simple sentence. Instead of using a manually created corpus for complex sentence generation, we extract a pseudo-parallel corpus for modifier clause generation, and use it to learn a generator model at cheap cost. As shown in the first example of Table 1, the input of this method is a sentence that has a modifiable noun phrase, and the output is a sentence with a modifier clause assigned to the input sentence. By doing so, it is possible to augment the database to include a variety of complex sentences based on simple sentences.
The main contribution of this research is as follows.
We propose a technique to generate a response by inserting a modifier clause to the input response.
We propose a method to automatically create a corpus for inserting a modifier clause from a response database, and to learn a complex sentence generation model with neural networks.
We examine evaluation metrics for complex sentence generation, and show that a pipeline method improves both fluency and diversity in comparison with the baseline method, in which the entire generation is done in an end-to-end fashion.
2 Divide and Generate: Neural Generation of a Modifier Clause
Our objective in this research is to generate various kinds of new responses, by inserting a modifier clause to a simple sentence in a response database to create a complex sentence. As our generator model for the complex sentences, we use the Encoder-Decoder with Attention Bahdanau et al. (2015). In order to train the model, we propose a technique to first automatically build an annotated parallel corpus of pseudo-simple sentences, from a raw corpus of complex sentences. We then present two approaches of learning the generator model: (1) an end-to-end model, which jointly inserts and generates a modifier clause, and (2) a pipeline model, which divides the insertion and generation processes, to guide the generation of a natural modifier clause.
In Section 2.1, we explain how to create a parallel corpus for modifier clauses, and propose a model that inserts a modifier clause in Section 2.2. The evaluation metric of generated sentences is explained in Section 2.3.
2.1 Pseudo-Simple Corpus
For learning a generation model, it is necessary to have a parallel corpus of complex sentences, in each of which the modifier clause is annotated. However, it is expensive to annotate the corpus manually. Therefore, we create a pseudo-simple corpus by removing the modifier clauses from a raw corpus consisting of complex sentences. We collected training data including modifier clauses and used it for training of complex sentence generation. In this research, we extracted a modifier clause from each complex sentence using Algorithm 1.
Note that we can use different corpora for training the generator model and decoding an actual sentence at test time. If the domain of the test corpus is far from that of the training data, unnatural modifier clauses may be generated at test time. Thus, the domain of the training corpus should be carefully chosen.
2.2 Generator Models
The baseline for comparison used in this research is an end-to-end model in which a complex sentence, with a modifier clause given to an input sentence, is generated in an end-to-end fashion by using Encoder-Decoder. Since the end-to-end model simultaneously detects the position to insert a modifier clause and generates the modifier clause for the input sentence, the task gets complicated and is likely to suffer from data sparseness.
To overcome the limitations of an end-to-end approach, we propose a pipeline model that generates a modifier clause more robustly, by detecting the insertion position and generating the modifier clause separately.
In the pipeline model, the insertion position is detected by a set of rules, and marked with a special token on the input side. The modifier clause is generated by an Encoder-Decoder trained on the pseudo-parallel corpus that includes special tokens that mark the insertion position. Algorithm 2 (Appendix) shows the rule-based algorithm for detecting the insertion position. After the insertion position is detected, the Encoder-Decoder model creates a complex sentence by generating a modifier clause hinted by the rule. By using a special token to mark the insertion position, our pipeline model can find the noun to modify and generate a modifier clause easily and robustly.
As shown in the second example of Table 1, the pseudo-parallel corpus of the modifier clause includes a special symbol inserted before and after the word to be modified.
|Original corpus||車に乗りました I got on a car||彼に借りた車に乗りました I got on a car I borrowed from him|
|Marked corpus||<ins> 方法 </ins> を探しています I am looking for <ins> ways </ins>||この先に進む方法を探しています I am looking for ways to move forward|
2.3 Evaluation Metric
In this research, we generate a complex sentence by inserting a modifier clause to a simple sentence. Since there is no specific correct answer for such a task, it is not appropriate to use an evaluation metric that utilizes a reference sentence, such as BLEU and ROUGE, to evaluate the sentence that is generated.
The purpose of our complex sentence generation task is to improve diversity in the responses, without compromising on fluency to the extent possible. Although our generator model takes simple sentences as input, our goal is to augment the response database, so the quality of the generated data can be evaluated by looking only at the generated data.
Since it is desirable to have fluent sentences, we use perplexity with the N-gram language model created by the test data to assess the fluency of the generator model.
In addition, we consider, naturally, that the produced sentence has more information than the original sentence. Therefore, we use the number of word types in the sentence as a measure for the amount of information.
We extracted conversation sentences from novels posted on an online forum for sharing Japanese novels 111https://syosetu.com. We crawled the site and obtained 2,782,577 sentences as of May, 2017. We then created a pseudo-simple corpus as described in Section 2.1. We used CaboCha Kudo and Matsumoto (2002) for dependency analysis, and MeCab Kudo et al. (2004) + IPAdic 222http://chasen.naist.jp/stable/ipadic/ for morphological analysis. Since a simple sentence is assumed as an input at test time, test and development data consist of simple sentences only. In order to prevent the generated modifier clause from being biased toward a generic but meaningless clause (e.g. “that I know”), we kept only one instance with the same modifier clauses in the training data. The training data consists of 95,234 sentences, and the test and development data contain 1,000 sentences each.
We also tested whether the model learned by our corpus can correctly give a modifier clause to sentences in an out-of-domain setting. For the out-domain data, we used simple sentences taken from a chat dialogue corpus Higashinaka et al. (2016). This corpus is a typed online dialogue corpus consisting of utterances of a system and a user. In this research, we extracted user utterances and generated modifier clauses for each utterance.
We conducted an experiment with an end-to-end model and a pipeline model, respectively. In addition, we examined the kind of output that was generated when beam search was performed with a search width 10 in the pipeline model. The hyper parameter of the neural networks was experimented with a vocabulary size of 10,000, an embed layer of 512, a hidden layer of 512, and a batch size of 128. The initial value of the word vector was word2vec learned from the training data. The optimization algorithm used was Adagrad and the learning rate was 0.01. Model selection was performed by running epochs up to 20 and selecting the number of epochs for which BLEU achieved the maximum score for each dev set.
We evaluated generated sentences using an automatic evaluation for each model by perplexity of N-gram (N = 4) language model with the modified Kneser-Ney smoothing and the average of word types. In addition, as a manual evaluation, we subjectively evaluated the fluency of 210 sentences randomly sampled from each system. We performed a pairwise comparison of the output of the two models.
As shown in Table 2, perplexity is lower in the pipeline model than in the end-to-end model. This indicates that the pipeline model produces more fluent results. Moreover, since the number of word types is greater in the pipeline model, it can output a larger variety of sentences than the end-to-end model.
As for the subjective evaluation, the end-to-end model won 32 times; the pipeline model won 68 times; and there were 110 ties. These results demonstrate that the pipeline model was able to generate sentences with higher fluency than the end-to-end model.
The output of each model is shown in Table 3 (Appendix). As in the first example, the end-to-end model sometimes output a same word redundantly in the modifier clause, whereas the pipeline model did not. Also, in the second example, “大手” (“ major ”) in the input sentence changes to “大型” (“ big ”), which is similar but unnatural in the end-to-end model. In addition, as in the third and fifth examples, there were many cases where garbled words appeared in the output of the end-to-end model, resulting in the output sentence becoming syntactically and semantically invalid. Although the pipeline model did not output such a sentence, as in the fourth and fifth examples, it sometimes inserted a modifier clause at an unnatural position and generated a modifier clause not appropriate for the modified noun.
The output of the pipeline model in an out-of-domain setting is shown in Table 5 (Appendix). As in the first example, the pipeline model could successfully generate a modifier clause for a word contained in the vocabulary of the model in an out-of-domain context. On the other hand, when a modified word is out-of-vocabulary, there were many cases in which inappropriate modifier clauses were generated as shown in the third example.
In both the end-to-end model and the pipeline model, it commonly occured that a generic but meaningless modifier clause was inserted to the input sentence. This problem is related to the evaluation metric for selecting models. In this research, we chose the model where BLEU was the highest, but as for the modifier generation task, it is not always possible for the model with the highest BLEU to produce a fluent and diverse modifier clause. As the learning process proceeds, the generated modifier clause tends to vary, whereas loss and perplexity in the development data continue to rise, so that it is necessary to balance the trade-off between fluency and diversity.
The end-to-end model tends to output words with similar meaning, or garbled words, resulting in the generation of unnatural sentences. This indicates that the information to predict the position of the modifier clause is better kept as a special token, rather than as distributed representation in a hidden layer. It is known that style information is better encoded as a special tokenSennrich et al. (2016); Yamagishi et al. (2017), and our finding is consistent with previous work.
Table 4 (Appendix) shows the results of the top three sentences in beam search with a beam width of 10 in the pipeline model. Each sentence has high fluency and is different in meaning. Therefore, it can be possible to avoid outputting sentences which have same words redundantly by imposing a penalty for duplication and re-ranking candidates in a beam.
We evaluated the diversity of the model by the average number of word types in the generated corpus. In this evaluation metric, although a sentence containing more kinds of words receives a better evaluation score, a longer sentence tends to be over-estimated because there is no penalty on the sentence length. Thus, it is necessary to take the sentence length into consideration, possibly weighted by the number of content words in the sentence.
5 Related Work
There is a thread of research on generating sentences based on training data without any input Bowman et al. (2016)
. Their research is similar to ours in that they generate a sentence according to the probability distribution learned from the data beforehand, but we generate a sentence based additionally on a given input. In other words, since the output can be controlled by the input, we can output a sentence in a specific domain or include a specific keyword.
There are also other studies that delve into the topic of complex sentences Derr and McKeown (1984); Sato (1980). The former discusses when to output complex sentences, and therefore it has a purpose that is different from our research. The latter generates Japanese sentences automatically by providing the frames and specifications of a sentence. In his work, it is necessary to select the frame and provide information regarding a subordinate clause in a sentence, while in our work we automatically predict a subordinate clause suitable for main clauses based on training data and generate a modifier clause.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In ICLR.
- Bowman et al. (2016) Samuel RḂowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Józefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In CoNLL.
- Derr and McKeown (1984) Marcia A. Derr and Kathleen McKeown. 1984. Using focus to generate complex and simple sentences. In COLING.
- Higashinaka et al. (2016) Ryuichiro Higashinaka, Kotaro Funakoshi, Yuka Kobayashi, and Michimasa Inaba. 2016. The dialogue breakdown detection challenge: Task description, datasets, and evaluation metrics. In LREC.
- Ji et al. (2014) Zongcheng Ji, Zhengdong Lu, and Hang Li. 2014. An information retrieval approach to short text conversation. arXiv.
- Kudo and Matsumoto (2002) Taku Kudo and Yuji Matsumoto. 2002. Japanese dependency analysis using cascaded chunking. In CoNLL.
- Kudo et al. (2004) Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to Japanese morphological analysis. In EMNLP.
- Sato (1980) Taisuke Sato. 1980. Sgs: A system for mechanical generation of Japanese sentences. In COLING.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Controlling politeness in neural machine translation via side constraints. In HLT-NAACL.
- Yamagishi et al. (2017) Hayahide Yamagishi, Shin Kanouchi, Takayuki Sato, and Mamoru Komachi. 2017. Improving Japanese-to-English neural machine translation by voice prediction. In IJCNLP(2).