Discrete Structural Planning for Neural Machine Translation

08/14/2018 ∙ by Raphael Shu, et al. ∙ The University of Tokyo 0

Structural planning is important for producing long sentences, which is a missing part in current language generation models. In this work, we add a planning phase in neural machine translation to control the coarse structure of output sentences. The model first generates some planner codes, then predicts real output words conditioned on them. The codes are learned to capture the coarse structure of the target sentence. In order to obtain the codes, we design an end-to-end neural network with a discretization bottleneck, which predicts the simplified part-of-speech tags of target sentences. Experiments show that the translation performance are generally improved by planning ahead. We also find that translations with different structures can be obtained by manipulating the planner codes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

When human speaks, it is difficult to ensure the grammatical or logical correctness without any form of planning. Linguists have found evidence through speech errors or particular behaviors that indicate speakers are planning ahead Redford (2015). Such planning can happen in discourse or sentence level, and sometimes we may notice it through inner speech.

In contrast to human, a neural machine translation (NMT) model does not have the planning phase when it is asked to generate a sentence. Although we can argue that the planning is done in the hidden layers, however, such structural information remains uncertain in the continuous vectors until the concrete words are sampled. In tasks such as machine translation, a source sentence can have multiple valid translations with different syntactic structures. As a consequence, in each step of generation, the model is unaware of the “big picture” of the sentence to produce, resulting in uncertainty of the choice of words.

Figure 1: Illustration of the proposed sentence generation framework. The model predicts the planner codes before generating real output words.

In this research, we try to let the model plan the coarse structure of the output sentence before decoding real words. As illustrated in Fig. 1, in our proposed framework, we insert some planner codes into the beginning of the target sentences. The sentence structure of the translation is governed by the codes.

An NMT model takes an input sentence and produce a translation . Let denotes the syntactic structure of the translation. Indeed, the input sentence already provides rich information about the target-side structure .

For example, given the Spanish sentence in Fig. 1, we can easily know that the translation will have a noun, a pronoun and a verb. Such obvious structural information does not have uncertainty, and thus does not require planning. In this example, the uncertain part is the order of the noun and the pronoun. Thus, we want to learn a set of planner codes to disambiguate such uncertain information about the sentence structure. By conditioning on the codes, we can potentially improve the effectiveness of beam search as the search space can be properly regulated.

In this work, we use simplified POS tags to annotate the structure . We learn the planner codes by putting a discretization bottleneck in an end-to-end network that reconstructs with both and . The codes are merged with the target sentences in the training data. Thus, no modification to the NMT model is required. Experiments show the translation performance is generally improved with structural planning. More interestingly, we can control the structure of output sentences by manipulating the planner codes.

2 Learning Structural Planners

In this section, we first extract the structural annotation by simplifying the POS tags. Then we explain the code learning model for obtaining the planner codes.

2.1 Structural Annotation with POS Tags

To reduce uncertainty in the decoding phase, we want a structural annotation that describes the “big picture” of the sentence. For instance, the annotation can tell whether the sentence to generate is in a “NP VP” order. The uncertainty of local structures can be efficiently solved by beam search or the NMT model itself.

In this work, we extract such coarse structural annotations through a simple two-step process that simplifies the POS tags of the target sentence:

  1. Remove all tags other than “N”, “V”, “PRP”, “,” and “.”. Note that all tags begin with “N” (e.g. NNS) are mapped to “N”, and tags begin with “V” (e.g. VBD) are mapped to “V”.

  2. Remove duplicated consecutive tags.

The following list gives an example of the process: [commandchars=
{},] Input: He found a fox behind the wall. POS Tags: PRP VBD DT NN IN DT NN . Step 1: PRP V N N . Step 2: PRP V N .
Note that many other annotations can also be considered to represent the syntactic structure, which is left for future work to explore.

2.2 Code Learning

Figure 2: Architecture of the code learning model. The discretization bottleneck is shown as the dashed lines.

Next, we learn the planner codes to remove the uncertainty of the sentence structure when producing a translation. For simplicity, we use the notion and to replace and in this section.

We first compute the discrete codes based on simplified POS tags :

(1)
(2)
(3)

where the tag sequence is firstly encoded using a backward LSTM Hochreiter and Schmidhuber (1997). denotes the embedding function. Then, we compute a set of vectors , which are latterly discretized in to approximated one-hot vectors using Gumbel-Softmax trick Jang et al. (2016); Maddison et al. (2016).

We then combine the information from and to initialize a decoder LSTM that sequentially predicts :

(4)
(5)
(6)

where denotes a concatenation of one-hot vectors. Note that only is computed with a forward LSTM. Both and

are affine transformations. Finally, we predict the probability of emitting each tag

with

(7)

The architecture of the code learning model is depicted in Fig. 2, which can be seen as a sequence auto-encoder with an extra context input to the decoder. The parameters are optimized with cross-entropy loss.

Once the code learning model is trained, we can obtain the planner codes for all target sentences in the training data using the encoder part.

3 NMT with Structural Planning

The training data of machine translation dataset is composed of sentence pairs. With the planner codes we obtained, our training data now becomes a list of pairs. As shown in Fig. 1, we connect the planner codes and target sentence with a “” token.

With the modified dataset, we train a regular NMT model. We use beam search when decoding sentences, thus the planner codes are searched before emitting real words. The codes are removed from the translation results during evaluation.

4 Related Work

Recently, some methods are proposed to improve the syntactic correctness of the translations. Stahlberg et al. (2016) restricts the search space of the NMT decoder using the lattice produced by a Statistical Machine Translation system. Eriguchi et al. (2017) takes a multi-task approach, letting the NMT model to parse a dependency tree and combine the parsing loss with the original loss.

Several works further incorporate the target-side syntactic structures explicitly. Nadejde et al. (2017) interleaves CCG supertags with normal output words in the target side. Instead of predicting words, Aharoni and Goldberg (2017) trains a NMT model to generate linearized constituent parse trees. Wu et al. (2017) proposed a model to generate words and parse actions simultaneously. The word prediction and action prediction are conditioned on each other. However, none of the these methods plan the structure before translation.

Similar to our code learning approach, some works also learn the discrete codes for different purposes. Shu and Nakayama (2018) compresses the word embeddings by learning the concept codes to represent each word. Kaiser et al. (2018) breaks down the dependency among words with shorter code sequences. The decoding can be faster by predicting the shorter artificial codes.

5 Experiments

We evaluate our models on IWSLT 2014 German-to-English task (Cettolo et al., 2014) and ASPEC Japanese-to-English task (Nakazawa et al., 2016), containing 178K and 3M bilingual pairs respectively. We use Kytea (Neubig et al., 2011) to tokenize Japanese texts and moses toolkit (Koehn et al., 2007) for other languages. Using byte-pair encoding (Sennrich et al., 2016), we force the vocabulary size of each language to be 20K for IWSLT dataset and 40K for ASPEC dataset.

For IWSLT 2014 dataset, we concatenate all five TED/TEDx development and test corpus to form a test set containing 6750 pairs. For evaluation, we report tokenized BLEU with moses tool.

5.1 Evaluation of Planner Codes

In the code learning model, all hidden layers have 256 hidden units. The model is trained using Nesterov’s accelerated gradient (NAG) (Nesterov, 1983)

for maximum 50 epochs with a learning rate of

. We test different settings of code length and the number of code types . The information capacity of the codes will be bits. In Table 1, we evaluate the learned codes for different settings. accuracy evaluates the accuracy of correctly reconstructing with the source sentence and the code . accuracy reflects the chance of guessing the correct code given .

Code Setting Capacity acc. acc.
N=1, K=4 2 bits 27% 63%
N=2, K=2 2 bits 23% 67%
N=2, K=4 4 bits 35% 41%
N=4, K=2 4 bits 22% 44%
N=4, K=4 8 bits 44% 27%
Table 1: A comparison of different code settings on IWSLT 2014 dataset. The accuracy of reconstructing in the code model, and the accuracy of predict in the NMT model are reported.

We can see a clear trade-off between accuracy and accuracy. When the code has more capacity, it can recover more accurately, however, resulting in a lower probability for the NMT model to guess the correct code. We found the setting of has a balanced trade-off.

5.2 Evaluation of NMT Models

To make a strong baseline, we use 2 layers of bi-directional LSTM encoders with 2 layers of LSTM decoders in the NMT model. The hidden layers have 256 units for IWSLT De-En task and 1000 units for ASPEC Ja-En task. We apply Key-Value Attention (Miller et al., 2016)

in the first decoder layer. Residual connection

(He et al., 2016) is used to combine the hidden states in two decoder layers. Dropout is applied everywhere outside of the recurrent function with a drop rate of . To train the NMT models, we also use the NAG optimizer with a learning rate of 0.25, which is annealed by a factor of 10 if no improvement of loss value is observed in 20K iterations. Best parameters are chosen on a validation set.

Dataset Model BLEU(%)
BS=1 BS=3 BS=5
De-En baseline 27.90 29.26 29.52
plan (N=2, K=4) 28.35 29.59 29.78
Ja-En baseline 23.92 25.08 25.26
plan (N=2, K=4) 22.79 25.53 25.69
Table 2: A comparison of translation performance with different beam sizes (BS).

As shown in Table 2, by conditioning the word prediction on the generated planner codes, the translation performance is generally improved over a strong baseline. The improvement may be the result of properly regulating the search space.

However, when we apply greedy search on Ja-En dataset, the BLEU score is much lower compared to the baseline. We also tried to beam search the planner codes then switch to greedy search, but the results are not significantly changed. We hypothesize that it is important to simultaneously explore multiple candidates with drastically different structures in Ja-En task. By planning ahead, more diverse candidates can be explored, which improves beam search but not greedy search. If so, the results are in line with a recent study Li et al. (2016) that shows the performance of beam search depends on the diversity of candidates.

5.3 Qualitative Analysis

Instead of letting the beam search to decide the planner codes, we can also choose the codes manually. Table 3 gives an example of the candidate translations produced by the model when conditioning on different planner codes.

input AP no katei ni tsuite nobeta. (Japanese)
code 1 <c4> <c1> <eoc>
the process of AP is described .
code 2 <c1> <c1> <eoc>
this paper describes the process of AP .
code 3 <c3> <c1> <eoc>
here was described on process of AP .
code 4 <c2> <c1> <eoc>
they described the process of AP .
Table 3: Example of translation results conditioned on different planner codes in Ja-En task

As shown in Table 3, we can obtain translations with drastically different structures by manipulating the codes. The results show that the proposed method can be useful for sampling paraphrased translations with high diversity.

Figure 3: Distribution of assigned planner codes for English sentences in ASPEC Ja-En dataset

The distribution of the codes learned for 3M English sentences in ASPEC Ja-En dataset is shown in Fig. 3. We found the code “<c1> <c1>” is assigned to 20% of the sentences, whereas “<c4> <c3>

” is not assigned to any sentence. The skewed distribution may indicate that the capacity of the codes is not fully exploited, and thus leaves room for further improvement.

6 Discussion

Instead of learning discrete codes, we can also directly predict the structural annotations (e.g. POS tags), then translate based on the predicted structure. However, as the simplified POS tags are also long sequences, the error of predicting the tags will be propagated to word generation. In our experiments, doing so degrades the performance by around 8 BLEU points on IWSLT dataset.

7 Conclusion

In this paper, we add a planning phase in neural machine translation, which generates some planner codes to control the structure of the output sentence. To learn the codes, we design an end-to-end neural network with a discretization bottleneck to predict the simplified POS tags of target sentences. Experiments show that the proposed method generally improves the translation performance. We also confirm the effect of the planner codes, by being able to sample translations with drastically different structures using different planner codes.

The planning phase helps the decoding algorithm by removing the uncertainty of the sentence structure. The framework described in this paper can be extended to plan other latent factors, such as the sentiment or topic of the sentence.

References

Appendix A Examples of Generated Translations

We show some random translation examples in ASPEC Ja-En task. The length of input sentence is limited below 10 words. The second code tends to be “c1” because it may learns to capture information for long sentences.

[commandchars=
{},] Input: saigo ni, shorai tenbo ni tsu ite kijutsu . ¡c3¿ ¡c2¿ ¡eoc¿ finally , the future prospects are described . ¡c4¿ ¡c1¿ ¡eoc¿ future prospects are also described .

Input: DNA kaiseki no gijutsu wo kaisetu shi ta . ¡c4¿ ¡c1¿ ¡eoc¿ the technology of DNA analysis is explained . ¡c1¿ ¡c1¿ ¡eoc¿ this paper explains the technology of DNA analysis .

Input: ekisho tokusei hyouka souchi no shoukai de a ru . ¡c3¿ ¡c1¿ ¡eoc¿ this is an introduction to liquid crystal property evaluation equipment . ¡c4¿ ¡c1¿ ¡eoc¿ the liquid crystal property evaluation equipment is introduced .

Input: gaiyou zai de chiryou shi ta . ¡c2¿ ¡c1¿ ¡eoc¿ it was treated with external preparation . ¡c1¿ ¡c1¿ ¡eoc¿ the patient was treated with external preparation .

Input: ukeire no kahi ha shichou ga handan suru . ¡c1¿ ¡c1¿ ¡eoc¿ the city length is judged the propriety of the acceptance . ¡c4¿ ¡c1¿ ¡eoc¿ the propriety of the acceptance judges the city length .

Input: fuku sayou ha na ka ta . ¡c3¿ ¡c1¿ ¡eoc¿ there was no side effect . ¡c4¿ ¡c1¿ ¡eoc¿ no side effect was observed .

Input: heriumu reidou shisutemu no shiyo wo kaisetsu shi ta . ¡c4¿ ¡c1¿ ¡eoc¿ the specification of the helium refrigeration system was explained . ¡c1¿ ¡c1¿ ¡eoc¿ this paper explains the specification of the helium refrigeration system .

Input: kenkyu han no gaiyou wo shoukai shi ta . ¡c4¿ ¡c1¿ ¡eoc¿ the outline of the research team is introduced . ¡c1¿ ¡c1¿ ¡eoc¿ this paper introduces the outline of the research team .

Input: kahen bunsan hoshou ki wo kaihatsu shi ta . ¡c4¿ ¡c1¿ ¡eoc¿ a variable dispersion compensator has been developed . ¡c2¿ ¡c1¿ ¡eoc¿ we have developed a variable dispersion compensator .

Input: kairo no sekai to tokusei wo setumei shi ta . ¡c1¿ ¡c1¿ ¡eoc¿ this paper explains the design and characteristics of the circuit . ¡c4¿ ¡c1¿ ¡eoc¿ the design and characteristics of the circuit are explained .