Synchronous Bidirectional Inference for Neural Sequence Generation

02/24/2019 ∙ by Jiajun Zhang, et al. ∙ 0

In sequence to sequence generation tasks (e.g. machine translation and abstractive summarization), inference is generally performed in a left-to-right manner to produce the result token by token. The neural approaches, such as LSTM and self-attention networks, are now able to make full use of all the predicted history hypotheses from left side during inference, but cannot meanwhile access any future (right side) information and usually generate unbalanced outputs in which left parts are much more accurate than right ones. In this work, we propose a synchronous bidirectional inference model to generate outputs using both left-to-right and right-to-left decoding simultaneously and interactively. First, we introduce a novel beam search algorithm that facilitates synchronous bidirectional decoding. Then, we present the core approach which enables left-to-right and right-to-left decoding to interact with each other, so as to utilize both the history and future predictions simultaneously during inference. We apply the proposed model to both LSTM and self-attention networks. In addition, we propose two strategies for parameter optimization. The extensive experiments on machine translation and abstractive summarization demonstrate that our synchronous bidirectional inference model can achieve remarkable improvements over the strong baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many tasks in natural language processing, such as machine translation, abstractive summarization and chatbot, can be formalized as a sequence to sequence (seq2seq) generation problem which takes a sequence as input (e.g. source language sentence) and produces another sequence as output (e.g. target language translation). Generally, the seq2seq framework performs inference in a left-to-right (L2R) manner and predicts the current output token conditioned on previously generated tokens. Existing methods mainly focus on how to fully exploit the already predicted outputs on the left. And the recently proposed neural architectures for sequence generation including recurrent networks

sutskever2014sequence ; bahdanau2015neural , convolutional networks gehring2017convolutional and self-attention ones (known as Transformer) vaswani2017attention facilitate the exploration of all the history information during inference.

Architecture Direction First Four Last Four
LSTM L2R 36.35% 31.64%
R2L 31.22% 34.01%
Transformer (Self-Attention Network) R2L 40.21% 35.10%
R2L 35.67% 39.47%
Table 1: Matching accuracy of the first and last four tokens between model predictions and references in NIST Chinese-English machine translation tasks. L2R denotes conventional left-to-right inference while R2L denotes right-to-left inference.

However, conventional seq2seq models cannot access the future predictions on the right and usually produce unbalanced outputs in which left parts are much more accurate than right ones. The phenomenon is similar for right-to-left (R2L) inference where the right parts are better. In order to have a more intuitive understanding, we have investigated both L2R and R2L inferences using LSTM hochreiter1997long and self-attention networks (SAN) vaswani2017attention on the typical sequence generation task, namely machine translation. Table 1 shows the matching accuracy of the first and last four tokens between model predictions and references. It is obvious that left-to-right inference performs much better on predictions of head tokens while right-to-left inference excels in tail token predictions. Intuitively, it is a promising direction to combine the merits of bidirectional inferences and make full use of both history and future contexts.

Researchers have made great efforts to take advantages of both L2R and R2L inferences. liu2016agreementb ; zhang2018regularizing enforce the agreement between L2R and R2L predictions during training, and then L2R inference will be improved accordingly. liu2016agreementa ; wang2017sogou employ R2L model to rerank the -best hypotheses of the L2R model. zhang2018asynchronous first obtains the R2L outputs and optimizes the L2R inference model based on both of the original input and the R2L outputs. Despite the performance improvement, these approaches suffer from two issues. On one hand, they have to train two separate seq2seq models for L2R and R2L inferences respectively. On the other hand, the two models cannot interact with each other during inference.

In this article, we propose a synchronous bidirectional inference model that produces outputs using both L2R and R2L decoding simultaneously and interactively. We first introduce a novel beam search algorithm to accommodate L2R and R2L inferences at the same time. At each timestep during inference, each half beam retain the hypotheses from L2R and R2L inferences respectively and each hypothesis is generated by utilizing already predicted outputs from both directions. The interaction between L2R and R2L inferences is realized through a synchronous attention model that attempts to leverage both the history and future sequential predictions simultaneously during inference. Fig. 

1 gives a simple illustration of the proposed synchronous bidirectional inference model. The middle part in color on the right of Fig. 1 is the core of our model. L2R and R2L inferences interact with each other in an implicit way illustrated by the colored part. The arrows indicate the information passing flow. Solid arrows in black show the conventional history context dependence while dotted arrows in color introduce the future context dependence on the other inference direction. For example, besides the past predictions (, ), L2R inference can also utilize the future contexts (, ) generated by the R2L inference when predicting .

Figure 1: A simple illustration of our synchronous bidirectional inference model in which L2R and R2L models interact with each other. The left part is the input sequence and the right part denotes the scheme of synchronous bidirectional inference. Take generating as an example. We cannot only utilize the past predictions (, ) of the L2R inference, but also could leverage the future contexts (, ) which have been already predicted by the R2L inference. It is similar when predicting . Note that is not necessary the same as and may be different from . The final output sequence will be if L2R inference wins. It will be otherwise.

As we mentioned above, there are many sequence to sequence models. To test the generalization capacity of our model, we apply the proposed synchronous bidirectional inference model into two representative seq2seq frameworks using LSTM and self-attention networks. Furthermore, we propose two optimization strategies to train network parameters. We choose machine translation and abstractive sentence summarization as the testbed to verify the effectiveness of the models. The extensive experiments demonstrate that our proposed model remarkably outperforms the strong baselines.

2 Synchronous Bidirectional Inference

The task of sequence to sequence learning is to find the most probable output sequence

which maximizes the following conditional probability given the input sequence .

(1)

Unlike sequential labeling tasks in which shares the same length as , the output length of is unknown until the inference process ends and in most cases the token numbers are different between output and input sequences in seq2seq learning tasks. For instance, the result summary (output sequence) should be much shorter than the original text (input sequence) in the summarization task.

2.1 Unidirectional Inference

Conventionally, Equation 1 is decomposed in a left-to-right manner as follows:

(2)

Since the search space () is very large and contains tens of thousands of entries in most cases, a beam search algorithm is usually employed to approximately find the most probable output sequence according to the history predictions and the input sequence . Currently, neural methods such as LSTM, conventional networks and self-attention ones can model the conditional probability more and more accurately, but leaves the future contexts unexplored.

In order to leverage the right hand information, Equation 1 can also be decomposed in a right-to-left manner as follows:

(3)

Using this decomposition, the right-side hypotheses are available when predicting , while the left-side predictions are still missing.

2.2 Synchronous Bidirectional Beam Search

Input: Input sequence , beam size , maximum length of output sequence
Output: Optimal output sequence
1 Initialize complete hypothesis list , left-to-right temporary hypothesis list and partial hypothesis list , right-to-left temporary hypothesis list and partial hypothesis list ; // and are tags indicating inference direction;
2 for  do
3       =ExpandHypo;
4       =ExpandHypo;
5       =UpdateHypo;
6       =UpdateHypo;
7       ;
8       ;
9      
10 end for
11if  then
12       sort in decending order;
13       ;
14      
15else
16       if  then
17             ;
18            
19      else
20             ;
21            
22       end if
23      
24 end if
25if  then
26       return ;
27      
28else
29       return reversed ;
30      
31 end if
Algorithm 1 Synchronous Bidirectional Beam Search Algorithm

Ideally, we expect to utilize both the past and future contexts ( and ) when determining the best prediction of . However, it is contradictory to some extent. Predicting needs on the right, while determining requires on the left. Obviously, it is impractical to use the whole contexts of both sides ( and ) in a single inference model. We take a step back and attempt to explore the bidirectional contexts as many as possible if not all.

We propose a synchronous bidirectional inference model in which left-to-right and right-to-left inferences perform in parallel while keeping interaction with each other. In this way, Equation 1 is decomposed as follows:

(4)

Equation 4 says that the bidirectional inference model accommodates L2R and R2L decoding at the same time. At timestep , we have already generated the left hypotheses with L2R inference and the right predictions with R2L inference. Thus, different from equation 2 and equation 3, both-side predictions can be utilized as contexts in the above bidirectional composition.

Algorithm 1 shows the beam search procedure of the synchronous bidirectional inference model. The working flow is similar to the unidirectional beam search. We keep three kinds of lists. is employed to store complete hypotheses. and are used to maintain the temporary and partial hypotheses at each decoding timestep for L2R and R2L inferences respectively. Lines 2-9 in Algorithm 1 is the main part of the beam search algorithm. At timestep , L2R and R2L inferences perform in parallel but interactively to expand the partial hypotheses and from the previous timestep (lines 3-4). Then, the complete hypothesis list and the partial hypothesis list will be updated according to the temporary hypothesis list (lines 5-6).

Algorithm 2 and Algorithm 3 respectively detail the hypothesis expansion procedure and hypothesis update process. Algorithm 3 is trivial and is the same as the conventional unidirectional beam search. The algorithm ExpandHypo is the key for synchronous bidirectional inference (SBInfer). In Algorithm 2, for a partial hypothesis in , we calculate the probability of each candidate token in the target vocabulary by utilizing both of the history context and the future information through the function SBInfer.

Obviously, the function SBInfer is the most important part and requires specific design for different seq2seq architectures. Next, we introduce how to implement the function SBInfer for both LSTM-based and self-attention based seq2seq networks.

Input: Partial hypothesis list for current decoding direction and for the opposite decoding direction
Output: temporary hypothesis list
1 for  in  do
2       for  in  do
3             ;
4             =SBInfer ;
5             ;
6            
7       end for
8      
9 end for
10sort in a decending order;
11 ;
12 return ;
Algorithm 2 ExpandHypo.
Input: Temporary hypothesis list , partial hypothesis list , and complete hypothesis list
Output: Partial hypothesis list and complete hypothesis list
1 for  in  do
2       if  then
3             ;
4             if  then
5                   break;
6                  
7             end if
8            
9      else
10             ;
11            
12       end if
13      
14 end for
15return ;
Algorithm 3 UpdateHypo

3 Synchronous Bidirectional Inference for LSTM-based Seq2Seq Framework

No matter what kind of network architecture is used, all Seq2Seq frameworks consist of an encoder and a decoder. Given an input sequence , the encoder transforms into a sequence of abstract context representations

whose size is the same as the length of the input text. Then, from the context vectors

the decoder generates the output sequence one token each time by maximizing the probability of with a left-to-right inference model.

Hereafter, we leverage and to denote the word embeddings corresponding to the input and output tokens and . Next, we briefly review the encoder introducing how to obtain and the decoder addressing how to calculate for the conventional LSTM-based Seq2Seq architecture. Then, we propose to enable synchronous bidirectional inference SBInfer in the LSTM-based architecture.

3.1 LSTM-based Seq2Seq Framework

The encoder employs stacked LSTM layers to learn the context vectors . In the -th layer (), is calculated as follows:

(5)

In the first layer (), is obtained through a bidirectional LSTM:

(6)
(7)

Given and ,

is calculated with a feed-forward neural network

.

The decoder computes the conditional probability with the help of attention mechanism bahdanau2015neural that leverages different input context at different decoding time step:

(8)

where is the attention output:

(9)

in which is the top hidden state of the decoder network and in the -th layer is computed using the following formula:

(10)

If , will be calculated by combining as feed input luong2015effective :

(11)

The dynamic context is the weighted sum of the source-side context vectors and is calculated by the attention model:

(12)

where is a normalized item calculated as follows:

(13)
(14)

The greater the value of the variable , the more contribution of the -th input token to the generation of the -th output word. The left part in Fig. 2

gives the overall illustration of this unidirectional inference model for LSTM-based Seq2Seq framework. Note that residual connections and layer normalizations are employed as well and they are neglected in the description for simplicity.

Figure 2: Bidirectional inference model for seq2seq framework with LSTM architecture.

3.2 Synchronous Bidirectional Inference for LSTM-based Architecture

In synchronous bidirectional inference, is calculated with both history and future contexts according to Equation 4: . The previous section introduces the way to use input and history contexts in Equation 8: . The synchronous bidirectional inference adopts the same mechanism as follows:

(15)

Different from unidirectional inference, the synchronous bidirectional inference calculates the attention output with both L2R and R2L predictions:

(16)

where the future context is obtained using another attention model as illustrated with green arrows in Fig. 2.

(17)

in which is a normalized coefficient:

(18)
(19)

It should be noted that L2R and R2L inferences perform simultaneously in parallel. Thus, when calculating , we can as well compute at the same time in a similar way.

(20)
(21)

Where the left context is obtained using a similar attention model as illustrated with blue arrows in Fig. 2.

(22)

4 Synchronous Bidirectional Inference for Self-attention based Framework

The self-attention based Seq2Seq framework is known as Transformer vaswani2017attention . In this section, we first give an overview of Transformer and then propose the implementation of synchronous bidirectional inference SBInfer in Transformer.

4.1 Transformer

The Transformer also follows the encoder-decoder architecture. The encoder includes identical layers and each layer is composed of two sub-layers: the self-attention sub-layer followed by the feed-forward sub-layer.

The decoder also consists of identical layers. Each layer has three sub-layers. The first one is the masked self-attention mechanism. The second one is the decoder-encoder attention sub-layer and the third one is the feed-forward sub-layer111Residual connection and layer normalization are performed for each sub-layer in both encoder and decoder..

Obviously, the key component is the attention mechanism222In fact, multi-head attention is employed and we just introduce basic attention for simplicity.. The three kinds of attention mechanisms can be formalized into the same formula.

(23)

Where , and stand for a query, the key list and the value list respectively. is the dimension of the key.

For the self-attention in encoder, the queries, keys and values are from the same layer. For example, if we calculate the output of the first layer in the encoder at the -th position. The query is vector 333Suppose is the sum vector of input token embedding and the positional embedding.. The keys and values are the same and both are the embedding matrix . Using Equation 23 followed by a feed-forward network, we can get the representation of the second layer. After layers, we obtain the input contexts .

The masked self-attention in decoder is similar to that of encoder except that the query at the -th position can only attend to positions before , since the predictions after -th position are not available in the auto-regressive unidirectional inference.

(24)

The decoder-encoder attention mechanism is the same as that of LSTM-based Seq2Seq architecture. The query is the output of the masked self-attention sub-layer . The keys and values are the same encoder contexts . The feed-forward sub-layer is then applied to yield the output of a whole layer. After such layers, we obtain the final hidden state s. Softmax function (Equation 8) is then employed to predict the output . Left part in Fig. 3 depicts the overall architecture of Transformer.

4.2 Synchronous Bidirectional Inference for Transformer

In synchronous bidirectional inference, the essential difference lies in the improvement over the masked self-attention mechanism for decoder. In standard Transformer, the masked self-attention model calculates the output () using only the history contexts. In contrast, synchronous bidirectional inference performs L2R and R2L decoding in parallel and interactively. At the -th timestep, L2R and R2L inferences have already generated outputs and . Therefore, both and can be employed to compute .

Accordingly, we design two self-attention mechanisms to handle history contexts and future contexts respectively. In addition to Equation 24 that utilizes history information, we propose another self-attention mechanism to leverage the future information generated by the opposite inference direction.

(25)

where is the query (i.e. the embedding of the -th output ) from the L2R decoder. and are keys and values (i.e. the embeddings of the previous predictions ) from the R2L decoder. Fig. 3 illustrates how synchronous bidirectional inference performs. When producing , the orange lines denote the original masked self-attention with history while the green lines indicate the self-attention with future contexts.

Finally, we introduce a function to combine and to obtain a new representation that encodes both past and future contexts.

(26)

For R2L decoding, can be calculated similarly in parallel.

(27)
(28)
Figure 3: Bidirectional inference model for seq2seq framework with self-attention architecture.

5 Training

Since our synchronous bidirectional inference performs L2R and R2L decoding in parallel, L2R decoder aims to generate the gold reference during training while R2L decoder attempts to produce the reversed gold reference at the same time. Given the training data consisting of sentence pairs , the objective is to maximize the log-likelihood over the training data.

(29)

When calculating , L2R decoder usually employs the gold reference as the condition . In synchronous bidirectional inference, a problem will arise if we directly utilize the gold reference from the other side to compute . For example, in the calculation of , includes . It indicates that is used to predict itself. Obviously, it is not reasonable. To address this issue during training, we propose two optimization strategies to learn network parameters.

5.1 Two-pass Training

In the first training pass, we learn independent L2R and R2L inference models on the training data. Then, L2R and R2L models are employed to decode the input sentences of the training data, resulting in and respectively. During the second training pass, is calculated using , indicating that the future context is the model predictions rather than gold reference . Similarly, we calculate using .

5.2 Fine-tuning Strategy

In the find-tuning strategy, we first train the parallel inference for L2R and R2L without interaction just as Equation 30 shows. Each training instance for this step is a triple .

(30)

After this simple training procedure converges, we use this model to decode a small subset of the source sentences in the original training data (e.g. 10% of the dataset) and get the new triple . Then, we can fine-tune our synchronous bidirectional inference model similar to the second pass of the two-pass training strategy.

Compared to the two-pass training strategy, the fine-tuning strategy is much cheaper to implement since there is no need to train two separate models, to decode the entire training set and to do the second training over the whole dataset. In the experiments, we mainly employ the two-pass strategy and compare these two strategies in the experimental analysis part.

6 Experimental Setup

In our experiments, two typical seq2seq tasks of machine translation and abstractive summarization are employed to test the effectiveness of our synchronous bidirectional inference model.

6.1 Machine Translation

6.1.1 Dataset

We evaluate the proposed synchronous bidirectional inference model on both Chinese-to-English and English-to-German translation tasks. For the Chinese-to-English task, the training data consists of about 2.1M sentence pairs extracted from LDC corpora444LDC2000T50, LDC2002L27, LDC2002T01, LDC2002E18, LDC2003E07, LDC2003E14, LDC2003T17, LDC2004T07.. We choose NIST 2002 (MT02) dataset for validation. For testing, we employ NIST 2003-2006 (MT03-06) datasets. We apply Byte-Pair Encoding (BPE)sennrich2016neural with 30K merge operations and maintain the source and target vocabularies to the most frequent 30K tokens..

For the English-to-German task, we utilize the same subset of the WMT 2014 training corpus employed by vaswani2017attention ; luong2015effective ; shen2016minimum ; zhou2016deep . It contains 4.5M sentence pairs555All preprocessed dataset and vocabulary can be directly download in tensor2tensor website https://drive.google.com/open?id=0B_bZck-ksdkpM25jRUN2X2UxMm8.. The concatenation of news-test 2012 and news-test 2013 is used as the validation set. The news-test 2014 is employed as the test set.

6.1.2 Training and Evaluation Details

For our synchronous bidirectional inference model with LSTM-based architecture, we implement the system by reusing and modifying the open source toolkit Zoph_RNN666https://github.com/isi-nlp/Zoph_RNN

which is written in C++/CUDA and provides efficient training across multiple GPUs. The encoder includes two stacked LSTM layers and the first layer employs the bidirectional LSTMs. The decoder also contains two stacked LSTM layers followed by the softmax layer. The dimension of word embedding and the size of hidden layers are all set to 1000. The dropout rate is set to 0.2. At test time, we employ beam search with beam size

.

For the synchronous bidirectional inference model with Transformer, we modify the tensor2tensor777https://github.com/tensorflow/tensor2tensor. toolkit for training and evaluation. We employ the Adam optimizer with =0.9, =0.998, and =. The warmup and decay strategy for learning rate are the same as vaswani2017attention , with 16,000 warmup steps. During training, we employ label smoothing of value =0.1. For evaluation, we use beam search with a beam size of and length penalty =0.6. Additionally, we use 6 encoder and decoder layers. In each layer, we employ hidden size, 16 attention-heads, 4096 feed forward inner-layer dimensions, and =0.1. Our settings are close to transformer_big setting as defined in vaswani2017attention . We employ three Titan Xp GPUs to train English-to-German translation and one GPU for Chinese-to-English translation pairs. In addition, we average the last 20 checkpoints to get the final model for English-to-German but do not perform checkpoint averaging for Chinese-to-English.

We evaluate the final translation quality with case-insensitive BLEU papineni2002bleu for Chinese-to-English and with case-sensitive BLEU for English-to-German. Significance test is performed using the pairwise re-sampling approach koehn2004sig .

6.1.3 Translation Systems

We use BI-RNMT

to denote our proposed synchronous bidirectional inference model implemented in LSTM-based recurrent neural machine translation. The proposed synchronous

bidirectional inference model for Transformer is named BIFT888Our code is freely available in github https://github.com/ZNLP/sb-nmt.

We compare the proposed models against the following state-of-the-art NMT systems:

  • RNMT wu2016google : it is a state-of-the-art LSTM-based NMT system with the same setting as BI-RNMT.

  • RNMT (R2L): it is a variant of RNMT and produces translations from right to left.

  • Transformer: it is the state-of-the-art machine translation system with self-attention mechanism using the default left-to-right generation vaswani2017attention .

  • Transformer (R2L): it is a variant of Transformer which performs translation in a right-to-left manner.

  • Rerank-NMT: following liu2016agreementa , we first run beam search for L2R and R2L inference models independently to obtain two k-best lists, and then re-score the union of these two k-best lists. This method assumes that some source sentences are appropriate to translate from left to right, while others are better to translate from right to left.

  • ABD-NMT: it is an asynchronous bidirectional inference model for NMT that performs L2R inference with the results generated by R2L inference model zhang2018asynchronous . During inference, two-pass decoding scheme is employed. First, the R2L inference model generates the backward hidden states and corresponding translation results. Then, ABD-NMT optimizes the L2R inference with the helpf of the backward hidden states.

For fair comparison, Rerank-NMT and ABD-NMT are all reimplemented based on strong Transformer models.

6.2 Abstractive Summarization

6.2.1 Dataset

Abstractive sentence summarization is a task that generates a title-like summary for a long sentence. Our training data is a (text, summary) parallel corpus from the Annotated English Gigaword dataset napoles2012annotated ; rush2015neural . It contains about 3.8M text-summary pairs for training and 189K pairs for validation. The encoder and decoder share the same vocabulary of about 90K word types.

For the test set, we use both DUC 2004 and the English Gigaword. In the test set of DUC 2004, there are 500 examples and each example pairs a document with four different human-written reference summaries. For the test set of the English Gigaword, we employ the same randomly selected subset of 2000 text-summary pairs as rush2015neural ; zhou2017selective .

6.2.2 Training and Evaluation Details

For both LSTM-based architecture and Transformer framework, we use the same model settings as neural machine translation.

For evaluation, we use ROUGE lin2004rouge

as the metric. ROUGE measures the quality of summary by computing overlapping lexical units, such as unigram, bigram, trigram, and longest common subsequence (LCS). It becomes the standard evaluation metric for DUC shared tasks and popular for summarization evaluation. Following previous work, we use ROUGE-1 (unigram), ROUGE-2 (bigram) and ROUGE-L (LCS) as our evaluation metrics in the reported experimental results.

6.2.3 Summarization Systems

We compare our proposed model with the following state-of-the-art baselines.

ABS: rush2015neural first proposed the abstractive summarization task and used an attentive CNN encoder and NNLM (neural network language model) decoder to perform this task.

LSTM-Sum

: it is the abstractive summarization system with the same architecture as RNMT in which the encoder and decoder are both LSTM-based recurrent neural networks.

Feats2S

: it is also a RNN encoder-decoder model using gated recurrent unit (GRU)

bahdanau2015neural and provide more features (e.g. POS and NER) to enrich the encoder nallapati2016abstractive .

Selective-Enc: zhou2017selective proposed a selective mechanism to selecting important information from encoder before generating summary.

Transformer: it is a Transformer model which is applied to the abstractive sentence summarization task.

7 Results and Analysis

7.1 Machine Translation

7.1.1 Overall Translation Quality

Model MT03 MT04 M05 MT06 AVE
RNMT 42.07 43.40 40.73 41.11 41.83 -
RNMT (R2L) 41.47 43.13 40.62 40.94 41.54 -0.29
BI-RNMT 43.50 43.98 41.37 42.48 42.83 +1.00
Transformer 47.63 48.32 47.51 45.31 47.19 -
Transformer (R2L) 46.79 47.01 46.50 44.13 46.11 -1.08
Rerank-NMT 48.23 48.91 48.73 46.51 48.10 +0.91
ABD-NMT 49.47 48.01 48.19 47.09 48.19 +1.00
BIFT 51.87 51.50 51.23 49.83 51.11 +3.92
Table 2: Translation quality for Chinese-to-English tasks using case-insensitive BLEU scores. Our model BI-RNMT and BIFT are respectively significantly better than corresponding baselines (p 0.01).

Table 2 reports the translation performance of different systems on the Chinese-English task. The results are mainly divided into two parts. The first part in this table shows the BLEU scores of the systems based on LSTM framework while the second part gives the results of various systems based on the Transformer architecture.

Comparing the baselines using different architectures, we can easily see that the self-attention based Transformer remarkably outperforms the LSTM-based RNMT, with the average improvement of 5.36 BLEU points (47.19 vs. 41.83), suggesting the superiority of the Transformer.

As conventional decoding performs left to right, a question may arise that which inference direction is better. It is easy to find from table 2 that the right-to-left decoding performs worse than the left-to-right style no matter which neural network architecture is adopted (RNMT (R2L) vs. RNMT, Transformer (R2L) vs. Transformer). Specifically, the gap under the Transformer architecture is much bigger (1.08 vs. 0.29), indicating that Transformer is more sensitive to the inference direction.

We also investigate previous methods that take advantage of two inference directions. The second part in table 2 shows that both the reranking approach Rerank-NMT liu2016agreementa and the asynchronous bidirectional decoding method ABD-NMT zhang2018asynchronous can get a significant improvement over the strong Transformer baseline. The average gains can be up to 1.0 BLEU point (48.19 vs. 47.19), indicating that L2R decoding and R2L decoding can be complementary to each other.

We go step further to exploit synchronous bidirectional inference that makes full use of L2R and R2L decoding. It is obvious to see from table 2 that our proposed method performs best. The first part in table 2 says that the synchronous bidirectional inference model under LSTM framework BI-RNMT can obtain an average improvement of 1.0 BLEU point over RNMT. The second part demonstrates that our synchronous bidirectional inference model under the Transformer architecture BIFT achieves promising BLEU gains and the gap can be as large as 3.92 BLEU points on average. The remarkable improvements suggest that compared to asynchronous bidirectional decoding, our synchronous bidirectional inference can better explore the history and future contexts on the target side.

Model TEST (WMT 14)
GNMT wu2016google 24.61
Conv gehring2017convolutional 25.16
AttIsAll vaswani2017attention 28.40
RNMT 22.85
RNMT (R2L) 22.17
BI-RNMT 23.97
Transformer888The BLEU score of the Transformer model is reproduced in our hardware environment and is slightly lower than AttIsAll vaswani2017attention . chen2018best also reported that their reproduction is lower than their original result in vaswani2017attention . In our experiments, we use only 3 GPUs for English-to-German, whereas vaswani2017attention ; chen2018best adopted TPUs for model training. 27.72
Transformer (R2L) 27.13
Rerank-NMT 27.81
ABD-NMT 28.22
BIFT 29.21
Table 3: Translation results on WMT14 English-to-German task using case-sensitive BLEU.

Similar phenomena can be observed from the English-German translation results as shown in Table 3. The finding is that BI-RNMT outperforms RNMT with 1.12 BLEU points. BIFT performs best among all the systems including GNMT wu2016google , Conv gehring2017convolutional and AttIsAll vaswani2017attention . In addition, BIFT achieves the state-of-the-art performance of 29.21 on the same dataset. Considering that only one reference is available for English-to-German translation, the improvements are very promising.

7.1.2 Model Size and Efficiency

Our synchronous bidirectional inference model is slightly complicated than conventional L2R or R2L inference model. It is interesting to figure out the model size and efficiency of our system compared to other baseline systems. Table 4 reports the corresponding statistics of different NMT models. The model size denotes the total number of network parameters. Since the synchronous bidirectional inference model only introduces one parameter in Equation 26 for BIFT, the model size is the same as that of the Transformer. In contrast, Rerank-NMT has double the number of parameters compared to the Transformer because it requires two individual encoder-decoder models for L2R and R2L decoding respectively. As for the asynchronous bidirectional decoding model, ABD-NMT shares one encoder and has two decoders, and thus contains more than a half parameters against the Transformer baseline.

Model Model Size Efficiency
Train Test
Transformer 207.8M 2.07 19.97
Transformer (R2L) 207.8M 2.07 19.81
Rerank-NMT 415.6M 1.03 6.51
ABD-NMT 333.8M 1.18 7.20
BIFT 207.8M 1.26 17.87
Table 4: Comparison results of model size, training and testing efficiency. Train denotes the number of global training steps processed per second at the same batch-sized sentences; Test indicates the amount of translated sentences in one second.

The Train column shows the number of global training steps per second. Because the training procedure of BIFT needs to match both of the L2R and R2L references, it takes more time to converge. However, regarding the decoding efficiency, our synchronous bidirectional inference model performs on par with the Transformer baseline and is only 10% slowdown (17.87 vs. 19.97 sentences per second), whereas Rerank-NMT and ABD-NMT are much slower. The statistics suggest that BIFT is acceptable regarding the decoding efficiency.

7.1.3 Performance Trends on Sentence Length

Figure 4: BLEU score trends for different intervals of sentence length.

In previous sections, we argued that BI-RNMT and BIFT can utilize both of the history and future contexts during translation. A natural question may arise that whether our proposed model would perform much better on the long sentences. To answer this question, we group the source sentences of similar lengths in the test set and calculate the corresponding BLEU scores for each length interval.

Fig. 4 displays the statistics. The left part of Fig. 4 represents the results of the systems using LSTM-based framework. The right one shows the statistics of Transformer-based systems. Overall, no matter which architecture is adopted, our proposed models (BI-RNMT and BIFT) are superior to baselines over sentences with all different lengths. Generally, the gap becomes bigger and bigger when the length grows. Comparing the two architectures, we find that the Transformer-based BIFT excels in long sentence translation. The results indicate that our proposed synchronous bidirectional inference model are better at translating long sentences with the help of both the history and future contexts during decoding.

7.1.4 Translation Precision over Different Positions

We mentioned in introduction that L2R inference is good at predicting prefix while R2L inference is adept at suffix prediction. We may wonder that how does our synchronous bidirectional inference model perform on prefix and suffix prediction. Furthermore, which part of the translation sentence will be improved most, the prefix, middle part or the suffix?

Model First Four Last Four
RNMT 36.35% 31.64%
RNMT (R2L) 31.22% 34.01%
BI-RNMT 36.88% 34.65%
Transformer 40.21% 35.10%
Transformer (R2L) 35.67% 39.47%
Rerank-NMT 38.98% 38.91%
ABD-NMT 38.36% 38.11%
BIFT 40.89% 40.08%
Table 5: Matching accuracy of the first and last four tokens between model predictions and references in NIST Chinese-English machine translation tasks for different NMT systems.
Figure 5: Match precision of translation tokens for different positions.

To figure out these questions, we first analyze the prediction precision of the first-four and last-four tokens of the translation compared to the references for different NMT systems. Table 5 reports the comparison results. Obviously, the proposed synchronous bidirectional inference model performs best for matching precision of both the first-four and last-four tokens, showing the superiority of our methods.

Then, we go step further and conduct a deep analysis. We divide each translation hypothesis and its reference into 10 equal parts and calculate the average word prediction accuracy for each part. In this way, we attempt to investigate the contribution of the synchronous bidirectional inference model over different positions. Fig. 5 illustrates the comparison results. It is interesting to see from this figure that both ends of the translation hypothesis are much easier to predict. In contrast, the prediction accuracy of the middle part is much lower, suggesting more demands of both the history and future contexts. Our proposed BI-RNMT and BIFT facilitate the usage of the left and right predictions, leading to large improvements over the middle part (40-80% in Fig. 5).

7.1.5 Two-pass Training vs. Fine-tuning

In this subsection, we attempt to investigate the effects of different parameter optimization strategies for our synchronous bidirectional inference model. We employ BIFT and Chinese-to-English translation task to compare between the two-pass training strategy and the fine-tuning strategy. In the fine-tuning step, we randomly choose 10% source sentences of the training data. Table 6 reports the comparison results.

As shown in the table, we observe that both training strategies can remarkably improve the translation performance compared to the Transformer baseline. Although the fine-tuning strategy is not as powerful as the two-pass training strategy, it can still achieve a big improvement of 2 BLEU points in average over the strong baseline. Considering that the fine-tuning strategy is much easier and cheaper for system deployment, we believe this strategy will be more popular than the two-pass strategy.

Model MT03 MT04 M05 MT06 AVE
Transformer 47.63 48.32 47.51 45.31 47.19 -
BIFT (two-pass) 51.87 51.50 51.23 49.83 51.11 +3.92
BIFT (fine-tuning) 50.76 49.72 48.32 47.91 49.18 +1.99
Table 6: Translation results of different training strategies for our synchronous bidirectional inference model on Chinese-to-English tasks.

7.2 Abstractive Summarization

7.2.1 Summarization Quality

Model DUC-2004 English Gigaword
R1 R2 R-L R1 R2 R-L
ABS 26.55 7.06 22.05 29.55 11.32 26.42
Feats2s 28.35 9.46 24.59 32.67 15.59 30.64
Selective-Env 29.21 9.56 25.51 36.15 17.54 33.63
RNMT 28.22 10.21 25.14 34.54 16.85 32.32
BI-RNMT 29.05 10.90 26.05 35.47 17.62 32.90
Transformer 28.09 9.52 24.91 34.12 16.04 31.46
BIFT 29.17 10.30 26.05 35.68 17.39 32.89
Table 7: Abstractive summarization quality on DUC 2004 and English Gigaword for different methods.

Abstractive sentence summarization is another well-known testbed for sequence to sequence learning. We then apply our synchronous bidirectional inference model into this task. Table 7 presents the results of different systems over two test sets DUC-2004 and English Gigaword.

The first three rows show the performances of previous state-of-the-art abstractive summarization models. Among these three baselines, both Feats2s and Selective-Env aim at improving the summarization quality by enhancing the representation learning of the encoder. Selective-Env augments the encoder with key information selection performs best.

In contrast, our method attempts to improve the decoder (inference module) by enabling synchronous bidirectional decoding. The last four rows in Table 7 demonstrate that the LSTM-based RNMT stably outperforms the self-attention based Transformer, which is quite different from that for neural machine translation in which Transformer is the better one. After applying our synchronous bidirectional inference model, BI-RNMT and BIFT respectively achieve significantly better results on two test sets over RNMT and Transformer, despite that these two models cannot outperform Selective-Env on the Gigaword test set. The reason behind may be that we just apply BI-RNMT and BIFT into abstractive summarization without any special adaptation processing. From another perspective, the two kinds of the models handle encoder and decoder respectively, and can be complementary to each other. Nevertheless, the statistics given in the table further show the effectiveness of our synchronous bidirectional inference model beyond machine translation.

7.2.2 Some Examples

Input

resident nelson mandela acknowledged saturday the african national congress violated human rights during apartheid , setting him at odds with his deputy president over a report that has divided much of south africa .

Reference mandela acknowledges human rights violations by african national congress
Transformer mandela acknowledges human rights violation at odds with deputy president
Transformer (R2L) mandela says south africa violated human rights
BIFT mandela says south african national congress violated human rights
Input the new york times said in an editorial on monday , oct. UNK : since the deadly bombing of two american embassies in africa in august , there has been a troubling accumulation of evidence that the state department inexplicably ignored warnings of possible terrorist attacks against the installations .
Reference editorial claims state department ignored warnings of terrorist
Transformer new york times says evidence of terror attacks is ignored
Transformer (R2L) new york times warns of possible terrorist attacks on u.s. embassies in africa
BIFT new york times says state department ignored terror warnings
Table 8: Abstractive summarization examples comparing our synchronous bidirectional inference model to other baselines.

To better understand the models, we further investigate some specific examples which are listed in Table 8. For each example, the input is a long sentence and the output is a title-like summary.

In the first example, human rights and african national congress are two key contents. The baseline Transformer fails to generate african national congress in the tail part and Transformer (R2L) neglects natrail congress in the head part. In contrast, our model BIFT renders all of the key points.

As for the second example, the baselines have made similar mistakes. Transformer omits by state department at the end, and the summary generated by Transformer (R2L) expresses the wrong meaning due to the absence of says state department fails to after new york times. However, our model BIFT can generate the summary with correct and complete contents. The examples demonstrate the superiority of our synchronous bidirectional inference model over the modeling of both history and future contexts.

8 Related Work

This work addresses synchronous bidirectional inference for sequence to sequence learning tasks, aiming to take full advantage of history and future predictions on the output. Generally, the related work can be divided into two categories, namely bidirectional inference and future context usage.

Bidirectional inference is well studied in sequential labeling tasks toutanova2003feature ; tsuruoka2005bidirectional ; shen2007guided , in which each input token corresponds to an output label and the output shares the same length with the input. In general, there are decomposition ways of the conditional probability for an -token input sequence, since each token is predicted after the left one (from left to right) or the right one (from right to left). In this way, bidirectional inference is not difficult for sequential labeling. However, it is not trivial to leverage bidirectional inference for sequence generation problems mainly due to the length nondeterminacy of the output. liu2016agreementa ; liu2016agreementb ; zhang2018asynchronous added agreement constraints to enforce L2R inference output to be consistent with R2L inference output for sequence generation tasks. serdyuk2018twin proposed the twin network that encourages the target hidden states of the L2R and R2L inferences at the same position to be as close as possible to predict the same token during training. Recently, zhang2018asynchronous introduced an asynchronous bidirectional inference model for neural machine translation. They first obtained the translation hypothesis using R2L inference and then optimized the L2R inference model with the help of the R2L inference result. Despite of performance improvement, all these studies require two individual inference models, making the architecture more complicated. Furthermore, the interactions between L2R and R2L inferences are not adequate. Taking the asynchronous bidirectional inference model for example, L2R model can utilize the information of R2L model but R2L inference cannot use the L2R predictions. In contrast, our synchronous bidirectional inference model has only a single decoder in which L2R and R2L inferences interact with each at each decoding time step.

Using future contexts has drawn more and more attention in sequence prediction tasks. Intuitively, R2L inference model can be employed to re-rank the -best hypotheses of the L2R inference model, preferring balanced output sennrich2016edinburgh ; sennrich2017university ; hoang2017towards ; tan2017xmu ; deng2018alibaba ; liu2018comparable . To use the future context which is unavailable in conventional inference model, bahdanau2017actor ; he2017decoding ; li2017learning

proposed the reinforcement learning methods to estimate the possible future information. To mimic the human cognitive behaviors,

xia2017deliberation presented a deliberation network, which leverages the global information with the help of both forward and backward predictions in sequence generation through a deliberation process. zheng2018modeling introduced two additional recurrent layers to model translated past contents and untranslated future contents. They much improved the sequence generation quality with the cost of model complexity. They either employed two-pass decoding strategy or added more layers to the original network. Compared to this kind of work, our proposed model uses a smart way to exploit both history and future predictions by allowing L2R and R2L inferences to perform in parallel but interactively. In our previous work zhou2019synchronous , we address the bidirectional decoding for neural machine translation. In this current work, we generalize the decoding model into synchronous bidirectional inference for the general sequence-to-sequence models (LSTM and Transformer) and general sequence generation tasks (translation and summarization). We further propose and investigate two optimization strategies to learn network parameters.

9 Conclusion and Future Work

This work proposes a synchronous bidirectional inference model for sequential generation tasks. We first presented a synchronous bidirectional beam search algorithm for sequence generation, in which left-to-right and right-to-left decoding perform in parallel but interactively. We have exploited the usage of synchronous bidirectional inference model on both LSTM-based and Transformer-based seq2seq architectures. We have also proposed and investigated two parameter optimization strategies. The comprehensive experiments on machine translation and abstractive summarization have demonstrated that our proposed synchronous bidirectional inference model remarkably outperforms the strong baselines. The deep analysis further shows that our model can indeed take full advantage of both history and future predictions during inference.

In the future work, we plan to apply our synchronous bidirectional inference model to other sequential generation tasks, such as question answering, chatbot and image caption.

References

  • (1) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in: Proceedings of NIPS, 2014.
  • (2) D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: Proceedings of ICLR, 2015.
  • (3) J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. N. Dauphin, Convolutional sequence to sequence learning, in: Proceedings of ICML, 2017.
  • (4) A. Vawani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, arXiv preprint arXiv:1706.03762.
  • (5)

    S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780.

  • (6) L. Liu, A. M. Finch, M. Utiyama, E. Sumita, Agreement on target-bidirectional lstms for sequence-to-sequence learning, in: AAAI, 2016, pp. 2630–2637.
  • (7) Z. Zhang, S. Wu, S. Liu, M. Li, M. Zhou, E. Chen, Regularizing neural machine translation by target-bidirectional agreement, arXiv preprint arXiv:1808.04064.
  • (8) L. Liu, M. Utiyama, A. Finch, E. Sumita, Agreement on target-bidirectional neural machine translation, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 411–416.
  • (9) Y. Wang, S. Cheng, L. Jiang, J. Yang, W. Chen, M. Li, L. Shi, Y. Wang, H. Yang, Sogou neural machine translation systems for wmt17, in: Proceedings of the Second Conference on Machine Translation, 2017, pp. 410–415.
  • (10) X. Zhang, J. Su, Y. Qin, Y. Liu, R. Ji, H. Wang, Asynchronous bidirectional decoding for neural machine translation, In Proceedings of AAAI 2018.
  • (11) M.-T. Luong, H. Pham, C. D. Manning, Effective approaches to attention-based neural machine translation, in: Proceedings of EMNLP, 2015.
  • (12) R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in: Proceedings of ACL, 2016.
  • (13) S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, Y. Liu, Minimum risk training for neural machine translation, in: Proceedings of ACL, 2016.
  • (14) J. Zhou, Y. Cao, X. Wang, P. Li, W. Xu, Deep recurrent models with fast-forward connections for neural machine translation, arXiv preprint arXiv:1606.04199.
  • (15) K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of ACL, 2002, pp. 311–318.
  • (16) P. Koehn, Statistical significance tests for machine translation evaluation., in: Proceedings of EMNLP, 2004, pp. 388–395.
  • (17) Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system: Bridging the gap between human and machine translation, arXiv preprint arXiv:1609.08144.
  • (18) C. Napoles, M. Gormley, B. Van Durme, Annotated gigaword, in: Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, 2012, pp. 95–100.
  • (19) A. M. Rush, S. Chopra, J. Weston, A neural attention model for abstractive sentence summarization, in: Proceedings of EMNLP, 2015.
  • (20) Q. Zhou, N. Yang, F. Wei, M. Zhou, Selective encoding for abstractive sentence summarization, in: Proceedings of ACL, 2017.
  • (21)

    C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out: Proceedings of the ACL-04 workshop, Vol. 8, 2004.

  • (22) R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, Abstractive text summarization using sequence-to-sequence rnns and beyond, in: Computational Natural Language Learning, 2016.
  • (23) M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen, et al., The best of both worlds: Combining recent advances in neural machine translation, in: Proceedings of ACL, 2018, pp. 76–86.
  • (24) K. Toutanova, D. Klein, C. D. Manning, Y. Singer, Feature-rich part-of-speech tagging with a cyclic dependency network, in: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, 2003, pp. 173–180.
  • (25) Y. Tsuruoka, J. Tsujii, Bidirectional inference with the easiest-first strategy for tagging sequence data, in: Proceedings of the conference on human language technology and empirical methods in natural language processing, 2005, pp. 467–474.
  • (26) L. Shen, G. Satta, A. Joshi, Guided learning for bidirectional sequence classification, in: Proceedings of the 45th annual meeting of the association of computational linguistics, 2007, pp. 760–767.
  • (27) D. Serdyuk, N. R. Ke, A. Sordoni, A. Trischler, C. Pal, Y. Bengio, Twin networks: Matching the future for sequence generation, in: International Conference on Learning Representations, 2018.
  • (28) R. Sennrich, B. Haddow, A. Birch, Edinburgh neural machine translation systems for wmt 16, in: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016, pp. 371–376.
  • (29) R. Sennrich, A. Birch, A. Currey, U. Germann, B. Haddow, K. Heafield, A. V. M. Barone, P. Williams, The university of edinburgh’s neural mt systems for wmt17, in: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2017, pp. 389–399.
  • (30) C. D. V. Hoang, G. Haffari, T. Cohn, Towards decoding as continuous optimisation in neural machine translation, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 146–156.
  • (31) Z. Tan, B. Wang, J. Hu, Y. Chen, X. Shi, Xmu neural machine translation systems for wmt 17, in: Proceedings of the Second Conference on Machine Translation, 2017, pp. 400–404.
  • (32) Y. Deng, S. Cheng, J. Lu, K. Song, J. Wang, S. Wu, L. Yao, G. Zhang, H. Zhang, P. Zhang, et al., Alibaba’s neural machine translation systems for wmt18, in: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 2018, pp. 368–376.
  • (33) Y. Liu, L. Zhou, Y. Wang, Y. Zhao, J. Zhang, C. Zong, A comparable study on model averaging, ensembling and reranking in nmt, in: CCF International Conference on Natural Language Processing and Chinese Computing, Springer, 2018, pp. 299–308.
  • (34) D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, Y. Bengio, An actor-critic algorithm for sequence prediction, In Proceedings of ICLR 2017.
  • (35) D. He, H. Lu, Y. Xia, T. Qin, L. Wang, T.-Y. Liu, Decoding with value networks for neural machine translation, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems 30, 2017, pp. 178–187.
  • (36) J. Li, W. Monroe, D. Jurafsky, Learning to decode for future success, arXiv preprint arXiv:1701.06549.
  • (37) Y. Xia, F. Tian, L. Wu, J. Lin, T. Qin, N. Yu, T.-Y. Liu, Deliberation networks: Sequence generation beyond one-pass decoding, in: Advances in Neural Information Processing Systems, 2017, pp. 1784–1794.
  • (38) Z. Zheng, H. Zhou, S. Huang, L. Mou, X. Dai, J. Chen, Z. Tu, Modeling past and future for neural machine translation, in: TACL, Vol. 6, 2018, pp. 145–157.
  • (39) L. Zhou, J. Zhang, C. Zong, Synchronous bidirectional neural machine translation, in: TACL, 2019.