1 Introduction
Many tasks in natural language processing, such as machine translation, abstractive summarization and chatbot, can be formalized as a sequence to sequence (seq2seq) generation problem which takes a sequence as input (e.g. source language sentence) and produces another sequence as output (e.g. target language translation). Generally, the seq2seq framework performs inference in a lefttoright (L2R) manner and predicts the current output token conditioned on previously generated tokens. Existing methods mainly focus on how to fully exploit the already predicted outputs on the left. And the recently proposed neural architectures for sequence generation including recurrent networks
sutskever2014sequence ; bahdanau2015neural , convolutional networks gehring2017convolutional and selfattention ones (known as Transformer) vaswani2017attention facilitate the exploration of all the history information during inference.Architecture  Direction  First Four  Last Four 
LSTM  L2R  36.35%  31.64% 
R2L  31.22%  34.01%  
Transformer (SelfAttention Network)  R2L  40.21%  35.10% 
R2L  35.67%  39.47% 
However, conventional seq2seq models cannot access the future predictions on the right and usually produce unbalanced outputs in which left parts are much more accurate than right ones. The phenomenon is similar for righttoleft (R2L) inference where the right parts are better. In order to have a more intuitive understanding, we have investigated both L2R and R2L inferences using LSTM hochreiter1997long and selfattention networks (SAN) vaswani2017attention on the typical sequence generation task, namely machine translation. Table 1 shows the matching accuracy of the first and last four tokens between model predictions and references. It is obvious that lefttoright inference performs much better on predictions of head tokens while righttoleft inference excels in tail token predictions. Intuitively, it is a promising direction to combine the merits of bidirectional inferences and make full use of both history and future contexts.
Researchers have made great efforts to take advantages of both L2R and R2L inferences. liu2016agreementb ; zhang2018regularizing enforce the agreement between L2R and R2L predictions during training, and then L2R inference will be improved accordingly. liu2016agreementa ; wang2017sogou employ R2L model to rerank the best hypotheses of the L2R model. zhang2018asynchronous first obtains the R2L outputs and optimizes the L2R inference model based on both of the original input and the R2L outputs. Despite the performance improvement, these approaches suffer from two issues. On one hand, they have to train two separate seq2seq models for L2R and R2L inferences respectively. On the other hand, the two models cannot interact with each other during inference.
In this article, we propose a synchronous bidirectional inference model that produces outputs using both L2R and R2L decoding simultaneously and interactively. We first introduce a novel beam search algorithm to accommodate L2R and R2L inferences at the same time. At each timestep during inference, each half beam retain the hypotheses from L2R and R2L inferences respectively and each hypothesis is generated by utilizing already predicted outputs from both directions. The interaction between L2R and R2L inferences is realized through a synchronous attention model that attempts to leverage both the history and future sequential predictions simultaneously during inference. Fig.
1 gives a simple illustration of the proposed synchronous bidirectional inference model. The middle part in color on the right of Fig. 1 is the core of our model. L2R and R2L inferences interact with each other in an implicit way illustrated by the colored part. The arrows indicate the information passing flow. Solid arrows in black show the conventional history context dependence while dotted arrows in color introduce the future context dependence on the other inference direction. For example, besides the past predictions (, ), L2R inference can also utilize the future contexts (, ) generated by the R2L inference when predicting .As we mentioned above, there are many sequence to sequence models. To test the generalization capacity of our model, we apply the proposed synchronous bidirectional inference model into two representative seq2seq frameworks using LSTM and selfattention networks. Furthermore, we propose two optimization strategies to train network parameters. We choose machine translation and abstractive sentence summarization as the testbed to verify the effectiveness of the models. The extensive experiments demonstrate that our proposed model remarkably outperforms the strong baselines.
2 Synchronous Bidirectional Inference
The task of sequence to sequence learning is to find the most probable output sequence
which maximizes the following conditional probability given the input sequence .(1) 
Unlike sequential labeling tasks in which shares the same length as , the output length of is unknown until the inference process ends and in most cases the token numbers are different between output and input sequences in seq2seq learning tasks. For instance, the result summary (output sequence) should be much shorter than the original text (input sequence) in the summarization task.
2.1 Unidirectional Inference
Conventionally, Equation 1 is decomposed in a lefttoright manner as follows:
(2) 
Since the search space () is very large and contains tens of thousands of entries in most cases, a beam search algorithm is usually employed to approximately find the most probable output sequence according to the history predictions and the input sequence . Currently, neural methods such as LSTM, conventional networks and selfattention ones can model the conditional probability more and more accurately, but leaves the future contexts unexplored.
In order to leverage the right hand information, Equation 1 can also be decomposed in a righttoleft manner as follows:
(3) 
Using this decomposition, the rightside hypotheses are available when predicting , while the leftside predictions are still missing.
2.2 Synchronous Bidirectional Beam Search
Ideally, we expect to utilize both the past and future contexts ( and ) when determining the best prediction of . However, it is contradictory to some extent. Predicting needs on the right, while determining requires on the left. Obviously, it is impractical to use the whole contexts of both sides ( and ) in a single inference model. We take a step back and attempt to explore the bidirectional contexts as many as possible if not all.
We propose a synchronous bidirectional inference model in which lefttoright and righttoleft inferences perform in parallel while keeping interaction with each other. In this way, Equation 1 is decomposed as follows:
(4) 
Equation 4 says that the bidirectional inference model accommodates L2R and R2L decoding at the same time. At timestep , we have already generated the left hypotheses with L2R inference and the right predictions with R2L inference. Thus, different from equation 2 and equation 3, bothside predictions can be utilized as contexts in the above bidirectional composition.
Algorithm 1 shows the beam search procedure of the synchronous bidirectional inference model. The working flow is similar to the unidirectional beam search. We keep three kinds of lists. is employed to store complete hypotheses. and are used to maintain the temporary and partial hypotheses at each decoding timestep for L2R and R2L inferences respectively. Lines 29 in Algorithm 1 is the main part of the beam search algorithm. At timestep , L2R and R2L inferences perform in parallel but interactively to expand the partial hypotheses and from the previous timestep (lines 34). Then, the complete hypothesis list and the partial hypothesis list will be updated according to the temporary hypothesis list (lines 56).
Algorithm 2 and Algorithm 3 respectively detail the hypothesis expansion procedure and hypothesis update process. Algorithm 3 is trivial and is the same as the conventional unidirectional beam search. The algorithm ExpandHypo is the key for synchronous bidirectional inference (SBInfer). In Algorithm 2, for a partial hypothesis in , we calculate the probability of each candidate token in the target vocabulary by utilizing both of the history context and the future information through the function SBInfer.
Obviously, the function SBInfer is the most important part and requires specific design for different seq2seq architectures. Next, we introduce how to implement the function SBInfer for both LSTMbased and selfattention based seq2seq networks.
3 Synchronous Bidirectional Inference for LSTMbased Seq2Seq Framework
No matter what kind of network architecture is used, all Seq2Seq frameworks consist of an encoder and a decoder. Given an input sequence , the encoder transforms into a sequence of abstract context representations
whose size is the same as the length of the input text. Then, from the context vectors
the decoder generates the output sequence one token each time by maximizing the probability of with a lefttoright inference model.Hereafter, we leverage and to denote the word embeddings corresponding to the input and output tokens and . Next, we briefly review the encoder introducing how to obtain and the decoder addressing how to calculate for the conventional LSTMbased Seq2Seq architecture. Then, we propose to enable synchronous bidirectional inference SBInfer in the LSTMbased architecture.
3.1 LSTMbased Seq2Seq Framework
The encoder employs stacked LSTM layers to learn the context vectors . In the th layer (), is calculated as follows:
(5) 
In the first layer (), is obtained through a bidirectional LSTM:
(6) 
(7) 
The decoder computes the conditional probability with the help of attention mechanism bahdanau2015neural that leverages different input context at different decoding time step:
(8) 
where is the attention output:
(9) 
in which is the top hidden state of the decoder network and in the th layer is computed using the following formula:
(10) 
If , will be calculated by combining as feed input luong2015effective :
(11) 
The dynamic context is the weighted sum of the sourceside context vectors and is calculated by the attention model:
(12) 
where is a normalized item calculated as follows:
(13) 
(14) 
The greater the value of the variable , the more contribution of the th input token to the generation of the th output word. The left part in Fig. 2
gives the overall illustration of this unidirectional inference model for LSTMbased Seq2Seq framework. Note that residual connections and layer normalizations are employed as well and they are neglected in the description for simplicity.
3.2 Synchronous Bidirectional Inference for LSTMbased Architecture
In synchronous bidirectional inference, is calculated with both history and future contexts according to Equation 4: . The previous section introduces the way to use input and history contexts in Equation 8: . The synchronous bidirectional inference adopts the same mechanism as follows:
(15) 
Different from unidirectional inference, the synchronous bidirectional inference calculates the attention output with both L2R and R2L predictions:
(16) 
where the future context is obtained using another attention model as illustrated with green arrows in Fig. 2.
(17) 
in which is a normalized coefficient:
(18) 
(19) 
It should be noted that L2R and R2L inferences perform simultaneously in parallel. Thus, when calculating , we can as well compute at the same time in a similar way.
(20) 
(21) 
Where the left context is obtained using a similar attention model as illustrated with blue arrows in Fig. 2.
(22) 
4 Synchronous Bidirectional Inference for Selfattention based Framework
The selfattention based Seq2Seq framework is known as Transformer vaswani2017attention . In this section, we first give an overview of Transformer and then propose the implementation of synchronous bidirectional inference SBInfer in Transformer.
4.1 Transformer
The Transformer also follows the encoderdecoder architecture. The encoder includes identical layers and each layer is composed of two sublayers: the selfattention sublayer followed by the feedforward sublayer.
The decoder also consists of identical layers. Each layer has three sublayers. The first one is the masked selfattention mechanism. The second one is the decoderencoder attention sublayer and the third one is the feedforward sublayer^{1}^{1}1Residual connection and layer normalization are performed for each sublayer in both encoder and decoder..
Obviously, the key component is the attention mechanism^{2}^{2}2In fact, multihead attention is employed and we just introduce basic attention for simplicity.. The three kinds of attention mechanisms can be formalized into the same formula.
(23) 
Where , and stand for a query, the key list and the value list respectively. is the dimension of the key.
For the selfattention in encoder, the queries, keys and values are from the same layer. For example, if we calculate the output of the first layer in the encoder at the th position. The query is vector ^{3}^{3}3Suppose is the sum vector of input token embedding and the positional embedding.. The keys and values are the same and both are the embedding matrix . Using Equation 23 followed by a feedforward network, we can get the representation of the second layer. After layers, we obtain the input contexts .
The masked selfattention in decoder is similar to that of encoder except that the query at the th position can only attend to positions before , since the predictions after th position are not available in the autoregressive unidirectional inference.
(24) 
The decoderencoder attention mechanism is the same as that of LSTMbased Seq2Seq architecture. The query is the output of the masked selfattention sublayer . The keys and values are the same encoder contexts . The feedforward sublayer is then applied to yield the output of a whole layer. After such layers, we obtain the final hidden state s. Softmax function (Equation 8) is then employed to predict the output . Left part in Fig. 3 depicts the overall architecture of Transformer.
4.2 Synchronous Bidirectional Inference for Transformer
In synchronous bidirectional inference, the essential difference lies in the improvement over the masked selfattention mechanism for decoder. In standard Transformer, the masked selfattention model calculates the output () using only the history contexts. In contrast, synchronous bidirectional inference performs L2R and R2L decoding in parallel and interactively. At the th timestep, L2R and R2L inferences have already generated outputs and . Therefore, both and can be employed to compute .
Accordingly, we design two selfattention mechanisms to handle history contexts and future contexts respectively. In addition to Equation 24 that utilizes history information, we propose another selfattention mechanism to leverage the future information generated by the opposite inference direction.
(25) 
where is the query (i.e. the embedding of the th output ) from the L2R decoder. and are keys and values (i.e. the embeddings of the previous predictions ) from the R2L decoder. Fig. 3 illustrates how synchronous bidirectional inference performs. When producing , the orange lines denote the original masked selfattention with history while the green lines indicate the selfattention with future contexts.
Finally, we introduce a function to combine and to obtain a new representation that encodes both past and future contexts.
(26) 
For R2L decoding, can be calculated similarly in parallel.
(27) 
(28) 
5 Training
Since our synchronous bidirectional inference performs L2R and R2L decoding in parallel, L2R decoder aims to generate the gold reference during training while R2L decoder attempts to produce the reversed gold reference at the same time. Given the training data consisting of sentence pairs , the objective is to maximize the loglikelihood over the training data.
(29) 
When calculating , L2R decoder usually employs the gold reference as the condition . In synchronous bidirectional inference, a problem will arise if we directly utilize the gold reference from the other side to compute . For example, in the calculation of , includes . It indicates that is used to predict itself. Obviously, it is not reasonable. To address this issue during training, we propose two optimization strategies to learn network parameters.
5.1 Twopass Training
In the first training pass, we learn independent L2R and R2L inference models on the training data. Then, L2R and R2L models are employed to decode the input sentences of the training data, resulting in and respectively. During the second training pass, is calculated using , indicating that the future context is the model predictions rather than gold reference . Similarly, we calculate using .
5.2 Finetuning Strategy
In the findtuning strategy, we first train the parallel inference for L2R and R2L without interaction just as Equation 30 shows. Each training instance for this step is a triple .
(30) 
After this simple training procedure converges, we use this model to decode a small subset of the source sentences in the original training data (e.g. 10% of the dataset) and get the new triple . Then, we can finetune our synchronous bidirectional inference model similar to the second pass of the twopass training strategy.
Compared to the twopass training strategy, the finetuning strategy is much cheaper to implement since there is no need to train two separate models, to decode the entire training set and to do the second training over the whole dataset. In the experiments, we mainly employ the twopass strategy and compare these two strategies in the experimental analysis part.
6 Experimental Setup
In our experiments, two typical seq2seq tasks of machine translation and abstractive summarization are employed to test the effectiveness of our synchronous bidirectional inference model.
6.1 Machine Translation
6.1.1 Dataset
We evaluate the proposed synchronous bidirectional inference model on both ChinesetoEnglish and EnglishtoGerman translation tasks. For the ChinesetoEnglish task, the training data consists of about 2.1M sentence pairs extracted from LDC corpora^{4}^{4}4LDC2000T50, LDC2002L27, LDC2002T01, LDC2002E18, LDC2003E07, LDC2003E14, LDC2003T17, LDC2004T07.. We choose NIST 2002 (MT02) dataset for validation. For testing, we employ NIST 20032006 (MT0306) datasets. We apply BytePair Encoding (BPE)sennrich2016neural with 30K merge operations and maintain the source and target vocabularies to the most frequent 30K tokens..
For the EnglishtoGerman task, we utilize the same subset of the WMT 2014 training corpus employed by vaswani2017attention ; luong2015effective ; shen2016minimum ; zhou2016deep . It contains 4.5M sentence pairs^{5}^{5}5All preprocessed dataset and vocabulary can be directly download in tensor2tensor website https://drive.google.com/open?id=0B_bZckksdkpM25jRUN2X2UxMm8.. The concatenation of newstest 2012 and newstest 2013 is used as the validation set. The newstest 2014 is employed as the test set.
6.1.2 Training and Evaluation Details
For our synchronous bidirectional inference model with LSTMbased architecture, we implement the system by reusing and modifying the open source toolkit Zoph_RNN^{6}^{6}6https://github.com/isinlp/Zoph_RNN
which is written in C++/CUDA and provides efficient training across multiple GPUs. The encoder includes two stacked LSTM layers and the first layer employs the bidirectional LSTMs. The decoder also contains two stacked LSTM layers followed by the softmax layer. The dimension of word embedding and the size of hidden layers are all set to 1000. The dropout rate is set to 0.2. At test time, we employ beam search with beam size
.For the synchronous bidirectional inference model with Transformer, we modify the tensor2tensor^{7}^{7}7https://github.com/tensorflow/tensor2tensor. toolkit for training and evaluation. We employ the Adam optimizer with =0.9, =0.998, and =. The warmup and decay strategy for learning rate are the same as vaswani2017attention , with 16,000 warmup steps. During training, we employ label smoothing of value =0.1. For evaluation, we use beam search with a beam size of and length penalty =0.6. Additionally, we use 6 encoder and decoder layers. In each layer, we employ hidden size, 16 attentionheads, 4096 feed forward innerlayer dimensions, and =0.1. Our settings are close to transformer_big setting as defined in vaswani2017attention . We employ three Titan Xp GPUs to train EnglishtoGerman translation and one GPU for ChinesetoEnglish translation pairs. In addition, we average the last 20 checkpoints to get the final model for EnglishtoGerman but do not perform checkpoint averaging for ChinesetoEnglish.
We evaluate the final translation quality with caseinsensitive BLEU papineni2002bleu for ChinesetoEnglish and with casesensitive BLEU for EnglishtoGerman. Significance test is performed using the pairwise resampling approach koehn2004sig .
6.1.3 Translation Systems
We use BIRNMT
to denote our proposed synchronous bidirectional inference model implemented in LSTMbased recurrent neural machine translation. The proposed synchronous
bidirectional inference model for Transformer is named BIFT^{8}^{8}8Our code is freely available in github https://github.com/ZNLP/sbnmt.We compare the proposed models against the following stateoftheart NMT systems:

RNMT wu2016google : it is a stateoftheart LSTMbased NMT system with the same setting as BIRNMT.

RNMT (R2L): it is a variant of RNMT and produces translations from right to left.

Transformer: it is the stateoftheart machine translation system with selfattention mechanism using the default lefttoright generation vaswani2017attention .

Transformer (R2L): it is a variant of Transformer which performs translation in a righttoleft manner.

RerankNMT: following liu2016agreementa , we first run beam search for L2R and R2L inference models independently to obtain two kbest lists, and then rescore the union of these two kbest lists. This method assumes that some source sentences are appropriate to translate from left to right, while others are better to translate from right to left.

ABDNMT: it is an asynchronous bidirectional inference model for NMT that performs L2R inference with the results generated by R2L inference model zhang2018asynchronous . During inference, twopass decoding scheme is employed. First, the R2L inference model generates the backward hidden states and corresponding translation results. Then, ABDNMT optimizes the L2R inference with the helpf of the backward hidden states.
For fair comparison, RerankNMT and ABDNMT are all reimplemented based on strong Transformer models.
6.2 Abstractive Summarization
6.2.1 Dataset
Abstractive sentence summarization is a task that generates a titlelike summary for a long sentence. Our training data is a (text, summary) parallel corpus from the Annotated English Gigaword dataset napoles2012annotated ; rush2015neural . It contains about 3.8M textsummary pairs for training and 189K pairs for validation. The encoder and decoder share the same vocabulary of about 90K word types.
For the test set, we use both DUC 2004 and the English Gigaword. In the test set of DUC 2004, there are 500 examples and each example pairs a document with four different humanwritten reference summaries. For the test set of the English Gigaword, we employ the same randomly selected subset of 2000 textsummary pairs as rush2015neural ; zhou2017selective .
6.2.2 Training and Evaluation Details
For both LSTMbased architecture and Transformer framework, we use the same model settings as neural machine translation.
For evaluation, we use ROUGE lin2004rouge
as the metric. ROUGE measures the quality of summary by computing overlapping lexical units, such as unigram, bigram, trigram, and longest common subsequence (LCS). It becomes the standard evaluation metric for DUC shared tasks and popular for summarization evaluation. Following previous work, we use ROUGE1 (unigram), ROUGE2 (bigram) and ROUGEL (LCS) as our evaluation metrics in the reported experimental results.
6.2.3 Summarization Systems
We compare our proposed model with the following stateoftheart baselines.
ABS: rush2015neural first proposed the abstractive summarization task and used an attentive CNN encoder and NNLM (neural network language model) decoder to perform this task.
LSTMSum
: it is the abstractive summarization system with the same architecture as RNMT in which the encoder and decoder are both LSTMbased recurrent neural networks.
Feats2S
: it is also a RNN encoderdecoder model using gated recurrent unit (GRU)
bahdanau2015neural and provide more features (e.g. POS and NER) to enrich the encoder nallapati2016abstractive .SelectiveEnc: zhou2017selective proposed a selective mechanism to selecting important information from encoder before generating summary.
Transformer: it is a Transformer model which is applied to the abstractive sentence summarization task.
7 Results and Analysis
7.1 Machine Translation
7.1.1 Overall Translation Quality
Model  MT03  MT04  M05  MT06  AVE  

RNMT  42.07  43.40  40.73  41.11  41.83   
RNMT (R2L)  41.47  43.13  40.62  40.94  41.54  0.29 
BIRNMT  43.50  43.98  41.37  42.48  42.83  +1.00 
Transformer  47.63  48.32  47.51  45.31  47.19   
Transformer (R2L)  46.79  47.01  46.50  44.13  46.11  1.08 
RerankNMT  48.23  48.91  48.73  46.51  48.10  +0.91 
ABDNMT  49.47  48.01  48.19  47.09  48.19  +1.00 
BIFT  51.87  51.50  51.23  49.83  51.11  +3.92 
Table 2 reports the translation performance of different systems on the ChineseEnglish task. The results are mainly divided into two parts. The first part in this table shows the BLEU scores of the systems based on LSTM framework while the second part gives the results of various systems based on the Transformer architecture.
Comparing the baselines using different architectures, we can easily see that the selfattention based Transformer remarkably outperforms the LSTMbased RNMT, with the average improvement of 5.36 BLEU points (47.19 vs. 41.83), suggesting the superiority of the Transformer.
As conventional decoding performs left to right, a question may arise that which inference direction is better. It is easy to find from table 2 that the righttoleft decoding performs worse than the lefttoright style no matter which neural network architecture is adopted (RNMT (R2L) vs. RNMT, Transformer (R2L) vs. Transformer). Specifically, the gap under the Transformer architecture is much bigger (1.08 vs. 0.29), indicating that Transformer is more sensitive to the inference direction.
We also investigate previous methods that take advantage of two inference directions. The second part in table 2 shows that both the reranking approach RerankNMT liu2016agreementa and the asynchronous bidirectional decoding method ABDNMT zhang2018asynchronous can get a significant improvement over the strong Transformer baseline. The average gains can be up to 1.0 BLEU point (48.19 vs. 47.19), indicating that L2R decoding and R2L decoding can be complementary to each other.
We go step further to exploit synchronous bidirectional inference that makes full use of L2R and R2L decoding. It is obvious to see from table 2 that our proposed method performs best. The first part in table 2 says that the synchronous bidirectional inference model under LSTM framework BIRNMT can obtain an average improvement of 1.0 BLEU point over RNMT. The second part demonstrates that our synchronous bidirectional inference model under the Transformer architecture BIFT achieves promising BLEU gains and the gap can be as large as 3.92 BLEU points on average. The remarkable improvements suggest that compared to asynchronous bidirectional decoding, our synchronous bidirectional inference can better explore the history and future contexts on the target side.
Model  TEST (WMT 14) 

GNMT wu2016google  24.61 
Conv gehring2017convolutional  25.16 
AttIsAll vaswani2017attention  28.40 
RNMT  22.85 
RNMT (R2L)  22.17 
BIRNMT  23.97 
Transformer^{8}^{8}8The BLEU score of the Transformer model is reproduced in our hardware environment and is slightly lower than AttIsAll vaswani2017attention . chen2018best also reported that their reproduction is lower than their original result in vaswani2017attention . In our experiments, we use only 3 GPUs for EnglishtoGerman, whereas vaswani2017attention ; chen2018best adopted TPUs for model training.  27.72 
Transformer (R2L)  27.13 
RerankNMT  27.81 
ABDNMT  28.22 
BIFT  29.21 
Similar phenomena can be observed from the EnglishGerman translation results as shown in Table 3. The finding is that BIRNMT outperforms RNMT with 1.12 BLEU points. BIFT performs best among all the systems including GNMT wu2016google , Conv gehring2017convolutional and AttIsAll vaswani2017attention . In addition, BIFT achieves the stateoftheart performance of 29.21 on the same dataset. Considering that only one reference is available for EnglishtoGerman translation, the improvements are very promising.
7.1.2 Model Size and Efficiency
Our synchronous bidirectional inference model is slightly complicated than conventional L2R or R2L inference model. It is interesting to figure out the model size and efficiency of our system compared to other baseline systems. Table 4 reports the corresponding statistics of different NMT models. The model size denotes the total number of network parameters. Since the synchronous bidirectional inference model only introduces one parameter in Equation 26 for BIFT, the model size is the same as that of the Transformer. In contrast, RerankNMT has double the number of parameters compared to the Transformer because it requires two individual encoderdecoder models for L2R and R2L decoding respectively. As for the asynchronous bidirectional decoding model, ABDNMT shares one encoder and has two decoders, and thus contains more than a half parameters against the Transformer baseline.
Model  Model Size  Efficiency  

Train  Test  
Transformer  207.8M  2.07  19.97 
Transformer (R2L)  207.8M  2.07  19.81 
RerankNMT  415.6M  1.03  6.51 
ABDNMT  333.8M  1.18  7.20 
BIFT  207.8M  1.26  17.87 
The Train column shows the number of global training steps per second. Because the training procedure of BIFT needs to match both of the L2R and R2L references, it takes more time to converge. However, regarding the decoding efficiency, our synchronous bidirectional inference model performs on par with the Transformer baseline and is only 10% slowdown (17.87 vs. 19.97 sentences per second), whereas RerankNMT and ABDNMT are much slower. The statistics suggest that BIFT is acceptable regarding the decoding efficiency.
7.1.3 Performance Trends on Sentence Length
In previous sections, we argued that BIRNMT and BIFT can utilize both of the history and future contexts during translation. A natural question may arise that whether our proposed model would perform much better on the long sentences. To answer this question, we group the source sentences of similar lengths in the test set and calculate the corresponding BLEU scores for each length interval.
Fig. 4 displays the statistics. The left part of Fig. 4 represents the results of the systems using LSTMbased framework. The right one shows the statistics of Transformerbased systems. Overall, no matter which architecture is adopted, our proposed models (BIRNMT and BIFT) are superior to baselines over sentences with all different lengths. Generally, the gap becomes bigger and bigger when the length grows. Comparing the two architectures, we find that the Transformerbased BIFT excels in long sentence translation. The results indicate that our proposed synchronous bidirectional inference model are better at translating long sentences with the help of both the history and future contexts during decoding.
7.1.4 Translation Precision over Different Positions
We mentioned in introduction that L2R inference is good at predicting prefix while R2L inference is adept at suffix prediction. We may wonder that how does our synchronous bidirectional inference model perform on prefix and suffix prediction. Furthermore, which part of the translation sentence will be improved most, the prefix, middle part or the suffix?
Model  First Four  Last Four 

RNMT  36.35%  31.64% 
RNMT (R2L)  31.22%  34.01% 
BIRNMT  36.88%  34.65% 
Transformer  40.21%  35.10% 
Transformer (R2L)  35.67%  39.47% 
RerankNMT  38.98%  38.91% 
ABDNMT  38.36%  38.11% 
BIFT  40.89%  40.08% 
To figure out these questions, we first analyze the prediction precision of the firstfour and lastfour tokens of the translation compared to the references for different NMT systems. Table 5 reports the comparison results. Obviously, the proposed synchronous bidirectional inference model performs best for matching precision of both the firstfour and lastfour tokens, showing the superiority of our methods.
Then, we go step further and conduct a deep analysis. We divide each translation hypothesis and its reference into 10 equal parts and calculate the average word prediction accuracy for each part. In this way, we attempt to investigate the contribution of the synchronous bidirectional inference model over different positions. Fig. 5 illustrates the comparison results. It is interesting to see from this figure that both ends of the translation hypothesis are much easier to predict. In contrast, the prediction accuracy of the middle part is much lower, suggesting more demands of both the history and future contexts. Our proposed BIRNMT and BIFT facilitate the usage of the left and right predictions, leading to large improvements over the middle part (4080% in Fig. 5).
7.1.5 Twopass Training vs. Finetuning
In this subsection, we attempt to investigate the effects of different parameter optimization strategies for our synchronous bidirectional inference model. We employ BIFT and ChinesetoEnglish translation task to compare between the twopass training strategy and the finetuning strategy. In the finetuning step, we randomly choose 10% source sentences of the training data. Table 6 reports the comparison results.
As shown in the table, we observe that both training strategies can remarkably improve the translation performance compared to the Transformer baseline. Although the finetuning strategy is not as powerful as the twopass training strategy, it can still achieve a big improvement of 2 BLEU points in average over the strong baseline. Considering that the finetuning strategy is much easier and cheaper for system deployment, we believe this strategy will be more popular than the twopass strategy.
Model  MT03  MT04  M05  MT06  AVE  
Transformer  47.63  48.32  47.51  45.31  47.19   
BIFT (twopass)  51.87  51.50  51.23  49.83  51.11  +3.92 
BIFT (finetuning)  50.76  49.72  48.32  47.91  49.18  +1.99 
7.2 Abstractive Summarization
7.2.1 Summarization Quality
Model  DUC2004  English Gigaword  

R1  R2  RL  R1  R2  RL  
ABS  26.55  7.06  22.05  29.55  11.32  26.42 
Feats2s  28.35  9.46  24.59  32.67  15.59  30.64 
SelectiveEnv  29.21  9.56  25.51  36.15  17.54  33.63 
RNMT  28.22  10.21  25.14  34.54  16.85  32.32 
BIRNMT  29.05  10.90  26.05  35.47  17.62  32.90 
Transformer  28.09  9.52  24.91  34.12  16.04  31.46 
BIFT  29.17  10.30  26.05  35.68  17.39  32.89 
Abstractive sentence summarization is another wellknown testbed for sequence to sequence learning. We then apply our synchronous bidirectional inference model into this task. Table 7 presents the results of different systems over two test sets DUC2004 and English Gigaword.
The first three rows show the performances of previous stateoftheart abstractive summarization models. Among these three baselines, both Feats2s and SelectiveEnv aim at improving the summarization quality by enhancing the representation learning of the encoder. SelectiveEnv augments the encoder with key information selection performs best.
In contrast, our method attempts to improve the decoder (inference module) by enabling synchronous bidirectional decoding. The last four rows in Table 7 demonstrate that the LSTMbased RNMT stably outperforms the selfattention based Transformer, which is quite different from that for neural machine translation in which Transformer is the better one. After applying our synchronous bidirectional inference model, BIRNMT and BIFT respectively achieve significantly better results on two test sets over RNMT and Transformer, despite that these two models cannot outperform SelectiveEnv on the Gigaword test set. The reason behind may be that we just apply BIRNMT and BIFT into abstractive summarization without any special adaptation processing. From another perspective, the two kinds of the models handle encoder and decoder respectively, and can be complementary to each other. Nevertheless, the statistics given in the table further show the effectiveness of our synchronous bidirectional inference model beyond machine translation.
7.2.2 Some Examples
Input  resident nelson mandela acknowledged saturday the african national congress violated human rights during apartheid , setting him at odds with his deputy president over a report that has divided much of south africa . 

Reference  mandela acknowledges human rights violations by african national congress 
Transformer  mandela acknowledges human rights violation at odds with deputy president 
Transformer (R2L)  mandela says south africa violated human rights 
BIFT  mandela says south african national congress violated human rights 
Input  the new york times said in an editorial on monday , oct. UNK : since the deadly bombing of two american embassies in africa in august , there has been a troubling accumulation of evidence that the state department inexplicably ignored warnings of possible terrorist attacks against the installations . 
Reference  editorial claims state department ignored warnings of terrorist 
Transformer  new york times says evidence of terror attacks is ignored 
Transformer (R2L)  new york times warns of possible terrorist attacks on u.s. embassies in africa 
BIFT  new york times says state department ignored terror warnings 
To better understand the models, we further investigate some specific examples which are listed in Table 8. For each example, the input is a long sentence and the output is a titlelike summary.
In the first example, human rights and african national congress are two key contents. The baseline Transformer fails to generate african national congress in the tail part and Transformer (R2L) neglects natrail congress in the head part. In contrast, our model BIFT renders all of the key points.
As for the second example, the baselines have made similar mistakes. Transformer omits by state department at the end, and the summary generated by Transformer (R2L) expresses the wrong meaning due to the absence of says state department fails to after new york times. However, our model BIFT can generate the summary with correct and complete contents. The examples demonstrate the superiority of our synchronous bidirectional inference model over the modeling of both history and future contexts.
8 Related Work
This work addresses synchronous bidirectional inference for sequence to sequence learning tasks, aiming to take full advantage of history and future predictions on the output. Generally, the related work can be divided into two categories, namely bidirectional inference and future context usage.
Bidirectional inference is well studied in sequential labeling tasks toutanova2003feature ; tsuruoka2005bidirectional ; shen2007guided , in which each input token corresponds to an output label and the output shares the same length with the input. In general, there are decomposition ways of the conditional probability for an token input sequence, since each token is predicted after the left one (from left to right) or the right one (from right to left). In this way, bidirectional inference is not difficult for sequential labeling. However, it is not trivial to leverage bidirectional inference for sequence generation problems mainly due to the length nondeterminacy of the output. liu2016agreementa ; liu2016agreementb ; zhang2018asynchronous added agreement constraints to enforce L2R inference output to be consistent with R2L inference output for sequence generation tasks. serdyuk2018twin proposed the twin network that encourages the target hidden states of the L2R and R2L inferences at the same position to be as close as possible to predict the same token during training. Recently, zhang2018asynchronous introduced an asynchronous bidirectional inference model for neural machine translation. They first obtained the translation hypothesis using R2L inference and then optimized the L2R inference model with the help of the R2L inference result. Despite of performance improvement, all these studies require two individual inference models, making the architecture more complicated. Furthermore, the interactions between L2R and R2L inferences are not adequate. Taking the asynchronous bidirectional inference model for example, L2R model can utilize the information of R2L model but R2L inference cannot use the L2R predictions. In contrast, our synchronous bidirectional inference model has only a single decoder in which L2R and R2L inferences interact with each at each decoding time step.
Using future contexts has drawn more and more attention in sequence prediction tasks. Intuitively, R2L inference model can be employed to rerank the best hypotheses of the L2R inference model, preferring balanced output sennrich2016edinburgh ; sennrich2017university ; hoang2017towards ; tan2017xmu ; deng2018alibaba ; liu2018comparable . To use the future context which is unavailable in conventional inference model, bahdanau2017actor ; he2017decoding ; li2017learning
proposed the reinforcement learning methods to estimate the possible future information. To mimic the human cognitive behaviors,
xia2017deliberation presented a deliberation network, which leverages the global information with the help of both forward and backward predictions in sequence generation through a deliberation process. zheng2018modeling introduced two additional recurrent layers to model translated past contents and untranslated future contents. They much improved the sequence generation quality with the cost of model complexity. They either employed twopass decoding strategy or added more layers to the original network. Compared to this kind of work, our proposed model uses a smart way to exploit both history and future predictions by allowing L2R and R2L inferences to perform in parallel but interactively. In our previous work zhou2019synchronous , we address the bidirectional decoding for neural machine translation. In this current work, we generalize the decoding model into synchronous bidirectional inference for the general sequencetosequence models (LSTM and Transformer) and general sequence generation tasks (translation and summarization). We further propose and investigate two optimization strategies to learn network parameters.9 Conclusion and Future Work
This work proposes a synchronous bidirectional inference model for sequential generation tasks. We first presented a synchronous bidirectional beam search algorithm for sequence generation, in which lefttoright and righttoleft decoding perform in parallel but interactively. We have exploited the usage of synchronous bidirectional inference model on both LSTMbased and Transformerbased seq2seq architectures. We have also proposed and investigated two parameter optimization strategies. The comprehensive experiments on machine translation and abstractive summarization have demonstrated that our proposed synchronous bidirectional inference model remarkably outperforms the strong baselines. The deep analysis further shows that our model can indeed take full advantage of both history and future predictions during inference.
In the future work, we plan to apply our synchronous bidirectional inference model to other sequential generation tasks, such as question answering, chatbot and image caption.
References
 (1) I. Sutskever, O. Vinyals, Q. V. Le, Sequence to sequence learning with neural networks, in: Proceedings of NIPS, 2014.
 (2) D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, in: Proceedings of ICLR, 2015.
 (3) J. Gehring, M. Auli, D. Grangier, D. Yarats, Y. N. Dauphin, Convolutional sequence to sequence learning, in: Proceedings of ICML, 2017.
 (4) A. Vawani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.Gomez, L. Kaiser, I. Polosukhin, Attention is all you need, arXiv preprint arXiv:1706.03762.

(5)
S. Hochreiter, J. Schmidhuber, Long shortterm memory, Neural computation 9 (8) (1997) 1735–1780.
 (6) L. Liu, A. M. Finch, M. Utiyama, E. Sumita, Agreement on targetbidirectional lstms for sequencetosequence learning, in: AAAI, 2016, pp. 2630–2637.
 (7) Z. Zhang, S. Wu, S. Liu, M. Li, M. Zhou, E. Chen, Regularizing neural machine translation by targetbidirectional agreement, arXiv preprint arXiv:1808.04064.
 (8) L. Liu, M. Utiyama, A. Finch, E. Sumita, Agreement on targetbidirectional neural machine translation, in: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 411–416.
 (9) Y. Wang, S. Cheng, L. Jiang, J. Yang, W. Chen, M. Li, L. Shi, Y. Wang, H. Yang, Sogou neural machine translation systems for wmt17, in: Proceedings of the Second Conference on Machine Translation, 2017, pp. 410–415.
 (10) X. Zhang, J. Su, Y. Qin, Y. Liu, R. Ji, H. Wang, Asynchronous bidirectional decoding for neural machine translation, In Proceedings of AAAI 2018.
 (11) M.T. Luong, H. Pham, C. D. Manning, Effective approaches to attentionbased neural machine translation, in: Proceedings of EMNLP, 2015.
 (12) R. Sennrich, B. Haddow, A. Birch, Neural machine translation of rare words with subword units, in: Proceedings of ACL, 2016.
 (13) S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, Y. Liu, Minimum risk training for neural machine translation, in: Proceedings of ACL, 2016.
 (14) J. Zhou, Y. Cao, X. Wang, P. Li, W. Xu, Deep recurrent models with fastforward connections for neural machine translation, arXiv preprint arXiv:1606.04199.
 (15) K. Papineni, S. Roukos, T. Ward, W.J. Zhu, Bleu: a method for automatic evaluation of machine translation, in: Proceedings of ACL, 2002, pp. 311–318.
 (16) P. Koehn, Statistical significance tests for machine translation evaluation., in: Proceedings of EMNLP, 2004, pp. 388–395.
 (17) Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system: Bridging the gap between human and machine translation, arXiv preprint arXiv:1609.08144.
 (18) C. Napoles, M. Gormley, B. Van Durme, Annotated gigaword, in: Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Webscale Knowledge Extraction, 2012, pp. 95–100.
 (19) A. M. Rush, S. Chopra, J. Weston, A neural attention model for abstractive sentence summarization, in: Proceedings of EMNLP, 2015.
 (20) Q. Zhou, N. Yang, F. Wei, M. Zhou, Selective encoding for abstractive sentence summarization, in: Proceedings of ACL, 2017.

(21)
C.Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization branches out: Proceedings of the ACL04 workshop, Vol. 8, 2004.
 (22) R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, Abstractive text summarization using sequencetosequence rnns and beyond, in: Computational Natural Language Learning, 2016.
 (23) M. X. Chen, O. Firat, A. Bapna, M. Johnson, W. Macherey, G. Foster, L. Jones, N. Parmar, M. Schuster, Z. Chen, et al., The best of both worlds: Combining recent advances in neural machine translation, in: Proceedings of ACL, 2018, pp. 76–86.
 (24) K. Toutanova, D. Klein, C. D. Manning, Y. Singer, Featurerich partofspeech tagging with a cyclic dependency network, in: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language TechnologyVolume 1, 2003, pp. 173–180.
 (25) Y. Tsuruoka, J. Tsujii, Bidirectional inference with the easiestfirst strategy for tagging sequence data, in: Proceedings of the conference on human language technology and empirical methods in natural language processing, 2005, pp. 467–474.
 (26) L. Shen, G. Satta, A. Joshi, Guided learning for bidirectional sequence classification, in: Proceedings of the 45th annual meeting of the association of computational linguistics, 2007, pp. 760–767.
 (27) D. Serdyuk, N. R. Ke, A. Sordoni, A. Trischler, C. Pal, Y. Bengio, Twin networks: Matching the future for sequence generation, in: International Conference on Learning Representations, 2018.
 (28) R. Sennrich, B. Haddow, A. Birch, Edinburgh neural machine translation systems for wmt 16, in: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2016, pp. 371–376.
 (29) R. Sennrich, A. Birch, A. Currey, U. Germann, B. Haddow, K. Heafield, A. V. M. Barone, P. Williams, The university of edinburgh’s neural mt systems for wmt17, in: Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, 2017, pp. 389–399.
 (30) C. D. V. Hoang, G. Haffari, T. Cohn, Towards decoding as continuous optimisation in neural machine translation, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 146–156.
 (31) Z. Tan, B. Wang, J. Hu, Y. Chen, X. Shi, Xmu neural machine translation systems for wmt 17, in: Proceedings of the Second Conference on Machine Translation, 2017, pp. 400–404.
 (32) Y. Deng, S. Cheng, J. Lu, K. Song, J. Wang, S. Wu, L. Yao, G. Zhang, H. Zhang, P. Zhang, et al., Alibaba’s neural machine translation systems for wmt18, in: Proceedings of the Third Conference on Machine Translation: Shared Task Papers, 2018, pp. 368–376.
 (33) Y. Liu, L. Zhou, Y. Wang, Y. Zhao, J. Zhang, C. Zong, A comparable study on model averaging, ensembling and reranking in nmt, in: CCF International Conference on Natural Language Processing and Chinese Computing, Springer, 2018, pp. 299–308.
 (34) D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, Y. Bengio, An actorcritic algorithm for sequence prediction, In Proceedings of ICLR 2017.
 (35) D. He, H. Lu, Y. Xia, T. Qin, L. Wang, T.Y. Liu, Decoding with value networks for neural machine translation, in: I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems 30, 2017, pp. 178–187.
 (36) J. Li, W. Monroe, D. Jurafsky, Learning to decode for future success, arXiv preprint arXiv:1701.06549.
 (37) Y. Xia, F. Tian, L. Wu, J. Lin, T. Qin, N. Yu, T.Y. Liu, Deliberation networks: Sequence generation beyond onepass decoding, in: Advances in Neural Information Processing Systems, 2017, pp. 1784–1794.
 (38) Z. Zheng, H. Zhou, S. Huang, L. Mou, X. Dai, J. Chen, Z. Tu, Modeling past and future for neural machine translation, in: TACL, Vol. 6, 2018, pp. 145–157.
 (39) L. Zhou, J. Zhang, C. Zong, Synchronous bidirectional neural machine translation, in: TACL, 2019.
Comments
There are no comments yet.