1 Introduction
Neural machine translation (NMT), typically with an attentionbased encoderdecoder framework (Bahdanau et al., 2015), has recently become the dominant approach to machine translation and already been deployed for online translation services Wu et al. (2016)
. Recurrent neural networks (RNN), e.g., LSTMs
(Hochreiter and Schmidhuber, 1997) or GRUs (Chung et al., 2014), are widely used as the encoder and decoder for NMT. In order to alleviate the gradient vanishing issue found in simple recurrent neural networks (SRNN) Elman (1990), recurrent units in LSTMs or GRUs normally introduce different gates to create shotcuts for gradient information to pass through.Notwithstanding the capability of these gated recurrent networks in learning longdistance dependencies, they use remarkably more matrix transformations (i.e., more parameters) than SRNN. And with many nonlinear functions modeling inputs, hidden states and outputs, they are also less transparent than SRNN. These make NMT which is based on these gated RNNs suffer from not only inefficiency in training and inference due to recurrency and heavy computation in recurrent units Vaswani et al. (2017) but also difficulty in producing interpretable models Lee et al. (2017). These also hinder the deployment of NMT models particularly on memory and computationlimited devices.
In this paper, our key interest is to simplify recurrent units in RNNbased NMT. In doing so, we want to investigate how further we can advance RNNbased NMT in terms of the number of parameters (i.e., memory consumption), running speed and interpretability. This simplification shall preserve the capability of modeling longdistance dependencies in LSTMs/GRUs and the expressive power of recurrent nonlinearities in SRNN. The simplification shall also reduce computation load and physical memory consumption in recurrent units on the one hand and allow us to take a good look into the inner workings of RNNs on the other hand.
In order to achieve this goal, we propose an additionsubtraction twingated recurrent network (ATR) for NMT. In the recurrent units of ATR, we only keep the very essential weight matrices: one over the input and the other over the history (similar to SRNN). Comparing with previous RNN variants (e.g., LSTM or GRU), we have the smallest number of weight matrices. This will reduce the computation load of matrix multiplication. ATR also uses gates to bypass the vanishing gradient problem so as to capture longrange dependencies. Specifically, we use the addition and subtraction operations between the weighted history and input to estimate an input and forget gate respectively. These addsub operations not only distinguish the two gates so that we do not need to have different weight matrices for them, but also make the two gates dynamically correlate to each other. Finally, we remove some nonlinearities in recurrent units.
Due to these simplifications, we can easily show that each new state in ATR is an unnormalized weighted sum of previous inputs, similar to recurrent additive networks Lee et al. (2017). This property not only allows us to trace each state back to those inputs which contribute more but also establishes unnormalized forward selfattention between the current state and all its previous inputs. The selfattention mechanism has already proved very useful in nonrecurrent NMT Vaswani et al. (2017).
We build our NMT systems on the proposed ATR with a singlelayer encoder and decoder. Experiments on WMT14 EnglishGerman and EnglishFrench translation tasks show that our model yields competitive results compared with GRU/LSTMbased NMT. When we integrate an orthogonal contextaware encoder (still single layer) into ATRbased NMT, our model (yielding 24.97 and 39.06 BLEU on EnglishGerman and EnglishFrench translation respectively) is even comparable to deep RNN and nonRNN NMT models which are all with multiple encoder/decoder layers. Indepth analyses demonstrate that ATR is more efficient than LSTM/GRU in terms of NMT training and decoding speed.
We adapt our model to other language translation and natural language processing tasks, including NIST ChineseEnglish translation, natural language inference and Chinese word segmentation. Our conclusions still hold on all these tasks.
2 Related Work
The most widely used RNN models are LSTM (Hochreiter and Schmidhuber, 1997) and GRU (Chung et al., 2014), both of which are good at handling gradient vanishing problem, a notorious bottleneck of the simple RNN Elman (1990). The design of gates in our model follows the gate philosophy in LSTM/GRU.
Our work is closely related to the recurrent additive network (RAN) proposed by Lee et al. (2017)
. They empirically demonstrate that many nonlinearities commonly used in RNN transition dynamics can be removed, and that recurrent hidden states computed as purely the weighted sum of input vectors can be quite efficient in language modeling. Our work follows the same spirit of simplifying recurrent units as they do. But our proposed ATR is significantly different from RAN in three aspects. First, ATR is simpler than RAN with even fewer parameters. There are only two weight matrices in ATR while four different weight matrices in the simplest version of RAN (two for each gate in RAN). Second, since the only difference between the input and forget gate in ATR is the addition/subtraction operation between the history and input, the two gates can be learned to be highly correlated as shown in our analysis. Finally, although RAN is verified effective in language modeling, our experiments show that ATR is better than RAN in machine translation in terms of both speed and translation quality.
To speed up RNN models, a line of work has attempted to remove recurrent connections. For example, Bradbury et al. (2016) propose the quasirecurrent neural network (QRNN) which uses convolutional layers and a minimalist recurrent pooling function to improve parallelism. Very recently, Lei and Zhang (2017) propose a simple recurrent unit (SRU). With the cuDNN optimization, their RNN model can be trained as fast as CNNs. However, to obtain promising results, QRNN and SRU have to use deep architectures. In practice, 4layer QRNN encoder and decoder are used to gain translation quality that is comparable to that of singlelayer LSTM/GRU NMT. In particular, our onelayer model achieves significantly higher performance than a 10layer SRU system.
Finally, our work is also related to the efforts in developing alternative architectures for NMT models. Zhou et al. (2016) introduce fastforward connections between adjacent stacked RNN layers to ease gradient propagation. Wang et al. (2017a) propose a linear associate unit to reduce the gradient propagation length along layers in deep NMT. Gehring et al. (2017b) and Vaswani et al. (2017) explore purely convolutional and attentional architectures as alternatives to RNNs for neural translation. With careful configurations, their deep models achieve stateoftheart performance on various datasets.
3 AdditionSubtraction TwinGated Recurrent Network
Given a sequence , RNN updates the hidden state recurrently as follows:
(1) 
where is the previous hidden state, which is considered to store information from all previous inputs, and is the current input. The function is a nonlinear recurrent function, abstracting away from details in recurrent units.
GRU can be considered as a simplified version of LSTM. In this paper, theoretically, we use GRU as our benchmark and propose a new recurrent unit to further simplify it. The GRU function is defined as follows (see Figure 0(b)):
(2)  
(3)  
(4)  
(5) 
where denotes an elementwise multiplication. The reset gate and update gate enable manageable information flow from the history and the current input to the new state respectively. Despite the success of these two gates in handling gradient flow, they consume extensive matrix transformations and weight parameters.
We argue that many of these matrix transformations are not essential. We therefore propose an additionsubtraction twingated recurrent unit (ATR), formulated as follows (see Figure
0(c)):(6)  
(7)  
(8)  
(9) 
The hidden state in ATR is a weighted mixture of both the current input and the history controlled by an input gate and a forget gate respectively. Notice that we use the transformed representation for the current input rather than the raw vector due to the potential mismatch in dimensions between and .
Similar to GRU, we use gates, especially the forget gate, to control the backpropagated gradient flow to make sure gradients will neither vanish nor explode. We also preserve the nonlinearities of SRNN in ATR but only in the two gates.
There are three significant differences of ATR from GRU. Some of these differences are due to the simplifications introduced in ATR. First, we squeeze the number of weight matrices in gate calculation from four to two (see Equation (2&3) and (7&8)). In all existing gated RNNs, the inputs to gates are weighted sum of the previous hidden state and input. In order to distinguish these gates, the weight matrices over the previous hidden state and the current input should be different for different gates. The number of different weight matrices in gates is therefore 2#gates in previous gated RNNs. Different from them, ATR introduces different operations (i.e., addition and subtraction) between the weighted history and input to distinguish the input and forget gate. Therefore, the weight matrices over the previous state/input in the two gates can be the same in ATR. Second, we keep the very essential nonlinearities, only in the two gates. In ATR, the role of is similar to that of in GRU (see Equation (4)). However, we completely wipe out the recurrent nonlinearity of in (i.e., ). DBLP:journals/corr/LeeLZ17 show that this nonlinearity is not necessary in language modeling. We further empirically demonstrate that it is neither essential in machine translation. Third, in GRU the gates for and are coupled and normalized to 1 while we do not explicitly associate the two gates in ATR. Instead, they can be learned to be correlated in an implicit way, as shown in the next subsection and our empirical analyis in Section 5.1.
3.1 TwinGated Mechanism
Unlike GRU, we use an addition and subtraction operation over the transformed current input and history to differentiate the two gates in ATR. As the two gates have the same weights for their input components with only a single difference in the operation between the input components, they act like twins. We term the two gates in ATR as twin gates and the procedure, shown in Equation (7&8), as the twingated mechanism. This mechanism endows our model with the following two advantages: 1) Both addition and subtraction operations are completely linear so that fast computation can be expected; and 2) No other weight parameters are introduced for gates so that our model is more memorycompact.
A practical question for the twingated mechanism is whether twin gates are really capable of dynamically weighting the input and history information. To this end, we plot the surface of onedimensional in Figure 2. It is clear that both gates are highly nonlinearly correlated, and that there are regions where is equal to, greater or smaller than . In other words, by adapting the distribution of input and forget gates, the twingated mechanism has the potential to automatically seek suitable regions in Figure 2 to control its preference between the new and past information. We argue that the input and forget gates are negatively correlated after training, and empirically show their actual correlation in Section 5.1.
3.2 Computation Analysis
Here we provide a systematical comparison of computations in LSTM, GRU, RAN and our ATR with respect to the number of weight matrices and matrix transformations. Notice that all these units are building blocks of RNNs so that the total computational complexity and the minimum number of sequential operations required are unchanged, i.e. and respectively where is the sequence length and is the dimensionality of hidden states. However, the actual number of matrix transformations in the unit indeed significantly affects the running speed of RNN in practice.
We summarize the results in Table 1. LSTM contains three different gates and a cell state, including 4 different neural layers with 8 weight matrices and transformations. GRU simplifies LSTM by removing a gate, but still involves two gates and a candidate hidden state. It includes 3 different neural layers with 6 weight matrices and transformations. RAN further simplifies GRU by removing the nonlinearity in the state transition and therefore contains 4 weight matrices in its simplest version. Although our ATR also has two gates, however, there are only 2 weight matrices and transformations, accounting for only a third and a quarter of those in GRU and LSTM respectively. To the best of our knowledge, ATR has the smallest number of weight transformations in existing gated RNN units. We provide a detailed and empirical analysis on the speed in Section 5.2.
Model  # WM  # MT 

LSTM  8  8 
GRU  6  6 
RAN  4  4 
ATR  2  2 
3.3 Interpretability Analysis of Hidden States
An appealing property of the proposed ATR is its interpretability. This can be demonstrated by rolling out Equation (9) as follows:
(10) 
where can be considered as an approximate weight assigned to the th input. Similar to the RAN model Lee et al. (2017), the hidden state in ATR is a componentwise weighted sum of the inputs. This not only enables ATR to build up essential dependencies between preceding inputs and the current hidden state, but also allows us to easily detect which previous words have the promising impacts on the current state. This desirable property obviously makes ATR highly interpretable.
Additionally, this form of weighted sum is also related to selfattention Vaswani et al. (2017). It can be considered as a forward unnormalized selfattention where each hidden state attends to all its previous positions. As the selfattention mechanism has proved very useful in NMT Vaswani et al. (2017), we conjecture that such property of ATR partially contributes to its success in machine translation as shown in our experiments. We visualize the dependencies captured by Equation (10) in Section 5.3.
System  Architecture  Vocab  tok BLEU  detok BLEU 

Buck et al. (2014)  WMT14 winner system phrasebased + large LM      20.70 
Existing deep NMT systems (perhaps different tokenization)  
Zhou et al. (2016)  LSTM with 16 layers + FF connections  160K  20.60   
Lei and Zhang (2017)  SRU with 10 layers  50K  20.70   
Antonino and Federico (2018)  SRNMT with 4 layers  32K  23.32   
Wang et al. (2017a)  GRU with 4 layers + LAU + PosUnk  80K  23.80   
Wang et al. (2017a)  GRU with 4 layers + LAU + PosUnk + ensemble  80K  26.10   
Wu et al. (2016)  LSTM with 8 layers + RLrefined WPM  32K  24.61   
Wu et al. (2016)  LSTM with 8 layers + RLrefined ensemble  80K  26.30   
Vaswani et al. (2017)  Transformer with 6 layers + base model  37K  27.30   
Comparable NMT systems (the same tokenization)  
Luong et al. (2015a)  LSTM with 4 layers + local att. + unk replace  50K  20.90   
Zhang et al. (2017a)  GRU with gated attention + BPE  40K  23.84   
Gehring et al. (2017b)  CNN with 15 layers + BPE  40K  25.16   
Gehring et al. (2017b)  CNN with 15 layers + BPE + ensemble  40K  26.43   
Zhang et al. (2018a)  Transformer with 6 layers + aan + base model  32K  26.31   
Our endtoend NMT systems  
this work  RNNSearch + GRU + BPE  40K  22.54  22.06 
RNNSearch + LSTM + BPE  40K  22.96  22.39  
RNNSearch + RAN + BPE  40K  22.14  21.40  
RNNSearch + ATR + BPE  40K  22.48  21.99  
RNNSearch + ATR + CA + BPE  40K  23.31  22.70  
GNMT + ATR + BPE  40K  24.16  23.59  
RNNSearch + ATR + CA + BPE + ensemble  40K  24.97  24.33 
” is the reinforcement learning optimization and word piece model used in
Wu et al. (2016). “CA” is the contextaware recurrent encoder Zhang et al. (2017b). “LAU” and “FF” denote the linear associative unit and the fastforward architecture proposed by Wang et al. (2017a) and Zhou et al. (2016) respectively. “aan” denotes the average attention network proposed by Zhang et al. (2018a).4 Experiments
4.1 Setup
We conducted our main experiments on WMT14 EnglishGerman and EnglishFrench translation tasks. Translation quality is measured by casesensitive BLEU4 metric (Papineni et al., 2002). Details about each dataset are as follows:
 EnglishGerman
 EnglishFrench

We used the WMT 2014 training data. This corpora contain 12M selected sentence pairs. We used the concatenation of newstest2012 and newstest2013 as our dev set, and the newstest2014 as our test set.
The used NMT system is an attentionbased encoderdecoder system, which employs a bidirectional recurrent network as its encoder and a twolayer hierarchical unidirectional recurrent network as its decoder, companied with an additive attention mechanism Bahdanau et al. (2015). We replaced the recurrent unit with our proposed ATR model. More details are given in Appendix A.1.
We also conducted experiments on ChineseEnglish translation, natural language inference and Chinese word segmentation. Details and experiment results are provided in Appendix A.2.
4.2 Training
We set the maximum length of training instances to 80 words for both EnglishGerman and EnglishFrench task. We used the byte pair encoding compression algorithm (Sennrich et al., 2016) to reduce the vocabulary size as well as to deal with the issue of rich morphology. We set the vocabulary size of both source and target languages to 40K for all translation tasks. All outofvocabulary words were replaced with a token “unk”.
We used 1000 hidden units for both encoder and decoder. All word embeddings had dimensionality 620. We initialized all model parameters randomly according to a uniform distribution ranging from 0.08 to 0.08. These tunable parameters were then optimized using Adam algorithm
(Kingma and Ba, 2015)with the two momentum parameters set to 0.9 and 0.999 respectively. Gradient clipping 5.0 was applied to avoid the gradient explosion problem. We trained all models with a learning rate
and batch size 80. We decayed the learning rate with a factor of 0.5 between each training epoch. Translations were generated by a beam search algorithm that was based on loglikelihood scores normalized by sentence length. We used a beam size of 10 in all the experiments. We also applied dropout for EnglishGerman and EnglishFrench tasks on the output layer to avoid overfitting, and the dropout rate was set to 0.2.
To train deep NMT models, we adopted the GNMT architecture Wu et al. (2016). We kept all the above settings, except the dimensionality of word embedding and hidden state which we set to be 512.
4.3 Results on EnglishGerman Translation
The translation results are shown in Table 2. We also provide results of several existing systems that are trained with comparable experimental settings to ours. In particular, our single model yields a detokenized BLEU score of 21.99. In order to show that the proposed model can be orthogonal to previous methods that improve LSTM/GRUbased NMT, we integrate a singlelayer contextaware (CA) encoder Zhang et al. (2017b) into our system. The ATR+CA system further reaches 22.7 BLEU, outperforming the winner system Buck et al. (2014) by a substantial improvement of 2 BLEU points. Enhanced with the deep GNMT architecture, the GNMT+ATR system yields a gain of 0.89 BLEU points over the RNNSearch+ATR+CA and 1.6 BLEU points over the RNNSearch + ATR. Notice that different from our system which was trained on the parallel corpus alone, the winner system used a huge monolingual text to enhance its language model.
System  Architecture  Vocab  tok BLEU  detok BLEU 
Existing endtoend NMT systems  
Jean et al. (2015)  RNNSearch (GRU) + unk replace + large vocab  500K  34.11   
Luong et al. (2015b)  LSTM with 6 layers + PosUnk  40K  32.70   
Sutskever et al. (2014)  LSTM with 4 layers  80K  30.59   
Shen et al. (2016)  RNNSearch (GRU) + MRT + PosUnk  30K  34.23   
Zhou et al. (2016)  LSTM with 16 layers + FF connections + 36M data  80K  37.70   
Wu et al. (2016)  LSTM with 8 layers + RLrefined WPM + 36M data  32K  38.95   
Wang et al. (2017a)  RNNSearch (GRU) with 4 layers + LAU  30K  35.10   
Gehring et al. (2017a)  Deep Convolutional Encoder 20 layers with kernel width 5  30K  35.70   
Vaswani et al. (2017)  Transformer with 6 layers + 36M data + base model  32K  38.10   
Gehring et al. (2017b)  ConvS2S with 15 layers + 36M data  40K  40.46   
Vaswani et al. (2017)  Transformer with 6 layers + 36M data + big model  32K  41.80   
Wu et al. (2016)  LSTM with 8 layers + RL WPM + 36M data + ensemble  32K  41.16   
Our endtoend NMT systems  
this work  RNNSearch + GRU + BPE  40K  35.89  33.41 
RNNSearch + LSTM + BPE  40K  36.95  34.15  
RNNSearch + ATR + BPE  40K  36.89  34.00  
RNNSearch + ATR + CA + BPE  40K  37.88  34.96  
GNMT + ATR + BPE  40K  38.59  35.67  
RNNSearch + ATR + CA + BPE + ensemble  40K  39.06  36.06 
Compared with the existing LSTMbased Luong et al. (2015a) deep NMT system, our shallow/deep model achieves a gain of 2.41/3.26 tokenized BLEU points respectively. Under the same training condition, our ATR outperforms RAN by a margin of 0.34 tokenized BLEU points, and achieves competitive results against its GRU/LSTM counterpart. This suggests that although our ATR is much simpler than GRU, LSTM and RAN, it still possesses strong modeling capacity.
In comparison to several advanced deep NMT models, such as the Google NMT (8 layers, 24.61 tokenized BLEU) Wu et al. (2016) and the LAUconnected NMT (4 layers, 23.80 tokenized BLEU) Wang et al. (2017a), the performance of our shallow model (23.31) is competitive. Particularly, when replacing LSTM in the Google NMT with our ATR model, the GNMT+ATR system achieves a BLEU score of 24.16, merely 0.45 BLEU points lower. Notice that although all systems use the same training data of WMT14, the tokenization of these work might be different from ours. However, the overall results can indicate the competitive strength of our model. In addition, SRU Lei and Zhang (2017), a recent proposed efficient recurrent unit, obtains a BLEU score of 20.70 with 10 layers, far more behind ATR’s.
We further ensemble eight likelihoodtrained models with different random initializations for the ATR+CA system. The variance in the tokenized BLEU scores of these models is 0.07. As can be seen from Table
2, the ensemble system achieves a tokenized and detokenized BLEU score of 24.97 and 24.33 respectively, obtaining a gain of 1.66 and 1.63 BLEU points over the single model. The final result of the ensemble system, to the best of our knowledge, is a very promising result that can be reached by singlelayer NMT systems on WMT14 EnglishGerman translation.4.4 Results on EnglishFrench Translation
Unlike the above translation task, the WMT14 EnglishFrench translation task provides a significant larger dataset. The full training data have approximately 36M sentence pairs, from which we only used 12M instances for experiments following previous work (Jean et al., 2015; Gehring et al., 2017a; Luong et al., 2015b; Wang et al., 2017a). We show the results in Table 3.
Our shallow model achieves a tokenized BLEU score of 36.89 and 37.88 when it is equipped with the CA encoder, outperforming almost all the listed systems, except the Google NMT Wu et al. (2016), the ConvS2S Gehring et al. (2017b) and the Transformer Vaswani et al. (2017). Enhanced with the deep GNMT architecture, the GNMT+ATR system reaches a BLEU score of 38.59, which beats the base model version of the Transformer by a margin of 0.49 BLEU points. When we use four ensemble models (the variance in the tokenized BLEU scores of these ensemble models is 0.16), the ATR+CA system obtains another gain of 0.47 BLEU points, reaching a tokenized BLEU score of 39.06, which is comparable with several stateoftheart systems.
5 Analysis
5.1 Analysis on TwinGated Mechanism
We provide an illustration of the actual relation between the learned input and forget gate in Figure 3. Clearly, these two gates show strong negative correlation. When the input gate opens with high values, the forget gate prefer to be close. Quantitatively, on the whole test set, the Pearson’s r of the input and forget gate is 0.9819, indicating a high correlation.
5.2 Analysis on Speed and Model Parameters
As mentioned in Section 3.2, ATR has much fewer model parameters and matrix transformations. We provide more details in this section by comparing against the following two NMT systems:

DeepRNNSearch (GRU): a deep GRUequipped RNNSearch model Wu et al. (2016) with 5 layers. We set the dimension of word embedding and hidden state to 620 and 1000 respectively.

Transformer: a purely attentional translator Vaswani et al. (2017). We set the dimension of word embedding and filter size to 512 and 2048 respectively. The model was trained with a minibatch size of 256.
We also compare with the GRU and LSTMbased RNNSearch. Without specific mention, all other experimental settings for all these models are the same as for our model. We implement all these models using the Theano library, and test the speed on one GeForce GTX TITAN X GPU card. We show the results on Table 4.
Model  #PMs  Train  Test 

RNNSearch+GRU  83.5M  1996  168 (0.133) 
RNNSearch+LSTM  93.3M  1919  167 (0.139) 
RNNSearch+RAN  79.5M  2192  170 (0.129) 
DeepRNNSearch  143.0M  894  70 (0.318) 
Transformer  70.2M  4961  44 (0.485) 
RNNSearch+ATR  67.8M  2518  177 (0.123) 
RNNSearch+ATR+CA  63.1M  3993  186 (0.118) 
We observe that the Transformer achieves the best training speed, processing 4961 words per second. This is reasonable since the Transformer can be trained in full parallelization. On the contrary, DeepRNNSearch is the slowest system. As RNN performs sequentially, stacking more layers of RNNs inevitably reduces the training efficiency. However, this situation becomes the reverse when it comes to the decoding procedure. The Transformer merely generates 44 words per second while DeepRNNSearch reaches 70. This is because during decoding, all these beam searchbased systems must generate translation one word after another. Therefore the parallelization advantage of the Transformer disappears. In comparison to DeepRNNSearch, the Transformer spends extra time on performing selfattention over all previous hidden states.
Our model with the CA structure, using only 63.1M parameters, processes 3993 words per second during training and generates 186 words per second during decoding, which yields substantial speed improvements over the GRU and LSTMequipped RNNSearch. This is due to the light matrix computation in recurrent units of ATR. Notice that the speed increase of ATR over GRU and LSTM does not reach 3x. This is because at each decoding step, there are mainly two types of computation: recurrent unit and softmax layer. The latter consumes the most calculation, which, however, is the same for different models (LSTM/GRU/ATR).
5.3 Analysis on Dependency Modeling
As shown in Section 3.3, a hidden state in our ATR can be formulated as a weighted sum of the previous inputs. In this section, we quantitatively analyze the weights in Equation (10) induced from Equation (13). Inspired by DBLP:journals/corr/LeeLZ17, we visualize the captured dependencies of an example in Figure 4 where we connect each word to the corresponding previous word with the highest weight .
Obviously, our model can discover strong local dependencies. For example, the token “unglück@@” and “lichen” should be a single word. Our model successfully associates “unglück@@” closely to the generation of “lichen” during decoding. In addition, our model can also detect nonconsecutive longdistance dependencies. Particularly, the prediction of “Parteien” relies heavily on the token “unglücklichen”, which actually entails an amod linguistic dependency relationship. These captured dependencies make our model more interpretable than LSTM/GRU.
6 Conclusion and Future Work
This paper has presented a twingated recurrent network (ATR) to simplify neural machine translation. There are only two weight matrices and matrix transformations in recurrent units of ATR, making it efficient in physical memory usage and running speed. To avoid the gradient vanishing problem, ATR introduces a twingated mechanism to generate an input gate and forget gate through linear addition and subtraction operation respectively, without introducing any additional parameters. The simplifications allow ATR to produce interpretable results.
Experiments on EnglishGerman and EnglishFrench translation tasks demonstrate the effectiveness of our model. They also show that ATR can be orthogonal to and applied with methods that improve LSTM/GRUbased NMT, indicated by the promising performance of the ATR+CA system. Further analyses reveal that ATR can be trained more efficiently than GRU. It is also able to transparently model longdistance dependencies.
We also adapt our ATR to other natural language processing tasks. Experiments show encouraging performance of our model on ChineseEnglish translation, natural language inference and Chinese word segmentation, demonstrating its generality and applicability on various NLP tasks.
Acknowledgments
The authors were supported by National Natural Science Foundation of China (Grants No. 61672440, 61622209 and 61861130364), the Fundamental Research Funds for the Central Universities (Grant No. ZK1024), and Scientific Research Project of National Language Committee of China (Grant No. YB13549). Biao Zhang greatly acknowledges the support of the Baidu Scholarship. We also thank the reviewers for their insightful comments.
References
 Antonino and Federico (2018) M. Antonino and M. Federico. 2018. Deep Neural Machine Translation with WeaklyRecurrent Units. ArXiv eprints.
 Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proc. of ICLR.
 Bowman et al. (2015) Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large annotated corpus for learning natural language inference. In Proc. of EMNLP. Association for Computational Linguistics.
 Bowman et al. (2016) Samuel R. Bowman, Jon Gauthier, Abhinav Rastogi, Raghav Gupta, Christopher D. Manning, and Christopher Potts. 2016. A fast unified model for parsing and sentence understanding. In Proc. of ACL, pages 1466–1477.
 Bradbury et al. (2016) James Bradbury, Stephen Merity, Caiming Xiong, and Richard Socher. 2016. Quasirecurrent neural networks. CoRR, abs/1611.01576.
 Buck et al. (2014) Christian Buck, Kenneth Heafield, and Bas van Ooyen. 2014. Ngram counts and language models from the common crawl. In Proc. of LREC, pages 3579–3584, Reykjavik, Iceland.
 Chen et al. (2015) Xinchi Chen, Xipeng Qiu, Chenxi Zhu, Pengfei Liu, and Xuanjing Huang. 2015. Long shortterm memory neural networks for chinese word segmentation. In Proc. of EMNLP, pages 1197–1206.
 Cheng et al. (2016) Jianpeng Cheng, Li Dong, and Mirella Lapata. 2016. Long shortterm memorynetworks for machine reading. In Proc. of EMNLP, pages 551–561.
 Chung et al. (2014) Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR.
 Elman (1990) Jeffrey L Elman. 1990. Finding structure in time. Cognitive science, 14(2):179–211.
 Gehring et al. (2017a) Jonas Gehring, Michael Auli, David Grangier, and Yann N. Dauphin. 2017a. A convolutional encoder model for neural machine translation. In Proc. of ACL, pages 123–135.
 Gehring et al. (2017b) Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017b. Convolutional sequence to sequence learning.
 Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long shortterm memory. Neural Comput., 9:1735–1780.
 Jean et al. (2015) Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015. On using very large target vocabulary for neural machine translation. In Proc. of ACLIJCNLP, pages 1–10.
 Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. Proc. of ICLR.
 Lee et al. (2017) Kenton Lee, Omer Levy, and Luke Zettlemoyer. 2017. Recurrent additive networks. CoRR, abs/1705.07393.
 Lei and Zhang (2017) T. Lei and Y. Zhang. 2017. Training RNNs as Fast as CNNs. ArXiv eprints.
 Liu et al. (2016) Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. 2016. Learning natural language inference using bidirectional LSTM model and innerattention. CoRR, abs/1605.09090.
 Luong et al. (2015a) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attentionbased neural machine translation. In Proc. of EMNLP, pages 1412–1421.
 Luong et al. (2015b) Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the rare word problem in neural machine translation. In Proc. of ACLIJCNLP, pages 11–19.
 Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proc. of ACL, pages 311–318.

Pei et al. (2014)
Wenzhe Pei, Tao Ge, and Baobao Chang. 2014.
Maxmargin tensor neural network for chinese word segmentation.
In Proc. of ACL, pages 293–303, Baltimore, Maryland. Association for Computational Linguistics.  Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Proc. of EMNLP, pages 1532–1543.
 Rocktäschel et al. (2016) Tim Rocktäschel, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, and Phil Blunsom. 2016. Reasoning about entailment with neural attention. In Proc. of ICLR.
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proc. of ACL, pages 1715–1725.
 Shen et al. (2016) Shiqi Shen, Yong Cheng, Zhongjun He, Wei He, Hua Wu, Maosong Sun, and Yang Liu. 2016. Minimum risk training for neural machine translation. In Proc. of ACL, pages 1683–1692, Berlin, Germany. Association for Computational Linguistics.
 Sproat and Emerson (2003) Richard Sproat and Thomas Emerson. 2003. The first international chinese word segmentation bakeoff. In Proceedings of the Second SIGHAN Workshop on Chinese Language Processing  Volume 17, SIGHAN ’03, pages 133–143.
 Su et al. (2018a) Jinsong Su, Shan Wu, Deyi Xiong, Yaojie Lu, Xianpei Han, and Biao Zhang. 2018a. Variational recurrent neural machine translation. arXiv preprint arXiv:1801.05119.
 Su et al. (2018b) Jinsong Su, Jiali Zeng, Deyi Xiong, Yang Liu, Mingxuan Wang, and Jun Xie. 2018b. A hierarchytosequence attentional neural machine translation model. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 26(3):623–632.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 3104–3112. Curran Associates, Inc.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc.
 Vendrov et al. (2015) Ivan Vendrov, Ryan Kiros, Sanja Fidler, and Raquel Urtasun. 2015. Orderembeddings of images and language. CoRR, abs/1511.06361.
 Wang et al. (2017) M. Wang, Z. Lu, J. Zhou, and Q. Liu. 2017. Deep Neural Machine Translation with Linear Associative Unit. ArXiv eprints.
 Wang et al. (2017a) Mingxuan Wang, Zhengdong Lu, Jie Zhou, and Qun Liu. 2017a. Deep neural machine translation with linear associative unit. In Proc. of ACL, pages 136–145, Vancouver, Canada.
 Wang and Jiang (2016) Shuohang Wang and Jing Jiang. 2016. Learning natural language inference with lstm. In Proc. of NAACL, pages 1442–1451.
 Wang et al. (2017b) Zhiguo Wang, Wael Hamza, and Radu Florian. 2017b. Bilateral multiperspective matching for natural language sentences. CoRR, abs/1702.03814.
 Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Lukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
 Xue et al. (2005) Naiwen Xue, Fei Xia, Fudong Chiou, and Marta Palmer. 2005. The penn chinese treebank: Phrase structure annotation of a large corpus. Nat. Lang. Eng., 11(2):207–238.
 Zhang et al. (2017a) Biao Zhang, Deyi Xiong, and Jinsong Su. 2017a. A grugated attention model for neural machine translation. CoRR, abs/1704.08430.
 Zhang et al. (2018a) Biao Zhang, Deyi Xiong, and Jinsong Su. 2018a. Accelerating neural transformer via an average attention network. In Proc of ACL, pages 1789–1798. Association for Computational Linguistics.
 Zhang et al. (2017b) Biao Zhang, Deyi Xiong, Jinsong Su, and Hong Duan. 2017b. A contextaware recurrent encoder for neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, PP(99):1–1.
 Zhang et al. (2017c) Jinchao Zhang, Mingxuan Wang, Qun Liu, and Jie Zhou. 2017c. Incorporating word reordering knowledge into attentionbased neural machine translation. In Proc. of ACL, pages 1524–1534, Vancouver, Canada. Association for Computational Linguistics.
 Zhang et al. (2018b) Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and Hongji Wang. 2018b. Asynchronous bidirectional decoding for neural machine translation. CoRR, abs/1801.05122.
 Zheng et al. (2013) Xiaoqing Zheng, Hanyang Chen, and Tianyu Xu. 2013. Deep learning for chinese word segmentation and pos tagging. In Pro. of EMNLP, pages 647–657.
 Zheng et al. (2017) Zaixiang Zheng, Hao Zhou, Shujian Huang, Lili Mou, Xinyu Dai, Jiajun Chen, and Zhaopeng Tu. 2017. Modeling past and future for neural machine translation. CoRR, abs/1711.09502.
 Zhou et al. (2016) Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, and Wei Xu. 2016. Deep recurrent models with fastforward connections for neural machine translation. Transactions of the Association for Computational Linguistics, 4:371–383.
Appendix A Appendix
System  MT05  MT02  MT03  MT04  MT06  MT08 

Existing Systems  
Coverage (Wang et al., 2017)  34.91    34.49  38.34  34.25   
MemDec (Wang et al., 2017)  35.91    36.16  39.81  35.98   
DeepLAU (Wang et al., 2017)  38.07    39.35  41.15  37.29   
Distortion (Zhang et al., 2017c)  36.71    38.33  40.11  35.29   
CAEncoder (Zhang et al., 2017b)  36.44  40.12  37.63  39.83  35.44  27.34 
FPNMT (Zheng et al., 2017)  36.75  39.65  37.90  40.37  34.55   
ASDBNMT (Zhang et al., 2018b)  38.84    40.02  42.32  38.38   
Our endtoend NMT systems  
this work  39.71  42.95  41.71  43.71  39.61  31.14 
a.1 Neural Machine Translation with ATR
We replace LSTM/GRU with our proposed ATR to build NMT models under the attentionbased encoderdecoder framework (Bahdanau et al., 2015). The encoder that reads a source sentence is a bidirectional recurrent network. Formally, given a source sentence , the encoder is formulated as follows:
(11) 
where is defined by Equation (6&9). The forward and backward hidden states are concatenated together to represent the th word: .
The decoder is a conditional language model that predicts the th target word via a multilayer perception:
(12) 
where is a partial translation. is the translationsensitive semantic vector computed via the attention mechanism (Bahdanau et al., 2015) based on the source states and internal target state , and is the th targetside hidden state calculated through a twolevel hierarchy:
(13) 
a.2 Additional Experiments
a.2.1 Experiments on ChineseEnglish Translation
Our training data consists of 1.25M sentence pairs including 27.9M Chinese words and 34.5M English words respectively.^{2}^{2}2This corpora contain LDC2002E18, LDC2003E07, LDC2003E14, Hansards portion of LDC2004T07, LDC2004T08 and LDC2005T06. We used the NIST 2005 dataset as our dev set, and the NIST 2002, 2003, 2004, 2006 and 2008 datasets as our test sets. Unlike WMT14 translation tasks, we used wordbased vocabulary for ChineseEnglish, preserving top30K most frequent source and target words in the vocabulary. Caseinsensitive BLEU4 metric was used to evaluate the translation quality.
Translation Results
We compare our model against several advanced models on the same dataset, including:

Coverage (Wang et al., 2017): an attentionbased NMT system enhanced with a coverage mechanism to handle the overtranslation and undertranslation problem.

MemDec (Wang et al., 2017): an attentionbased NMT system that replaces the vanilla decoder with a memoryenhanced decoder to better capture important information for translation.

DeepLAU (Wang et al., 2017): a deep attentionbased NMT system integrated with linear associative units that deals better with gradient propagation.

Distortion (Zhang et al., 2017c): an attentionbased NMT system that incorporates word reordering knowledge to encourage more accurate attention.

CAEncoder (Zhang et al., 2017b): the same as our model but uses GRU unit.

FPNMT (Zheng et al., 2017)
: an attentionbased NMT system that leverages past and future information to improve the attention model and the decoder states, also using addition and subtraction operations.

ASDBNMT (Zhang et al., 2018b): an attentionbased NMT system that is equipped with a backward decoder to explore bidirectional decoding.
Table 5 summarizes the results. Although our model does not involve any subnetworks for modeling the coverage, distortion, memory and future context, our model clearly outperforms all these advanced models, achieving an average BLEU score of 39.82 on all test sets. This strongly suggests that 1) shallow models are also capable of generating extremely highquality translations, and 2) our ATR model indeed has the ability in capturing translation correspondence in spite of its simplicity.
a.2.2 Experiments on Natural Language Inference
Given two sentences, namely a premise and a hypothesis, this task aims at recognizing whether the premise can entail the hypothesis. We used the Stanford Natural Language Inference Corpus (SNLI) Bowman et al. (2015) for this experiment, which involves a collection of 570k humanwritten English sentence pairs manually labeled for balanced classification with the labels entailment, contradiction, and neutral. We formulated this problem as a threeway classification task.
We employed the attentional architecture Rocktäschel et al. (2016) as our basic model, and replaced its recurrent unit with our ATR model. We fixed word embedding initialized with the pretrained 300D Glove vector Pennington et al. (2014). The hidden size of ATR was also set to 300. We optimized model parameters using the Adam method Kingma and Ba (2015)
with hyperparameters
and . The learning rate was fixed at 0.0005. The minibatch size was set to 128. Dropout was applied on both word embedding layer and preclassification layer to avoid overfitting, with a rate of 0.15. The maximum training epoch was set to 20.Classification Results
Model  #Params  Train  Test  
LSTM encoders Bowman et al. (2016) 
300  3.0m  83.9  80.6 
GRU encoders w/ pretraining Vendrov et al. (2015)  1024  15m  98.8  81.4 
BiLSTM encoders with intraattention Liu et al. (2016)  600  2.8m  84.5  84.2 
LSTMs w/ wordbyword attention Rocktäschel et al. (2016)  100  250k  85.3  83.5 
mLSTM wordbyword attention model Wang and Jiang (2016)  300  1.9m  92.0  86.1 
LSTMN with deep attention fusion Cheng et al. (2016)  450  3.4m  88.5  86.3 
BiMPM Wang et al. (2017b)  100  1.6m  90.9  87.5 
this work with GRU 
300  3.2m  91.0  84.6 
this work with ATR  300  1.5m  90.9  85.6 

Model  MSRA  CTB6  

P  R  F  P  R  F  
(Zheng et al., 2013)  92.9  93.6  93.3  94.0  93.1  93.6 
(Pei et al., 2014)  94.6  94.2  94.4  94.4  93.4  93.9 
(Chen et al., 2015)  96.7  96.2  96.4  95.0  94.8  94.9 
this work + LSTM  95.5  94.9  95.2  93.3  93.1  93.2 
this work + GRU  95.2  95.1  95.1  93.3  93.0  93.2 
this work + ATR  95.3  95.1  95.2  94.0  93.9  94.0 
Table 6 shows the results. The GRU equipped model in our implementation achieves a test accuracy of 84.6% with about 3.2m trainable model parameters, outperforming the LSTMenhanced counterpart Rocktäschel et al. (2016) by a margin of 1.1%. By contrast, the same architecture with ATR model yields a test accuracy of 85.6%, with merely 1.5m model parameters. In other words, using fewer parameters, our ATR model gains a significant improvement of 1.0%, reaching a comparable performance against some deep architectures Cheng et al. (2016).
a.2.3 Experiments on Chinese Word Segmentation
Chinese word segmentation (CWS) is a fundamental preprocessing step for Chineserelated NLP tasks. Unlike other languages, Chinese sentences are recorded without explicit delimiters. Therefore, before performing indepth modeling, researchers need to segment the whole sentence into a sequence of tokens, which is exactly the goal of CWS.
Following previous work Chen et al. (2015), we formulate CWS as a sequence labeling task. Each character in a sentence is assigned with a unique label from the set {B, M, E, S}, where {B, M, E} indicate Begin, Middle, End of a multicharacter word respectively, and S denotes a Single character word. Given a sequence of characters, we first embed them individually through a character embedding layer, followed by a bidirectional RNN layer to generate contextsensitive representation for each character. The output representations are then passed through a CRF inference layer to capture dependencies among character labels. The whole model is optimized using a maxmargin objective towards minimizing the differences between predicted sequences and gold label sequences.
We used the MSRA and CTB6 datasets to evaluate our model. The former is provided by the second International Chinese Word Segmentation Bakeoff Sproat and Emerson (2003), and the latter is from Chinese TreeBank6.0 (LDC2007T36) Xue et al. (2005). For MSRA dataset, we split the first 90% sentences of the training data as the training set and the rest as the development set. For CTB6 dataset, we divided the training, development and test sets in the same way as in Chen et al. (2015). Precision, recall, F1score and outofvocabulary (OOV) word recall calculated by the standard backoff scoring program were used for evaluation.
We set the dimensionality of both character embedding and RNN hidden state to be 300. Model parameters were tuned by Adam algorithm Kingma and Ba (2015) with default hyperparameters () and minibatch size 128. Gradient was clipped when its norm exceeds 1.0 to avoid gradient explosion. We applied dropout on both character embedding layer and preCRF layer with a rate of 0.2. The discount parameter in maxmargin objective was set to 0.2. The maximum training epoch was set to 50. Learning rate was initially set to 0.0005, and halved after each epoch.
Model Performance
Table 7 shows the overall performance. We observe that our ATR model performs as efficient as both GRU and LSTM on this task. ATR yields a F1score of 95.2% and 94.0% on MSRA and CTB6 dataset respectively, almost the same as that of GRU (95.1%/93.2%) and LSTM (95.2%/93.2%). Particularly, ATR achieves better results on CTB6, with a gain of 0.8% F1 points over GRU and LSTM. This further demonstrates the effectiveness of the proposed ATR model.