This task is to study how well the word order information learned by different neural networks.
Self-attention networks (SAN) have attracted a lot of interests due to their high parallelization and strong performance on a variety of NLP tasks, e.g. machine translation. Due to the lack of recurrence structure such as recurrent neural networks (RNN), SAN is ascribed to be weak at learning positional information of words for sequence modeling. However, neither this speculation has been empirically confirmed, nor explanations for their strong performances on machine translation tasks when "lacking positional information" have been explored. To this end, we propose a novel word reordering detection task to quantify how well the word order information learned by SAN and RNN. Specifically, we randomly move one word to another position, and examine whether a trained model can detect both the original and inserted positions. Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SAN trained on machine translation learns better positional information than its RNN counterpart, in which position embedding plays a critical role. Although recurrence structure make the model more universally-effective on learning word order, learning objectives matter more in the downstream tasks such as machine translation.READ FULL TEXT VIEW PDF
Due to the highly parallelizable architecture, Transformer is faster to ...
Self-attention networks (SANs) with selective mechanism has produced
Recently, non-recurrent architectures (convolutional, self-attentional) ...
Position encoding (PE), an essential part of self-attention networks (SA...
Sequential word order is important when processing text. Currently, neur...
We report the results of our classification-based machine translation mo...
Multiple studies have probed representations emerging in neural networks...
This task is to study how well the word order information learned by different neural networks.
have shown promising empirical results in a variety of natural language processing (NLP) tasks, such as machine translationVaswani et al. (2017), semantic role labelling Strubell et al. (2018), and language representations Devlin et al. (2019). The popularity of SAN lies in its high parallelization in computation, and flexibility in modeling dependencies regardless of distance by explicitly attending to all the signals. Position embedding Gehring et al. (2017) is generally deployed to capture sequential information for SAN Vaswani et al. (2017); Shaw et al. (2018).
Recent studies claimed that SAN with position embedding is still weak at learning word order information, due to the lack of recurrence structure that is essential for sequence modeling Shen et al. (2018a); Chen et al. (2018); Hao et al. (2019). However, such claims are mainly based on a theoretical argument, which have not been empirically validated. In addition, this can not explain well why SAN-based models outperform their RNN counterpart in machine translation – a benchmark sequence modeling task Vaswani et al. (2017).
Our goal in this work is to empirically assess the ability of SAN to learn word order. We focus on asking the following research questions:
Is recurrence structure obligate for learning word order, and does the conclusion hold in different scenarios (e.g., translation)?
Is the model architecture the critical factor for learning word order in the downstream tasks such as machine translation?
Is position embedding powerful enough to capture word order information for SAN?
We approach these questions with a novel probing task – word reordering detection (WRD), which aims to detect the positions of randomly reordered words in the input sentence. We compare SAN with RNN, as well as directional SAN (DiSAN, Shen et al., 2018a) that augments SAN with recurrence modeling. In this study, we focus on the encoders implemented with different architectures, so as to investigate their abilities to learn word order information of the input sequence. The encoders are trained on objectives like detection accuracy and machine translation, to study the influences of learning objectives.
Our experimental results reveal that: (Q1) SAN indeed underperforms the architectures with recurrence modeling (i.e. RNN and DiSAN) on the WRD task, while this conclusion does not hold in machine translation: SAN trained with the translation objective outperforms both RNN and DiSAN on detection accuracy; (Q2) Learning objectives matter more than model architectures in downstream tasks such as machine translation; and (Q3) Position encoding is good enough for SAN in machine translation, while DiSAN is a more universally-effective mechanism to learn word order information for SAN.
The key contributions are:
We design a novel probing task along with the corresponding benchmark model, which can assess the abilities of different architectures to learn word order information.111The data and codes are released at: https://github.com/baosongyang/WRD.
Our study dispels the doubt on the inability of SAN to learn word order information in machine translation, indicating that the learning objective can greatly influence the suitability of an architecture for downstream tasks.
In order to investigate the ability of self-attention networks to extract word order information, in this section, we design an artificial task to evaluate the abilities of the examined models to detect the erroneous word orders in a given sequence.
Given a sentence , we randomly pop a word and insert it into another position ( and ). The objective of this task is to detect both the position the word is popped out (labeled as “O”), as well as the position the word is inserted (labeled as “I”). As seen the example in Figure 1 (a), the word “hold” is moved from the slot to the slot. Accordingly, the and slots are labelled as “O” and “I”, respectively. To exactly detect word reordering, the examined models have to learn to recognize both the normal and abnormal word order in a sentence.
Figure 1 (a) depicts the architecture of the position detector. Let the sequential representations be the output of each encoder noted in Section 3, which are fed to the output layer (Figure 1 (b)). Since only one pair of “I” and “O” labels should be generated in the output sequence, we cast the task as a pointer detection problem Vinyals et al. (2015). To this end, we turn to an output layer that commonly used in the reading comprehension task Wang and Jiang (2017); Du and Cardie (2017), which aims to identify the start and end positions of the answer in the given text.222Contrary to reading comprehension in which the start and end positions are ordered, “I” and “O” do not have to be ordered in our tasks, that is, the popped word can be inserted to either left or right position.
The output layer consists of two sub-layers, which progressively predicts the probabilities of each position being labelled as “I” and “O”. The probability distribution of the sequence being labelled as “I” is calculated as:
where and are trainable parameters, and is the dimensionality of .
The second layer aims to locate the original position “O”, which conditions on the predicted popped word at the position “I”.333We tried to predict the position of “O” without feeding the approximate embedding, i.e. predicting “I” and “O” individually. It slightly underperforms the current model. To make the learning process differentiable, we follow xu2017neural to use the weighted sum of hidden states as the approximate embedding of the popped word. The embedding subsequently serves as a query to attend to the sequence to find which position is most similar to the original position of popped word. The probability distribution of the sequence being labelled as “O” is calculated as:
In training process, the objective is to minimize the cross entropy of the true inserted and original positions, which is the sum of the negative log probabilities of the groundtruth indices by the predicted distributions:
is an one-hot vector to indicate the groundtruth indices for the inserted and original positions. During prediction, we choose the positions with highest probabilities from the distributionsand as “I” and “O”, respectively. Considering the instance in Figure 1 (a), the position is labelled as inserted position “I”, and the position as the original position “O”.
In this study, we strove to empirically test whether SAN indeed weak at learning positional information and come up with the reason about the strong performance of SAN on machine translation. In response to the three research questions in Section 1, we give following experimental settings:
Q1: We compare SAN with two recurrence architectures – RNN and DiSAN on the WRD task, thus to quantify their abilities on learning word order (Section 3.1).
Q2: To compare the effects of learning objectives and model architectures, we train each encoder under two scenarios, i.e. trained on objectives like WRD accuracy and on machine translation (Section 3.2).
Q3: The strength of position encoding is appraised by ablating position encoding and recurrence modeling for SAN.
RNN and SAN are commonly used to produce sentence representations on NLP tasks Cho et al. (2014); Lin et al. (2017); Chen et al. (2018). As shown in Figure 2, we investigate three architectures in this study. Mathematically, let be the embedding matrix of the input sentence, and be the output sequence of representations.
RNN sequentially produces each state:
where is GRU Cho et al. (2014) in this study. RNN is particularly hard to parallelize due to their inherent dependence on the previous state .
DiSAN Shen et al. (2018a) augments SAN with the ability to encode word order:
where indicate leftward elements, e.g., .
To enable a fair comparison of architectures, we only vary the sub-layer of sequence modeling (e.g. the SAN sub-layer) in the Transformer encoder Vaswani et al. (2017), and keep the other components the same for all architectures. We use bi-directional setting for RNN and DiSAN, and apply position embedding for SAN and DiSAN. We follow Vaswani:2017:NIPS to set the configurations of the encoders, which consists of 6 stacked layers with the layer size being 512.
In this study, we employ two strategies to train the encoders, which differ at the learning objectives and data used to train the associated parameters. Note that in both strategies, the output layer in Figure 2 is fine-trained on the WRD data with the word reordering detection objective.
We first directly train the encoders on the WRD data, to evaluate the abilities of model architectures. The WRD encoders are randomly initialized and co-trained with the output layer. Accordingly, the detection accuracy can be treated as the learning objective of this group of encoders. Meanwhile, we can investigate the reliability of the proposed WRD task by checking whether the performances of different architectures (i.e. RNN, SAN, and DiSAN) are consistent with previous findings on other benchmark NLP tasks Shen et al. (2018a); Tang et al. (2018); Tran et al. (2018); Devlin et al. (2019).
To quantify how well different architectures learn word order information with the learning objective of machine translation, we first train the NMT models (both encoder and decoder) on bilingual corpus using the same configuration reported by Vaswani:2017:NIPS. Then, we fix the parameters of the encoder, and only train the parameter associated with the output layer on the WRD data. In this way, we can probe the representations learned by NMT models, on their abilities to learn word order of input sentences.
To cope with WRD task, all the models were trained for 600K steps, each of which is allocated a batch of 500 sentences. The training set is shuffled after each epoch. We use AdamKingma and Ba (2015) with , and . The learning rate linearly warms up over the first 4,000 steps, and decreases thereafter proportionally to the inverse square root of the step number. We use a dropout rate of 0.1 on all layers.
We pre-train NMT models on the benchmark WMT14 EnglishGerman (EnDe) data, which consists of 4.5M sentence pairs. The validation and test sets are newstest2013 and newstest2014, respectively. To demonstrate the universality of the findings in this study, we also conduct experiments on WAT17 EnglishJapanese (EnJa) data. Specifically, we follow morishita2017ntt to use the first two sections of WAT17 dataset as the training data, which approximately consists of 2.0M sentence pairs. We use newsdev2017 as the validation set and newstest2017 as the test set.
We conduct this task on the English sentences, which are extracted from the source side of WMT14 EnDe data with maximum length to 80. For each sentence in different sets (i.e. training, validation, and test sets), we construct an instance by randomly moving a word to another position. Finally we construct 7M, 10K and 10K samples for training, validating and testing, respectively. Note that a sentence can be sampled multiple times, thus each dataset in the WRD data contains more instances than that in the machine translation data.
All the English and German data are tokenized using the scripts in Moses. The Japanese sentences are segmented by the word segmentation toolkit KeTea Neubig et al. (2011). To reduce the vocabulary size, all the sentences are processed by byte-pair encoding (BPE) Sennrich et al. (2016) with 32K merge operations for all the data.
We return to the central questions originally posed, that is, whether SAN is indeed weak at learning positional information. Using the above experimental design, we give the following answers:
We first check the performance of each WRD encoder on the proposed WRD task from two aspects: 1) WRD accuracy; and 2) learning ability.
|EnDe||EnJa||EnDe Enc.||EnJa Enc.||WRD Enc.|
The detection results are concluded in Table 1. As seen, both RNN and DiSAN significantly outperform SAN on our task, indicating that the recurrence structure (RNN) exactly performs better than parallelization (SAN) on capturing word order information in a sentence. Nevertheless, the drawback can be alleviated by applying directional attention functions. The comparable result between DiSAN and RNN confirms the hypothesis by Shen:2018:AAAI and devlin2018bert that directional SAN exactly improves the ability of SAN to learn word order. The consistency between prior studies and our results verified the reliability of the proposed WRD task.
We visualize the learning curve of the training. As shown in Figure 3, SAN has much slower convergence than others, showing that SAN has a harder time learning word order information than RNN and DiSAN. This is consistent with our intuition that the parallel structure is more difficult to learn word order information than those models with a sequential process. Considering DiSAN, although it has slightly slower learning speed at the early stage of the training, it is able to achieve comparable accuracy to RNN at the mid and late phases of the training.
We investigate whether the SAN indeed lacks the ability to learn word order information under machine translation context. The results are concluded in Table 2. We first report the effectiveness of the compared models on translation tasks. For En-De translation, SAN outperforms RNN, which is consistent with the results reported in Chen et al. (2018). The tendency is universal on En-Ja which is a distant language pair Bosch and Sebastián-Gallés (2001); Isozaki et al. (2010). Moreover, DiSAN incrementally improves the translation quality, demonstrating that model directional information benefits to the translation quality. The consistent translation performances make the following evaluation on WRD accuracy convincing.
Concerning the performances of NMT encoders on the WRD task:
It is surprising to see that SAN yields even higher accuracy on WRD task than other pre-trained NMT encoders, despite its lower translation qualities comparing with DiSAN. The results not only dispel the doubt on the inablity of SAN-based encoder to learn word order in machine translation, but also demonstrate that SAN learns to retain more features with respect to word order during the training of machine translation.
In addition, both the NMT encoders underperform the WRD encoders on detection task across models and language pairs.444The EnJa pre-trained encoders yield lower accuracy on WRD task than that of EnDe pre-trained encoders. We attribute this to the difference between the source sentences in pre-training corpus (En-Ja) and that of WRD data (from En-De dataset). Despite of this, the tendency of results are consistent across language pairs. The only difference between the two kinds of encoders is the learning objective. This raises a hypothesis that the learning objective sometimes severs as a more critical factor than the model architecture on modeling word order.
In order to assess the importance of position encoding, we redo the experiments by removing the position encoding from SAN and DiSAN (“- Pos_Emb”). Clearly, SAN-based encoder without position embedding fails on both machine translation and our WRD task, indicating the necessity of position encoding on learning word order. It is encourage to see that SAN yields higher BLEU score and detection accuracy than “DiSAN-Pos_Emb” in machine translation scenario. It means that position embedding is more suitable on capture word order information for machine translation than modeling recurrence for SAN. Considering both two scenarios, DiSAN-based encoders achieve comparable detection accuracies to the best models, revealing its effectiveness and universality on learning word order.
In response to above results, we provide further analyses to verify our hypothesis on NMT encoders. We discuss three questions in this section: 1) Does learning objective indeed affect the extracting of word order information; 2) How SAN derives word order information from position encoding; and 3) Whether more word order information retained is useful for machine translation.
We further investigate the accuracy of WRD task according to various distance between the positions of word is popped out and inserted. As shown in Figure 4 (a), WRD encoders marginally reduce the performance with the increasing of distances. However, this kind of stability is destroyed when we pre-train each encoder with a learning objective of machine translation. As seen in Figure 4 (b) and (c), the performance of pre-trained NMT encoders obviously became worse on long-distance cases across language pairs and model variants. This is consistent with prior observation on NMT systems that both RNN and SAN fail to fully capture long-distance dependencies Tai et al. (2015); Yang et al. (2017); Tang et al. (2018).
Regarding to information bottleneck principle Tishby and Zaslavsky (2015); Alemi et al. (2016), our NMT models are trained to maximally maintain the relevant information between source and target, while abandon irrelevant features in the source sentence, e.g. portion of word order information. Different NLP tasks have distinct requirements on linguistic information Conneau et al. (2018). For machine translation, the local patterns (e.g. phrases) matter more Luong et al. (2015); Yang et al. (2018, 2019), while long-distance word order information plays a relatively trivial role in understanding the meaning of a source sentence. Recent studies also pointed out that abandoning irrelevant features in source sentence benefits to some downstream NLP tasks Lei et al. (2016); Yu et al. (2017); Shen et al. (2018b). An immediate consequence of such kind of data process inequality Schumacher and Nielsen (1996) is that information about word order that is lost in encoder cannot be recovered in the detector, and consequently drops the performance on our WRD task. The results verified that the learning objective indeed affects more on learning word order information than model architecture in our case.
Several researchers may doubt that the parallel structure of SAN may lead to failure on capturing word order information at higher layers, since the position embeddings are merely injected at the input layer. Accordingly, we further probe the representations at each layer on our WRD task to explore how does SAN learn word order information. As seen in Figure 5, SAN achieves better performance than other NMT encoders on the proposed WRD tasks across almost all the layers. The result dispels the doubt on the inability of position encoding and confirms the speculation by Vaswani:2017:NIPS and Shaw:2018:NAACL who suggested that SAN can profit from the use of residual network which propagates the positional information to higher layers. Moreover, both SAN and RNN gradually increase their performance on our task with the stacking of layers. The same tendency demonstrates that position encoding is able to provide same learning manner to that of recurrent structure with respect to word order for SAN. Both the results confirm the strength of position encoding to bring word order properties into SAN.
We strove to come up with the reason why SAN captured even more word order information in machine translation task. yin2017comparative and tran2018importance found that the approach with a recurrence structure (e.g. RNN) has an easier time learning syntactic information than that of models with a parallel structure (e.g. CNN, SAN). Inspired by their findings, we argue that SAN tries to partially countervail its disadvantage in parallel structure by reserving more word order information, thus to help for the encoding of deeper linguistic properties required by machine translation. Recent studies on multi-layer learning shown that different layers tend to learn distinct linguistic information Peters et al. (2018); Raganato and Tiedemann (2018); Li et al. (2019). The better accuracy achieved by SAN across layers indicates that SAN indeed tries to preserve more word order information during the learning of other linguistic properties for translation purpose.
For humans, a small number of erroneous word orders in a sentence usually does not affect the comprehension. For example, we can understand the meaning of English sentence “Dropped the boy the ball.”, despite its erroneous word order. It is intriguing whether NMT model has the ability to tackle the wrong order noises. As a results, we make erroneous word order noises on English-German development set by moving one word to another position, and evaluate the drop of the translation quality of each model. As listed in Figure 6, SAN and DiSAN yield less drops on translation quality than their RNN counterpart, demonstrating the effectiveness of self-attention on ablating wrong order noises. We attribute this to the fact that models (e.g. RNN-based models) will not learn to be robust to errors since they are never observed Sperber et al. (2017); Cheng et al. (2018). On the contrary, since SAN-based NMT encoder is good at recognizing and reserving anomalous word order information under NMT context, it may raise the ability of decoder on handling noises occurred in the training set, thus to be more robust in translating sentences with anomalous word order.
SAN has yielded strong empirical performance in a variety of NLP tasks Vaswani et al. (2017); Tan et al. (2018); Li et al. (2018); Devlin et al. (2019). In response to these impressive results, several studies have emerged with the goal of understanding SAN on many properties. For example, tran2018importance compared SAN and RNN on language inference tasks, and pointed out that SAN is weak at learning hierarchical structure than its RNN counterpart. Moreover, Tang:2018:EMNLP conducted experiments on subject-verb agreement and word sense disambiguation tasks. They found that SAN is good at extracting semantic properties, while underperforms RNN on capturing long-distance dependencies. This is in contrast to our intuition that SAN is good at capturing long-distance dependencies. In this work, we focus on exploring the ability of SAN on modeling word order information.
To open the black box of networks, probing task is used as a first step which facilitates comparing different models on a much finer-grained level. Most work has focused on probing fixed-sentence encoders, e.g. sentence embedding Adi et al. (2017); Conneau et al. (2018). Among them, adi2016fine and conneau2018you introduced to probe the sensitivity to legal word orders by detecting whether there exists a pair of permuted word in a sentence by giving its sentence embedding. However, analysis on sentence encodings may introduce confounds, making it difficult to infer whether the relevant information is encoded within the specific position of interest or rather inferred from diffuse information elsewhere in the sentence Tenney et al. (2019)
. In this study, we directly probe the token representations for word- and phrase-level properties, which has been widely used for probing token-level representations learned in neural machine translation systems, e.g. part-of-speech, semantic tags, morphology as well as constituent structureShi et al. (2016); Belinkov et al. (2017); Blevins et al. (2018).
In this paper, we introduce a novel word reordering detection task which can probe the ability of a model to extract word order information. With the help of the proposed task, we evaluate RNN, SAN and DiSAN upon Transformer framework to empirically test the theoretical claims that SAN lacks the ability to learn word order. The results reveal that RNN and DiSAN exactly perform better than SAN on extracting word order information in the case they are trained individually for our task. However, there is no evidence that SAN learns less word order information under the machine translation context.
Our further analyses for the encoders pre-trained on the NMT data suggest that 1) the learning objective sometimes plays a crucial role on learning a specific feature (e.g. word order) in a downstream NLP task; 2) modeling recurrence is universally-effective to learn word order information for SAN; and 3) RNN is more sensitive on erroneous word order noises in machine translation system. These observations facilitate the understanding of different tasks and model architectures in finer-grained level, rather than merely in overall score (e.g. BLEU). As our approach is not limited to the NMT encoders, it is also interesting to explore how do the models trained on other NLP tasks learn word order information.
The work was partly supported by the National Natural Science Foundation of China (Grant No. 61672555), the Joint Project of Macao Science and Technology Development Fund and National Natural Science Foundation of China (Grant No. 045/2017/AFJ) and the Multi-Year Research Grant from the University of Macau (Grant No. MYRG2017-00087-FST). We thank the anonymous reviewers for their insightful comments.
A Decomposable Attention Model for Natural Language Inference.In EMNLP.
Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks.In ACL.