How Does Selective Mechanism Improve Self-Attention Networks?

05/03/2020 ∙ by Xinwei Geng, et al. ∙ Tencent 0

Self-attention networks (SANs) with selective mechanism has produced substantial improvements in various NLP tasks by concentrating on a subset of input words. However, the underlying reasons for their strong performance have not been well explained. In this paper, we bridge the gap by assessing the strengths of selective SANs (SSANs), which are implemented with a flexible and universal Gumbel-Softmax. Experimental results on several representative NLP tasks, including natural language inference, semantic role labelling, and machine translation, show that SSANs consistently outperform the standard SANs. Through well-designed probing experiments, we empirically validate that the improvement of SSANs can be attributed in part to mitigating two commonly-cited weaknesses of SANs: word order encoding and structure modeling. Specifically, the selective mechanism improves SANs by paying more attention to content words that contribute to the meaning of the sentence. The code and data are released at https://github.com/xwgeng/SSAN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Self-attention networks (SANs) lin2017structured

have achieved promising progress in various natural language processing (NLP) tasks, including machine translation 

Vaswani:2017:NIPS, natural language inference Shen:2018:AAAI, semantic role labeling tan2018deep; strubell2018linguistically and language representation devlin2019bert. The appealing strength of SANs derives from high parallelism as well as flexibility in modeling dependencies among all the input elements.

Recently, there has been a growing interest in integrating selective mechanism into SANs, which has produced substantial improvements in a variety of NLP tasks. For example, some researchers incorporated a hard constraint into SANs to select a subset of input words, on top of which self-attention is conducted Shen:2018:IJCAI; Hou:2019:SelectiveAB; Yang:2019:NAACL. Yang:2018:EMNLP and guo2019gaussian proposed a soft mechanism by imposing a learned Gaussian bias over the original attention distribution to enhance its ability of capturing local contexts. Shen:2018:IJCAI incorporated reinforced sampling to dynamically choose a subset of input elements, which are fed to SANs.

Although the general idea of selective mechanism works well across NLP tasks, previous studies only validate their own implementations in a few tasks, either on only classification tasks Shen:2018:IJCAI; guo2019gaussian or sequence generation tasks Yang:2018:EMNLP; Yang:2019:NAACL. This poses a potential threat to the conclusive effectiveness of selective mechanism. In response to this problem, we adopt a flexible and universal implementation of selective mechanism using Gumbel-Softmax jang2016categorical, called selective self-attention networks (i.e., SSANs). Experimental results on several representative types of NLP tasks, including natural language inference (i.e., classification), semantic role labeling (i.e., sequence labeling), and machine translation (i.e., sequence generation), demonstrate that SSANs consistently outperform the standard SANs (§3).

Despite demonstrating the effectiveness of SSANs, the underlying reasons for their strong performance have not been well explained, which poses great challenges for further refinement. In this study, we bridge this gap by assessing the strengths of selective mechanism on capturing essentially linguistic properties via well-designed experiments. The starting point for our approach is recent findings: the standard SANs suffer from two representation limitation on modeling word order encoding shaw2018self; yang:2019:assessing and syntactic structure modeling tang2018self; hao:2019:multi, which are essential for natural language understanding and generation. Experimental results on targeted linguistic evaluation lead to the following observations:

  • SSANs can identify the improper word orders in both local (§4.1) and global (§4.2) ranges by learning to attend to the expected words.

  • SSANs produce more syntactic representations (§5.1) with a better modeling of structure by selective attention (§5.2).

  • The selective mechanism improves SANs by paying more attention to content words that posses semantic content and contribute to the meaning of the sentence (§5.3).

2 Methodology

2.1 Self-Attention Networks

SANs lin2017structured

, as a variant of attention model 

bahdanau2015neural; luong2015effective, compute attention weights between each pair of elements in a single sequence. Given the input layer , SANs first transform the layer into the queries , the keys , and the values with three separate weight matrices. The output layer is calculated as:

(1)

where the alternatives to can be additive attention bahdanau2015neural or dot-product attention luong2015effective. Due to time and space efficiency, we used the dot-product attention in this study, which is computed as:

(2)

where is the scaling factor with being the dimensionality of layer states Vaswani:2017:NIPS.

2.2 Weaknesses of Self-Attention Networks

Despite SANs have demonstrated its effectiveness on various NLP tasks, recent studies empirically revealed that SANs suffer from two representation limitations of modeling word order encoding yang:2019:assessing and syntactic structure modeling tang2018self. In this work, we concentrate on these two commonly-cited issues.

Word Order Encoding

SANs merely rely on attention mechanism with neither recurrence nor convolution structures. In order to incorporate sequence order information, Vaswani:2017:NIPS proposed to inject position information into the input word embedding with additional position embedding. Nevertheless, SANs are still weak at learning word order information yang:2019:assessing. Recent studies have shown that incorporating recurrence Chen:2018:ACL; Hao-2019-towards; hao2019modeling, convolution song-etal-2018-double; Yang:2019:NAACL, or advanced position encoding shaw2018self; Wang:2019:EMNLP into vanilla SANs can further boost their performance, confirming its shortcomings at modeling sequence order.

Structure Modeling

Due to lack of supervision signals of learning structural information, recent studies pay widespread attention on incorporating syntactic structure into SANs. For instance, strubell2018linguistically utilized one attention head to learn to attend to syntactic parents of each word. Towards generating better sentence representations, several researchers propose phrase-level SANs by performing self-attention across words inside a n-gram phrase or syntactic constituent 

wu-etal-2018-phrase; hao:2019:multi; wang:2019:tree. These studies show that the introduction of syntactic information can achieve further improvement over SANs, demonstrating its potential weakness on structure modeling.

2.3 Selective Self-Attention Networks

In this study, we implement the selective mechanism on SANs by introducing an additional selector, namely SSANs, as illustrated in Figure 1. The selector aims to select a subset of elements from the input sequence, on top of which the standard self-attention (Equation 1

) is conducted. We implement the selector with Gumbel-Softmax, which has proven effective for computer vision tasks

shen:2018:sharp; yang:2019:modeling.

Figure 1: Illustration of SSANs that select a subset of input elements with an additional selector network, on top of which self-attention is conducted. In this example, the word “talk” performs attention operation over input sequence, where the words “Bush”, “held” and “Sharon” are chosen as the truly-significant words.

Selector

Formally, we parameterize selection action for each input element with an auxiliary policy network, where SELECT indicates that the element is selected for self-attention while DISCARD represents to abandon the element. The output action sequence is calculated as:

(3)
(4)

where and are transformed from the input layer with distinct weight matrices. We utilize

as activation function to calculate the distribution for choosing the action

SELECT

with the probability

or DISCARD with the probability .

Gumbel Relaxation

There are two challenges for training the selector: (1) the ground-truth labels indicating which words should be selected are unavailable; and (2) the discrete variables in lead to a non-differentiable objective function. In response to this problem, jang2016categorical proposed Gumbel-Softmax to give a continuous approximation to sampling from the categorical distribution. We adopt a similar approach by adding Gumbel noise gumbel1954statistical

in the sigmoid function, which we refer as

Gumbel-Sigmoid. Since sigmoid can be viewed as a special 2-class case ( and in our case) of softmax, we derive the Gumbel-Sigmoid as:

(5)

where and are two independent Gumbel noises gumbel1954statistical, and is a temperature parameter. As diminishes to zero, a sample from the Gumbel-Sigmoid distribution becomes cold and resembles the one-hot samples. At training time, we can use Gumbel-Sigmoid to obtain differentiable sample as . In inference, we choose the action with maximum probability as the final output.

3 NLP Benchmarks

To demonstrate the robustness and effectiveness of the SSANs, we evaluate it in three representative NLP tasks: language inference, semantic role labeling and machine translation. We used them as NLP benchmarks, which cover classification, sequence labeling and sequence generation categories. Specifically, the performances of semantic role labeling and language inference models heavily rely on structural information strubell2018linguistically, while machine translation models need to learn word order and syntactic structure Chen:2018:ACL; hao2019modeling.

3.1 Experimental Setup

Natural Language Inference

aims to classify semantic relationship between a pair of sentences,

i.e., a premise and corresponding hypothesis. We conduct experiments on the Stanford Natural Language Inference (SNLI) dataset snli:emnlp2015, which has three classes: Entailment, Contradiction and Neutral.

We followed Shen:2018:AAAI to use a token2token SAN layer followed by a source2token SAN layer to generate a compressed vector representation of input sentence. The selector is integrated into the token2token SAN layer. Taking the premise representation

and the hypothesis vector as input, their semantic relationship is represented by the concatenation of , , and , which is passed to a classification module to generate a categorical distribution over the three classes. We initialize the word embeddings with 300D GloVe 6B pre-trained vectors pennington2014glove, and the hidden size is set as 300.

Semantic Role Labeling

is a shallow semantic parsing task, which aims to recognize the predicate-argument structure of a sentence, such as “who did what to whom”, “when” and “where”. Typically, it assigns labels to words that indicate their semantic role in the sentence. Our experiments are conducted on CoNLL2012 dataset provided by toward:2013:conll.

We evaluated selective mechanism on top of DEEPATT111https://github.com/XMUNLP/Tagger. tan2018deep

, which consists of stacked SAN layers and a following softmax layer. Following their configurations, we set the number of SAN layers as 10 with hidden size being 200, the number of attention heads as 8 and the dimension of word embeddings as 100. We use the GloVe embeddings 

pennington2014glove, which are pre-trained on Wikipedia and Gigaword, to initialize our networks, but they are not fixed during training. We choose the better feed-forward networks (FFN) variants of DEEPATT as our standard settings.

Machine Translation

is a conditional generation task, which aims to translate a sentence from a source language to its counterpart in a target language. We carry out experiments on several widely-used datasets, including small EnglishJapanese (EnJa) and EnglishRomanian (EnRo) corpora, as well as a relatively large EnglishGerman (EnDe) corpus. For EnDe and EnRo, we respectively follow Li:2018:EMNLP and He:Layer:NIPS to prepare WMT2014222http://www.statmt.org/wmt14. and IWSLT2014333https://wit3.fbk.eu/mt.php?release=2014-01. corpora. For EnJa, we use KFTT444http://www.phontron.com/kftt. dataset provided by neubig11kftt. All the data are tokenized and then segmented into subword symbols using BPE sennrich2015neural with 32K operations.

We implemented the approach on top of advanced Transformer model Vaswani:2017:NIPS. On the large-scale EnDe dataset, we followed the base configurations to train the NMT model, which consists of 6 stacked encoder and decoder layers with the layer size being 512 and the number of attention heads being 8. On the small-scale EnRo and EnJa datasets, we followed He:Layer:NIPS to decrease the layer size to 256 and the number of attention heads to 4.

For all the tasks, we applied the selector to the first layer of encoder to better capture lexical and syntactic information, which is empirically validated by our further analyses in Section 4.

3.2 Experimental Results

Task Size SANs SSANs
Natural Language Inference (Accuracy)
SNLI 550K 85.60 86.30 +0.8%
Semantic Role Labeling (F1 score)
CoNLL 312K 82.48 82.88 +0.5%
Machine Translation (BLEU)
EnRo 0.18M 23.22 23.91 +3.0%
EnJa 0.44M 31.56 32.17 +1.9%
EnDe 4.56M 27.60 28.50 +3.3%
Table 1: Results on the NLP benchmarks. “Size” indicates the number of training examples, and “” denotes relative improvements over the vanilla SANs.

Table 1 shows the results on the three NLP benchmarks. Clearly, introducing selective mechanism significantly and consistently improves performances in all tasks, demonstrating the universality and effectiveness of the selective mechanism for SANs. Concretely, SSANs relatively improve prediction accuracy over SANs by +0.8% and +0.5% respectively on the NLI and SRL tasks, showing their superiority on structure modeling. Shen:2018:IJCAI pointed that SSANs can better capture dependencies among semantically important words, and our results and further analyses (§5) provide supports for this claim.

In the machine translation tasks, SSANs consistently outperform SANs across language pairs. Encouragingly, the improvement on translation performance can be maintained on the large-scale training data. The relative improvements on the EnRo, EnJa, and EnDe tasks are respectively +3.0%, +1.9%, and +3.3%. We attribute the improvement to the strengths of SSANs on word order encoding and structure modeling, which are empirically validated in Sections 4 and 5.

Shen:2018:IJCAI implemented the selection mechanism with the REINFORCE algorithm. jang2016categorical revealed that compared with Gumbel-Softmax Maddison:NIPS:sampling, REINFORCE Williams1992Simple

suffers from high variance, which consequently leads to slow converge. In our preliminary experiments, we also implemented REINFORCE-based SSANs, but it underperforms the Gumbel-Softmax approach on the benchmarking En

De translation task (BLEU: 27.90 vs. 28.50, not shown in the paper). The conclusion is consistent with jang2016categorical, and we thus use Gumbel-Softmax instead of REINFORCE in this study.

4 Evaluation of Word Order Encoding

In this section, we investigate the ability of SSANs of capturing both local and global word orders on the bigram order shift detection4.1) and word reordering detection4.2) tasks.

4.1 Detection of Local Word Reordering

Task Description

conneau2018you propose a bigram order shift detection task to test whether an encoder is sensitive to local word orders. Given a monolingual corpus, a certain portion of sentences are randomly extracted to construct instances with illegal word order. Specially, given a sentence , two adjacent words (i.e., , ) are swapped to generate an illegal instance as a substitute for . Given processed data which consists of intact and inverted sentences, examined models are required to distinguish intact sentences from inverted ones. To detect the shift of bigram word order, the models should learn to recognize normal and abnormal word orders.

The model consists of 6-layer SANs and 3-layer MLP classifier. The layer size is 128, and the filter size is 512. We trained the model on the open-source dataset

555https://github.com/facebookresearch/SentEval/tree/master/data/probing. provided by conneau2018you. The accuracy of SAN-based encoder is higher than previously reported result on the same task Li:2019:NAACL (52.23 vs. 49.30).

Model Layer Acc.
SANs 52.23
SSANs 1 62.55 +19.8%
2 53.73 +2.9%
3 54.65 +4.6%
4 54.29 +3.9%
5 54.78 +4.9%
6 54.23 +3.8%
Table 2: Results on the local bigram order shift detection task when SSANs are applied into different layers.

Detection Accuracy

Table 2 lists the results on the local bigram order shift detection task, in which SSANs are applied to different encoder layers. Clearly, all the SSANs variants consistently outperform SANs, demonstrating the superiority of SSANs on capturing local order information. Applying the selective mechanism to the first layer achieves the best performance, which improves the prediction accuracy by +19.8% over SANs. The performance gap between the SSANs variants is very large (i.e., 19.8% vs. around 4%), which we attribute to that the detection of local word reorder depends more on lexical information embedded in the bottom layer.

Figure 2: Attention weights over attended words with different relative distance from the query word on the local reordering task. SSANs pay more attention to the adjacent words (distance=1) than SANs.
(a) SANs
(b) SSANs
Figure 3: Visualization of attention weights from an example on the local reordering detection task. We highlight the attended word (Y-axis) with maximum attention weight for each query (X-axis) in red rectangles.

Attention Behaviors

The objective of local reordering task is to distinguish the swap of two adjacent words, which requires the examined model to pay more attention to the adjacent words. Starting from this intuition, we investigate the attention distribution over the attended words with different relative distances from the query word, as illustrated in Figure 2. We find that both SANs and SSANs focus on neighbouring words (e.g., distance 3), and SSANs pays more attention to the adjacent words (distance=1) than SANs (14.6% vs. 12.4%). The results confirm our hypothesis that the selective mechanism helps to exploit more bigram patterns to accomplish the task objective. Figure 3 shows an example, in which SSANs attend most to the adjacent words except the inverted bigram “he what”. In addition, the surrounding words “exactly” and “wanted” also pay more attention to the exceptional word “he”. We believe such features help to distinguish the abnormally local word order.

4.2 Detection of Global Word Reordering

Task Description

yang:2019:assessing propose a word reordering detection task to investigate the ability of SAN-based encoder to extract global word order information. Given a sentence , a random word is popped and inserted into another position (). The objective is to detect both the original position the word is popped out (labeled as “O”), and the position the word is inserted (labeled as “I”).

The model consists of 6-layer SANs and a output layer. The layer size is 512, and the filter size is 2048. We trained the model on the open-source dataset666https://github.com/baosongyang/WRD. provided by yang:2019:assessing.

Model Layer Insert Original Both
SANs 73.20 66.00 60.10
SSANs 1 81.52 72.19 66.77
2 80.14 70.01 63.97
3 79.82 69.69 63.93
4 79.08 70.22 63.67
5 80.19 69.84 64.12
6 80.27 69.50 63.73
Table 3: Performance on the global word reordering detection (WRD) task.

Detection Accuracy

Table 3 lists the results on the global reordering detection task, in which all the SSANs variants improve prediction accuracy. Similarly, applying the selective mechanism to the first layer achieves the best performance, which is consistent with the results on the global word reordering task (Table 2). However, the performance gap between the SSANs variants is much lower that that on the local reordering task (i.e., 4% vs. 15%). One possible reason is that the detection of global word reordering may also need syntactic and semantic information, which are generally embedded in the high-level layers Peters:2018:NAACL.

Figure 4: Attention weights over attended words with different relative distance from the query word on the global WRD task. SSANs pay more attention to the distant words (distance) than SANs.
(a) SANs
(b) SSANs
Figure 5: Visualization of attention weights from an example on the global reordering detection task. We highlight the attended word (Y-axis) with maximum attention weight for each query (X-axis) in red rectangles.

Attention Behaviors

The objective of the WRD is to distinguish a global reordering (averaged distance is words), which requires the examined model to pay more attention to distant words. Figure 4 depicts the attention distribution according to different relative distances. SSANs alleviate the leaning-to-local nature of SANs and pay more attention to distant words (e.g., distance), which better accomplish the task of detecting global reordering. Figure 5 illustrates an example, in which more queries in SSANs attend most to the inserted word “the” than SANs. Particularly, SANs pay more attention to the surrounding words (e.g., distance ), while the inserted word “the” only accepts subtle attention. In contrast, SSANs dispense much attention over words centred on the inserted position (i.e.,the”) regardless of distance, especially for the queries “current rules for now”. We speculate that SSANs benefits from such features on detecting the global word reordering .

5 Evaluation of Structure Modeling

In this section, we investigate whether SSANs better capture structural information of sentences. To this end, we first empirically evaluate the syntactic structure knowledge embedded in the learned representations5.1). Then we investigate the attention behaviors by extracting constituency tree from the attention distribution (§5.2).

5.1 Structures Embedded in Representations

Class Ratio SANs SSANs
5 6.9% 68.66 75.22 +9.6%
6 14.3% 56.10 64.09 +14.2%
7 16.3% 46.63 55.05 +18.1%
8 17.9% 39.68 50.88 +28.2%
9 17.4% 38.33 50.97 +33.0%
10 15.3% 35.54 49.88 +40.3%
11 11.9% 48.86 56.39 +15.4%
All 100% 45.68 55.90 +22.4%
Type Ratio SANs SSANs
Ques. 10% 95.90 97.06 +1.2%
Decl. 60% 88.48 91.34 +3.2%
Clau. 25% 72.78 78.32 +7.6%
Other 5% 50.67 61.13 +20.6%
All 100% 83.78 87.25 +4.1%

Task Description

We leverage two linguistic probing tasks to assessing the syntactic information embedded in a given representation. Both tasks are cast as multi-label classification problem based on the representation of a given sentence, which is produced by an examined model:

Tree Depth (TreeDepth) task conneau2018you checks whether the examined model can group sentences by the depth of the longest path from root to any leaf in their parsing tree. Tree depth values range from 5 to 11, and the task is to categorize sentences into the class corresponding to their depth (7 classes).

Top Constituent (TopConst) task shi2016does classifies the sentence in terms of the sequence of top constituents immediately below the root node, such as “ADVP NP VP .”. The top constituent sequences fall into 20 categories: 19 classes for the most frequent top constructions, and one for all other constructions.

We trained the model on the open-source dataset provided by conneau2018you, and used the same model architecture in Section 4.1.

Probing Accuracy

Table LABEL:tab:treedepth lists the results on the TreeDepth task. SSANs significantly outperform SANs by 22.4% on the overall performance. Concretely, the performance of SANs dramatically drops as the depth of the sentences increases.777The only exception is the class of “11”, which we attribute to the extraction of feature of associating “very complex sentence” with maximum depth “11”. On the other hand, SSANs is more robust to the depth of the sentences, demonstrating the superiority of SSANs on capturing complex structures.

Table LABEL:tab:topconst shows the results on the TopConst task. We categorize the 20 classes into 4 categories based on the types of sentences: question sentence (“* SQ .”), declarative sentence (“* NP VP *” etc.), clause sentence (“SBAR *” and “S *”), and others (“OTHER”). Similarly, the performance of SANs drops as the complexity of sentence patterns increases (e.g., “Ques.” “Others”, 95.90 50.67). SSANs significantly improves the prediction F1 score as the complexity of sentences increases, which reconfirm the superiority of SSANs on capturing complex structures.

5.2 Structures Modeled by Attention

Metric SANs SSANs
BP 21.09 22.07 +4.7%
BR 22.05 23.07 +4.6%
F1 21.56 22.56 +4.2%
Table 6: Evaluation on constituency trees generated from the attention distribution.
(a) SANs
(b) SSANs
Figure 6: Example of constituency trees generated from the attention distributions.
Type TreeDepth TopConst EnDe Translation
SANs SSANs SANs SSANs SANs SSANs

Content

Noun 0.149 0.245 +64.4% 0.126 0.196 +55.6% 0.418 0.689 +64.8%
Verb 0.165 0.190 +15.2% 0.165 0.201 +21.8% 0.146 0.126 -13.7%
Adj. 0.040 0.069 +7.3% 0.033 0.054 +63.6% 0.077 0.074 -3.9%
2-11 Total 0.354 0.504 +42.4% 0.324 0.451 +39.2% 0.641 0.889 +38.7%

Content-Free

Prep. 0.135 0.082 -39.3% 0.123 0.119 -3.3% 0.089 0.032 -64.0%
Dete. 0.180 0.122 -32.2% 0.103 0.073 -29.1% 0.070 0.010 -85.7%
Punc. 0.073 0.068 -6.8% 0.078 0.072 -7.7% 0.098 0.013 -86.7%
Others 0.258 0.224 -13.2% 0.373 0.286 -23.3% 0.102 0.057 -41.1%
2-11 Total 0.646 0.496 -23.3% 0.676 0.549 -18.8% 0.359 0.111 -69.1%
Table 7: Attention distributions on linguistic roles for the structure modeling probing tasks (§5.1, “TreeDepth” and “TopConst”) and the constituency tree generation task (§5.2, “EnDe Translation”).

Task Description

We evaluate the ability of self-attention on structure modeling by constructing constituency trees from the attention distributions. Under the assumption that attention distribution within phrases is stronger than the other, marecek-rosa-2018-extracting define the score of a constituent with span from position to position as the attention merely inside the span denoted as . Based on these scores, a binary constituency tree is generated by recurrently splitting the sentence. When splitting a phrase with span , the target is to look for a position maximizing the scores of the two resulting phrases:

(6)

We utilized Stanford CoreNLP toolkit to annotate English sentences as golden constituency trees. We used EVALB888http://nlp.cs.nyu.edu/evalb. to evaluate the generated constituency trees, including bracketing precision, bracketing recall, and bracketing F1 score.

Parsing Accuracy

As shown in Table 6, SSANs consistently outperform SANs by 4.6% in all the metrics, demonstrating that SSANs better model structures than SANs. Figure 6 shows an example of generated trees. As seen, the phrases “he ran” and “heart pumping” can be well composed for both SANs and SSANS. However, SANs fail to parse the phrase structure “legs churning”, which is correctly parsed by SSANs.

5.3 Analysis on Linguistic Properties

In this section, we follow He:2019:EMNLP to analyze the linguistic characteristics of the attended words in the above structure modeling tasks, as listed in Table 7. Larger relative increase (“”) denotes more attention assigned by SSANs. Clearly, SSANs pay more attention to content words in all cases, although there are considerable differences among NLP tasks.

Content words possess semantic content and contribute to the meaning of the sentence, which are essential in various NLP tasks. For example, the depth of constituency trees mainly relies on the nouns, while the modifiers (e.g., adjective and content-free words) generally make less contributions. The top constituents mainly consist of VP (95% examples) and NP (75% examples) categories, whose head words are verbs and nouns respectively. In machine translation, content words carry essential information, which should be fully transformed to the target side for producing adequate translations. Without explicit annotations, SANs are able to learn the required linguistic features, especially on the machine translation task (e.g., dominating attention on nouns). SSANs further enhance the strength by paying more attention to the content words.

However, due to their high frequency with a limited vocabulary (e.g., 150 words999https://en.wikipedia.org/wiki/Function_word.), content-free words, or function words generally receive a lot of attention, although they have very little substantive meaning. This is more series in structure probing tasks (i.e., TreeDepth and TopConst), since the scalar guiding signal (i.e., class labels) for a whole sentence is non-informative as it does not necessarily preserve the picture about the intermediate syntactic structure of the sentence that is being generated for the prediction. On the other hand, the problem on content-free words is alleviated on machine translation tasks due to the informative sequence signals. SSANs can further alleviate this problem in all cases with a better modeling of structures.

6 Conclusion

In this work, we make an early attempt to assess the strengths of the selective mechanism for SANs, which is implemented with a flexible Gumbel-Softmax approach. Through several well-designed experiments, we empirically reveal that the selective mechanism migrates two major weaknesses of SANs, namely word order encoding and structure modeling, which are essential for natural language understanding and generation. Future directions include validating our findings on other SAN architectures (e.g., BERT devlin2019bert) and more general attention models bahdanau2015neural; luong2015effective.

Acknowledgments

We thank the anonymous reviewers for their insightful comments. We also thank Xiaocheng Feng, Heng Gong, Zhangyin Feng, and Xiachong Feng for helpful discussion. This work was supported by the National Key R&D Program of China via grant 2018YFB1005103 and National Natural Science Foundation of China (NSFC) via grant 61632011 and 61772156.

References

6 Conclusion

In this work, we make an early attempt to assess the strengths of the selective mechanism for SANs, which is implemented with a flexible Gumbel-Softmax approach. Through several well-designed experiments, we empirically reveal that the selective mechanism migrates two major weaknesses of SANs, namely word order encoding and structure modeling, which are essential for natural language understanding and generation. Future directions include validating our findings on other SAN architectures (e.g., BERT devlin2019bert) and more general attention models bahdanau2015neural; luong2015effective.

Acknowledgments

We thank the anonymous reviewers for their insightful comments. We also thank Xiaocheng Feng, Heng Gong, Zhangyin Feng, and Xiachong Feng for helpful discussion. This work was supported by the National Key R&D Program of China via grant 2018YFB1005103 and National Natural Science Foundation of China (NSFC) via grant 61632011 and 61772156.

References

Acknowledgments

We thank the anonymous reviewers for their insightful comments. We also thank Xiaocheng Feng, Heng Gong, Zhangyin Feng, and Xiachong Feng for helpful discussion. This work was supported by the National Key R&D Program of China via grant 2018YFB1005103 and National Natural Science Foundation of China (NSFC) via grant 61632011 and 61772156.

References

References