Contextualized Non-local Neural Networks for Sequence Learning

11/21/2018 ∙ by Pengfei Liu, et al. ∙ HEC Montréal McGill University The Ohio State University FUDAN University 0

Recently, a large number of neural mechanisms and models have been proposed for sequence learning, of which self-attention, as exemplified by the Transformer model, and graph neural networks (GNNs) have attracted much attention. In this paper, we propose an approach that combines and draws on the complementary strengths of these two methods. Specifically, we propose contextualized non-local neural networks (CN^3), which can both dynamically construct a task-specific structure of a sentence and leverage rich local dependencies within a particular neighborhood. Experimental results on ten NLP tasks in text classification, semantic matching, and sequence labeling show that our proposed model outperforms competitive baselines and discovers task-specific dependency structures, thus providing better interpretability to users.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning the representation of sequences is a fundamental task, which requires deep understanding of both the complex structure of sentences [Biber et al.1998] and the contextualized representation of words [Peters et al.2018]. Recent successful approaches replace the classical compositional functions in neural networks (i.e. CNNs or RNNs) with mechanisms based on self-attention, of which the most effective model is the Transformer, having achieved state-of-the-art performance on machine translation [Vaswani et al.2017] and parsing [Kitaev and Klein2018]. The success of Transformer can be attributed to its non-local structure bias, in which dependencies between any pair of words can be modelled [Baerman2015, Wang et al.2017]. This property allows it to dynamically learn both the syntactic and semantic structures of sentences [Vaswani et al.2017]. Despite its success, the lack of local bias (local dependencies that exist over adjacent words [Baerman2015, Futrell2017]) limits its capacity for learning contextualized representations of words111Our experiments also show that Transformer performs worse than the models with local bias, especially on sequence labeling tasks..

Challenges Bias
Com-Structure
Non-local
Con-Representation
Local
Table 1: Com-Structure denotes “complicated structures of sentences” while Con-Representation denotes “contextualized representations of words”. G. and T. represent graph neural network and Transformer respectively. The first two columns shows main challenges for sentence learning and corresponding required bias: Non-local bias: dependencies can be built with any neighbor [Baerman2015, Wang et al.2017]. Local bias: local dependencies exist over adjacent word [Baerman2015, Futrell2017]. The last three columns list comparison of typical sequence learning approaches for incorporating local and non-local bias.

In contrast to Transformer, another line of work aims to model sequences with graph neural networks (GNNs). While GNNs are capable of learning local contextual information flexibly by encoding attribute features [Battaglia et al.2018], it is less clear how to effectively utilize GNNs for sequence learning, since there is no single structural representation of a sentence that is well suited to all tasks.

A common choice is to use the syntactic dependencies between words in a sentence to determine the sentential graph structure, which reduces the model to a tree-structured neural network [Tai et al.2015]. Inspired by Transformer [Vaswani et al.2017], we aim to improve upon this fixed structure by learning a task-dependent graphical representation, which should better capture the dependencies that matter for the end task.

In this paper, we draw from both lines of research and propose a model which can benefit from their complementary strengths. On the one hand, the success of Transformer motivates us to explore dynamic “graph” construction. Rather than defining a hard-coded graph for the sentence, we incorporate non-local bias into GNNs, learning the sentence structure dynamically on different tasks. On the other hand, the advantage of the graph neural network framework itself is to provide rich local information by encoding node or edge attributes, which can make up for the deficiency in Transformer.

Consequently, we propose a contextualized non-local networks by extending self-attention (Transformer) to GNNs, which support highly flexible representations in two ways. First, the representations of attributes (feature encodings of nodes and edges); second, the structure of the graph itself (dynamic learning of task-dependent structures). Notably, these two advantages of the proposed model enable us to better learn the representation of sequences in terms of contextualized representation of words and complicated structures of sentences.

We conduct extensive experiments on ten sequence learning tasks, from text classification and semantic matching, to sequence labelling. Experimental results show that our proposed approach outperforms competitive baselines or achieves comparable results in all tasks. Additionally, we are able to discover task-dependent structures among words and provide better interpretablility to users.

Models Interaction Locality Bias Eligible Tasks
Scope Functions
BOW
[Fan et al.2008] None None Local T-Sent & T-Word
S-Graph
[Mihalcea and Tarau2004]
All pairs of
words
Non-Local
T-Sent
CNN
[Kalchbrenner et al.2014] Fixed window Local T-Sent
SeqRNN
[Liu et al.2015] Adjacent words Local T-Sent
TreeRNN
[Tai et al.2015, Zhu et al.2015, Liu et al.2017] Child words
Local
T-Sent
GraphCNN
[Marcheggiani and Titov2017] Child words
Local
T-Word
SelfAtt1
[Yang et al.2016b, Lin et al.2017] Adjacent words Local T-Sent
SelfAtt2
[Cheng et al.2016, Liu et al.2016] Preceding words Local T-Sent & T-Word
SelfAtt3
[Vaswani et al.2017]
All pairs of
words
Non-local
T-Word
Our Work
All pairs of
words
Contextualized Non-local
T-Sent & T-Word
Table 2: A comparison of published approaches for sentence representations. S-Graph denotes statistical graph. and are parameterized functions. denotes the representation of word and

is a vector relating to positional information.

is a scalar and is a set of indexes for those words within a interaction scope. The 2nd column shows which part of words can interact with each other while the 3rd column shows how to compose these words. The 4th column lists what kinds of knowledge the interaction methods depend on. The last column represents which tasks can be processed by corresponding models. “T-Sent” denotes those tasks who require sentence-level representations, such as text classification, while “T-word” represent tasks which rely on word-level representations, such as sequence labelling, machine translation. “-”: S-Graph usually calls for complicated ranking algorithm and can not directly obtain a real-valued vector to represent the sentence meanings. “*” SelfAtt3 is proposed for machine translation, rather than sentence level representation learning.

We summarize our contributions as below:

  1. We analyze two challenges for sequence learning (structure of sentences and contextualized word representations) from the perspective of models’ locality biases, which encourages us to find the complementarity between Transformer and graph neural networks.

  2. We draw on the complementary strengths of Transformer and GNNs, proposing a contextualized non-local neural network (CN), in which structures of sentences can be learned on the fly and words in sentence can be contextualized flexibly.

  3. We perform a comprehensive analysis of existing sequence learning approaches in a unified way and conduct extensive experiments on a number of natural language understanding tasks. Experimental results show that our proposed models achieve comparable results in these tasks as well as offer better interpretability to users.

2 Related Work

In this paper, we propose to utilize the locality bias to systematically analyze existing sequence learning models in a unified way. Here, the locality bias is one kind of inductive bias that local dependencies exist over adjacent words, which is first well established in phonology [van der Lely2005] and then also be explored in natural language [Baerman2015, Futrell2017]. Typically, local and non-local bias are two common locality biases implicitly existing in different models for sequence learning.

Next, we will elaborate on explanations following related lines and as shown in Table 2, we present a summary of existing models by highlighting differences among interaction methods for different words, locality bias, as well as the eligible tasks that corresponding methods can handle.

Neural network-based Sequence Modelling.

Neural networks provide an effective way to model the dependencies between different words via parameterized composition functions. Some examples of composition functions involve recurrent neural networks with long short-term memory (LSTM)

[Liu et al.2015]

, convolutional neural networks (CNN)

[Kalchbrenner et al.2014], and tree neural networks [Tai et al.2015, Zhu et al.2015]. There are two major differences between these methods: one is the scope in which words can interact with each other. The other is the compositional functions they used. For example, LSTMs can explicitly model the dependencies between the current word and previous ones, and words can interact with others within the same window. More examples can be found in Table 2.

Attention-based Sequence Modelling.

Several attention-based mechanisms have been introduced to expand the interaction scope, enabling more words to interact with each together. For example, yang2016hierarchical,lin2017structured [yang2016hierarchical,lin2017structured] utilize a learnable query vector to aggregate the weighted information of each word, which can be used to get sentence-level representations for classification tasks. cheng2016long [cheng2016long] augment neural networks with a re-reading ability while processing each word. vaswani2017attention [vaswani2017attention] proposes to model the dependencies between words based entirely on self attention without any recurrent or convolutional layers. The so-called Transformer (SelfAtt3) has achieved state-of-the-art results on a machine translation task. In contrast with previous work, we introduce different locality biases to learn complex structure of sentences and leverage rich local dependencies within a particular neighbourhood.

In computer vision community, wang2017non [wang2017non] also incorporate non-local bias into neural network for image representation. However, in this work, we focus on sequence learning, which is different from image processing and requires rich contextual information.

Graph Convolutional Neural Network.

Graph convolutional neural networks (GCNN) have been used to learn the representations of graph structure data. kipf2016semi [kipf2016semi] proposes a GCNN for semi-supervised graph classification by learning node representations; gilmer2017neural [gilmer2017neural] propose a general neural message passing algorithm to predict the properties of molecule structures. velickovic2017graph [velickovic2017graph] proposes graph attention networks to model graph-structure data, such as protein and citation networks. While “Graph” has been well-studied in above areas, it’s still less clear how to build a graph for a sentence. Besides, our work is different from [Velickovic et al.2017] lying in the following aspects: 1) We propose to incorporate non-local bias into GNNs while velickovic2017graph [velickovic2017graph] aims to combine attention with GNNs. 2) We focus on sentence understanding, which also calls for contextualized word representations. Some existing works marcheggiani2017encoding [marcheggiani2017encoding] try to construct a graph from a sequence based on pre-defined structures, such as syntactic dependencies, which are not fit to the complexity of sentences’ structures since in a specific task, the true dependency structures among the words in a sentence can significantly differ from the given input structures [Vaswani et al.2017]. In this paper, we integrate non-local bias into GNNs aim to learn the structures of sentences on the fly based on different tasks.

3 Contextualized Non-local Neural Networks (CN)

To better learn the representation of sequences in terms of contextualized representations of words [Peters et al.2018] and complex structures [Biber et al.1998], we propose the contextualized non-local network, which incorporates non-local bias into graph neural networks. Generally, CN can not only support the encoding of local contextual information, it can also learn task-dependent structures on the fly.

Next, we introduce the contextualized non-local network, which involves two steps. The first is to construct graphs from sentences and introduce contextualized information from attributes. The second is to dynamically update the graph structures for specific tasks. The overall learning process for learning task-dependent graphs is illustrated in Algorithm 1.

3.1 Graph Construction

The first part of our framework is to construct meaningful graphs from sentences in order to encode different types of local contextual information.

Nodes

Given a sentence , we treat each word as a node. For each node , we define node attributes () as different types of extra information besides the word identity. In this paper, we list several typical node attributes described as follows:

Position Information

The position of each word in a sentence.

Contextual Information

Information of nodes can also be enriched by short-term contexts, which can be obtained by convolutional neural network or long-short term memory network. Introducing this information means that we have augmented the model with a local bias: adjacent units are prone to provide useful information [Battaglia et al.2018, Cheng et al.2016].

Tag Information

More prior knowledge such as POS tags can be encoded into the nodes.

Edges

For each edge between any word pair , we define edge attributes () as combined relationships, such as lexical relationships, syntactic dependency obtained from parsing tree, or co-occurrence relationships.

Contextualized Node Representations

Each word and attribute will be first converted to be a real-valued vector through a look-up table, and if only the word identity is used, the node representations are simply as word representations. The information of word then will be enriched by fusing its attributes therefore constructing contextualized node representations.

3.2 Dynamic Graph Structure Updating

Non-local bias in CN is realized as pairwise interactions between any two words, which makes it possible to dynamically learn sentence structures that are geared towards specific tasks. Specifically, The model first encodes each node in a low-dimensional vector. Afterwards, different nodes communicate with each other via messages, which updates the graph structures and the node representations. After the nodes are represented by low-dimensional vectors, we create multiple message passing layers , which either update the graph structure or the node representations.

Graph Structure Updating

The graph structures are updated by promoting communication between different nodes. Specifically, let denote the strength of the relationship between nodes and . We update according to the following formula:

(1)
(2)
(3)

where denotes the node attributes. represents the edge attributes between node and , is the representation of node in layer . and

are parameters that can be learned by backpropagation.

denotes the word embedding.

1:Tag sequence and text sequence from a specific task.
2:A set of node attributes and edge attributes
3:for  do
4:      for  do Graph Construction
5:            // Contextualized representations
6:             = Concat-RealValue()
7:            for  do
8:                  = Concat-RealValue()
9:            end for
10:      end for
11:      for  do Graph Learning
12:            for  do
13:                 // Dynamic Structure Updating
14:                 Edge-Update
15:            end for
16:             Node-Update
17:      end for Application Layer
18:      if Node-Level then
19:            
20:      else if Graph-Level then
21:            
22:            
23:      end if
24:end for
Algorithm 1 Learning Processes of Contextualized Non-local Neural Networks for Sequences

Node Updating

Once the graph structure is updated, for each node , it first aggregates information from its neighbors. Specifically, let denote the information collected from neighbors for node in the -th graph layer, which can be simply calculated as:

(4)

Afterwards, we update the node representations based on the current node representations and the information aggregated from its neighbors . Here, we choose a gating-based updating methods.

(5)

where denotes element-wise multiplication, and

(6)

where represents the logistic function. and are learnable parameters.

3.3 Application Layers

After multiple steps of graph structure updating (), we can obtain node-level representations of the final layer : , which can be used for different NLP tasks when followed by different output layers. Here, we show how to utilize these representations at the node level for POS, Chunking, NER tasks, and graph-level for text classification, semantic matching tasks.

Node-level Representations

A direct application in this scenario is sequence labelling, which aims to assign a tag sequence to a text sequence . The output at each time step

can be regarded as the representation of the preceding subsequence, which is then fed into a CRF layer that calculates scores for corresponding tag categories. Then, in the CRF layer, the conditional probability

is formalized as:

(7)

where represents the CRF layer and is learnable parameters. For detailed formulation of the CRF layer, see additional resources.

Graph-level Representations

In this scenario, we should compute a representation for the whole graph. The simplest way is to take the average of all the node representations:

Then, the representation

can be further fed into a softmax function to yield a probability distribution over the output labels for text classification or semantic matching tasks.

4 Experiments

In this section, we evaluate the effectiveness of our proposed method on ten tasks. Next, we will give brief descriptions of the tasks and datasets. For more details of the above dataset descriptions and training hyper-parameters, please see the Additional Resources section.

4.1 Tasks and Datasets

We evaluate our models on five text classification tasks, two semantic matching tasks and three sequence labelling tasks.

  • Text Classification: QC (Question Classification); SST2 (the Stanford Sentiment Treebank); MR (The movie reviews with two classes [Pang and Lee2005]); IMDB (Long text movie reviews.)

  • Semantic Matching: In this task, given two sentences A and B, the model is used to determine the semantic relationship between these two sentences. We choose two typical datasets SICK [Marelli et al.2014] and SNLI [Bowman et al.2015] for this tasks.

  • Sequence Labelling: We choose POS, Chunking and NER as evaluation tasks on Penn Treebank, CoNLL 2000 and CoNLL 2003 respectively.

Models Text Classification Semantic Matching Sequence Labelling
QC MR SST SUBJ IMDB SICK SNLI POS Chunking NER
NBOW 88.2 77.2 80.5 91.3 87.5 73.4 75.1 96.38 90.51 87.91
Neural network with Different Structure Bias
CNN 92.2 81.5 88.1 93.2 88.5 75.4 77.8 97.20 93.63 88.67
RNN 90.2 77.2 85.0 92.1 87.0 74.9 76.8 97.30 92.56 88.50
LSTM 91.3 77.7 85.8 92.5 88.0 76.3 77.6 97.55 94.46 90.10
TreeLSTM 92.8 78.7 88.0 93.3 77.5 78.3 - - -
PDGraph 93.2 80.2 86.8 92.5 78.8 78.9 - - -
Attention-based Models
SelfAtt1 93.2 80.3 86.4 92.9 90.3 78.4 77.4 * * *
SelfAtt2 93.5 79.8 87.0 93.1 89.2 79.8 79.2 - - -
SelfAtt3 90.1 77.4 83.6 92.2 88.5 76.3 76.9 96.42 91.12 87.61
Task-Dependent Graphs
CN 94.2 81.3 87.5 93.5 91.5 80.2 81.3 97.12 93.81 89.33
CN 94.8 80.7 87.1 94.3 91.0 81.4 82.1 - - -
CN 94.6 82.5 88.0 94.0 92.5 81.5 82.7 97.65 94.82 90.51
CN 95.0 81.0 87.8 94.6 * 81.9 83.5 - - -
CN - - - - - - - 97.78 95.13 91.10
Table 3: Performance of the proposed models on all datasets compared to typical baselines. indicates that corresponding models can not work since the sentences are too long to be processed by parser. denotes corresponding models can not be used for sequence labelling tasks. denotes corresponding publications don’t evaluate models on related tasks. The superscript of CN denotes the edge attributes while subscript represents node attributes. Specifically, denotes the information of each node is enriched by long-short-term memory unit. And the feature is an indicator that the first letter of a word is capital or small. provides the information that if two node have an edge in syntactic dependency tree. denotes the POS taging tags while represents character information. NBOW: Sums up the word vectors for graph-level representation and concatenate word with positional embedding as node-level representation. CNN: We use [Kim2014] for graph-level representation and [Collobert et al.2011] for node-level representation. LSTM We use [Tai et al.2015] for graph-level representation and [Huang et al.2015] for node-level representation. TreeLSTM: LSTM over tree-structure [Tai et al.2015]. PDGraph: Pre-defined Graph Neural network based on syntactic dependency. [Marcheggiani and Titov2017]. SelfAtt1: A structured self-attentive sentence [Lin et al.2017]. SelfAtt2: Proposed by [Cheng et al.2016]. SelfAtt3: Also known as Transformer, which is proposed by [Vaswani et al.2017].

We parse the sentences in the datasets with Stanford NLP toolkit [Manning et al.2014] to obtain dependency relations and Part-of-Speech tags for our models and several competitor models.

4.2 Settings

To minimize the objective, we use stochastic gradient descent with the diagonal variant of AdaDelta

[Zeiler2012]. The word embeddings for all of the models are initialized with GloVe vectors [Pennington et al.2014]

. The other parameters are initialized by randomly sampling from a uniform distribution in

. For each task, we take the hyperparameters which achieve the best performance on the development set via grid search. Other detailed settings of our models can be seen in the Additional Resources section.

4.3 Quantitative Evaluation

The proposed models support highly flexible graph representations in two ways: first, in terms of the representation of the attributes (feature encoding of nodes and edges); and second, in terms of the structure of the graph itself (dynamic learning of task-dependent structure). Next, we will elaborate on these evaluations.

Evaluation on Dynamic Structure Learning

Table 3 shows the performances of different models on 10 different tasks. We have following observations:

  • For graph-level task, CN consistently outperforms neural networks with different structural biases (sequential, tree, pre-defined graph) and attention mechanisms, indicating the effectiveness of non-local bias and dynamic learning nature for sentence structures. Particularly, compared with PDGraph, CN achieves better performance and doesn’t rely on external syntactic tree, with the ability to handle more longer texts, such as IMDB. Compared with LSTM, the improvement of CN also indicates the functions of local and non-local biases are complementary.

  • For node-level tasks (sequence labelling), CN achieves comparable results as opposed to CNN (which utilizes mulit-task learning framework) and LSTM (which additionally introduces many external features: gazetteer features and spelling features).

Evaluation on Attributes

The proposed model has the advantage of being able to contextualize words by encoding information of nodes’ or edges’ attributes. Here we use superscripts and subscripts to introduce nodes’ or edges’ attributes. Table 3 illustrates:

  • Compared to SelfAtt3 (Transformer), CN obtained substantial improvements, especially on SST dataset constructed by lots of sentences with complicated sentence patterns. For those node-level tasks, CN surpasses SelfAtt3 and we attribute the success to its power in both the ability of learning structures dynamically and encoding short-term contextual information.

  • The performances can be enhanced when rich node or edge attributes were taken into accounts. And we observed that effects of different attributions are different. Specifically, CN achieves best performances on MR and IMDB datasets while CN has obtained best performances on QC, SUBJ, SICK and SNLI datasets. The reason is that the texts in QC, SUBJ, SICK and SNLI is more formal where higher accuracies on external tools (Parser or tagger) was achieved, while for MR, and IMDB, they contain more informal expressions such as “‘cooool, gooood”’, leading to the failure of external linguistic tools though they can be resolved by character-aware models.

  • While introducing the same spelling features with LSTM model, the performances of CN in tagging tasks can be further improved. To make an complete comparison, we have also listed the performances of our models in tagging tasks aginst state-of-the art models in Tab.4222Since it’s hard to integrate this table into Tab.3 . Competitor models in Tab.4 or introduce multi-task learning methods [Collobert and Weston2008, Yang et al.2016a], or utilize more unsupervised knowledge [Peters et al.2018, Yasunaga et al.2018], or design more handcrafted features [Ma and Hovy2016]. Rather, proposed models utilize less knowledge in terms of data or external features while achieving comparable results.

Model Chunking NER POS
collobert2008unified [collobert2008unified] 94.32 89.59 97.29
yang2016multi [yang2016multi] 95.41 90.94 97.55
peters2018deep [peters2018deep] - 92.22 -
yasunaga2018robust [yasunaga2018robust] - - 97.58
ma2016end [ma2016end] - 91.21 97.55
Ours 95.13 91.10 97.78
Table 4: Performances of our model against state-of-the-art models.
Interpretable Sub-Structures Explanations
Semantic
Key sentence patterns for question

classification and sentiment analysis

tasks
Syntactic
Key Subgraph pairs for semantic
matching task. Relating to different
types of phrases (verb-adverb, verb-
object) and voices (active or passive)
Table 5: Multiple interpretable sub-structures generated by different tasks. For text classification tasks, we show the sub-structure of a sentence, which is a key pattern for final prediction. For semantic matching, since there are two sentences as input, we simultaneously give a pair of sub-structures with different color lines. For example, the subgraph pair in the second row and column denote the structure of “is leaping over” is learned in the first sentence and simultaneously “is jumping into” is learned in the second sentence.

4.4 Qualitative Analysis

In this section, we aim to better understand our models in terms of the following two aspects: 1) How do learned task-dependent structures contribute to different tasks? 2) How do these attributes influence the learned structures? We design a series of experiments to address these questions.

Analysis on Task-dependent Structures

Since the proposed models can explicitly learn the relationship between each word pair by computing edge weights, we randomly sample several examples across different tasks (text classification and semantic matching) and visualize their learned structures.

As shown in Table 5, there are multiple interpretable sub-structures. We observe the following points:

  • For several simple tasks such as text classification, the words in a sentence are usually organized so as to accurately express certain semantic information.

    For example, for a question classification task, the informative patterns constructed by the question word “what” can be easily learned. For a sentiment classification task, the word “movie” is prone to connecting to sentimental words, such as “terrible”, “no sense”.

  • For more complex tasks such as semantic matching, a ground-truth understanding of the syntactic structure is important. In this context, we find that, given a sentence pair, our model is more likely to learn their syntactic information. For example, for the sentence pair “A man is rising from a swamp/A man is coming out of the water”, our model learns that “is rising from” and “is coming out of” are two informative patterns in two sentences respectively, which are crucial for accurately predicting the relationship of the sentence pair (“Entailment”).

Analysis on Attributes

As the above results demonstrate, attributes (Part-of-Speech or dependency relation) will influence the learning process of relationships among different nodes. In order to obtain a better intuitive understanding, we randomly pick samples from the dataset (QC), and compare the learned dependencies among words learned by CN and CN.

As shown in Figure 1 (b-c), given a sentence, CN makes a correct prediction about the type of this question while CN fails. We note the following points.

1) With the help of edge attributes, more useful patterns can be built into the graph structure. For example, “What” strongly connects to the words “What” and “birthday” therefore it captures the latent sentence pattern “What ... birthday ”, which is the key to predicting the type of this question.

2) The relationships between different words are task-specific rather than exactly matching the pre-defined syntactic dependencies. Although our model gets a hint that there is strong connection between “What” and “is” from the dependency parser as shown in Figure 1 (a), yet CN regards it as less informative. We think it simply reflects the

Figure 1: (a) The dependency tree obtained by the Stanford Parser. (b-c) The relationship between “What” and other words, which are learned by CN and CN models respectively. The correct label of “What is Susan B. Anthoy ’s birthday ?” is “Number”, indicating that it’s a question asked about “Number”.

5 Conclusion

In this paper, we first analyze two challenges for sequence learning from the perspective of a model’s locality biases, which motivates us to examine the complementary natures of Transformer and graph neural networks. Then, we draw on their complementary strengths to propose a contextualized non-local neural network. Experimental results show that learning task-dependent structures of sentences and contextualized word representations are crucial to many NLP tasks.

Acknowledgments

The authors wish to thank the anonymous reviewers for their helpful comments. This work was partially funded by National Natural Science Foundation of China (No. 61751201, 61672162), STCSM (No.16JC1420401, No.17JC1404100), and Natural Sciences and Engineering Research Council of Canada (NSERC).

References

  • [Baerman2015] Matthew Baerman. The Oxford handbook of inflection. Oxford Handbooks in Linguistic, 2015.
  • [Battaglia et al.2018] Peter W Battaglia, Andrea Hamrick, Raposo, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
  • [Biber et al.1998] Douglas Biber, Susan Conrad, and Randi Reppen. Corpus linguistics: Investigating language structure and use. Cambridge University Press, 1998.
  • [Bowman et al.2015] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on EMNLP, 2015.
  • [Cheng et al.2016] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. In Proceedings of the 2016 Conference on EMNLP, pages 551–561, 2016.
  • [Collobert and Weston2008] Ronan Collobert and Jason Weston.

    A unified architecture for natural language processing: Deep neural networks with multitask learning.

    In Proceedings of ICML, 2008.
  • [Collobert et al.2011] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. The JMLR, 12:2493–2537, 2011.
  • [Fan et al.2008] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. Liblinear: A library for large linear classification. JMLR, 9(Aug):1871–1874, 2008.
  • [Futrell2017] Richard Landy Jones Futrell. Memory and locality in natural language. PhD thesis, Massachusetts Institute of Technology, 2017.
  • [Gilmer et al.2017] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In ICML, pages 1263–1272, 2017.
  • [Huang et al.2015] Zhiheng Huang, Wei Xu, and Kai Yu. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991, 2015.
  • [Kalchbrenner et al.2014] Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. A convolutional neural network for modelling sentences. In Proceedings of ACL, 2014.
  • [Kim2014] Yoon Kim. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on EMNLP, pages 1746–1751, 2014.
  • [Kipf and Welling2016] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [Kitaev and Klein2018] Nikita Kitaev and Dan Klein. Constituency parsing with a self-attentive encoder. arXiv preprint arXiv:1805.01052, 2018.
  • [Lin et al.2017] Zhouhan Lin, Mo Feng, Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130, 2017.
  • [Liu et al.2015] PengFei Liu, Xipeng Qiu, Xinchi Chen, Shiyu Wu, and Xuanjing Huang. Multi-timescale lstm for modelling sentences and documents. In Proceedings of the EMNLP, 2015.
  • [Liu et al.2016] Pengfe Liu, Xipeng Qiu, and Xuanjing Huang. Deep fusion LSTMs for text semantic matching. In Proceedings of ACL, 2016.
  • [Liu et al.2017] Pengfi Liu, Xipeng Qiu, and Xuanjing Huang. Dynamic compositional neural networks over tree structure. In Proceedings of IJCAI, 2017.
  • [Ma and Hovy2016] Xuezhe Ma and Eduard Hovy. End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the 54th Annual Meeting of ACL, volume 1, pages 1064–1074, 2016.
  • [Manning et al.2014] Christopher Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven Bethard, and David McClosky. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd ACL, 2014.
  • [Marcheggiani and Titov2017] Diego Marcheggiani and Ivan Titov. Encoding sentences with graph convolutional networks for semantic role labeling. In Proceedings of the 2017 Conference on EMNLP, pages 1506–1515, 2017.
  • [Marelli et al.2014] Marco Marelli, Luisa Bentivogli, Marco Baroni, Raffaella Bernardi, Stefano Menini, and Roberto Zamparelli. Semeval-2014 task 1: Evaluation of compositional distributional. SemEval-2014, 2014.
  • [Mihalcea and Tarau2004] Rada Mihalcea and Paul Tarau. Textrank: Bringing order into text. In Proceedings of conference on EMNLP, 2004.
  • [Pang and Lee2005] Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL, pages 115–124, 2005.
  • [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. Proceedings of the EMNLP, 12:1532–1543, 2014.
  • [Peters et al.2018] Matthew Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the 2018 Conference of NAACL, volume 1, pages 2227–2237, 2018.
  • [Tai et al.2015] Kai Sheng Tai, Richard Socher, and Christopher D. Manning. Improved semantic representations from tree-structured long short-term memory networks. In Proceedings of the 53rd ACL, pages 1556–1566, July 2015.
  • [van der Lely2005] HKJ van der Lely. Domain-specific cognitive systems: Insight from grammatical specific language impairment. Trends in Cognitive Sciences, 9(2):53–59, 2005.
  • [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Jakob Parmar, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in NIPS, 2017.
  • [Velickovic et al.2017] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  • [Wang et al.2017] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. arXiv preprint arXiv:1711.07971, 2017.
  • [Yang et al.2016a] Zhilin Yang, Ruslan Salakhutdinov, and William Cohen. Multi-task cross-lingual sequence tagging from scratch. arXiv preprint arXiv:1603.06270, 2016.
  • [Yang et al.2016b] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of NAACL, pages 1480–1489, 2016.
  • [Yasunaga et al.2018] Michihiro Yasunaga, Jungo Kasai, and Dragomir Radev. Robust multilingual part-of-speech tagging via adversarial training. In Proceedings of the NAACL, 2018.
  • [Zeiler2012] Matthew D Zeiler. Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701, 2012.
  • [Zhu et al.2015] Xiao-Dan Zhu, Parinaz Sobhani, and Hongyu Guo. Long short-term memory over recursive structures. In ICML, pages 1604–1612, 2015.