Log In Sign Up

Semantics-Aware Inferential Network for Natural Language Understanding

For natural language understanding tasks, either machine reading comprehension or natural language inference, both semantics-aware and inference are favorable features of the concerned modeling for better understanding performance. Thus we propose a Semantics-Aware Inferential Network (SAIN) to meet such a motivation. Taking explicit contextualized semantics as a complementary input, the inferential module of SAIN enables a series of reasoning steps over semantic clues through an attention mechanism. By stringing these steps, the inferential network effectively learns to perform iterative reasoning which incorporates both explicit semantics and contextualized representations. In terms of well pre-trained language models as front-end encoder, our model achieves significant improvement on 11 tasks including machine reading comprehension and natural language inference.


page 1

page 2

page 3

page 4


Semantics-aware BERT for Language Understanding

The latest work on language representations carefully integrates context...

Robustly Optimized and Distilled Training for Natural Language Understanding

In this paper, we explore multi-task learning (MTL) as a second pretrain...

CLINE: Contrastive Learning with Semantic Negative Examples for Natural Language Understanding

Despite pre-trained language models have proven useful for learning high...

I Know What You Want: Semantic Learning for Text Comprehension

Who did what to whom is a major focus in natural language understanding,...

Symmetric Regularization based BERT for Pair-wise Semantic Reasoning

The ability of semantic reasoning over the sentence pair is essential fo...

Machine Reading, Fast and Slow: When Do Models "Understand" Language?

Two of the most fundamental challenges in Natural Language Understanding...

SpatialNLI: A Spatial Domain Natural Language Interface to Databases Using Spatial Comprehension

A natural language interface (NLI) to databases is an interface that tra...

1 Introduction

Recent studies Zhang et al. (2020); Mihaylov and Frank (2019); Sun et al. (2019); Zhang et al. (2019, 2018) have shown that introducing extra common sense knowledge or linguistic knowledge into language representations may further enhance the concerned natural language understanding (NLU) tasks that latently have a need of reasoning ability, such as natural language inference (NLI) Wang et al. (2019); Bowman et al. (2015) and machine reading comprehension (MRC) Rajpurkar et al. (2018); Kočiský et al. (2018). Zhang et al. (2020) propose incorporating explicit semantics as a well-formed linguistic knowledge by concatenating the pre-trained language model embedding with semantic role labeling embedding, and obtains significant gains on the SNLI Bowman et al. (2015) and GLUE benchmark Wang et al. (2019). Mihaylov and Frank (2019)

use semantic information to strengthen the multi-head self-attention model, and achieves substantial improvement on NarrativeQA

Kočiský et al. (2018). In this work, we propose a Semantics-Aware Inferential Network (SAIN) to refine the use of semantic structures by decomposing text into different semantic structures for compositional processing in inferential network.

Questions in NLU tasks are usually not compositional, so most existing inferential networks Weston et al. (2014); Yu et al. (2019) input the same text at each reasoning step, which is not efficient enough to perform iterative reasoning. To overcome this problem, we use semantic role labeling to decompose the text into different semantic structures which are referred as different semantic representations of the sentence Khashabi et al. (2018); Mihaylov and Frank (2019).

Figure 1: Examples in MRC and NLI with necessary semantic annotations. The connected predicates have important arguments to predict the answer.
Figure 2: Overview of the framework. Here we only show the inputs and outputs of the first step and last step. The encoding module outputs semantic representations that integrate both the contextualized and semantic embedding. The model attends to and in step . The final memory state is used to predict the answer.

Semantic role labeling (SRL) over a sentence is to discover who did what to whom, when and why with respect to the central meaning (usually verbs) of the sentence and present semantic relationship as predicate-argument structure, which naturally matches the requirements of MRC and NLI tasks, because questions in MRC are usually formed with who, what, how, when, why and verbs in NLI play an important role to determine the answer. Furthermore, when there are several predicate-argument structures in one sentence, there come multiple contextual semantics. Previous neural models are usually with little consideration of modeling these multiple semantic structures which could be critical to predict the answer.

In Figure 1, to correctly answer the MRC question, the model needs to recognize that the author of Armada is Ernest Cline firstly, and then knows that Ernest Cline’s novel Ready Player One was made into a movie by Steven spielberg, which requires iteratively reasoning over the two predicates written and made because they have very similar arguments with the corresponding predicates written and adapted in the context. For the NLI example, if the model recognizes the predicate infected as the central meaning in S2 and ignores the true central word live

, it probably makes wrong prediction

entailment because S1 also has a similar structure predicated on infected. So it may be helpful to refine the use of semantic clues by integrating all the semantic information into the inference.

We are motivated to model these semantic structures by presenting SAIN, which consists of a set of reasoning steps. In SAIN, each step attends to one predicate-argument structure and can be viewed as a cell consisting of three units: control unit, read unit and write unit, that operate over dull control and memory hidden states. The cells are recursively connected, where the result of the previous step acts as the context of next step. The interaction between the cells is regulated by structural constraints to perform iterative reasoning in an end-to-end way.

This work will focus on two typical NLU tasks, natural language inference (SNLI Bowman et al. (2015), QNLI Rajpurkar et al. (2016), RTE Bentivogli et al. (2009) and MNLI Williams et al. (2018)) and machine reading comprehension (SQuAD Rajpurkar et al. (2016, 2018) and MRQA Fisch et al. (2019)). Experiment results indicate that our proposed model achieves significant improvement over the strong baselines on these tasks and obtains the state-of-the-art performance on SNLI and MRQA datasets.

2 Approach

The model framework is shown in Figure 2. Our model includes: 1) contextualized encoding module which obtains the joint representation of the pre-trained language model embedding and semantic embedding. 2) inferential module which consists of a set of recurrent reasoning steps/cells, where each step/cell attends to one predicate-argument structure of one sentence. 3) output module which predicts the answer based on the final memory state of the inferential module.

2.1 Task Definition

For MRC task, given a passage (P) and a question (Q), the goal is to predict the answer from the given passage. For NLI task, given a pair of sentences, the goal is to judge the relationship between their meanings, such as entailment, neural and contradiction. Our model will be introduced with the background of MRC task, and the corresponding NLI implementation of our model can be regarded as a simplified case of the MRC, considering that passage and question in MRC task correspond to two sentences in NLI task.

2.2 Semantic Role Labeling

Semantic role labeling (SRL) is generally formulated as multi-step classification subtasks in pipeline systems to identify the semantic structures. There are a few of formal semantic frames, including FrameNet Baker et al. (1998) and PropBank Palmer et al. (2005), which generally present the semantic relationship as predicate-argument structure. When several argument-taking predicates are recognized in one sentence, we obtain multiple semantic representations of the sentence. For example, given the context sentence in Figure 3 with target predicates loves and eat, there are two semantic structures labeled as follows,

[The cat]ARG0 [loves]V [to eat fish]ARG1.

[The cat]ARG0 [loves to]O [eat]V [fish]ARG1.

where ARG0, ARG1 represents the argument role 0, 1 of the predicate V, respectively.

Figure 3: Different semantic representations of one sentence combined by contextualized embedding and semantic embedding.

2.3 Contextual Encoding

Semantic Embedding Given the sentence with words and predicates ( in Figure 3), there come corresponding labeled SRL sequences {} with length

. Note this is done in data preprocessing and these labels are not updated with the following modules. These semantic role labels are mapped into vectors in dimension

where each sequence is embedded as .

Contextualized Embedding With an adopted contextualized encoder, the input sequence is embedded as , where is hidden state size of the encoder and is the tokenized sequence length.

Joint embedding Note that the input sequence may be tokenized into subwords. Then the tokenized sequence of length is usually longer than the SRL sequence of length . To align these two sequences, we extend the SRL sequence to length by assigning the subwords the same label with original word111For example, if is tokenized into three subwords {}, then is extended to . The aligned contextualized and semantic embeddings are then concatenated as the joint embedding222We also tried summation and multiplication, but experiments show that concatenation is the best. for the sequence , where .

Different sentences have various numbers of predicate-argument structures, here we set the maximum number as for ease of calculation333

The sentences without enough number of semantic structures are padded to

structures where all the labels are assigned to O. For example, the sentence in Figure 3 is padded as [The cat loves to eat fish]O.. So for MRC, the passage and question both have encoded representations where and , where , are the subwords numbers of passage and question.

2.4 Inferential Network

The inferential module performs explicit multi-step reasoning by stringing together cells, where each attends to one semantic structure of the sentence. Each cell has three operation units: control unit, read unit and write unit, iteratively aggregating information from different semantic structures.

For MRC, each reasoning step attends to one semantic structure of each sentence from passage and question, respectively. So passage and question are the input sequences for step . Besides, we use biLSTM to get the overall question representation .

Reasoning Cell The reasoning cell is a recurrent cell designed to capture information from different semantic structures. For each step in the reasoning process, the cell maintains two hidden states: control and memory , with dimension . The control retrieves information from by calculating a soft attention-based weighted average of the question words. The memory holds the intermediate results from the reasoning process up to the step by integrating the preceding hidden state with the new information retrieved from the passage .

Figure 4: The control unit.

There are three units in each cell: control unit, read unit and write unit, which work together to perform iterative reasoning. The control unit retrieves the information from the question, updating the control hidden state . The read unit extracts relevant information from the passage and outputs extracted information . The write unit integrates and into the memory , producing a new memory . In the following, we give the details of these three units. All the vectors are of dimension unless otherwise stated.

The control unit (Figure 4) attends to the semantic structure of the question at step and updates the control state accordingly. Firstly, we combines the overall question representation and preceding reasoning operation into through a linear layer. Subsequently, we calculate the similarity between and each question word

, and pass the result through a softmax layer, yielding an attention distribution over the question words. Finally, we sum the words over this distribution to get the new control

. The calculation details are as follows:

where , , and are learnable parameters, is the subwords numbers of question.

The read unit (Figure 5) inspects the semantic structure of the passage at step and retrieves the information to update the memory. Firstly, we compute the interaction between every passage word and the memory , resulting in which measures the relevance of the passage word to the preceding memory. Then, and

are concatenated and passed through a linear transformation, yielding

which considers both the new information from and the information related to the prior intermediate result. Finally, aiming to retrieve the information relevant to the question, we measure the similarity between and and pass the result through a softmax layer which produces an attention distribution over the passage words. This distribution is used to get the weighted average over the passage. The calculation is detailed as follows:

where all the and are learnable parameters, is the subwords numbers of passage.

Figure 5: The read unit.
Figure 6: The write unit.

The write unit (Figure 6) is responsible for integrating the information retrieved from the read unit with the preceding memory , guided by the reasoning operation from the question. Specificly, a sigmoid gate is used when combining the previous memory state and the new memory candidate . The calculation details are as follows:

2.5 Output Module

For MRC, the output module predicts the final answer to the question based on the set of memory states {,…,} produced by the inferential module. For MRC, we calculate the similarity between the memory and each passage word in semantic passage representation , resulting in , . We concatenate as the final passage representation

which is then passed to a linear layer to get the start and end probability distribution

, on each position. Finally, a cross entropy loss is computed:

where and are the true start and end probability distribution. , , and are all with size . indicates the cross entropy function.

For NLI, the final memory state is directly passed to a linear layer to produce the probability distribution over the labels: . Cross entropy is used as the metric: , where is the number of labels. is the predicted probability distribution over the labels and is the true label distribution.

3 Experiments

3.1 Data and Task Description

Machine Reading Comprehension We evaluate our model on extractive MRC such as SQuAD Rajpurkar et al. (2018) and MRQA444 Fisch et al. (2019) where the answer is a span of the passage. MRQA is a collection of existing question-answering related MRC datasets, such as SearchQA Dunn et al. (2017), NewsQA Trischler et al. (2017), NaturalQuestions Kwiatkowski et al. (2019), TriviaQA Joshi et al. (2017), etc. All these datasets as shown in Table 1 are transformed into SQuAD style where given the passage and question, the answer is a span of the passage.

Natural Language Inference Given a pair of sentences, the target of natural language inference is to judge the relationship between their meanings, such as entailment, neural and contradiction. We evaluate on 4 diverse datasets, including Stanford Natural Language Inference (SNLI) Bowman et al. (2015), Multi-Genre Natural Language Inference (MNLI) Williams et al. (2018), Question Natural Language Inference (QNLI) Rajpurkar et al. (2016) and Recognizing Textual Entailment (RTE) Bentivogli et al. (2009).

Dataset #train #dev
NewsQA 74,160 4,211 599 8
TriviaQA 61,688 7,785 784 16
SearchQA 117,384 16,980 749 17
HotpotQA 72,928 5,904 232 22
NaturalQA 104,071 12,836 153 9
Table 1: Statistics of MRQA datasets. #train and #dev are the number of examples in train and dev set. denotes the average length in tokens.
NewsQA TriviaQA SearchQA HotpotQA NaturalQA (Avg.)
MTL Fisch et al. (2019) 66.8 71.6 76.7 76.6 77.4 73.8
MTL Fisch et al. (2019) 66.3 74.7 79.0 79.0 79.8 75.8
CLER Takahashi et al. (2019) 69.4 75.6 79.0 79.8 79.8 76.7
BERT Joshi et al. (2019) 68.8 77.5 81.7 78.3 79.9 77.3
HLTC Su et al. (2019) 72.4 76.2 79.3 80.1 80.6 77.7
SemBERT Zhang et al. (2020) 69.1 78.6 82.4 78.6 80.3 77.8
SpanBERT Joshi et al. (2019) 73.6 83.6 84.8 83.0 82.5 81.5
BERT 66.2 71.5 77.0 75.0 77.5 73.4
BERT 69.2 77.4 81.5 78.2 79.4 77.2
SpanBERT 73.0 83.1 83.5 82.5 81.9 80.8
Our Models
SAINBERT 67.9 72.3 77.8 77.4 78.6 74.8
SAINBERT 72.1 80.1 83.4 79.4 82.0 79.4
SAINSpanBERT 74.2 84.5 84.4 83.4 82.7 81.9
Table 2: Performance (F1) on five MRQA tasks. Results with are our implementations. Avg indicates the average score of these datasets. All these results are from single models.
Model MNLI-m/mm QNLI RTE SNLI (Avg.) SQuAD 1.1 SQuAD 2.0
Acc Acc Acc Acc Acc EM F1 EM F1
BERT 84.6 83.4 89.3 66.4 90.7 82.9 80.8 88.5 77.1 80.3
BERT 86.7 85.9 92.7 70.1 91.1 85.3 84.1 90.9 80.0 83.3
SemBERT 84.4 84.0 90.9 69.3 91.0 83.9 _ _ _ _
SemBERT 87.6 86.3 94.6 70.9 91.6 86.2 84.5 91.3 80.9 83.6

Our Models
SAINBERT 84.9 85.0 92.1 72.0 91.2 85.1 82.2 89.3 79.4 82.0
SAINBERT 87.7 87.3 94.5 73.9 91.7 87.1 85.4 91.9 82.8 85.4
Table 3: Experiment results on MNLI, QNLI, RTE, SNLI and SQuAD 1.1, SQuAD 2.0. The results of BERT and SemBERT are from Devlin et al. (2019) and Zhang et al. (2020). indicates the results of SemBERT without random restarts and distillation. Results of SQuAD are tested on development sets. Results with are our implementations. Avg indicates the average score of these datasets. All these results are from single models.

3.2 Implementation Details

To obtain the semantic role labels, we use the SRL system of He et al. (2017) as implemented in AllenNLP Gardner et al. (2018) that splits sentences into tokens and predicts SRL tags such as ARG0, ARG1 for each verb. We use O for non-argument words and V for predicates. The dimension of SRL embedding is set to 30 and performance does not change significantly when setting this number to 10, 50 or 100. The maximum number of predicate-argument structures (reasoning steps) is set to 3 or 4 for different tasks.

Our model framework is based on the Pytorch implementation of transformers

555 We use Adam as our optimizer with initial learning rate 1e-5 and warm-up rate of 0.1. The batch size is selected in {8, 16, 32} with respect to the task. The total parameters vary from 355M (total steps ) to 362M (), increasing 20M to 27M parameters compared to BERT (335M).

3.3 Overall Results

Our main comparison models are the BERT baselines (BERT Devlin et al. (2019) and SpanBERT Joshi et al. (2019)) and SemBERT Zhang et al. (2020). SemBERT improves the language representation by concatenating the BERT embedding and semantic embedding, where embeddings from different predicate-argument structures are simply fused as one semantic representation by using one linear layer. We compare our model to these baselines on 11 benchmarks including 5 MRQA datasets, 4 NLI tasks and 2 SQuAD datasets in Tables 2 and 3.

SAIN vs. BERT/SpanBERT baselines Compared to BERT Devlin et al. (2019), our model achieves general improvements, including 2.1% (79.4 vs. 77.3), 1.8% (87.1 vs. 85.3), 1.6% (88.7% vs. 87.1 %) average improvement on 5 MRQA, 4 NLI and 2 SQuAD datasets. Our model also outperforms other BERT based models CLER Takahashi et al. (2019) and HLTC Su et al. (2019) on MRQA. We also compare with SpanBERT Joshi et al. (2019) on MRQA datasets and our model outperforms this baseline by 0.4% (81.9 vs. 81.5) in average F1 score. To the best of our knowledge, we achieve state-of-the-art performance on MRQA (dev sets) and SNLI.

RTE SQuAD 1.1 SQuAD 2.0
SAIN 73.4 91.9 85.4
w/o IM 71.2 (-2.3) 90.6 (-1.3) 82.5 (-1.9)
w/o SI 71.5 (-1.9) 90.1 (-1.8) 83.1 (-2.3)
w/o IR 72.0 (-1.4) 90.9 (-1.0) 83.2 (-2.2)
Table 4: Ablation study on RTE, SQuAD 1.1 and SQuAD 2.0 (F1). We use BERT as contextual encoder here. The definition of IM, SI and IR is detailed in Section 3.4.

SAIN vs. SemBERT Our SAIN outperforms SemBERT on all tasks, including 1.6% (79.4 vs. 77.8), 0.9% (87.1 vs. 86.2) and 1.3% (86.4 vs. 85.1) average improvement on MRQA, NLI and SQuAD datasets. We attribute the superiority of our SAIN to its more refined use of semantic clues in terms of inferential network rather than SemBERT which simply encodes all predicate-argument structures into one embedding.

3.4 Ablation Study

To evaluate the contribution of key components in our model, we perform ablation studies on the RTE and SQuAD 2.0 dev sets as shown in Table 4. Here we focus on these components: (1) the whole inferential module (IM); (2) the semantic information (SI); (3) iterative reasoning (IR) that different reasoning cells attend to different predicate-argument structures. To evaluate their contribution, we perform experiments respectively by: (1) IM: removing the inferential module and simply combining the BERT embedding with semantic embeddings from different predicate-argument structures; (2) SI: removing all the semantic embeddings; (3) IR: combining all semantic embeddings from different predicate-argument structures as one and every reasoning step taking the same semantic embedding.

Figure 7: Performance on different question types, tested on the SQuAD 1.1 development set. BERT is used as contextual encoder here. The definition of IM, SI and IR is detailed in Section 3.4.

As displayed in Table 4, the ablation on all evaluated components results in performance drop which indicates that all the key components (the inferential module, the semantic information and iterative reasoning process) are indispensable for the model. Particularly, the ablation on iterative reasoning proves that it is necessarily helpful that the model attends to different predicate-argument structures in different reasoning steps.

Furthermore, Figure 7 shows the ablation results on different question types, tested on sampled examples from SQuAD 1.1. The full SAIN model outperforms all other ablation models on all question types except the where type questions, which again proves that integrating the semantic information of (who did what to whom, when and why) contributes to boosting the performance of MRC tasks where questions are usually formed with who, what, how, when and why.

3.5 Influence of Semantic Information

To further investigate the influence of semantic information, Figure 8 shows the performance comparison of whether to use the semantic information with different numbers of reasoning steps (from 1 to 7). The highest performance is achieved when is set to 3 on SQuAD, 4 on RTE. The results indicate that semantic information consistently contributes to the performance increase, although the inferential network is strong enough.

To investigate influence of the accuracy of the labeler, we randomly tune specific proportion [0, 20%, 40%] of labels into random error ones. The scores of SQuAD 2.0 and RTE are respectively [85.4, 83.2, 82.6] and [73.4, 71.8, 71.2], which indicate that the model benefits from high-accuracy labeler but can still maintain the performance even using some noisy labels.

Figure 8: Results on the dev sets of SQuAD 2.0 and RTE when selecting different reasoning steps . We use BERT as contextual encoder here. SQuAD/RTE-w/o SI indicates the results without using any semantic information.

3.6 Influence of Inferential Mechanism

To obtain better insight into the underlying reasoning processes, we study the visualization of the attention distributions during the iterative computation, and provide examples in Table 5 and Figure 9. Table 5 shows a relatively complex question that is correctly answered by our model, but wrongly predicted by SemBERT Zhang et al. (2020). In this example, there is misleading contextual similarity between words “store and transmit” in sentence S1 and “transport and storage” in the question which may lead the model to wrong answer in S1, such as “fuel” by SemBERT. To overcome this misleading, the model needs to recognize the central connection predicates “demand” and “requires” between the question and passage, then extract the correct answer “special training” in S2.

Figure 9 shows how our model retrieves information from different semantic structures of the question in each reasoning step. The model first focuses on the word “what”, working to retrieve a noun. Then it focuses on the arguments “transport” and “storage” in step 2 but gets around these words in step 3, and attends to the second verb phrase “dealing with oxygen”, taking the model’s attention away from sentence S1. Finally, the model focuses on the main meaning of the question: “demand for security” and predicts the correct answer “special training” in sentence S2, with respect to the semantic similarity between words “demand for safety” and “requires to ensure”. This example intuitively explains why our model benefits from the iterative reasoning where each step only attends to one semantic representation.

Passage: (S1) Steel pipes and storage vessels used to store and transmit both gaseous and liquid oxygen will act as a fuel; (S2) and therefore the design and manufacture of oxygen systems requires special training to ensure that ignition sources are minimized.
Question: What does the transport and storage demand for safety in dealing with oxygen?
Golden Answer: special training
SemBERT: fuelSAIN: special training
Table 5: One example that is correctly predicted by SAIN, but wrongly predicted by SemBERT.

4 Related Work

Semantic Information for MRC Using semantic information to enhance the question answering system is one effective method to boost the performance. Narayanan and Harabagiu (2004) first stress the importance of semantic roles in dealing with complex questions. Shen and Lapata (2007) introduce a general framework for answer extraction which exploits semantic role annotations in the FrameNet Baker et al. (1998) paradigm. Yih et al. (2013) propose to solve the answer selection problem using enhanced lexical semantic models. More recently, Zhang et al. (2020) propose to strengthen the language model representation by fusing explicit contextualized semantics. Mihaylov and Frank (2019) apply linguistic annotations to a discourse-aware semantic self-attention encoder which is employed for reading comprehension on narrative texts. In this work, we propose to use inferential model to recurrently retrieve different predicate-argument structures, which presents a more refined way using semantic clues and thus is essentially different from all previous methods.

Inferential Network

To support inference in neural network, exiting models either rely on structured rule-based matching methods

Sun et al. (2018) or multi-layer memory networks Weston et al. (2014); Liu and Perez (2017), which either lack end-to-end design or no prior structure to subtly guide the reasoning direction.

Another related works are on Visual QA, aiming to answer the question with regards to the given image. In particular, Santoro et al. (2017) propose a relation net but restricted the relational question such as comparison. Later, for compositional question, Hudson and Manning (2018) introduce an iterative network that separates memory and control to improve interpretability. Our work leverages such separate design, dedicating to inferential NLI and MRC tasks, where the questions are usually not compositional.

To overcome the difficulty of applying inferential network into general NLU tasks, and passingly refine the use of multiple semantic structures, we propose SAIN which naturally decomposes text into different semantic structures for compositional processing in inferential network.

Figure 9: Transformation of attention distribution at each reasoning step, showing how the model iteratively retrieves information from the question.

5 Conclusion

This work focuses on two typical NLU tasks, machine reading comprehension and natural language inference by refining the use of semantic clues and inferential model. The proposed semantics-aware inferential network (SAIN) is capable of taking multiple semantic structures as input of an inferential network by closely integrating semantics and reasoning steps in a creative way. Experiment results on 11 benchmarks, including 4 NLI tasks and 7 MRC tasks, show that our model outperforms all previous strong baselines, which consistently indicate the general effectiveness of our model666Our model can be easily adapted to other language models such as ALBERT, which is left for future work..