Log In Sign Up

Fine-tuning Multi-hop Question Answering with Hierarchical Graph Network

In this paper, we present a two stage model for multi-hop question answering. The first stage is a hierarchical graph network, which is used to reason over multi-hop question and is capable to capture different levels of granularity using the nature structure(i.e., paragraphs, questions, sentences and entities) of documents. The reasoning process is convert to node classify task(i.e., paragraph nodes and sentences nodes). The second stage is a language model fine-tuning task. In a word, stage one use graph neural network to select and concatenate support sentences as one paragraph, and stage two find the answer span in language model fine-tuning paradigm.


Hierarchical Graph Network for Multi-hop Question Answering

In this paper, we present Hierarchical Graph Network (HGN) for multi-hop...

Semantic Sentence Composition Reasoning for Multi-Hop Question Answering

Due to the lack of insufficient data, existing multi-hop open domain que...

Propagate-Selector: Detecting Supporting Sentences for Question Answering via Graph Neural Networks

In this study, we propose a novel graph neural network, called propagate...

When to Fold'em: How to answer Unanswerable questions

We present 3 different question-answering models trained on the SQuAD2.0...

DaNetQA: a yes/no Question Answering Dataset for the Russian Language

DaNetQA, a new question-answering corpus, follows (Clark et. al, 2019) d...

A Simple Yet Strong Pipeline for HotpotQA

State-of-the-art models for multi-hop question answering typically augme...

Asking Complex Questions with Multi-hop Answer-focused Reasoning

Asking questions from natural language text has attracted increasing att...

1 Introduction

In one-hop question answering, also known as machine reading comprehension, answers span can be derived from a single paragraph. Numerous neural models have been proposed (Seo et al. (2017), Chen et al. (2017), Clark & Gardner (2018), Feldman & El-Yaniv (2019)) and achieved admirable performances on several different data sets, such as SQuAD(Rajpurkar et al. (2016), Rajpurkar et al. (2018)) and TriviaQA(Joshi et al. (2017)). in such task, language models have been proved to performed better than human after the release of BERT(Devlin et al. (2019)), a lot of excellent works blowout likes Retro-Reader on ALBERT(Zhang et al. (2020)), XLNet + SG-Net Verifier (Zhang et al. (2019)), or just fine-tuning pre-trained language model like ALBERT (Lan et al. (2019)).

Naturally, Extending language models’ reading capacity to multi-hop question answering is a challenging problem. WikiHop (Welbl et al. (2018)), ComplexWebQuestions (Talmor & Berant (2018)) and HotpotQA(Yang et al. (2018)) are popular multi-hop reasoning data sets. These data sets require multi-hop reasoning over multiple supporting documents to find the answer. An example from HotpotQA is illustrated in 1. In order to correctly answer the question (“The director of the romantic comedy ‘Big Stone Gap’ is based in what New York city”), the model first needs to identify P1 as a relevant paragraph, whose title contains keywords that appear in the question (“Big Stone Gap”). S1, the first sentence of P1, is then verified as supporting facts, which leads to the next-hop paragraph P2. From P2, the span “Greenwich Village, New York City” is selected as the predicted answer.

Question: What city is the band that recorded Renegade from? Paragraph 1, Renegade (Styx song): ”Renegade” is a 1979 hit song recorded by the American rock band Styx. Paragraph 2, Styx (band): Styx is an American rock band from Chicago that formed in 1972 and became famous for its albums released in the late 1970s and early 1980s. Answer: Chicago Supporting facts: [[’Renegade (Styx song)’, 0], [’Styx (band)’, 0]]

Figure 1: An example from HotpotQA. Under line denotes the bridge entity (unlabeled). “Supporting facts” is the original format in data set.

Most existing studies solve the multi-hop task in two directions. The first direction focuses on applying or adapting previous frame work that are successful in single-hop QA tasks to multi-hop QA tasks(e.g. Dhingra et al. (2018), Nishida et al. (2019), Zhong et al. (2019)).

The other direction treats the connectivity of Graph Neural Networks (GNN) as reasoning chain so that multi-hop task is convert to path choosing(or sub-graph) problem or node classifying problem. many prominent works followed this direction (Cao et al. (2019), De Cao et al. (2019), Tu et al. (2019), Ding et al. (2019), Qiu et al. (2019), Asai et al. (2019)). obviously, Graph neural networks have demonstrated their promising potential in many recent works.

Despite of the above achieved success, there are still several limitations of the current approaches on multi-hop QA. First, the entity graph is widely used for predicting answers or extent reasoning path, but is insufficient for finding supporting facts. Entities graph contains few information compared to sentences or paragraphs, relying heavily on it obviously limits the model capacity. Second, almost all existing methods directly work on all documents either by simply concatenating them, regardless of the fact that most context is not related to the question or not helpful in finding the answer. In pretrained language model fine-tuning paradigm, context length is restricted to a fixed number(e.g. 512 or 1024), but few works have been conducted to design a sentence level filter in order to remove redundant context. Motivated by Hierarchical Graph Network (HGN) (Fang et al. (2019)), we propose a two stage model to incorporate the reasoning capacity of HGN and the reading capacity of pretrained language models. the origin work proposed a multi-task learning model, the HGN part and the reading comprehension part share the same context encoding which is generated from BERT, than the model learns how to classify node classes(choose support sentence) and answer span at the same time. we decompose the model to two stage model for the reason that purely fine-tuning the pretrained language model is a better way to fully explore the LM’s potential. Meanwhile, our method initialize the node in a different way, we use a simpler [CLS] tokens rather than bi-LSTM.

our model procedure is constructed intuitively. given a question and a set of paragraph (hotpotqa distractor setting):
1. identify support paragraphs and sentences.
2. concatenate all sentences as context.
3. fine-tuning language model to find a answer span in context.

Figure 2: Model architecture of our model. feat

is the feature vector of a sentence.

details of Hierarchical Graph Network:
1. four type of nodes: question, paragraphs, sentences, and entities (see Figure 2)
2. initialization of nodes: follow [CLS] text1 text2 format, we initialize different type of nodes with different text pairs. see section xxx. 3. seven types of edge. see sections xxx for detail.

our two stage model has the following contributions:
1. taking advantages of Hierarchical Graph Network to select support sentences, we convert multi-hop reasoning to single-hop reading comprehension.
2. explore the potential of pretrained language model for question answering fine-tuning task.

2 Related work

language model(LM)

LM have been performed better than human in machine reading comprehension task, which is a sub-task of LM, since the release of BERT (Devlin et al. (2019)). a mountain of work that attempts to improve BERT have presented explosively. roberta (Liu et al. (2019)

) incorporate many training tricks and slightly modify origin loss function; transformer-XL (

Dai et al. (2019)) extent the model to variable input sequence with recurrent structure; XLNET(Yang et al. (2019)), which is a upgrade of transformer-XL, creates a permutations attention mask matrix to solve the [MASK] tokens bias; A Lite BERT (ALBERT) (Lan et al. (2019)) incorporates two parameter reduction techniques to accelerate both the train and inference speed. T5 model (Raffel et al. (2019)

) is a generative model by introducing a unified framework that converts every language problem into a text-to-text format. benefit by these excellent models, solving questions answering task with transfer learning paradigm is a future tendency.

Graph Neural Network for Multi-hop QA

GNN is a powerful tool for reasoning via message passing between neighbourhood. recent studies on multi-hop QA focus on creating graph based on entities. MHQA-GRN (Song et al. (2018)) and Coref-GRN (Dhingra et al. (2018)) construct an entity graph based on co-reference resolution or sliding windows. Entity-GCN (De Cao et al. (2019)) connects different documents via entity mentions. BAG (Cao et al. (2019)) extents biDAF framework to learn graph representations. Cognitive Graph QA (Ding et al. (2019)) mimics human cognitive process, uses iterative generative entities graph to find the reasoning path. Dynamically Fused Graph Network (DFGN) (Qiu et al. (2019)) constructs a dynamic entity graph, where in each reasoning step irrelevant entities are softly masked out, and a fusion module is designed to improve the interaction between the entity graph and the documents.

different from entities graph, our hierarchical graph models all granularities from paragraphs to entities. different from origin HGN, which is a multi-task learning model, our two stage model demonstrate the HGN and LM’s capacity separately.

3 Hierarchical Graph Network

the Hierarchical Graph Network (HGN) consists of four main components:
(i) Graph Construction Module, through which nodes were created according to the nature structure of data;
(ii) features generations Module, where initial representations of graph nodes are obtained via a pretrained language model encoder;
(iii) Graph Reasoning Module, where graph-attention-based message passing algorithm is applied to jointly update node representations;
(iv) Node Classify Module, which converts choosing support paragraphs task to predicting paragraph nodes.

The following sub-sections describe each component in detail.

3.1 Graph Construction

as we say, HGN consists of four types of nodes: questions, paragraphs, sentences, entities. according to the data set structure, one question has a set of paragraphs where one or more support labels in there. In the labeled paragraphs, one or more sentences are labeled ’support’, which are necessary context for answering the question, but the answer span may lies on one of them. therefore, we can create questions, paragraphs, sentences node directly. entity nodes come from sentence. we extract all the entities in the sentence and add edges between the sentence node and these entity nodes. entities play roles like bridges, which is defined as hyperlink. we use an external tool to identify and add hyperlinks between sentences and paragraph titles if one entity appears in both of them.

seven different types of edges are defined as follows:
(i) edges between question node and paragraph nodes;
(ii) edges between question node and entity nodes that appears in the question;
(iii) edges between paragraph nodes and their sentence nodes (sentences within the paragraph);
(iv) edges between sentence nodes and their linked paragraph nodes (linked through hyperlinks);
(v) edges between sentence nodes and their corresponding entity nodes (entities appearing in the sentences);
(vi) edges between paragraph nodes;
(vii) edges between sentence nodes that appear in the same paragraph.

3.2 Node initialization

every node is represented by a feature vector. in order to obtain semantic information, we use LM’s [CLS] tokens feature as usual does. we denote question text as , paragraph title text as , sentence text as , entity text as . so that, features denotes the features vector of node , where .

question node

just passing the question raw text to obtain the features is reasonable and effective.

= [CLS] in LM(“[CLS] [SEP]”)

sentence node

it is important to mention that LM has limited max sequence length(e.g.512, 1024), For training a model, inputs dimension has to be a constant number even through using XLNET. it is a crucial limitation, but it is rare that a sentence contains more than 512 tokens. the original HGN model (multi-task learning version) extracts sentences encoding from paragraph encoding, which is made up of sentences, by sentences offsets. paragraph are much more likely to exceed this limitation of token length.

= [CLS] in LM(“[CLS] [SEP] [SEP]”)

paragraph node

intuitively, a paragraph is made up of a title and a set of sentences. therefore we simply add the title features and the sentence features.

= [CLS] + sum()

where [CLS] in LM(“[CLS] [SEP]”),

entity node

just consider the context of entity name and paragraph title.

= [CLS] in LM(“[CLS] [SEP] [SEP]”)

3.3 Graph Reasoning

after node initialization, the node features are updated via graph neural network. we use a heterogeneous version of Graph Attention Network (GAT) (Velickovic et al. (2018)) to pass message over the hierarchical graph. Specifically, GAT aggregates all neighbors’ information with learn-able weights to update a node feature. Formally,

where is the next step hidden states, is a weight matrix,

denotes an activation function, and

is the attention coefficients, which is calculated by:

where is the weight matrix with respect to the edge type between the i-th and j-th nodes. In a summary, after graph reasoning, we obtain , which is the updated representations for each node.

3.4 Node Classify

the data set provided labeled support paragraphs and sentences, therefore we directly use a 2-layer perceptron to reduce dimension from hidden size to two, converting a two-class choosing problem. it noteworthy that the answer type “comparison” has two option: yes or no. intuitively, answering an question needs to read all the context words, but judging a question type between “comparison” and “word span” only requires the questions text. specifically, using question node features is sufficient, therefore we add a classifier on the question node, where predicted question type will be passed to second stage model, where the final choice would be made, otherwise, the second model will find answer span in tokens sequences. formally, we define three loss terms:


denotes a loss function, denotes the last hidden state and subscript denotes node type, denotes initial hidden states.

4 Language Model Fine-tuning

the second stage model is a Language Model with minimal modification. follow the guidance of T5 (Raffel et al. (2019)), fine-tuning all of the model’s parameters can lead to suboptimal results, particularly on low-resource the first strategy, we only add a small classifier that was fed into sentence embeddings produced by a fixed pre-trained LM.

The second alternative fine-tuning method we consider is “gradual unfreezing” [Howard and Ruder, 2018]. In gradual unfreezing, the model’s parameters are fine-tuned from top to bottom over time. in our setting, the additional header was trained for a number of fix step, then unfreeze the whole attention block gradually.

the third way we tried is adding “adapter layers”. adapter layers are additional dense-ReLU dense blocks that are added after each of the preexisting feed-forward networks in each block of the Transformer. such layers have only one hyperparameter: hidden dimension. We experiment with various values for


formally, given a set of support sentences(set by a hyper-parameter) predicated by stage one model, the targets of model 2 are:


where , ,

denote logits in corresponding positions,

denotes a kind of fine-tuning method. means selecting the top N sentences as support evidences.

sentence permutations

we notice that LM sum up positional embeddings and word embeddings when sentences fed in, positional embedding is crucial for model to capture sequence information. However, the order of sentences predicted by GNN can not be promised. in this situation, sentences are likely to occur at any kind of orders. hence we permutate the set of sentences to form a set of context paragraphs as training datas.

5 Experiments

5.1 Experimental Setup


HotpotQA is a question answering data set that requires reading multi-sentences across multi-documents to reveal the final answer span. this is constructed in the way that crowd workers are asked to provide a question with multiple documents. the data set also provided golden answers of questions, named support sentence & paragraph and answer span. There are about 90K training samples, 7.4K development samples, and 7.4K test samples. Please refer to the original paper (Yang et al. (2018)) for more details.

HotpotQA presents two tasks: answer span prediction and supporting facts prediction. results are evaluated based on Exact Match (EM) and F1 score of the two tasks. to evaluated the overall performance, Joint EM and F1 scores are used separately. we train our two stage model on the training set, and tune hyperparameters on the development set.

Implementation Details

Our implementation is based on the pre-trained language models provided by Transformer library, we use RoBERTa-base for generating graph node features, tokenization, and fine-tuning the question answering model. in graph construction step, we use spacy111 to recognize entities in sentence and question. according to the statistical data of HotpotQA, 80% questions requires 3 support sentences, thus we set this parameters as one of base line settings. in fine-tuning step, we compare three different methods advised by (t5), and set single-MLP header as base line setting.

Fine-tuning strategies

we use three kinds of fine-tuning strategies. firstly, we follow the same fine-tuning procedure as (Devlin et al. (2019)), create a start vector and an end vector during fine-tuning, what is a single-layer MLP, actually. secondly, we consider a method named “gradual unfreezing” (Howard & Ruder (2018)). In gradual unfreezing, more and more of the model’s parameters are fine-tuned over time. starting from the last layer (in LM, an self-attention block is considered as the minimum unit), model components are unfrozen step by step or after training for a certain number of updates. in practice, we notice that both BERT (Devlin et al. (2019)) and XLNET (Yang et al. (2019)) augment their training data with additional QA datasets, we only finetune using the provided SQuAD training data. The third method, ‘adapter layers‘” (Houlsby et al. (2019), Bapna & Firat (2019)), is motivated by the goal of keeping most of the original model fixed while fine-tuning. Adapter layers are additional MLP blocks (dense-active function-dense) that are added after each of the preexisting feed-forward networks in each block of the Transformer. In training procedure, only these MLP blocks are updated. therefore the only hyperparameter is the hidden dimension of the liner layer in the bottom of MLP blocks.

5.2 Results

Model Ans Sup Joint
Baseline Model (Yang et al. (2018)) 45.60 59.02 20.32 64.49 10.83 40.16
DFGN (Qiu et al. (2019)) 56.31 69.69 51.50 81.62 33.62 59.82
HGN (Fang et al. (2019)) 66.07 79.36 60.33 87.33 43.57 71.03
Two stage model (ours) 2 3 4 5 6 7
Table 1: Results on the test set of HotpotQA in the Distractor setting.

in 1, we show the performance comparison among different models on leaderboard. we show that our method improves more than xx% and xx% absolutely in terms of joint EM and F1 scores over the baseline model. Compared to original HGN work, our model …

stage 1 model

HGN Sup. precision Sup. recall Sup. F1
base model (topN=3) xx xx xx
large model (topN=3) xx xx xx
base model (topN=4) xx xx xx
large model (topN=4) xx xx xx
Table 2: Results of model 1 on the dev set of HotpotQA.

in table2, we demonstrate the performance of HGN. the model reach xx precision and xx recall, it confirms that graph neural network has great potential in modeling reasoning relationship. we also set topN parameters to 4 as comparison, where 4 sentences are predicted as support sentences. in the two stage model frame, we dont have to change stage 2 as it is able to find answer span as long as it appears in the topN sentences.

stage 2 model

LM/strategy Ans. precision (EM) Ans. recall Ans. F1
single MLP:
 BERT-base xx xx xx
 RoBERTa-base xx xx xx
unfreeze last 1 layer:
 BERT-base xx xx xx
 RoBERTa-base xx xx xx
unfreeze last 2 layer:
 BERT-base xx xx xx
 RoBERTa-base xx xx xx
adapter layer with d=16:
 BERT-base xx xx xx
 RoBERTa-base xx xx xx
adapter layer with d=32:
 BERT-base xx xx xx
 RoBERTa-base xx xx xx
Table 3: Results of model 2 on the dev set of HotpotQA.

the 2nd model’s performance is shown in table 3, the base model reach xx precision and xx recall. (add more)

5.3 Ablation studies

In order to better understand how the performance is affected by different part of modules, we conduct several ablation studies on the development set of data. ablation test on LM is the same as comparison of different header of LM fine-tuning task, which has been studied in section xx. therefore in this section, we focus on model 1.

If we remove the edge type and treat all edge types equally, the accuracy and recall drop xx, xx separately. it proves that different types information is important for gnn.

if … the acc degrades by xx…

5.4 Results analysis

to investigate model deeply, analysis is done based on different reasoning types in the development set. every question belongs to a category, either “bridge” or “comparison”, that is provided in data set. “bridge” means answering a question requires reading multi sentences connected by at least one “bridge entity”. “comparison” means answer would be inferred from comparing attributes of different entities, We calculate the joint EM and F1 in each categorization and compare ours with the baseline model and the DFGN model under these two reasoning types.

In Table4 …

6 Conclusion



Appendix A Appendix

You may include other additional sections here.