Self-supervised Knowledge Triplet Learning for Zero-shot Question Answering

05/01/2020 ∙ by Pratyay Banerjee, et al. ∙ 39

The aim of all Question Answering (QA) systems is to be able to generalize to unseen questions. Most of the current methods rely on learning every possible scenario which is reliant on expensive data annotation. Moreover, such annotations can introduce unintended bias which makes systems focus more on the bias than the actual task. In this work, we propose Knowledge Triplet Learning, a self-supervised task over knowledge graphs. We propose methods of how to use such a model to perform zero-shot QA and our experiments show considerable improvements over large pre-trained generative models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The ability to understand natural language and answer questions is one of the core focus in the field of natural language processing. To measure and study the different aspects of question answering, several datasets are developed, such as SQuaD

Rajpurkar et al. (2018), HotpotQA Yang et al. (2018), and Natural Questions Kwiatkowski et al. (2019) which require systems to perform extractive question answering. On the other hand datasets such as SocialIQA Sap et al. (2019b), CommonsenseQA Talmor et al. (2018), Swag Zellers et al. (2018) and Winogrande Sakaguchi et al. (2019) require systems to choose the correct answer from a given set. These multiple-choice question answering datasets are very challenging, but recent large pre-trained language models such as BERT Devlin et al. (2018), XLNET Yang et al. (2019) and RoBERTa Liu et al. (2019b) have shown very strong performance on them. Moreover as shown in Winogrande Sakaguchi et al. (2019), acquiring unbiased labels requires a “carefully designed crowdsourcing procedure”, which adds to the cost of data annotation. This is also quantified in other natural language tasks such as Natural Language Inference Gururangan et al. (2018) and Argument Reasoning Comprehension Niven and Kao (2019), where such annotation artifacts lead to “Clever Hans Effect” in the models Kaushik and Lipton (2018); Poliak et al. (2018).

Figure 1: Knowledge Triplet Learning Framework, where given a triple we learn to generate one of the inputs given the other two.

One way to resolve this is to design and create datasets in a clever way, such as in Winogrande Sakaguchi et al. (2019), the other way is to ignore the data annotations and to build systems to perform unsupervised question answering Teney and Hengel (2016); Lewis et al. (2019). In this paper, we focus on building unsupervised zero-shot multiple-choice question answering systems.

The task of unsupervised question answering is very challenging. Recent work Fabbri et al. (2020); Lewis et al. (2019) try to generate a synthetic dataset using a text corpus such as Wikipedia, to solve extractive QA. Other work Bosselut and Choi (2019); Shwartz et al. (2020)

uses large pre-trained generative language models such as GPT-2

Radford et al. (2019) to generate knowledge, questions, and answers and compare against the given choices.

In this work, we utilize the information present in Knowledge Graphs such as ATOMIC Sap et al. (2019a) and ConceptNET Liu and Singh (2004) and define a new task of Knowledge Triplet Learning. Knowledge Triplet Learning is similar to Knowledge Representation Learning but not limited to it. Knowledge Representation Learning Lin et al. (2018)

learns the low-dimensional projected and distributed representations of entities and relations defined in a knowledge graph. As shown in Figure

1, we define a triplet , and given any two we try to recover the third. This forces the system to learn the all possible relations between the three inputs. We map the question answering task to Knowledge Triplet Learning, by mapping the context, question and answer to respectively. We define two different ways to perform self-supervised Knowledge Triplet Learning. This task can be designed as a representation generation task or a language modeling task. We compare both the strategies in this work. We show how to use models trained on this task to perform zero-shot question answering without any additional knowledge or additional supervision. We also show how models pre-trained on this task perform considerably well compared to strong pre-trained language models on few-shot learning. We evaluate our approach on the SocialIQA dataset.

The contributions of this paper are summarized as follows:

  • [noitemsep]

  • We define the Knowledge Triplet Learning over Knowledge Graph and show how to use it for zero-shot question answering.

  • We compare two strategies for the above task.

  • We achieve state-of-the-art results for zero-shot and propose a strong baseline for the few-shot question answering task.

2 Knowledge Triplet Learning

We define the task of Knowledge Triplet Learning (KTL) in this section. We define as a Knowledge Graph, where is the set of vertices, is the set of edges. consists of entities which can be phrases or named-entities depending on the given input Knowledge Graph. Let be a set of fact triples, with the format , where and belong to set of vertices and belongs to set of edges. The and indicates the head and tail entities, whereas indicates the relation between these entities.

For example, from the ATOMIC knowledge graph, (PersonX puts PersonX’s trust in PersonY, How is PersonX seen as?, faithful) is one such triple. Here the head is PersonX puts PersonX’s trust in PersonY, relation is How is PersonX seen as? and the tail is faithful. Do note does not contain homogenous entities, i.e, both faithful and PersonX puts PersonX’s trust in PersonY belong to .

We define the task of KTL as follows: Given input a triple , we learn the following three functions.

(1)

That is, each function learns to generate one component of the triple given the other two. The intuition behind learning these three functions is as follows. Let us take the above example: (PersonX puts PersonX’s trust in PersonY, How is PersonX seen as?, faithful). The first function learns to generate the answer given the context and the question. The second function learns to generate one context where the question and the answer may be valid. The final function is a Jeopardy-style generating the question which connects the context and the answer.

In Multiple-choice QA, given the context, two answer choices may be true for two different questions. Similarly, given the question, two answer choices may be true for two different contexts. For example, given the context: PersonX puts PersonX’s trust in PersonY, the answers PersonX is considered trustworthy by others and PersonX is polite are true for two different questions How does this affect others? and How is PersonX seen as?. Learning these three functions enables us to score these relations between the context, question, and answers.

2.1 Using KTL to perform QA

After learning this function in a self-supervised way, we can use them to perform question answering. Given a triple , we define the following scoring function:

(2)

where is the context, is the question and is one of the answer options. is a distance function which measures the distance between the generated output and the ground-truth. The distance function varies depending on the instantiation of the framework, which we will study in the following sections. The final answer is selected as:

(3)

Since the scores are the distance from the ground-truth we select the answer which has the minimum score.

In the following sections, we define the different ways we can implement this framework.

2.2 Knowledge Representation Learning

In this implementation, we use Knowledge representation learning to learn equation (1). In contrast to Knowledge representation learning, where systems try to learn a score function , i.e, is the fact triple

true or false; in this work we learn to generate the inputs vector representations, i.e,

. We can view equation 1 as generator functions, which given the two input learns to generate a vector representation of the third. As our triples can have a many to many relations between each pair, we first project the two inputs from input encoding space to a different space similar to the work of TransD Ji et al. (2015). We use a Transformer encoder to encode our triples to the encoding space. We learn two projection functions, and to project the two inputs, and a third projection function to project the entity to be generated. We combine the two projected inputs using a function . These functions can be implemented using feedforward networks.

where is the input, is the generated output vector and is the projected vector. and functions are learned using fully connected networks. In our implementation, we use RoBERTa as the transformer, with the output representation of the token as the phrase representation.

We train this model using two types of loss functions, L2Loss where we try to minimize the L2 norm between the generated and the projected ground-truth, and Noise Contrastive Estimation

Gutmann and Hyvärinen (2010) where along with the ground-truth we have noise-samples. These noise samples are selected from other

triples such that the target output is not another true fact triple, i.e,

is false. The NCELoss is defined as:

where are the projected noise samples,

is the similarity function which can be the L2 norm or Cosine similarity,

is the generated output vector and is the projected vector.

The distance function (2) for such a model is defined by the distance function used in the loss function. For L2Loss, it is the L2 norm, and in the case of NCELoss, we use function.

2.3 Span Masked Language Modeling

In Span Masked Language Modeling, we model the equation 1 as a masked language modeling task. We tokenize and concatenate the triple with a separator token between them, i.e, . For the function , we mask all the tokens present in , i.e, . We feed this tokens to a Transformer encoder , and use a feed forward network to unmask the sequence of tokens. Similarly we mask to learn and to learn .

We train the same Transformer encoder to perform all the three functions. We use the cross-entropy loss to train the model:

where

is the masked language modeling probability of the token

, given the unmasked tokens and and other masked tokens in . Do note we do not do progressive unmasking, i.e, all the masked tokens are jointly predicted.

The distance function (2) for this model is same as the loss function defined above.

3 Datasets

3.1 SocialIQA

To study our framework we evaluate it on SocialIQA. This dataset is about reasoning over social interactions and the implications of social events. Each instance in this dataset contains a context which is a social situation, a question about this situation, and three answer options. There are several question types that are derived from the different ATOMIC inference dimensions, such as Intent, Effect, Attributes, etc. There are 33,410 training samples and 1954 validation samples, with a withheld test set.

Though this dataset is derived from the ATOMIC knowledge graph, the fact triples present in the graph are considerably different than the context, question and answers present in the dataset as they are crowdsourced. The average length of the event description in ATOMIC is 10, max length is 18. Whereas in SocialIQA the average length is 36 and max is 124. This shows the varied type of questions present in SocialIQA, and possess a much more challenge in unsupervised learning.

4 Experiments

4.1 Baselines

We compare our models to three strong baselines. The first one is a pre-trained language model, GPT-2 (all sizes) which are scored using language modeling cross-entropy loss. We concatenate the context and question and find the cross-entropy loss for each of the answer choices and choose the answer which has the minimum loss. The second baseline is another pre-trained language model, RoBERTa-large. We follow the same Span Masked Language Model (SMLM) scoring using the pre-trained RoBERTa model. For the third baseline, we finetune the RoBERTa-large model using the original Masked Language Modeling task over our concatenated fact triples .

4.2 KTL Training

We train the Knowledge Representation Learning (KRL) model using both L2Loss and NCELoss. For NCELoss we also train it with both L2 norm and Cosine similarity. Both the KRL model and SMLM model uses RoBERTa-large as the Transformer encoder. We train the model with the following hyper-parameters: batch sizes 16,32; learning rate in range: [1e-5,5e-5]; warm-up steps in range [0,0.1]. We use the transformers package Wolf et al. (2019). From the ATOMIC knowledge graph, we generate 595595 unique triplets. All these triplets are positive facts. We learn using these triplets. For NCE, we choose equal to 10, i.e, 10 negative samples. We perform 3 hyper-parameter trials for each model, and train models with 3 different seeds [0,21,42].

Type Model Train-Val Val Deviation
Baselines Majority 33.5 33.6
GPT2-L 41.3 41.1
RoBERTa 33.6 33.4
RoBERTa-MLM 36.4 34.3 +/-0.8
Shwartz et al. Self-Talk 46.5 46.2
Ours KRL-L2 43.2 43.8 +/-0.3
KRL-NCE-L2 46.4 46.2 +/-0.2
KRL-NCE-Cos 46.6 46.4 +/-0.2
SMLM 48.7 48.5 +/-0.4
Supervised Sup RoBERTa 76.6 +/-0.6
Human 86.9
Table 1: Accuracy comparison with our baseline models on the SocialIQA dataset. We compare the models on the Zero-shot task. We compare them on both Train-Val split (35k) and the Validation split (2k), to enable measuring better statistical significance.

5 Discussion

Table 1 shows our evaluation and baseline comparisons for the zero-shot task. We can observe our KTL trained models perform significantly well compared to the baselines. When comparing the different KRL models, the NCELoss with Cosine similarity performs the best. This might be due to the additional supervision provided by the negative samples as the L2Loss model only tries to minimize the two projections. When comparing different KTL instantiations we can see the SMLM model performs the best overall but has a slightly higher deviation. We are analyzing our model to understand this phenomenon better. Our KRL model does perform equal to the current state-of-the-art model, Self-Talk Shwartz et al. (2020) which uses two GPT type models. Our models have half the parameters compared to Self-Talk. We also compare our model on the entire Train-Val set of 35,364 questions in the zero-shot setting to better gauge the statistical significance of the different model accuracies.

Model Val Deviation
RoBERTa 44.4 +/-1.2
RoBERTa+MLM 46.8 +/-0.6
KRL-NCE-Cos 58.6 +/-0.8
SMLM 69.1 +/-0.4
Table 2: Accuracy comparison on Few-shot Question Answering. We re-use our Transformer encoder for the few-shot task with only 8 % train samples.

Table 2 compares the different pre-trained Transformer encoders in the few-shot question answering task. We randomly sample three sets of 2,400 samples as training data from the training set. We train our models on these three sets and measure the validation set accuracy. We can see the Transformer encoder trained on KTL perform significantly better than the baseline models in this setting. This shows that encoders trained on KTL are able to learn with a few samples. We plan to continue analyzing our models and evaluate the KTL framework on different datasets.

6 Related Work

6.1 Unsupervised Question Answering

Recent work on unsupervised question answering approach the problem in two ways, a domain adaption or transfer learning problem

Chung et al. (2018), or a data augmentation problem Yang et al. (2017); Dhingra et al. (2018); Wang et al. (2018); Alberti et al. (2019). The work of Lewis et al. (2019); Fabbri et al. (2020); Puri et al. (2020) use style transfer or template-based question, context and answer triple generation, and learn using these to perform unsupervised extractive question answering. There is also another approach of learning generative models, generating the answer given a question or clarifying explanations and/or questions, such as GPT-2 Radford et al. (2019) to perform unsupervised question answering Shwartz et al. (2020); Bosselut and Choi (2019); Bosselut et al. (2019). In contrast, our work focuses on learning from knowledge graphs and generate vector representations or sequences of tokens not restricted to the answer but including the context and the question using the masked language modeling objective.

6.2 Use of External Knowledge for Question Answering

There are several approaches to add external knowledge into models to improve question answering. Broadly they can be classified into two, learning from unstructured knowledge and structured knowledge. In learning from unstructured knowledge, recent large pre-trained language models

Peters et al. (2018); Radford et al. (2019); Devlin et al. (2018); Liu et al. (2019b); Clark et al. (2020); Lan et al. (2019); Joshi et al. (2020); Bosselut et al. (2019) learn general-purpose text encoders from a huge text corpus. On the other hand, learning from structured knowledge includes learning from structured knowledge bases Yang and Mitchell (2017); Bauer et al. (2018); Mihaylov and Frank (2018); Wang and Jiang (2019); Sun et al. (2019) by learning knowledge enriched word embeddings. Using structured knowledge to refine pre-trained contextualized representations learned from unstructured knowledge is another approach Peters et al. (2019); Yang et al. (2019); Zhang et al. (2019); Liu et al. (2019a).

Another approach of using external knowledge includes retrieval of knowledge sentences from a text corpora Mitra et al. (2019); Banerjee et al. (2019); Baral et al. (2020); Das et al. (2019); Chen et al. (2017); Lee et al. (2019); Banerjee and Baral (2020); Banerjee (2019) or knowledge triples from knowledge bases Min et al. (2019); Wang et al. (2020) that are useful to answer a specific question. In our work, we use knowledge graphs to learn a self-supervised generative task to be able to perform zero-shot multiple-choice QA.

6.3 Knowledge Representation Learning

Over the years there are several methods discovered to perform the task of knowledge representation learning, i.e., embedding entities and relations in knowledge graphs to low-dimensional continuous vector space. We mention few of them here, such as TransE Bordes et al. (2013) which views relations as a translation vector between head and tail entities, TransH Wang et al. (2014) which overcomes TransE’s inability to model complex relations, and TransD Ji et al. (2015) which aims to reduce the parameters by proposing two different mapping matrices for head and tail entities. For much detailed reading, we refer to this survey by Lin et al.. KRL has been used in various ways to generate natural answers Yin et al. (2016); He et al. (2017) and generate factoid questions Serban et al. (2016). In our work, we modify TransD and adapt it to our KTL framework to perform zero-shot QA.

7 Conclusion

In this work, we propose a new framework of Knowledge Triplet Learning over Knowledge Graphs. We show learning all three possible functions, ,, and helps the model to perform zero-shot multiple-choice question answering. We learn from the ATOMIC knowledge graph and evaluate our framework on the SocialIQA dataset. Our framework achieves state-of-the-art in the zero-shot question answering task and sets a strong baseline in the few-shot question answering task.

References

  • C. Alberti, D. Andor, E. Pitler, J. Devlin, and M. Collins (2019) Synthetic QA corpora generation with roundtrip consistency. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6168–6173. External Links: Link, Document Cited by: §6.1.
  • P. Banerjee and C. Baral (2020) Knowledge fusion and semantic knowledge ranking for open domain question answering. arXiv preprint arXiv:2004.03101. Cited by: §6.2.
  • P. Banerjee, K. K. Pal, A. Mitra, and C. Baral (2019) Careful selection of knowledge to solve open book question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6120–6129. External Links: Link, Document Cited by: §6.2.
  • P. Banerjee (2019) ASU at TextGraphs 2019 shared task: explanation ReGeneration using language models and iterative re-ranking. In Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13), Hong Kong, pp. 78–84. External Links: Link, Document Cited by: §6.2.
  • C. Baral, P. Banerjee, K. K. Pal, and A. Mitra (2020) Natural language qa approaches using reasoning with external knowledge. arXiv preprint arXiv:2003.03446. Cited by: §6.2.
  • L. Bauer, Y. Wang, and M. Bansal (2018) Commonsense for generative multi-hop question answering tasks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4220–4230. External Links: Link, Document Cited by: §6.2.
  • A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko (2013) Translating embeddings for modeling multi-relational data. In Advances in neural information processing systems, pp. 2787–2795. Cited by: §6.3.
  • A. Bosselut and Y. Choi (2019) Dynamic knowledge graph construction for zero-shot commonsense question answering. arXiv preprint arXiv:1911.03876. Cited by: §1, §6.1.
  • A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz, and Y. Choi (2019) Comet: commonsense transformers for automatic knowledge graph construction. arXiv preprint arXiv:1906.05317. Cited by: §6.1, §6.2.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1870–1879. External Links: Link, Document Cited by: §6.2.
  • Y. Chung, H. Lee, and J. Glass (2018) Supervised and unsupervised transfer learning for question answering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1585–1594. External Links: Link, Document Cited by: §6.1.
  • K. Clark, M. Luong, Q. V. Le, and C. D. Manning (2020) Electra: pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555. Cited by: §6.2.
  • R. Das, S. Dhuliawala, M. Zaheer, and A. McCallum (2019) Multi-step retriever-reader interaction for scalable open-domain question answering. arXiv preprint arXiv:1905.05733. Cited by: §6.2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §6.2.
  • B. Dhingra, D. Danish, and D. Rajagopal (2018) Simple and effective semi-supervised question answering. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 582–587. External Links: Link, Document Cited by: §6.1.
  • A. R. Fabbri, P. Ng, Z. Wang, R. Nallapati, and B. Xiang (2020) Template-based question generation from retrieved sentences for improved unsupervised question answering. arXiv preprint arXiv:2004.11892. Cited by: §1, §6.1.
  • S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. Bowman, and N. A. Smith (2018) Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 107–112. Cited by: §1.
  • M. Gutmann and A. Hyvärinen (2010)

    Noise-contrastive estimation: a new estimation principle for unnormalized statistical models

    .
    In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    ,
    pp. 297–304. Cited by: §2.2.
  • S. He, C. Liu, K. Liu, and J. Zhao (2017) Generating natural answers by incorporating copying and retrieving mechanisms in sequence-to-sequence learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 199–208. Cited by: §6.3.
  • G. Ji, S. He, L. Xu, K. Liu, and J. Zhao (2015) Knowledge graph embedding via dynamic mapping matrix. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 687–696. Cited by: §2.2, §6.3.
  • M. Joshi, K. Lee, Y. Luan, and K. Toutanova (2020) Contextualized representations using textual encyclopedic knowledge. External Links: 2004.12006 Cited by: §6.2.
  • D. Kaushik and Z. C. Lipton (2018) How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 5010–5015. Cited by: §1.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. (2019) Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7, pp. 453–466. Cited by: §1.
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut (2019)

    Albert: a lite bert for self-supervised learning of language representations

    .
    arXiv preprint arXiv:1909.11942. Cited by: §6.2.
  • K. Lee, M. Chang, and K. Toutanova (2019) Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6086–6096. External Links: Link, Document Cited by: §6.2.
  • P. Lewis, L. Denoyer, and S. Riedel (2019) Unsupervised question answering by cloze translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4896–4910. External Links: Link, Document Cited by: §1, §1, §6.1.
  • Y. Lin, X. Han, R. Xie, Z. Liu, and M. Sun (2018) Knowledge representation learning: a quantitative review. arXiv preprint arXiv:1812.10901. Cited by: §1, §6.3.
  • H. Liu and P. Singh (2004) ConceptNet — a practical commonsense reasoning tool-kit. BT Technology Journal 22, pp. 211–226. Cited by: §1.
  • W. Liu, P. Zhou, Z. Zhao, Z. Wang, Q. Ju, H. Deng, and P. Wang (2019a) K-bert: enabling language representation with knowledge graph. arXiv preprint arXiv:1909.07606. Cited by: §6.2.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019b) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: §1, §6.2.
  • T. Mihaylov and A. Frank (2018) Knowledgeable reader: enhancing cloze-style reading comprehension with external commonsense knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 821–832. External Links: Link, Document Cited by: §6.2.
  • S. Min, D. Chen, L. Zettlemoyer, and H. Hajishirzi (2019) Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868. Cited by: §6.2.
  • A. Mitra, P. Banerjee, K. K. Pal, S. Mishra, and C. Baral (2019) Exploring ways to incorporate additional knowledge to improve natural language commonsense question answering. arXiv preprint arXiv:1909.08855. Cited by: §6.2.
  • T. Niven and H. Kao (2019)

    Probing neural network comprehension of natural language arguments

    .
    In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4658–4664. Cited by: §1.
  • M. E. Peters, M. Neumann, R. Logan, R. Schwartz, V. Joshi, S. Singh, and N. A. Smith (2019) Knowledge enhanced contextual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 43–54. External Links: Link, Document Cited by: §6.2.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §6.2.
  • A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme (2018) Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pp. 180–191. Cited by: §1.
  • R. Puri, R. Spring, M. Patwary, M. Shoeybi, and B. Catanzaro (2020) Training question answering models from synthetic data. arXiv preprint arXiv:2002.09599. Cited by: §6.1.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. OpenAI Blog 1 (8), pp. 9. Cited by: §1, §6.1, §6.2.
  • P. Rajpurkar, R. Jia, and P. Liang (2018) Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822. Cited by: §1.
  • K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi (2019) WINOGRANDE: an adversarial winograd schema challenge at scale. arXiv preprint arXiv:1907.10641. Cited by: §1, §1.
  • M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith, and Y. Choi (2019a) ATOMIC: an atlas of machine commonsense for if-then reasoning. ArXiv abs/1811.00146. Cited by: §1.
  • M. Sap, H. Rashkin, D. Chen, R. LeBras, and Y. Choi (2019b) Socialiqa: commonsense reasoning about social interactions. arXiv preprint arXiv:1904.09728. Cited by: §1.
  • I. V. Serban, A. Garcia-Duran, C. Gulcehre, S. Ahn, S. Chandar, A. Courville, and Y. Bengio (2016)

    Generating factoid questions with recurrent neural networks: the 30m factoid question-answer corpus

    .
    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 588–598. Cited by: §6.3.
  • V. Shwartz, P. West, R. L. Bras, C. Bhagavatula, and Y. Choi (2020) Unsupervised commonsense question answering with self-talk. arXiv preprint arXiv:2004.05483. Cited by: §1, Table 1, §5, §6.1.
  • H. Sun, T. Bedrax-Weiss, and W. Cohen (2019) PullNet: open domain question answering with iterative retrieval on knowledge bases and text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 2380–2390. External Links: Link, Document Cited by: §6.2.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2018) Commonsenseqa: a question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937. Cited by: §1.
  • D. Teney and A. v. d. Hengel (2016) Zero-shot visual question answering. arXiv preprint arXiv:1611.05546. Cited by: §1.
  • C. Wang and H. Jiang (2019) Explicit utilization of general knowledge in machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2263–2272. External Links: Link, Document Cited by: §6.2.
  • L. Wang, S. Li, W. Zhao, K. Shen, M. Sun, R. Jia, and J. Liu (2018) Multi-perspective context aggregation for semi-supervised cloze-style reading comprehension. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 857–867. External Links: Link Cited by: §6.1.
  • R. Wang, D. Tang, N. Duan, Z. Wei, X. Huang, C. Cao, D. Jiang, M. Zhou, et al. (2020) K-adapter: infusing knowledge into pre-trained models with adapters. arXiv preprint arXiv:2002.01808. Cited by: §6.2.
  • Z. Wang, J. Zhang, J. Feng, and Z. Chen (2014)

    Knowledge graph embedding by translating on hyperplanes

    .
    In Twenty-Eighth AAAI conference on artificial intelligence, Cited by: §6.3.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §4.2.
  • A. Yang, Q. Wang, J. Liu, K. Liu, Y. Lyu, H. Wu, Q. She, and S. Li (2019) Enhancing pre-trained language representations with rich knowledge for machine reading comprehension. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2346–2357. External Links: Link, Document Cited by: §6.2.
  • B. Yang and T. Mitchell (2017) Leveraging knowledge bases in LSTMs for improving machine reading. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1436–1446. External Links: Link, Document Cited by: §6.2.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In Advances in neural information processing systems, pp. 5754–5764. Cited by: §1.
  • Z. Yang, J. Hu, R. Salakhutdinov, and W. Cohen (2017) Semi-supervised QA with generative domain-adaptive nets. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1040–1050. External Links: Link, Document Cited by: §6.1.
  • Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning (2018) Hotpotqa: a dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600. Cited by: §1.
  • J. Yin, X. Jiang, Z. Lu, L. Shang, H. Li, and X. Li (2016) Neural generative question answering. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2972–2978. Cited by: §6.3.
  • R. Zellers, Y. Bisk, R. Schwartz, and Y. Choi (2018) Swag: a large-scale adversarial dataset for grounded commonsense inference. arXiv preprint arXiv:1808.05326. Cited by: §1.
  • Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun, and Q. Liu (2019) ERNIE: enhanced language representation with informative entities. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1441–1451. External Links: Link, Document Cited by: §6.2.

Appendix A Appendices

Appendix B Supplemental Material