This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/ Reading comprehension is a challenging task which requires the deep understanding of natural language. Cloze test is a particular form of reading comprehension: given a passage with blanks, an examinee is required to fill in the missing word (or phrase) that best fits the context surrounding the blank. Recently, cloze-style reading comprehension has drawn growing interests from NLP research communities, since such a task meets the practical requirements and is relatively easy to design.
datasets, which are automatically constructed by randomly or periodically deleting a word from original passage. Though the automatically generated datasets usually consist of a large quantity of labeled data and make it possible to train large neural network models, they are in nature far away from real-world language understanding problems and have serious ambiguity issues[Chen et al.2016]. As a result, the state-of-the-art system of cloze test almost reaches the performance ceiling and loses its improvement direction due to the limitation of the corpus [Chen et al.2016]. In such situation, xie2017large argues that it is a more reliable means to assess language proficiency with carefully designed questions by professional teachers, and releases a novel corpus CLOTH. The CLOTH dataset brings the new challenge of exploring a comprehensive evaluation of language proficiency and specifically divides the questions into several types including matching, reasoning and grammar etc. Table 1 shows several example questions from CLOTH.
From experiments by xie2017large, we can see that the Stanford attention reader [Chen et al.2016] of having the near state-of-the-art performance (with an accuracy of about ) on CNN/Daily Mail only gets an accuracy of
on CLOTH and there exists a huge performance gap between human and popular machine learning models. The main reason is that attention models are mainly good at processing matching questions (e.g., the first example in Table1 has matching between “police” and “accident”, “man died”), which occupy a less percentage in CLOTH than in CNN/Daily Mail. xie2017large also present the word-predicting potential of language models (LM) which can well tackle lexical collocation (e.g., “take” and “exam” in the second example), given a large volume of unlabeled data and high computation power. Furthermore, xie2017large points out that the most difficult questions belong to the long-term-reasoning type (e.g., the third example question), which constitutes approximately in CLOTH and needs more semantics to deal with.
To comprehensively consider the progress and questions in CLOTH, we come up with the idea of modeling multiple perspectives to arrive at the correct answer, given limited computation power. Our multi-perspective network consists of several parallel modules, where each module aggregates context information from a unique perspective. We model long-distance matching with attentive readers, global semantics with iterative dilated convolutions and lexical collocation with both n-gram and neural language model(LM). The outputs of aggregation modules are further integrated and fed into a one-timestep pointer network[Vinyals et al.2015] to get the final answer.
Next, one challenging problem is how to effectively train our multi-perspective network due to the insufficiency of labeled data. To overcome this problem, xie2017large present a representativeness-based weighted loss function. Their approach has two drawbacks: first, it requires to train another model for predicting a candidate’s representativeness score; second, it is not a sample-efficient way since each word including uninformative stop words becomes a training example. In this paper, we improve on xie2017large’s approach and develop a semi-supervised learning method by matching the distribution of candidates between labeled and unlabeled data. The intuition is to make automatically constructed data as similar as possible to existing labeled data. Stop words, named entities and out-of-vocabulary words should be downsampled while meaningful content words should be kept for training.
Our method is simple, straightforward and shows better performance with only a fraction of training examples. Experiments show that our semi-supervised multi-perspective network is able to outperform state-of-the-art results on the CLOTH dataset by .
Formally, the task of cloze-style reading comprehension requires choosing the correct answer from candidates given a sequence of words as context. Candidate could be a word or a phrase. For the CLOTH dataset, each question has candidates.
2.1 MPNet: Multi-Perspective Context Aggregation Network
The overall architecture of our proposed MPNet is shown in Figure 1. It consists of an input layer, a multi-perspective aggregation layer and an output layer.
Input Layer Given the passage as a variable-length word sequence , we embed each word into -dimensional word embeddings
using GloVe vectors. Then, we apply bidirectional GRU(BiGRU) onto get contextualized word representations [McCann et al.2017] [Peters et al.2018], since GRU is computationally more efficient and shows slightly better performance than LSTM.
We also use another GRU to encode candidates into fixed-length vectors ,
as candidates may be multi-word phrases.
Multi-Perspective Aggregation Layer This layer consists of several independent aggregation modules . Computation can be easily parallelized since modules are independent. Each module takes contextualized word representations and candidates’ encoding as input and outputs a vector , which reflects the information from module ’s perspective. We also assume aggregation modules can have access to and . For cloze-style reading comprehension, each module should be able to distill some knowledge which can judge whether a candidate fits a given context from a perspective.
The aggregation modules that we use are listed below:
Selective Copying Assuming the index of the blank is
, this module simply selects the hidden representation of the blank, directly copies it to the output and ignores everything else. Note that is the output of BiGRU and already incorporates context information from both forward and backward directions. This resembles a bidirectional language model without softmax output layer. Words near the blank are paid more attention which is consistent with our intuition of filling in the blank.
Attentive Reader A large portion of questions involve matching candidates with related words which may be far away from each other such as the second example in Table 1. Attentive reader proposed by chen2016thorough directly attends to the entire context and therefore avoids the difficulty of modeling long-range dependence. Original bilinear attention function [Chen et al.2016] is slightly modified by introducing to model attention bias towards the th word. is the vector representation of a candidate, we omit its subscript for simplicity.
Iterative Dilated Convolution Convolutional neural networks have been a successful method for modeling both natural language [Kim2014] and images. Multiple layers of CNNs can extract features in a hierarchical way, which shares similarity with the compositional property of natural language. Dilated convolution is a variant of traditional convolution and is more efficient for multi-scale context aggregation [Yu and Koltun2015, Strubell et al.2017]
. In this work, we use two blocks where each block consists of two dilated convolutions with dilation rate set to 1 and 3 respectively. Max pooling across filters is applied to get the final output.
N-gram Statistics To explicitly incorporate collocation information, we use this module to output logarithmic -gram counts from English Wikipedia with . Logarithmic function avoids the optimization difficulty with extremely large numbers.
Note that the output from selective copying module
and from iterative dilated convolution module don’t depend on the candidates.
We therefore get context representation
by concatenating and .
we can get the representation vector for th candidate
by concatenating the output from attentive reader module,
from n-gram statistics
and from the candidate encoder:
Output Layer We use a one-timestep pointer network [Vinyals et al.2015] to choose the correct answer from candidates . Given context representation and candidates representation , we first refine candidates representation with a gating mechanism:
is the sigmoid function anddenotes pointwise multiplication. Then we calculate the distribution of being the correct answer over candidates with bilinear function:
is a probability distribution and the pointer points to the candidate.
Model Learning The model is trained by minimizing the standard cross-entropy loss.
Discussion Different aggregation modules summarize context from different perspectives. In order to precisely locate the correct answer, a set of complementary aggregation modules are preferred where one module may only focus on lexical collocation and another module may be sensitive to the global matching. It is worth noting that our MPNet framework can be easily extended by adding other effective aggregation modules.
In addition, the main idea of MPNet is to some extent connected with the mixture of experts (MoE) [Masoudnia and Ebrahimpour2014]. If each aggregation module can be seen as an expert, then multiple aggregation modules become MoE. One key difference is that aggregation modules in MPNet have heterogeneous network structures while traditional MoE models usually consist of homogeneous experts.
2.2 SemiMPNet: Semi-supervised Learning with Distribution Matching
SemiMPNet is the semi-supervised variant of our proposed MPNet in Section 2.1,
with exactly the same network architecture.
Though CLOTH consists of nearly questions,
it is generally not enough to train large neural models.
Semi-supervised learning comes to the rescue.
We propose to sample from unlabeled text to construct training examples automatically.
In order to train effectively,
we need to make the automatically generated data similar to labeled data and
ensure that candidates should have a similar distribution in original labeled data to that in the generated data.
we formulate candidates distribution matching in two datasets as two sampling problems as follows:
How to sample positive candidates? We assume is a collection of unlabeled documents, is the collection of all candidates in the CLOTH dataset and is the vocabulary which is composed of all the candidates occurring in CLOTH. Each word is associated with an unknown sampling probability . To match the distribution of candidates between and , the following constraints about should be satisfied:
Function returns the frequency of in corpus D. The second constraint is to make sure is a valid probability distribution and the third constraint is to make full use of data. There is generally no exact solution to Equation(5) as and may hold for some . Instead, we use an approximate solution:
The coefficient can be interpreted as the average probability of sampling a word. We set based on validation data. With this strategy, we sample the positive candidates and use the corresponding passages as their contexts.
How to sample negative candidates? Given a positive candidate , the probability of being sampled as a negative candidate can be calculated as follows:
is the co-occurrence counts of and as candidates in labeled dataset . Intuitively, the co-occurrence probability of and should match between constructed data and labeled data.
is the probability of randomly selecting a word from the entire vocabulary, similar to the exploration mechanism in reinforcement learning. It makes our model more robust to overfitting and we setthroughout the experiments.
In the case that candidates are multi-word phrases, our method is also applicable by simply expanding the vocabulary to include phrases in .
3.1 Experimental Setup
Dataset and Evaluation Metrics
Dataset and Evaluation MetricsWe use the CLOTH [Xie et al.2017] dataset for training and evaluation. RACE [Lai et al.2017] dataset and English Wikipedia 111https://dumps.wikimedia.org/enwiki/ serve as background text corpora for semi-supervised learning. RACE dataset consists of nearly reading comprehension passages from high-school examinations. We delete passages that have a Jaccard similarity over with passages in the CLOTH dataset. Furthermore, background text corpora also include training passages from the CLOTH dataset by filling the correct answer back into the corresponding blank.
Accuracy is used as the evaluation metric. To make a fair comparison with xie2017large, we also report performance on CLOTH-M(middle school questions) and CLOTH-H(high school questions).
Our model is implemented with Tensorflow[Abadi et al.2016]
. Hyperparameters are optimized with random search based on validation data. All our models are run on a single GPU(Tesla P40). NLTK[Bird and Loper2004] is used for tokenization. Word embeddings are initialized with 300-dimensional GloVe [Pennington et al.2014] vectors. Only vectors of top frequent words are fine-tuned during training. Our network is trained with Adam algorithm [Kingma and Ba2014]. The initial learning rate is set to . We decrease learning rate to after iterations and further decrease it to after iterations. Both forward and backward GRU have hidden units. For input, we use a context window of words. For 1D dilated convolution, we use blocks, the number of filters is and the convolution width is . Dropout with probability is applied to the output of BiGRU.
|Model||+ constructed data?||CLOTH||CLOTH-M||CLOTH-H|
|LSTM [Xie et al.2017]||No||48.4%||51.8%||47.1%|
|Stanford Attention Reader [Chen et al.2016]||No||48.7%||52.9%||47.1%|
|MPNet - ngram||No||50.1%||53.2%||49.0%|
|Language Model [Xie et al.2017]||Yes||54.8%||64.6%||50.6%|
|Representativeness [Xie et al.2017]||Yes||56.5%||66.5%||52.6%|
|LSTM + Representativeness [Xie et al.2017]||Yes||58.3%||67.3%||54.9%|
|SemiMPNet - ngram||Yes||60.9%||67.6%||58.3%|
LSTM is a baseline model by xie2017large. First, a BiLSTM layer is applied to context word embeddings. Then it uses the outputs near the blank to calculate the probability of being the correct answer for each candidate.
Stanford Attention Reader is an attention-based neural model for reading comprehension presented by chen2016thorough. Experimental results are from xie2017large.
Language Model To overcome the difficulty of insufficient labeled data. xie2017large propose to train a neural language model on passages from the CLOTH dataset. The candidate that results in the highest probability is chosen as the predicted answer. It’s fair to say Language Model is a simple data augmentation approach that treats every word as a training example with equal weight.
Representativeness is another semi-supervised data augmentation approach by xie2017large. It assigns different weights to different constructed examples based on Representativeness score. Representativeness can be interpreted as the probability of a given word being selected as a blank by human. For more technical details, please refer to xie2017large.
One-billion-word-LM is a state-of-the-art neural language model [Jozefowicz et al.2016] trained on one-billion-word benchmark [Chelba et al.2013]. It has more than billion parameters and is publicly available222https://github.com/tensorflow/models/tree/master/research/lm_1b.
3.3 Main Results
We evaluate our model’s performance in two experimental settings:
use external data or not.
For the setting without external data,
we only use passages from CLOTH for training and semi-supervised data augmentation.
Though GloVe vectors are trained on external text corpora,
it has become a standard practice for NLP to use pretrained embeddings.
GloVe vectors are used in both settings
and so does the work by xie2017large.
Results w/o External Data Results are shown in Table 2. When trained only on labeled data, both LSTM by xie2017large and our proposed MPNet perform poorly, though MPNet slightly outperforms LSTM by in overall accuracy. The accuracy of middle school questions (CLOTH-M) is consistently higher than high school questions (CLOTH-H) across all of our experiments, since middle school questions are relatively easier.
Table 2 clearly shows that
constructed data can significantly boost both models’ performance.
xie2017large explore several different ways for data augmentation:
Language Model treats every word equally,
while Representativeness method assigns different weights to different words
by training an representativeness prediction network.
This mechanism improves the accuracy from to .
Further, our proposed method adopts a new sampling method and requires sampling words with distribution constraints,
which makes training more efficient.
As shown in Table 3,
stop words (e.g., “I” and “the”),
named entities (e.g., “Frank” and “California”)
and common phrases (e.g., “thank you”) have low probability of being sampled.
Content words such as “festivals” and “birthday” are more likely to be sampled.
One limitation of our sampling method is its inability to handle synonyms.
Since synonyms tend to co-occur as candidates in the labeled dataset,
this problem is not as severe as it looks like to be.
“SemiMPNet - ngram” beats all baseline methods
and achieves the highest accuracy .
Human performance is which is much higher than “SemiMPNet - ngram”.
The effectiveness of constructed data indicates that
the lack of labeled data has become a bottleneck.
|SemiMPNet + One-billion-word-LM||74.9%||79.0%||73.3%|
Results with External Data As shown in Table 4, incorporation of the RACE dataset for semi-supervised learning improves accuracy from to . However, MPNet and SemiMPNet still underperform a pretrained state-of-the-art neural language model One-billion-word-LM [Jozefowicz et al.2016]. It is trained on a large corpus with nearly billion words and achieves an accuracy of . In contrast, SemiMPNet is trained on only million words and has a gap in accuracy, which is pretty impressive given that the sizes of two corpora differ by two orders of magnitude. Once again it shows the power of transferring knowledge from unlabeled text corpora.
As a further discussion, we’d like to point out that although language model can achieve good results, it is not the most efficient way. Actually, experimental results in Table 2 show that language model underperforms SemiMPNet given the same amount of text. A fair comparison would be training SemiMPNet on the one-billion-word benchmark. Considering the size of the one-billion-word corpus, applying our semi-supervised method directly on the one-billion-word corpus would require a sizable amount of computing power. Here we design an approximate method “SemiMPNet + One-billion-word-LM” and combine MPNet and One-billion-word-LM
by linear interpolation of their output probabilities:
is the output probability by SmiMPNet and is the normalized probability by One-billion-word-LM333The normalization makes sure the probabilities for all candidates sum to 1.. Setting yields empirically good results. Hyper-parameter search shows the results are quite robust to a wide range of values. We can see that our model “SemiMPNet + One-billion-word-LM” achieves a new state-of-the-art performance of , which improves One-billion-word-LM by . This also shows the complementarity of SemiMPNet and One-billion-word-LM. Two models can learn different aspects of the contexts.
3.4 Ablation Study
Our proposed MPNet consists of four aggregation modules. To examine the effect of each module, we conduct an ablation study. The results are shown in Table 5.
|w/o selective copying||69.4% (-1.0)|
|w/o attentive reader||67.6% (-2.8)|
|w/o dilated convolution||69.6% (-0.8)|
|w/o n-gram statistics||63.0% (-7.4)|
N-gram statistics turn out to be the single most influential factor. Overall performance decreases by without n-gram statistics. On one hand, this result further highlights the importance of distilling knowledge from large text corpora. On the other hand, it proves that our background corpus is not large enough for neural models to learn reliable lexical collocation information.
Attentive reader also has a significant impact on overall performance. Attention mechanism is able to locate useful information regardless of its positional distance from the blank. In contrast, RNNs need to preserve such information over a long distance which is nontrivial.
Besides, the results in Table 5 support an important intuition in this paper: different modules capture context information from different perspectives, and removing any one of them would result in decreased performance.
3.5 Examining Effects of Background Corpus
For our semi-supervised learning model SemiMPNet, background corpus is used to construct training examples. The choice of background corpus can make a big difference. In this section, we conduct an experiment to examine such effects. Three different corpora are used: passages from the training set of CLOTH, RACE and English Wikipedia.
Results are shown in Figure 2. Unsurprisingly, more data lead to better performance. Moreover, given the same amount of text, CLOTH consistently beats RACE and RACE consistently beats Wikipedia. As we know, CLOTH and RACE consists of passages designed for high school students, while Wikipedia entries are for the general public and therefore have a different word distribution. Thus, how to make use of huge unlabeled data to help training is a key for performance improvement, since training corpora of higher quality are generally smaller in scale.
4 Related Work
Reading Comprension or machine reading is drawing more and more interests among NLP research communities.
The CNN/Daily Mail [Hermann et al.2015] and CBT [Hill et al.2015]
are two automatically generated cloze-style datasets.
Though they can be large in scale,
the quality of automatically generated questions is generally lower than manually labeled ones.
SQuAD [Rajpurkar et al.2016] adopts a crowd-sourcing approach to ensure its quality.
each passage accompanies one or more questions
and the answer is a text span of the given passage
for the convenience of automatic evaluation.
Rapid progress has been made with neural network based models [Wang et al.2016].
The performances of state-of-the-art models on SQuAD
such as QANet [Yu et al.2018] and ELMo [Peters et al.2018] are already very close to human.
There are also some datasets focusing on answering questions
from real-world scenarios.
MS MARCO [Nguyen et al.2016] and DuReader [He et al.2017] are two typical examples.
Such datasets are usually harder as they require the ability of
both comprehension and language generation.
BLEU and ROUGE are often used as evaluation metrics.
One potential problem is
that answers with high BLEU scores may have very different semantic meanings.
Cloze Test is a particular form of reading comprehension task
and has been widely adopted as a method for assessing students’ language proficiency.
zweig2011microsoft presented a challenging dataset for sentence completion
but its scale is too small with only questions.
CNN/Daily Mail [Hermann et al.2015], CBT [Hill et al.2015], LAMBADA [Paperno et al.2016]
and CLOTH [Xie et al.2017] are all large-scale cloze-test datasets,
with the difference that each question in CLOTH has four candidate options.
Recently proposed Story Cloze [Mostafazadeh et al.2017] is a cloze-test dataset
that goes beyond words and phrases,
and requires choosing a sentence as the appropriate story ending.
for reading comprehension are widely studied due to the fact that labeled data is scarce. One major approach is pretraining a model for text representation and reusing the weights during supervised learning. Autoencoders[Hewlett et al.2017], machine translation [McCann et al.2017] and language model [Peters et al.2018] can be used for representation learning. Another approach aims to directly construct training examples from unlabeled text corpora. Weighted loss function [Xie et al.2017] and reinforcement learning [Yang et al.2017] can be used to alleviate the discrepancy between human-labeled data and automatically-constructed data.
In this paper, we propose a multi-perspective network MPNet for cloze-style reading comprehension. MPNet consists of several parallel context aggregation modules. Each module summarizes the variable-length context and candidates into a fixed-length vector from a unique perspective. We explore four effective implementations of aggregation modules in experiments. The architecture of MPNet is very flexible and can be easily extended by adding more task-specific modules.
To overcome the difficulty of limited labeled data, we turn to semi-supervised learning by automatically constructing training examples from unlabeled text corpora. Experiments on the CLOTH dataset show that our semi-supervised MPNet achieves new state-of-the-art performance. In our future work, we’d like to come up with more effective methods to tackle this challenge.
We would like to thank three anonymous reviewers for their insightful comments, and COLING 2018 organizers for their efforts.
- [Abadi et al.2016] Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. 2016. TensorFlow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283.
- [Bird and Loper2004] Steven Bird and Edward Loper. 2004. NLTK: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 31. Association for Computational Linguistics.
- [Chelba et al.2013] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. 2013. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint arXiv:1312.3005.
- [Chen et al.2016] Danqi Chen, Jason Bolton, and Christopher D Manning. 2016. A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2358–2367.
- [He et al.2017] Wei He, Kai Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang, Hua Wu, Qiaoqiao She, Xuan Liu, et al. 2017. DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications. arXiv preprint arXiv:1711.05073.
- [Hermann et al.2015] Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701.
[Hewlett et al.2017]
Daniel Hewlett, Llion Jones, Alexandre Lacoste, et al.
Accurate supervised and semi-supervised machine reading for long
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2001–2010.
- [Hill et al.2015] Felix Hill, Antoine Bordes, Sumit Chopra, and Jason Weston. 2015. The goldilocks principle: Reading children’s books with explicit memory representations. arXiv preprint arXiv:1511.02301.
- [Jozefowicz et al.2016] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
- [Kim2014] Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
- [Kingma and Ba2014] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Lai et al.2017] Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. RACE: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.
- [Masoudnia and Ebrahimpour2014] Saeed Masoudnia and Reza Ebrahimpour. 2014. Mixture of experts: a literature survey. Artificial Intelligence Review, pages 1–19.
- [McCann et al.2017] Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. 2017. Learned in translation: Contextualized word vectors. arXiv preprint arXiv:1708.00107.
- [Mostafazadeh et al.2017] Nasrin Mostafazadeh, Michael Roth, Annie Louis, Nathanael Chambers, and James F Allen. 2017. LSDSem 2017 Shared Task: The Story Cloze Test. LSDSem 2017, page 46.
- [Nguyen et al.2016] Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268.
- [Paperno et al.2016] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández. 2016. The LAMBADA dataset: Word prediction requiring a broad discourse context. arXiv preprint arXiv:1606.06031.
- [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
- [Peters et al.2018] Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. NAACL.
- [Rajpurkar et al.2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.
- [Strubell et al.2017] Emma Strubell, Patrick Verga, David Belanger, and Andrew McCallum. 2017. Fast and accurate entity recognition with iterated dilated convolutions. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2660–2670.
- [Vinyals et al.2015] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700.
- [Wang et al.2016] Zhiguo Wang, Haitao Mi, Wael Hamza, and Radu Florian. 2016. Multi-perspective context matching for machine comprehension. arXiv preprint arXiv:1612.04211.
- [Xie et al.2017] Qizhe Xie, Guokun Lai, Zihang Dai, and Eduard Hovy. 2017. Large-scale cloze test dataset designed by teachers. arXiv preprint arXiv:1711.03225.
- [Yang et al.2017] Zhilin Yang, Junjie Hu, Ruslan Salakhutdinov, and William W Cohen. 2017. Semi-Supervised QA with Generative Domain-Adaptive Nets. arXiv preprint arXiv:1702.02206.
- [Yu and Koltun2015] Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
- [Yu et al.2018] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. ICLR.
- [Zweig and Burges2011] Geoffrey Zweig and Christopher JC Burges. 2011. The Microsoft Research sentence completion challenge. Technical report, Technical Report MSR-TR-2011-129, Microsoft.