Given a passage (context) and a question about it, a reading comprehension system should be able to read the passage and answer the question. While not a hard task for a human, it requires that the system both understand natural language and have knowledge about the world. Because of the renaissance of neural networks and accessibility of large-scale datasets, great progress has recently been made in reading comprehension. For example, according to the leaderboard of SQuAD 1.0 [Rajpurkar et al.2016], over systems have been submitted, and human performance has been left behind. In experiments, the reimplementation and comparison of these solutions are necessary but not easy tasks, because researchers usually build their blocks from scratch and in different environments. Meanwhile, the efficient construction of original prototypes is not possible, although reading comprehension models often share similar components and architectures.
In this paper, we present the Sogou Machine Reading Comprehension toolkit111https://github.com/sogou/SMRCToolkit
, which has the goal of allowing the rapid and efficient development of modern machine comprehension models, including both published models and original prototypes. First, the toolkit simplifies the dataset reading process by providing dataset reader modules that support popular datasets. Second, the flexible preprocessing pipeline allows vocabulary building, linguistic feature extraction, and operations to work in a seamless way. Third, the toolkit offers frequently used neural network components, a trainer module, and a save/load function, which accelerates the construction of custom models. Last, but not the least, some published models are implemented in the toolkit, making model comparison and modification convenient. The toolkit is built based on the Tensorflow222https://github.com/tensorflow/tensorflow library [Abadi et al.2016].
2 Toolkit Framework
As shown in Figure 1, the architecture of our toolkit mainly contains four modules: the Dataset Reader, Data Preprocessing, Model Construction, and Model Training modules. These four modules are designed as a pipeline flow and can be used for most machine reading comprehension tasks. In the following, we will introduce each part in detail.
2.1 Dataset Reader
One reason that machine reading comprehension has made rapid progress, that cannot be ignored, is the release of a variety of large-scale and high-quality question answering datasets. In addition, preprocessing and evaluating are essential steps when doing research on these datasets.
Reader To avoid repeating the development of dataset reading codes, the toolkit provides reader modules for some typical datasets SQuAD 1.0 [Rajpurkar et al.2016], SQuAD 2.0 [Rajpurkar et al.2018] and CoQA [Reddy et al.2018]. To enhance the language diversity, we also support a Chinese dataset, CMRC2018 [Cui et al.2018]. The reader modules first tokenize texts and generate labels (e.g., start/end positions), and then transform data instances into nested structure objects, the fields of which are uniformly named. This makes data serialization/deserialization convenient and helps in error analysis. By inheriting the base reader, users can develop custom readers for other datasets.
Evaluator Most datasets offer official evaluation scripts. To ease the model validation and early stopping, we integrate these evaluation scripts into the toolkit and simplify the evaluation in the model training process.
2.2 Data Preprocessing
To prepare data for the training model, we need to build a vocabulary, extract linguistic features, and map discrete features into indices. The toolkit provides modules for fulfilling these requirements.
Vocabulary Builder By scanning the training data from the dataset reader, Vocabulary Builder maintains a corresponding vocabulary of words (and characters if needed). Adding any special tokens or setting the whole vocabulary is allowed as well. Another important function of Vocabulary Builder is creating an embedding matrix from pretrained word embeddings. If you feed a pretrained embedding file, for example Glove333https://nlp.stanford.edu/projects/glove/ [Pennington et al.2014], to the Vocabulary Builder, it will produce a word embedding matrix for its inner vocabulary.
Feature Extractor Linguistic features are used in many machine reading comprehension models such as DrQA [Chen et al.2017], FusionNet [Huang et al.2017] and have been proven to be effective. The Feature Extractor supports commonly used features, e.g., part-of-speech (POS) tags, called entity recognition (NER) tags, along with normalized term frequency (TF) and word-level exact matching. By simply adding new feature fields, Feature Extractor does not break the serializability and readability of data instance objects. Meanwhile, Feature Extractor also builds vocabularies for discrete features like POS and NER, which will be used in the next steps for index mapping.
The last step of the preprocessing is to pack all of the features up and modify them to fit the form of the model input. In Batch Generator, we first map words and tags to indices, pad length-variable features, transform all of the features into tensors, and then batch them. To make these steps efficient, we implement Batch Generator based on the Tensorflow Dataset API444https://www.tensorflow.org/api_docs/python/tf/data/Dataset, which parallelizes data transformation and provides fundamental functions like dynamic padding and data shuffling, which make it behave consistently with the “generator” in Python. Batch Generator is designed to be flexible and compatible with the feature types frequently used in machine reading comprehension tasks.
2.3 Model Construction
The core part of the machine reading comprehension task is constructing an effective and efficient model for generating answers from given passages. The toolkit provides two methods: build your own model or use a built-in model. For the first one, we implement frequently used neural network components in the machine reading comprehension task. We follow the idea of functional API and wrap them as MRC specific supplements of Tensorflow layers.
Embedding Besides a vanilla embedding layer, the toolkit also provides PartiallyTrainableEmbedding, as used in [Chen et al.2017] [Huang et al.2017], and pretrained contextualized representation layers, including CoVeEmbedding, ElmoEmbedding, and BertEmbedding.
Recurrent BiLSTM and BiGRU are basic recurrent layers, and their CuDNN version CudnnBiLSTM and CudnnBiGRU are also available.
Similarity Function Functions are available for calculating the word-level similarities between texts, e.g., DotProduct, TriLinear, and MLP.
Attention Attention layers are usually used together with the Similarity Function, e.g., BiAttention, UniAttention, and SelfAttention.
Basic Layer Some of the basic layers are used in machine reading comprehension models, e.g., VariationalDropout, and Highway, ReduceSequence.
Basic Operation These are mainly masking operations, e.g., masked_softmax, mask_logits. By inheriting the base model class and combining the components above, developers should be able to construct most mainstream machine reading comprehension models. To build a custom model, a developer should define the following three member methods,
Training functions (train_and_evaluate, evaluate, and inference) should also be inherited if needed.
The toolkit also provides simple interfaces for using the built-in models. We will introduce the details in Section 3.
2.4 Model Training
When training a model, we usually care about how the metrics change on the train/dev set, when to perform early stopping, how many epochs the model needs to converge, and so on. Because most models share a similar training strategy, the toolkit provides a Trainer module, with main functions that include baby-sitting the training, evaluation and inference processing, saving the best weights, cooperating with the exponential moving average, and recording the training summary. Each model also provides interfaces for saving and loading the model weights.
3 Using Built-In Models
3.1 Have a Try
We will show an example of running the BiDAF model on the SQuAD 1.0 dataset in this section.
First, the data file of SQuAD 1.0 is loaded using SquadReader. Meanwhile, we also create an evaluator for validation.
Second, we build a vocabulary and corresponding word embedding matrix.
Third, data instances are fed to Batch Generator for the necessary preprocessing and batching.
Last, we use the built-in BiDAF model and compile it with default hyperparameters.train_and_evaluate will handle the training process and save the best model weights for inference.
With our toolkit, users can try different machine reading comprehension models in a neat and fast way.
3.2 Model Zoo
In the section, we will briefly introduce the machine reading comprehension models implemented in this toolkit.
BiDAF was introduced by [Seo et al.2016]. Unlike the attention mechanisms in previous work, the core idea of BiDAF is bidirectional attention, which models both the query-to-context and context-to-query attention.
DrQA was proposed by [Chen et al.2017] and aims at tackling open-domain question answering. DrQA use word embedding, basic linguistic features, and a simple attention mechanism, and proves that simple models without sophisticated architectural designs can also achieve strong results in machine reading comprehension.
FusionNet Based on an analysis of the attention approaches in previous work, [Huang et al.2017] proposed FusionNet, which extends the attention from three perspectives. They proposed the use of the “history of word” and fully aware attention, which let the model combine the information flows from different semantic levels. In addition, the idea was also applied to natural language inference.
R-Net The main contribution of R-Net was the self-matching attention mechanism. After the gating matching of the context and question, passage self-matching was introduced to aggregate evidence from the whole passage and refine the passage representation.
QANet The architecture of QANet [Yu et al.2018] was adapted from the Transformer [Vaswani et al.2017] and only contains the convolution and self-attention. By not using the recurrent Layers, QANet gains a 3–13-fold speed increase in the training time and 4–9-fold increase for the inference time.
IARNN In our toolkit, two types of Inner Attention-based RNNs (IARNNs) [Wang et al.2016] are implemented, which are advantageous for sentence representation and efficient in the answer selection task. IARNN-word weights the word representation of the context for the question before inputting into the RNN models. Unlike IARNN-word, which only achieves input word embedding, IARNN-hidden can capture the relationships between multiple words by adding additional context information to the calculation of the attention weights.
BiDAF++ [Clark and Gardner2018] originally introduced a model for multi-paragraph machine reading comprehension. Based on BiDAF, BiDAF++ adds a self attention layer to increase the model capacity. We also apply the model to CoQA [Yatskar2018] for conversational question answering.
have shown great efficacy in many natural language processing tasks. In our toolkit, we use BERT, ELMo, and Cove[McCann et al.2017] as embedding layers to provide a strong contextualized representation. Meanwhile, we also include the BERT model for machine reading comprehension, as well as our modified version. The results of the models in our toolkit are listed in Section 4.
We conducted experiments on a supported dataset with the models in the toolkit. By following the experimental settings in the original papers, we attempted to reproduce the results of the models on a different dataset. It is worth mentioning that slight modifications were applied when necessary, and the scripts and hyperparameters for producing the results shown below are included in the toolkit.
|Model||toolkit implementation||original paper|
In Table 1, we report the results of the implemented models on the development set of SQuAD 1.0. A sophisticated and effective attention mechanism is necessary for building a high-performance model, according to the table. In addition, pretrained models like ELMo and BERT give reading comprehension a big boost and have become a new trend in natural language processing. Our toolkit also wraps commonly used attention and pretrained models in a high-level layer and allows flexible combinations.
|Model||toolkit implementation||original paper|
|BiDAF++ + ELMo||67.6/64.8||67.6/65.1|
|Model||toolkit implementation||original paper|
|BiDAF++ + ELMo||74.5||69.2|
Because SQuAD 2.0 and CoQA are different from SQuAD 1.0 in a variety of respects, the models are not directly transferrable between these datasets. Following [Levy et al.2017] and [Yatskar2018], we implement several effective models. Moreover, our implemented BiDAF achieves Exact Match and F1 on the CMRC dataset, providing a strong baseline.
To investigate the effect of the word representation, we selected two popular models and tested their performances with different embeddings. Table 4 suggests that DrQA is more sensitive to the word embedding and ELMo helps improve the score consistently (here, when ELMo was used, no word embedding was concatenated).
5 Conclusion and Future Work
In the paper, we present the Sogou Machine Reading Comprehension toolkit, which has the goal of allowing the rapid and efficient development of modern machine comprehension models, including both published models and original prototypes.
In the future, we plan to extend the toolkit, and make it applicable to more tasks, e.g. multi-paragraph and multi-document question answering, and provide more available models.
[Abadi et al.2016]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis,
Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael
Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore,
Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete
Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng.
Tensorflow: A system for large-scale machine learning.In Kimberly Keeton and Timothy Roscoe, editors, 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016., pages 265–283. USENIX Association.
- [Chen et al.2017] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading wikipedia to answer open-domain questions. In Regina Barzilay and Min-Yen Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1870–1879. Association for Computational Linguistics.
- [Clark and Gardner2018] Christopher Clark and Matt Gardner. 2018. Simple and effective multi-paragraph reading comprehension. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 845–855. Association for Computational Linguistics.
- [Cui et al.2018] Yiming Cui, Ting Liu, Li Xiao, Zhipeng Chen, Wentao Ma, Wanxiang Che, Shijin Wang, and Guoping Hu. 2018. A span-extraction dataset for chinese machine reading comprehension. CoRR, abs/1810.07366.
- [Devlin et al.2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- [Huang et al.2017] Hsin-Yuan Huang, Chenguang Zhu, Yelong Shen, and Weizhu Chen. 2017. Fusionnet: Fusing via fully-aware attention with application to machine comprehension. CoRR, abs/1711.07341.
- [Levy et al.2017] Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Roger Levy and Lucia Specia, editors, Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), Vancouver, Canada, August 3-4, 2017, pages 333–342. Association for Computational Linguistics.
[McCann et al.2017]
Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher.
Learned in translation: Contextualized word vectors.In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6297–6308.
- [Pennington et al.2014] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In Alessandro Moschitti, Bo Pang, and Walter Daelemans, editors, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pages 1532–1543. ACL.
- [Peters et al.2018] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 2227–2237.
- [Rajpurkar et al.2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100, 000+ questions for machine comprehension of text. In Jian Su, Xavier Carreras, and Kevin Duh, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2383–2392. The Association for Computational Linguistics.
- [Rajpurkar et al.2018] Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for squad. In Iryna Gurevych and Yusuke Miyao, editors, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pages 784–789. Association for Computational Linguistics.
- [Reddy et al.2018] Siva Reddy, Danqi Chen, and Christopher D. Manning. 2018. Coqa: A conversational question answering challenge. CoRR, abs/1808.07042.
- [Seo et al.2016] Min Joon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2016. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603.
- [Vaswani et al.2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6000–6010.
[Wang et al.2016]
Bingning Wang, Kang Liu, and Jun Zhao.
Inner attention based recurrent neural networks for answer selection.In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
- [Yatskar2018] Mark Yatskar. 2018. A qualitative comparison of coqa, squad 2.0 and quac. CoRR, abs/1809.10735.
- [Yu et al.2018] Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. 2018. Qanet: Combining local convolution with global self-attention for reading comprehension. CoRR, abs/1804.09541.