On the Multi-Property Extraction and Beyond

06/15/2020 ∙ by Tomasz Dwojak, et al. ∙ 0

In this paper, we investigate the Dual-source Transformer architecture on the WikiReading information extraction and machine reading comprehension dataset. The proposed model outperforms the current state-of-the-art by a large margin. Next, we introduce WikiReading Recycled - a newly developed public dataset, supporting the task of multiple property extraction. It keeps the spirit of the original WikiReading but does not inherit the identified disadvantages of its predecessor.



There are no comments yet.


page 1

page 2

page 3

page 4

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The WikiReading dataset proposed by Hewlett et al. (2016) is built on top of Wikipedia articles and properties taken from the WikiData database Vrandečić and Krötzsch (2014). Its objective is to determine property-value pairs for provided text, e.g. to extract or infer information regarding the described person’s occupation, spouse, alma mater or place of birth, given related biographic article. An important part of the task is to create a model that is able to extract properties that have not appeared during training.

Our approach to the aforementioned dataset relies on the Transformer architecture modified in order to support two source sequences Vaswani et al. (2017); Junczys-Dowmunt and Grundkiewicz (2018). The proposed model consists of a single decoder that generates property values and of two encoders with shared weights: one for property names and one for the article to analyze.

Our work on WikiReading inspired us to create the WikiReading Recycled dataset to extract multiple properties of the same object at once. The dataset uses the same data as the WikiReading but unlike in the original dataset, validation and test sets do not share articles from the train set. Additionally, the test set contains properties not seen during training, posing a challenging subset for current state-of-the-art systems. The provided human-evaluated test set contains only those properties that could be inferred from the article. Finally, a strong Dual-source Transformer baseline for the WikiReading Recycled task is provided.

2 Related Work

Early work in relation extraction revolves around problems crafted using distant supervision methods Craven and Kumlien (1999). Encoder-decoder models, that had been previously framed in NMT problems Bahdanau et al. (2014), have recently been used in solving information extraction problems formulated with triples (property name, property value, item) Vu et al. (2016), as well as in similar problems of Question Answering Feng et al. (2015). The difference between WikiReading and QA problems is in how questions are being asked, namely whether they are formulated in natural language or as a raw property name.

In response to this popular discourse, a WikiReading dataset with millions of training instances was proposed Hewlett et al. (2016). Many baseline methods were evaluated alongside the dataset. The best performing model (Placeholder seq2seq) uses placeholders to allow rewriting out-of-vocabulary words to the output, achieving Mean- score of .

The following work of Choi et al. (2017) re-evaluated the Placeholder seq2seq model and reached a Mean- score of

. Moreover, the authors proposed a reinforcement learning approach which improved results on the challenging subset of 10% longest articles. This framework was extended by 

Wang and Jin (2019) with the addition of a self-correcting action, that removes the inaccurate answer from the GRU-based Chung et al. (2014) answer generation module and continues to read, reaching a 75.8 Mean- score on the whole WikiReading.

Hewlett et al. (2017) holds the state-of-the-art on WikiReading

 with their proposition of SWEAR — a hierarchical approach that attends over a sliding window’s GRU-generated representations in order to reduce documents to one vector from which another GRU network generates the answer. Additionally, authors set up a strong semi-supervised solution on a 1% subset of


3 Dual-source Transformer

The Transformer architecture proposed by Vaswani et al. (2017) was further extended to support two inputs by Junczys-Dowmunt and Grundkiewicz (2018) and successfully utilized in Automatic Post-Editing. We propose to apply this Dual-source Transformer in information extraction and machine reading comprehension tasks.

The architecture consists of two encoders that share parameters and a single decoder. Moreover, both the encoders and decoder share embeddings and vocabulary. In our approach, the first encoder is fed with the text of an article, and the second one takes the names of properties to determine.

Datasets were processed with a SentencePiece model Kudo (2018) trained on a concatenated corpus of inputs and outputs with a vocabulary size of 32,000. Dynamic batching was applied during training, in order to use the GPU memory optimally (nevertheless, the average batch size was around 100). The model was implemented in Marian NMT Toolkit Junczys-Dowmunt et al. (2018) and its specification followed the default Marian’s settings for Transformer models. The only difference was reduction of encoder and decoder depths to 4.111The complete configuration file will be available on GitHub.

Figure 1: Architecture of Dual-source Transformer as proposed by Junczys-Dowmunt and Grundkiewicz (2018) for Automatic Post-Editing. In the case of WikiReading Recycled and WikiReading, the encoder transforms an article and the corresponding properties separately.

4 WikiReading Recycled

WikiReading Recycled introduces the problem of multi-property information extraction with the goal of evaluating systems that extract any number of given properties at once from the same source text. It is built on WikiReading, the biggest publicly available dataset for information extraction, with improved design and human annotation. In order to achieve that, we propose to merge data instances from all splits (training, validation, and test sets) that contain the same articles by combining their property names and values. The resulting dataset contains approximately 4.1M instances with 703 distinct properties that we split into new training, validation, and test sets.

We perceive the model’s generalization abilities (i.e. to extract unseen properties) as an important metric. Therefore, we assigned 20% of the properties to the test set only. In order to make the validation set a good approximation of the test set, another 20% of the properties are validation-only and a set of 10% of the properties are shared between the test and validation splits. This leads to a design where as much as 50% of the properties cannot be seen in the training split, while the remaining 50% of the properties can appear in any split.

The quality of test sets plays a pivotal role in reasoning about systems performance. Therefore, a group of annotators went through the instances of the test set and assessed whether the properties either appear in the article or can be inferred from it. Relevance of the aforementioned validation can be demonstrated by the fact that Mean- on a pre-verified set was approximately lower and 8% of articles were removed completely. Apparently, 28% of property values were marked as unanswerable and were removed.

This led to the creation of a new test set, where the proportion of the properties has slightly changed. Hence, 27% of the properties in the test set were not seen during the training and 15% are test set only. Similarly, 36% of the validation set has not been seen during training.

It was determined that 46% of expected values in the test set were present in the article explicitly, whereas 54% of test set values were possible to infer.

Data split Total Overlap %
validation set 1,452,591 1,374,820 94.65
test set 821,409 780,639 95.04
Table 1: The size of WikiReading splits. Column (Total) shows the total number of unique Wikipedia articles in each split and column (overlap) shows the number of articles that have been seen in the train set. The last column shows the percentage of the overlap between the considered set and train set.

5 Evaluation

Performance of systems is evaluated using the F1 metric, adapted for the WikiReading Recycled specifics. For the WikiReading metric, Mean- follows the originally proposed metric and assesses F1 scores for each property instance, to be averaged later over the whole test split. We extend this metric due to changes introduced in the instance definition with the new metric Mean-MultiLabel-  which is able to handle multiple properties, where each can have multiple answers. The Mean-MultiLabel- score is calculated for each property name and then averaged per article, and in the next step averaged over all articles. Mean-MultiLabel- is invariant to the order of generated answers.

It is worth noting that the WikiReading Recycled instances could contain multiple property names. Due to that, models trained on WikiReading Recycled are able to use solely the context of one property to deduce correct answers about another properties. As a result, a model trained on property names achieves up to a 0.18 Mean-MultiLabel-score without seeing the actual article content. One such example would be answering the property instance of: human just by seeing another property name educated at.

5.1 Baseline

To compare with the previous results, we reproduce a basic sequence to sequence model from Hewlett et al. (2016). Since the model’s description missed some important details, they had to be assumed before model training. We supposed that the model consisted of unidirectional LSTMs and it was trained with mean (per word) cross entropy loss, until no progress was observed for 10 consecutive validations occurring every 10,000 updates. Input and output sequences were tokenized and lowercased. In addition, truecasing was applied to the output. We use syntok222https://github.com/fnl/syntok tokenizer and a simple RNN-based truecaser proposed by Susanto et al. (2016) were used. During inference, we used a beam size of 8. The rest of parameters followed the description provided by the authors333The complete configuration file will be available on GitHub..

5.2 Results on WikiReading

The reproduced Basic seq2seq model achieved a Mean- score of , that is 3 points higher than reported by Hewlett et al. (2016) and less than 1 point lower than Placeholder seq2seq reimplemented by Choi et al. (2017). Hence, the Results of our reimplementation may suggest that the method proposed in the initial WikiReading paper suffered due to poor optimization.

We evaluated two training approaches for the dual-encoder model. In the first scenario, we merge all property names related to the given article (Multi-property). In the second one, we train the model on each property name separately (Single-property). Nonetheless, the evaluation process was performed in a single-property manner.

The dual-encoder solution outperforms previous state-of-the-art models. The single-property model achieves a slightly higher performance of 79.9%.

Model Mean-
Basic s2s  74.8
Placeholder s2s Choi et al. (2017)  75.6
SWEAR Hewlett et al. (2017)  76.8
Dual-source Transformer
 Multi-property 79.4
 Single-property 79.9
Table 2: Results on WikiReading (test set). Basic s2s denotes the re-implemented model described in Section 5.1.

5.3 Results on WikiReading Recycled

Finally, we propose two models as baselines for WikiReading Recycled: the reproduced Basic seq2seq and the dual-encoder model. In addition, we evaluate an ensemble of four best-performing checkpoints on the validation set. Table 3 presents Mean-MultiLabel-scores on the test set. The dual-encoder model outperforms the Basic seq2seq as in the case of WikiReading task. The former achieved Mean-MultiLabel- of 79.5%. Additionally, the ensemble submission improved the single-best model by 0.5 points.

Moreover, test set was split into two subsets for analytic purposes. The first resulting subset contains property values that appear in the article explicitly (exact-matches, EM), whereas the second contains the rest of data, e.g. the property values that are inferable (IN). Since the precise computation of precision is impossible in this scenario (one cannot determine which incorrect values were predicted for which expected ones), we report only recall on these subsets. The single-best model achieves the highest scores on both subsets: 73.3% on the exact-match subset and 73.9% on the inferable one.

Additionally, we evaluate both models on the subset of property names that did not appear in the training set. To our surprise, both models perform poorly. The Basic seq2seq model achieves 2.4% (Mean-MultiLabel-), whereas the dual-source model ignores those properties and does not generate answers for them.

Model MM- EM IN
Basic s2s 75.2 66.3 70.8
Dual-source Transformer
 Single 79.5 73.3 73.9
 Ensemble 80.0 73.1 73.8
Table 3: Results on WikiReading Recycled. We chose the model with the highest score on the validation set for the final submission (single) and ensemble of the four best-performing checkpoints. MM- stands for Mean-MultiLabel-, EM for exact-match subset, and IN for inferable property subset. Note that subsets were evaluated with recall instead of MM-.

6 Summary

We showed that the Dual-source Transformer outperforms the previous state-of-the-art model on the WikiReading by a far margin. The architecture was successfully adapted from Automatic Post-Editing systems to information extraction and machine reading comprehension tasks.

Moreover, WikiReading Recycled was introduced — to the best of our knowledge, the first multi-property information extraction dataset with a human-annotated test set. In this case, a different setting of Dual-source Transformer was applied, significantly outperforming the presented baseline approach.

Both the dataset and models, as well as their detailed configurations required for reproducibility, have been made publicly available.

An analysis of our results on a challenging subset of unseen properties reveals that despite high overall scores, existing systems fail to provide satisfactory performance. Low scores indicate an opportunity to improve, as these properties were verified by annotators and are expected to be answerable. We look forward to seeing models closing this gap and leading to remarkable progress in the field of machine reading comprehension.