Integrating diverse extraction pathways using iterative predictions for Multilingual Open Information Extraction

by   Bhushan Kotnis, et al.

In this paper we investigate a simple hypothesis for the Open Information Extraction (OpenIE) task, that it may be easier to extract some elements of an triple if the extraction is conditioned on prior extractions which may be easier to extract. We successfully exploit this and propose a neural multilingual OpenIE system that iteratively extracts triples by conditioning extractions on different elements of the triple leading to a rich set of extractions. The iterative nature of MiLIE also allows for seamlessly integrating rule based extraction systems with a neural end-to-end system leading to improved performance. MiLIE outperforms SOTA systems on multiple languages ranging from Chinese to Galician thanks to it's ability of combining multiple extraction pathways. Our analysis confirms that it is indeed true that certain elements of an extraction are easier to extract than others. Finally, we introduce OpenIE evaluation datasets for two low resource languages namely Japanese and Galician.


page 1

page 2

page 3

page 4


CaMEL: Case Marker Extraction without Labels

We introduce CaMEL (Case Marker Extraction without Labels), a novel and ...

Domain Adaptive Pretraining for Multilingual Acronym Extraction

This paper presents our findings from participating in the multilingual ...

Multilingual Normalization of Temporal Expressions with Masked Language Models

The detection and normalization of temporal expressions is an important ...

IMoJIE: Iterative Memory-Based Joint Open Information Extraction

While traditional systems for Open Information Extraction were statistic...

Multi^2OIE: Multilingual Open Information Extraction based on Multi-Head Attention with BERT

In this paper, we propose Multi^2OIE, which performs open information ex...

Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity

Rule-based information extraction has lately received a fair amount of a...

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

End-to-end TTS suffers from high data requirements as it is difficult fo...

1 Introduction

Open Information Extraction (OpenIE) aims to extract structured facts in the form of triples from sentences Etzioni et al. (2008). For example, given a sentence "Tokyo, the capital of Japan, is also the most populous city in Japan.", an OpenIE system is expected to extract the following triples: (Tokyo; capital of; Japan) and (Tokyo; most populous city in; Japan). The different parts of the triple are also referred to as subject, relation and object, respectively. OpenIE extractions are human understandable intermediate representations of facts in source texts Mausam (2016), which are useful in a variety of information extraction end tasks such as summarization Christensen et al. (2013), question answering Khot et al. (2017) and automated schema extraction Nimishakavi et al. (2016).

OpenIE systems largely come in two flavors, (1) unsupervised OpenIE systems that use fine grained rules based on dependency parse trees Del Corro and Gemulla (2013); Gashteovski et al. (2017); Lauscher et al. (2019), and (2) supervised neural OpenIE systems, trained end-to-end with large training datasets Stanovsky et al. (2018); Ro et al. (2020); Kolluru et al. (2020a)

. The majority of studies indicate that neural OpenIE systems perform better than rule-based systems

Bhardwaj et al. (2019); Ro et al. (2020) on English language text. However, a recent study, Gashteovski et al. (2021), challenges these results and and demonstrates that traditional dependency parse based methods perform better. This motivates our work of unifying these two paradigms. Additionally, supervised neural systems require a lot of training data which is not available for specialized domains such as legal, finance and biomedical domains. Indeed such domains also require high precision extractions which are typically extracted using rule based systems because of the inability to train a supervised system due to a lack of training data. This problem is even more acute for languages other than English due to the lack of resources for training neural models and the need for expert linguistics for writing sophisticated rules for triple extraction. It is these very problems that we attempt to address using a novel iterative multilingual open information extraction system that combines the best of both worlds, the rule based and neural paradigm. We term this systems as MiLIE.

We propose MiLIE, MultiLingual (Iterative) Information Extraction, a system that seamlessly integrates supervised neural end-to-end prediction with linguistic rules for extracting OpenIE triples in multiple languages.

Although MiLIE can be used as a purely neural OpenIE system, it can also be supplemented with linguistic rules, but these rule based systems need neither be exhaustive nor complete, i.e., such systems need not predict the entire triple but only part of the triples such as the predicate, subject or object. MiLIE can use such incomplete extractions and predict the remaining parts of the triple. Therefore MiLIE can be a boon to existing applications wishing to transition from a rule based information extraction system to a neural information extraction because MiLIE would allow them to do so without throwing away their rule based system.

We demonstrate that our system of iterative extraction performs much better compared to current systems in the multilingual setting which includes diverse languages from Chinese to Galician. Additionally we show that MiLIE also meaningfully combines diverse extraction pathways for obtaining a rich set of triples. Further analysis uncovers useful insights on how different languages either make it easy or difficult for OpenIE systems to extract individual elements of the triple. We evaluate MiLIE on both triple extraction as well as the n-ary extraction task, Ro et al. (2020), which involves extracting arguments connected with the object.

Our contributions are summarized as follows:

  1. We propose MiLIE, a multilingual iterative OpenIE system that can seamlessly integrate knowledge from any other neural or rule based information extraction systems for improving performance.

  2. We perform analysis based on ablation studies and uncoverinteresting insights about the nature of information extraction tasks in different languages.

  3. Extensive experiments on a variety of languages including English demonstrate that MiLIE outperforms recent SOTA systems by a wide margin, especially on languages other than English.

2 Related Work

Since the introduction of the first OpenIE system, Textrunner Etzioni et al. (2008), the area has seen a flurry of work utilizing feature based pattern learners Fader et al. (2011); Etzioni et al. (2011); Christinsen2011; Mausam et al. (2012); Saha et al. (2017) to unsupervised rule based systems using dependency trees Del Corro and Gemulla (2013); Angeli et al. (2015); Gashteovski et al. (2017); Saha and Mausam (2018), to the recent supervised neural Stanovsky et al. (2018); Roy et al. (2019); Zhan and Zhao (2020) and transformer based systems Ro et al. (2020); Kolluru et al. (2020b, a).

The recent trend of end-to-end neural supervised OpenIE systems is possible thanks to availability of large scale training data in English either available by filtering outputs of unsupervised OpenIE systems, Roy et al. (2019) or by automatically transforming outputs of QA-SRL systems to OpenIE triples, Stanovsky et al. (2018); Zhan and Zhao (2020); Ro et al. (2020). Stanovsky et al. (2018)

trains a Recurrent Neural Network (RNN) based architecture using the QA-SRL data.

Roy et al. (2019) use an ensemble model for obtaining training data from several unsupervised rule based systems for training a RNN model. Such models work by casting OpenIE as a sequence tagging task where each token is tagged as subject, predicate or object using a BIO like tagging scheme. In contrast, SpanOIE, Zhan and Zhao (2020), instead of tagging tokens, first extracts the predicate by predicting its token spans and then extracts arguments again using spans by for each extracted predicate. Another approach of extracting triples is by generating them. Sun et al. (2018); Kolluru et al. (2020b) use a sequence-2-sequence model with copy mechanism for generating triples. Kolluru et al. (2020b) use a BERT Devlin et al. (2019) encoder with an LSTM decoder which iteratively extracts triples, i.e., it extracts a complete triple and adds the extracted triple is then marked in the sentence using [SEP] for extracting the next triple. This is done to prevent extracting redundant triples. This decoding is different than MiLIE which iteratively extracts individual elements of a triple and does not add the entire triple to the sentence for avoiding extracting redundant triples. Additionally unlike IMOJIE, MiLLIE does not generate the extractions, but uses a tagging scheme which eliminates the need for a learnable copy mechanism.

Recently Kolluru et al. (2020a) proposed the OpenIE6 model, a supervised OpenIE model with novel iterative grid labelling scheme using a BERT model trained with linguistic constraints. Similar to IMOJIE, OpenIE6 also extracts a complete triple and iterates it through a self-attention mechanism by adding the extracted label embeddings to the sentence embeddings for obtaining additional triples. Moreover it also uses linguistic constraints along with CALM Saha and Mausam (2018), a coordination analyzer, for improving performance. Such lingusitic constraints cannot be readily ported to other languages, consequently OpenIE 6 is only evaluated on English. To address multilinguality, recently Ro et al. (2020)

proposes, Multi2OIE, a neural model that leverages a pretrained multilingual BERT model for transfer learning. Multi2OIE uses BIO tagging to first extract predicates and for each predicate, extracts the arguments connected to the predicate using the predicate embedding and a self-attention mechanism. Like

MiLLIE, Multi2OIE only uses a pretrained BERT model for transfer learning, however unlike MiLIE, it cannot integrate simple rule based output for further improving multilingual extraction and nor can it extract the triple using different decoding pathways.


Figure 1: MiLIE system architecture.
Figure 2: Iterative extraction dynamics for decoding pathway . The numbers indicate the iteration number. Iterations are color coded, black is the predicate extraction, green subject extraction, blue object extraction and red argument extraction. Adding markers for extracted elements increases the length of the sequence in consequent iterations.

The iterative nature of our OpenIE system was motivated by a simple question —Is it easier to extract some elements of a triple, say predicates, compared to others, like subjects, for certain type of sentences. Also does this vary with different languages? For example consider the sentence "Barrack Obama became the US President in the year 2008" which contains two triples (Barrack Obama; became; US President) and (Barrack Obama; became US President in; 2008). Extracting the predicate, "became US President in", for the second triple is tricky, because the object of the first triple (US President) overlaps with the predicate of the second triple. But if the extraction system was provided with the object, (2008), and asked to extract a triple conditioned on this object, then the predicate extraction would be easier. Likewise there are other sentence constructions where first extracting objects or subjects might be beneficial. We speculate that iteratively extracting a triple by conditioning on different elements of the triple could lead to richer extractions and indeed this is our hypothesis; combining extracted triples using different extraction patterns from a single sentence improves recall, especially for zero shot multilingual tasks. Motivated by this, we propose MiLIE.

We formally describe the task in the following section and the training procedure in section 3.2.

3.1 Iterative Prediction

Likelihood function Input Sentence Target
The Taj Mahal was built by Shah Jahan in 1643 built by
The Taj Mahal was <P>built by<P> Shah Jahan in 1643 Taj Mahal
The <S>Taj Mahal<S> was <P>built by<P> Shah Jahan in 1643 Shah Jahan
The <S>Taj Mahal<S> was <P>built by<P> <O>Shah Jahan<O> in 1643 in 1643
The <S>Taj Mahal<S> was built by <O>Shah Jahan<O> in 1643. built by
The Taj Mahal was built by <O>Shah Jahan<O> in 1643. Taj Mahal
The Taj Mahal was built by Shah Jahan in 1643. Shah Jahan
Table 1: The table gives a few examples of sentence inputs and corresponding log likelihood functions minimized.

The transformer model expects a sentence in the form of a sequence of words (tokens) along with either their dependency tags or part of speech tags. Let , be the tokens and corresponding tags provided as inputs to the transformer model where is the maximum number of tokens. We use a language specific dependency tagger for obtaining the tags. We target low resource languages, but such languages are low resource for OpenIE task but could be high resource for other tasks such as PoS tagging or dependency parsing especially due to the introduction of universal dependencies Nivre et al. (2016). The task is to extract the subset of tokens which belong to a triple. To do so, we use the BIO tagging scheme. MiLIE consist of four output heads, where each head is in charge of predicting subject, object, predicate and argument, respectively. For this, each output head outputs a label for token where . Fig, 1 illustrates the multi output head architecture. The output heads use the final transformer hidden state and predict labels denoted by where .

The order in which the different triple parts are extracted can be varied. This allows us to investigate the challenge of extracting triple elements in specific order on different languages. Additionally different pathways aid different kinds of extractions and combining them results in a richer set of extractions. Choosing a particular order defines a decoding pathway as a sequence of output heads where . For example, the decoding pathway denotes a sequence of output functions . Crucially, each output head is conditioned on the input and the output labels extracted by the previous function. This feature allows MiLIE to seamlessly integrate rule based systems with neural systems since the conditioning can be also done on extractions obtained from rule based systems. The output labels of the previous function are added to the input tokens as shown in Fig. 2.

During training we minimize different log-likelihood functions that ensure the model learns to predict conditioned on previous extractions. The log likelihood functions minimized during training depend on the training instances, therefore each training instance may optimize a different log likelihood function. We describe the log likelihood functions along with a few example of the training instances in Table 1. During training we use random sampling, described in section 3.2, for ensuring that each likelihood function is minimized with enough training instances.

During prediction time along with the input sentence, the model also expects extractions predicted by the previous iterations. To provide this information we add special symbols to the sentence that explicitly mark the previous extractions in the sentence. For example, we surround predicate with the symbols <P>, subject with <S> and object with <O>. For example, for predicting the object given the predicate extracted from previous iteration, the extracted predicate is marked in the sentence using the <P> symbol and the sentence is consequently passed through the transformer for predicting the object using the object head.

We always extract the arguments at the last iteration therefore we do not mark the arguments in the sentence. There are two reasons for doing so, one is the added computational cost due to considering all possible permutations of the argument order and second is our preliminary experiments which suggested that arguments are best predicted after rest of the triple is predicted.

3.2 Training

For effectively extracting different elements of the triple conditioned on different elements, the model needs to see such combinations during training. However enumerating all possible combinations exhaustively is prohibitively expensive. We propose a sampling technique that ensures that the model sees all possible combinations of a target and prior extractions. This is done by creating a training set that simulates a prior extraction and forces the model to predict the next extraction.To ensure that the number of training data points does not explode we randomly sample one order for each training instance. Some elements in the triple are marked in the sentence while others are used as target labels. Note that we allow for multiple instances of the target labels, however there is only one instance of the marked element. For example, given one subject the target could be multiple predicates. This procedure trains the model to predict an appropriate label conditioned on a variety of previous predictions.

We train and evaluate MiLIE

on an NVIDIA Titan RTX with 24 GB RAM. The training is done for a maximum of two epochs and each epoch takes about 9 hours to complete. The maximum sentence length using the English train and validation dataset is found to be about 100. Due to the addition of extracted triple element markers we allow a slack of 20 tokens, thus fixing the maximum sentence length to 120. We use a maximum possible batch size that fits inside the GPU, which results in batch size of 192. We use an ADAM SGD optimizer with linear warmup and tune the learning rate and warmup percentage.

English Ro et al. (2020) Translation Error Explanation
The stock pot should be chilled and the solid lump of dripping which settles when chilled should be scraped clean and re-chilled for future use. La olla de caldo debe ser enfriado y la masa sólida de goteo que se asienta cuando [se] enfriada se debe raspar limpio y re-enfriada para uso futuro.
  • [noitemsep,topsep=0pt,leftmargin=*,label=]

  • "enfriado": the gender of the adjective doesn’t match the noun.

  • "[se]": missing reflexive particle.

  • "enfriada": wrong use of the participle.

  • "raspar limpio": syntactic error.

However, StatesWest isn’t abandoning its pursuit of the much-larger Mesa. Sin embargo, StatesWest no abandona su búsqueda de la tan - Mesa grande. <tan - Mesa grande>: syntactically and semantically incorrect.
The rest of the group reach a small shop, where Brady attempts to phone the Sheriff, but the crocodile breaks through a wall and devours Annabelle. El resto del grupo llega a una pequeña tienda, donde Brady intentos de teléfono del Sheriff, pero los saltos de cocodrilo a través de una pared, y devora a Annabelle.
  • [noitemsep,topsep=0pt,leftmargin=*,label=]

  • "intentos": number and the gender don’t match with the noun.

  • "de teléfono del Sheriff": telefóno cannot be used as a verb.

  • "los saltos de cocodrilo a través de una pared": semantically incorrect.

Table 2: Examples of incorrectly translated sentences. Using red we highlight mistranslated words, using blue, missing words, and with a strikethrough the parts that are semantically or syntactically incorrect.

3.3 Negative Sampling

Iterative prediction is prone to error amplification, i.e. if an error is made during the first iteration then the error propagates and affects subsequent extractions. Anticipating this we train MiLIE to recognize extraction errors made in the previous iteration. To this end we augment the training data with corrupted data points containing incorrectly marked extractions. For each of the incorrect extractions the model is trained to predict blank extraction, i.e., predicting the outside label for all tokens. We use a similar sampling procedure described in the previous section. For every training data point from a fixed number of training data points, we create two negative samples, one simulating conditioning on one incorrect extraction and the other for two incorrectly extracted elements of the triple. For each negative sample we corrupt the extraction using three techniques: (1) corrupting the predicates by replacing them with randomly chosen tokens from the sentence, (2) corrupting the subject and object by exchanging them, and (3) by mismatching the subject object pairs from different triples.

3.4 Aggregating Decoding Pathways

Fixing the n-ary argument extraction in the final iteration we obtain the following six decoding pathways- . For a given decoding pathway say, , In this case all the predicates are extracted first, then for each predicate, subjects are extracted, then for each predicate subject pair objects are extracted and finally for every extracted predicate, subject, object tuple all the n-ary arguments are extracted. This extraction procedure preserves the relationships between the extracted elements resulting in correctly extracting multiple triples. Fig. 2 illustrates this procedure.

We hypothesize that some triples are easier to predict if extracted predicate first while others may be easily obtained using subject first extraction. And this can vary with different languages. This also means that some decoding pathways are more error prone than others. Therefore aggregating triples using different decoding pathways may improve recall.

We propose a simple algorithm we term as Water Filling (WF) for aggregating the extractions. This is inspired by the power allocation problem in communication engineering literature Kumar et al. (2008). Imagine a thirsty person with access to different pots of water with varying levels of purity and with the caveat that the amount of water varies is inversely proportional to the purity. The natural solution is to first drink the high purity water and move on to the pots in decreasing level of purity until the thirst is quenched. We use the same idea. Treating each decoding pathways as an expert, we assume that the triples extracted by all 6 pathways are more accurate compared to those extracted by only 5 pathways, 4 pathways and so on. This can be thought of as triples obtaining votes from experts. Starting with an empty set, for each sentence we start adding triples to the set in the order of decreasing number of received votes. The normalized votes a triple receives is used as the confidence value of the triple.

Although the procedure is explained in a sequential manner it can be parallelized by running all 6 decoding pathways in parallel by replicating the model weights. After obtaining the triples form each decoding pathway in parallel they can be combined in a sequential manner.

3.5 Binarizing n-ary Extractions

We train MiLIE using RE-OIE206, which is an n-ary OpenIE dataset. In addition, we would also like to evaluate MiLIE on a binary triple extraction dataset, i,e, with only (subject, predicate, object) extractions. One simple way to convert the n-ary extractions to binary extraction is to ignore the n-ary arguments. However, this will lead to a decrease in recall because the n-ary arguments may not be part of other extracted triples due to the initial n-ary extraction. Another method is to simply treat the extracted n-ary arguments as objects to the same subject, predicate pair. This would ensure that the extracted arguments are not dropped, however this may result in drop of precision since the n-ary argument may not attach to the same predicate. For example, consider the extraction (Barrack Obama; became; US President; in the year 2008). Treating n-ary arguments as objects results in (Barrack Obama; became; US President) and (Barrack Obama; became; in the year 2008).

The iterative nature of MiLIE allows us to elegantly address the problem of converting n-ary extractions into a binary format. We treat the extracted n-ary arguments as hypothesized objects. We then and provide the extracted subject, hypothesized object pair to the model, which then extracts a new predicate conditioned on the previously extracted subject and the hypothesized object, i.e., . This creates a possibility of extracting the correct predicate, something that is not possible with existing n-ary OpenIE systems.

3.6 Integrating Linguistic Rule based systems

The iterative nature of prediction allows us to predict parts of a triple conditioned on any other part of the triple. For example, if a linguistic rule based system works well for extracting objects, MiLIE can complete the missing parts of the triple conditioned on the objects. We simulate this by using ClausIE to extract objects and then use MiLIE for extracting the corresponding subject and predicate conditioned on the ClausIE objects.

We treat the output of the rule based system as potential objects paired with subjects and extract the predicate connecting them. If the rule based extraction is incorrect, then MiLIE can detect the error and extract nothing. This results in more accurate extractions compared to simply post-processing the extracted tokens using linguistic rules.

4 Experiments

BenchIE Chinese German Japanese Galician
F1 Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1 Prec. Rec.
M2OIE 17.06 25.71 12.78 4.00 8.95 2.57
MiLIE 20.50 25.15 17.30 8.54 13.44 6.35
MinIE - DEP 19.25 19.82 18.71 8.42 11.28 6.72
MiLIE - NS 17.32 19.64 15.49 10.28 14.33 8.01
MiLIE - Bin 20.04 22.0 18.41 8.98 13.54 6.72
Table 3: MiLLIE performance comparison on multilingual BenchIE benchmark. M2OIE was trained using code supplied in Ro et al. (2020) and zero-shot evaluated on multlingual BenchIE test sets.

4.1 Datasets and Evaluation

We use the RE-OIE2016 training dataset used in Ro et al. (2020) and introduced by Zhan and Zhao (2020). This training dataset contains n-ary extractions allowing MiLIE to be evaluated on both n-ary as well as binary extraction benchmarks. We use the CaRB benchmark introduced in Bhardwaj et al. (2019) for evaluating English OpenIE n-ary extraction. However, the CaRB benchmark also suffers from serious shortcomings.

Gashteovski et al. (2021) discovered that a simple OpenIE system that breaks the sentence into triple at the verb boundary achieves recall and precision. This is a problem because it indicates that simply adding extraneous words to the extraction results in improved recall. For this purpose we also evaluate on BenchIE, an exhaustive fact based multilingual OpenIE benchmark proposed by Gashteovski et al. (2021). The BenchIE evaluation benchmark evaluates explicit binary extractions in English, Chinese and German. The authors also provide an annotation tool, AnnIE Friedrich et al. (2021) for extending the benchmark to additional languages. We used the AnnIE annotation tool for creating BenchIE style benchmarks for Japanese and Galician with the help of native Japanese and Galician speakers.

Additionally we also evaluate MiLIE on the Spanish and Portuguese multilingual datasets introduced in Ro et al. (2020), using the lexical match evaluation strategy. The lexical match evaluation strategy was introduced in Stanovsky et al. (2018) and its numerous shortcomings are discussed in Bhardwaj et al. (2019). Although problematic, we still use the lexical match procedure for fair comparison with the Multi2OIE baseline. Ro et al. (2020) translated sentences and n-ary extractions from the CaRB test set to Spanish and Portuguese using the Google Translate API. We managed to obtain 100 random samples from these test sets evaluated by native Spanish and Portuguese speakers. To our surprise we discovered that around 70 percent of the sentence or extraction translations were inaccurate. Table 2 shows a few examples of the incorrect translations.

For an accurate and clean comparison with Multi2OIE we also cleaned up part of the Spanish test set by re-translating about 100 sentences and their extractions in Spanish. These translations were done by native Spanish speakers. In summary, we evaluate MiLIE on (1) CaRB and BenchIE benchmarks for English, (2) noisy Spanish and Portuguese datasets from Ro et al. (2020) along with the Spanish dataset cleaned by us, and finally (3) BenchIE datasets in German and Chinese introduced by Gashteovski et al. (2021) in addition to BenchIE syle datasets in Japanese and Galician introduced in this paper. In total we evaluate MiLIE on English and six other low resource languages.

For hyperparameter tuning we use the

CaRB English validation set and use the F1 scores obtained using the CaRB evaluation procedure for comparing models with different hyperparameters.

Lexical Match Spanish Portuguese Spanish-Clean
F1 Prec. Rec. F1 Prec. Rec. F1 Prec. Rec.
M2OIE 60.2 59.10 61.20 59.10 56.10 62.50
MiLIE 64.21 69.50 59.67 65.60 70.19 61.59
Table 4: Performance comparison on Spanish and Portuguese datasets from Ro et al. (2020) with Lexical Match. Numbers for Spanish and Portuguese are obtained from Ro et al. (2020) while we train M2OIE using the supplied code and evaluate it on the cleaned Spanish set.

4.2 Results

We compare MiLIE with both unsupervised and supervised baselines. Specifically we compare MiLIE with ClausIE, MinIE, Stanford-OIE, RNN-OIE, OIE6 and Multi2OIE on English. Out of all these systems only Multi2OIE is capable of extracting triples from multiple languages. We did not find any other neural multilingual OpenIE system in the literature capable of extracting triples in the languages studied in this paper. Therefore we compare with only the Multi2OIE system on languages other than English. Evaluation on languages other than English is always zero-shot, i.e., the model is trained using only the English Re-2016 dataset and validated on the English CaRB dataset.

The MiLIE model is trained using negative sampling, includes the dependency tag information. For BenchIE, MiLIE

uses the binarization function described in Section

3.5, but for CaRB and lexical match it does not do so because they evaluate n-ary extraction. On the CaRB English benchmark we use results for baselines reported in Ro et al. (2020) and Kolluru et al. (2020a). For evaluating on BenchIE, we run all the baselines on the BenchIE English evaluation benchmark. For multilingual BenchIE we train Multi2OIE using the code and hyperparameters supplied in the paper for evaluation on multilingual BenchIE.

4.2.1 Multilingual

In Table 3, we compare MiLIE with Multi2OIE (M2OIE) on the multilingual BenchIE evaluation benchmark. MiLIE performs significantly better compared to Multi2OIE on all the languages on the zero shot setting. For German language both Multi2OIE and BenchIE perform much worse compared to other languages, even though German is closer to English compared to Chinese and Japanese. The reason behind this is due to the presence of separable prefixes in German verbs which cannot be extracted using BIO tags. The BIO tagging scheme assumes continuity of phrases which is not present for most German verbs present in predicates, resulting in extremely low recall. Ablation results also indicate the usefulness of adding the dependency tags as well as the use of negative sampling.

Table 4 shows that MiLIE performs better than Multi2OIE on the noisy Spanish and Portuguese test as well as the cleaned Spanish test set. Note that the numbers are higher due to the lexical matching evaluation procedure.

4.2.2 English

In Table 5, we compare MiLIE with several unsupervised and supervised baselines in English using CaRB and BenchIE dataset and evaluation algorithm. MiLIE performs much better compared to other neural baselines on BenchIE compared to CaRB. The reason behind this is that CaRB punishes compact extractions while rewarding overly long extractions Gashteovski et al. (2021).

English CaRB-nary BenchIE-binary
F1 Prec. Rec. F1 Prec. Rec.
ClausIE 44.9 33.89 50.29 25.56
MinIE 41.9 33.72 42.91 27.78
Stanford 23.0 12.99 11.08 15.70
R-OIE 46.7 55.6 40.2 12.97 37.32 7.85
S-OIE 49.4 60.9 41.6
OIE6 52.7 25.36 31.11 21.41
M2OIE 52.3 60.9 45.8 22.80 39.24 16.07
MiLIE 44.98 48.64 41.83 27.88 36.65 22.37
41.22 44.16 38.65 26.71 31.10 23.40
- NS
44.71 47.58 42.17 25.80 29.57 22.88
- Bin
27.72 34.63 23.11
Table 5: MiLIE performance comparison on CaRB and BenchIE English benchmarks. For CaRB, numbers for OpenIE6 (OIE6) obtained from Kolluru et al. (2020a), while numbers for ClausIE, MinIE, Stanford-OIE, RNN-OIE (R-OIE), Span-OIE (S-OIE) obtained from Ro et al. (2020). For BenchIE, we run all the baselines on the BenchIE English dataset.

4.2.3 Hybrid OpenIE

MiLIE can easily integrate any rule based system that extracts even a part of the triple. To evaluate this we first simulate a system that only extracts the object and use MiLIE to extract other parts of the triple. We do this in two ways, (1) we use the ClausIE system for extracting triples for the BenchIE English data and only use the object, discarding the rest of the triple, and (2) we use objects from the BenchIE English test set and use MiLIE to complete rest of the triple. The first part evaluates how well MiLIE can complete a triple when combined with a good but error prone object extraction system. And the second evaluates how well MiLIE completes a triple when given an accurate extraction.

We use the Multi2OIE system as a baseline, where we extract triples using Multi2OIE and eliminate those triples with objects that are either not extracted by ClausIE or the gold objects from BenchIE. The reason behind the choice of selecting object extraction from ClausIE or BenchIE gold is the fact that neural systems are not good at extracting objects Kolluru et al. (2020a). This is also seen from additional experiments detailed in Section 5. By combining objects extracted from rule based system and other elements extracted from a neural system we wish to investigate whether its possible to obtain the best of both world.

Table 6 indeed confirms that combining rule based object extraction with neural MiLIE improves performance. Note that we used ClausIE as method to extract objects for studying hybrid systems, this is not an attempt to fuse ClausIE with MiLIE.

English F1 Precision Recall
ClausIE 0.3398 0.50437 0.2563
MiLIE 27.88 36.65 22.37
MiLIE + Object 0.2971 0.3235 0.2748
Table 6: Performance comparison of Hybrid MiLIE on English BenchIE.

5 Analysis

F1-Score EN DE ZH JP ES(Clean)
SPOA 26.27 8.65 20.32
SOPA 24.95 8.19 18.24
PSOA 27.70 8.75 19.46
POSA 27.42 8.12 19.43
OSPA 22.35 7.96 17.09
OPSA 22.24 7.93 17.47
WF 27.88 8.54 20.50
Table 7: Comparison between different decoding schemes. DYN represents Dynamic and WF water filling.

We hypothesized that it is the ability of MiLIE to extract triples using different extraction patterns that results in improved performance on multilingual data. We test this hypothesis with a simple experiment. We compare MiLIE with the water filling aggregation with MiLIE using different decoding pathways. Additionally we also compare with a dynamic decoding scheme where MiLIE chooses a decoding pathways based on the sentence. To do this we split a part of the English training set and for each sentence in the split we record the extraction pathway that provides the best F1 scoreMiLIE as per CaRB

evaluation. We then use this as training data for training another mBERT model which classifies each sentence in one of the six classes where each class represents an extraction pathway. While creating the training split we found that for most sentences the extraction path

results in the best F1 score which unsurprisingly also translates during test time.

Table 7 details the performance for different decoding schemes. All the decoding schemes except WF, use only one decoding pathway while WF combines multiple pathways. As can be seen, WF performs much better on all languages, even better than the dynamic decoding. This shows that combining triple extraction from multiple pathways is better than dynamically choosing a pathway. This confirms our hypothesis that one can extract richer set of triples if one extracts triples repeatedly from the same sentence using multiple extraction pathways.

Table 7

also provides an interesting insight, predicate first extraction seems to be the best, followed by subject first and then object first extraction. This probably happens because predicates are easier to extract leading to lesser number of errors propagated in the chain. In general, we can conclude that predicates are easier to extract than subjects and subjects are easier to extract than objects. We suspect that this could be due to differences in linguistic variability among the predicate, subject and object. To test our hypothesis we measured the entropy of the distribution of dependency and part-of-speech tags in the predicate, subject and object elements in the BenchIE English and the multilingual test sets. Although this is not a measure of linguistic complexity, results shown in Table

8 do suggest that linguistic complexity of objects is higher than those of predicates and subjects. However, additional analyses are required to understand the reasons behind why objects are more difficult to extract compared to predicates and subjects.

Entropy Subject Predicate Object
EN 1.719 1.588 2.443 1.831 2.286 1.861
ZH 2.464 1.827 2.497 1.476 2.602 1.943
DE 1.587 1.567 1.811 1.457 2.115 2.095
Table 8: Entropy of dependency and part of speech tags for subject, predicate and objects in BenchIE test data.

6 Conclusion

In this paper, we introduced MiLIE a multilingual OpenIE system that extracts triples by extracting parts of the triple iteratively. MiLIE allowed us to answer the question of how different sentence constructions and languages affect the difficulty of extracting different elements of a triple. We found that it is indeed true that certain triples are easier to exploit if the predicate is extracted first while other triples can be easily extracted with subject first extraction. We were able to exploit such variations by aggregating extractions from multiple extraction strategies resulting in improved performance. We also demonstrated how MiLIE can be combined seamlessly with rule based systems for improving performance, especially in the multilingual setting. Although our experiments were focused on the OpenIE task, we believe that the insights gained can be translated to other information extraction tasks where extractions are coupled to each other. We plan to explore such connections in the future.