Multi-facet Universal Schema

03/29/2021 ∙ by Rohan Paul, et al. ∙ University of Massachusetts Amherst 7

Universal schema (USchema) assumes that two sentence patterns that share the same entity pairs are similar to each other. This assumption is widely adopted for solving various types of relation extraction (RE) tasks. Nevertheless, each sentence pattern could contain multiple facets, and not every facet is similar to all the facets of another sentence pattern co-occurring with the same entity pair. To address the violation of the USchema assumption, we propose multi-facet universal schema that uses a neural model to represent each sentence pattern as multiple facet embeddings and encourage one of these facet embeddings to be close to that of another sentence pattern if they co-occur with the same entity pair. In our experiments, we demonstrate that multi-facet embeddings significantly outperform their single-facet embedding counterpart, compositional universal schema (CUSchema) (Verga et al., 2016), in distantly supervised relation extraction tasks. Moreover, we can also use multiple embeddings to detect the entailment relation between two sentence patterns when no manual label is available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Relation extraction (RE) is a crucial step in automatic knowledge base construction (AKBC). A major challenge of RE is that the frequency of relations in the real world is a long-tail distribution but collecting sufficient human annotations for every relation is infeasible (Han et al., 2020).

Distant supervision is proposed to alleviate the issue (Mintz et al., 2009). Distant supervision assumes that a sentence pattern expresses a relation if the sentence pattern co-occurs with an entity pair and the entity pair has the relation. For example, we assume the sentence pattern “$ARG1, the partner of fellow $ARG2” is likely to express the spouse relation if we observe a text clip “… Angelina Jolie, the partner of fellow Brad Pitt …” in our training corpus and a knowledge base tells us that Angelina Jolie and Brad Pitt has the spouse relation. Accordingly, we can infer that another entity pair is likely to have the spouse relation if we observe the text “, the partner of fellow” between them in a new corpus.

Figure 1: Comparison between the multi-facet and compositional universal schema. In our training loss, we encourage one of the facet embeddings from a sentence pattern to be similar to its co-occurred entity pair.

Universal schema (Riedel et al., 2013) extends this assumption by treating every sentence pattern as a relation, which means we assume that sentence patterns or relations in a knowledge base are similar if they co-occur with the same entity pair. For example, we assume “$ARG1, the partner of fellow $ARG2” and “$ARG1, the wife of fellow $ARG2” are similar if they both co-occur with (Kristen Bell, Dax Shepard). Consequently, we can infer that “$ARG1, the wife of fellow $ARG2” also implies spouse relation as “$ARG1, the partner of fellow $ARG2” even if the knowledge base does not record the spouse relation between Kristen Bell and Dax Shepard.

Compositional universal schema (Verga et al., 2016) realizes the idea by using a LSTM (Hochreiter and Schmidhuber, 1997) to encode each sentence pattern into an embedding and encouraging the embedding to be similar to the embedding of the co-occurred entity pair. As in the lower part of Figure 1, the model makes the embeddings of two sentence patterns similar if they co-occur with the same entity pair. Baldini Soares et al. (2019) rely on a similar assumption and achieve state-of-the-art results on supervised RE tasks by replacing the LSTM with a large pre-trained language model.

Figure 2: An illustration of the proposed method. The training signal comes from the co-occurrence matrices of the KB and training text corpus on the right. On the lower left, we visualize our neural encoder, which captures the compositional meaning of tokens in the sentence pattern, and our neural decoder, which models the dependency among multiple facet embeddings. When a sentence pattern co-occurs with an entity pair, the training loss minimizes the distance between the entity pair embedding and the closest facet embedding of the sentence pattern (e.g., 0.2 between and ). Trainable parameters in our model are highlighted using red borders. On the upper left, we visualize the embedding space to establish the connection between our method and clustering.

The variants of universal schema have many different applications, including multilingual RE Verga et al. (2016), knowledge base construction Toutanova et al. (2015); Verga et al. (2017), question answering Das et al. (2017), document-level RE Verga et al. (2018), N-ary RE Akimoto et al. (2019), open information extraction Zhang et al. (2019), and unsupervised relation discovery Percha and Altman (2018).

Nevertheless, one sentence pattern could contain multiple facets, and each facet could imply a different relation. In Figure 1, “$ARG1, the partner of fellow $ARG2” could imply the entity pair has the spouse relation, the co-worker relation, or both. “$ARG1 moved in with $ARG2” could imply the spouse relation, the parent relation, …, etc. If we squeeze the facets of a sentence pattern into a single embedding, the embedding is more likely to be affected by the irrelevant facets from other patterns co-occurred with the same entity pair (e.g., “$ARG1 moved in with $ARG2” might incorrectly imply the co-worker relation).

Another limitation is that single embedding representation can only provide symmetric similarity measurement between two sentence patterns. Thus, an open research challenge is to predict the entailment direction of two sentence patterns only based on their co-occurring entity pair information.

To overcome the challenges, we propose multi-facet universal schema, where we assume that two sentence patterns share a similar facet if they co-occur with the same entity pair. As in Figure 1, we use a neural encoder and decoder to predict multiple facet embeddings of each sentence pattern and encourage one of the facet embeddings to be similar to the entity pair embedding. As a result, the facets that are irrelevant to the relation between the entity pairs are less likely to affect the embeddings of entity pairs and other related sentence patterns. For example, the parent facet of “$ARG1 moved in with $ARG2” could be excluded when updating the embeddings of (Angelina Jolie, Brad Pitt).

In our experiments, we first compare the multi-facet embeddings with the single-facet embedding in distantly supervised RE tasks. The results demonstrate that multiple facet embeddings significantly improve the similarity measurement between the sentence patterns and knowledge base relations. Besides RE, we also apply multi-facet embeddings to unsupervised entailment detection tasks. In a newly collected dataset, we show that multi-facet universal schema significantly outperforms the other unsupervised baselines.

2 Methods

Our method is illustrated in Figure 2. In Section 2.1, we first provide our problem setup: We are given a knowledge base (KB) and a text corpus during training. Our goal is to extract relations by measuring the similarity between KB relations and an (unseen) sentence pattern or to detect entailment between two sentence patterns. In Section 2.2, we introduce our neural model, which predicts multi-facet embeddings of each sentence pattern. Next, in Section 2.3, we describe our objective function, which encourages the embeddings of co-occurred entity pairs to be close to the embeddings of their closest pattern facets. Finally, in Section 2.4, we explain that multi-facet embeddings could be viewed as the cluster centers of possibly co-occurred entity pairs, and in Section 2.5, we provide our scoring functions for distantly supervised RE and unsupervised entailment tasks.

2.1 Background and Problem Setup

Our RE problem setup is the same as compositional universal schema (Verga et al., 2016)

. First, we run named entity recognition (NER) and entity linking on a raw corpus. After identifying the entity pairs in each sentence, we prepare a co-occurrence matrix as in Figure 

2. Similarly, we represent the KB relations between entity pairs as a co-occurrence matrix and merge the matrices from the KB and the training corpus. The merged matrix has if the th sentence pair or KB relation co-occurs with the th entity pair and otherwise.

During testing, we use NER to extract an entity pair and the sentence pattern, which might not have been seen in the training corpus. Next, we extract relations by computing the similarity between the sentence pattern embeddings and the embeddings of the applicable KB relations. Besides RE, we also detect the entailment between two sentence patterns by comparing their embeddings.

2.2 Neural Encoder and Decoder

We use a neural model to predict facet embeddings of each sentence pattern. The goal is similar to Chang et al. (2021), which predict a fixed number of embeddings of a sentence, so we adopt their neural model as shown in Figure 2.

For the th sentence pattern , we append an <eos> to its end and use a 3-layer Transformer Vaswani et al. (2017) encoder to model the compositional meaning of the input word sequence: where is an embedding contextualized by the encoder. In the experiment, we also replace the Transformer with a bidirectional LSTM (bi-LSTM) to show that the improvement of multi-facet embeddings is independent of the encoder choice.

The embedding represents the whole sentence pattern; we use different linear layers to transform the embedding into the inputs of our decoder: .

The facets in a sentence pattern often have some dependency. For example, the patterns that express the partnership between two people might also express the collaboration relation between two companies. To leverage the dependency, we use another 3-layer Transformer as our decoder . Besides the self-attention, we allow the hidden states in the decoder to query the contextualized word embeddings from the encoder Vaswani et al. (2017) and output the embeddings corresponding to the different facets : Notice that we do not use autoregressive decoding as in Vaswani et al. (2017), so our decoder could also be viewed as another encoder with attention to the output of the encoder . Finally, to convert the hidden state size to the entity embedding size, we let the outputs of decoder go through another linear layer to get the facet embedding (i.e., sentence pattern embedding): .

2.3 Objective Function

When measuring the distance between the th entity pair and the th sentence pattern, we compute the Euclidean distance between the entity pair embedding and its closest facet embedding of the th sentence pattern. The distance is defined as

(1)

where the entity pair embedding is normalized (i.e., ). During testing, we ignore the magnitude of facet embeddings, so we use to eliminate the magnitude of facet embeddings during training. We do not allow negative to prevent the gradient flow from pushing toward the inverse direction of and we ensure to avoid the neural model from outputting with a very small magnitude.

As in Figure 2, we minimize the distance

in our loss function when the

th sentence pair co-occurs with the th entity pair (i.e., ). For negative samples (i.e., ), we maximize the distance instead. That is, the major term of our loss function is defined as

(2)

and the other regularization term in the loss function will be described in the appendix. is a set that includes all positive and negative samples. Positive samples are such that and the negative samples are constructed by pairing a randomly selected sentence pattern with the th entity pair. To balance the influence of popular entity pairs (i.e., entity pairs that co-occur with many sentence patterns) and rare entity pairs on our model, we set the weight of each pair, and .

We generate the embeddings for KB relations in a similar way. We use a single token to represent the relation and append an <eos> (e.g., per:spouse <eos>) to form the input of our neural model. The KB relations usually co-occur with more entity pairs, so we set the number of facet embeddings for KB relations to be larger than the number of facet embeddings for sentence patterns .

2.4 Connection to Clustering

If a sentence pattern contains multiple facets that describe different relations between the entity pairs, the pattern often co-occurs with different kinds of entity pairs. For example, “$ARG1 ’s partner $ARG2” in Figure 2 could express the collaboration relationship between two companies or the partnership between two people, so the sentence patterns could co-occur with two companies such as (Google, Facebook) and two people such as (Bob Bryan, Mike Bryan).

Different kinds of entity pairs often have very different embeddings, so we could discover the facets of sentence patterns by clustering the embeddings of entity pairs. Here, a facet refers to a mode of the embedding distribution of the entity pairs that could possibly co-occur with the sentence pattern. A facet could be represented by multiple facet embeddings and each facet embedding corresponds to a cluster center of the entity pair embeddings. Hence, although the number of facet embeddings is fixed for all the sentence patterns, our model can capture the facets of the sentence patterns well when the number of facets is less than .

In equation 1, we choose the closest facet embedding of the sentence pattern for each co-occurring entity pair embedding and minimize their distance. For example, and the embedding of (Bob Bryan, Mike Bryan) are pulled closer in Figure 2. Minimizing equation 1 by passing the gradient through the scaled facet embedding is the same as minimizing a Kmeans loss, so the loss term induced by positive sample pairs encourage each to become the cluster center of its nearby co-occurring entity pair embeddings. The details of our training algorithm could be found in the appendix.

The co-occurrence matrices in RE tasks are usually extremely sparse, and most of the sentence patterns only co-occur with a few entity pairs, which makes it difficult to derive multiple high-quality embeddings by clustering the co-occurring entity pair embeddings as in multi-sense word embedding methods such as Neelakantan et al. (2014). The proposed method solves this sparsity challenge by predicting the cluster centers using a neural model. For instance, even if “$ARG1 ’s partner $ARG2” does not co-occur with many entity pairs, its embeddings are encouraged to be close to the embeddings of entity pairs that co-occur with other similar patterns (e.g., “$ARG1 and her partner $ARG2”).

Figure 3: Comparison of the asymmetric similarities. because the average cosine distance on the left is smaller than that on the right.

2.5 Scoring Functions

In compositional universal schema, the similarity between the th and

th sentence patterns are measured by the symmetric cosine similarity

, where . When using multiple embeddings to represent a sentence pattern, we can compute the asymmetric similarity as

(3)

In an example of Figure 3, a red square is close to all the blue points, which leads to a high .

Between two sentence patterns with entailment relation, we empirically find that the embeddings of a premise (the more specific pattern) often have some facet embeddings that are far away from all the embeddings of its hypothesis (the more general pattern). Relying on the tendency, we could detect the direction of the entailment relation. For example, the th sentence pattern (red squares) in Figure 3 is more likely to be premise if the th and th (blue circles) sentence patterns have an entailment relation.

We suspect the reason is that more specific patterns could contain more words that are similar to the words of other patterns expressing different relations. For example, “$ARG1 , the wife of fellow $ARG2” have a facet embedding for spouse relation and another facet embedding for the co-worker relation because the pattern has high word overlapping with “$ARG1 , the wife of $ARG2” and “$ARG1 and her fellow $ARG2”. Another possible reason is that the articles in our corpus tend to use more specific patterns to express the relation between a pair of entities Shwartz et al. (2017).

When performing RE, we compute the symmetric similarity between th sentence pattern and th KB relation by

(4)

3 Experiments

We primarily compare our method with compositional universal schema (CUSchema) (Verga et al., 2016)

because CUSchema is one of the state-of-the-art RE methods in the small model regime (without using large pre-trained language models

Chang et al. (2016); Chaganty et al. (2017).111We have not yet applied the multi-facet embeddings approach to the models that rely on a large pretrained language model (LM) Baldini Soares et al. (2019) due to computational and evaluation considerations. Computationally speaking, training state-of-the-art models requires intensive GPU resources. Besides, a smaller model size might be desired when we need to construct a knowledge base from a large corpus in real time. Moreover, there is no existing pretrained LM in some domains Zhang et al. (2019), and training the LM in a new domain from scratch requires even more GPU resources. In terms of the evaluation consideration, our method is an improvement over CUSchema, so we want to compare it with CUSchema fairly. Furthermore, evaluating entailment between two full sentences is more difficult than between the sentence patterns, and we are not aware of a LM-based model that only considers the text between the entity pairs.

In Section 3.1, we visualize and analyze the facet embeddings. Next, we use distant-supervised RE tasks to evaluate our symmetric similarity measurement in Section 3.2, and detect entailment between sentence patterns to evaluate our asymmetric similarity measurement in Section 3.3.

Figure 4: Facet embedding visualization of Ours (Single-Trans) on the left and Ours (Trans) on the right. Dots are the facet embeddings outputted by our models and crosses are their nearby entity pair embeddings

3.1 Embedding Visualization

We visualize the embeddings of sentence patterns and a KB relation from the single embedding model and multi-facet embedding model that perform the best in the RE tasks (i.e., Ours (Single-Trans) and Ours (Trans) in Table 1). We project the facet embeddings to a 2-dimensional space using multidimensional scaling (MDS) Borg and Groenen (2005) and visualize the embeddings of one KB relation and three related sentence patterns in Figure 4. The three sentence patterns are selected from our validation set, so the model is not aware of the entity pairs that actually co-occur with the patterns during training. For each facet embedding, we show two among five of its closest entity pairs to visualize the meaning of the embedding space.222Notice that our training signal is sparse and noisy and the projection does not necessarily preserve the original distances, so the entity pairs with similar relations might be relatively far away from each other.

In the single embedding model, the embedding of org:city_of_headquarter is close to the embedding of (school, location) while “$ARG1 headoffice in $ARG2” is close to (company, location) and “$ARG1 headquarter in $ARG2”.

In the multi-facet embedding model, some embeddings of org:city_of_headquarter are closer to (school, location) and others are closer to (company, location). In addition to these entity pairs, “$ARG1 headoffice in $ARG2” and “$ARG1 headquarter in $ARG2” also co-occur with (people, location) and (people/organization, year). Using the visualization of multi-facet embedding, we can understand which facets of org:city_of_headquarter are similar or dissimilar to “$ARG1 headoffice in $ARG2”, which cannot be done if all facets are averaged into a single embedding as in the traditional models.

The facet embeddings of “$ARG1 is now at $ARG2” are close to (people, organization) where the organization could be school, sports team, and company. Using multiple embeddings could avoid enforcing the closeness of these entity pairs with different relations. The results also indicate that our model can output reasonable cluster centers despite learning from the sparse and noisy training data. Finally, we can see that if a sentence pattern has fewer facets than , our model learns to output some very similar facet embeddings, which makes the performance less sensitive to the setting of .

3.2 Relation Extraction

We follow the same training data and testing protocol in compositional universal schema (CUSchema) (Verga et al., 2016)333https://github.com/patverga/torch-relation-extraction to highlight the benefit of predicting multiple facet embeddings, and the relation extraction step in TAC KBP slot-filling tasks is used to compare the different models.

Method TAC 2012 (Validation) TAC 2013 TAC 2014
Prec Recall F1 Prec Recall F1 Prec Recall F1
USchema* 34.8 23.7 28.2 42.6 29.4 34.8 35.5 24.3 28.8
CUSchema (LSTM)* 27.0 32.7 29.6 39.6 32.2 35.5 32.9 27.3 29.8
Ours (Single-LSTM) 25.7 21.7 23.5 30.4 26.3 28.2 22.1 20.5 21.3
Ours (Single-Trans) 26.1 21.6 23.7 29.5 25.2 27.2 19.0 21.2 20.0
Ours (LSTM) 32.0 28.9 30.3 41.3 33.9 37.2 34.1 29.5 31.6
Ours (Trans) 33.6 27.7 30.4 42.5 33.2 37.3 34.6 28.5 31.3
USchema + CUSchema (LSTM)* 29.3 32.8 30.9 41.9 34.4 37.7 29.3 34.1 31.5
USchema + Ours (LSTM) 29.2 33.7 31.3 38.1 38.9 38.5 31.5 34.4 32.9
USchema + Ours (Trans) 30.4 33.9 32.1 39.0 38.8 38.9 32.0 34.0 33.0
Table 1: Distantly supervised relation extraction using different versions of the universal schema. All numbers are %. CUSchema refers to compositional universal schema. Trans is an abbreviation of Transformer. The best scores of the single models and ensemble models are highlighted. *The performance of TAC 2013 and 2014 are copied from Verga et al. (2016).

Setup: The training data for our RE models are prepared by distant supervision without requiring any manually labeled data. The relations in Freebase (Bollacker et al., 2008) are mapped to TAC relations (e.g., org:city_of_headquarter) and the NER tagger and entity linker are run in a raw text corpus. Then, the training data is cleaned using the methods in Roth et al. (2013).

During testing, we are given a query containing the head entity and a query TAC relation in the slot-filling tasks, and the goal is to extract the tail entity from the candidate sentences. The NER tagger and query expansion are used to gather the candidate sentence patterns, and we compute the similarity scores from different models between the candidate sentence patterns and query relation. Finally, we compare the extracted second entity with the ground truth using exact string matching and report the precision, recall, and F1 scores.

Following Verga et al. (2016)

, we use TAC 2012 as our validation set to determine the threshold score for each TAC relation. Each model’s hyperparameters are tuned separately using the validation set (TAC 2012) to ensure a fair comparison.

We compare the following methods:
Ours (Trans): Our method that measures the similarity between the sentence pattern  and TAC relation  using in equation 4. Trans is an abbreviation of the Transformer encoder. We set and based on the validation set.
Ours (LSTM): The same as Ours (Trans) except that we use bi-LSTM as our encoder instead.
Ours (Single-*): Our methods that use single facet embedding to represent each sentence pattern or KB relation. When setting , our decoder becomes the interleaving feedforward layers and cross-attention layers attending to the output embeddings of the encoder.
CUSchema (LSTM): Compositional universal schema (Verga et al., 2016). The method is similar to Ours (Single-LSTM) but uses a different loss function, neural architecture (no decoder), and hyperparameter search procedure.
USchema: Universal schema (Riedel et al., 2013)estimates every sentence pattern embedding by factorizing the co-occurrence matrices (i.e., replacing the bi-LSTM in CUSchema with a look-up table).
USchema + *: Verga et al. (2016) show that taking the maximal similarity between USchema and CUSchema model improves the F1. We also apply the same merging procedure to our model.

Results: In Table 1, the proposed method Ours (Trans) significantly outperform CUSchema (LSTM) before and after combining with universal schema. As far as we know, our proposed multi-facet embedding is the first method that outperforms compositional universal schema using the same training signal in the distant-supervised RE benchmark they proposed.

Although the recall of USchema is low because it does not exploit the similarity between the patterns (e.g., “$ARG1 happily married $ARG2” is similar to “$ARG1 married $ARG2”), USchema has a high precision because it also won’t be misled by the similarity (e.g., “$ARG1, and his wife $ARG2” expresses the spouse relation but “$ARG1, his wife, and $ARG2” does not) (Verga et al., 2016). Thus, combining USchema and Ours (Trans) leads to the best performance.

Ours (Trans) and Ours (LSTM) perform similarly. Furthermore, Ours (LSTM) performs much better than Ours (Single-LSTM), which demonstrates the effectiveness of using multiple embeddings. Notice that multiple facet embeddings could improve the performance even after the training data have been cleaned. This indicates that our method is complementary to the noise removal methods in Roth et al. (2013).

Premise (Specific Pattern) Hypothesis (General Pattern) Label Ours CUSchema Ours Diff Freq Diff
$ARG1 , the president of the $ARG2 $ARG1 , the leader of the $ARG2 Entailment 0.98 0.94 + +
$ARG1 ’s chairman , $ARG2 $ARG1 ’s leader , $ARG2 Entailment 0.95 0.87 + -
$ARG1 ’s father , $ARG2 $ARG1 ’s leader , $ARG2 Other 0.08 0.52 NA NA
$ARG1 worked with $ARG2 $ARG1 deal with $ARG2 Entailment 0.92 0.83 + -
$ARG1 had with $ARG2 $ARG1 deal with $ARG2 Other 0.96 0.88 NA NA
$ARG1 said the $ARG2 $ARG1 say the $ARG2 Paraphrase 0.93 0.92 NA NA
Table 2: Example of sentence pattern pairs, its label, and our predictions in our entailment experiment. Ours and Ours Diff are the predictions from Ours (Trans). Freq Diff is the frequency difference baseline.

3.3 Entailment Detection

Entailment is a common and fundamental relation between two sentence patterns. Some examples could be seen in Table 2. Unsupervised hypernym detection (i.e., entailment at the word level) is extensively studied Shwartz et al. (2017), but we are not aware of any previous work on unsupervised entailment detection at the sentence level, nor any existing entailment dataset between sentence patterns. Thus, we create one.

Dataset Creation: We use WordNet (Miller, 1998) to discover the entailment candidates of sentence pattern pairs and manually label the candidates. For each sentence pattern in the training data of Verga et al. (2016), we replace one word at a time with its hypernym based on the WordNet hierarchy. The two sentence patterns before and after replacement form an entailment candidate.

We label 1,500 pairs of the most popular sentence pattern, which co-occurs with the highest number of unique entity pairs. Each candidate could be labeled as entailment, paraphrase, or other. Finally, around 20% of the candidates are randomly chosen to form the validation set, and the rest are in the test set. More details of the dataset creation process could be seen in the appendix

In this dataset, only 22% and 10% of candidates are labeled as entailment and paraphrase, respectively. This suggests that entailment relation between two sentence patterns is hard to be inferred by only the hypernym relation (i.e., entailment relation at the word level) in WordNet.

Setup: We evaluate entailment detection using the typical setup and metrics in hypernym detection (Shwartz et al., 2017). Negative examples include the candidates labeled as paraphrases and others. We compare the average precision of different methods (i.e., AUC in the precision-recall curve) (Hastie et al., 2009). In addition, we predict the direction of entailment relation in each pair (i.e., which pattern is the premise) and report the accuracy. Many hypotheses have the same hypernyms such as the leader in Table 2, so we also report the macro accuracy of direction detection averaged across every hypernym in the hypotheses.

The task is challenging because all the candidates have a word-level entailment relation if their compositional meaning is ignored. Furthermore, we cannot infer the entailment direction based on the tendency that longer sentence patterns tend to be more specific because most of the candidate pairs in this dataset have the same length.

As described in Section 2.5, our models detect the direction by computing Ours Diff as and predict the th sentence pattern to be premise if Ours Diff . When performing entailment classification, we use as the asymmetric similarity scores . We report the performance of Ours (Trans), which is the same best model in the RE experiment.

In entailment classification, we compare the results with cosine similarity from Ours (Single-Trans) and CUschema. We also test the frequency difference, which is a strong baseline in hypernym direction detection Chang et al. (2018). Freq Diff = Freq() - Freq() where Freq() is the number of unique entity pairs co-occurred with the th sentence pattern. The baseline predicts to be premise if Freq Diff because more general sentence patterns should co-occur with more entity pairs. As a reference, we also report the performance of random scores.

Method Classification Direction Detection
AP@all Micro Acc Macro Acc
Random 21.9 50.0 50.0
Freq Diff 21.4 54.5 47.3
CUSchema 31.2 50.0 50.0
Ours (Single) 23.6 50.0 50.0
Ours 37.8 63.1 55.4
Table 3: Comparison of entailment detection methods. AP and Acc are average precision and accuracy, respectively. All numbers are %. Our methods use a Transformer as their encoder.

Results: The quantitative and qualitative comparison are presented in Table 3 and Table 2, respectively. Our model that uses multi-facet embeddings significantly outperforms the other baselines. We hypothesize that a major reason is that the sentence patterns with an entailment relation are often similar in some but not all of the facets, and our asymmetric similarity measurement is better at capturing the facet overlapping.

4 Related Work

Relation extraction (RE) is widely studied. Han et al. (2020) summarize the trend of recent studies and point out one of the major challenges is the cost of collecting the labels. Distant supervision (Mintz et al., 2009) and its follow-up work enable us to collect a large amount of training data at a low cost, but the violation of its assumptions often introduces substantial noise into the supervision signal. Our goal is to alleviate the noise issue by representing every sentence pattern using multiple embeddings.

Other noise reduction methods have also been proposed Roth et al. (2013). For instance, we can adopt multi-instance learning techniques Yao et al. (2010); Surdeanu et al. (2012); Amin et al. (2020), global topic model Alfonseca et al. (2012), or both Roth and Klakow (2013). We can also reduce the noise by counting the number of shared entity pairs between a sentence pattern and a KB relation Takamatsu et al. (2012); Su et al. (2018). Nevertheless, the studies focus on mitigating the noise caused by assuming similarity between the sentence patterns and KB relations that co-occur with the same entity pairs, while our method can also reduce the noise from two sentence patterns sharing the same entity pair. Besides, our method is complementary to popular noise reduction methods because our improvement is shown in the training data that have been cleaned (Verga et al., 2016).

Our method is conceptually related to some studies for lexical semantics. For example, word sense induction or unsupervised hypernymy detection can be addressed by clustering the co-occurring words (Neelakantan et al., 2014; Athiwaratkun and Wilson, 2017; Chang et al., 2018). However, the clustering-based methods do not apply to RE because the co-occurring matrix for RE is much sparser (see Section 2.4 for more details).

Finally, our work is inspired by Chang et al. (2021), but they focus on improving the sentence representation rather than RE. We encourage the facet embeddings to become the centers in Kmeans clustering instead of NNSC (non-negative sparse coding) clustering used in Chang et al. (2021), due to its simplicity, efficiency, and better RE performance. Moreover, we discover that an additional regularization described in the appendixis crucial to overcome the sparsity challenge in RE.

5 Conclusion

In this work, we address the limitation of representing each sentence pattern using only a single embedding, and our approach improves the distantly-supervised RE performances of compositional universal schema.

Relying on only a very sparse co-occurrence matrix between the sentence patterns and entity pairs, we show that it is possible to predict reasonable cluster centers of entity pair embeddings and to predict the entailment relation between two sentence patterns without any labels.

Acknowledgements

We thank Ge Gao for the preliminary exploration of this project. We also thank the anonymous reviewers for their constructive feedback.

This work was supported in part by the Center for Data Science and the Center for Intelligent Information Retrieval, in part by the Chan Zuckerberg Initiative under the project Scientific Knowledge Base Construction, in part using high performance computing equipment obtained under a grant from the Collaborative R&D Fund managed by the Massachusetts Technology Collaborative, in part by the National Science Foundation (NSF) grant numbers DMR-1534431 and IIS-1514053.

Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor.

References

  • K. Akimoto, T. Hiraoka, K. Sadamasa, and M. Niepert (2019) Cross-sentence n-ary relation extraction using lower-arity universal schemas. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    ,
    Hong Kong, China, pp. 6225–6231. External Links: Document, Link Cited by: §1.
  • E. Alfonseca, K. Filippova, J. Delort, and G. Garrido (2012) Pattern learning for relation extraction with a hierarchical topic model. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Jeju Island, Korea, pp. 54–59. External Links: Link Cited by: §4.
  • S. Amin, K. A. Dunfield, A. Vechkaeva, and G. Neumann (2020) A data-driven approach for noise reduction in distantly supervised biomedical relation extraction. In Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing, Online, pp. 187–194. External Links: Document, Link Cited by: §4.
  • S. Arora, Y. Liang, and T. Ma (2017) A simple but tough-to-beat baseline for sentence embeddings. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §A.1.
  • B. Athiwaratkun and A. Wilson (2017) Multimodal word distributions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1645–1656. External Links: Document, Link Cited by: §4.
  • L. Baldini Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski (2019) Matching the blanks: distributional similarity for relation learning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 2895–2905. External Links: Document, Link Cited by: §1, footnote 1.
  • K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, J. T. Wang (Ed.), pp. 1247–1250. External Links: Link, Document Cited by: §3.2.
  • I. Borg and P. J. Groenen (2005) Modern multidimensional scaling: theory and applications. Springer Science & Business Media. Cited by: §3.1.
  • A. Chaganty, A. Paranjape, P. Liang, and C. D. Manning (2017) Importance sampling for unbiased on-demand evaluation of knowledge base population. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 1038–1048. External Links: Document, Link Cited by: §3.
  • H. Chang, A. Agrawal, and A. McCallum (2021) Extending multi-sense word embedding to phrases and sentences for unsupervised semantic applications. In

    Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.2, §4.
  • H. Chang, A. Munir, A. Liu, J. T. Wei, A. Traylor, A. Nagesh, N. Monath, P. Verga, E. Strubell, and A. McCallum (2016) Extracting multilingual relations under limited resources: TAC 2016 cold-start KB construction and slot-filling using compositional universal schema. In Proceedings of the 2016 Text Analysis Conference, TAC 2016, Gaithersburg, Maryland, USA, November 14-15, 2016, External Links: Link Cited by: §3.
  • H. Chang, Z. Wang, L. Vilnis, and A. McCallum (2018)

    Distributional inclusion vector embedding for unsupervised hypernymy detection

    .
    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 485–495. External Links: Document, Link Cited by: §3.3, §4.
  • R. Das, M. Zaheer, S. Reddy, and A. McCallum (2017) Question answering on knowledge bases and text using universal schema and memory networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, Canada, pp. 358–365. External Links: Document, Link Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Document, Link Cited by: §A.2.
  • X. Han, T. Gao, Y. Lin, H. Peng, Y. Yang, C. Xiao, Z. Liu, P. Li, J. Zhou, and M. Sun (2020) More data, more relations, more context and more openness: a review and outlook for relation extraction. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, pp. 745–758. External Links: Link Cited by: §1, §4.
  • T. Hastie, R. Tibshirani, and J. H. Friedman (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edition. Springer Series in Statistics, Springer. External Links: Link, Document, ISBN 9780387848570 Cited by: §3.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: Link, Document Cited by: §1.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §A.2.
  • G. A. Miller (1998) WordNet: an electronic lexical database. MIT press. Cited by: §3.3.
  • M. Mintz, S. Bills, R. Snow, and D. Jurafsky (2009) Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Suntec, Singapore, pp. 1003–1011. External Links: Link Cited by: §1, §4.
  • A. Neelakantan, J. Shankar, A. Passos, and A. McCallum (2014) Efficient non-parametric estimation of multiple embeddings per word in vector space. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1059–1069. External Links: Document, Link Cited by: §2.4, §4.
  • J. Pennington, R. Socher, and C. Manning (2014) GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Document, Link Cited by: §A.1.
  • B. Percha and R. B. Altman (2018) A global network of biomedical relationships derived from text. Bioinform. 34 (15), pp. 2614–2624. External Links: Link, Document Cited by: §1.
  • S. Riedel, L. Yao, A. McCallum, and B. M. Marlin (2013) Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 74–84. External Links: Link Cited by: §1, §3.2.
  • B. Roth, T. Barth, M. Wiegand, and D. Klakow (2013) A survey of noise reduction methods for distant supervision. In Proceedings of the 2013 workshop on Automated knowledge base construction, Cited by: §3.2, §3.2, §4.
  • B. Roth and D. Klakow (2013) Feature-based models for improving the quality of noisy training data for relation extraction. In 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, San Francisco, CA, USA, October 27 - November 1, 2013, Q. He, A. Iyengar, W. Nejdl, J. Pei, and R. Rastogi (Eds.), pp. 1181–1184. External Links: Document, Link Cited by: §4.
  • V. Shwartz, E. Santus, and D. Schlechtweg (2017) Hypernyms under siege: linguistically-motivated artillery for hypernymy detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 65–75. External Links: Link Cited by: §2.5, §3.3, §3.3.
  • Y. Su, H. Liu, S. Yavuz, I. Gür, H. Sun, and X. Yan (2018) Global relation embedding for relation extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 820–830. External Links: Document, Link Cited by: §4.
  • M. Surdeanu, J. Tibshirani, R. Nallapati, and C. D. Manning (2012) Multi-instance multi-label learning for relation extraction. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Jeju Island, Korea, pp. 455–465. External Links: Link Cited by: §4.
  • S. Takamatsu, I. Sato, and H. Nakagawa (2012) Reducing wrong labels in distant supervision for relation extraction. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 721–729. External Links: Link Cited by: §4.
  • K. Toutanova, D. Chen, P. Pantel, H. Poon, P. Choudhury, and M. Gamon (2015) Representing text for joint embedding of text and knowledge bases. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1499–1509. External Links: Document, Link Cited by: §1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §2.2, §2.2.
  • P. Verga, D. Belanger, E. Strubell, B. Roth, and A. McCallum (2016) Multilingual relation extraction using compositional universal schema. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 886–896. External Links: Document, Link Cited by: §A.2, Multi-facet Universal Schema, §1, §1, §2.1, §3.2, §3.2, §3.2, §3.2, §3.3, Table 1, §3, §4.
  • P. Verga, A. Neelakantan, and A. McCallum (2017) Generalizing to unseen entities and entity pairs with row-less universal schema. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, Valencia, Spain, pp. 613–622. External Links: Link Cited by: §1.
  • P. Verga, E. Strubell, and A. McCallum (2018) Simultaneously self-attending to all mentions for full-abstract biological relation extraction. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 872–884. External Links: Document, Link Cited by: §1.
  • L. Yao, S. Riedel, and A. McCallum (2010) Collective cross-document relation extraction without labelled data. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, pp. 1013–1023. External Links: Link Cited by: §4.
  • D. Zhang, S. Mukherjee, C. Lockard, L. Dong, and A. McCallum (2019) OpenKI: Integrating Open Information Extraction and Knowledge Bases with Relation Inference. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 762–772. External Links: Document, Link Cited by: §1, footnote 1.

Appendix A Method Details

Our objective function includes the loss defined in equation 2 and a regularization term . We describe in Section A.1 and some implementation details in Section A.2. Finally, our training procedure is summarized in Algorithm LABEL:algo:dist_opt.

a.1 Regularization by Autoencoder

The co-occurrence matrix between the sentence patterns and entity pairs is very sparse because most of the sentence patterns only co-occur with a few entity pairs. The sparsity might make the training process of multi-facet embeddings sensitive to the hyperparameters.

We discover that adding a simple autoencoder regularization is an effective way to stabilize the training. This regularization term aims to make the average of our facet embeddings of a sentence pattern similar to the weighted average of our word embeddings in that sentence pattern. The regularization is a kind of autoencoder because it reconstructs the weighted average embeddings of words in the input sentence pattern using the output facet embeddings. The regularization term

is defined as

(5)

where is a weight for the regularization term, is the set of all positive and negative training pairs, is the number of sentence patterns, and is a randomly selected index of sentence patterns which serves as our negative example, if and otherwise, is the average of facet embeddings of the sentence pattern . is a weighted average embedding of words in the

th sentence pattern that passes through a linear transformation

. Weighting each word embedding by a smoothed inverse frequency provides a better text similarity measurement (Arora et al., 2017) because the frequently occurring words often do not contribute much to the semantic meaning (e.g., stop words). Similarly , we compute

(6)

where linearly transforms the word embedding into the entity pair embedding space. is a constant set suggested in Arora et al. (2017), is the frequency of the word divided by the number of words in the corpus, and is the pretrained embedding of word . We use 840B GloVe Pennington et al. (2014) as our word embedding in this work.

Ours (Single-Trans) Ours (Trans) BERT Base
1.0M 1.1M 86.0M
Ours (Single-LSTM) Ours (LSTM)
2.6M 3.0M
Table 4: Comparison of neural model sizes (i.e., the number of trainable parameters excluding the word embedding layer).

algocf[!thp]    

a.2 Implementation Details

In our Transformer encoder and decoder, we set the number of layers to be 3 and the size of the hidden state to be 300. In our bi-LSTM encoder, the number of layers is set to 2. After the bi-LSTM encoder, we use the last embedding in the hidden state to encode a sequence into a single embedding. In Table 4, we can see that the size of the neural model outputting multi-facet embeddings is similar to the size of the single embedding model, and both are much smaller than the BERT base model (Devlin et al., 2019). The size of CUSchema is smaller than ours because it uses a smaller hidden state size, but we find that increasing the model size of CUSchema does not lead to better performance.

Before training, we initialize the embeddings of entity pairs using USchema as in Verga et al. (2016). Besides, we initialize the word embedding layer in our neural model using GloVe. CUSchema initializes its word embedding layer randomly, but we also find that initializing it using pre-trained word embeddings does not increase the performance.

We use Adam Kingma and Ba (2015) to optimize the parameters of our neural model and entity pair embeddings. For the linear layer in equation 6, we adopt SGD to make the training more stable. Due to its small model size, one 1080Ti GPU is sufficient to train our model in 3 days.

Appendix B Experiment Details

The details of preparing the entailment dataset are included in Section B.1, and the details of our experiment setup are included in Section B.2. We present the results of our ablation studies in Section B.3.

b.1 Entailment Dataset Creation

When finding the entailment candidates using WordNet, we iterate over all the words in every sentence pattern. For each word, we retrieve all of its senses/synsets and the possible hypernym synsets. Finally, we replace the word with every lemma of each hypernym synset. After the replacement, if the sentence pattern appears in our training data, we pair the sentence patterns before and after replacement as a candidate.

Our goal is to find entailment rather than paraphrase relation, so we exclude the candidates where the replaced word is both the hypernym and hyponym of the replacing word. To measure each sentence pattern’s popularity, we compute the number of unique entity pairs co-occurring with the sentence pattern as the pattern frequency and take the minimum of the frequency between the two sentence patterns in a candidate as the candidate frequency. For each hypothesis, we only consider the top 6 premise candidates with the highest frequencies to diversify the hypotheses in our dataset. The hypothesis popularity score is the average candidate frequencies across its top 6 premises.

Before labeling, we sort the hypotheses based on their popularity scores and label the most popular 1,500 candidates (with the highest minimal frequencies). The labeling is done by a PhD student who has RE research experiences because we hypothesize that it is hard to clearly explain the task to crowdsourcing workers. After the dataset is built, we separate the validation set and test set such that all hypernyms in the test hypotheses are unseen in the validation set.

b.2 Experiment Setup Details

In our visualization experiment, we filter out a few entity pairs that co-occur with less than 5 sentence patterns or become far away from its closest facet embedding after the projection. To prevent the facet embeddings from overlapping, we add a small random vector to each facet embedding.

In the co-occurrence matrix, we use 5% of the unique sentence patterns as our validation set. All the sentence patterns in the validation set are unseen in the training set. We use the validation set to tune the number of epochs during training.

We use coordinate descent to search the hyperparameters that result in the best F1 score in TAC 2012 during training (i.e., change one hyperparameter at a time). Compared with the grid search used in CUSchema, this tuning method is less computationally expensive and less likely to overfit the validation data.

We first optimize the hyperparameters in Ours (Trans). The search ranges are , , , encoder dropout rate , learning rate for updating and maximal epoch number . Our best Transformer model Ours (Trans) used , , , encoder dropout rate , learning rate for updating and maximal number of epochs = 50. Then, we start from these best hyperparameters for Ours (Trans) and tune only encoder dropout rate , and maximal epoch number for Ours (LSTM). The best performing LSTM model used maximal 30 epochs while all other hyperparameters are found to be the same as the best Transformer model. Finally, we fix and tune the hyperparameters using the same range as above, for Ours (Single-Trans) and Ours (Single-LSTM).

b.3 Ablation Study

In Table 5, we justify using the autoencoder loss and using different facet numbers for sentence patterns () and for KB relations (). We can see that performance drops if we remove these techniques from our models using a Transformer.

Method 2012 2013 2014
Ours 30.4 37.3 31.3
Ours () 29.9 36.1 29.8
Ours (No autoencoder) 27.5 33.5 30.2
Table 5: Ablation study on TAC datasets. All numbers are F1 (%). All models use the Transformer encoder.