## 1. Introduction

Relation extraction (RE) is the task to predict the semantic relation for a pair of entities in a text sequence. As the training data is difficult to obtain, many data annotation strategies are used to alleviate the work load of domain experts, such as crowd-sourcing or distant supervision. For an example of distant supervision, a pair of entities in a text sequence is labeled with a relation type based on the given knowledge bases (KBs). In KBs a triplet represents a pair of entities and with relation type . In general, the labeling rule assumes that each sentence containing and will always express the same relation type . In this way, the result usually suffers from the noisy labeling problem, such as the example in Figure 1.

To alleviate the impact of the noisy data caused by the annotation, various solutions have been proposed. One of the previous works (Shingo_ACL_2012) tries to remove noisy sentences from the training data. However, it is not scalable, since it heavily relies on manual rules. Some other works directly use models to reduce noise rather than filtering out noisy data. For example, Riedel_ECML_2010 make the at-least-one assumption that if two entities participate in a relation, at least one sentence that mentions these two entities might express that relation. Lin_ACL_2016 propose a sentence-level selective attention mechanism over a sentence bag with the same entity pair, gaining significant improvement over other previous methods. Although both the at-least-one assumption and the selective attention mechanism can significantly reduce noise, they are not able to handle the situation that given entity pair , none of the sentences in the sentence bag expresses the semantic relation decided by KBs. Wu_AAAI_2019

propose a model with neural noise converter and conditional optimal selector. They regard the relation label decided by KBs as a noisy label for the corresponding group of sentences and build a neural noise converter to connect the distribution between true labels and the noisy labels. The final prediction based on the conditional optimal selector can outperform the selective attention mechanism. However, exactly identifying the connection between the true labels and noisy labels is an ‘ill-posed’ problem. For example, given a softmax output, its pre-logits values are not identifiable. Thus,

(Wu_AAAI_2019) make a very strict constraint on the label transition probabilities, which assumes that only the no-relation type can be mistakenly labeled as other types of relations.In this paper, we propose a novel approach to model the distribution of both true labels and noisy labels for relation extraction. We first define a probabilistic model with two latent variables, which represents the underlying true label and an indicator to identify whether the true label equals to the noisy label. From the Bayesian viewpoint, we develop an Expectation-Maximization (EM) algorithm to optimize an explicit classification loss function. The EM algorithm characterizes the connection between the distribution of true labels and noisy labels, and the loss function is derived from the negative log-likelihood of noisy data points by integrating out latent variables. In addition, in the line of the research of

(Wu_AAAI_2019), we adopt an Implicit Classification Loss as well. However, the transition we design is based on the invertible Planar Flow (rezende2015variational), and the condition we proposed is more robust. Collaboratively optimizing the two loss functions can practically achieve better performance.For sentence feature extraction and representation, we take advantage of pre-trained language models. Language model pre-training has been shown to be effective for improving many natural language processing tasks

(Dai_NIPS_2015; Peters_arxiv_2017; OpenAI_2018_tech; Ruder_ACL_2018; bert_Jacobv_corr_bert_2018). For example, both convolutional neural networks (CNN)

(Santos_ACL_2015; Nguyen_NAACL_2015; Zeng_emnlp_2015; Adel_naacl_2016; Wang_ACL_2016)and recurrent neural networks (RNN)

(Cai_ACL_2016; Zhou_ACL_2016) have been successfully applied to relation extraction. The recently proposed language model BERT (bert_Jacobv_corr_bert_2018) is extraordinarily remarkable. We build our sentence representation upon the BERT model to obtain a relational BERT, by adding other necessary components incorporating relational information.In summary, our contributions are in four folds: (1) We propose an innovative approach with doubly transitional loss to effectively handle the noisy data in relation extraction. (2) We adopt the pretrained language model for relation feature extraction. (3) We achieve the new state-of-the-art in the distant supervision task for NYT dataset.

## 2. Related Work

In recent years, deep neural networks have been successfully applied to relation extraction, typically based on CNN or RNN (Santos_ACL_2015; Nguyen_NAACL_2015; Zeng_emnlp_2015; Adel_naacl_2016; Wang_ACL_2016; Cai_ACL_2016; Zhou_ACL_2016). When dealing with noisy labels for relation extraction, some recent approaches have been proposed to reduce the impact of wrong labels by cleaning the noisy training data, such as (Shingo_ACL_2012). Some other works directly rely on models to reduce the impact of the noisy data during the training process, such as (Riedel_ECML_2010; Hoffmann_ACL_2011; Surdeanu_emnlp_2012; Zeng_emnlp_2015). Lin_ACL_2016

propose an instance-level selective attention model for distantly supervised relation extraction.

Wu_AAAI_2019 build a model with neural noise converter and conditional optimal selector to PCNN (a variant of convolutional neural network) for distantly supervised relation extraction. Some other NLP related works with noise model include (Fang_CONLL_2016) and (Luo_ACL_2017). In the domain of computer vision, researchers also propose many robust strategies including bootstrapping mechanism

(Reed_CoRR_2014), linear noise layer (Sukhbaatar_ICLR_2015) and amortized transition matrix (Misra_CVPR_2016).In general, dealing with noisy labels is important in machine learning especially when the training data is large. The proposed methods to relieve the noisy label problems can roughly be classified into three categories

(Xiang_2018_ieee). The first category is to design classification models to be robust to label noise with some mechanisms like robust loss functions (Beigman_2009_ACL). The second category (Wilson_ICML_1997) makes effort to identify the wrong labeling instances for the purpose of improving the quality of training data. The third category (Lawrence_ICML_2001) aims to model the distribution of noisy labels during the training process. By this way, the information of noisy labels is used for training.Recently, as large scale deep learning has become popular, training models with noisy labeled data draws a lot of attention. The reason comes from the fact that deep learning typically relies on large training data and it is expensive to obtain sufficient numbers of accurately labeled data. For deep learning methods, Chiyuan_2017_ICLR (Chiyuan_2017_ICLR) show that a deep network with large enough capacity can memorize the labels of the training set even in the circumstances that they are randomly generated. Hence, general deep networks are particularly susceptible to noisy labels. Mnih_2012_ICML (Mnih_2012_ICML) propose two robust loss functions for aerial images with noisy labels. The limitation of their model is that it can only be applicable for binary classification. Sukhbaatar_ICLR_2015 (Sukhbaatar_ICLR_2015)

propose a model to try to learn the noise distribution. They build a transition matrix which represents the conversion probability from true labels to noisy labels. The transition matrix is constructed by adding a constrained linear “noise” layer on top of the softmax layer, and it is learned by back-propagating the cross-entropy loss through the top layer down into the base model.

Reed_CoRR_2014 (Reed_CoRR_2014) utilize a simple mechanism to handle noisy labels by incorporating a notion of perceptual consistency into the usual prediction objective. They combine the training labels and the predictions from the current model to generate training targets. GhoshKS_AAAI_2017 (GhoshKS_AAAI_2017) investigate the robustness of different loss functions, such as the mean squared loss, mean absolute loss and cross entropy loss. ZhangS_NIPS_2018 (ZhangS_NIPS_2018) combine advantages of the mean absolute loss and cross entropy loss to obtain a better loss function. None of these previous works have combined the explicit loss and the implicit loss as us to tackle the noisy labeling problems.## 3. Problem Formulation

Before we introduce our relation extraction system for noisy labels, we first formally define the problem in a probabilistic model, and then we concentrate on the details of each module of the proposed approach. Given a set of sentences , with each containing a corresponding pair of entities , the purpose of this task is to predict the relation type

for each entity pair. The traditional problem is a supervised learning task by fitting a mapping based on the training dataset, i.e., search a mapping

. However, we mainly explore the practical scenario when the labels of training data are possibly contaminated, in other words, the revealed relation type may not correspond to the underlying correct .The noisy training dataset is usually denoted as a tuple of four elements . The challenge posed in this problem is that the ultimate purpose remains the same as if the data is clean, which still predicts the label as accurate as possible. In order to build the connection between the noisy and true labels, we propose a probabilistic framework as follows.

(1) | ||||

where we substitute the summation for integration in the second equation, because the relation type is a discrete variable. In addition, the true relation type is a latent variable that we need to make inference. Thus, is what we are virtually interested in.

## 4. Methodology

Figure 2 shows the overall neural network architecture of our model. It primarily consists of two parts: Sentence Feature Extractor and Noisy Label Classifier. Sentence Feature Extractor is used to process a given sentence with a pair of target entities and further obtain a vector representation. Since we can merely access the noisy labels, it is theoretically forbidden to straightforwardly utilize them to build a classifier. Instead, after constructing the vector representation of the sentence, we employ a noisy classifier

with our proposed doubly transitional module, which either converts the hidden state of true labels to the hidden state of the noisy labels or directly converts classification result from the true labels to other labels by following the well designed probability distribution. Our experimental results demonstrate that the two transitions can potentially achieve mutual improvement in training the underlying true label classifier

.### 4.1. Sentence Feature Extractor

Recent work BERT (bert_Jacobv_corr_bert_2018) designs a pre-trained language model by multi-layer bidirectional Transformer encoder (Vaswani_NIPS_2017), achieving the state-of-the-art in many NLP tasks. In this section, we will briefly introduce the modifications we made on top of BERT model such that it can readily adapt to a relation extraction task.

#### 4.1.1. BERT for Relation Extraction

The left part of Figure 2 illustrates the major architecture of our sentence feature extractor. First, we follow the convention of BERT to insert a special token ‘[CLS]’ at the beginning of each sequence input. In our case, each sentence in principle possibly contains two target entities and . To make the BERT module capture the location information of the two entities, we suggest inserting two different special tokens ‘$’ and ‘#’ at the beginning and the end of both entities.

Given a sentence with entity and , after the aforementioned preprocessing, a sequence of latent representations is output for each token from the pre-trained BERT model. We denote the as the hidden state corresponding to ‘[CLS]’, which is supposed to govern the information of the entire sentence. Suppose the subsequence vectors are the output hidden states for entity , we can obtain a fix-sized summarized vector representation by applying an averaging operation, following by a fully connected layer. This process can be formalized as , where FC means a fully connected layer with tanhactivation function. Similarly, can be calculated for entity using the same FC layer, i.e., sharing parameters for the weights and bias. Since the averaging operation is not necessary for , we simply calculate it by with a different fully connected layer. Note that we also apply dropout before each fully connected layer during training. Eventually, we have our extracted sentence feature ready by concatenating .

### 4.2. Explicit Label Transition

We first extract the sentence feature as the concatenation of the three vector representations from relational BERT, denoted as . Thus, our interested can be simplified as . However, we can barely access the true label of the training data. Instead of directly training a classifier by pretending all noisy labels to be true, we propose a probability transition matrix to capture to what extent a true label will become an incorrect one. In this way, the true label can be treated as the latent variable to be inferred.

We employ a probabilistic framework to describe our approach with the introduction of another latent variable to indicate whether the noisy label is correct or not. Particularly, we denote as the case that the noisy label is exactly the true label while 0 means the noisy label can be any other incorrect label. In accordance to the Bayesian Rule, the overall probabilistic latent variable model can be factorized as the follows.

(2) |

where is the parameters of the neural networks in our model, and the transition matrix related term is

(3) |

Note that i)

is the identity matrix,

ii) to make it mathematically concrete, and are both one-hot vectors, iii) the probability transition matrix has non-negative elements and zero-valued diagonal, and its column summation is 1. The means the probability that the true label is annotated as another label . Since we have no information on the latent indicator as well, we propose to use an Expectation-Maximization (EM) framework to infer the distribution of the latent variables and .#### 4.2.1. E-step and M-step

Since the missing value follows a discrete distribution, it is possible to further factorize the probabilistic model by iterating a finite set of a fixed number of values. In our case, can only take binary values, allowing us to integrate out the latent variable in E-step. According to Eq (2

), we can write the joint distribution for the latent variables in details.

(4) |

where denotes the one hot vector where the -th element is 1, and is the index of non-zero element of . is used for normalization in Bayesian Rule. We simplify the notation by removing the model parameter and the probability transition matrix .

The function in EM algorithm is the expected value of the log likelihood function of with respect to the current conditional distribution of the latent variable given the observed evidence

and the current estimates of the transition matrix

. We can write the function over the training dataset as follows to conclude the E-step.The subsequent M-step requires to maximize the quantity as the following optimization problem . Fortunately, the solution can be analytically derived by Lagrange multiplier method. Due to space limitation, we sketch the derivation of the solution of as follows.

From the definition of , we can further write it as

(5) |

where is the index of non-zero value for one-hot vector , and the Constant only depends on model parameter and the at previous iteration, but does not depend on any element of . Thus, with Lagrange multipliers method, we consider the new objective

(6) |

After making the derivative of with respective to each element of , such that

(7) | ||||

(8) |

Together with the property , we have

(9) |

#### 4.2.2. Explicit Classification Loss

We use a binary and a softmax classifier networks to parameterize the required probability distributions in the EM algorithm, and respectively, as shown in the right side of Figure 2. Thus, for a single observation , we usually want to minimize the negative log-likelihood of this data point. However, since it is intractable in our case, we instead propose to optimize its upper bound as our proposed explicit classification loss function. We also sketch the derivation of our loss function.

In general, we want to optimize the negative log-likelihood as our loss function in the following formulation.

(11) |

Notice the fact that , thus we can use the Jensen’s inequality to obtain

(12) |

where , then we use the Jensen’s inequality again.

(13) |

We use the upper bound of negative log-likelihood as our proposed explicit classification loss function:

where is the -th row summation, is the logits before applying softmax operation for predicting the true label, and XE means the cross-entropy loss with respect to logits. Note that i) the latent variable is agnostic, we integrate out in the loss function by taking the advantage of possible binary values, ii) only is optimized by gradient descent through , iii) we conduct an efficient iterative training between and , that is to say, we iterate the following two steps: fixing to optimize the above loss for

iterations by stochastic gradient descent based optimization and updating

with the proposed EM algorithm.### 4.3. Implicit Label Transition

Instead of directly building connection between and , we alternatively suggest using another transition mechanism to capture the underlying relation between two types of labels. In this case, we assume the existence of the probability distribution of noisy labels and a “ghost” logits such that . We naturally have the following total probability equation,

(14) |

Our intuition is that whether we can build an implicit transformation between the two logits rather than the labels. However, softmax operation is not invertible, preventing us to use the transition if no constraint is imposed on .

We notice the fact that if every single element of the logits changes, the final probability distribution will be completely rescaled. Therefore, we use a two-step transformation modified from planar flow (rezende2015variational),

(15) |

where is a scalar, are learnable vectors, and indicates first element of while means a vector by setting the first element of as 0. The proposed transformation does not theoretically guarantee the identifiability. However, the following properties can guarantee it if our optimization can proceed properly.

###### Property 1 ().

Mapping is invertible if (rezende2015variational).

###### Property 2 ().

Mapping cannot guarantee the identifiability, since will result in the same probabilty after softmax operation. It is an identifiable parameterization for Eq (14) if where is a constant. ((Wu_AAAI_2019) did not mention this crucial condition.)

A simple implicit classification loss can be defined as , where the parameter includes but in is no longer needed. Note that we use a pre-training schedule by fixing and to optimize loss

alone for several epochs. Then we set all parameters to be trainable, and train the two loss functions

and alternatively. The overall training procedure is summarized in Algorithm 1.As comparison, we have two advantages over (Wu_AAAI_2019). First, the first non-linear invertible mapping can make more complicated approximation, since the implicit in Eq (14) is obviously non-linear. Though a linear transition on logits space can achieve non-linear transition on probability space, our approach is more flexible. The second mapping is actually a linear transformation that perturbs each element of with a different scaled . Such simple transformations work well in practice, such as NICE (dinh2014nice). However, Wu_AAAI_2019 omit a crucial condition in their proof. Secondly, our proposed explicit loss function shares the parameters of the true label classifier with the implicit loss, and empirically we find it benefits the overall training.

One may suggest using more complicated multi-layer neural network to simulate the transition. However, we empirically find a fully connected layer as the transition is difficult to optimize, resulting worse performance. We highly suspect this is due to the unidentifiable problem.

## 5. Experiments

In this section, we demonstrate the effectiveness of our solution by incorporating doubly transitional loss with the pre-trained language model on two datasets, NYT dataset and SemEval 2018 dataset. The NYT dataset is a distantly supervised relation extraction task, and the SemEval 2018 dataset is a general relation extraction task. For distant supervision, we will describe the dataset and the metrics used for our experimental evaluation, and conduct comparison with other baselines and ablation study to further analyze each component of our own approach. For the SemEval competition, we briefly describes the dataset and then compares our results with other methods.

### 5.1. Distantly Supervised Relation Extraction

#### 5.1.1. Dataset and Evaluation Metrics

To evaluate our model, we conduct experiments on NYT dataset developed by
(Riedel_ECML_2010) and has also been used
by (Hoffmann_ACL_2011; Surdeanu_emnlp_2012; Zeng_emnlp_2015; Lin_ACL_2016; Wu_AAAI_2019).
Similar to previous works, we use the preprocessed version^{1}^{1}1https://github.com/thunlp/OpenNRE which is made publicly available by Tsinghua NLP Lab.

This dataset was generated from New York Times corpus (NYT) with relations aligned with Freebase. The dataset is separated into training set and testing set by first dividing the Freebase related entity pairs into training and testing parts. The training data set is then created by aligning the sentences from the corpus of the years 2005-2006 with the training entity pairs, and the testing data set is created by aligning the sentences from the corpus of the year 2007 with the testing entity pairs.

Totally there are 53 possible relationships including a special relation type ‘NA’ which indicates no relation between the two entities. The training data set contains 570,088 sentences and the testing data set contains 172,448 sentences.

Similar to previous works (Mintz_ACL_2009; Lin_ACL_2016; Wu_AAAI_2019), the evaluation is on the held-out testing data. The evaluation compares the extracted relations of the entity pairs from the sentences in the test data set against the Freebase relation data. It makes the assumption that the inference model has similar performance in relation instances inside and outside Freebase. We report the precision/recall curves, Precision@N (P@N), and average precision as the metrics in our experiments.

#### 5.1.2. Parameter Settings

We use batch size as 32 and maximum sentence length as 128 in all experiments. Longer sentence will be truncated to the end tokens. For any entities being truncated, we set the positions of the corresponding entities to be the end of the sentence after tokenization. We use the Adam (kingma2014adam) optimizer with initial learning rate 5e-5 and weight decay 0.01 for model regularization. The dropout rate during training is 0.1 for fully connected layers and BERT module (the uncased basic version).

We pretrain our model with Implicit Classification Loss and without Explicit Classification Loss for 2 epochs as Algorithm 1. We then fine-tune our model with all trainable parameters. During the fine-tuning, we use the following strategy for the initialization of . We define a ratio , and assign , and the rest elements to be . We set in practice.

When training with explicit classification loss, we first initialize transition matrix by setting the diagonal elements to be 0 and all other elements to be . We efficiently update by calculating the statistics from every 100,000 sentences instead of the whole dataset.

#### 5.1.3. Predictions

The original prediction output of our system is the probability distribution of relation types for each test sentence with its corresponding entity pair. We use the following strategy to assign relation types to entity pairs after getting the probability distributions of relation types for all sentences. For a pair of entity, if all sentences containing the entity pair are predicted to be negative (i.e. where ‘no-relation’ type has the highest probability for each sentence), we make ‘no-relation’ prediction for the entity pair. Otherwise, if any sentence is predicted to be some positive relation, we make prediction based on the sentences with positive relation predictions, regardless of those sentences predicted to be‘no-relation’. The selection of positive labels for prediction is based on their probability values. We pick the positive label with the maximum probability value among all positive labels as the prediction of the entity pair.

#### 5.1.4. Comparison with Baseline Methods in Literature

To evaluate our proposed approach, we select several baseline methods for comparison on the held-out test dataset, including (Mintz_ACL_2009; Hoffmann_ACL_2011; Surdeanu_emnlp_2012; Lin_ACL_2016; Wu_AAAI_2019).

The method that Mintz_ACL_2009 propose is a traditional distant supervised model (Mintz_ACL_2009). The method that Hoffmann_ACL_2011 propose is a probabilistic, graphical model for multi-instance learning that can handle overlapping relations (Hoffmann_ACL_2011) . The method that Surdeanu_emnlp_2012 propose is a method that models both multiple instances and multiple relations (Surdeanu_emnlp_2012). The method Lin_ACL_2016 propose is a method that first represents a sentence with PCNN, and then uses sentence-level selective attention to model a group of sentences with the same entity pair (Lin_ACL_2016). The method Wu_AAAI_2019 propose also represents a sentence based on the architecture of PCNN and then applies a neural noise converter and a conditional optimal selector to reduce the impact of noised data (Wu_AAAI_2019).

Figure 3 shows the precision/recall curves for all methods, including ours. For all of the baseline methods, we can see that the method in (Lin_ACL_2016) shows much better performance than the methods in (Mintz_ACL_2009), (Hoffmann_ACL_2011) and (Surdeanu_emnlp_2012), which demonstrates the effectiveness of the sentence-level selective attention. The method proposed in (Wu_AAAI_2019) has further improvement than the method in (Lin_ACL_2016) and other baseline methods. Although the method proposed in (Wu_AAAI_2019) has shown significant improvement over other baselines, our method still gains great improvement over the method in (Wu_AAAI_2019). Particularly, Table 1 compares the precision@N (P@N) between our model and the methods in (Lin_ACL_2016) and (Wu_AAAI_2019). Our method achieves the highest values for P@100, P@200, P@300, with mean value of 5.5 higher than the method in (Wu_AAAI_2019), and 14.6 higher than the method in (Lin_ACL_2016).

We further compare the average precision of our method against the algorithms in (Lin_ACL_2016) and (Wu_AAAI_2019) in the last column of Table 1, achieving 21.2 and 28.2 higher average precision scores respectively. Comparing to the metric P@100, P@200, P@300, our method has much more improvement on the metric of average precision. The reason can be intuitively summarized as following. P@100, P@200, P@300 are corresponding to the precision on the precision recall curve when the recall is low (roughly in the range of 0.05 to 0.15) in Figure 3. Our method has much higher precision improvement when the recall is high than when the recall is low. For example, from Figure 3, we can see that when the recall is 0.5, the precision of our method is 0.68, while the precision of the method in (Wu_AAAI_2019) is 0.43. In addition, we also visualize the explicit transitional matrix in the trained model, and the estimated matrix from test dataset in Figure 4. We can see most the elements for the two matrices are similar, demonstrating the derived transitional matrix by the EM algorithm can successfully generalize to test data.

P@N (%) | 100 | 200 | 300 | Mean | AP (%) |
---|---|---|---|---|---|

Lin et al., 2016 | 76.2 | 73.1 | 67.4 | 72.2 | 36.5 |

Wu et al., 2019 | 85.0 | 82.0 | 77.0 | 81.3 | 43.5 |

ours | 92.0 | 86.0 | 82.3 | 86.8 | 64.7 |

### 5.2. Model Analysis

#### 5.2.1. Verification of Transitional Loss

The method in (Wu_AAAI_2019) has shown to significantly outperform other previous methods, which is benefited from their proposed Noise-Converter and Conditional Optimal Selector components. When comparing with another strong method in (Lin_ACL_2016), Wu_AAAI_2019 in (Wu_AAAI_2019) use PCNN component for sentence representation as well as Lin_ACL_2016 in (Lin_ACL_2016). The only different parts for the two methods are the components above the sentence representation, where Wu_AAAI_2019 use Noise-Converter and Conditional Optimal Selector and Lin_ACL_2016 use sentence-level selective attention.

To demonstrate superiority of our method, we further conduct another experiment to compare our Transition_Loss components with the components proposed in (Wu_AAAI_2019) above the same sentence representation. For this purpose, we create a new baseline method, which uses the same component for sentence representation as ours, and uses Noise-Converter and Conditional Optimal Selector components as (Wu_AAAI_2019) on top of it. The sentence representation component is left part till vector shown in Figure 2. We label this newly created baseline method as Relation_BERT+noise_converter. Our method is named Relation_BERT+Transition_Loss.

Figure 5 shows the precision/recall curves for this newly created baseline approach Relation_BERT+noise_converter and our method Relation_BERT+Transition_Loss. We can see that our method shows better performance than the created baseline method. Table 2 compares the precision@N (P@N) and average precision (AP) for the two methods. The comparison of P@1000 of the two methods are shown in Table 3. Our method achieves the higher values for all these metrics than the baseline method.

P@N (%) | 100 | 200 | 300 | Mean | AP (%) |
---|---|---|---|---|---|

Relation_BERT+noise_converter | 91.0 | 85.0 | 80.7 | 85.6 | 64.0 |

Relation_BERT+Transition_Loss | 92.0 | 86.0 | 82.3 | 86.8 | 64.7 |

Methods | P@1000 (%) |
---|---|

Relation_BERT+noise_converter | 69.8 |

Relation_BERT+Transition_Loss | 71.5 |

_Loss

#### 5.2.2. Ablation Study

A critical point of our approach is that to what extent the prevalent BERT model can benefit the performance of the relation extraction task, and how our proposed doubly transitional loss can even boost the performance of the model.

To understand the impact of the components in our approach, we conduct several more experiments. One experiment is that we use the BERT model alone, and the input for the BERT model is sentences containing the target entities without inserting the special tokens. The output hidden vector of ‘[CLS]’ is then connected to a fully connected layer for classification. This method is named BERT.

Another experiment is that we remove the noise related components in our approach, and keep the rest parts. That is to say, we insert the special tokens around the two target entities, and use the output hidden vectors of the first token and the output hidden tokens of the two target entities for classification. More specifically, in Figure 2, we remove all components in the right of hidden vector , and add a fully connected layer and softmax layer over for classification. This method is named Relation_BERT. Additionally, our method is named Relation_BERT+Transition_Loss.

Figure 6 illustrates the precision/recall curves of all comparison methods. From this figure, we can see that the general method BERT has the much worse performance than the other two methods. Although BERT has shown very strong performance in general text classification (bert_Jacobv_corr_bert_2018), but without incorporating the locations of the target entities and their corresponding output hidden vectors, it does not show any advantage on this task. On the other hand, Relation_BERT shows much stronger performance than general BERT, which demonstrates the effective way that we incorporate the target entity information into the model. Furthermore, the approach Relation_BERT+Transition_Loss which incorporate all components in our model architecture gains significant improvement over Relation_BERT. We can see that when the recall is 0.3, Relation_BERT+Transition_Loss has the precision value of 0.74, while Relation_BERT has the precision value of 0.6, which shows 14% precision improvement on this recall point. The comparison of P@1000 of the three methods are shown in Table 4. Relation_BERT+Transition_Loss has the highest P@1000 value, with 11.6% absolute improvement over Relation_BERT, and 38.8% absolute improvement over BERT. Table 4 further shows that Relation_BERT+Transition_Loss has the highest average precision value.

Methods | P@1000 (%) | AP |
---|---|---|

BERT | 32.7 | 23.1 |

Relation_BERT | 59.9 | 61.9 |

Relation_BERT+Transition_Loss | 71.5 | 64.7 |

We also conduct two more experiments. One only uses the explicit classification loss, and the other only uses the implicit classification loss. We report the average precision of these two methods as long as the method using both loss in Table 5. It shows that when using both loss functions, we have higher average precision.

Methods | AP (%) |
---|---|

Explicit Classification Loss | 64.2 |

Implicit Classification Loss | 64.1 |

Both Loss | 64.7 |

### 5.3. General Relation Extraction Task

In this task, we evaluate on the SemEval 2018 Task 7 dataset. We focus on subtask 1.1, where the test dataset is composed of clean entities with manual labels. Like several other solutions (Jonathan_semeval2018_arxiv), we combine the training dataset of subtask 1.1 and the training dataset of subtask 1.2 as the new training dataset. The entities in the dataset of subtask 1.2 are automatically annotated and hence contain some noises. In this setting, we have the noisy training dataset and evaluate on the clean test dataset. Originally there are 6 relation types with 5 of them are asymmetrical and one of them is symmetrical. For the 5 asymmetrical relations, we create 5 extra corresponding reversed relations. Then we have totally 11 relations. For two entities, we consider their relations only when they appear in the same sentence. After pre-processing, the training dataset contains 2476 sentences (1248 of them are from subtask 1.1 and 1228 of them are from subtask 1.2), and the test dataset contains 355 sentences. This is a much smaller dataset compared with the NYT dataset.

For the hyper-parameters, we use the similar ones as NYT dataset. Since this dataset is much smaller, the following parameters are different. In this experiment, we update for every 16 sentences, and we set batch size to be 16. We fine-tune our model for 5 epochs.

Methods / Teams | F1 score |
---|---|

Talla | 74.2 |

ClaiRE | 74.9 |

SIRIUS-LTG-UiO | 76.7 |

UWNLP | 78.9 |

ETHDS3Lab | 81.7 |

Ours | 80.7 |

We use the official tool to generate the final macro F1 scores.
Table 6 shows the results of our methods with comparison with the top 5 teams (conf_semeval2018_Gabor) ^{2}^{2}2https://lipn.univ-paris13.fr/~gabor/semeval2018task7/ who participated
in the task competition.
Our method can rank 2nd place with a little bit less than the top 1 team.
The top 1 team applies a lot of engineering tricks to achieve the high macro F1 score, such as using the
weight of classes (which boosted their F1 for 1.6 points), reversing sentences rather than adding
reversed relation types (which boosted their scores for 2.0 points) (Jonathan_semeval2018_arxiv).
We did not apply any engineering tricks.
The results show that our method is still very promising in general relation extraction task, since the training data is inevitably containing some noisy data.

## 6. Discussion and Limitation

In this paper, we construct two types of transitional loss to tackle the relation extraction problem, which shows significant improvement over the competitive baseline methods. Optimizing the two losses together also outperforms using only one type of the transitional loss. We partially attribute this improvement to the alternative training. Switching the two types of loss during optimization potentially helps to escape from local minimal, because the local optimal of one loss function is not necessarily the local optimal of another one. Since the direction of gradient descent is characterized by the local landscape around the current optimized parameters, we believe that the two types of transitional loss have a discriminatory gradient at the same point.

A critical point in our work is the relationship of the two losses. First, they both share the same sentence representation component and the logits vector used for generating true labels (See Figure 2 for the left parts till vector ). After back-propagation of gradient descent, both of them will optimize the parameters of the shared components. Second, they differ in the way how to model the connection between the noisy (observed) labels and the true labels.

While the evaluation of our model based on Doubly Transitional Loss shows better performance in practice, it still has two potential limitations. First, since the model is based on the pretrained BERT and includes EM optimization process, the computational cost is relatively high and the latency in prediction will be a practical issue for production deployment. Second, since our EM algorithm for the explicit loss needs to gather sufficient instances to calculate the sufficient statistics to estimate the desired parameters, the estimation will be in high variance if the size of training data is small. This phenomenon can be found in the experiments on the SemEval 2018 Task 7 dataset, which contains less than 3 thousand instances. Although our solution is very competitive in this dataset, it does not outperform other methods in a large margin.

When comparing with the baselines, we would address the reason that we do not use the same baselines for both the NYT dataset and the SemEval 2018 Task 7 dataset. The latter task, which targets general relation extraction, is a slightly different task from the former one, which aims at distantly supervised relation extraction. The baselines from Lin_ACL_2016 (Lin_ACL_2016) and Wu_AAAI_2019 (Wu_AAAI_2019) are particularly suitable for distantly supervised dataset, and hence it is fair enough to compare them with our method on the NYT dataset. The methods to solve the two problems should be fundamentally different, but we attempt to unify them with our proposed approach. Although the comparison with other top ranked teams in the competition on the SemEval 2018 Task 7 dataset is not strictly fair, our method still achieves an acceptable result.

## 7. Conclusions

In this paper, we propose an innovative approach with doubly transitional loss to effectively handle the noisy data in relation extraction. The explicit classification loss is essentially derived from a probabilistic model with latent variables. The implicit classification loss function represents an end-to-end noisy transition framework. In the experiments, we validate the effectiveness of each component in our approach, and it also demonstrates comparable or state-of-the-art performance in the relation extraction task with noisy data.

Comments

There are no comments yet.