We present a simple and accurate span-based model for semantic role labeling (SRL). Our model directly takes into account all possible argument spans and scores them for each label. At decoding time, we greedily select higher scoring labeled spans. One advantage of our model is to allow us to design and use span-level features, that are difficult to use in token-based BIO tagging approaches. Experimental results demonstrate that our ensemble model achieves the state-of-the-art results, 87.4 F1 and 87.0 F1 on the CoNLL-2005 and 2012 datasets, respectively.READ FULL TEXT VIEW PDF
Semantic role labeling (SRL) aims to discover the predicateargument stru...
Recent BIO-tagging-based neural semantic role labeling models are very h...
We introduce a dataset with annotated Roles Across Multiple Sentences (R...
We propose a novel linearization of a constituent tree, together with a ...
Current end-to-end semantic role labeling is mostly accomplished via
The latest developments in neural semantic role labeling (SRL), includin...
We present state-of-the-art results for semantic proto-role labeling (SP...
Semantic Role Labeling (SRL) is a shallow semantic parsing task whose goal is to recognize the predicate-argument structure of each predicate. Given a sentence and a target predicate, SRL systems have to predict semantic arguments of the predicate. Each argument is a span, a unit that consists of one or more words. A key to the argument span prediction is how to represent and model spans.
. Using features induced by neural networks, they predict a BIO tag for each word. Words at the beginning and inside of argument spans have the “B” and “I” tags, and words outside argument spans have the tag “O.” While yielding high accuracies, this approach reconstructs argument spans from the predicted BIO tags instead of directly predicting the spans.
Another approach is based on labeled span prediction Täckström et al. (2015); FitzGerald et al. (2015). This approach scores each span with its label. One advantage of this approach is to allow us to design and use span-level features, that are difficult to use in BIO tagging approaches. However, the performance has lagged behind that of the state-of-the-art BIO-based neural models.
To fill this gap, this paper presents a simple and accurate span-based model. Inspired by recent span-based models in syntactic parsing and coreference resolution Stern et al. (2017); Lee et al. (2017), our model directly scores all possible labeled spans based on span representations induced from neural networks. At decoding time, we greedily select higher scoring labeled spans. The model parameters are learned by optimizing log-likelihood of correct labeled spans.
We evaluate the performance of our span-based model on the CoNLL-2005 and 2012 datasets Carreras and Màrquez (2005); Pradhan et al. (2012). Experimental results show that the span-based model outperforms the BiLSTM-CRF model. In addition, by using contextualized word representations, ELMo Peters et al. (2018), our ensemble model achieves the state-of-the-art results, 87.4 F1 and 87.0 F1 on the CoNLL-2005 and 2012 datasets, respectively. Empirical analysis on these results shows that the label prediction ability of our span-based model is better than that of the CRF-based model. Another finding is that ELMo improves the model performance for span boundary identification.
In summary, our main contributions include:
A simple span-based model that achieves the state-of-the-art results.
Quantitative and qualitative analysis on strengths and weaknesses of the span-based model.
Empirical analysis on the performance gains by ELMo.
Our code and scripts are publicly available.111https://github.com/hiroki13/span-based-srl
We treat SRL as span selection, in which we select appropriate spans from a set of possible spans for each label. This section formalizes the problem and provides our span selection model.
Given a sentence that consists of words and the target predicate position index , the goal is to predict a set of labeled spans .
Each labeled span consists of word indices and in the sentence () and a semantic role label .
One simple method to predict is to select the highest scoring span from all possible spans for each label ,
Function returns a real value for each span (described in Section 2.2 in more detail). The number of possible spans in the input sentence is , and is defined as follows,
Note that some semantic roles may not appear in the sentence. To deal with the absence of some labels, we define the predicate position span as a Null span and train a model to select the Null span when there is no span for the label.222Since the predicate itself can never be an argument of its own, we define the position as the Null span.
Consider the following sentence with the set of correct labeled spans .
She kept a cat
[ A0 ] [ A1 ]
The input sentence is , and the target predicate position is . The correct labeled span indicates that the A0 argument is “She”, and indicates that the A1 argument is “a cat”. The other labeled spans indicate there are no arguments.
All the possible spans in this sentence are as follows,
where the predicate span is treated as the Null span. Among these candidates, we select the highest scoring span for each label. As a result, we can obtain correct labeled spans .
As the scoring function for each span in Eq. 1
, we model normalized distribution over all possible spansfor each label ,
where function returns a real value.
We train the parameters of on a training set,
To train the parameters of
, we minimize the cross-entropy loss function,
where function is a loss for each sample.
Function in Eq. 2.2 consists of three types of functions; the base feature function , the span feature function and the labeling function as follows,
calculates a base feature vectorfor each word . Then, from a sequence of the base feature vectors , calculates a span feature vector for a span . Finally, using , calculates the score for the span with a label .
The simple argmax inference (Eq. 1) selects one span for each label. While this argmax inference is computationally efficient, it faces the following two problematic issues.
The argmax inference sometimes selects spans that overlap with each other.
The argmax inference cannot select multiple spans for one label.
In terms of (a), for example, when and are selected, a part of these two spans overlaps.
In terms of (b), consider the following sentence.
He came to the U.S. yesterday at 5 p.m.
[A0] [ A4 ] [ TMP ] [ TMP ]
In this example, the label TMP is assigned to the two spans (“yesterday” and “at 5 p.m.”). Semantic role labels are mainly categorized into (i) core labels or (ii) adjunct labels. In the above example, the labels A0 and A4 are regarded as core labels, which indicate obligatory arguments for the predicate. In contrast, the labels like TMP are regarded as adjunct labels, which indicate optional arguments for the predicate. As the example shows, adjunct labels can be assigned to multiple spans.
To deal with these issues, we use a greedy search that keeps the consistency among spans and can return multiple spans for adjunct labels. Specifically, we greedily select higher scoring labeled spans subject to two constraints.
Any spans that overlap with the selected spans cannot be selected.
While multiple spans can be selected for each adjunct label, at most one span can be selected for each core label.
As a precise description of this algorithm, we describe the pseudo code and its explanation in Appendix A.
To compute the score for each span, we have introduced three functions () in Section 2.3. As an instantiation of each function, we use neural networks. This section describes our neural networks for each function and the overall network architecture.
Figure 1 illustrates the overall architecture of our model. The first component uses bidirectional LSTMs (BiLSTMs) Schuster and Paliwal (1997); Graves et al. (2005, 2013) to calculate the base features. From the base features, the second component extracts span features. Based on them, the final component calculates the score for each labeled span. In the following, we describe these three components in detail.
As the base feature function , we use BiLSTMs,
There are some variants of BiLSTMs. Following the deep SRL models proposed by zhou:15 and he:17, we stack BiLSTMs in an interleaving fashion. The stacked BiLSTMs process an input sequence in a left-to-right manner at odd-numbered layers and in a right-to-left manner at even-numbered layers.
The first layer of the stacked BiLSTMs receives word embeddings and predicate mark embeddings . As the word embeddings, we can use existing word embeddings. The mark embeddings are created from the mark feature which has a binary value. The value is 1 if the word is the target predicate and 0 otherwise. For example, at the bottom part of Figure 1, the word “bought” is the target predicate and assigned as its mark feature.
From the base features induced by the BiLSTMs, we create the span feature representations,
where the addition and subtraction features of the -th and -th hidden states are concatenated and used as the feature for a span . The resulting vector is a dimensional vector.
The middle part of Figure 1 shows an example of this process. For the span , the span feature function receives the rd and th features ( and ). Then, these two vectors are added, and the th vector is subtracted from the rd vector. The resulting vectors are concatenated and given to the labeling function .
Our design of the span features is inspired by the span (or segment) features used in syntactic parsing Wang and Chang (2016); Stern et al. (2017); Teranishi et al. (2017). While these neural span features cannot be used in BIO-based SRL models, they can easily be incorporated into span-based models.
Taking a span representation as input, the labeling function returns the score for the span with a label . Specifically, we use the following labeling function,
where has a row vector associated with each label , and denotes the -th row vector. As the result of the inner product of and , we obtain the score for a span with a label .
The upper part of Figure 1 shows an example of this process. The span representation for the span is created from addition and subtraction of and . Then, we calculate the inner product of and . The score for the label A0 is , and the score for the label A1 is . In the same manner, by calculating the scores for all the spans and labels , we can obtain the score matrix (at the top part of Figure 1).
We propose an ensemble model that uses span representations from multiple models. Each base model trained with different random initializations has variance in span representations. To take advantage of it, we introduce a variant of a mixture of experts (MoE)Shazeer et al. (2017), 333One popular ensemble model for SRL is the product of experts (PoE) model FitzGerald et al. (2015); He et al. (2017); Tan et al. (2018). In our preliminary experiments, we tried the PoE model but it did not improve the performance.
Firstly, we combine span representations from each model . is a parameter matrix and are trainable, softmax-normalized parameters. Then, using the combined span representation , we calculate the score in the same way as Eq. 8. We use the same greedy search algorithm used for our base model (Section 2.4).
During training, we update only the parameters of the ensemble model, i.e., . That is, we fix the parameters of each trained model . As the loss function, we use the cross-entropy (Eq. 3).
We use the CoNLL-2005 and 2012 datasets444We use the version of OntoNotes downloaded at: http://cemantix.org/data/ontonotes.html.. We follow the standard train-development-test split and use the official evaluation script555The script can be downloaded at: http://www.lsi.upc.edu/ srlconll/soft.html from the CoNLL-2005 shared task on both datasets.
|Development||Test WSJ||Test Brown||Test ALL|
For comparison, as a model based on BIO tagging approaches, we use the BiLSTM-CRF model proposed by zhou:15. The BiLSTMs for the base feature function are the same as those used in our BiLSTM-span model.
Word embeddings have a great influence on SRL models. To validate the model performance, we use two types of word embeddings.
SENNA and ELMo can be regarded as different types of embeddings in terms of the context sensitivity. SENNA and other typical word embeddings always assign an identical vector to each word regardless of the input context. In contrast, ELMo assigns different vectors to each word depending on the input context. In this work, we use these word embeddings that have different properties.888In our preliminary experiments, we also used the GloVe embeddings Pennington et al. (2014), but the performance was worse than SENNA. These embeddings are fixed during training.
As the objective function, we use the cross-entropy in Eq. 3 with L2 weight decay,
where the hyperparameter is the coefficient governing the L2 weight decay.
We report averaged scores across five different runs of the model training.
Tables 1 and 2 show the experimental results on the CoNLL-2005 and 2012 datasets. Overall, our span-based ensemble model using ELMo achieved the best F1 scores, 87.4 F1 and 87.0 F1 on the CoNLL-2005 and CoNLL-2012 datasets, respectively. In comparison with the CRF-based single model, our span-based single model consistently yielded better F1 scores regardless of the word embeddings, Senna and ELMo. Although the performance difference was small between these models using ELMo, it seems natural because both models got much better results and approached to the performance upper bound.
Table 3 shows the comparison with existing models in F1 scores. Our single and ensemble models using ELMo achieved the best F1 scores on all the test sets except the Brown test set.
To better understand our span-based model, we addressed the following questions and obtained the following findings.
What are strengths and weaknesses of our span-based model compared with the CRF-based model?
What aspect of SRL does ELMo improve?
ELMo improves the model performance for span boundary identification (Section 5.1).
In addition, we have conducted qualitative analysis on span and label representations learned in the span-based model (Section 5.3).
We analyze the results predicted by the single models. We evaluate F1 scores only for the span boundary match, shown by Table 4. We regard a predicted boundary as correct if it matches the gold annotation regardless of its label.
On both datasets, the CRF-based models achieved better F1 than that of the span-based models. Also, compared with Senna, ELMo yielded much better F1 by over 3.0. This suggests that a factor of the overall SRL performance gain by ELMo is the improvement of the model ability to identify span boundaries.
We analyze labels of the predicted results. For labeled spans whose boundaries match the gold annotation, we evaluate the label accuracies. As Table 5 shows, the span-based models outperformed the CRF-based models. Also, interestingly, the performance gap between Senna and ELMo was not so big as that for span boundary identification.
Table 6 shows F1 scores for frequent labels on the CoNLL-2005 and 2012 datasets. For A0 and A1, the performances of the CRF-based and span-based models were almost the same. For A2, the span-based models outperformed the CRF-based model by about 1.0 F1 on the both datasets. 999The PNC label got low scores on the CoNLL-2012 dataset in Table 6. Almost all the gold PNC (purpose) labels are assigned to only the news article domain texts of the CoNLL-2012 dataset. The other 6 domain texts have no or very few PNC labels. This can lead to the low performance.
Figure 2 shows a confusion matrix for labeling errors of the span-based model using ELMo.101010We have observed the same tendency of labeling confusions between the models using ELMo and SENNA. Following he:17, we only count predicted arguments that match the gold span boundaries.
The span-based model confused A0 and A1 arguments the most.
In particular, the model confused them for ergative verbs.
Consider the following two sentences:
People start their own business …
[ A0 ]
.. Congress has started to jump on …
[ A1 ]
where the constituents located at the syntactic subjective position fulfill a different role A0 or A1 according to their semantic properties, such as animacy. Such arguments are difficult for SRL models to correctly identify.
Another point is the confusions of A2 with DIR and LOC. As he:17 pointed out, A2 in a lot of verb frames represents semantic relations such as direction or location, which can cause the confusions of A2 with such location-related adjuncts. To remedy these two problematic issues, it can be a promising approach to incorporate frame knowledge into SRL models by using verb frame dictionaries.
|“ toy makers to move [ across the border ] .”|
|Nearest neighbors of “across the border”|
|1||DIR||across the Hudson|
|2||DIR||outside their traditional tony circle|
|3||DIR||across the floor|
|4||DIR||through this congress|
|5||A2||off their foundations|
|6||DIR||off its foundation|
|7||DIR||off the center field wall|
|8||A3||out of bed|
|9||A2||through cottage rooftops|
|10||DIR||through San Francisco|
Our span-based model computes and uses span representations (Eq. 7) for label prediction. To investigate a relation between the span representations and predicted labels, we qualitatively analyze nearest neighbors of each span representation with its predicted label. Specifically, for each predicted span in the development set, we collect 10 nearest neighbor spans with their gold labels from the training set.
Table 7 shows 10 nearest neighbors of a span “across the border” for the predicate “move”. The label of this span was misclassified, i.e., the predicted label is DIR but the gold is A2. Looking at its nearest neighbor spans, they have different gold labels, such as DIR, A2 and A3. Like this case, we have observed that spans with a misclassified label often have their nearest neighbors with inconsistent labels.
We analyze the label embeddings in the labeling function (Eq. 8). Figure 3 shows the distribution of the learned label embeddings. The adjunct labels are close to each other, which are likely to be less discriminative. Also, the core label A2 is close to the adjunct label DIR, which are often confused by the model. To enhance the discriminative power, it is promising to apply techniques that keep label representations far away from each other Wen et al. (2016); Luo et al. (2017).
Automatic SRL has been widely studied Gildea and Jurafsky (2002). There have been two main styles of SRL.
In this paper, we have tackled PropBank-style SRL.111111Detailed descriptions on FrameNet-style and PropBank-style SRL can be found in baker:98,das:14,kingsbury:02,palmer:05.
In PropBank-style SRL, there have been two main task settings.
Figure 4 illustrates an example of span-based and dependency-based SRL. In dependency-based SRL (at the upper part of Figure 4), the correct A2 argument for the predicate “hit” is the word “with”. On one hand, in span-based SRL (at the lower part of Figure 4), the correct A2 argument is the span “with the bat”.
For span-based SRL, the CoNLL-2004 and 2005 shared tasks Carreras and Marquez (2004); Carreras and Màrquez (2005) provided the task settings and datasets. In the task settings, various SRL models, from traditional pipeline models to recent neural ones, have been proposed and competed with each other Pradhan et al. (2005); He et al. (2017); Tan et al. (2018). For dependency-based SRL, the CoNLL-2008 and 2009 shared tasks Surdeanu et al. (2008); Hajič et al. (2009) provided the task settings and datasets. As in span-based SRL, recent neural models achieved high-performance in dependency-based SRL Marcheggiani et al. (2017); Marcheggiani and Titov (2017); He et al. (2018b); Cai et al. (2018). This paper focuses on span-based SRL.
State-of-the-art SRL models use neural networks based on the BIO tagging approach. The pioneering neural SRL model was proposed by collobert:11. They use convolutional neural networks (CNNs) and CRFs. Instead of CNNs, zhou:15 and he:17 used stacked BiLSTMs and achieved strong performance without syntactic inputs. tan:18 replaced stacked BiLSTMs with self-attention architectures. strubell:18a improved the self-attention SRL model by incorporating syntactic information.
Word representations Typical word representations, such as SENNA Collobert et al. (2011) and GloVe Pennington et al. (2014), have been used and contributed to the performance improvement Collobert et al. (2011); Zhou and Xu (2015); He et al. (2017). Recently, peters:18 integrated contextualized word representation, ELMo, into the model of he:17 and improved the performance by 3.2 F1 score. strubell:18b also integrated ELMo into the model of strubell:18a and reported the performance improvement.
Typically, in this approach, models firstly identify candidate argument spans (argument identification) and then classify each span into one of the semantic role labels (argument classification). For inference, several effective methods have been proposed, such as structural constraint inference by using integer linear programmingPunyakanok et al. (2008) or dynamic programming Täckström et al. (2015); FitzGerald et al. (2015).
Recent span-based model A very recent work, he:18, proposed a span-based SRL model similar to our model.
They also used BiLSTMs to induce span representations in an end-to-end fashion.
A main difference is that while they model , we model .
In other words, while their model seeks to select an appropriate label for each span (label selection), our model seeks to select appropriate spans for each label (span selection).
This point distinguishes between their model and ours.
FrameNet span-based model For FrameNet-style SRL, swayamdipta:17 used a segmental RNN Kong et al. (2016), combining bidirectional RNNs with semi-Markov CRFs Sarawagi and Cohen (2004). Their model computes span representations using BiLSTMs and learns a conditional distribution over all possible labeled spans of an input sequence. Although we cannot compare our results with theirs, we can regard that our model is simpler and effective for PropBank-style SRL.
In syntactic parsing, wang:16 proposed an LSTM-based sentence segment embedding method named LSTM-Minus. stern:17,kitaev:18 incorporated the LSTM Minus into their parsing model and achieved the best results in constituency parsing. In coreference resolution, lee:17,lee:18 presented an end-to-end coreference resolution model, which considers all spans in a document as potential mentions and learn distributions over possible antecedents for each. Our model can be regarded as an extension of their model.
We have presented a simple and accurate span-based model. We treat SRL as span selection and our model seeks to select appropriate spans for each label. Experimental results have demonstrated that despite the simplicity, the model outperforms a strong BiLSTM-CRF model. Also, our span-based ensemble model using ELMo achieves the state-of-the-art results on the CoNLL-2005 and 2012 datasets. Through empirical analysis, we have obtained some interesting findings. One of them is that the span-based model is better at label prediction compared with the CRF-based model. Another one is that ELMo improves the model performance for span boundary identification.
An interesting direction for future work concerns evaluating span representations from our span-based model. Since the investigation on the characteristics of the representations can lead to interesting findings, it is worthwhile evaluating them intrinsically and extrinsically. Another promising direction is to explore methods of incorporating frame knowledge into SRL models. We have observed that a lot of label confusions arise due to the lack of such knowledge. The use of frame knowledge to reduce these confusions is a straightforward approach.
This work was partially supported by JST CREST Grant Number JPMJCR1513 and JSPS KAKENHI Grant Number 18K18109. We are grateful to the members of the NAIST Computational Linguistics Laboratory, the members of Tohoku University Inui-Suzuki Laboratory, Kentaro Inui, Jun Suzuki, Yuichiro Matsubayashi, and the anonymous reviewers for their insightful comments.
Journal of Machine Learning Research.
Jan Hajič, Massimiliano Ciaramita, Richard Johansson, Daisuke Kawahara, Maria Antònia Martí, Lluís Màrquez, Adam Meyers, Joakim Nivre, Sebastian Padó, Jan Štěpánek, Pavel Straňák, Mihai Surdeanu, Nianwen Xue, and Yi Zhang. 2009.The conll-2009 shared task: Syntactic and semantic dependencies in multiple languages. In Proceedings of CoNLL, pages 1–18.
Segmental recurrent neural networks.In Proceedings of ICLR.
A discriminative feature learning approach for deep face recognition.In Proceedings of ECCV, pages 499–515.
Algorithm 1 describes the pseudo code of the greedy search algorithm introduced in Section 2.4. This algorithm receives the three inputs (line 1-3). is the score matrix illustrated at the top part of Figure 1 in Section 3. Each cell of the matrix represents the score of each span. is a target predicate position index. is the set of core labels. At line 4, the variable “spans” is initialized. This variable stores the selected spans to be returned as the output. At line 5, the variable “usedcores” is initialized. This variable keeps track of the already selected core labels.
At line 6, the score matrix is converted to tuples, , by the function . These tuples are stored in the variable . At line 7, from , we remove the tuples that fall into any one of the followings, (i) the tuples whose boundary overlaps with the predicate position or (ii) the tuples whose score is lower than that of the predicate span tuples. In terms of (i), since spans whose boundary overlaps with the predicate position, , can never be a correct argument, we remove such tuples. In terms of (ii), we remove the tuples whose score is lower than that of the predicate span tuple . In Section 2, we define the predicate span as the Null span, implying that we can regard the spans whose score is lower than that of the Null span as an inappropriate argument. Thus, we remove such tuples from the set of the candidates .
The main processing starts from line 8. Based on the scores, the function sorts the tuples in a descending order. At line 9-10, there are constraints for output spans. At line 9, “” represents the constraint that at most one span can be selected for each core label. At line 10, the function takes as input a span and the set of the selected spans, and returns the boolean value (“True” or “False”) that represents whether the span overlaps with any one of the selected spans or not.
At line 11, the span is added to the set of the selected spans. At line 12-13, if the label is included in the core labels , the label is added to “used_cores”. At line 14, as the final output, the set of the selected spans “spans” is returned.
In particular, we use the stacked BiLSTMs in an interleaving fashion Zhou and Xu (2015); He et al. (2017). The stacked BiLSTMs process an input sequence in a left-to-right manner for odd-numbered layers and in a right-to-left manner for even-numbered layers.
The stacked BiLSTMs consist of layers. The hidden state in each layer is calculated as follows,
Both of the odd- and even-numbered layers receive as the first input of the LSTM. For the second input, odd-numbered layers receive , whereas even-numbered layers receive .
Between the LSTM layers, we use the following connection Zhou and Xu (2015),
Here, we firstly concatenate and , and then calculate the inner product of the concatenated vector and the parameter matrix
with the rectified linear units (ReLU). As a result, we obtain the input representationfor the next (-th) LSTM layer.
In the first layer, receives an input feature vector . Following he:17, we create this vector by concatenating a word embedding and predicate mark embedding,
where and . The mark embedding is created from the binary mark feature. The value is 1 if the word is the target predicate and 0 otherwise.
|Word Embedding||50-dimensional SENNA|
|Mark Embedding||50-dimensional vector|
|LSTM Hidden Units||300 dimensions|
|Dropout Ratio for BiLSTMs||0.1|
|Dropout Ratio for ELMo||0.5|
Table 8 lists the hyperparameters used for our span-based model.
Word representation setup As word embeddings , we use two types of embeddings, (i) SENNA Collobert et al. (2011), 50-dimensional word vectors (), and (ii) ELMo Peters et al. (2018), 1024-dimensional vectors ().
During training, we fix these word embeddings (not update them).
As predicate mark embeddings , we use randomly initialized 50-dimensional vectors ().
During training, we update them.
Network setup As the base feature function , we use stacked BiLSTMs (2 forward and 2 backward LSTMs) with 300-dimensional hidden units (). Following he:17, we initialize all the parameter matrices in BiLSTMs with random orthonormal matrices Saxe et al. (2013). Other parameters are initialized following glorot:10, and bias parameters are initialized with zero vectors.
Regularization We set the coefficient for the L2 weight decay (Eq. 11 in Section 4.3) to .
We apply dropout Srivastava et al. (2014) to the input vectors of each LSTM with dropout ratio of 0.1 and the ELMo embeddings with dropout ratio of 0.5.
Training To optimize the parameters, we use Adam Kingma and Ba (2014) with and
. The learning rate is initialized to 0.001. After training 50 epochs, we halve the learning rate every 25 epochs. Parameter updates are performed in mini-batches of 32. The number of training epochs is set to 100. We save the parameters that achieve the best F1 score on the development set and evaluate them on the test set. Training our model on the CoNLL-2005 training set takes about one day and on the CoNLL-2012 training set takes about two days on a single GPU, respectively.
Our ensemble model uses span representations from base models (Section 3.2).
We use 5 base models () learned over different runs.
Note that, during training, we fix the parameters of the five base models and update only the parameters of the ensemble model.
) is initialized with the identity matrix. The scalar parameters(Eq. 9) are initialized with . Each row vector of the parameter matrix (Eq. 10) is initialized with the averaged vector over the row vectors of each model , i.e., .
Training To optimize the parameters, we use Adam with and . The learning rate is set to 0.0001. Parameter updates are performed in mini-batches of 8. The number of training epochs is set to 20. We save the parameters that achieve the best F1 score on the development set and evaluate them on the test set. Training one ensemble model on the CoNLL-2005 and 2012 training sets takes about one day on a single GPU.