Low Resource Multi-Task Sequence Tagging – Revisiting Dynamic Conditional Random Fields

05/01/2020 ∙ by Jonas Pfeiffer, et al. ∙ Technische Universität Darmstadt 0

We compare different models for low resource multi-task sequence tagging that leverage dependencies between label sequences for different tasks. Our analysis is aimed at datasets where each example has labels for multiple tasks. Current approaches use either a separate model for each task or standard multi-task learning to learn shared feature representations. However, these approaches ignore correlations between label sequences, which can provide important information in settings with small training datasets. To analyze which scenarios can profit from modeling dependencies between labels in different tasks, we revisit dynamic conditional random fields (CRFs) and combine them with deep neural networks. We compare single-task, multi-task and dynamic CRF setups for three diverse datasets at both sentence and document levels in English and German low resource scenarios. We show that including silver labels from pretrained part-of-speech taggers as auxiliary tasks can improve performance on downstream tasks. We find that especially in low-resource scenarios, the explicit modeling of inter-dependencies between task predictions outperforms single-task as well as standard multi-task models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider the problem of multi-task sequence tagging (MTST) with small training datasets, where each token in a sequence has multiple labels, each corresponding to a different task. Many advances in sequence labeling for NLP stem from combining new types of deep neural network (DNN) with conditional random fields (CRFs; Lafferty et al., 2001; Sutton and McCallum, 2012). In these approaches, such as Huang et al. (2015), Lample et al. (2016) and Ma and Hovy (2016)

, DNNs extract rich vector representations from raw text sequences that facilitate classification, while CRFs capture the dependencies between labels in a sequence. However, as the DNNs contain many parameters, strong performance is achieved by training on large labeled datasets, which are unavailable for many domain-specific span annotation tasks.

Figure 1: General setup of multi-task sequence tagging (MTST) for a sequence of tokens to . Standard MTL does not model the flow of inter-dependency information (red arrow) between the output layers (i.e., CRFs).

The performance of large DNNs can often be improved by multi-task learning (MTL), which trains a shared data representation for several related tasks so that the representation is learned from a larger pool of data Collobert and Weston (2008). While this suggests MTL may be a solution for multi-task sequence tagging, standard MTL assumes that the sequences of labels for each task are conditionally independent given the shared data representations, as depicted in Figure 1. These multi-task setups therefore do not model the dependencies between CRFs, illustrated by the red arrow. This modeling decision may result in information loss if important dependencies between tasks are not adequately modeled by the shared representations alone. Nonetheless, the go-to strategy in recent works is to either tackle the tasks separately without MTL Lee et al. (2017b), or employ an MTL setup with multiple independent linear-chain CRFs in the prediction layer Schulz et al. (2019c).

As an alternative to the standard CRF, dynamic CRFs, such as the factorial CRF Sutton et al. (2007), explicitly model dependencies between multiple sequences of labels, but have not previously been integrated with DNNs, so until now have relied on fixed text representations that cannot be improved through training. In this work, we adopt factorial CRFs into a neural setting, finding that especially for difficult tasks and low resource settings, modeling task inter-dependencies outperforms both single task and multi-task setups that do not model the inter-dependencies, indicating that this additional flow of information helps performance considerably.

Our core contributions are: (1) a review of different CRF architectures for multi-task sequence tagging (MTST) in a neural network setting; (2) three new MTST models that integrate factorial CRFs with deep neural networks to exploit dependencies between tasks; and (3) an empirical analysis of different CRF architectures, showing situations where factorial CRF approaches are more suitable than traditional multi-task learning or single-task setups.

Our implementation extends the popular sequence labeling framework FLAIR111https://github.com/zalandoresearch/flair Akbik et al. (2019). To make future experiments and reproducibility easy, our experiments use existing publicly available datasets and we make our code available under https://github.com/UKPLab/multi-task-sequence-tagging.

2 Related Work

In recent years, research into sequence labeling has focused on representations of the input text. Several architectures were introduced that combine word and character embeddings as inputs to a DNN, evolving from the BiLSTM-CRF Huang et al. (2015) to BiLSTM-LSTM-CRF Lample et al. (2016) and BiLSTM-CNN-CRF Ma and Hovy (2016). These approaches have been enhanced by leveraging pretrained language models Peters et al. (2017); Liu et al. (2018) or using contextual embedding representations Akbik et al. (2018) such as ELMO Peters et al. (2018) or BERT Devlin et al. (2019). However, all of these approaches focus on the data representation and use a linear-chain CRF as the prediction head, so do not model task dependencies in an MTST scenario.

Multi-task learning (MTL) has been widely used in NLP to exploit multiple datasets for representation learning Collobert and Weston (2008); Liu et al. (2016); Nam et al. (2014); Liu et al. (2017). The general architecture of MTL systems consists of two components: (1) a shared data representation, and (2) an (independent) task specific output or prediction layer Caruana (1997); Collobert and Weston (2008); Nam et al. (2014); Liu et al. (2016, 2017); Zhang and Yang (2017); Ruder (2017); Ruder et al. (2019); Sanh et al. (2019). Søgaard and Goldberg (2016) show that in MTL setups, different tasks perform better if the prediction layer is on different layers of multi-layer LSTMs for part-of-speech tagging (POS), syntactic chunking and CCG supertagging. Bingel and Søgaard (2017) provide an in depth ablation study on which task combinations, such as POS, multi-word expressions, super-sense tagging, etc., profit from one another, while Changpinyo et al. (2018) design different strategies for sharing weights between tasks. More recently, Simpson et al. (2020) use variational inference to combine the predictions of multiple taggers trained on different tasks. Greenberg et al. (2018) train a single CRF from multiple datasets using marginal likelihood training to mitigate missing labels. However, these design traits are necessary because each task has different data with task-specific idiosyncrasies that require different encodings. We eliminate the need for such design traits by focusing on MTST settings where multiple labels are provided for the same set of sentences, thus all tasks share a single data representation.

At first glance, the MTST setup seems closely related to Nested NER (NNER) Alex et al. (2007); Finkel and Manning (2009), which introduces a hierarchical structure of dependent entities. However, NNER does not necessarily focus on different tasks at the hierarchical level, but allows the same label from the same task to be tagged over the same span multiple times. This is significantly different to MTST, where all overlapping spans correspond to distinct tasks.

In summary, existing work focuses on task specific solutions that either model a specific hierarchy in NNER Alex et al. (2007); Finkel and Manning (2009); Ju et al. (2018); Lin et al. (2019); Luan et al. (2019); Li et al. (2019), assume independence between tasks given the shared data representation Søgaard and Goldberg (2016); Bingel and Søgaard (2017); Greenberg et al. (2018); Changpinyo et al. (2018), or combine predictions from completely independent taggers Simpson et al. (2020). In contrast, we focus on simple CRF procedures that do not require task-specific adaptations, can be integrated as a prediction layer with any underlying data representation model, and learn jointly from multiple labels for the same sequence.

3 Multi-Task Sequence Tagging

(a) Linear-Chain CRF
(b) Multi-Head CRF
(c) Factorial CRF
(d) Weighted Factorial CRF
Figure 2: Different CRF architectures that take as input the output of an LSTM at each time-step in a sequence of two tokens. The circles represent predictions of the labels at each time-step, and the filled squares represent transformations.

We first define a feature function as any arbitrary function with parameters that maps the -th token, , in an input text sequence to a vector representation or embedding, , to facilitate tasks such as sequence labeling. While feature functions have traditionally been defined by feature engineering, recent state-of-the-art models employ DNNs in the form of CNNs, LSTMs and more recently Transformers Vaswani et al. (2017) to model sequence based features. As mentioned in Section 2, most sequence labeling approaches feed the output of a feature function into a basic linear-chain CRF to improve performance. In this work, we do not investigate new neural feature functions, which may have a task-specific nature. Instead, we evaluate and extend dynamic CRF models initially introduced by Sutton et al. (2007), using them to construct new neural architectures for sequence labeling that can be applied to arbitrary feature functions. We thus denote a feature function, , as an arbitrary neural model for sequence tagging and refer to any combination of a feature function and a CRF as -CRF. In the following sections we introduce different CRF models for multi label sequence tagging, then in our experiments, we combine the dynamic CRF models with DNNs for the first time.

3.1 -Crf

Linear-chain CRFs, illustrated in Figure 1(a), model a sequence under the first-order Markov assumption that the labels are only conditionally dependent on the label of the previous time-step and the features of the current time-step. A linear-chain CRF thus factorizes the conditional distribution of a sequence of labels given the sequence of tokens into two main terms:

(1)

where A is an affine transformation from the output of to the prediction space, denotes the index of the label at time-step , and B is a transition matrix with entries

that define the log probability of the label at the current time-step given the label at the previous time-step. We reduce the notations from

to for simplicity. denotes the normalization factor over all possible sequence states of y:

(2)

3.2 Multi-Head -Crf

To predict sequence labels for multiple tasks, we need to adapt the architecture of the -CRF. This can be done with a standard multi-task setup where the different tasks share the same function with separate CRFs as prediction layers for each task. Our multi-head -CRF architecture, illustrated in Figure 1(b), thus jointly learns the weights of a shared (the output of is the same for each task), but learns distinct transition matrices for each task :

(3)

3.3 Factorial -Crf

While multi-head -CRFs jointly learn the weights for each of the tasks, they introduce a conditional independence assumption between the predicted labels given . To mitigate this, we revisit factorial CRFs, illustrated in Figure 1(c), which are a special case of dynamic CRFs introduced by Sutton et al. (2007). Factorial CRFs model the conditional dependency between multiple tasks by introducing the log joint probability matrix C:

(4)

This encodes the dependency between tasks at each time-step. Since , the log joint probability matrix between task and is shared between the two tasks.

3.4 Weighted Factorial -Crf

In practice, the labels for the other task, , are also uncertain. Therefore, we enhance factorial CRFs by introducing a new variant that weights the matrix according to this uncertainty. For this, we scale the log joint probability matrix by the likelihood of the label for the respective other task, :

(5)

This is illustrated in Figure 1(d).

3.5 Cascaded Weighted Factorial -Crf

To avoid modeling dependencies between the labels for all pairs of tasks, we specify a cascaded factorial CRF Sutton et al. (2007), which defines a hierarchy of dependencies. This decreases the complexity of inference as there are no longer circular dependencies between the labels for different tasks, meaning we can avoid expensive loopy dynamic programming Murphy et al. (1999). We can define a hierarchical setup by specifying an ordered list of tasks , where each task, , is dependent on iff . For this type of hierarchical setup, cascaded factorial CRFs can be defined as:

(6)

This structure is similar to the weighted factorial depicted in Figure 1(d), except that the connections between tasks are only present for tasks with indices .

Figure 3: Example Streusle dataset Schneider and Smith (2015) including the supersenses indicated by and classes and MWE indicated by the BIO tags.

First I wanted to see if the problem was new, so I checked the teacher’s observations . As it was the same back then, I ruled out a trauma or another dramatic event. I was then undecided between autism and ADHD, since his social behaviour seems to be problematic and that’s a sign for both diagnoses. In the end, I settled on ADHD since his script seems chaotic and unorganised and because he seems to have some friends despite his difficult behaviour.

Figure 4: Example text from the TEd dataset, with highlighted spans for EG (green), EE (underlined), DC (yellow), HG (blue).

4 Datasets

We evaluate the different CRF architectures on three very diverse datasets in the languages English and German. The Streusle dataset Schneider and Smith (2015) focuses on extracting the semantics of the text, introducing many different labels of supersense categories and identifying multi-word expressions. The MalwareTextDB Lim et al. (2017) on the other hand has the task of extracting malicious entities, providing a very difficult NER task. While the first two datasets are on sentence level and in English, FAMULUS Schulz et al. (2019a) is a document-level sequence labeling task in German. It focuses on diagnostic reasoning for the medical and teacher education domains and consists of 4 interdependent tasks.

Streusle

The Streusle dataset Schneider and Smith (2015) consists of three tasks. POS tagging, supersense categories (SSC) and multi-word expressions (MWE). SSC refers to top-level hypernyms from WordNet Miller (1998), which are designed to be broad enough to encompass all nouns and verbs Miller (1990); Fellbaum (1990). In total the SSC task consists of 26 noun and 15 verb categories. MWEs consist of single- and multi-word noun and verb expressions with supersenses that encompass idioms, light verb constructions, verb-particle constructions, and compounds Sag et al. (2002). We provide an example annotation of the Streusle dataset in Figure 3. The dataset consists of train, dev, and test data points.

Malware

The MalwareTextDB Lim et al. (2017) consists of 39 annotated Advanced Persistent Threat (APT) reports released by APTnotes222https://github.com/aptnotes/. The dataset is targeted for the cybersecurity domain for automatically detecting malicious entities. In total, the dataset consists of sentences333https://github.com/juand-r/entity-recognition-datasets over all 39 domains, which we split into train, and development and test sets. We extend this dataset by an additional task (described below) by using the spacy.io framework444https://spacy.io/ to obtain silver part-of-speech tags.

Famulus

Med TEd
# av. len # av.len
EG/EE 5 3.8 8 7.9
HG/DC 4 8.5 2 22.0
DC/EE 342 9.8 143 10.9
EG/HG 0 - 3 6.0
HG/EE 12 5.7 8 11.1
EG/DC 4 6.8 3 11.7
Table 1: Corpus statistics in terms of absolute number (#) and average number of tokens (av. len), where EE/EG (and similar) denotes an overlap of an EG and EE segment.

The FAMULUS datasets Schulz et al. (2019a, b) comprise diagnostic reasoning annotations in the Medical (Med) and Teacher Education (TEd) domains. Each dataset contains summaries written by students of virtual patients (cases), in which the students reason over possible symptomatic diagnoses. The argumentative structure of the diagnoses is categorized into diagnostic activities (Fischer et al., 2014), covered by sub-spans of the text.

The dataset consists of 4 diagnostic activity classes: hypothesis generation (HG; the derivation of possible answers to the problem), evidence generation (EG; the derivation of evidence, e.g., through deductive reasoning or observing phenomena), evidence evaluation (EE; the assessment of whether and to which degree evidence supports an answer to the problem), and drawing conclusions (DC; the aggregation and weighing up of evidence and knowledge to derive a final answer to the problem), discussed in detail by Schulz et al. (2019a). A translated labeled example is shown in Figure 4.

While the datasets consist of 4 tasks, only two of them (DC and EE) have many examples of overlapping labels, as can be seen in Table 1. This means that while DC and EE are highly dependent on each other, EG and EE are mostly disjoint from the other tasks. The TEd dataset consist of train, development and test data points. The Med dataset consists of train, development, and test data points.

5 Experiments

In this section we describe our experimental procedures including how we simulate low resource scenarios, our hyper-parameter search as well as our inference strategy.

5.1 Low Resource Training Splits

We simulate low resource settings for the Streusle and Malware datasets by randomly splitting the training data into smaller sets. We create 4 training sets consisting of 100, 500, and 1000 random samples and the full dataset respectively. In order to keep the performance comparable, we keep the full development and test set for all scenarios.

5.2 Feature Function and Hyper-parameters

As a feature function for the representation of the time-steps we take the commonly-used BiLSTM architecture throughout all our experiments. This architecture has been shown to perform well on various sequence labeling tasks in combination with CRFs Huang et al. (2015); Lample et al. (2016); Søgaard and Goldberg (2016); Ma and Hovy (2016); Reimers and Gurevych (2017); Bingel and Søgaard (2017); Akbik et al. (2018); Schulz et al. (2019c). As input to the BiLSTM we combine character and pretrained word embedding representations. For the Streusle and Malware datasets we take pretrained Glove embeddings Pennington et al. (2014) in English and for Famulus we take pretrained FastText embeddings Bojanowski et al. (2017) in German.

Hyperparameter Values
# Layers
Hidden Size
Batch Size
Table 2: Hyper-parameter settings for the BiLSTM feature function which we randomly sample over for all experiments

We follow the widely-used training procedure for BiLSTM-CRFs, where we first compute using bidirectional LSTMs, where the output of the forward and backward LSTM are concatenated for the respective time-steps. The inputs to the LSTM at each time-step are embedding representations of the respective words and the output of a learned character language model following Akbik et al. (2018). Following this, we compute the forward and backward pass to compute the gradients for the LSTM parameters, , and the matrices A, B and C. For more detail we refer to Lafferty et al. (2001); Sutton and McCallum (2012).

For all experimental setups we randomly sample hyper-parameter settings listed in Table 2. We use Adam Kingma and Ba (2014)

for optimization with default settings, however, we perform linear learning rate warm-up over the first epoch. We perform gradient clipping set to

.

We follow Reimers and Gurevych (2017) by conducting 5 random seed runs for each hyper-parameter setting. We average the results of each run on the development set. We train all models until convergence on the loss of the development set and perform inference on the development set subsequently. In our results, we report the average test set scores for the best average development setting.

5.3 Inference

For decoding the optimal tags during inference, we use the dynamic programming algorithm, Viterbi Forney (1973). Due to the inter-dependency nature of the factorial CRF structure, we require a looping dynamic programming algorithm to infer the most likely label of the dependent tasks. For this we run loopy belief propagation Murphy et al. (1999) at each time-step between all the tasks for the factorial settings. This is unfortunately greedy in nature with respect to the dependent tasks, meaning that the algorithm is not guaranteed to find the optimal solution, but this approach resolves otherwise intractable computation and has been shown to work well in practice Murphy et al. (1999).

6 Results

In this section we report and discuss the results from our three datasets, Streusle, Malware and FAMULUS. We compare the different models introduced in Section 3 trained using the setup described in Section 5. Here we would like to point out that the multi-head (MH) model is the traditional multi-task setup that includes a shared feature function , such as a BiLSTM, but has separate CRFs for each task, such that the inter-dependency is not explicitly modeled.

6.1 Streusle

Task # Train ST MH Fac WFac CFac
POS 100 79.91 78.93 75.72 78.30 78.77
500 88.53 87.76 86.43 87.42 88.28
1000 91.00 90.94 89.78 90.21 91.00
2723 93.25 92.91 91.13 92.40 92.87
SSC 100 35.57 33.76 32.65 32.53 31.65
500 50.73 49.30 46.58 47.62 47.96
1000 57.70 56.66 52.92 53.68 55.23
2723 63.06 62.32 58.21 60.37 60.83
MWE 100 3.23 7.66 15.27 17.32 12.03
500 25.07 32.34 32.71 39.10 23.91
1000 40.00 38.67 38.53 45.05 34.04
2723 50.51 50.17 50.51 51.46 43.22
Table 3: F1-Results of the different CRF architectures on the Streusle tasks. Single task (ST), multi-head (MH), factorial (Fac), weighted factorial (WFac), cascaded factorial (CFac).
Figure 5:

Results on the MWE task of the Streusle dataset. The lines represent the mean results with bands indicating one standard deviation.

The results of the different architectures for SSC and MWE are presented in Table 3. We can see that sharing representations between the tasks improves performance on the MWE task. Especially in the sparse scenario where we only train on 100 instances, the single task setup is outperformed by 14 points. We also find that factorial CRFs outperform the multi-head CRFs over all training data sizes, which is illustrated in Figure 5.

On the other hand, the results of the multi-head as well as factorial settings for the POS and SSC task are continuously a few points below the single task setting, indicating that there exists interference between tasks as has been reported frequently for multi-task learning McCloskey and Cohen (1989); French (1999); Lee et al. (2017a) . However for the harder task MWE we see consitent performance gains, indicating that there is a strong inter-dependency between the tasks of the Streusle dataset. By explicitly modeling this inter-dependency, large performance gains can be achieved for the MWE task.

6.2 Malware

Malw ST MH WFac
100 4.52 15.88 16.98
500 24.57 32.51 29.61
1000 36.93 38.70 36.70
4952 48.25 46.94 47.05
Table 4: F1-Results of the different CRF architectures on the MalwareTextDB Lim et al. (2017) dataset. Single task (ST), multi-head (MH), weighted factorial (WFac)
Figure 6: Results on the Malware dataset. The lines are the mean results, bands indicate one standard deviation.

For the MalwareTextDB dataset we want to probe if it is possible to leverage silver labels of a pretrained part-of-speech tagger555https://spacy.io/ as an auxiliary task to increase performance on the actual task. We find that especially for the low resource settings where only 100 or 500 training examples exist, both the multi-head -CRF and the factorial -CRF outperform the single task (Table 4). This indicates that the silver POS labels introduce additional supervision for the task in these settings. However, when the training data increases, the performances of all models are on-par (Figure 6). This is in line with what can be expected: given sufficient data the sequential representation learning of LSTMs are known to be powerful enough to implicitly learn syntactic features such as POS tags, mitigating the need for explicitly inducing these as labels.

6.3 Famulus

Med ST MH WFac
DC 58.63 59.92 62.14
EG 71.67 66.41 65.25
EE 85.31 85.80 85.89
HG 59.05 54.93 56.56
Table 5: F1-Results of the different CRF architectures on the Med FAMULUS Schulz et al. (2019c) dataset. Single task (ST), multi-head (MH), weighted factorial (WFac)
TEd ST MH WFac
DC 50.15 53.57 54.28
EG 76.57 74.49 74.44
EE 84.09 85.07 85.33
HG 42.96 38.89 36.05
Table 6: F1-Results of the different CRF architectures on the TEd FAMULUS Schulz et al. (2019c) dataset. Single task (ST), multi-head (MH), weighted factorial (WFac)

In Table 5 and 6 we present the results for the different architecture setups for the FAMULUS Med and TEd datasets respectively. In line with the overlapping labels for each task presented in Table 1, we find that the shared representation models (multi-head and factorial) outperform the single task models for DC and EE. However, for EG and HG, the single task models are better than the joint representations. This is in line with what can be expected as EG and HG have almost no overlapping labels with the other tasks. Modeling inter-dependency between the tasks thus hurts performance as the model tries to find a joint representation between the tasks. For the dependent tasks DC and EE, we find that the weighted factorial model outperforms the multi-head model for both tasks, but with a larger margin for DC for both Med and TEd. This indicates that modeling the inter-dependency between the tasks helps the model generalize better by leveraging the prediction of the respective other task.

7 Discussion

Figure 7: Heatmaps of the joint probability matrix between POS and MWE of the Streusle dataset. The top heatmap shows positive correlations between tasks as non-black entries. The bottom heatmap represents combinations that are unlikely to occur together, thus the values are negative. Rows correspond to the beginning and inside token labels for two types of MWE spans, ‘-’ and ‘’. Columns correspond to POS tags.

In our experiments, we found that by explicitly modeling the task inter-dependencies, performance gains can be achieved for many scenarios. This effect can be seen especially in the low-resource settings, in which the weighted factorial model (WFac) outperforms the single task (ST) as well as the traditional multi-task (MH) models. This is also true for cases where we make use of cheap silver labels from pretrained POS taggers.

The strongest performance gains can be achieved when combining multiple related tasks with spans that appear infrequently in the dataset, such as SSC and MWE in the Streusle dataset. There is a strong inter-dependency between the tasks which the model is not able to implicitly learn in the multi-head or single task setting, compared to the explicit dependency representation of WFac.

An example of the explicit dependencies modeled by WFac is illustrated in Figure 7. Here, we plot two heatmaps of the C matrices that encode the dependencies between each of the POS tags and the MWE labels for the Streusle dataset . The top heatmap shows positive correlations, i.e., the label combinations that are likely to occur. The bottom heatmap shows negative dependencies, where labels are unlikely to coincide. The heatmaps show that the model has learned dependencies between the MWE labels and specific POS tags. It uses these values to downscale the probability of labels which are unlikely to co-occur but upscale those which are likely to appear at the same token. By modeling the dependencies explicitly, WFac can directly leverage the predictions of other tasks. In contrast, multi-head models only share the feature function , so require more data to learn to encode the dependency within the deep model.

While we consistently see performance gains for the multi-task approaches for a subset of the tasks, the performances for other tasks simultaneously deteriorate. For the Streusle dataset we observe gains for MWE across all multi-task settings over the single task setting, however for the two other tasks POS and SSC the single task setting performs the best. Similarly, for the FAMULUS dataset, tasks EG and HG perform the best in the single-task setting as these do not have many overlapping labels with the respective other tasks. Similar observations of interference between tasks for multi-task learning have been reported frequently in literature McCloskey and Cohen (1989); French (1999); Lee et al. (2017a), indicating that sharing the entirety of parameters can be harmful for performance for a subset of the tasks. However, when leveraging additional labels for auxiliary tasks, such as the silver POS tags for the Malware dataset, performance drops on the auxiliary tasks can be disregarded as the performance gain on the target task is the objective.

8 Conclusion

In this paper, we investigated multi-task sequence tagging, introducing neural factorial CRF models that explicitly model the inter-dependencies between different task labels. We compared different methods for datasets where multiple labels are available for each example, including single task learning, standard multi-task learning, and factorial CRFs, finding strong performance for factorial models in low resource settings where spans of different tasks coincide.

Similar to what has been reported in literature, we observe interference between tasks in multi-task learning settings, indicating that sharing the entirety of parameters decreases performance on a subset of the tasks. In the future we will investigate recent Adapter approaches Rebuffi et al. (2017); Houlsby et al. (2019) which train new parameters within each layer of pre-trained models, to combine them for multi-task learning as proposed by Pfeiffer et al. (2020).

Based on our results, we believe that modeling the inter-dependencies between tasks could be beneficial during the early stages of dataset creation, where only small amounts of data are available. Employing such models in a bootstrapping setup to provide annotators with label suggestions can increase the speed of dataset creation as well as improve the inter-annotator agreement Schulz et al. (2019b); Pfeiffer et al. (2019).

Acknowledgments

This work has been supported by the German Federal Ministry of Education and Research (BMBF) under the reference 16DHL1041 (FAMULUS).

References

  • A. Akbik, T. Bergmann, D. Blythe, K. Rasul, S. Schweter, and R. Vollgraf (2019) FLAIR: an easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 54–59. External Links: Link, Document Cited by: §1.
  • A. Akbik, D. Blythe, and R. Vollgraf (2018) Contextual String Embeddings for Sequence Labeling. In Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Link Cited by: §2, §5.2, §5.2.
  • B. Alex, B. Haddow, and C. Grover (2007) Recognising nested named entities in biomedical text. In Proceedings of the Workshop on BioNLP 2007: Biological, Translational, and Clinical Language Processing, pp. 65–72. Cited by: §2, §2.
  • J. Bingel and A. Søgaard (2017) Identifying beneficial task relations for multi-task learning in deep neural networks. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 164–169. External Links: Link Cited by: §2, §2, §5.2.
  • P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov (2017) Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics 5, pp. 135–146. External Links: Link, Document Cited by: §5.2.
  • R. Caruana (1997) Multitask learning. Machine Learning 28 (1), pp. 41–75. External Links: Link, Document Cited by: §2.
  • S. Changpinyo, H. Hu, and F. Sha (2018) Multi-task learning for sequence tagging: an empirical study. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 2965–2977. External Links: Link Cited by: §2, §2.
  • R. Collobert and J. Weston (2008)

    A unified architecture for natural language processing: deep neural networks with multitask learning

    .
    In Proceedings of the 25th international conference on Machine learning, pp. 160–167. Cited by: §1, §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.
  • C. Fellbaum (1990) English verbs as a semantic net. International Journal of Lexicography 3 (4), pp. 278–301. Cited by: §4.
  • J. R. Finkel and C. D. Manning (2009)

    Nested named entity recognition

    .
    In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 141–150. External Links: Link Cited by: §2, §2.
  • F. Fischer, I. Kollar, S. Ufer, B. Sodian, H. Hussmann, R. Pekrun, B. Neuhaus, B. Dorner, S. Pankofer, M. R. Fischer, J. Strijbos, M. Heene, and J. Eberle (2014) Scientific Reasoning and Argumentation: Advancing an Interdisciplinary Research Agenda in Education. Frontline Learning Research 4, pp. 28–45. External Links: Document, ISSN 2295-3159, Link Cited by: §4.
  • G. D. Forney (1973) The viterbi algorithm. Proceedings of the IEEE 61 (3), pp. 268–278. Cited by: §5.3.
  • R. M. French (1999) Catastrophic forgetting in connectionist networks. Trends in cognitive sciences 3 (4), pp. 128–135. Cited by: §6.1, §7.
  • N. Greenberg, T. Bansal, P. Verga, and A. McCallum (2018) Marginal likelihood training of BiLSTM-CRF for biomedical named entity recognition from disjoint label sets. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2824–2829. External Links: Link, Document Cited by: §2, §2.
  • N. Houlsby, A. Giurgiu, S. Jastrzkebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019)

    Parameter-efficient transfer learning for NLP

    .
    In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 2790–2799. External Links: Link Cited by: §8.
  • Z. Huang, W. Xu, and K. Yu (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991. External Links: Link Cited by: §1, §2, §5.2.
  • M. Ju, M. Miwa, and S. Ananiadou (2018) A neural layered model for nested named entity recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1446–1459. External Links: Link, Document Cited by: §2.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.2.
  • J. Lafferty, A. McCallum, and F. C. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Cited by: §1, §5.2.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. External Links: Link, Document Cited by: §1, §2, §5.2.
  • J. Lee, K. Cho, and T. Hofmann (2017a)

    Fully character-level neural machine translation without explicit segmentation

    .
    Trans. Assoc. Comput. Linguistics 5, pp. 365–378. External Links: Link Cited by: §6.1, §7.
  • J. Lee, S. Eger, J. Daxenberger, and I. Gurevych (2017b)

    UKP TU-DA at GermEval 2017: deep learning for aspect based sentiment detection

    .
    Proceedings of the GSCL GermEval Shared Task on Aspect-based Sentiment in Social Media Customer Feedback, pp. 22–29. Cited by: §1.
  • X. Li, J. Feng, Y. Meng, Q. Han, F. Wu, and J. Li (2019) A unified MRC framework for named entity recognition. arXiv preprint arXiv:1910.11476. External Links: Link Cited by: §2.
  • S. K. Lim, A. O. Muis, W. Lu, and C. H. Ong (2017) MalwareTextDB: a database for annotated malware articles. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1557–1567. External Links: Link, Document Cited by: §4, §4, Table 4.
  • H. Lin, Y. Lu, X. Han, and L. Sun (2019) Sequence-to-nuggets: nested entity mention detection via anchor-region networks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5182–5192. External Links: Link, Document Cited by: §2.
  • L. Liu, J. Shang, X. Ren, F. F. Xu, H. Gui, J. Peng, and J. Han (2018) Empower sequence labeling with task-aware neural language model. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    External Links: Link Cited by: §2.
  • P. Liu, X. Qiu, and X. Huang (2016) Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 2873–2879. Cited by: §2.
  • P. Liu, X. Qiu, and X. Huang (2017) Adversarial multi-task learning for text classification. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1–10. External Links: Link, Document Cited by: §2.
  • Y. Luan, D. Wadden, L. He, A. Shah, M. Ostendorf, and H. Hajishirzi (2019) A general framework for information extraction using dynamic span graphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 3036–3046. External Links: Link, Document Cited by: §2.
  • X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1064–1074. External Links: Link, Document Cited by: §1, §2, §5.2.
  • M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of learning and motivation, Vol. 24, pp. 109–165. Cited by: §6.1, §7.
  • G. A. Miller (1990) Nouns in wordnet: a lexical inheritance system. International journal of Lexicography 3 (4), pp. 245–264. Cited by: §4.
  • G. A. Miller (1998) WordNet: an electronic lexical database. MIT press. Cited by: §4.
  • K. P. Murphy, Y. Weiss, and M. I. Jordan (1999) Loopy belief propagation for approximate inference: an empirical study. In Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, pp. 467–475. Cited by: §3.5, §5.3.
  • J. Nam, J. Kim, E. L. Mencía, I. Gurevych, and J. Fürnkranz (2014) Large-scale multi-label text classification—revisiting neural networks. In Joint european conference on machine learning and knowledge discovery in databases, pp. 437–452. Cited by: §2.
  • J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §5.2.
  • M. Peters, W. Ammar, C. Bhagavatula, and R. Power (2017) Semi-supervised sequence tagging with bidirectional language models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1756–1765. External Links: Link, Document Cited by: §2.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §2.
  • J. Pfeiffer, A. Kamath, A. Rücklé, K. Cho, and I. Gurevych (2020) AdapterFusion: non-destructive task composition for transfer learning. arXiv preprint. Cited by: §8.
  • J. Pfeiffer, C. M. Meyer, C. Schulz, J. Kiesewetter, J. M. Zottmann, M. Sailer, E. Bauer, F. Fischer, M. R. Fischer, and I. Gurevych (2019) FAMULUS: interactive annotation and feedback generation for teaching diagnostic reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019 - System Demonstrations, pp. 73–78. External Links: Link, Document Cited by: §8.
  • S. Rebuffi, H. Bilen, and A. Vedaldi (2017) Learning multiple visual domains with residual adapters. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 506–516. External Links: Link Cited by: §8.
  • N. Reimers and I. Gurevych (2017) Reporting score distributions makes a difference: performance study of lstm-networks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 338–348. External Links: Link Cited by: §5.2, §5.2.
  • S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard (2019) Latent multi-task architecture learning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 4822–4829. External Links: Link, Document Cited by: §2.
  • S. Ruder (2017) An Overview of Multi-Task Learning in Deep Neural Networks. arXiv preprint arXiv:1706.05098. External Links: Link Cited by: §2.
  • I. A. Sag, T. Baldwin, F. Bond, A. Copestake, and D. Flickinger (2002) Multiword expressions: a pain in the neck for NLP. In International Conference on Intelligent Text Processing and Computational Linguistics, pp. 1–15. Cited by: §4.
  • V. Sanh, T. Wolf, and S. Ruder (2019) A hierarchical multi-task approach for learning embeddings from semantic tasks. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp. 6949–6956. External Links: Link, Document Cited by: §2.
  • N. Schneider and N. A. Smith (2015) A corpus and model integrating multiword expressions and supersenses. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1537–1547. External Links: Link, Document Cited by: Figure 3, §4, §4.
  • C. Schulz, C. M. Meyer, and I. Gurevych (2019a) Challenges in the automatic analysis of students’ diagnostic reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 6974–6981. Cited by: §4, §4, §4.
  • C. Schulz, C. M. Meyer, J. Kiesewetter, M. Sailer, E. Bauer, M. R. Fischer, F. Fischer, and I. Gurevych (2019b) Analysis of automatic annotation suggestions for hard discourse-level tasks in expert domains. arXiv preprint arXiv:1906.02564. Cited by: §4, §8.
  • C. Schulz, C. M. Meyer, and I. Gurevych (2019c) Challenges in the Automatic Analysis of Students’ Diagnostic Reasoning. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence (AAAI), Honolulu, HI, USA. Note: (to appear) External Links: Link Cited by: §1, §5.2, Table 5, Table 6.
  • E. Simpson, J. Pfeiffer, and I. Gurevych (2020) Low resource sequence tagging with weak labels. In Thirty-fourth AAAI Conference on Artificial Intelligence, External Links: Link Cited by: §2, §2.
  • A. Søgaard and Y. Goldberg (2016) Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, pp. 231–235. External Links: Link, Document Cited by: §2, §2, §5.2.
  • C. Sutton and A. McCallum (2012) An introduction to conditional random fields. Foundations and Trends® in Machine Learning 4 (4), pp. 267–373. Cited by: §1, §5.2.
  • C. Sutton, K. Rohanimanesh, and A. McCallum (2007) Dynamic conditional random fields: factorized probabilistic models for labeling and segmenting sequence data. Journal of Machine Learning Research 8 (Mar), pp. 693–723. External Links: Link Cited by: §1, §3.3, §3.5, §3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.
  • Y. Zhang and Q. Yang (2017) A survey on multi-task learning. arXiv preprint arXiv:1707.08114. Cited by: §2.