Log In Sign Up

Learning to Contextually Aggregate Multi-Source Supervision for Sequence Labeling

Sequence labeling is a fundamental framework for various natural language processing problems. Its performance is largely influenced by the annotation quality and quantity in supervised learning scenarios. In many cases, ground truth labels are costly and time-consuming to collect or even non-existent, while imperfect ones could be easily accessed or transferred from different domains. In this paper, we propose a novel framework named consensus Network (ConNet) to conduct training with imperfect annotations from multiple sources. It learns the representation for every weak supervision source and dynamically aggregates them by a context-aware attention mechanism. Finally, it leads to a model reflecting the consensus among multiple sources. We evaluate the proposed framework in two practical settings of multisource learning: learning with crowd annotations and unsupervised cross-domain model adaptation. Extensive experimental results show that our model achieves significant improvements over existing methods in both settings.


page 7

page 12


Modeling sequential annotations for sequence labeling with crowds

Crowd sequential annotations can be an efficient and cost-effective way ...

Automatic Synthesis of Diverse Weak Supervision Sources for Behavior Analysis

Obtaining annotations for large training sets is expensive, especially i...

Learning from Imperfect Annotations

Many machine learning systems today are trained on large amounts of huma...

Learning from Crowds with Sparse and Imbalanced Annotations

Traditional supervised learning requires ground truth labels for the tra...

Truth Discovery in Sequence Labels from Crowds

Annotations quality and quantity positively affect the performance of se...

The Weak Supervision Landscape

Many ways of annotating a dataset for machine learning classification ta...

1 Introduction

Figure 1: Illustration of the task settings for the two applications in this work: (a) learning consensus model from crowd annotations; (b) unsupervised cross-domain model adaptation.

Sequence labeling is a general approach encompassing various natural language processing (NLP) tasks including part-of-speech (POS) tagging (ratnaparkhi1996maximum), noun phrase chunking (sang2000introduction), word segmentation (low2005maximum)

, and named entity recognition (NER) 

(nadeau2007survey). Typically, existing methods follow the supervised learning paradigm, and require high-quality annotations. While gold standard label annotating is expensive and time-consuming, imperfect annotations are relatively easier to obtain from crowdsourcing (noisy labels) or other domains (out-of-domain). Despite their low cost, such supervision usually can be obtained from different sources, and it has been shown that multi-source weak supervision has the potential to perform similar to gold annotations ratner2016data.

Specifically, we are interested in two scenarios: 1) learning with crowd annotations and 2) unsupervised cross-domain model adaptation. Both situations suffer from imperfect annotations, and benefit from multiple sources. Therefore, the key challenge here is to aggregate multi-source imperfect annotations for learning a model without knowing the underlying ground truth label sequences in the target domain.

Our intuition mainly comes from the phenomenon that different sources of supervision have different strengths and are more proficient with distinct situations. Therefore they may not keep consistent importance during aggregating supervisions, and aggregating multiple sources for a specific input should be a dynamic process that depends on the sentence context. To better model this nature, we need to (1) explicitly model the unique traits of different sources when training and (2) find best suitable sources for generalizing the learned model on unseen sentences.

In this paper, we propose a novel framework, named Consensus Network (ConNet), for sequence labeling with multi-source supervisions. We represent the annotation patterns as different biases of annotators over a shared behavior pattern. Both annotator-invariant patterns and annotator-specific biases are modeled in a decoupled way. The first term through sharing part of low-level model parameters in a multi-task learning schema. For learning the biases, we decouple them from the model as the transformations on top-level tagging model parameters, such that they can capture the unique strength of each annotator. With such decoupled source representations, we further learn an attention network for dynamically assigning the best sources for every unseen sentence through composing a transformation that represents the consensus. Extensive experimental results in two scenarios show that our model always outperforms strong baseline methods. ConNet achieves the state-of-the-art performance on real-world crowdsourcing datasets and improve significantly in most unsupervised cross-domain adaptation tasks over existing works. In addition to sequence labeling, it also shows its effectiveness on text classification tasks.

2 Related Work

There exist three threads of related work regarding the topics in this paper, which are sequence labeling, crowd-sourced annotation and unsupervised domain adaptation.

Neural Sequence Labeling.

Traditional approaches for sequence labeling usually need significant efforts in feature engineering for graphical models like hidden markov models (HMMs) 

(Rabiner1989IEEE) and conditional random fields (CRFs) (Lafferty2001CRF)

. Recent research efforts in neural network models have shown that end-to-end learning like convolutional neural networks (CNNs) 


or bidirectional long short-term memory (BLSTMs) 

(lample2016neural) can largely eliminate human-crafted features. Together with a final CRF layer, these BLSTM-CRF models have achieved promising performance and are used as our base sequence tagging model in this paper.

Crowd-sourced Annotation. Crowd-sourcing has been demonstrated to be an effective way of fulfilling the label consumption of neural models (Guan2017CoRR; lin2019alpacatag). It collects annotations with lower costs and a higher speed by non-expert contributors but suffers from some degradation in quality.  Dawid1979EM

proposes the pioneering work to aggregate crowd annotations to estimate true labels, and 

Snow2008EMNLP shows its effectiveness with Amazon’s Mechanical Turk system. Later works (Dempster1977MLL; dredze2009sequence; raykar2010learning)

focus on Expectation-Maximization (EM) algorithms to jointly learn the model and annotator behavior on classification problems. Recent research shows the strength of multi-task framework in semi-supervised learning 

(lan2018semi; clark2018semi), and cross-type learning (Wang2018BioNER).  Nguyen2017ACL and Rodrigues2018AAAI regards crowd annotations as noisy versions of glod labels and constructs crowd components to model annotator-specific bias which were discarded during the inference process. It is worth mentioning that, it has been found even for human curated annotations, there exists certain label noise that hinders the model performance wang2019crossweigh.

Unsupervised Domain Adaptation. Unsupervised cross-domain adaptation aims to transfer knowledge learned from high-resource domains (source domains) to boost performance on low-resource domains (target domains) of interests such as social media messages lin2017multi. Different from supervised adaptation (Lin2018NeuralAL), we assume there is no labels at all for traget corpora.  saito2017asymmetric and ruder2018strong explored bootstrapping with multi-task tri-training approach, which requires unlabeled data from the target domain. The method is developed for one-to-one domain adaptation and does not model the differences among multiple source domains.  yang2015unsupervised

represented each domain with a vector of metadata domain attributes and uses domain vectors to train the model to deal with domain shifting, which is highly dependent on prior domain knowledge.  


uses an auto-encoder method by jointly training a predictor for source labels, and a decoder to reproduce target input with a shared encoder. The decoder acts as a normalizer to force the model to learn shared knowledge between source and target domains. Adversarial penalty can be applied to the loss function to make models learn domain-invariant feature only 

(fernando2015joint; long2014transfer; ming2015unsupervised). However, it does not exploit domain-specific information.

3 Multi-source Supervised Learning

We formulate the multi-source sequence labeling problem as follows. Given sources of supervision, we regard each source as an imperfect annotator (non-expert human tagger or models trained in related domains). For the -th source data set , we denote its -th sentence as which is a sequence of tokens: . The tag sequence of the sentence is marked as . We define the sentence set of each annotators as , and the whole training domain as the union of all sentence sets: . The goal of the multi-source learning task is to use such imperfect annotations to train a model for predicting the tag sequence for any sentence in a target corpus . Note that the target corpus can either share the same distribution with (Application I) or be significantly different (Application II). In the following two subsections, we formulate two typical tasks in this problem as shown in Fig. 1.

Application I: Learning with Crowd Annotations. When learning with crowd-sourced sequence labeling data, we regard each worker as an imperfect annotator (), who may make mistakes or skip sentences in its annotations. Note that for crowd-sourcing data, different annotators tag subsets of the same given dataset (), and thus we assume there are no input distribution shifts among . Also, we only test sentences in the same domain such that the distribution in target corpus is the same as well. That is, the marginal distribution of target corpus is the same with that for each individual source dataset, i.e. . However, due to imperfectness of the annotations in each source, is shifted from the underlying truth (illustrated in the top-left part of Fig. 1). The multi-source learning objective here is to learn a model for supporting inference on any new sentences in the same domain.

Application II: Unsupervised Cross-Domain Model Adaptation. We assume there are available annotations in several source domains, but not in an unseen target domain. We assume that the input distributions in different source domains vary a lot, and such annotations can hardly be adapted for training a target domain model. That is, the prediction distribution of each domain model () is close to the underlying truth distribution () only when . For target corpus sentences , such a source model again differs from underlying ground truth for the target domain and can be seen as an imperfect annotators. Our objective in this setting is also to jointly model while noticing that there are significant domain shifts between and any other .

4 Consensus Network

Figure 2: Overview of the ConNet framework. The decoupling phase constructs the shared model (yellow) and source-specific matrices (blue). The aggregation phase dynamically combines crowd components into a consensus representation (blue) by a context-aware attention module (red) for each sentence .

In this section, we present our two-phase framework ConNet for multi-source sequence labeling. As shown in Figure 2, our proposed framework first uses a multi-task learning schema with a special objective to decouple annotator representations as different parameters of a transformation around CRF layers. This decoupling phase (Section 4.2) is for decoupling the model parameters into a set of annotator-invariant model parameters and a set of annotator-specific representations. Secondly, the dynamic aggregation phase (Section 4.3) learns to contextually utilize the annotator representations with a lightweight attention mechanism to find the best suitable transformation for each sentence, so that the model can achieve a context-aware consensus among all sources. The inference process is described in Section 4.4.

4.1 The Base Model: BiLSTM-CRF

Many recent sequence labeling frameworks (Ma2016ACL; misawa-etal-2017-character) share a very basic structure: a bidirectional LSTM network followed by a CRF tagging layer (i.e. BLSTM-CRF). The BLSTM encodes an input sequence into a sequence of hidden state vectors . The CRF takes as input the hidden state vectors and computes an emission score matrix where is the size of tag set. It also maintains a trainable transition matrix . We can consider is the score of labeling the tag with id for word in the input sequence , and means the transition score from tag to .

The CRF further computes the score for a predicted tag sequence as


and then tag sequence follows the conditional distribution


4.2 The Decoupling Phase: Learning annotator representations

For decoupling annotator-specific biases in annotations, we represent them as a transformation on emission scores and transition scores respectively. Specifically, we learn a matrix for each imperfect annotator and apply this matrix as transformation on and as follows:


From this transformation, we can see that the original score function in Eq. 1 becomes an annotator-specific computation. The original emission and transformation score matrix and are still shared by all the annotators, while they both are transformed by the matrix for -th annotator. While training the model parameters in this phase, we follow a multi-task learning schema. That is, we share the model parameters for BLSTM and CRF (including , , ), while updating only by examples in .

The learning objective is to minimize the negative log-likelihood of all source annotations:


The assumption on the annotation representation is that it can model the pattern of annotation bias. Each annotator can be seen as a noisy version of the shared model. For the -th annotator, models noise from labeling the current word and transferring from the previous label. Specifically, each entry

captures the probability of mistakenly labeling

-th tag to -th tag. In other words, the base sequence labeling model in Sec. 4.1 learns the basic consensus knowledge while annotator-specific components add their understanding to predictions.

4.3 The Aggregation Phase: Dynamically Reaching Consensus

In the second phase, our proposed network learns a context-aware attention module for a consensus representation supervised by combined predictions on the target data. For each sentence in target data , these predictions are combined by weighted voting. The weight of each source is its normalized score on the training set. Through weighted voting on such augmented labels over all source sentences , we can find a good approximation of underlying truth labels.

For better generalization, an attention module is trained to estimate the relevance of each source to the target under the supervision of generated labels. Specifically, source-specific matrices are aggregated into a consensus representation for sentence by


The attention vector is calculated from the sentence embedding with respect to . We use the head and tail hidden states as the sentence embedding i.e.  .



is the size of each hidden state. In this way, the consensus representation contains more information about sources which are more related to the current sentence. It also alleviates the contradiction problem among sources, because it could consider multiple sources of different emphasis. Since only an attention model with weight matrix

is required to be trained, the amount of computation is relatively small. We assume the base model and annotator representations are well-trained in the previous phase. The main objective in this phase is to learn how to select most suitable annotators for the current sentence.

4.4 Parameter Learning and Inference

ConNet learns parameters through two phases described above. In the decoupling phase, each instance from source is used for training the base sequence labeling model and its representation . In the aggregation phase, we use aggregated predictions from the first phase to learn a lightweight attention module. For each instance in the target corpus , we calculate its embedding from BLSTM hidden states. With these sentence embeddings, the context-aware attention module assigns weight to each source and dynamically aggregates source-specific representations for inferring . In the inference process, only the consolidated consensus matrix is applied to the base sequence learning model. In this way, more specialist knowledge helps to deal with more complex instances.

5 Experiments

We evaluate ConNet in two practical settings of multi-source learning: learning with crowd annotations and unsupervised cross-domain model adaptation. In the crowd annotation learning setting, the training data of the same domain is annotated by multiple noisy annotators. In the decoupling phase, the model is trained on noisy annotations, and in the aggregation phase, it is trained with combined predictions on the training set. In the cross-domain setting, the model has access to unlabeled training data of the target domain and clean labeled data of multiple source domains. In the decoupling phase, the model is trained on source domains, and in the aggregation phase, the model is trained on combined predictions on the training data of the target domain.

In addition to BLSTM-CRF, our framework can generalize to other encoders, for example, MLP encoder for text classification. In this setting, hidden representations of MLP are transformed by multiplying crowd/consensus matrices and the transformed representation is then used by a classification layer to make predictions.

Methods AMTC AMT
Precision(%) Recall(%) F1-score(%) Precision(%) Recall(%) F1-score(%)
CONCAT-SLM 85.95(1.00) 57.96(0.26) 69.23(0.13) 91.12(0.57) 55.41(2.66) 68.89(1.92)
MVT-SLM 84.78(0.66) 62.50(1.36) 71.94(0.66) 86.96(1.22) 58.07(0.11) 69.64(0.31)
MVS-SLM 84.76(0.50) 61.95(0.32) 71.57(0.04) 86.95(1.12) 56.23(0.01) 68.30(0.33)
DS-SLM (Nguyen2017ACL) 72.30 61.17 66.27 - - -
HMM-SLM (Nguyen2017ACL) 76.19 66.24 70.87 - - -
MTL-MVT (Wang2018BioNER) 81.81(2.34) 62.51(0.28) 70.87(1.06) 88.88(0.25) 65.04(0.80) 75.10(0.44)
MTL-BEA (rahimi2019massively) 85.72(0.66) 58.28(0.43) 69.39(0.52) 77.56(2.23) 67.23(0.72) 72.01(0.85)
CRF-MA (Rodrigues2014ML) - - - 49.40 85.60 62.60
Crowd-Add (Nguyen2017ACL) 85.81(1.53) 62.15(0.18) 72.09(0.42) 89.74(0.10) 64.50(1.48) 75.03(1.02)
Crowd-Cat (Nguyen2017ACL) 85.02(0.98) 62.73(1.10) 72.19(0.37) 89.72(0.47) 63.55(1.20) 74.39(0.98)
CL-MW (Rodrigues2018AAAI) - - - 66.00 59.30 62.40
ConNet (Ours) 84.11(0.71) 68.61(0.03) 75.57(0.27) 88.77(0.25) 72.79(0.04) 79.99(0.08)
Gold (Upper Bound) 89.48(0.32) 89.55(0.06) 89.51(0.21) 92.12(0.31) 91.73(0.09) 91.92(0.21)
Table 1: Performance on real-world crowd-sourced NER datasets. The best score in each column excepting Gold is marked bold. * indicates number reported by the paper.

5.1 Datasets

Crowd-Annotation Datasets. We use crowd-annotation datasets based on the 2003 CoNLL shared NER task  (sang2003introduction). The real-world datasets, denoted as AMT, are collected by Rodrigues2014ML using Amazon’s Mechanical Turk where F1 scores of annotators against the ground truth vary from 17.60% to 89.11%. Since there is no development set in AMT, we also follow Nguyen2017ACL to use the AMT training set and CoNLL 2003 development and test sets, denoted as AMTC. Overlapping sentences are removed in the training set, which is ignored in that work. Additionally, we construct two sets of simulated datasets to investigate the quality and quantity of annotators. To simulate the behavior of a non-expert annotator, a CRF model is trained on a small subset of training data and generates predictions on the whole set. Because of the limited size of training data, each model would have a bias to certain patterns.

Cross-Domain Datasets. In this setting, we investigate three NLP tasks: POS tagging, NER and text classification. For POS tagging task, we use the GUM portion (zeldes2017gum) of Universal Dependencies (UD) v2.3 corpus with tags and domains: academic, bio, fiction, news, voyage, wiki, and interview. For NER task, we select the English portion of the OntoNotes v5 corpus (hovy2006ontonotes). The corpus is annotated with named entities with data from domains: broadcast conversation (bc), broadcast news (bn), magazine (mz), newswire (nw), pivot text (pt), telephone conversation (tc), and web (web). Multi-Domain Sentiment Dataset (MDS) v2.0 (blitzer2007biographies) is used for text classification, which is built on Amazon reviews from domains: books, dvd, electronics, and kitchen. Since the dataset only contains word frequencies for each review without raw texts, we follow the setting in chen2018multinomial considering 5,000 most frequent words and use the raw counts as the feature vector for each review.

5.2 Experiment Setup

For sequence labeling tasks, we follow liu2018efficient to build the BLSTM-CRF architecture as the base model. The dimension of character-level, word-level embeddings and BLSTM hidden layer are set as , and respectively. For text classification, each review in the MDS dataset is represented as a -d feature vector. We use an MLP with a hidden size of for encoding such features and a linear classification layer for predicting labels. The dropout with a probability of

is applied to the non-recurrent connections for regularization. The network parameters are randomly initialized and updated by stochastic gradient descent (SGD). The learning rate is initialized as

and decayed by

for each epoch. The training process stops early if no improvements in

continuous epochs and selects the best model on the development set. For the dataset without a development set, we report the performance on the

-th epoch. For each experiment, we report the average performance and standard variance of

runs with different random initialization.

5.3 Compared Methods

Figure 3: Visualizations of (a) the expertise of annotators; (b) attention weights for sample sentences. More cases and details are described in Appendix A.1.

We compare our models with multiple baselines, which can be categorized in two groups: wrapper methods and joint models. To demonstrate the theoretical upper bound of performance, we also train the base model using ground-truth annotations in the target domain (Gold).

A wrapper method consists of a label aggregator and a deep learning model. These two components could be combined in two ways: (1) aggregating labels on crowd-sourced training set then feeding the generated labels to a Sequence Labeling Model (

SLM(Liu2017EMNLP); (2) feeding multi-source data to a Multi-Task Learning (MTL(Wang2018BioNER) model then aggregating multiple predicted labels. We investigate multiple label aggregation strategies. CONCAT considers all crowd annotations as gold labels. MVT does majority voting on the token level, i.e., the majority of labels is selected as the gold label for each token . MVS is conducted on the sequence level, addressing the problem of violating Begin/In/Out (BIO) rules. DS (Dawid1979EM), HMM (Nguyen2017ACL) and BEA (rahimi2019massively) induce consensus labels with probability models.

In contrast with wrapper methods, joint models incorporate multi-source data within the structure of sequential taggers and jointly model all individual annotators. CRF-MA models CRFs with Multiple Annotators by EM algorithm (Rodrigues2014ML). Nguyen2017ACL augments the LSTM architecture with crowd vectors. These crowd components are element-wise added to tags scores (Crowd-Add) or concatenated to the output of hidden layer (Crowd-Cat). These two methods are the most similar to our decoupling phase. We implemented them and got better results than reported. CL-MW applies a crowd layer to a CNN-based deep learning framework (Rodrigues2018AAAI). Tri-Training uses bootstrapping with multi-task Tri-Training approach for unsupervised one-to-one domain adaptation (saito2017asymmetric; ruder2018strong).

5.4 Learning with Crowd Annotations

Performance on real-world datasets. Tab. 1 shows the performance of aforementioned methods and our ConNet on two real-world datasets, i.e. AMT and AMTC222We tried our best to re-implement the baseline methods for all datasets, and left the results blank when the re-implementation is not showing consistent results as in the original papers.. We can see that ConNet outperforms all other methods on both datasets significantly on score, which shows the effectiveness of dealing with noisy annotations for higher-quality labels. Although CONCAT-SLM

achieves the highest precision, it suffers from low recall. All existing methods have the high-precision but low-recall problem. One possible reason is that they try to find the latent ground truth and throw away illuminating annotator-specific information. So only simple mentions can be classified with great certainty while difficult mentions fail to be identified without sufficient knowledge. In comparison,

ConNet pools information from all annotations and focus on matching knowledge to make predictions. It makes the tagging model be able to identify more mentions and get a higher recall.

Figure 4: Performance of ConNet variants of decoupling phase (DP) and aggregation phase (AP).

Case study on real-world datasets.It is enlightening to analyze whether the model decides the importance of annotators given a sentence. Fig. 3 visualizes normalized expertise over all annotators, and attention weights in Eq. 7 for sampled sentences containing different entity types. Obviously, the nd sample sentence with ORG has higher attention weights on st, th and rd annotator who are best at labeling ORG. More details and cases are shown in Appendix A.1.

Ablation study on real-world datasets. We also investigate multiple variants of the decoupling phase and aggregation phase on AMT dataset, shown in Fig. 4. We tried approaches to incorporate source-specific representation in the decoupling phase (DP). CRF means the traditional approach as Eq. 1 while DP(1+2) is for our approach as Eq. 3. DP(1) only applies source representations to the emission score while DP(2) only transfers the transition matrix . We can observe from the result that both variants can improve the result. The underlying model keeps more consensus knowledge if we extract annotator-specific bias on sentence encoding and label transition. We also compare methods of generating supervision targets in the aggregation phase (AP). OMV uses majority voting of original annotations, while PMV substitutes them with model prediction learned from DP. AMV extends the model by using all prediction, while AWV uses majority voting weighted by each annotator’s training score. The results show the effectiveness of AWV, which could augment training data and well approximate the ground truth to supervise the attention module for estimating the expertise of annotator on the current sentence. We can also infer labels on the test set by conducting AWV on predictions of the underlying model with each annotator-specific components. But it leads to heavy computation-consuming and unsatisfying performance, whose test score is . We can also train a traditional BLSTM-CRF model with the same AMV labels. Its result is , which is lower than our ConNet and show the importance of extracted source-specific components.

Figure 5: Performance on simulated crowd-sourced NER data with (a) annotators with different reliability levels; (b) different numbers of annotators with reliability .

Performance on simulated datasets. To analyze the impact of annotator quality, we split the origin train set into folds and each fold could be used to train a CRF model whose reliability could be represented as because a model with less training data would have stronger bias and less generalization. We tried settings where and randomly select models for each setting. When the reliability level of all annotators is too low, i.e. , only the base model is used for prediction without annotator representations. Shown in Fig. 5(a), ConNet achieves significant improvements over MVT-SLM and competitive performance as Crowd-Cat. Our model shows its effectiveness when annotators are less reliable.

Regarding the annotator quantity, we split the train set into subsets () and randomly select models as simulation. Fig. 5(b) shows ConNet is superior to baselines and able to well deal with many annotators while there is no obvious relationship between the performance and annotator quantity in baselines.

Methods POS Tagging NER Text Classification
CONCAT-SLM 92.11(0.07) 61.24(0.92) 79.41(0.02)
MTL-MVT 90.73(0.29) 60.44(0.45) 77.54(0.06)
MTL-BEA 91.71(0.06) 52.15(0.58) 78.01(0.28)
Crowd-Add 91.36(0.14) 39.30(4.44) 79.30(9.21)
Crowd-Cat 91.94(0.08) 62.14(0.89) 79.54(0.25)
Tri-Training 91.93(0.01) 61.67(0.31) 80.58(0.02)
ConNet (Ours) 92.33(0.17) 63.32(0.81) 81.55(0.04)
Gold 92.88(0.14) 68.61(0.64) 83.22(0.19)
Table 2: Performance on cross-domain adaptation. The average score for all domains is reported for each task. The best score in each column that is significantly () better than the second-best is marked bold, while those are better but not significantly are underlined. Detailed results can be found in Appendix A.2.

5.5 Cross-Domain Adaptation Performance

The average performance of each method on each task is shown in Tab. 2. More detailed results can be found in Appendix A.2. We report the accuracy for POS tagging and text classification, and the chunk-level score for NER. We can see that ConNet achieves the highest average score on all tasks. MTL-MVT is similar to our decoupling phase and performs much worse. It shows that naively doing unweighted voting does not work well. The attention can be viewed as implicitly doing weighted voting on the feature level. MTL-BEA relies on a probabilistic model to conduct weighted voting over predictions, but unlike our approach, its voting process is independent from the input context. It is probably why our model achieves higher scores. This demonstrates the importance of having such a module to assign weights to domains based on the input sentence. We also analyze attention scores generated by the model in Appendix A.3 to show that the attention is meaningful. Tri-Training trained on the concatenated data from all sources also performs worse than ConNet, which suggests the importance of a multi-task structure to model the difference among domains. The performance of Crowd-Add

is unstable (high standard deviation) and very low on the NER task, because the size of the crowd vectors is not controllable and thus may be too large. On the other hand, the size of the crowd vectors in

Crowd-Cat can be controlled and tuned to improve overall performance and stability. These two methods leverage domain-invariant knowledge only but not domain-specific knowledge and thus does not have comparable performance with our model.

6 Conclusion

In this paper, we present ConNet for learning a sequence tagger from multi-source supervision. It could be applied in two practical scenarios: learning with crowd annotations and cross-domain adaptation. In contrast to prior works, ConNet learns fine-grained representations of each source which are further dynamically aggregated for every unseen sentence in the target data. Experiments show that our model is superior to previous crowd-sourcing and unsupervised domain adaptation sequence labeling models. The proposed learning framework also shows promising results on other NLP tasks like text classification.


Appendix A Appendix

a.1 Case study on learning with crowd annotations

Figure 6: Visualizations of (a) the expertise of annotators; (b) attention weights for additional sample sentences to Fig. 3. Details of samples are described in Tab. 3.
1 Defender [PER  Hassan Abbas] rose to intercept a long ball into the area in the 84th minute but only managed to divert it into the top corner of [PER  Bitar] ’s goal .
2 [ORG  Plymouth] 4 [ORG  Exeter] 1
3 Hosts [LOC  UAE] play [LOC  Kuwait] and [LOC  South Korea] take on [LOC  Indonesia] on Saturday in Group A matches .
4 The former [MISC  Soviet] republic was playing in an [MISC  Asian Cup] finals tie for the first time .
5 [PER  Bitar] pulled off fine saves whenever they did .
6 [PER  Coste] said he had approached the player two months ago about a comeback .
7 [ORG  Goias] 1 [ORG  Gremio] 3
8 [ORG  Portuguesa] 1 [ORG  Atletico Mineiro] 0
9 [LOC  Melbourne] 1996-12-06
10 On Friday for their friendly against [LOC  Scotland] at [LOC  Murrayfield] more than a year after the 30-year-old wing announced he was retiring following differences over selection .
11 Scoreboard in the [MISC  World Series]
12 Cricket - [MISC  Sheffield Shield] score .
13 “ He ended the [MISC  World Cup] on the wrong note , ” [PER  Coste] said .
14 Soccer - [ORG  Leeds][PER  Bowyer] fined for part in fast-food fracas .
15 [ORG  Rugby Union] - [PER  Cuttitta] back for [LOC  Italy] after a year .
16 [LOC  Australia] gave [PER  Brian Lara] another reason to be miserable when they beat [LOC  West Indies] by five wickets in the opening [MISC  World Series] limited overs match on Friday .
Table 3: Sample instances in Fig. 3 and Fig. 6 with NER annotations including PER (red), ORG (blue), LOC (violet) and MISC (orange).

To better understand the effect and benefit of ConNet, we do some case study on AMTC real-world dataset with annotators. We look into some more instances to investigate the ability of attention module to find right annotators in Fig. 6 and Tab. 3. Sentence 1-12 contains a specific entity type respectively while 13-16 contains multiple different entities. Compared with expertise of annotators, we can see that the attention module would give more weight on annotators who have competitive performance and preference on the included entity type. Although top selected annotators for ORG has relatively lower expertise on ORG than PER and LOC, they are actually the top three annotators with highest expertise on ORG.

Figure 7: Heatmap of averaged attention scores from each source domain to each target domain.

a.2 Detailed results for Cross-Domain Adaptation

In addition to cross-domain adaptation, we evaluate our model on cross-lingual adaptation as well. We use the Wikiann corpus pan2017cross for cross-lingual NER. The dataset contains text annotated with three entity types: person, location, and organization for 282 languages. For simplicity, we randomly chose five languages from the corpus: no (Norwegian), et (Estonian), es (Spanish), sv (Swedish), and en (English).

The results of each task on each domain/language are shown in Tab. 4. We can see that except the multi-lingual task, ConNet performs the best on most of the domains and achieves the highest average score for all tasks. The results of cross-lingual NER suggests that our intuition does not fit very well into this kind of problems. MTL-BEA (rahimi2019massively) is proposed to solve multi-lingual problems and indeed works well under this setting.

a.3 Multi-domain: Analysis

We analyzed the attention scores generated by the attention module on the OntoNotes dataset. For each sentence in the target domain we collected the attention score of each source domain, and finally the attention scores are averaged for each source-target pair. Fig. 7 shows all the source-to-target average attention scores. We can see that some domains can contribute to other related domains. For example, bn (broadcast news) and nw (newswire) are both about news and they contribute to each other; bn and bc (broadcast conversation) are both broadcast and bn contributes to bc; bn and nw both contributes to mz (magzine) probably because they are all about news; wb (web) and tc (telephone conversation) almost make no positive contribution to any other, which is reasonable because they are informal texts compared to others and they are not necessarily related to the other. Overall the attention scores can make some sense. It suggests that the attention is aware of relations between different domains and can contribute to the model.

Task & Corpus Multi-Domain POS Tagging: Universal Dependencies v2.3 - GUM
Target Domain academic bio  fiction   news voyage wiki interview AVG (%)
CONCAT 92.68 92.12 93.05 90.79 92.38 92.32 91.44 92.11(0.07)
MTL-MVT (Wang2018BioNER) 92.42 90.59 91.16 89.69 90.75 90.29 90.21 90.73(0.29)
MTL-BEA (rahimi2019massively) 92.87 91.88 91.90 91.03 91.67 91.31 91.29 91.71(0.06)
Crowd-Add (Nguyen2017ACL) 92.58 91.91 91.50 90.73 91.74 90.47 90.61 91.36(0.14)
Crowd-Cat (Nguyen2017ACL) 92.71 91.71 92.48 91.15 92.35 91.97 91.22 91.94(0.08)
Tri-Training (ruder2018strong) 92.84 92.15 92.51 91.40 92.35 91.29 91.00 91.93(0.01)
ConNet 92.97 92.25 93.15 91.06 92.52 92.74 91.66 92.33(0.17)
Gold (Upper Bound) 92.64 93.10 93.15 91.33 93.09 94.67 92.20 92.88(0.14)
Task & Corpus Multi-Domain NER: OntoNotes v5.0 - English
Target Domain nw wb bn tc bc mz AVG (%)
CONCAT 68.23 32.96 77.25 53.66 72.74 62.61 61.24(0.92)
MTL-MVT (Wang2018BioNER) 65.74 33.25 76.80 53.16 69.77 63.91 60.44(0.45)
MTL-BEA (rahimi2019massively) 58.33 32.62 72.47 47.83 48.99 52.68 52.15(0.58)
Crowd-Add (Nguyen2017ACL) 45.76 32.51 50.01 26.47 52.94 28.12 39.30(4.44)
Crowd-Cat (Nguyen2017ACL) 68.95 32.61 78.07 53.41 74.22 65.55 62.14(0.89)
Tri-Training (ruder2018strong) 69.68 33.41 79.62 47.91 70.85 68.53 61.67(0.31)
ConNet 71.31 34.06 79.66 52.72 71.47 70.71 63.32(0.81)
Gold (Upper Bound) 84.70 46.98 83.77 52.57 73.05 70.58 68.61(0.64)
Task & Corpus Multi-Domain Text Classification: MDS
Target Domain books dvd electronics kitchen AVG (%)
CONCAT 75.68 77.02 81.87 83.07 79.41(0.02)
MTL-MVT (Wang2018BioNER) 74.92 74.43 79.33 81.47 77.54(0.06)
MTL-BEA (rahimi2019massively) 74.88 74.60 79.73 82.82 78.01(0.28)
Crowd-Add (Nguyen2017ACL) 75.72 77.35 81.25 82.90 79.30(9.21)
Crowd-Cat (Nguyen2017ACL) 76.45 77.37 81.22 83.12 79.54(0.25)
Tri-Training (ruder2018strong) 77.58 78.45 81.95 83.17 80.29(0.02)
ConNet 78.75 81.06 84.12 83.45 81.85(0.04)
Gold (Upper Bound) 78.78 82.11 86.21 85.76 83.22(0.19)
Task & Corpus Multi-Lingual NER: Wikiann
Target Lang no et es sv en AVG (%)
CONCAT 47.17 34.01 53.27 60.06 42.13 47.33(2.49)
MTL-MVT (Wang2018BioNER) 37.96 37.28 48.69 53.80 40.07 43.56(1.23)
MTL-BEA (rahimi2019massively) 49.60 36.90 52.21 63.40 43.05 49.03(0.74)
Crowd-Add (Nguyen2017ACL) 35.00 21.39 32.55 44.51 29.36 32.56(9.21)
Crowd-Cat (Nguyen2017ACL) 49.17 35.52 50.95 62.23 42.78 48.13(0.47)
Tri-Training (ruder2018strong) 49.39 36.21 51.58 62.82 42.92 48.58(0.25)
ConNet 48.53 35.61 50.78 63.03 43.04 48.20(0.55)
Gold (Upper Bound) 56.39 42.71 56.58 71.32 48.42 55.08(0.74)
Table 4: Performance on cross-domain and cross-lingual adaptation. The best score (except the Gold) in each column that is significantly () better than the second best is marked bold, while those are better but not significantly are underlined.