Adapting to the Long Tail: A Meta-Analysis of Transfer Learning Research for Language Understanding Tasks

11/02/2021
by   Aakanksha Naik, et al.
Carnegie Mellon University
0

Natural language understanding (NLU) has made massive progress driven by large benchmarks, paired with research on transfer learning to broaden its impact. Benchmarks are dominated by a small set of frequent phenomena, leaving a long tail of infrequent phenomena underrepresented. In this work, we reflect on the question: have transfer learning methods sufficiently addressed performance of benchmark-trained models on the long tail? Since benchmarks do not list included/excluded phenomena, we conceptualize the long tail using macro-level dimensions such as underrepresented genres, topics, etc. We assess trends in transfer learning research through a qualitative meta-analysis of 100 representative papers on transfer learning for NLU. Our analysis asks three questions: (i) Which long tail dimensions do transfer learning studies target? (ii) Which properties help adaptation methods improve performance on the long tail? (iii) Which methodological gaps have greatest negative impact on long tail performance? Our answers to these questions highlight major avenues for future research in transfer learning for the long tail. Lastly, we present a case study comparing the performance of various adaptation methods on clinical narratives to show how systematically conducted meta-experiments can provide insights that enable us to make progress along these future avenues.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 6

11/05/2020

Language Model is All You Need: Natural Language Understanding as Question Answering

Different flavors of transfer learning have shown tremendous impact in a...
04/20/2021

X-METRA-ADA: Cross-lingual Meta-Transfer Learning Adaptation to Natural Language Understanding and Question Answering

Multilingual models, such as M-BERT and XLM-R, have gained increasing po...
05/02/2019

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

In the last year, new models and methods for pretraining and transfer le...
06/03/2017

Concept Transfer Learning for Adaptive Language Understanding

Semantic transfer is an important problem of the language understanding ...
05/01/2020

Intermediate-Task Transfer Learning with Pretrained Models for Natural Language Understanding: When and Why Does It Work?

While pretrained models such as BERT have shown large gains across natur...
12/01/2021

Subtask-dominated Transfer Learning for Long-tail Person Search

Person search unifies person detection and person re-identification (Re-...
04/23/2018

Dropping Networks for Transfer Learning

In natural language understanding, many challenges require learning rela...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

“There is a growing consensus that significant, rapid progress can be made in both text understanding and spoken language understanding by investigating those phenomena that occur most centrally in naturally occurring unconstrained materials and by attempting to automatically extract information about language from very large corpora.” Marcus et al. (1993)

Since the creation of the Penn Treebank, using shared benchmarks to measure and drive progress in model development has been instrumental for accumulation of knowledge in the field of natural language processing, and has become a dominant practice. Ideally, we would like shared benchmark corpora to be diverse and comprehensive, which can be addressed at two levels: (i) macro-level dimensions such as language, genre, topic, etc., and (ii) micro-level dimensions such as specific language phenomena. However, diversity and comprehensiveness is not straightforward to achieve.

According to Zipf’s law, many micro-level language phenomena naturally occur infrequently and will be relegated to the long tail, except in cases of intentional over-sampling. Moreover, the advantages of restricting community focus to a specific set of benchmark corpora and limitations in resources lead to portions of the macro-level space being under-explored, which can further cause certain micro-level phenomena to be under-represented. For example, since most popular coreference benchmarks focus on English narratives, they do not contain many instances of zero anaphora, a phenomenon quite common in other languages (e.g., Japanese, Chinese). In such situations, model performance on benchmark corpora may not be truly reflective of expected performance on micro-level long tail phenomena, raising questions about the ability of state-of-the-art models to generalize to the long tail.

Most benchmarks do not explicitly catalogue the list of micro-level language phenomena that are included or excluded in the sample, which makes it non-trivial to construct a list of long tail micro-level language phenomena. Hence, we formalize an alternate conceptualization of the long tail: undersampled portions of the macro-level space that can be treated as proxies for long tail micro-level phenomena. These undersampled long tail macro-level dimensions highlight gaps and present potential new challenging directions for the field. Therefore, periodically taking stock of research to identify long tail macro-level dimensions can help in highlighting opportunities for progress that have not yet been tackled. This idea has been gaining prominence recently; for example, joshi-etal-2020-state survey languages studied by NLP papers, providing statistical support for the existence of a macro-level long tail of low-resource languages.

Figure 1: Distribution of meta-analysis sample papers across years

In this work, our goal is to attempt to characterize the macro-level long tail in NLU and efforts that have tried to address it from research on transfer learning. Large benchmarks have driven much of the recent methodological progress on NLU Bowman et al. (2015); Rajpurkar et al. (2016); McCann et al. (2018); Talmor et al. (2019); Wang et al. (2019b, a), but the generalization abilities of benchmark-trained models to the long tail have been unclear. In tandem, the NLP community has been successfully developing transfer learning methods to improve generalization of models trained on NLU benchmarks Ruder et al. (2019). The goal of transfer learning research is to tackle the macro-level long tail in NLU, leading to the question: how far has transfer learning addressed performance of benchmark models on the NLU long tail, and where do we still fall behind?

Probing further, we perform a qualitative meta-analysis of a representative sample of 100 papers on domain adaptation and transfer learning in NLU. We sample these papers based on citation counts and publication venues (§2.1), and document 7 facets for each paper such as tasks and domains studied, adaptation settings evaluated, etc. (§2.2). Adaptation methods proposed (or applied) are documented using a hierarchical categorization described in §2.3, which we develop by extending the hierarchy from ramponi-plank-2020-neural. With this information, our analysis focuses on three questions:

  • [topsep=0pt]

  • Q1: What long tail macro-level dimensions do transfer learning studies target? Here dimensions include tasks, domains, languages and adaptation settings covered in transfer learning research.

  • Q2: Which properties help adaptation methods improve performance on long tail dimensions?

  • Q3: Which methodological gaps have greatest negative impact on long tail performance?

The rest of the paper presents thorough answers to these questions, laying out avenues for future research on transfer learning that more effectively address the macro-level long tail in NLU. We also present a case study to demonstrate that our meta-analysis framework can be use to systematically design and conduct experiments that provide insights that enable us to make progress along these avenues.

Figure 2: PRISMA diagram explaining our sample curation process

2 Meta-Analysis Framework

2.1 Sample Curation

For our meta-analysis, we gather a representative sample of work on domain adaptation or transfer learning in NLU from the December 2020 dump of the Semantic Scholar Open Research Corpus (S2ORC) Lo et al. (2020). First, we extract all abstracts published at 9 prestigious *CL venues: ACL, EMNLP, NAACL, EACL, COLING, CoNLL, SemEval, TACL, and CL. This results in 25,141 abstracts, which are filtered to retain those containing the terms “domain adaptation” or “transfer learning” in the title or abstract111Search scope is limited to title and abstract in order to prefer papers that focus on transfer learning rather than ones including a brief discussion or experiment on transfer learning as part of an investigation of something else., producing a set of 382 abstracts after duplicate removal.222Distribution of abstracts across venues in appendix

Cat Tasks Included
TC

Text classification tasks like sentiment analysis, hate speech detection, propaganda detection, etc.

NER Semantic sequence labeling tasks like NER, event extraction, etc.
POS Syntactic sequence labeling tasks like POS tagging, chunking, etc.
NLI Natural language inference, NLU Tasks recast as NLI (e.g., GLUE)
SP Structured prediction tasks such as entity and event coreference
WSD Word sense disambiguation
TRN Text ranking tasks (e.g., search)
TRG Text regression tasks
RC Reading comprehension
MF Matrix factorization
LI Lexicon induction
SLU Spoken language understanding
Table 1: Categorization of tasks studied

We manually screen this subset and remove abstracts that are not eligible for our NLU-focused analysis (e.g., papers on generation-focused tasks like machine translation), leaving us with a set of 266 abstracts. From this, we construct a final meta-analysis sample of 100 abstracts via application of two inclusion criteria. Per the first criterion, all abstracts with 100 or more citations are included since they are likely to describe landmark advances. Then, remaining abstracts (to bring our meta-analysis sample to 100) are randomly chosen, after discarding ones with no citations.333The randomly sampled set has a mean citation count of 28.4, according to citation data from Semantic Scholar. The random sampling criterion ensures that we do not neglect studies that study less mainstream topics by focusing solely on highly-cited work. This produces a final representative sample of transfer learning work for our meta-analysis. Figure 2 gives an overview of our sample curation process via a PRISMA diagram Page et al. (2021), while figure 1 shows the year-wise distribution of the sample. A complete list of all papers included in our sample is provided in appendix B

Figure 3: Categorization of adaptation methods proposed, extended or used in all studies

2.2 Meta-Analysis Facets

For every paper from our meta-analysis sample, we document the following key facets:
Task(s): NLP task(s) studied in the work. Tasks are grouped into 12 categories based on task formalization and linguistic level (e.g., lexical, syntactic, etc.), as shown in table 1.
Domain(s): Source and target domains and/or languages studied, along with datasets used for each.
Task Model: Base model used for the task, to which domain adaptation algorithms are applied.
Adaptation Method(s): Domain adaptation method(s) proposed or used in the work. Adaptation methods are grouped according to the categorization showed in figure 3 (details in §2.3).
Adaptation Baseline(s): Baseline domain adaptation method(s) to compare new methods against.
Adaptation Settings: Source-target transfer settings explored in the work (e.g., unsupervised adaptation, multi-source adaptation, etc.).
Result Summary: Performance improvements (if any), performance differences across multiple source-target pairs or methods, etc.

2.3 Adaptation Method Categorization

For adaptation methods proposed or used in each study, we assign type labels according to the categorization presented in figure 3. This categorization is an extension of the one proposed by ramponi-plank-2020-neural. Since our meta-analysis is not limited to neural unsupervised domain adaptation, we need to extend their categorization with additional classes. Broadly, methods are divided into three coarse

categories: (i) model-centric, (ii) data-centric, and (iii) hybrid approaches. Model-centric approaches perform adaptation by modifying the structure of the model, which may include editing the feature representation, loss function or parameters. Data-centric approaches perform adaptation by modifying or leveraging labeled/unlabeled data from the source and target domains to bridge the domain gap. Finally, hybrid approaches are ones that cannot be clearly classified as model-centric or data-centric. Each coarse category is divided into

fine subcategories.

Category Example Methods Example Studies
Feat Aug (FA) Structural correspondence learning, Frustratingly easy domain adaptation Blitzer et al. (2006); Daumé III (2007)
Feat Gen (FG)

Marginalized stacked denoising autoencoders, Deep belief networks

Jochim and Schütze (2014); Ji et al. (2015); Yang et al. (2015)
Loss Aug (LA) Multi-task learning, Adversarial learning, Regularization-based methods Zhang et al. (2017); Liu et al. (2019); Chen et al. (2020)
Init (PI)

Prior estimation, Parameter matrix initialization

Chan and Ng (2006); Al Boni et al. (2015)
Add (PA) Adapter networks Lin and Lu (2018)
Freeze (FR) Embedding freezing, Layerwise freezing Yin et al. (2015); Tourille et al. (2017)
Ensemble (EN) Mixture of experts, Weighted averaging McClosky et al. (2010); Nguyen et al. (2014)
Instance Weighting (IW) Classifier based weighting Jiang and Zhai (2007); Jeong et al. (2009)
Data Selection (DS) Confidence-based sample selection Scheible and Schütze (2013); Braud and Denis (2014)
Pseudo-Labeling (PL) Semi-supervised learning, Self-training Umansky-Pesin et al. (2010); Lison et al. (2020)
Noising/Denoising (NO) Token dropout Pilán et al. (2016)
Active Learning (AL) Sample selection via active learning Rai et al. (2010); Wu et al. (2017)
Pretraining (PT) Language model pretraining, Supervised pretraining Conneau et al. (2017); Howard and Ruder (2018)
Instance Learning (IL) Nearest neighbor learning Gong et al. (2016)
Table 2: Examples of methods from each category, and papers studying these methods. These lists are non-exhaustive. In interest of replicability, the coding for all papers will be released on publication.

Model-centric approaches are divided into four categories, based on which portion of the model they modify: (i) feature-centric, (ii) loss-centric, (iii) parameter-centric, and (iv) ensemble. Feature-centric approaches are further divided into two fine subcategories: (i) feature augmentation, and (ii) feature generalization. Feature augmentation includes techniques that learn an alignment between source and target feature spaces using shared features called pivots Blitzer et al. (2006)

. Feature generalization includes methods that learn a joint representation space using autoencoders, motivated by glorot2011domain,chen2012marginalized. Loss-centric approaches contain one fine subcategory: loss augmentation. This includes techniques which augment task loss with adversarial loss

Ganin and Lempitsky (2015); Ganin et al. (2016), multi-task loss Liu et al. (2019) or regularization terms. Parameter-centric approaches include three fine subcategories: (i) parameter initialization, (ii) new parameter addition, and (iii) parameter freezing. Finally ensemble, used in settings with multiple source domains, includes techniques that learn to combine predictions of multiple models trained on source and target domains.

Data-centric approaches are divided into five fine subcategories. Pseudo-labeling approaches train classifiers that then produce “gold” labels for unlabeled target data. This includes semi-supervised learning methods such as bootstrapping, co-training, self-training, etc. (e.g., mcclosky-etal-2006-effective). Active learning approaches use a human-in-the-loop setting to annotate a select subset of target data that the model can learn most from

Settles (2009)

. Instance learning approaches leverage neighborhood structure in joint source-target feature spaces to make predictions on target data (e.g., nearest neighbor learning). Noising/denoising approaches include data corruption or pre-processing which increase surface similarity between source and target examples. Finally, pretraining includes approaches that train large-scale language models on unlabeled data to learn better source and target representations, a strategy that has gained popularity in recent years

Gururangan et al. (2020).

Hybrid approaches contain two fine subcategories that cannot be classified as model-centric or data-centric because they involve manipulation of the data distribution, but can also be viewed as loss-centric approaches that edit the training loss. Instance weighting approaches assign weights to target examples based on similarity to source data. Conversely, data selection approaches filter target data based on similarity to source data. Table 2 lists example adaptation methods for each fine category and example studies from our meta-analysis subset that use these methods.

3 Which Long Tail Macro-Level Dimensions Do Transfer Learning Studies Target?

The first goal of our meta-analysis is to document long tail macro-level dimensions that transfer learning studies have tested their methods on. We look at distributions of tasks, domains, languages and adaptation settings studied in all papers in our sample. 10 studies are surveys, position papers or meta-experiments, and so excluded from these statistics. Studies can cover multiple tasks, domains, languages or settings so counts may be higher than 90.

Figure 4: Distribution of papers according to tasks studied

Task distribution: Figure 4 gives a brief overview of the distribution of tasks studied across papers. Text classification tasks clearly dominate, followed by semantic and syntactic tagging. Text classification covers a variety of tasks, but sentiment analysis is the most well-studied, with research driven by the multi-domain sentiment detection (MDSD) dataset Blitzer et al. (2007). Conversely, structured prediction is under-studied, despite covering a variety of tasks such as coreference resolution, syntactic parsing, dependency parsing, semantic parsing, etc. This indicates that tasks with complex formulations/objectives are under-explored. We speculate that there may be two reasons for this: (i) difficulty of collecting annotated data in multiple domains/languages for such tasks, and (ii) shift in output structures (e.g., different named entity types in source and target domains) making adaptation harder.

Figure 5: Distribution of multi-lingual studies according to languages included

Languages studied: Despite a focus on generalization, most studies in our sample rarely evaluate on other languages aside from English. As stated by bender2011achieving, this is problematic because the ability to apply a technique to other languages does not necessarily guarantee comparable performance. Some studies do cover multi-lingual evaluation or focus on cross-linguality. Figure 5 shows the distribution of languages included in these studies, which is a limited subset. For a more comprehensive discussion of linguistic diversity in NLP research not limited to transfer learning, we refer interested readers to joshi-etal-2020-state.

Figure 6: Distribution of papers according to adaptation settings studied
HE #P NN #P
Clinical 10 Twitter 12
Biomedical 9 Conversations 10
Science 3 Forums 8
Finance 3 Emails 6
Literature 3
DefSec 1
Table 3: Counts of papers (#P) studying high-expertise (HE) and non-narrative (NN) domains

Domains studied: Many popular transfer benchmarks Blitzer et al. (2007); Wang et al. (2019b, a) are homogeneous. They focus on expository English, drawn from plentiful sources such as news articles, reviews, blogs, essays and Wikipedia. This sidelines some categories of domains444Domain is an overloaded term covering genres, styles, registers, etc., but we use it for consistency with prior work. that fall into the long tail: (i) non-expository text (e.g., social media, conversations etc.), and (ii) texts from high-expertise domains that use specialized vocabulary and knowledge (e.g., clinical text). Table 3 shows the number of papers focusing on high-expertise and non-expository domains, highlighting the lack of focus on these areas.

Adaptation settings studied: Most studies evaluate methods in a supervised adaptation setting, i.e. labeled data is available from both source and target domains. This assumption may not always hold. Often adaptation must be performed in harder settings such as unsupervised adaptation (no labeled data from target domain), adaptation from multiple source domains, online adaptation, etc. Figure 6 shows the distribution of unconventional adaptation settings across papers, indicating that these settings are understudied in literature.

Figure 7: Distribution of transfer learning studies according to coarse method categories

Open Issues: We can see that there is much ground to cover in testing adaptation methods on the long tail. Two research directions may be key to achieving this: (i) development of and evaluation on diverse benchmarks, and (ii) incentivizing publication of research on long tail domains at NLP venues. Diverse benchmark development has gained momentum, with the creation of benchmarks such as BLUE Peng et al. (2019) and BLURB Gu et al. (2020) for biomedical and clinical NLP, XTREME Hu et al. (2020) for cross-lingual NLP and GLUECoS Khanuja et al. (2020) for code-switched NLP. However, newly proposed adaptation methods are often not evaluated on them, which is imperative to test their limitations and generalization abilities. On the other hand, application-specific or domain-specific evaluations of adaptation methods are sidelined at NLP venues and may be viewed as limited in terms of bringing broader insights. But applied research can unearth significant opportunities for advances in transfer learning, and should be viewed from a translational perspective Newman-Griffis et al. (2021). For example, source-free domain adaptation in which only a trained source model is available with no access to source data Liang et al. (2020), was conceptualized partly due to data sharing restrictions on Twitter or clinical data. Though this issue is limited to certain domains, source-free adaptation may be of broader interest since it has implications for reducing models’ reliance on large amounts of data. Therefore, encouraging closer ties with applied transfer learning research can help us gain more insight into limitations of existing techniques on the long tail.

Figure 8: Distribution of transfer learning studies according to fine method categories
(a) High-expertise domains
(b) Non-narrative domains
(c) Both domain types
Figure 9: Distribution of fine method categories from studies evaluating on long tail domains
Study Method Performance
Tsuboi et al. (2008) Conditional random field (CRF) model trained on partially annotated sequences of OOV tokens (LA) Positive transfer from conversations to medical manuals
Arnold et al. (2008) Manually constructed feature hierarchy across domains, allowing back off to more general features (FA) Positive transfer from 5 corpora (biomedical, news, email) to email
McClosky et al. (2010)

Mixture of domain-specific models chosen via source-target similarity features (e.g., cosine similarity) (EN)

Positive transfer to biomedical, literature and conversation domains
Yang and Eisenstein (2015) Dense embeddings induced from template features and manually defined domain attribute embeddings (FA) Positive transfer to 4/5 web domains and 10/11 literary periods
Hangya et al. (2018) Monolingual joint training on generic+domain text, then cross-lingual projection (PT+FA), using cycle consistency loss Haeusser et al. (2017) (LA) Positive transfer to medical and Twitter data using both methods
Rodriguez et al. (2018) Training source domain classifier and using its predictions as target classifier inputs (FA), initializing target classifier with source classifier weights (PI) No clear winner across medical data, security and defense reports, conversations, Twitter
Xing et al. (2018) Multi-task learning method with source-target distance minimization as additional loss term (LA) Positive transfer on 4/6 intra-medical settings (EHRs, forums) and 5/9 narrative to medical settings
Wang et al. (2018) Source-target distance minimized using two loss penalties (LA) Positive transfer to medical and Twitter data
Kamath et al. (2019) Adversarial domain adaptation with additional domain-specific feature space (LA) Positive transfer to web forums and financial text
Lison et al. (2020) Weakly supervised data creation by aggregating labels from rule-based or trained labeling functions (PL) Positive transfer to financial text and Twitter
Table 4: Model and performance details for studies testing on high-expertise and non-narrative domains

4 Which Properties Help Adaptation Methods Improve Performance On Long Tail Dimensions?

The second goal of our meta-analysis is to identify which categories of adaptation methods have been tested extensively and have exhibited good performance on various long tail macro-level dimensions. Figures 7 and 8 provide an overview of categories of methods tested across all papers in our subset. We can see that studies overwhelmingly develop or use model-centric methods. Within this coarse category, feature augmentation (FA) and loss augmentation (LA) are the top two categories, followed by pretraining (PT), which is data-centric. Parameter initialization (PI) and pseudo labeling (PL) round out the top five. Feature augmentation being the most explored category is no surprise, given that a lot of pioneering early domain adaptation work in NLP Blitzer et al. (2006, 2007); Daumé III (2007) developed methods to learn shared feature spaces between source and target domains. Loss augmentation methods have gained prominence recently, with multi-task learning providing large improvements Liu et al. (2015, 2019). Pretraining methods, both unsupervised Howard and Ruder (2018) and supervised Conneau et al. (2017), have also gained popularity with large transformer-based language models like ELMo Peters et al. (2018) and BERT Devlin et al. (2019) achieving huge gains across a variety of tasks.

To specifically identify techniques that work on long tail domains, we look at categories of methods evaluated on high-expertise domains or non-narrative domains (or both). Figures 8(a)8(b) and 8(c) present the distributions of fine method categories tested on high-expertise domains, non-narrative domains and both domain types respectively. While feature augmentation techniques remain the most explored category for high-expertise domains, we see a change in trend for non-narrative domains. Loss augmentation and pretraining are more commonly explored categories. The difference in dominant model categories can be partly attributed to easy availability of large-scale unlabeled data and weak signals (e.g., likes, shares etc.), particularly for social media. Such user-generated content (called “fortuitous data” by DBLP:conf/konvens/Plank16) is leveraged well by pretraining or multi-task learning techniques, making them popular choices for non-narrative domains. In contrast, high-expertise domains (e.g, literature, security and defense reports, finance, etc.) often lack fortuitous data, with methods developed for them focusing on learning shared feature spaces. 10 studies in our meta-analysis sample evaluate on both domain types. Table 4 gives a brief description of methods explored in these studies and their performance.555More details provided in appendix C From these studies, we identify two interesting properties that seem to improve adaptation performance but remain relatively under-explored in the context of recent methods such as pretraining:

  • [leftmargin=*,topsep=0pt]

  • Incorporating source-target distance: Several methods explicitly incorporate distance between source and target domain (e.g., Xing et al. (2018); Wang et al. (2018)). Aside from allowing flexible adaptation based on the specific domain pairs being considered, adding source-target distance provides two benefits. It offers an additional avenue to analyze generalizability by monitoring source-target distance during adaptation. It also allows performance to be estimated in advance using source-target distance, which can be helpful when choosing an adaptation technique for a new target domain. kashyap2020domain provide a comprehensive overview of source-target distance metrics and discuss their utility in analysis and performance prediction. Despite these benefits, very little work has tried to incorporate source-target distance into newer adaptation methods such as pretraining.

  • Incorporating nuanced domain variation: Despite NLP treating domain variation as a dichotomy (source vs target), domains vary from each other along a multitude of dimensions (e.g., topic, genre, medium or purpose of communication etc.) Plank (2016). Some methods acknowledge this nuanced view and treat domain variation as multi-dimensional, either in a discrete feature space Arnold et al. (2008) or in a continuous embedding space Yang and Eisenstein (2015). This allows knowledge sharing across dimensions common to both source and target, improving transfer performance. This idea has also remained under-explored, though recent work such as the development of domain expert mixture (DEMix) layers Gururangan et al. (2021) has attempted to incorporate nuanced domain variation into pretraining.

Open Issues: Interestingly many studies from our sample do not analyze failures, i.e., source-target pairs on which adaptation methods do not improve performance. For some studies in table 4, adaptation methods do not improve performance on all source-target pairs. But failures are not investigated, presenting the question: do we know blind spots for current adaptation methods? Answering this is essential to develop a complete picture of the generalization capabilities of adaptation methods. Studies that present negative transfer results (e.g., Plank et al. (2014)) are rare, but should be encouraged to develop a sound understanding of adaptation techniques. Analyses should also study ties between datasets used and methods applied, highlighting dimensions of variation between source-target domains and how adaptation methods bridge these variations Kashyap et al. (2020); Naik et al. (2021). Such analyses can uncover important lessons about generalizability of adaptation methods and the kinds of source-target settings they can be expected to improve performance on.

5 Which Methodological Gaps Have Greatest Negative Impact On Long Tail Performance?

The final goal of our meta-analysis is to identify methodological gaps in developing adaptation methods for long tail domains, which provide avenues for future research. Our observations highlight three areas: (i) combining adaptation methods, (ii) incorporating extra-linguistic knowledge, and (iii) application to data-scarce settings.

5.1 Combining adaptation methods

The potential of combining multiple adaptation methods has been not been systematically and extensively studied. Combining methods may be useful in two scenarios. The first one is when source and target domains differ along multiple dimensions (e.g., topic, language etc.) and different methods are known to work well for each. The second one is when methods focus on resolving issues in specific portions of the model such as feature space misalignment, task level differences etc. Combining model-centric adaptation methods666as per our categorization presented in §2.3 that tackle each issue separately may improve performance over individual approaches. Despite its utility, method combination has only been systematically explored by one meta-study from 2010. On the other hand, 23 studies apply a particular combination of methods to their tasks/domains, but do not analyze when these combinations do/do not work. We summarize both sources of evidence and highlight open questions.

Method combination meta-study: chang-etal-2010-necessity observe that most adaptation methods either tackle shift in feature space () or shift in how features are linked to labels (). They call the former category “unlabeled adaptation methods” since feature space alignment can be done using unlabeled data alone. The latter category require some labeled target data and are called “labeled adaptation methods”.777These categories do not map cleanly to our hierarchy.

Through theoretical analysis, simulated experiments and experiments with real-world data on two tasks (named entity recognition and preposition sense disambiguation), they observe: (i) combination generally improves performance, (ii) combining best-performing individual methods may not provide best combination performance, and (iii) simpler labeled adaptation algorithms (e.g., jointly training on source and target data) interface better with strong unlabeled adaptation algorithms.

Study Method LT
Different Coarse Categories
Jeong et al. (2009) IW+PL
Hangya et al. (2018) PT+FA
Cer et al. (2018) PT+LA
Dereli and Saraclar (2019) FA+PT
Ji et al. (2015) FG+IW
Huang et al. (2019) PI+PL
Li et al. (2012) LA+PL+IW
Chan and Ng (2007) AL+PI+IW
Nguyen et al. (2014) PL+EN
Yu and Kübler (2011) PL+IW
Scheible and Schütze (2013) FA+PL+DS
Tan and Cheng (2009) FA+IW
Mohit et al. (2012) LA+PL
Rai et al. (2010) AL+LA
Wu et al. (2017) AL+LA
Same Coarse Categories
Lin and Lu (2018) PA+FA
Zhang et al. (2017) FA+LA
Yan et al. (2020) FA+LA
Yang et al. (2017) LA+PL+FA
Gong et al. (2016) LA+PI
Same Fine Categories
Alam et al. (2018) LA+LA
Lee et al. (2020) PL+PL
Kim et al. (2017) LA+LA
Table 5: Category combinations explored by studies that combine multiple methods. LT indicates whether long tail domains were evaluated on.

Applying particular combinations: Table 5 lists all studies that apply method combinations and fine-grained category labels from our hierarchy for the methods used. Combining methods from different coarse categories is the most popular strategy, employed by 15 out of 23 studies. 5 studies combine methods from the same coarse category, but different fine categories. They combine model-centric methods that edit different parts of the model (e.g. a feature-centric and a loss-centric method). The last 3 studies combine methods from the same fine category. Only 7 studies evaluate on at least one long tail domain.

Several studies observe performance improvements Yu and Kübler (2011); Mohit et al. (2012); Scheible and Schütze (2013); Kim et al. (2017); Yang et al. (2017); Alam et al. (2018), mirroring the observation by chang-etal-2010-necessity that method combination helps. However, this is not consistent across all studies. For example, jochim-schutze-2014-improving find that combining marginalized stacked denoising autoncoders (mSDA) Chen et al. (2012) and frustratingly easy domain adaptation (FEDA) Daumé III (2007) performs worse than individual methods in preliminary experiments on citation polarity classification. Both methods are feature-centric, though mSDA is a generalization method (FG) while FEDA is an augmentation method (FA). Additionally, mSDA is an unlabeled adaptation method while FEDA is a labeled adaptation method. Owing to negative results, jochim-schutze-2014-improving do not experiment further to find a combination that might have worked. wright-augenstein-2020-transformer show that combining adversarial domain adaptation (ADA) Ganin and Lempitsky (2015) with pretraining does not improve performance, but combining mixture of experts (MoE) with pretraining does. This indicates that methods from the same coarse category (model-centric) may react differently in combination settings. Similarly, studies achieving positive results do not analyze which properties of chosen methods allow them to combine well, and whether this success extends to other methods with similar properties.

Open questions: To understand method combination, we must examine the following questions:

  • [leftmargin=*,topsep=0pt]

  • Is it possible to draw general conclusions about the potential of combining methods from various coarse or fine categories?

  • Which properties of adaptation methods are indicative of their ability to interface well with other methods?

  • Do task and/or domain of interest influence the abilities of methods to combine successfully?

5.2 Incorporating extra-linguistic knowledge

Most methods leverage labeled/unlabeled text to learn generalizable representations. However, knowledge from sources beyond text such as ontologies, human understanding of domain/task variation, etc., can also improve adaptation performance. This is especially true for domains with expert-curated ontologies (e.g., UMLS for biomedical/clinical text Bodenreider (2004)). From our study sample, we observe some exploration of the following knowledge sources:

Ontological knowledge:

romanov-shivade-2018-lessons employ UMLS for clinical natural language inference via two techniques: (i) retrofitting word vectors as per UMLS

Faruqui et al. (2015), and (ii) using UMLS concept distance-based attention. Retrofitting hurts performance, while concept distance provides modest improvements.

Domain Variation: Arnold et al. (2008) and Yang and Eisenstein (2015) incorporate human understanding of domain variation in discrete and continuous feature spaces respectively, with some success (table 4). Structural correspondence learning Blitzer et al. (2006) relies on manually defined pivot features common to source and target domains, and shows performance improvements.

Task Variation: Zarrella and Marsh (2016) incorporate human understanding of knowledge required for stance detection to define an auxiliary hashtag prediction task, which improves target task performance.

Manual Adaptation: Chiticariu et al. (2010) manually customize rule-based NER models, matching scores achieved by supervised models.

Another source that is not explored by studies in our sample, but has gained popularity is providing task descriptions for sample-efficient transfer learning Schick and Schütze (2021). Despite initial explorations, the potential of extra-linguistic knowledge sources is largely under-explored.

Open questions: Given varying availability of knowledge sources across tasks/domains, comparing their performance across domains may be impractical. But studies experimenting with a specific source can still probe the following questions:

  • [leftmargin=*,topsep=0pt]

  • Can reliance on labeled/unlabeled data be reduced while maintaining the same performance?

  • Does incorporating the knowledge source improve interpretability of the adaptation method?

  • Can we preemptively identify a subset of samples which may benefit from the knowledge?

5.3 Application to data-scarce adaptation settings

§3 shows that most studies test methods in a supervised setting in which labeled and/or unlabeled data is available from both source and target domains. But availability of labeled or unlabeled data is often limited for long tail domains and languages. Hence, methods should also be developed for and applied to settings that reflect real-world criteria like data availability. Data-scarce adaptation settings might be harder, but are extremely important since they closely resemble contexts in which transfer learning is likely to be used. In particular, more evaluation should be carried out in the following data-scarce settings:

Unsupervised Adaptation: No labeled target data is available. Methods can use unlabeled target data or obtain distantly supervised target data from auxiliary resources (e.g., gazetteers) and user-generated signals (e.g., likes, shares, etc.).

Multi-source Adaptation: Instead of a single large-scale source dataset, smaller datasets from several source domains are available.

Online Adaptation: Especially pertinent for productionizing models, in this setting, adaptation methods must learn to adapt to new domains on-the-fly. Often information about the target domain beyond the current sample may not be available.

Source-free Adaptation: A trained model must be adapted to a target domain without source domain data, either labeled or unlabeled. This setting is especially useful for domains that have strong data-sharing restrictions such as clinical data.

Some of these settings have attracted attention in recent years. ramponi-plank-2020-neural comprehensively survey neural methods for unsupervised adaptation. In their survey on low-resource NLP, hedderich2020survey cover transfer learning techniques that reduce need for supervised target data. wang-etal-2021-putting list human-in-the-loop data augmentation and model updation techniques that can be used for data-scarce adaptation. However, there is room to further study application of adaptation methods in data-scarce settings.

Open questions: Broadly, two main questions in this area still remain unanswered:

  • [leftmargin=*,topsep=0pt]

  • At different levels of data scarcity (e.g., no labeled target data, no unlabeled target data, etc.), which adaptation methods perform best?

  • Can we correlate source-target domain distance and data-reliance of adaptation methods?

i2b22006 i2b22010 i2b22014
Model P R F1 P R F1 P R F1
ZS 18.7 21.8 20.1 35.2 10.1 15.7 21.2 32.8 25.7
LA 16.1 21.2 18.3 36.6 15.4 21.7 27.5 28.6 28.0
PL 23.2 22.0 22.6 23.3 5.0 8.3 47.4 23.6 31.5
PT 19.5 22.1 20.7 38.1 12.8 19.1 27.3 27.4 27.3
IW 21.0 19.5 20.2 34.3 12.1 17.9 21.0 29.2 24.4
Table 6: Results of all adaptation methods on NER in the coarse setting
i2b22006 i2b22014
Model P R F1 P R F1
ZS 12.6 14.1 13.3 24.0 28.3 25.9
LA 16.1 15.8 16.0 22.8 25.7 24.2
PL 17.5 11.4 13.8 39.5 21.4 27.7
PT 10.0 12.3 11.1 17.1 22.3 19.4
IW 14.4 14.1 14.2 21.8 25.6 23.6
Table 7: Results of all adaptation methods on NER in the fine setting

6 Case Study: Evaluating Adaptation Methods on Clinical Narratives

Finally, we attempt to demonstrate how our meta-analysis framework and observations can be leveraged to systematically design case studies that can provide answers to the prevailing open questions laid out in the previous section. As an example, we conduct a case study to evaluate the effectiveness of popularly used adaptation methods on high-expertise domains in an unsupervised adaptation setting, a burgeoning area of interest Ramponi and Plank (2020). Specifically, our study focuses on the question: which method categories perform best for semantic sequence labeling tasks when transferring from news to clinical narratives, given an unsupervised setting (i.e., no labeled clinical data available)? We focus on two semantic sequence labeling tasks: entity extraction and event extraction.

6.1 Datasets

We use the following entity extraction datasets:

  • [leftmargin=*,topsep=0pt]

  • CoNLL 2003 Tjong Kim Sang and De Meulder (2003): Reuters news stories annotated with four types of entities: persons (PER), organizations (ORG), locations (LOC), and miscellaneous names (MISC).

  • i2b2 2006 Uzuner et al. (2007): Medical discharge summaries from Partners Healthcare annotated with PHI (private health information) entities of eight types: patients, doctors, locations, hospitals, dates, IDs, phone numbers, and ages.

  • i2b2 2010 Uzuner et al. (2011): Discharge summaries from Partners Healthcare, Beth Israel Deaconess Medical Center, and the University of Pittsburgh Medical Center annotated with three entity types: medical problems, tests and treatments.

  • i2b2 2014 Stubbs and Uzuner (2015): Longitudinal medical records from Partners Healthcare annotated with PHI (private health information) entities of eight broad types: name, profession, location, age, date, contact, IDs, and other.

All entities are annotated in IOB format. For event extraction, we use the following datasets:

  • [leftmargin=*,topsep=0pt]

  • TimeBank Pustejovsky et al. (2003): News articles from various sources annotated with events, time expressions and temporal relations between events.

  • i2b2 2012 Sun et al. (2013): Discharge summaries from Partners Healthcare and Beth Israel Deaconess Medical Center annotated with events, time expressions and temporal relations.

  • MTSamples Naik et al. (2021): Medical records from the MTSamples website annotated with events. This dataset is test-only.

CoNLL 2003 and TimeBank are the source datasets for all entity and event extraction experiments respectively, while the remaining are target datasets. We focus on English narratives only. Among the NER datasets, the label sets for i2b22006 and i2b22014 can be mapped to the label set for CoNLL2003, however the label set for i2b22010 is quite distinct and cannot be mapped. Therefore, we evaluate NER in two settings: coarse and fine. In the coarse setting, the model only detects entities, but does not predict entity type, whereas in the fine setting, the model detects entities and predicts types. The coarse setting evaluation covers all target NER datasets, while the fine setting only covers datasets that can be label-mapped (i.e., i2b22006 and i2b22014).

i2b22012 MTSamples
Model P R F1 P R F1
ZS 48.8 15.3 23.3 91.4 48.0 63.0
LA 51.7 19.0 27.8 88.1 58.5 70.3
PL 44.1 11.4 18.2 91.8 39.3 55.1
PT 41.5 10.4 16.6 90.2 46.3 61.2
IW 50.5 18.1 26.6 90.6 48.4 63.1
Table 8: Results of all adaptation methods on event extraction

6.2 Adaptation Methods

As the baseline model for NER and event extraction, we use a BERT-based sequence labeling model that computes token-level representations using a BERT encoder, followed by a linear layer that predicts entity/event labels per token. We compare the performance of adaptation methods from the top 5 fine categories most frequently applied (i.e. most popular) to high-expertise domains as per our analysis (figure 8(a)), on top of this BERT baseline. Since feature augmentation (FA) methods require some target labeled data to train target-specific weights and our focus is on an unsupervised setting, our study tests the remaining four categories:

  • [leftmargin=*,topsep=0pt]

  • PL: From the pseudo-labeling category, we test the self-training method. Self-training works by first training a sequence labeling model on the source dataset of news narratives, then using the source-trained model to generate labels for unlabeled sentences from the target domain (clinical narratives). A subset of high-confidence predictions from this set of “pseudo-labeled” clinical data are then combined with the source dataset to train a sequence labeling model. This process can be repeated iteratively until all the unlabeled data is exhausted.

  • LA: From the loss augmentation category, we test adversarial domain adaptation Ganin and Lempitsky (2015). This method tries to learn domain-invariant representations by adding an adversary that tries to predict an example’s domain and subtracting the loss from this adversary from the overall model loss. This setup is trained in a two-stage process, with the adversary being trained for domain prediction in the first step, and the sequence labeling model being trained to do well on sequence labeling while suppressing domain-specific information in the second step.

  • PT: From the pretraining category, we test domain-adaptive pretraining as described by gururangan-etal-2020-dont. This method tries to improve target domain performance of BERT-based models by continual masked language modeling pretraining on unlabeled text from the target domain.

  • IW:

    From the instance weighting category, we test classifier-based instance weighting as described by jiang-zhai-2007-instance. This method trains a classifier on the task of predicting an example’s domain, then runs the trained classifier on all source domain examples and uses the target domain probabilities as weights for each example. This technique thus assigns higher weights to examples from the source domain that “look” more like the target domain according to the domain classifier, hopefully improving performance on the target datasets. In our setup, we perform interleaved training - we retrain the domain classifier after each model training pass and update the weights assigned to source dataset examples.

6.3 Results

Tables 6 and 7 show the results of all adaptation methods on both coarse and fine entity extraction, while table 8 shows the results of all adaptation methods on event extraction. Note that ZS indicates baseline model scores in a zero-shot setting, i.e., training on the source dataset (ConLL2003/TimeBank) and testing on the target dataset without any adaptation. From these tables, we can see that the best-performing method categories are loss augmentation and pseudo-labeling across different settings. In particular, loss augmentation methods work best for event extraction. For entity extraction in the coarse setting, pseudo-labeling methods work better on target datasets whose labels can be mapped to the source dataset, which can be considered closer transfer tasks. For i2b22010, which is the more distant transfer task, loss augmentation works best in the coarse setting. The effectiveness of pseudo-labeling methods here is interesting because they can suffer from the pitfall of propagating errors made by the source-trained model, which may also explain their poor performance on i2b22010. Indeed, early work on applying these methods to parsing showed negative results, or very minor improvements Charniak (1997); Steedman et al. (2003), but these methods have shown more promise in recent years with advances in embedding representations. Finally, for entity extraction in the fine setting, loss augmentation does better on i2b22006, while pseudo-labeling does better on i2b22014. Another interesting observation is that pretraining is not the best-performing method in any setting. This may be a side effect of the continual pretraining process leading to some level of forgetting, which can have negative impact in an unsupervised adaptation setting. This further highlights the need to conduct such systematic studies to compare adaptation methods under data-scarce settings because the ranking of methods can change based on the availability and quality of domain-specific data.

7 Conclusion

This work presents a qualitative meta-analysis of 100 representative papers on domain adaptation and transfer learning in NLU, with the aim of understanding performance of adaptation methods on the long tail. Through this analysis, we assess current trends and highlight methodological gaps that we consider to be major avenues for future research in transfer learning for the long tail. We observe that current research has a tendency to sideline certain types of tasks, languages, domains, and adaptation settings, indicating that long tail coverage is far from comprehensive. We also identify two properties that help long tail performance: (i) incorporating source-target domain distance, and (ii) incorporating a nuanced view of domain variation. Additionally, we identify three major gaps that must be addressed to improve long tail performance: (i) combining adaptation methods, (ii) incorporating extra-linguistic knowledge and (iii) application to data-scarce adaptation settings. Finally, we demonstrate the utility of our meta-analysis framework and observations in guiding the design of systematic meta-experiments to address prevailing open questions by conducting a systematic evaluation of popular adaptation methods for high-expertise domains in a data-scarce setting. This case study reveals interesting insights about the adaptation methods evaluated and shows that significant progress can be made towards developing a better understanding of adaptation for the long tail by conducting such experiments.

Acknowledgements

This research was supported in part by the Intramural Research Program of the National Institutes of Health, Clinical Research Center and through an Inter-Agency Agreement with the US Social Security Administration. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the NIH, or the US Government.

References

  • M. Al Boni, K. Zhou, H. Wang, and M. S. Gerber (2015) Model adaptation for personalized opinion analysis. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, pp. 769–774. External Links: Link, Document Cited by: Table 2.
  • F. Alam, S. Joty, and M. Imran (2018) Domain adaptation with adversarial training and graph embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1077–1087. External Links: Link, Document Cited by: §5.1, Table 5.
  • A. Arnold, R. Nallapati, and W. W. Cohen (2008) Exploiting feature hierarchy for transfer learning in named entity recognition. In Proceedings of ACL-08: HLT, Columbus, Ohio, pp. 245–253. External Links: Link Cited by: Table 10, Table 4, 2nd item, §5.2.
  • J. Blitzer, M. Dredze, and F. Pereira (2007) Biographies, Bollywood, boom-boxes and blenders: domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 440–447. External Links: Link Cited by: §3, §3, §4.
  • J. Blitzer, R. McDonald, and F. Pereira (2006) Domain adaptation with structural correspondence learning. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia, pp. 120–128. External Links: Link Cited by: §2.3, Table 2, §4, §5.2.
  • O. Bodenreider (2004) The unified medical language system (umls): integrating biomedical terminology. Cited by: §5.2.
  • S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 632–642. External Links: Link, Document Cited by: §1.
  • C. Braud and P. Denis (2014) Combining natural and artificial examples to improve implicit discourse relation identification. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 1694–1705. External Links: Link Cited by: Table 2.
  • D. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, B. Strope, and R. Kurzweil (2018) Universal sentence encoder for English. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Brussels, Belgium, pp. 169–174. External Links: Link, Document Cited by: Table 5.
  • Y. S. Chan and H. T. Ng (2006) Estimating class priors in domain adaptation for word sense disambiguation. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp. 89–96. External Links: Link, Document Cited by: Table 2.
  • Y. S. Chan and H. T. Ng (2007) Domain adaptation with active learning for word sense disambiguation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 49–56. External Links: Link Cited by: Table 5.
  • E. Charniak (1997) Statistical parsing with a context-free grammar and word statistics. AAAI/IAAI 2005 (598-603), pp. 18. Cited by: §6.3.
  • M. Chen, Z. E. Xu, K. Q. Weinberger, and F. Sha (2012) Marginalized denoising autoencoders for domain adaptation. In ICML, Cited by: §5.1.
  • S. Chen, Y. Hou, Y. Cui, W. Che, T. Liu, and X. Yu (2020) Recall and learn: fine-tuning deep pretrained language models with less forgetting. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, pp. 7870–7881. External Links: Link, Document Cited by: Table 2.
  • L. Chiticariu, R. Krishnamurthy, Y. Li, F. Reiss, and S. Vaithyanathan (2010) Domain adaptation of rule-based annotators for named-entity recognition tasks. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Cambridge, MA, pp. 1002–1012. External Links: Link Cited by: §5.2.
  • A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes (2017) Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 670–680. External Links: Link, Document Cited by: Table 2, §4.
  • H. Daumé III (2007) Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 256–263. External Links: Link Cited by: Table 2, §4, §5.1.
  • N. Dereli and M. Saraclar (2019) Convolutional neural networks for financial text regression. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Florence, Italy, pp. 331–337. External Links: Link, Document Cited by: Table 5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §4.
  • M. Faruqui, J. Dodge, S. K. Jauhar, C. Dyer, E. Hovy, and N. A. Smith (2015) Retrofitting word vectors to semantic lexicons. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1606–1615. External Links: Link, Document Cited by: §5.2.
  • Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    .
    In

    International conference on machine learning

    ,
    pp. 1180–1189. Cited by: §2.3, §5.1, 2nd item.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016)

    Domain-adversarial training of neural networks

    .
    The journal of machine learning research 17 (1), pp. 2096–2030. Cited by: §2.3.
  • L. Gong, M. Al Boni, and H. Wang (2016) Modeling social norms evolution for personalized sentiment classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 855–865. External Links: Link, Document Cited by: Table 2, Table 5.
  • Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon (2020) Domain-specific language model pretraining for biomedical natural language processing. arXiv preprint arXiv:2007.15779. Cited by: §3.
  • S. Gururangan, M. Lewis, A. Holtzman, N. A. Smith, and L. Zettlemoyer (2021) DEMix layers: disentangling domains for modular language modeling. arXiv preprint arXiv:2108.05036. Cited by: 2nd item.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8342–8360. External Links: Link, Document Cited by: §2.3.
  • P. Haeusser, A. Mordvintsev, and D. Cremers (2017) Learning by association–a versatile semi-supervised training method for neural networks. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 89–98. Cited by: Table 10, Table 4.
  • V. Hangya, F. Braune, A. Fraser, and H. Schütze (2018) Two methods for domain adaptation of bilingual tasks: delightfully simple and broadly applicable. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 810–820. External Links: Link, Document Cited by: Table 10, Table 4, Table 5.
  • J. Howard and S. Ruder (2018) Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 328–339. External Links: Link, Document Cited by: Table 2, §4.
  • J. Hu, S. Ruder, A. Siddhant, G. Neubig, O. Firat, and M. Johnson (2020) XTREME: a massively multilingual multi-task benchmark for evaluating cross-lingual generalisation. In International Conference on Machine Learning, pp. 4411–4421. Cited by: §3.
  • Y. J. Huang, J. Lu, S. Kurohashi, and V. Ng (2019) Improving event coreference resolution by learning argument compatibility from unlabeled data. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 785–795. External Links: Link, Document Cited by: Table 5.
  • M. Jeong, C. Lin, and G. G. Lee (2009) Semi-supervised speech act recognition in emails and forums. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 1250–1259. External Links: Link Cited by: Table 2, Table 5.
  • Y. Ji, G. Zhang, and J. Eisenstein (2015) Closing the gap: domain adaptation from explicit to implicit discourse relations. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 2219–2224. External Links: Link, Document Cited by: Table 2, Table 5.
  • J. Jiang and C. Zhai (2007) Instance weighting for domain adaptation in NLP. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, Prague, Czech Republic, pp. 264–271. External Links: Link Cited by: Table 2.
  • C. Jochim and H. Schütze (2014) Improving citation polarity classification with product reviews. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Baltimore, Maryland, pp. 42–48. External Links: Link, Document Cited by: Table 2.
  • A. Kamath, S. Gupta, and V. Carvalho (2019) Reversing gradients in adversarial domain adaptation for question deduplication and textual entailment tasks. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 5545–5550. External Links: Link, Document Cited by: Table 10, Table 4.
  • A. R. Kashyap, D. Hazarika, M. Kan, and R. Zimmermann (2020) Domain divergences: a survey and empirical analysis. arXiv preprint arXiv:2010.12198. Cited by: §4.
  • S. Khanuja, S. Dandapat, A. Srinivasan, S. Sitaram, and M. Choudhury (2020) GLUECoS: an evaluation benchmark for code-switched NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 3575–3585. External Links: Link, Document Cited by: §3.
  • J. Kim, Y. Kim, R. Sarikaya, and E. Fosler-Lussier (2017) Cross-lingual transfer learning for POS tagging without cross-lingual resources. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, pp. 2832–2838. External Links: Link, Document Cited by: §5.1, Table 5.
  • Y. Lee, R. Fernandez Astudillo, T. Naseem, R. Gangi Reddy, R. Florian, and S. Roukos (2020) Pushing the limits of AMR parsing with self-learning. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online, pp. 3208–3214. External Links: Link, Document Cited by: Table 5.
  • F. Li, S. J. Pan, O. Jin, Q. Yang, and X. Zhu (2012) Cross-domain co-extraction of sentiment and topic lexicons. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Korea, pp. 410–419. External Links: Link Cited by: Table 5.
  • J. Liang, D. Hu, and J. Feng (2020) Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning, pp. 6028–6039. Cited by: §3.
  • B. Y. Lin and W. Lu (2018) Neural adaptation layers for cross-domain named entity recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 2012–2022. External Links: Link, Document Cited by: Table 2, Table 5.
  • P. Lison, J. Barnes, A. Hubin, and S. Touileb (2020) Named entity recognition without labelled data: a weak supervision approach. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 1518–1533. External Links: Link, Document Cited by: Table 10, Table 2, Table 4.
  • X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y. Wang (2015) Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 912–921. External Links: Link, Document Cited by: §4.
  • X. Liu, P. He, W. Chen, and J. Gao (2019) Multi-task deep neural networks for natural language understanding. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 4487–4496. External Links: Link, Document Cited by: §2.3, Table 2, §4.
  • K. Lo, L. L. Wang, M. Neumann, R. Kinney, and D. Weld (2020) S2ORC: the semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 4969–4983. External Links: Link, Document Cited by: §2.1.
  • M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz (1993) Building a large annotated corpus of English: the Penn Treebank. Computational Linguistics 19 (2), pp. 313–330. External Links: Link Cited by: §1.
  • B. McCann, N. S. Keskar, C. Xiong, and R. Socher (2018) The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730. Cited by: §1.
  • D. McClosky, E. Charniak, and M. Johnson (2010) Automatic domain adaptation for parsing. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Los Angeles, California, pp. 28–36. External Links: Link Cited by: Table 10, Table 2, Table 4.
  • B. Mohit, N. Schneider, R. Bhowmick, K. Oflazer, and N. A. Smith (2012) Recall-oriented learning of named entities in Arabic Wikipedia. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, pp. 162–173. External Links: Link Cited by: §5.1, Table 5.
  • A. Naik, J. F. Lehman, and C. Rose (2021) Adapting event extractors to medical data: bridging the covariate shift. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 2963–2975. External Links: Link Cited by: §4, 3rd item.
  • D. Newman-Griffis, J. F. Lehman, C. Rosé, and H. Hochheiser (2021) Translational nlp: a new paradigm and general principles for natural language processing research. arXiv preprint arXiv:2104.07874. Cited by: §3.
  • M. L. Nguyen, I. W. Tsang, K. M. A. Chai, and H. L. Chieu (2014) Robust domain adaptation for relation extraction via clustering consistency. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Baltimore, Maryland, pp. 807–817. External Links: Link, Document Cited by: Table 2, Table 5.
  • M. J. Page, J. E. McKenzie, P. M. Bossuyt, I. Boutron, T. C. Hoffmann, C. D. Mulrow, L. Shamseer, J. M. Tetzlaff, E. A. Akl, S. E. Brennan, et al. (2021) The prisma 2020 statement: an updated guideline for reporting systematic reviews. Bmj 372. Cited by: §2.1.
  • Y. Peng, S. Yan, and Z. Lu (2019) Transfer learning in biomedical natural language processing: an evaluation of BERT and ELMo on ten benchmarking datasets. In Proceedings of the 18th BioNLP Workshop and Shared Task, Florence, Italy, pp. 58–65. External Links: Link, Document Cited by: §3.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §4.
  • I. Pilán, E. Volodina, and T. Zesch (2016) Predicting proficiency levels in learner writings by transferring a linguistic complexity model from expert-written coursebooks. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, Osaka, Japan, pp. 2101–2111. External Links: Link Cited by: Table 2.
  • B. Plank, A. Johannsen, and A. Søgaard (2014) Importance weighting and unsupervised domain adaptation of POS taggers: a negative result. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 968–973. External Links: Link, Document Cited by: §4.
  • B. Plank (2016) What to do about non-standard (or non-canonical) language in NLP. In Proceedings of the 13th Conference on Natural Language Processing, KONVENS 2016, Bochum, Germany, September 19-21, 2016, S. Dipper, F. Neubarth, and H. Zinsmeister (Eds.), Bochumer Linguistische Arbeitsberichte, Vol. 16. External Links: Link Cited by: 2nd item.
  • J. Pustejovsky, P. Hanks, R. Sauri, A. See, R. Gaizauskas, A. Setzer, D. Radev, B. Sundheim, D. Day, L. Ferro, et al. (2003) The timebank corpus. In Corpus linguistics, Vol. 2003, pp. 40. Cited by: 1st item.
  • P. Rai, A. Saha, H. Daumé, and S. Venkatasubramanian (2010) Domain adaptation meets active learning. In Proceedings of the NAACL HLT 2010 Workshop on Active Learning for Natural Language Processing, Los Angeles, California, pp. 27–32. External Links: Link Cited by: Table 2, Table 5.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 2383–2392. External Links: Link, Document Cited by: §1.
  • A. Ramponi and B. Plank (2020) Neural unsupervised domain adaptation in NLP—A survey. In Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), pp. 6838–6855. External Links: Link, Document Cited by: §6.
  • J. D. Rodriguez, A. Caldwell, and A. Liu (2018) Transfer learning for entity recognition of novel classes. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1974–1985. External Links: Link Cited by: Table 10, Table 4.
  • S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf (2019) Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, Minneapolis, Minnesota, pp. 15–18. External Links: Link, Document Cited by: §1.
  • C. Scheible and H. Schütze (2013) Sentiment relevance. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Sofia, Bulgaria, pp. 954–963. External Links: Link Cited by: Table 2, §5.1, Table 5.
  • T. Schick and H. Schütze (2021) Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Online, pp. 255–269. External Links: Link Cited by: §5.2.
  • B. Settles (2009) Active learning literature survey. Cited by: §2.3.
  • M. Steedman, M. Osborne, A. Sarkar, S. Clark, R. Hwa, J. Hockenmaier, P. Ruhlen, S. Baker, and J. Crim (2003) Bootstrapping statistical parsers from small datasets. In 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary. External Links: Link Cited by: §6.3.
  • A. Stubbs and Ö. Uzuner (2015) Annotating longitudinal clinical narratives for de-identification: the 2014 i2b2/uthealth corpus. Journal of biomedical informatics 58, pp. S20–S29. Cited by: 4th item.
  • W. Sun, A. Rumshisky, and O. Uzuner (2013) Evaluating temporal relations in clinical text: 2012 i2b2 challenge. Journal of the American Medical Informatics Association 20 (5), pp. 806–813. Cited by: 2nd item.
  • A. Talmor, J. Herzig, N. Lourie, and J. Berant (2019) CommonsenseQA: a question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4149–4158. External Links: Link, Document Cited by: §1.
  • S. Tan and X. Cheng (2009) Improving SCL model for sentiment-transfer learning. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, Boulder, Colorado, pp. 181–184. External Links: Link Cited by: Table 5.
  • E. F. Tjong Kim Sang and F. De Meulder (2003) Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147. External Links: Link Cited by: 1st item.
  • J. Tourille, O. Ferret, X. Tannier, and A. Névéol (2017) LIMSI-COT at SemEval-2017 task 12: neural architecture for temporal information extraction from clinical narratives. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), Vancouver, Canada, pp. 597–602. External Links: Link, Document Cited by: Table 2.
  • Y. Tsuboi, H. Kashima, S. Mori, H. Oda, and Y. Matsumoto (2008) Training conditional random fields using incomplete annotations. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), Manchester, UK, pp. 897–904. External Links: Link Cited by: Table 10, Table 4.
  • S. Umansky-Pesin, R. Reichart, and A. Rappoport (2010) A multi-domain web-based algorithm for POS tagging of unknown words. In Coling 2010: Posters, Beijing, China, pp. 1274–1282. External Links: Link Cited by: Table 2.
  • Ö. Uzuner, Y. Luo, and P. Szolovits (2007) Evaluating the state-of-the-art in automatic de-identification. Journal of the American Medical Informatics Association 14 (5), pp. 550–563. Cited by: 2nd item.
  • Ö. Uzuner, B. R. South, S. Shen, and S. L. DuVall (2011) 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association 18 (5), pp. 552–556. Cited by: 3rd item.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019a) Superglue: a stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 3266–3280. Cited by: §1, §3.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2019b) Glue: a multi-task benchmark and analysis platform for natural language understanding. In 7th International Conference on Learning Representations, ICLR 2019, Cited by: §1, §3.
  • Z. Wang, Y. Qu, L. Chen, J. Shen, W. Zhang, S. Zhang, Y. Gao, G. Gu, K. Chen, and Y. Yu (2018) Label-aware double transfer learning for cross-specialty medical named entity recognition. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1–15. External Links: Link, Document Cited by: Table 10, Table 4, 1st item.
  • F. Wu, Y. Huang, and J. Yan (2017) Active sentiment domain adaptation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1701–1711. External Links: Link, Document Cited by: Table 2, Table 5.
  • J. Xing, K. Zhu, and S. Zhang (2018) Adaptive multi-task transfer learning for Chinese word segmentation in medical text. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 3619–3630. External Links: Link Cited by: Table 10, Table 4, 1st item.
  • M. Yan, H. Zhang, D. Jin, and J. T. Zhou (2020) Multi-source meta transfer for low resource multiple-choice question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7331–7341. External Links: Link, Document Cited by: Table 5.
  • H. Yang, T. Zhuang, and C. Zong (2015) Domain adaptation for syntactic and semantic dependency parsing using deep belief networks. Transactions of the Association for Computational Linguistics 3, pp. 271–282. External Links: Link, Document Cited by: Table 2.
  • Y. Yang and J. Eisenstein (2015) Unsupervised multi-domain adaptation with feature embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 672–682. External Links: Link, Document Cited by: Table 10, Table 4, 2nd item, §5.2.
  • Z. Yang, J. Hu, R. Salakhutdinov, and W. Cohen (2017) Semi-supervised QA with generative domain-adaptive nets. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1040–1050. External Links: Link, Document Cited by: §5.1, Table 5.
  • W. Yin, T. Schnabel, and H. Schütze (2015) Online updating of word representations for part-of-speech tagging. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 1329–1334. External Links: Link, Document Cited by: Table 2.
  • N. Yu and S. Kübler (2011) Filling the gap: semi-supervised learning for opinion detection across domains. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Portland, Oregon, USA, pp. 200–209. External Links: Link Cited by: §5.1, Table 5.
  • G. Zarrella and A. Marsh (2016) MITRE at SemEval-2016 task 6: transfer learning for stance detection. In Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016), San Diego, California, pp. 458–463. External Links: Link, Document Cited by: §5.2.
  • Y. Zhang, R. Barzilay, and T. Jaakkola (2017) Aspect-augmented adversarial networks for domain adaptation. Transactions of the Association for Computational Linguistics 5, pp. 515–528. External Links: Link, Document Cited by: Table 2, Table 5.

Appendix A Venue Statistics

Table 9 presents the distribution of papers across venues for the complete corpus as well as the transfer learning subset of 435 papers.

Venue #Papers #TL Papers
ACL 7200 149
EMNLP 4160 127
NAACL 2943 52
EACL 1290 11
COLING 4965 39
CoNLL 778 10
SemEval 1632 33
TACL 397 10
CL 1776 4
Table 9: Distribution of papers across venues in the complete corpus and the transfer learning subset

Appendix B Meta-Analysis Papers

Following is the complete list of papers included in our final meta-analysis sample:
Papers with >= 100 citations: blitzer-etal-2006-domain, daume-iii-2007-frustratingly, jiang-zhai-2007-instance, blitzer-etal-2007-biographies, chan-ng-2007-domain, finkel-manning-2009-hierarchical, mcclosky-etal-2010-automatic, chiticariu-etal-2010-domain, subramanya-etal-2010-efficient, prettenhofer-stein-2010-cross, li-etal-2012-cross, plank-moschitti-2013-embedding, eisenstein-2013-bad, liu-etal-2015-representation, nguyen-grishman-2015-event, zarrella-marsh-2016-mitre, sogaard-goldberg-2016-deep, mou-etal-2016-transferable, conneau-etal-2017-supervised, yang-etal-2017-semi, cer-etal-2018-universal, howard-ruder-2018-universal, liu-etal-2019-multi

Remaining papers: chan-ng-2006-estimating, tsuboi-etal-2008-training, arnold-etal-2008-exploiting, agirre-lopez-de-lacalle-2008-robustness, jeong-etal-2009-semi, agirre-lopez-de-lacalle-2009-supervised, tan-cheng-2009-improving, umansky-pesin-etal-2010-multi, rai-etal-2010-domain, chang-etal-2010-necessity, yu-kubler-2011-filling, szarvas-etal-2012-cross, dhillon-etal-2012-metric, mohit-etal-2012-recall, heilman-madnani-2013-ets, scheible-schutze-2013-sentiment, plank-etal-2014-importance, monroe-etal-2014-word, jochim-schutze-2014-improving, nguyen-etal-2014-robust, braud-denis-2014-combining, passonneau-etal-2014-biber, yang-eisenstein-2015-unsupervised, yin-etal-2015-online, ji-etal-2015-closing, yang-etal-2015-domain, al-boni-etal-2015-model, kim-etal-2016-frustratingly, abdelwahab-elmaghraby-2016-uofl, huang-lin-2016-transferring, sapkota-etal-2016-domain, gong-etal-2016-modeling, pilan-etal-2016-predicting, duong-etal-2017-multilingual, tourille-etal-2017-limsi, zhang-etal-2017-aspect, kim-etal-2017-cross, gimenez-perez-etal-2017-single, wu-etal-2017-active, chen-etal-2018-xl, hangya-etal-2018-two, rodriguez-etal-2018-transfer, xing-etal-2018-adaptive, wang-etal-2018-label-aware, lin-lu-2018-neural, gee-wang-2018-psyml, alam-etal-2018-domain, romanov-shivade-2018-lessons, fares-etal-2018-transfer, yang-etal-2018-interpretable, jiao-etal-2018-convolutional, huang-etal-2018-zero, vlad-etal-2019-sentence, kamath-etal-2019-reversing, li-etal-2019-semi-supervised, chen-qian-2019-transfer, wiedemann-etal-2019-uhh, beryozkin-etal-2019-joint, dereli-saraclar-2019-convolutional, aggarwal-sadana-2019-nsit, li-etal-2019-semi-supervised-domain, huang-etal-2019-improving, johnson-etal-2019-cross, wang-etal-2019-tell, lison-etal-2020-named, akdemir-2020-research, chalkidis-etal-2020-empirical, naik-rose-2020-towards, lee-etal-2020-pushing, tamkin-etal-2020-investigating, chen-etal-2020-recall, yan-etal-2020-multi, wright-augenstein-2020-transformer, vu-etal-2020-exploring, schroder-biemann-2020-estimating, keung-etal-2020-dont

Study Method Performance
Tsuboi et al. (2008) Conditional random field (CRF) model trained on partially annotated sequences of OOV tokens (LA) Model evaluated on transfer between two long tail domains (conversations to medical manuals) shows improvements
Arnold et al. (2008) A feature hierarchy is manually constructed across domains, allowing model to back off to more general features as needed (FA) Evaluation on transfer from five corpora (biomedical, news, emails) to emails shows improvements
McClosky et al. (2010) Multi-source adaptation method comprising of mixture of domain-specific models, source-target similarity features (e.g., vocabulary overlap, cosine similarity etc.) choose right mixture (EN) Performance improvement on all long tail domains tested (biomedical, literature, conversations)
Yang and Eisenstein (2015) Unsupervised multi-source adaptation method in which dense embeddings are induced using template features, manually defined domain attribute embeddings also concatenated (FA) Performance improvements on 4 out of 5 web domains and 10 out of 11 literary periods
Hangya et al. (2018) Method 1 trains monolingual embeddings on generic+domain text, followed by cross-lingual projection via a seed lexicon (PT+FA), Method 2 uses cycle consistency loss Haeusser et al. (2017) on labeled+unlabeled data to learn better representations (LA) Performance improvements on both medical data and Twitter data using both methods
Rodriguez et al. (2018) Label varying setup with source domain labels being approximate hypernyms of target labels handled by training a source domain classifier and using its predictions as inputs to target classifier (FA), or initializing target classifier with source classifier weights (PI) Evaluation carried on broad range of long tail domains (medical, security and defense reports, conversations, Twitter), but no consensus on which method performs best across all
Xing et al. (2018)

Adaptive multi-task learning method with additional loss term to minimize source-target distance measured via different metrics (e.g., KL divergence, maximum mean discrepancy, central moment discrepancy etc.) (LA)

Performance improvements on 4 out of 6 intra-medical transfer settings (EHRs and medical forums) and 5 out of 9 narrative to medical transfer settings
Wang et al. (2018) Source-target distance minimized using two loss penalties: (i) minimizing maximum mean discrepancy between tokens with same labels from different domains, and (ii) minimizing KL divergence between source and target CRF emission and transition distributions (LA) Performance improvements on various medical and Twitter data sources
Kamath et al. (2019) Adversarial domain adaptation which uses a domain discriminator adversary to learn more general feature representations, augmented with a domain-specific feature space (LA) Performance improvements on web forum text (Quora, StackExchange, etc.) as well as financial text
Lison et al. (2020)

Weakly supervised training data created using multiple rule-based or trained labeling functions for each instance, multiple labels aggregated using a Hidden Markov Model (PL)

Performance improvements on both financial text and Twitter
Table 10: Model and performance details for studies that test on both high-expertise and non-narrative domains

Appendix C Methods Evaluated On Long Tail Domains

Table 10 lists studies that evaluate on multiple long tail domains, detailing methods explored in them and their performance. The draft contains an abridged version (table 4).