Log In Sign Up

Temporal Effects on Pre-trained Models for Language Processing Tasks

by   Oshin Agarwal, et al.
University of Pennsylvania

Keeping the performance of language technologies optimal as time passes is of great practical interest. Here we survey prior work concerned with the effect of time on system performance, establishing more nuanced terminology for discussing the topic and proper experimental design to support solid conclusions about the observed phenomena. We present a set of experiments with systems powered by large neural pretrained representations for English to demonstrate that temporal model deterioration is not as big a concern, with some models in fact improving when tested on data drawn from a later time period. It is however the case that temporal domain adaptation is beneficial, with better performance for a given time period possible when the system is trained on temporally more recent data. Our experiments reveal that the distinctions between temporal model deterioration and temporal domain adaptation becomes salient for systems built upon pretrained representations. Finally we examine the efficacy of two approaches for temporal domain adaptation without human annotations on new data, with self-labeling proving to be superior to continual pre-training. Notably, for named entity recognition, self-labeling leads to better temporal adaptation than human annotation.


page 1

page 2

page 3

page 4


Inexpensive Domain Adaptation of Pretrained Language Models: A Case Study on Biomedical Named Entity Recognition

Domain adaptation of Pretrained Language Models (PTLMs) is typically ach...

Multi-task Domain Adaptation for Sequence Tagging

Many domain adaptation approaches rely on learning cross domain shared r...

Linguistically Informed Masking for Representation Learning in the Patent Domain

Domain-specific contextualized language models have demonstrated substan...

A Broad Study of Pre-training for Domain Generalization and Adaptation

Deep models must learn robust and transferable representations in order ...

Pre-train or Annotate? Domain Adaptation with a Constrained Budget

Recent work has demonstrated that pre-training in-domain language models...

Transformer Based Multi-Source Domain Adaptation

In practical machine learning settings, the data on which a model must m...

Opinions are Made to be Changed: Temporally Adaptive Stance Classification

Given the rapidly evolving nature of social media and people's views, wo...

1 Introduction

Language changes constantly, even for small time scales such as the span of a year. How this change impacts the performance of language technologies is a question of great practical interest. In some scenarios, language will change as a result of deploying a system that uses the language to make a prediction, as in spam detection Fawcett (2003). But most change is not driven by such adversarial adaptations: the language expressing sentiment in product reviews Lukes and Søgaard (2018), the named entities and the contexts in which they are discussed on social media Fromreide et al. (2014); Rijhwani and Preotiuc-Pietro (2020) and language markers of political ideology Huang and Paul (2018) all change over time.

Yet research on quantifying how model performance changes has been sporadic. Moreover, approaches to solving language tasks have evolved rapidly, from bag of words models which rely only on a small number of fixed words represented as strings, without underlying meaning, to fixed dense word representations such as GloVe Pennington et al. (2014) and word2vec Mikolov et al. (2013) that are trained on task-independent text to provide a backbone representation for word meaning, to large contextualized representations of language Peters et al. (2018); Devlin et al. (2019) finetuned further for specific tasks. The swift change in approaches has made it hard to fully understand how representations and the data used to train them modulate the changes in system performance over time. Unlike in any prior work, we study several representations and show that the potential loss in performance and potential gain from updating the system over time is heavily representation-dependent.111The question of how the data used for pretraining influences changes in system performance is much harder to study because it would require substantial computational resources.

Here we present experimental evidence designed to disentangle conclusions about worsening model performance due to temporal language change (temporal model deterioration) and about the benefit from retraining systems on temporally more recent data in order to obtain optimal performance (temporal domain adaptation). We present experiments on three tasks for English—named entity recognition, sentiment classification and truecasing (restoring the conventional capitalization for English in lowercase text). For each, we analyze how the performance of approaches built on different language representations changes over consecutive time periods and how retraining on more recent periods influences it. We find that models built on pre-trained representations do not experience temporal deterioration and in fact the performance of many of the models improves with time222For at least one of the tasks, representations are trained before the time at which any data for experiments was drawn; in others, pretraining data and task-specific data overlap in time, so some confounding is possible though unlikely.. Temporal domain adaptation is still possible, i.e. performance can be further improved by retraining on more recent human labeled data. However with stronger pre-trained representations, the benefits from temporal adaptation diminish.

We further find that neural models trained (fine-tuned) on the same data but initialized with random vectors for word representation exhibit dramatic temporal deterioration on the same datasets. Models powered by pre-trained language models however are not impacted in the same way. Correlation analysis reveals that the overlap between training and testing vocabulary is larger as test data is drawn from closer time periods (confirming vocabulary change over time) and it is strongly positively correlated with model performance over time for models that do

not use pre-training. These results provide strong evidence for model deterioration without pre-training; it also raises questions for future work, on how the (mis)match between task-relevant data and pre-training data influences performance, with greater mismatch likely to be more similar to random initialization, resulting in a system more vulnerable to temporal deterioration. In contrast, for models using pretrained representations, especially for named entity recognition, overlap between training and testing data is barely correlated with performance. For truecasing and sentiment, performance is moderately to strongly correlated with train/test word and domain overlap respectively. Named entity recognition thus has a special status among the examined tasks, with respect to temporal effects.

The central insight from our work is that performance of pre-trained models does not deteriorate over time but that the best performance at a given time can be obtained by retraining the system on more recent data, though the benefits vary considerably across tasks and representations. We only work with three tasks where ehe correctness of the label is not influenced by time, unlike other tasks such as open domain question answering where the answer to some questions (eg. who is the CEO of X or what is the latest model of Y.) depends on the time when they are posed.

Finally, we present two methods for temporal adaptation that do not require manual labelling over time. One of the approaches is based on continuous pretraining, i.e. using contemporaneous unlabeled data to update the pre-trained representations. We modify the typical domain adaptative pretraining with an additional step to make it work for small corpora. The second adaptation method relies on self-labeling, i.e. a system labels data from later time periods with its predictions, and that data is used to augment the older data for fine-tuning the system. The self-labeling approach is effective and superior to continuous pretraining. On one of the datasets, self-labeling is even superior to finetuning on new labeled human annotated data and thus a promising method for temporal adaptation.

2 Background and Related Work

Language changes over time Weinreich et al. (1968); Eisenstein (2019); McCulloch (2020). For longer time periods, a robust body of computational work has proposed methods for modeling the changes in active vocabulary Dury and Drouin (2011); Danescu-Niculescu-Mizil et al. (2013) and in meaning of words Wijaya and Yeniterzi (2011); Hamilton et al. (2016); Rosenfeld and Erk (2018); Brandl and Lassner (2019). Changes in vocabulary and possibly syntax, as approximated by bi-grams in Eisenstein (2013), also occur on smaller time scales, such as days and weeks, and occur more in certain domains of discourse compared to others, i.e. change is faster in social media than in printed news. Such language changes over time can also be approximated by the change in language model perplexity. Lazaridou et al. (2021) find that language model perplexity changes faster for politics and sports than for other domains, suggesting that these domains evolve faster than others. Lazaridou et al. (2021) also demonstrate that across all domains, language models do not represent well language drawn from sources published after the language model was trained: perplexity for text samples drawn from increasingly temporarily distant sources steadily increases with time. Their qualitative analysis shows that the changes are not only a matter of new vocabulary being introduced, even the context in which words are used changes over time.

The global changes captured with language model perplexity and analysis of individual words cannot indicate how these changes impact the performance of a model trained for a given task. Röttger and Pierrehumbert (2021) present a meticulously executed study of how domain change (topic of discussion) influences both language models and a downstream classification task. They show that even big changes in language model perplexity may lead to small changes in downstream task performance. They also show that domain adaptation and temporal adaptation are both helpful for the downstream classification task they study, with domain adaptation providing the larger benefit.

In this paper, we also focus on the question of how temporal changes impact system performance. Unlike in prior work, we examine the performance of several architectures and representations on three different tasks. Studying temporal change in model performance requires extra care in experimental design, to tease apart the temporal aspect from all other changes that may occur between two samples of testing data Søgaard et al. (2021). Teasing apart temporal change from domain change is hardly possible. Even data drawn from the same source may include different domains over time333Huang and Paul (2018) find the topics in their data and observe that the top 20 topics change over time.. There are however several clear desiderata for meaningful approach to studying the most salient effects of time on performance.

  • Does performance deteriorate over time?

To study this question of temporal model deterioration, we need to measure performance over several time periods. Let , and denote the first, () and last temporal split in any dataset respectively. To guard against spurious conclusions that reflect specifics of data collected in time period , the starting point for the analysis should also vary. Huang and Paul (2018) use such a set up, performing an extensive evaluation by training models on to and then evaluating them on all remaining splits, both on past and future time periods. The resulting tables are cumbersome to analyze but give a realistic impression of the trends. We adopt a similar setup for our work, reporting results for a number of basic language-related tasks, trained on data from different time periods and tested on data from all subsequent time periods available. In addition, we introduce a summary statistic that captures changes across all studied time periods and present methods to visualize the temporal trends more easily, the details for which can be found in §4 and §8.

Most prior work uses a reduced evaluation setup with a fixed test time period and measures the performance of models trained on different time periods on this fixed test set. This typical setup of evaluation on one future temporal split does not measure the change in model performance over time and cannot support any conclusions about temporal deterioration444Lazaridou et al. (2021) omit such an evaluation since they measure language model perplexity which is sensitive to document length, which they found differed across months. Röttger and Pierrehumbert (2021) evaluate over multiple test sets on a classification task but also omit such an evaluation by reporting the change in the metrics of models w.r.t a control model without adaptation.

. This setup from prior work supports conclusions only about temporal domain adaptation i.e whether retraining on temporally new data helps improve performance on future data, with a single point estimate for the improvement.

  • Can performance at time be improved?

As described above, most prior work chose , the data from latest available time period as test data, to evaluate models trained on earlier data. Lukes and Søgaard (2018)

train a model for sentiment analysis of reviews of Amazon products in 2001-2004 and 2008-2011 and test them on reviews from 2012-2015.

Rijhwani and Preotiuc-Pietro (2020) train models for NER on tweets from each year from 2014 to 2018 and test them on tweets from 2019. Søgaard et al. (2020) work with the tasks of headline generation and emoji prediction. For headline generation, they successively train models on data from 1993 to 2003 and test it on 2004. For emoji prediction, the training data comes from different days and the last one is used as the test set. Lazaridou et al. (2021) train a language model on various corpora with test data from 2018-2019 and train years that either overlap with the test year or precede them. Such results allow us to draw conclusions about the potential for temporal domain adaptation, revealing that models trained on data closer to the test year perform better on that test year. The only problem is that there is a single test year chosen and any anomaly in that test year may lead to misleading results. An evaluation setup much like the one in Huang and Paul (2018) or the much more recent work in Röttger and Pierrehumbert (2021) is needed to draw robust conclusions. As noted above, we adopt their setup with some changes and additionally provide visualization plots to easily interpret trends.

  • Is the experimental set-up realistic?

The model checkpoint that performs best on the development set is typically chosen as the final model to be tested. Yet much of the prior work either does not report details of what data was used for choosing parameters, or uses default hyperparameters, or selects them randomly, or draws the development set from the same year as the test set year, which is in the future at the time of model development. We choose development data from the time period of the training data since data from a future time period will not be available to use as the development set during training.

  • Are findings specific to a model?

Given the variety of language representations, it is tempting to choose one for experimentation and assume that findings carry over to other representations. Therefore we present results using four different representations. The results reveal that findings on both temporal model deterioration and temporal domain adaptation vary vastly across representations. Powerful neural pretrained representations barely experience temporal model deterioration and afford less room for temporal domain adaptation than found in prior work. At first glance, our findings may appear to contradict findings from prior work. They appear more compatible though when we note that most of the foundational work discussing temporal effects on model performance studied bag of words models Fromreide et al. (2014); Lukes and Søgaard (2018); Huang and Paul (2018). Given that bag of word models are rarely used now, we do not perform experiments with them. Instead, we provide results with biLSTM representations initialized with random vectors for word representations. These learn only from the training data and their performance mirrors many of the trends reported in older work.

3 Experimental Resources

Here we describe the datasets used, the different models and the temporal overlap between them.

3.1 Tasks and Datasets

We use three English datasets for fundamental tasks, two for sequence labeling and one for text classification.

Named Entity Recognition with Temporal Twitter Corpus

TTC Rijhwani and Preotiuc-Pietro (2020) consists of tweets annotated for PER, LOC and ORG entities. There are 2,000 tweets for each year from the period 2014–2019. TTC is the only corpus with human annotations specifically collected in order to study temporal effects on performance. Other datasets, including the two we describe next are in fact derived annotations that do not require research effort. TTC is a highly valuable contribution, without which our work would have been impossible.

Sentiment Classification with Amazon Reviews

AR Ni et al. (2019) consists of 233M product reviews rated on a scale of 1 to 5. Following prior work Lukes and Søgaard (2018), we model this task as binary classification, treating rating of greater than 3 as positive and the remaining as negative. We randomly sample 40,000 reviews per year from the period 2001–2018 and organize the data with three consecutive years per split. Only the first 50 words of each review are used.

Truecasing with New York Times

Truecasing Gale et al. (1995); Lita et al. (2003) is the task of case restoration in text. We sample a dataset from the NYT Annotated Corpus Sandhaus (2008) which has sentences that follow English orthographic conventions. We perform a constrained random sampling of 10,000 sentences per year from 1987–2004 and organize the data with three consecutive years per split. To maintain diversity of text, we select approximately equal number of sentences from each domain (indicated by metadata tags) and only two sentences per article. Sentences should have at least one capitalized word not including the first word and should not be completely in uppercase (headlines appear in all uppercase). We model the task as sequence labelling with binary word labels of fully lowercase or not.

3.2 Models

We use two architectures (biLSTM-CRF and Transformers) and four representations (GloVe, ELMo, BERT, RoBERTa) for the experiments.


BiLSTM Hochreiter and Schmidhuber (1997) with 840B-300d-cased GloVe Pennington et al. (2014) and character-based word representation Ma and Hovy (2016) as input. For sequence labelling, a CRF Lafferty et al. (2001) layer is added and prediction is made for each word. For text classification, the representation of the first word is used to make the prediction.

ELMo+GloVe+char555This combination yields better results than ELMo alone.

Same as GloVe+char but the original ELMo Peters et al. (2018) embeddings are added to the input as well.


Devlin et al. (2019) We use the large model for sequence labeling and the base model for text classification, both cased. The number of training examples was larger for text classification resulting in a much faster base model with minimally lower performance than the large one.


Liu et al. (2019) We use the large model for sequence labeling and the base model for text classification.

The first two models are trained using the code from Lample et al. (2016)666

, modified to add ELMo and to perform sentence classification. For NER, we use a batch size of 20, train for a maximum 50 epochs with minimum 2,500 steps. For sentiment classification and truecasing, we use a batch size of 100, train for a maximum 10 epochs minimum 1,500 steps. BERT and RoBERTa are finetuned using the implementation in HuggingFace

Wolf et al. (2019) with a learning rate of 5e-06 and a maximum sequence length of 128. For NER, we finetune for a maximum 6 epochs with a batch size of 2 (smaller batch sizes resulted in better performance). For sentiment classification and truecasing, we finetune for a maximum of 3 epochs. Batch size is 32 for sentiment and 16 for truecasing (maximum value given the resources available to us). For the remaining hyperparameters we use the default values in the respective repositories. All model finetuning uses the AdamW optimizer with epsilon 1e-8.

Model/Task Corpus Time Span
Task Dataset
NER TTC 2014-2019
Sentiment Amazon Reviews 2001-2018
Truecasing NYT 1987-2004
Pretraining Data
GloVe Common Crawl till 2014
ELMo 1B Benchmark till 2011
BERT Wikipedia Jan 2001-2018
BookCorpus till 2015
RoBERTa Wikipedia Jan 2001-2018
BookCorpus till 2015
CC-News Sept 2016-Feb 2019
OpenWebText till 2019
Stories till 2018
Table 1: Time Span for all datasets and corpora. All corpora only include English data. denotes that the actual time span is unknown so we note the publication date of the dataset/paper instead.

3.3 Task and Pre-training Data Time Span

To perform clear experiments, one would need to control the time period of not only the task dataset but also the pre-training corpora. We report the time span for each dataset and the pre-training corpus of each model in Table 1. This table makes it easy to spot where cleaner experimental design may be needed for future studies.

For several corpora, the actual time span is unknown so we report the time of dataset/paper publication instead. Most pre-training corpora overlap with the tasks dataset time spans, making it hard to isolate the impact of temporal changes. BERT is trained on Wikipedia containing data from its launch in 2001 till 2018, and 11k books, spanning an unknown time period. RoBERTa uses all of the data used by BERT, in addition to several other corpora. The pre-training data of BERT and RoBERTa overlaps with all training and evaluation periods of the datasets used.

We also use GloVe and ELMo representations, which do not overlap with the TTC dataset. Yet for these two as we report in later sections, model performance over time increases when ELMO is used, and the potential for temporal domain adaptation is compressed, compared to GloVe. GloVe was released in 2014, hence is trained on data prior to 2014. ELMo uses the 1B benchmark for pretraning which has data from WMT 2011. Neither overlaps with the TTC data (2014-2019). They also do not overlap with a portion of the Amazon Reviews dataset (2016-2018). Therefore, we have at least a subset of experiments free of confounds due to pre-training time overlap. If the observed trends hold across both the set of experiments with and without overlap (which they do as we report later), it will point to the lower impact of pre-training time period on the downstream task. Instead, any improved performance is either the result of more data or architecture change.

Regardless of all this, an important set of experiments for future work would involve pre-training the best performing model on different corpora controlled for time and compare their performance. Such an extensive set of experiments would require significant computational resources as well as time. Because of this, prior work has, like us, has worked with off-the-shelf pretrained models. For instance Röttger and Pierrehumbert (2021) control the time period for the data used for intermediate pre-training in their experiments, but they start their experiments with BERT which is pre-trained on corpora that overlap temporally with their downstream task dataset.

For future work, we emphasize the need to report the time period of any data used to support research on temporal model deterioration and temporal domain adaptation. There are additional confounds such as the time span of the topics being discussed vs the time span when the text was written. From a model deployment perspective, the latter is more relevant.

Test Year Train Year
2014 2015 2016 2017 2018
GloVe+char biLSTM-CRF
2015 55.18 - - - -
2016 56.22 57.13 - - -
2017 55.09 53.95 59.43 - -
2018 51.06 53.12 57.75 57.82 -
2019 54.10 54.56 59.48 60.41 62.99
ELMo+GloVe+char biLSTM-CRF
2015 59.58 - - - -
2016 61.08 63.36 - - -
2017 59.66 60.35 60.84 - -
2018 60.51 60.30 61.77 61.74 -
2019 63.09 62.97 64.62 64.50 68.73
2015 64.69 - - - -
2016 65.60 66.89 - - -
2017 65.90 65.62 66.68 - -
2018 64.08 64.59 65.76 65.17 -
2019 71.70 73.68 73.06 74.05 76.20
2015 67.48 - - - -
2016 69.41 72.02 - - -
2017 68.30 70.53 70.29 - -
2018 67.82 68.33 69.29 68.60 -
2019 77.79 78.33 78.89 78.28 79.99
Table 2: F1 for NER on TTC. Training is on gold-standard data.
Test Year Train Year
01-03 04-06 07-09 10-12 13-15
GloVe+char biLSTM
04-06 44.88 - - - -
07-09 44.67 44.18 - - -
10-12 44.79 44.79 56.72 - -
13-15 44.65 45.97 58.48 60.27 -
16-18 42.75 45.36 59.52 62.24 64.69
ELMo+GloVe+char biLSTM
04-06 55.32 - - - -
07-09 56.59 59.01 - - -
10-12 57.83 60.70 61.24 - -
13-15 58.14 61.53 63.76 63.09 -
16-18 57.50 62.50 65.89 64.91 69.06
04-06 63.09 - - - -
07-09 63.58 65.62 - - -
10-12 64.64 67.02 68.16 - -
13-15 65.82 68.62 70.34 71.19 -
16-18 65.88 69.62 72.07 72.98 75.16
04-06 69.91 - - - -
07-09 71.01 72.29 - - -
10-12 72.26 73.77 73.73 - -
13-15 72.97 74.67 75.32 75.77 -
16-18 73.90 76.26 76.89 77.28 78.91
Table 3: F1 of the negative class on Amazon Reviews. Training is on gold-standard data.

4 Experimental Setup

We divide each dataset into temporal splits with equal number of sentences for sequence labelling and equal number of documents for text classification to minimize any performance difference due to the size of the split. We randomly downsample to the size of the smallest temporal split whenever necessary. Let , and denote the first, and last temporal split in the dataset respectively.

Train and Test Set

We largely follow Huang and Paul (2018), with minor clarifications on certain aspects as well as additional constraints due to difference in dataset size across tasks, ensuring consistency in setup. First, we vary both training and test year but limit the evaluation to future years since we want to mimic the practical setup of model deployment. We train models, on to , and evaluate the model trained on on test sets starting to . With these results, a lower triangular matrix can be created with the training years as the columns and the test years as the rows. A sample can be seen in Table 2.

Next, we need to further divide each temporal split into three sub-splits for training, development and testing. We are limited by our smallest dataset on NER, which is by far the hardest to label and is the only task that requires actual data annotation. It has 2,000 sentences in each year and splitting it into three parts will not provide us with enough data to train a large neural model or reliably evaluate it. Hence, we do not evaluate on the current year but only on the future ones. When training a model on , it is split 80-20 into a training set and a development set . Both these sets combined i.e the full serves as the test set when a model is evaluated on it.

Development Set

We selected the development set from the training year as described above, reserving 20% of the data in each temporal split. In addition to being less practically feasible setup, through experiments (shown in the appendix), we found that the selection of development set from the test year may affect observed performance trends and even lead to exaggerated improvement for temporal domain adaptation.


We report all metrics by averaging them over three runs with different seeds. For NER, we report the span-level micro-F1 over all entity classes. For sentiment classification, we report F1 for the negative sentiment class and for truecasing, we report F1 for the cased class. The positive sentiment and uncased word form about 80% of the data in their respective tasks and are largely (but not completely) unaffected over time.

Test Year Train Year
87-89 90-92 92-95 96-98 99-01
GloVe+char biLSTM-CRF
90-92 93.76 - - - -
93-95 93.08 93.79 - - -
96-98 93.34 93.35 93.40 - -
99-01 93.00 92.98 92.92 93.60 -
02-04 93.01 93.01 92.89 93.51 94.59
ELMo+GloVe+char biLSTM-CRF
90-92 94.41 - - - -
93-95 93.82 94.46 - - -
96-98 93.87 94.06 94.19 - -
99-01 93.42 93.75 93.86 93.98 -
02-04 93.35 93.52 93.73 93.89 95.08
90-92 97.17 - - - -
93-95 95.45 95.76 - - -
96-98 94.54 94.70 94.72 - -
99-01 96.41 96.57 96.56 97.04 -
02-04 94.01 94.11 94.12 94.56 94.57
90-92 97.50 - - - -
93-95 95.83 96.11 - - -
96-98 94.92 95.00 95.08 - -
99-01 96.74 96.87 96.97 97.28 -
02-04 94.36 94.48 94.57 94.88 95.59
Table 4: F1 of the cased words class for truecasing on NYT. Training is on gold-standard data.

5 Detailed Results

Results are shown in Tables 2, 3 and 4 for NER, sentiment and truecasing respectively.

Temporal model deterioration

can be tracked over the columns in the tables. Each column represents the performance of a fixed model over time on future data. We do not observe temporal deterioration for any of the datasets, representations or training years. Truecasing shows minimal deterioration but after a very large time gap, that too inconsistently. Generally, model performance either fluctuates or improves. This happens regardless of whether the pretraining data time period overlaps with the training and evaluation period.

Temporal Domain Adaptation

can be tracked over the rows in the tables. Each row represents performance on a fixed test set starting with models trained on data farthest away to the temporally nearest data. Performance increases as the models are retrained on data that is temporally closer to the test year. The results are consistent with prior work that uses non-neural models or evaluates on a single test set. However, the full evaluation matrix shows that the extent of improvement by retraining varies considerably by the test year.

Task-based Differences

The trend of no or little model deterioration but possible temporal adaptation is consistent across the three datasets. The amount of improvement via temporal adaptation, however, differs considerably. The largest improvement via retraining is for the sentiment classification task, followed by NER. It is worth noting that the NER dataset spans 6 years whereas the sentiment classification dataset spans 18 years and more improvement may be observed for NER for a similar larger time gap. The change in performance on truecasing is almost non-existent.

(a) BERT
(b) RoBERTa
Figure 1: Model variability across different test set samples as average, minimum and maximum F1 of negative class for sentiment. The model is trained on 2001-2003 and evaluated on multiple samples for each future time period. No model degradation is observed on any sample.

Model-based Differences

There is a consistent trend across the tasks, of diminishing gains from temporal domain adaptation as more powerful pre-trained representations are used. Consider the 2019 test set of TTC for NER and the first and second to last temporal split in the dataset. The gain in F1 by training on 2018 over training on 2014 is 8.89, 5.64, 4.50 and 2.2 for GloVe, GloVe+ELMo, BERT and RoBERTa respectively. Similarly for sentiment classification using Amazon reviews on the last temporal split of 2016-2018, the gain in F1 by retraining on 2013-2015 over training on 2001-3003 is 21.94, 11.56, 9.28 and 5.01 for GloVe, GloVe+ELMo, BERT and RoBERTa respectively.

Test Year Train Year
2014 2015 2016 2017 2018
2015 21.82 - - - -
2016 17.56 22.56 - - -
2017 14.41 16.55 24.48 - -
2018 12.82 15.22 20.12 17.49 -
2019 10.10 15.17 17.28 16.58 23.94
01-03 04-06 07-09 10-12 13-15
04-06 41.61 - - - -
07-09 42.24 44.54 - - -
10-12 42.01 45.35 47.44 - -
13-15 40.10 43.31 48.92 52.30 -
16-18 37.74 42.19 49.60 53.89 59.73
87-89 90-92 92-95 96-98 99-01
90-92 88.96 - - - -
93-95 87.35 88.62 - - -
96-98 86.93 87.16 86.85 - -
99-01 87.30 87.44 86.73 88.99 -
02-04 86.03 86.03 85.30 88.79 88.16
Table 5: F1 for NER, negative sentiment class and the cased class for truecasing using biLSTM-CRF with randomly initialized word embeddings.
Time-Diff Random ELMo+GloVe BERT RoBERTa
% test-words-in-train+dev
% test-entity-spans-in-train+dev
% test words in train+dev
% test reviews with domain in train
% test-words-in-train+dev
% test-cased-words-in-train+dev
Table 6: Spearman correlation of dataset properties with train-test time difference and model performance. denotes significant at 0.05 level without Bonferroni corrections and means with correction.

New Pretrained Representation vs New Labeled Data

We compare the gain by annotating new data to that by using a newer pre-trained representation777This includes changes on both the pre-training data and the model architecture. Let’s again consider the 2019 test set in TTC. We get an F1 of 63.09 with a model trained on 2014 with ELMo+Glove. It increases to 68.73 by retraining on 2018 data. But by substituting the representation to BERT and training on 2014, the F1 is 71.10, higher than that obtained by updating the training data. Similarly, the F1 using BERT increased from 71.70 to 76.20 by retraining on 2018 data. With RoBERTa, the F1 is already 77.79 using 2014 training data.

The trend is not the same for Amazon Reviews. Consider the last temporal split of 2016-2018. With ELMo+GloVe finetuned on 2001-03, the F1 is 57.50 and increases to 69.06 by retraining on 2013-2015. The improvement by changing the representation to BERT as opposed to retraining on new data is 65.88, which is lower than 69.06, unlike TTC. Using RoBERTa though, the F1 is 73.90, an improvement over retraining. So for this task, the benefit of retraining using new data vs using a more recent state of the art representation varies by representation, unlike TTC where more powerful representation trained on old data always yields results better than retraining a model with a weaker representation. An ideal study would properly control the pretraining, training and evaluation time period, and their overlap. Such a study would require considerable resources for pre-training many models.

6 Data Sample Variability

Since the above section results in an unexpected conclusion about the lack of model deterioration, we confirm our findings across several samples of test data to account for variability in sampling. We perform this experiment for the sentiment analysis dataset where much more labeled data is available but not for NER where there is no extra labeled data or truecasing where there isn’t any drastic change. Similar to the above experiments, we sample 40,000 random reviews in each year resulting in 120,000 reviews in each time period of three consecutive years. In addition to the sample used above, we select five more random samples from each temporal bucket, for a total of six samples. The samples are selected with replacement and may have overlapping instances. We then train a model on the oldest time period and evaluate it on all remaining time periods as described above. Again, we train total of three times with different seeds and take the average for each sample for a time period. The results are shown in Figure 1. We plot three lines for the average, maximum and minimum F1 over the six test samples. All three metrics follow the same trend for all representations i.e. no model deterioration. With this, we conclude that we can indeed compare the performance metrics of a model on a specific task on different test sets. As in all other experiments, we did control the test set size i.e. the total number of instances in each sample of a time period.

(b) Sentiment-AR
(c) Truecasing-NYT
Figure 2: Average temporal change in span-level F1 for NER, F1 of negative class for sentiment and F1 of the cased class for truecasing using RoBERTa. The left side of the plot shows the change in model performance as it is evaluated on future years. Only truecasing experiences temporal deterioration whereas performance unexpectedly improves for the other two tasks. The right side of the plot shows the change in performance on a fixed future test set due to retraining on more recent data. The gain is higher as the original model becomes older and thus more stale.

7 Pre-training and Model Deterioration

Above we found that models powered by pre-trained representations do not manifest temporal deterioration. We seek to confirm that pre-training is the cause for this unexpected finding. We experiment with models that do not make use of pre-trained representations and measure whether they exhibit temporal deterioration. We use biLSTM(-CRF) initialized with random word embeddings instead of pretrained ones. The results are shown in Table 5. As suspected, these models show temporal deterioration in performance.

We also measure the correlation between different dataset properties of training and testing data overlap with the time difference between the training and test year and with the performance of each of the models. We report the Spearman correlations in Table 6 and indicate statistical significance both with and without Bonferroni correction for multiple tests. Values significant at 0.05 level without and with multi-test correction are marked with one and two asterisks respectively.

For NER, the degree of train-test overlap has a high correlation with train-test time difference and with the performance of the randomly initialized model; it does not correlate well with the performance of pretrained models. For truecasing, the correlations are higher but with a weaker significance. For sentiment, the train-test time difference does not correlate with the vocabulary overlap but has a high negative correlation with the number of test reviews with a domain (e.g. Books, Hardware) seen in the training data. The introduction of new products/domains over time impacts performance rather than the change in vocabulary. For the domain overlap property that is correlated with the train-test time difference, we observe the same trend as for NER. This property has a high significant correlation with the performance of the model with the random initialization but not the ones with the pretrained embeddings.

The temporal deterioration of the model with the randomly initialized embeddings and the correlation analysis confirm that pretraining is the cause of no temporal deterioration. Pre-training on unlabeled data injects background knowledge into models beyond the training data and has led to significant improvement on many NLP tasks. It also helps avoid temporal deterioration in models, making deployed models more (though not completely) robust to changes over time.

Average Deterioration Score Average Adaptation Score
Random GloVe ELMo+Gl BERT RoBERTa Random GloVe ELMo+Gl BERT RoBERTa
NER-TTC -2.72 -0.10 0.97 2.91 3.49 3.36 2.09 1.03 0.67 0.76
Sentiment-AR -0.25 0.38 1.21 1.25 1.26 4.19 4.90 2.23 2.02 1.06
Truecasing-NYT -0.73 -0.21 -0.26 -0.79 -0.77 0.50 0.29 0.33 0.17 0.22
NER-TTC -6.56 -1.31 0.70 2.71 3.18 6.41 4.07 1.49 1.14 1.39
Sentiment-AR -0.19 0.75 2.55 2.38 2.46 8.97 10.30 5.50 4.70 2.49
Truecasing-NYT -1.53 -0.57 -0.61 -1.14 -1.12 0.68 0.32 0.53 0.29 0.35
Table 7: Deterioration and Adaptation scores for models fine-tuned on gold standard data. The first set of rows show the score w.r.t. the previous time period ( and ) and the second set of rows show the score w.r.t. the anchor time period ( and )

8 Compact Presentation of Findings

The results in Tables 2, 3 and 4 are comprehensive but overwhelming. In Figure 2, we present a more compact view of the results for RoBERTa, the representation that leads to best overall performance on all three tasks. The goal is to more clearly illustrate that we find little evidence for temporal model deterioration and that there is only limited room for improvement via temporal domain adaptation.

For each task, we plot on the left the average difference in performance of each model when there is a gap of 1, 2, 3 or 4 time periods between two test sets. Specifically, for each model trained on , we measure the difference in performance on test set

w.r.t test set

and plot the average over all available for each

. This measure captures the overall change in performance for all models reported in the tables. The mean and median (shown by a dot and line respectively) often are far apart, indicating that there are outliers and using a fixed test sequence as done in prior work can be misleading. More importantly, there is a net improvement (instead of deterioration) of performance for NER and sentiment for all three gaps, small for sentiment and big and variable for NER. Truecasing does show a trend for temporal model deterioration by less than 3 points of


On the right side of the figure for each task, we plot the analogous potential for temporal domain adaptation by retraining on manually labelled temporally closer data. Specifically, for each test set , we measure the difference in performance using a model trained on w.r.t a model trained on and plot the average over all available for each . These are small for NER and truecasing. Sentiment prediction shows the most potential for adaptation, with potential gains as high as 3 points in for the largest temporal gap.

For an even more compact representation, we also report summary deterioration score (DS) and adaptation score (AS) in Table 7. Deterioration score measures the average change in the performance of a model over time. A negative score indicates that the performance has deteriorated. Similarly, the adaptation score measures the average improvement in performance by retraining on recent data, labeled or unlabeled (§9). A positive score means performance improves by retraining. For each score, we report two versions, one that measures the change between subsequent time period and the other that measures the change with respect to an anchor time period. For measuring deterioration, the anchor is the oldest test time period for the given model i.e. if a model is trained on , then is the anchor (first row in results matrix). For measuring adaptation, the anchor is the oldest train time period so the anchor is always (first column in results matrix). Let

be the evaluation metric measured on

when the model is trained on data split . Also let N be the number of elements in the sum and be the anchor time period. Then, each of the summary scores can be defined as follows.


All datasets have negative scores with the randomly initialized model without pre-training, indicating a drop in performance. For models with pre-trained representations, NER-TTC and Sentiment-AR have positive deterioration scores (except NER with GloVe) indicating there is no model deterioration. However, the adaptation scores are positive showing an increase in performance by retraining. Truecasing is the only task out of the three that shows deterioration with even pre-trained model but the adaptation score is very low showing minimal improvement by retraining.

9 Temporal Adaptation without New Human Annotations

In this section, we explore methods for temporal domain adaptation without collecting new gold-standard data. Given human annotations for and a model trained on it, we want to improve the performance of this model on without human annotations on . For these experiments, we only use NER-TTC and Sentiment-AR since Truecasing-NYT showed little change in performance even when retrained with gold-standard data.

9.1 Continual Pre-training

For the first experiment, we use domain adaptive pretraining Gururangan et al. (2020) on temporal splits. A pre-trained model undergoes a second pre-training on domain specific unlabeled data before finetuning on task specific labeled data. In our case, the new data is a future temporal split. However, unlike in typical domain adaptive pre-training, we only have a small amount of in-domain data. In practice, the amount of this data would depend upon how frequently one wants to retrain the model. For the experiments, we use the data from temporal split , throwing away the gold annotations. We take a pretrained model, continue pretraining it on , then finetune it on . This is done with three random seeds and the performance is averaged over these runs. With this setup, we observe a drastic drop in performance (numbers omitted for space). We hypothesized this is because the amount of in-domain data is insufficient for stable domain adaptation. However, recent work Röttger and Pierrehumbert (2021) has shown that temporal adaptation through continual pre-training even on millions of examples has limited benefit. It should be noted that Röttger and Pierrehumbert (2021) adapt a pretrained BERT which was pre-trained on recent data overlapping temporally with the data used for the continued pre-training. To completely disentangle the temporal effects of pre-training and assess the effective of continual pre-training, one would also need to pretrain BERT from scratch on older data.

Test Year Train Year
2014 2014+15 2014+16 2014+17 2014+18
2015 64.69 - - - -
2016 65.60 66.06 - - -
2017 65.90 65.48 65.55 - -
2018 64.08 64.78 64.11 64.62 -
2019 71.70 73.81 75.00 74.26 75.01
2015 67.48 - - - -
2016 69.41 67.34 - - -
2017 68.30 67.12 66.64 - -
2018 67.82 66.98 67.96 67.95 -
2019 77.79 77.34 76.90 76.65 77.28
Table 8: Span-level F1 for NER on TTC. The columns show training on gold-standard data from 2014 with continual pretraining on
Test Year Train Year
2014 2014+15 2014+16 2014+17 2014+18
2015 64.69 - - - -
2016 65.60 67.34 - - -
2017 65.90 67.40 67.28 - -
2018 64.08 65.26 65.72 65.34 -
2019 71.70 75.68 76.53 75.94 76.50
2015 67.48 - - - -
2016 69.41 71.03 - - -
2017 68.30 70.41 70.53 - -
2018 67.82 69.84 69.80 68.70 -
2019 77.79 80.12 79.79 78.61 79.70
Table 9: Span-level F1 for NER on TTC. The columns represent training on gold standard data from year 2014 and self-labelled data from .
Test Year Train Year
01-03 04-06 07-09 10-12 13-15
04-06 63.09 - - - -
07-09 63.58 64.76 - - -
10-12 64.64 65.53 65.62 - -
13-15 65.82 66.82 66.86 67.14 -
16-18 65.88 67.49 67.73 67.92 68.29
04-06 69.91 - - - -
07-09 71.01 71.24 - - -
10-12 72.26 72.49 72.35 - -
13-15 72.97 73.16 72.93 73.21 -
16-18 73.90 73.99 74.10 74.25 74.75
Table 10: F1 of the negative sentiment class. The columns show training on gold-standard data from 2001-03 with continual pretraining on .
Test Year Train Year
01-03 04-06 07-09 10-12 13-15
04-06 63.09 - - - -
07-09 63.58 64.58 - - -
10-12 64.64 65.54 65.53 - -
13-15 65.82 67.13 67.29 67.49 -
16-18 65.88 67.71 67.96 68.37 67.83
04-06 69.91 - - - -
07-09 71.01 71.88 - - -
10-12 72.26 73.08 72.99 - -
13-15 72.97 74.31 74.18 74.22 -
16-18 73.90 75.76 75.99 75.97 75.65
Table 11: F1 of the negative sentiment. The columns represent training on gold-standard data from 2001-03 and self-labelled data from .

Next, we modify the domain adaptive pretraining by adding an extra finetuning step. This method first performs task adaptation, followed by temporal adaptation and then again task adatation. We take a pretrained model, finetune it on , then pretrain it on and then finetune it again on . Overall, this method does not improve performance, but does not decrease it either (Tables 8 and 10 for full tables of results and Table 2 for summary adaptation scores. As highlighted in the evaluation setup, multi-test set evaluation is essential for reliable results. In this experiment, if we had evaluated only on 2019 for TTC, we would have concluded that this method works well but looking at the full table with the different test years, one can see that the change in performance is inconsistent.

9.2 Self-labeling

Self-labelling has been to shown as effective technique to detect data drift Elsahar and Gallé (2019). Here, we explore its use in temporal domain adaptation. We finetune a model on , use this model to label the data and then use gold-standard and self-labelled to finetune another model. The new model is trained on and the full with as the development set. is weakly labelled and thus noisy, hence we do not extract a development set from for reliable evaluation. The results are shown in Tables 9 and 11 and Table 2 for summary adaptation scores. Self-labeling works consistently well across test years, representations and tasks. Though adding self-labelled data does not give the highest reported performance on , it improves performance over using just the gold data . For NER, performance improves over using even the gold-standard data (but not over gold-standard data). For sentiment, performance improves over using just gold but not to the same level as using new gold-standard data for finetuning. Adding new data is computationally expensive though. For NER, since the amount of data is small because it required actual annotation, we could continue using the same GPU, just the run time increased. With reviews, we had to upgrade our usage from one to two GPUs in parallel.

NER-TTC Sentiment-AR
Gold-Finetune 1.14 1.39 4.70 2.49
Adapt-Pretrain 0.84 -0.84 1.43 0.25
Self-label 2.27 1.79 1.56 1.40
Table 12: Adaptation scores w.r.t. anchor time period for different adaptation methods.

Lastly, we explore if continuously adding new self-labeled data further improves performance. All of to is self-labelled and added to the gold . We were able to perform this experiment only for NER since the cumulative data for reviews becomes too large to run experiments with our resources. Adding more data does not improve performance but it does not decrease performance either, despite the fact that the training data now comprises mainly noisy self-labelled data. More research on optimal data selection with self-labelling is needed. The right data selection may improve performance further.

10 Conclusion

In this paper, we presented exhaustive experiments to quantify the temporal effects on model performance. We outline an experimental design that allows us to draw conclusions about both temporal deterioration and the potential for temporal domain adaptation. Our experiments confirm the need to study performance on the full grid of possible training and testing time periods. Variation across time-periods is considerable and choosing only one can lead to misleading conclusions about changes in performance and utility of methods.

We find that with pretrained embeddings, there is no temporal model deterioration. This finding holds true regardless of whether pretraining data time period overlaps with the evaluation time period or not. Despite this, temporal domain adaptation via retraining on new gold-standard data is still possible, though the expected improvements are smaller than prior work may have observed.

We implemented two methods for temporal domain adaptation without labeling new data. We find that intermediate pretraining is not suitable when the amount of unlabelled data is small. Self-labeling works well across tasks and representations. This finding motivates future work on how to select data to be labelled and how to maintain a reasonable size for the training data as the continual learning progresses over time.


  • S. Brandl and D. Lassner (2019) Times are changing: investigating the pace of language change in diachronic word embeddings. In Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change, Florence, Italy, pp. 146–150. External Links: Link, Document Cited by: §2.
  • C. Danescu-Niculescu-Mizil, R. West, D. Jurafsky, J. Leskovec, and C. Potts (2013) No country for old members: user lifecycle and linguistic change in online communities. In 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13-17, 2013, pp. 307–318. External Links: Link, Document Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §3.2.
  • P. Dury and P. Drouin (2011)

    When terms disappear from a specialized lexicon: a semi-automatic investigation into necrology

    ICAME Journal, pp. 19–33. Cited by: §2.
  • J. Eisenstein (2013) What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, pp. 359–369. External Links: Link Cited by: §2.
  • J. Eisenstein (2019) Measuring and modeling language change. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, Minneapolis, Minnesota, pp. 9–14. External Links: Link, Document Cited by: §2.
  • H. Elsahar and M. Gallé (2019) To annotate or not? predicting performance drop under domain shift. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

    Hong Kong, China, pp. 2163–2173. External Links: Link, Document Cited by: §9.2.
  • T. Fawcett (2003) " In vivo" spam filtering: a challenge problem for kdd. ACM SIGKDD Explorations Newsletter 5 (2), pp. 140–148. Cited by: §1.
  • H. Fromreide, D. Hovy, and A. Søgaard (2014) Crowdsourcing and annotating NER for Twitter #drift. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, pp. 2544–2547. External Links: Link Cited by: §1, §2.
  • W. A. Gale, K. W. Church, and D. Yarowsky (1995) Discrimination decisions for 100,000-dimensional spaces. Annals of Operations Research 55, pp. 429–450. Cited by: §3.1.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 8342–8360. External Links: Link, Document Cited by: §9.1.
  • W. L. Hamilton, J. Leskovec, and D. Jurafsky (2016) Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1489–1501. External Links: Link, Document Cited by: §2.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §3.2.
  • X. Huang and M. J. Paul (2018) Examining temporality in document classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Melbourne, Australia, pp. 694–699. External Links: Link, Document Cited by: §1, §2, §2, §2, §4, footnote 3.
  • J. D. Lafferty, A. McCallum, and F. C. N. Pereira (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In

    Proceedings of the Eighteenth International Conference on Machine Learning

    ICML ’01, San Francisco, CA, USA, pp. 282–289. External Links: ISBN 1558607781 Cited by: §3.2.
  • G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer (2016) Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 260–270. External Links: Link, Document Cited by: §3.2.
  • A. Lazaridou, A. Kuncoro, E. Gribovskaya, D. Agrawal, A. Liska, T. Terzi, M. Gimenez, C. de Masson d’Autume, S. Ruder, D. Yogatama, K. Cao, T. Kociský, S. Young, and P. Blunsom (2021) Pitfalls of static language modelling. ArXiv abs/2102.01951. Cited by: §2, §2, footnote 4.
  • L. V. Lita, A. Ittycheriah, S. Roukos, and N. Kambhatla (2003) TRuEcasIng. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, Sapporo, Japan, pp. 152–159. External Links: Link, Document Cited by: §3.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §3.2.
  • J. Lukes and A. Søgaard (2018) Sentiment analysis under temporal shift. In Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Brussels, Belgium, pp. 65–71. External Links: Link, Document Cited by: §1, §2, §2, §3.1.
  • X. Ma and E. Hovy (2016) End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1064–1074. External Links: Link, Document Cited by: §3.2.
  • G. McCulloch (2020) Because internet: understanding the new rules of language. Riverhead Books. Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In NIPS, Cited by: §1.
  • J. Ni, J. Li, and J. McAuley (2019) Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 188–197. External Links: Link, Document Cited by: §3.1.
  • J. Pennington, R. Socher, and C. Manning (2014) GloVe: global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1532–1543. External Links: Link, Document Cited by: §1, §3.2.
  • M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 2227–2237. External Links: Link, Document Cited by: §1, §3.2.
  • S. Rijhwani and D. Preotiuc-Pietro (2020) Temporally-informed analysis of named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 7605–7617. External Links: Link, Document Cited by: §1, §2, §3.1.
  • A. Rosenfeld and K. Erk (2018) Deep neural models of semantic shift. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 474–484. External Links: Link, Document Cited by: §2.
  • P. Röttger and J. B. Pierrehumbert (2021) Temporal adaptation of bert and performance on downstream document classification: insights from social media. In EMNLP, Cited by: §2, §2, §3.3, §9.1, footnote 4.
  • E. Sandhaus (2008) The new york times annotated corpus. Linguistic Data Consortium, Philadelphia 6 (12), pp. e26752. Cited by: §3.1.
  • A. Søgaard, S. Ebert, J. Bastings, and K. Filippova (2020) We need to talk about random splits. ArXiv abs/2005.00636. Cited by: §2.
  • A. Søgaard, S. Ebert, J. Bastings, and K. Filippova (2021) We need to talk about random splits. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 1823–1832. External Links: Link Cited by: §2.
  • U. Weinreich, W. Labov, and M. Herzog (1968) Empirical foundations for a theory of language change. Cited by: §2.
  • D. T. Wijaya and R. Yeniterzi (2011) Understanding semantic change of words over centuries. In Proceedings of the 2011 International Workshop on DETecting and Exploiting Cultural DiversiTy on the Social Web, DETECT ’11, New York, NY, USA, pp. 35–40. External Links: ISBN 9781450309622, Link, Document Cited by: §2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §3.2.

Appendix A Development Set Selection and Model Performance

(b) Sentiment-AR
(c) Truecasing-NYT
Figure 3: Evaluation metrics for ELMo+GloVe model using a model with the development set chosen from the training vs test time period.

Here we show that the selection of development set not only affects task setup feasibility but may also lead to exaggerated claims of improvement by retraining. We performance two experiments by varying the training year with a fixed test year. In one, we select the development set from training year (dev-train) and in the other from the test year (dev-test). The results are shown in Figure 3. We selected the last temporal bucket as the test bucket to get the maximum number of training points for observing the difference. We observe a monotonically increasing trend in F1 when the development test is chosen from the test year. On the other hand, the trend isn’t as smooth when choosing the development set from the training year. Moreover, the amount of improvement observed by retraining is greater when the development set is shown from the teat year. For TTC, the improvement is from 63.3170.02 when the dev set is drawn from the test year but only 63.2368.82 when the dev set is drawn from the train year. Similarly, for Review, the improvement is from 52.5869.25 when the dev set is drawn from the test year but only 57.4969.25 when the dev set is drawn from the train year.