Words change their meaning over time in all natural languages. The emergence of large representative historical corpora and powerful data-driven semantic models has allowed researchers to track meaning change. SemEval-2020 Shared Task 1 challenges its participants to classify a list of target words into stable or changed (Subtask 1) and/or to rank these words by the degree of their semantic change (Subtask 2). The task is multilingual: it includes four lists of target words, respectively for English, German, Latin, and Swedish. Each word list is accompanied with two historical corpora of varying size, consisting of texts created in two different time periods.
We participated in Subtask 2 as the UiO-UvA team and investigated the potential of contextualised embeddings for the detection of lexical semantic change. Our systems are based on ELMo  and BERT  language models, and employ three different algorithms to compare contextualised embeddings diachronically. Our ‘official’ submission to the shared task ranked 10th out of 34 participating teams. In this paper, we extensively evaluate all combinations of architectures, training corpora and change detection algorithms, using 5 test sets in 4 languages. Our main findings are twofold: 1) In 3 out of 5 test sets, ELMo consistently outperforms BERT, while having much less parameters and being much faster in training and inference; 2) Cosine similarity of averaged contextualised embeddings and average pairwise distance between these embeddings are the two best-performing change detection algorithms, but different test sets show strong preference to either the former or the latter. This preference shows strong correlation with the distribution of gold scores in a test set. While it may indicate that there is a bias in the available test sets, this finding remains yet unexplained.
Lexical semantic change detection (LSCD) consists in automatically determining whether and/or to what extent the meaning of a set of target words has changed over time, with the help of time-annotated corpora [14, 22]. LSCD is often addressed using distributional semantic models: time-sensitive word representations (so-called ‘diachronic word embeddings’) are learned and then compared between different time periods. This is a fast developing NLP sub-field with a number of influential papers ([9, 12, 13, 10, 1, 2, 4], and many others).
Most of the previous work used some variation of ‘static’ word embeddings where each occurrence of a word form is assigned the same vector representation independently of its context. Recent contextualised architectures allow to overcome this limitation by taking sentential context into account when inferring word token representations. However, application of such architectures to diachronic semantic change detection was up to now rather limited [11, 8, 16]. While all these studies use BERT as their contextualising architecture, we extend our analysis to ELMo and perform a systematic evaluation of various contextualising approaches for both language models.
ELMo  was arguably the first contextualised word embedding architecture to attract wide attention of the NLP community. The network architecture consists of a two-layer Bidirectional LSTM on top of a convolutional layer. BERT 
is a Transformer with self-attention trained on masked language modelling and next sentence prediction. While Transformer architectures have been shown to outperform recurrent ones in many NLP tasks, ELMo allows faster training and inference than BERT, making it more convenient to experiment with different training corpora and hyperparameters.
3 System overview
Given two time periods , two corpora , and a set of target words, we use a neural language model to obtain contextualised embeddings of each occurrence of the target words in and and use them to compute a continuous change score. This score indicates the degree of semantic change undergone by a word between and , and the target words are ranked by its value.
More precisely, given a target word and its sentential context with , we extract the activations of a language model’s hidden layers for sentence position . The contextualised embeddings collected for can be represented as the usage matrix . The time-specific usage matrices for time periods and are used as input to all the tested metrics of semantic change. We use three change detection algorithms:
1. Cosine similarity (COS)
Given two usage matrices , the degree of change of is calculated as the inverted similarity between the average embeddings of all occurrences of in the two time periods:
where and are the number of occurrences of in time periods and , and is a similarity metric, for which we use cosine similarity. This method corresponds to the standard LSCD workflow based on static embeddings produced by Procrustes-aligned time-specific distributional models , with the only additional step of averaging token embeddings to create a single static representation.
2. Average pairwise distance (APD)
Here, the degree of change of is measured as the average distance between any two embeddings from different time periods :
where is the cosine distance. High APD values indicate stronger semantic change.
3. Jensen-Shannon divergence (JSD)
This measure relies on the partitioning of embeddings into clusters of similar word usages. We follow giulianelli2020 and create a single usage matrix with occurrences from two corpora . We then standardise it and cluster its entries using Affinity Propagation , which automatically selects a number of clusters for each word 
. Finally, we define probability distributionsbased on the normalised counts of word occurrences from each cluster  and compute a JSD score :
JSD scores measure the amount of change in the proportions of word usage clusters across time periods.
4 Experimental setup
For each of the 4 languages of the shared task, we train 4 ELMo model variants: 1) Pre-trained, an ELMo model trained on the respective Wikipedia corpus (English, German, Latin or Swedish)111The Wikipedia corpora were lemmatised using UDPipe  prior to training.; 2) Fine-tuned, the same as Pre-trained but further fine-tuned on the union of the two test corpora; 3) Trained on test, trained only on the union of the two test corpora; 4) Incremental
, two models—the first is trained on the first test corpus, and the second is the same model further trained on the second test corpus. The ELMo models are trained for 3 epochs (except English and LatinTrained on test and Incremental models, for which we use 5 epochs, due to the test corpora sizes), with the LSTM dimensionality of 2048, batch size 192 and 4096 negative samples per batch. All the other hyperparameters are left at their default values.222To train and fine-tune ELMo models, we use the code from https://github.com/ltgoslo/simple_elmo_training
, which is essentially the reference ELMo implementation updated to the recent TensorFlow versions.
For BERT, we use the base version, with 12 layers and 768 hidden dimensions.333We rely on Hugging Face’s implementation of BERT (available at https://github.com/huggingface/transformers, version 2.5.0), and follow their model naming conventions: https://huggingface.co/models. For English, German and Swedish, we employ language-specific models: bert-base-uncased, bert-base-german-cased, and af-ai-center/bert-base-swedish-uncased. For Latin, we resort to bert-base-multilingual-cased, since there is no specific Latin BERT available yet. Given the limited size of the test corpora (in the order of word tokens at max), we do not train BERT from scratch and only test the Pre-trained and Fine-tuned BERT variants. The fine-tuning is done with BERT’s standard objective for 2 epochs (English was trained for 5 epochs). We configure BERT’s WordPiece tokeniser to never split any occurrences of the target words (some target words are split by default into character sequences) and we add unknown target words to BERT’s vocabulary. We perform this step both before fine-tuning and before the extraction of contextualised representations.
At inference time, we use all ELMo and BERT variants to produce contextualised representations of all the occurrences of each target word in the test corpora. For the Incremental variant, the representations for the occurrences in each of the two test corpora are produced using the respective model trained on this corpus. The resulting embeddings are of size and
for BERT and ELMo, respectively. We employ three strategies to reduce their dimensionality to that of a single layer: 1) using only the top layer, 2) averaging all layers, 3) averaging the last four layers (BERT only). Finally, we feed the word’s contextualised embeddings into the three algorithms of semantic change estimation described in Section3
. We then compute the Spearman correlation of the estimated change scores with the gold answers. This is the evaluation metric of Subtask 2, and we use it throughout our experiments.
In our ‘official’ shared task submission we used top-layer ELMo embeddings with the cosine similarity change detection algorithm for all languages. English and German ELMo models were trained on the respective Wikipedia corpora. For Swedish and Latin, pre-trained ELMo models were not available, so we trained our own models on the union of the test corpora. This combination of architectures and algorithms was chosen based on our preliminary experiments with the available human-annotated semantic change datasets for English , German  and Russian . With an average Spearman correlation of 0.37, this submission ranked 10th out of 34 teams. We were aware that the submitted setup was likely sub-optimal as it did not include the Fine-tuned model variant. After the official submission deadline, we finished training and fine-tuning all of our language models. Their systematic evaluation is the main contribution of this paper.
|Frequency difference (FD)||Count vect., column inters., cosine (CNT+CI+CD)|
|Word2vec CBOW cosine similarity|
|Contextualised embeddings||Top layer||Average all layers||Average top 4 layers|
|Cosine similarity (COS)|
|Trained on test||0.370**||0.342**||–|
|Pairwise distance (APD)|
|Trained on test||0.338**||0.295***||–|
|Jensen-Shannon divergence (JSD)|
|Trained on test||0.225*||0.163*||–|
The average scores of all the tested configurations across 4 languages are given in Table 1. We compare our scores to the organisers’ baselines  and the classical approach of calculating cosine similarity between CBOW word embeddings . The CBOW models were used in two different flavors: 1) ‘incremental’, where the model was initialised with the weights , and 2) ‘Procrustes’, where the two models were trained independently on and , and then aligned using the orthogonal Procrustes transformation . Table 1 shows that no method achieves statistically significant correlation on all 4 languages, which attests both to the difficulty of the task and the diversity of the test sets. CBOW Procrustes is a surprisingly strong approach, consistently outperforming the organisers’ baselines. Only COS and APD obtain higher average scores, with fine-tuned ELMo models performing better than fine-tuned BERT.
Judging only from the average correlation scores, contextualised embeddings do not seem to outshine their static counterparts, especially considering that both ELMo and BERT are more computationally demanding than CBOW. However, closer analysis of per-language results shows that in fact the contextualised approaches outperform the CBOW Procrustes baseline by a large margin for each of the shared task test sets. Table 2 features the scores obtained by our best-performing methods (COS and APD with top layer embeddings from fine-tuned ELMo and BERT) on the individual languages of the shared task. We also report performance on the GEMS test set . The discrepancy between the averaged and the per-language results can be explained by properties of the test sets: APD works best on the English and Swedish sets, while COS yields the best scores for German and Latin.
With the right choice of APD or COS, contextualised embeddings can improve Spearman correlation coefficients by up to 50%. This is not a language-specific property: the English GEMS test set does not behave like the English test set from the shared task. In fact, one can observe 3 groups of test sets with regards to their statistical properties and to the method they favour: group 1 (Latin and German) exhibits rather uniform gold score distributions and prefers COS; group 2 (English and Swedish) is characterised by more skewed gold score distributions and prefers APD; group 3 (GEMS) is in between, with no clear preference. Interestingly, the method which produces a more uniform predicted score distribution (APD) works better for the test sets with skewed gold distributions, and the method which produces a more skewed predicted score distribution (COS) works better for the uniformly distributed test sets (as can be seen in the Appendix). Furthermore, there is perfect negative correlation () between the median gold score of a test set and the performance of the APD algorithm with fine-tuned ELMo models on this test set. We currently do not have a plausible explanation for this behaviour.
Table 2 also supports the previous observation that ELMo models perform better than BERT in the LSCD task. The only test set for which this is not the case is Latin, while on GEMS, ELMo and BERT are on par. One possible explanation is that our ELMo models were pre-trained on lemmatised Wikipedia corpora and thus better fit the test corpora, provided in lemmatised form by the organisers. The BERT models were pre-trained on raw corpora, and fine-tuning them on lemmatised data proves less successful.
|Word2vec CBOW cosine similarity|
|Fine-tuned contextualised embeddings (top layer)|
|ELMo Cosine similarity||0.254||0.740||0.360||0.252||0.323|
|ELMo Average pairwise distance||0.605||0.560||-0.113||0.569||0.323|
|BERT Cosine similarity||0.225||0.590||0.561||0.185||0.394|
|BERT Average pairwise distance||0.546||0.427||0.372||0.254||0.243|
In the post-evaluation phase of the shared task, we submitted predictions obtained with the optimal system configurations: fine-tuned ELMo + APD for English and Swedish, fine-tuned ELMo + COS for German, and fine-tuned BERT + COS for Latin. It reached the average Spearman correlation of 0.618 and, at the time of writing, it is the best Subtask 2 submission for SemEval-2020 Task 1.
Our experiments for the SemEval-2020 Shared Task 1 (Subtask 2) show that using contextualised embeddings to rank words by the degree of their semantic change produces strong correlation with human judgements, far outperforming static embeddings. Models pre-trained on large external corpora and fine-tuned on the historical test corpora produce the highest correlation results, with ELMo slightly but consistently outperforming BERT as a contextualiser.
Inverted cosine similarity between averaged contextualised embeddings and the average pairwise cosine distance between contextualised embeddings turned out to be the best semantic change detection algorithms. An interesting finding is that the former algorithm favours the test sets with uniform gold score distribution, while the latter works best with the test sets where the gold score distribution is skewed towards low values. This distinction is not related to the language of the test set. We believe this dependency between the statistical properties of gold scores and the performance of semantic change detection systems deserves to be investigated further in future work.
Dynamic word embeddings.
Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 380–389. Cited by: §2.
-  (2019) Short-term meaning shift: a distributional exploration. In Proceedings of NAACL-HLT 2019 (Annual Conference of the North American Chapter of the Association for Computational Linguistics), Cited by: §2.
-  (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.
-  (2019) Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy. Cited by: §2.
-  (2017) Word vectors, reuse, and replicability: towards a community repository of large-text resources. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 271–276. External Links: Cited by: §1.
-  (2019) Tracing cultural diachronic semantic shifts in Russian using word embeddings: test sets and baselines. Komp’yuternaya Lingvistika i Intellektual’nye Tekhnologii: Dialog conference, pp. 203–218. External Links: Cited by: §5.
-  (2007) Clustering by passing messages between data points. Science 315 (5814), pp. 972–976. Cited by: §3.
-  (2020) Analysing lexical semantic change with contextualised word representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Note: Forthcoming Cited by: §2, §3.
-  (2011-07) A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus.. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, Edinburgh, UK, pp. 67–71. External Links: Cited by: §2, §5, §5.
-  (2016) Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1489–1501. Cited by: §2, §3, §5.
-  (2019) Diachronic sense modeling with deep contextualized word embeddings: an ecological view. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3899–3908. External Links: Cited by: §2, §3.
-  (2014) Temporal analysis of language through neural language models. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, pp. 61–65. Cited by: §2, §5.
-  (2015) Statistically significant detection of linguistic change. In Proceedings of the 24th International Conference on World Wide Web, pp. 625–635. Cited by: §2.
-  (2018) Diachronic word embeddings and semantic shifts: a survey. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1384–1397. External Links: Cited by: §2.
-  (1991) Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37 (1), pp. 145–151. Cited by: §3.
-  (2020) Capturing evolution in word usage: just add more clusters?. In Companion Proceedings of the International World Wide Web Conference, pp. 20–24. Cited by: §2, §3.
-  (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26, pp. 3111–3119. Cited by: §5.
-  (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §1, §2.
-  (2020) SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In To appear in Proceedings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain. Cited by: §1, §5.
-  (2018) Diachronic usage relatedness (DURel): a framework for the annotation of lexical semantic change. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Vol. 2, pp. 169–174. Cited by: §5.
-  (2017) Tokenizing, pos tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. External Links: Cited by: footnote 1.
-  (2018) Survey of computational approaches to diachronic conceptual change. arXiv preprint arXiv:1811.06278. Cited by: §2.
Appendix A. Score distributions
In the left part of Figure 1, we show how different the 5 test sets are in terms of how the gold scores are distributed across them. It is clearly visible on the plot that in some test sets, the gold scores are skewed to the left, while some have a more uniform distribution. The central and right parts of Figure 1 show the distributions of the predicted scores produced by the APD and COS algorithms (with fine-tuned ELMo embeddings). COS tends to squeeze the majority of predictions near the lower boundary (no semantic change), with a low median score. On the opposite, APD distributes its predictions in a much more uniform way, with a higher median score. Counter-intuitively, skewed gold distributions favour uniform predictions and vice versa.
The grouping differences can be quantified with respect to the median gold score (after unit-normalisation). Figure 2 shows the dependency of the COS and APD performance on the median score of the gold test set. The dots here are the performance values of COS or APD algorithms on different test sets. English and Swedish test sets are in the left part of the plot with the median gold scores of 0.200 and 0.203 correspondingly. German, GEMS and Latin are on the right with 0.266, 0.267 and 0.364 correspondingly. There is a perfect negative Spearman correlation between the median gold scores of these 5 test sets and the performance of APD semantic change detection algorithm on each of them (with fine-tuned ELMo embeddings).