UiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection

04/30/2020 ∙ by Andrey Kutuzov, et al. ∙ University of Amsterdam UNIVERSITETET I OSLO 0

We apply contextualised word embeddings to lexical semantic change detection in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We analyse the performance of two contextualising architectures (BERT and ELMo) and three change detection algorithms. We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings. They outperform strong baselines by large margin, but interestingly, the choice of a particular algorithm depends on the distribution of gold scores in the test set.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Words change their meaning over time in all natural languages. The emergence of large representative historical corpora and powerful data-driven semantic models has allowed researchers to track meaning change. SemEval-2020 Shared Task 1 challenges its participants to classify a list of target words into stable or changed (Subtask 1) and/or to rank these words by the degree of their semantic change (Subtask 2)

[19]. The task is multilingual: it includes four lists of target words, respectively for English, German, Latin, and Swedish. Each word list is accompanied with two historical corpora of varying size, consisting of texts created in two different time periods.

We participated in Subtask 2 as the UiO-UvA team and investigated the potential of contextualised embeddings for the detection of lexical semantic change. Our systems are based on ELMo [18] and BERT [3] language models, and employ three different algorithms to compare contextualised embeddings diachronically. Our ‘official’ submission to the shared task ranked 10th out of 34 participating teams. In this paper, we extensively evaluate all combinations of architectures, training corpora and change detection algorithms, using 5 test sets in 4 languages. Our main findings are twofold: 1) In 3 out of 5 test sets, ELMo consistently outperforms BERT, while having much less parameters and being much faster in training and inference; 2) Cosine similarity of averaged contextualised embeddings and average pairwise distance between these embeddings are the two best-performing change detection algorithms, but different test sets show strong preference to either the former or the latter. This preference shows strong correlation with the distribution of gold scores in a test set. While it may indicate that there is a bias in the available test sets, this finding remains yet unexplained.

Our implementations of all the evaluated algorithms are available at https://github.com/akutuzov/semeval2020

, and the ELMo models we trained can be downloaded from the NLPL vector repository (

http://vectors.nlpl.eu/repository/) [5].

2 Background

Lexical semantic change detection (LSCD) consists in automatically determining whether and/or to what extent the meaning of a set of target words has changed over time, with the help of time-annotated corpora [14, 22]. LSCD is often addressed using distributional semantic models: time-sensitive word representations (so-called ‘diachronic word embeddings’) are learned and then compared between different time periods. This is a fast developing NLP sub-field with a number of influential papers ([9, 12, 13, 10, 1, 2, 4], and many others).

Most of the previous work used some variation of ‘static’ word embeddings where each occurrence of a word form is assigned the same vector representation independently of its context. Recent contextualised architectures allow to overcome this limitation by taking sentential context into account when inferring word token representations. However, application of such architectures to diachronic semantic change detection was up to now rather limited [11, 8, 16]. While all these studies use BERT as their contextualising architecture, we extend our analysis to ELMo and perform a systematic evaluation of various contextualising approaches for both language models.

ELMo [18] was arguably the first contextualised word embedding architecture to attract wide attention of the NLP community. The network architecture consists of a two-layer Bidirectional LSTM on top of a convolutional layer. BERT [3]

is a Transformer with self-attention trained on masked language modelling and next sentence prediction. While Transformer architectures have been shown to outperform recurrent ones in many NLP tasks, ELMo allows faster training and inference than BERT, making it more convenient to experiment with different training corpora and hyperparameters.

3 System overview

Given two time periods , two corpora , and a set of target words, we use a neural language model to obtain contextualised embeddings of each occurrence of the target words in and and use them to compute a continuous change score. This score indicates the degree of semantic change undergone by a word between and , and the target words are ranked by its value.

More precisely, given a target word and its sentential context with , we extract the activations of a language model’s hidden layers for sentence position . The contextualised embeddings collected for can be represented as the usage matrix . The time-specific usage matrices for time periods and are used as input to all the tested metrics of semantic change. We use three change detection algorithms:

1. Cosine similarity (COS)

Given two usage matrices , the degree of change of is calculated as the inverted similarity between the average embeddings of all occurrences of in the two time periods:

(1)

where and are the number of occurrences of in time periods and , and is a similarity metric, for which we use cosine similarity. This method corresponds to the standard LSCD workflow based on static embeddings produced by Procrustes-aligned time-specific distributional models [10], with the only additional step of averaging token embeddings to create a single static representation.

2. Average pairwise distance (APD)

Here, the degree of change of is measured as the average distance between any two embeddings from different time periods [8]:

(2)

where is the cosine distance. High APD values indicate stronger semantic change.

3. Jensen-Shannon divergence (JSD)

This measure relies on the partitioning of embeddings into clusters of similar word usages. We follow giulianelli2020 and create a single usage matrix with occurrences from two corpora . We then standardise it and cluster its entries using Affinity Propagation [7], which automatically selects a number of clusters for each word [16]

. Finally, we define probability distributions

based on the normalised counts of word occurrences from each cluster [11] and compute a JSD score [15]:

(3)

JSD scores measure the amount of change in the proportions of word usage clusters across time periods.

4 Experimental setup

For each of the 4 languages of the shared task, we train 4 ELMo model variants: 1) Pre-trained, an ELMo model trained on the respective Wikipedia corpus (English, German, Latin or Swedish)111The Wikipedia corpora were lemmatised using UDPipe [21] prior to training.; 2) Fine-tuned, the same as Pre-trained but further fine-tuned on the union of the two test corpora; 3) Trained on test, trained only on the union of the two test corpora; 4) Incremental

, two models—the first is trained on the first test corpus, and the second is the same model further trained on the second test corpus. The ELMo models are trained for 3 epochs (except English and Latin

Trained on test and Incremental models, for which we use 5 epochs, due to the test corpora sizes), with the LSTM dimensionality of 2048, batch size 192 and 4096 negative samples per batch. All the other hyperparameters are left at their default values.222To train and fine-tune ELMo models, we use the code from https://github.com/ltgoslo/simple_elmo_training

, which is essentially the reference ELMo implementation updated to the recent TensorFlow versions.

For BERT, we use the base version, with 12 layers and 768 hidden dimensions.333We rely on Hugging Face’s implementation of BERT (available at https://github.com/huggingface/transformers, version 2.5.0), and follow their model naming conventions: https://huggingface.co/models. For English, German and Swedish, we employ language-specific models: bert-base-uncased, bert-base-german-cased, and af-ai-center/bert-base-swedish-uncased. For Latin, we resort to bert-base-multilingual-cased, since there is no specific Latin BERT available yet. Given the limited size of the test corpora (in the order of word tokens at max), we do not train BERT from scratch and only test the Pre-trained and Fine-tuned BERT variants. The fine-tuning is done with BERT’s standard objective for 2 epochs (English was trained for 5 epochs). We configure BERT’s WordPiece tokeniser to never split any occurrences of the target words (some target words are split by default into character sequences) and we add unknown target words to BERT’s vocabulary. We perform this step both before fine-tuning and before the extraction of contextualised representations.

At inference time, we use all ELMo and BERT variants to produce contextualised representations of all the occurrences of each target word in the test corpora. For the Incremental variant, the representations for the occurrences in each of the two test corpora are produced using the respective model trained on this corpus. The resulting embeddings are of size and

for BERT and ELMo, respectively. We employ three strategies to reduce their dimensionality to that of a single layer: 1) using only the top layer, 2) averaging all layers, 3) averaging the last four layers (BERT only). Finally, we feed the word’s contextualised embeddings into the three algorithms of semantic change estimation described in Section

3

. We then compute the Spearman correlation of the estimated change scores with the gold answers. This is the evaluation metric of Subtask 2, and we use it throughout our experiments.

5 Results

Our submission.

In our ‘official’ shared task submission we used top-layer ELMo embeddings with the cosine similarity change detection algorithm for all languages. English and German ELMo models were trained on the respective Wikipedia corpora. For Swedish and Latin, pre-trained ELMo models were not available, so we trained our own models on the union of the test corpora. This combination of architectures and algorithms was chosen based on our preliminary experiments with the available human-annotated semantic change datasets for English [9], German [20] and Russian [6]. With an average Spearman correlation of 0.37, this submission ranked 10th out of 34 teams. We were aware that the submitted setup was likely sub-optimal as it did not include the Fine-tuned model variant. After the official submission deadline, we finished training and fine-tuning all of our language models. Their systematic evaluation is the main contribution of this paper.

Organisers’ baselines
Frequency difference (FD) Count vect., column inters., cosine (CNT+CI+CD)
-0.083 0.144*
Word2vec CBOW cosine similarity
Incremental Procrustes
0.140 0.392***
Contextualised embeddings Top layer Average all layers Average top 4 layers
Cosine similarity (COS)
BERT Pre-trained 0.278** 0.233 0.229
          Fine-tuned 0.373** 0.320** 0.338**
ELMo Pre-trained 0.375** 0.344**
          Fine-tuned 0.402** 0.389**
          Trained on test 0.370** 0.342**
          Incremental 0.114* 0.127
Pairwise distance (APD)
BERT Pre-trained 0.237** 0.163* 0.203*
          Fine-tuned 0.363*** 0.241** 0.297*
ELMo Pre-trained 0.296** 0.172*
          Fine-tuned 0.405*** 0.406***
          Trained on test 0.338** 0.295***
          Incremental 0.126** -0.001*
Jensen-Shannon divergence (JSD)
BERT Pre-trained 0.181* 0.125 0.203*
          Fine-tuned 0.176* 0.223** 0.186**
ELMo Pre-trained 0.251* 0.196*
          Fine-tuned 0.197* 0.156*
          Trained on test 0.225* 0.163*
          Incremental -0.037 -0.009
Table 1: Spearman correlation coefficients for Subtask 2 averaged over four languages. The number of asterisks denotes the number of languages for which the correlation was statistically significant ().

Current results.

The average scores of all the tested configurations across 4 languages are given in Table 1. We compare our scores to the organisers’ baselines [19] and the classical approach of calculating cosine similarity between CBOW word embeddings [17]. The CBOW models were used in two different flavors: 1) ‘incremental’, where the model was initialised with the weights [12], and 2) ‘Procrustes’, where the two models were trained independently on and , and then aligned using the orthogonal Procrustes transformation [10]. Table 1 shows that no method achieves statistically significant correlation on all 4 languages, which attests both to the difficulty of the task and the diversity of the test sets. CBOW Procrustes is a surprisingly strong approach, consistently outperforming the organisers’ baselines. Only COS and APD obtain higher average scores, with fine-tuned ELMo models performing better than fine-tuned BERT.

Judging only from the average correlation scores, contextualised embeddings do not seem to outshine their static counterparts, especially considering that both ELMo and BERT are more computationally demanding than CBOW. However, closer analysis of per-language results shows that in fact the contextualised approaches outperform the CBOW Procrustes baseline by a large margin for each of the shared task test sets. Table 2 features the scores obtained by our best-performing methods (COS and APD with top layer embeddings from fine-tuned ELMo and BERT) on the individual languages of the shared task. We also report performance on the GEMS test set [9]. The discrepancy between the averaged and the per-language results can be explained by properties of the test sets: APD works best on the English and Swedish sets, while COS yields the best scores for German and Latin.

With the right choice of APD or COS, contextualised embeddings can improve Spearman correlation coefficients by up to 50%. This is not a language-specific property: the English GEMS test set does not behave like the English test set from the shared task. In fact, one can observe 3 groups of test sets with regards to their statistical properties and to the method they favour: group 1 (Latin and German) exhibits rather uniform gold score distributions and prefers COS; group 2 (English and Swedish) is characterised by more skewed gold score distributions and prefers APD; group 3 (GEMS) is in between, with no clear preference. Interestingly, the method which produces a more uniform predicted score distribution (APD) works better for the test sets with skewed gold distributions, and the method which produces a more skewed predicted score distribution (COS) works better for the uniformly distributed test sets (as can be seen in the Appendix). Furthermore, there is perfect negative correlation (

) between the median gold score of a test set and the performance of the APD algorithm with fine-tuned ELMo models on this test set. We currently do not have a plausible explanation for this behaviour.

Table 2 also supports the previous observation that ELMo models perform better than BERT in the LSCD task. The only test set for which this is not the case is Latin, while on GEMS, ELMo and BERT are on par. One possible explanation is that our ELMo models were pre-trained on lemmatised Wikipedia corpora and thus better fit the test corpora, provided in lemmatised form by the organisers. The BERT models were pre-trained on raw corpora, and fine-tuning them on lemmatised data proves less successful.

Algorithm English German Latin Swedish GEMS
Word2vec CBOW cosine similarity
Incremental 0.210 0.145 0.217 -0.012 0.424
Procrustes 0.285 0.439 0.387 0.458 0.235
Fine-tuned contextualised embeddings (top layer)
ELMo Cosine similarity 0.254 0.740 0.360 0.252 0.323
ELMo Average pairwise distance 0.605 0.560 -0.113 0.569 0.323
BERT Cosine similarity 0.225 0.590 0.561 0.185 0.394
BERT Average pairwise distance 0.546 0.427 0.372 0.254 0.243
Table 2: Spearman correlation per test set for our best methods. marks statistical significance ().

In the post-evaluation phase of the shared task, we submitted predictions obtained with the optimal system configurations: fine-tuned ELMo + APD for English and Swedish, fine-tuned ELMo + COS for German, and fine-tuned BERT + COS for Latin. It reached the average Spearman correlation of 0.618 and, at the time of writing, it is the best Subtask 2 submission for SemEval-2020 Task 1.

6 Conclusion

Our experiments for the SemEval-2020 Shared Task 1 (Subtask 2) show that using contextualised embeddings to rank words by the degree of their semantic change produces strong correlation with human judgements, far outperforming static embeddings. Models pre-trained on large external corpora and fine-tuned on the historical test corpora produce the highest correlation results, with ELMo slightly but consistently outperforming BERT as a contextualiser.

Inverted cosine similarity between averaged contextualised embeddings and the average pairwise cosine distance between contextualised embeddings turned out to be the best semantic change detection algorithms. An interesting finding is that the former algorithm favours the test sets with uniform gold score distribution, while the latter works best with the test sets where the gold score distribution is skewed towards low values. This distinction is not related to the language of the test set. We believe this dependency between the statistical properties of gold scores and the performance of semantic change detection systems deserves to be investigated further in future work.

References

  • [1] R. Bamler and S. Mandt (2017) Dynamic word embeddings. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    ,
    pp. 380–389. Cited by: §2.
  • [2] M. Del Tredici, R. Fernández, and G. Boleda (2019) Short-term meaning shift: a distributional exploration. In Proceedings of NAACL-HLT 2019 (Annual Conference of the North American Chapter of the Association for Computational Linguistics), Cited by: §2.
  • [3] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §2.
  • [4] H. Dubossarsky, S. Hengchen, N. Tahmasebi, and D. Schlechtweg (2019) Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Florence, Italy. Cited by: §2.
  • [5] M. Fares, A. Kutuzov, S. Oepen, and E. Velldal (2017) Word vectors, reuse, and replicability: towards a community repository of large-text resources. In Proceedings of the 21st Nordic Conference on Computational Linguistics, pp. 271–276. External Links: Link Cited by: §1.
  • [6] V. Fomin, D. Bakshandaeva, J. Rodina, and A. Kutuzov (2019) Tracing cultural diachronic semantic shifts in Russian using word embeddings: test sets and baselines. Komp’yuternaya Lingvistika i Intellektual’nye Tekhnologii: Dialog conference, pp. 203–218. External Links: Link Cited by: §5.
  • [7] B. Frey and D. Dueck (2007) Clustering by passing messages between data points. Science 315 (5814), pp. 972–976. Cited by: §3.
  • [8] M. Giulianelli, M. Del Tredici, and R. Fernández (2020) Analysing lexical semantic change with contextualised word representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Note: Forthcoming Cited by: §2, §3.
  • [9] K. Gulordava and M. Baroni (2011-07) A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus.. In Proceedings of the GEMS 2011 Workshop on GEometrical Models of Natural Language Semantics, Edinburgh, UK, pp. 67–71. External Links: Link Cited by: §2, §5, §5.
  • [10] W. Hamilton, J. Leskovec, and D. Jurafsky (2016) Diachronic word embeddings reveal statistical laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1489–1501. Cited by: §2, §3, §5.
  • [11] R. Hu, S. Li, and S. Liang (2019) Diachronic sense modeling with deep contextualized word embeddings: an ecological view. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 3899–3908. External Links: Link, Document Cited by: §2, §3.
  • [12] Y. Kim, Y. Chiu, K. Hanaki, D. Hegde, and S. Petrov (2014) Temporal analysis of language through neural language models. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, pp. 61–65. Cited by: §2, §5.
  • [13] V. Kulkarni, R. Al-Rfou, B. Perozzi, and S. Skiena (2015) Statistically significant detection of linguistic change. In Proceedings of the 24th International Conference on World Wide Web, pp. 625–635. Cited by: §2.
  • [14] A. Kutuzov, L. Øvrelid, T. Szymanski, and E. Velldal (2018) Diachronic word embeddings and semantic shifts: a survey. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1384–1397. External Links: Link Cited by: §2.
  • [15] J. Lin (1991) Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory 37 (1), pp. 145–151. Cited by: §3.
  • [16] M. Martinc, S. Montariol, E. Zosa, and L. Pivovarova (2020) Capturing evolution in word usage: just add more clusters?. In Companion Proceedings of the International World Wide Web Conference, pp. 20–24. Cited by: §2, §3.
  • [17] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26, pp. 3111–3119. Cited by: §5.
  • [18] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Cited by: §1, §2.
  • [19] D. Schlechtweg, B. McGillivray, S. Hengchen, H. Dubossarsky, and N. Tahmasebi (2020) SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In To appear in Proceedings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain. Cited by: §1, §5.
  • [20] D. Schlechtweg, S. Schulte im Walde, and S. Eckmann (2018) Diachronic usage relatedness (DURel): a framework for the annotation of lexical semantic change. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Vol. 2, pp. 169–174. Cited by: §5.
  • [21] M. Straka and J. Straková (2017) Tokenizing, pos tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 88–99. External Links: Document, Link Cited by: footnote 1.
  • [22] N. Tahmasebi, L. Borin, and A. Jatowt (2018) Survey of computational approaches to diachronic conceptual change. arXiv preprint arXiv:1811.06278. Cited by: §2.

Appendix A. Score distributions

In the left part of Figure 1, we show how different the 5 test sets are in terms of how the gold scores are distributed across them. It is clearly visible on the plot that in some test sets, the gold scores are skewed to the left, while some have a more uniform distribution. The central and right parts of Figure 1 show the distributions of the predicted scores produced by the APD and COS algorithms (with fine-tuned ELMo embeddings). COS tends to squeeze the majority of predictions near the lower boundary (no semantic change), with a low median score. On the opposite, APD distributes its predictions in a much more uniform way, with a higher median score. Counter-intuitively, skewed gold distributions favour uniform predictions and vice versa.

Figure 1: Left: distribution of semantic change degree in the gold data; centre: distribution of scores predicted by the APD algorithm; right: distribution of scores predicted by the COS algorithm.

The grouping differences can be quantified with respect to the median gold score (after unit-normalisation). Figure 2 shows the dependency of the COS and APD performance on the median score of the gold test set. The dots here are the performance values of COS or APD algorithms on different test sets. English and Swedish test sets are in the left part of the plot with the median gold scores of 0.200 and 0.203 correspondingly. German, GEMS and Latin are on the right with 0.266, 0.267 and 0.364 correspondingly. There is a perfect negative Spearman correlation between the median gold scores of these 5 test sets and the performance of APD semantic change detection algorithm on each of them (with fine-tuned ELMo embeddings).

Figure 2: Performance of the COS and APD algorithms depending on the median gold score.