Recent years have seen an exponentially rising interest in computational Lexical Semantic Change (LSC) detection [ArXiV-Tahmasebi18, kutuzov-etal-2018-diachronic]. However, the field is lacking standard evaluation tasks and data. Almost all papers differ in how the evaluation is performed and what factors are considered in the evaluation. Very few are evaluated on a manually annotated diachronic corpus [mcgillivray_2019, perrone_2019, Schlechtwegetal19, e.g.]. This puts a damper on the development of computational models for LSC, and is a barrier for high-quality, comparable results that can be used in follow-up tasks.
We report the results of the first SemEval shared task on Unsupervised LSC detection.111https://languagechange.org/semeval. In the remainder of this paper, “CodaLab” refers to this URL.
We introduce two related subtasks for computational LSC detection, which aim to identify the change in meaning of words over time using corpus data. We provide a high-quality multilingual (English, German, Latin, Swedish) LSC gold standard relying on approximately 100,000 instances of human judgment. For the first time, it is possible to compare the variety of proposed models on relatively solid grounds and across languages, and to put previously reached conclusions on trial. We may now provide answers to questions concerning the performance of different types of semantic representations (such as token embeddings vs. type embeddings, and topic models vs. vector space models), alignment methods and change measures. We provide a thorough analysis of the submitted results uncovering trends for models and opening perspectives for further improvements. In addition to this, the CodaLab website will remain open to allow any reader to directly and easily compare their results to the participating systems. We expect the long-term impact of the task to be significant, and hope to encourage the study of LSC in more languages than are currently studied, in particular less-resourced languages.
This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http://creativecommons.org/licenses/by/4.0/.
For the proposed tasks we rely on the comparison of two time-specific corpora and . While this simplifies the LSC detection problem, it has two main advantages: (i) it reduces the number of time periods for which data has to be annotated, so we can annotate larger corpus samples and hence more reliably represent the sense distributions of target words; (ii) it reduces the task complexity, allowing different model architectures to be applied to it, widening the range of possible participants. Participants were asked to solve two subtasks:
- Subtask 1
Binary classification: for a set of target words, decide which words lost or gained sense(s) between and , and which ones did not.
- Subtask 2
Ranking: rank a set of target words according to their degree of LSC between and .
For Subtask 1, consider the example of cell in Table 1, where the sense ‘phone’ is newly acquired from to because its frequency is 0 in and in . Subtask 2, instead, captures fine-grained changes in the two sense frequency distributions. For example, Table 1 shows that the frequency of the sense ‘chamber’ drops from to
, although it is not totally lost. Such a change will increase the degree of LSC for Subtask 2, but will not count as change in Subtask 1. The notion of LSC underlying Subtask 1 is most relevant to historical linguistics and lexicography, while the majority of LSC detection models are rather designed to solve Subtask 2. Hence, we expected Subtask 1 to be a challenge for most models. Knowing whether, and to what degree a word has changed is crucial in other tasks, e.g. aiding in understanding historical documents, searching for relevant content, or historical sentiment analysis. The full LSC problem can be seen as a generalization of these two tasks into multiple time points where also the type of change needs to be identified.
The task took place in a realistic unsupervised learning scenario. Participants were provided with trial and test data, but no training data. The public trial and test data consisted of a diachronic corpus pair and a set of target words for each language. Participants’ predictions were evaluated against a set of hidden gold labels. The trial data consisted of small samples from the test corpora (see below) and four target words per language to which we assigned binary and graded gold labels randomly. Participants could not use this data to develop their models, but only to test the data input format and the online submission format. For development data participants were referred to three pre-existing diachronic data sets: DURel[Schlechtwegetal18], SemCor LSC [SchlechtwegWalde20] and WSC [Tahmasebi17]. In the evaluation phase participants were provided with the test corpora and a set of target words for each language.222https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd Participants were asked to train their models only on the corpora described in Table 2, though the use of pre-trained embeddings was allowed as long as they were trained in a completely unsupervised way, i.e., not on manually annotated data.
For English, we used the Clean Corpus of Historical American English (CCOHA) [davies2002corpus, Alatrashetal20], which spans 1810s–2000s. For German, we used the DTA corpus [dta2017] and a combination of the BZ and ND corpora [BZ2018, ND2018]. DTA contains texts from different genres spanning the 16th–20th centuries. BZ and ND are newspaper corpora jointly spanning 1945–1993. For Latin, we used the LatinISE corpus [mcgillivray-kilgarriff] spanning from the 2nd century B.C. to the 21st century A.D. For Swedish, we used the Kubhist corpus [KubHist], a newspaper corpus containing texts from 18th–20th century. The corpora are lemmatised and POS-tagged. CCOHA and DTA are spelling-normalized. BZ, ND and Kubhist contain frequent OCR errors [adesam2019exploring, hengchen2020vocab].
From each corpus we extracted two time-specific subcorpora , , as defined in Table 2. The division was driven by considerations of data size and availability of target words (see below). From these two subcorpora we then sampled the released test corpora in the following way: Sentences with tokens ( for Latin) were removed. German was downsampled to fit the size of by sampling all sentences containing target lemmas and combining them with a random sample of sentences not containing target lemmas of suited size. An equal procedure was applied to downsample English and . For Latin and Swedish the full amount of sentences was used. Finally, all tokens were replaced by their lemma, punctuation was removed and sentences were randomly shuffled within each of , .333Sentence shuffling and lemmatization were done for copyright reasons. Participants were provided with start and end positions of sentences. Where Kubhist did not provide lemmatization (through KORP [borin-etal-2012-korp]) we left tokens unlemmatized. Additional pre-processing steps were needed for English: for copyright reasons CCOHA contains frequent replacement tokens (10 x ‘@’). We split sentences around replacement tokens and removed them as a first step in the pre-processing pipeline. Further, because English frequently combines various POS in one lemma and many of our target words underwent POS-specific semantic changes, we concatenated targets in the English corpus with their broad POS tag (‘target_pos’). Also, the joint size of the CCOHA subcorpora had to be limited to 10M tokens because of copyright issues. Find a summary of the released test corpora in Table 2.
3.2 Target words
Target words are either: (i) words that changed their meaning(s) (lost or gained a sense) between and ; or (ii) stable words that did not change their meaning during that time.444A target word is represented by its lemma form. A large list of 100–200 changing words was selected by scanning etymological and historical dictionaries [Paul02XXI, svenska, oxford2009oxford] for changes within the time periods of the respective corpora. This list was then further reduced by one annotator who checked whether there were meaning differences in samples of 50 uses from and per target word. Stable words were then chosen by sampling a control counterpart for each of the changing words with the same POS and comparable frequency development between and , and manually verifying their diachronic stability as described above. Both types of words were annotated to obtain their sense frequency distributions as described below, which allowed us to verify the a-priori choice of changing and stable words. By balancing the target words for POS and frequency we aim to minimize the possibility that model biases towards these factors lead to artificially high performance [dubossarsky2017, SchlechtwegWalde20].
3.3 Hidden/True Labels
For Subtask 1 (binary classification) each target word was assigned a binary label () via manual annotation ( for stable, for change). For Subtask 2 each target word was assigned a graded label () according to their degree of LSC derived from the annotation ( means no change, means total change). The hidden labels were published in the post-evaluation phase.555https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd-post Both types of labels (binary and graded) were derived from the sense frequency distributions of target words in and as obtained from the annotation process. For this, we adopt change notions similar to SchlechtwegWalde20 as described below.
We focused our efforts on annotating large and more representative samples for a limited number of words rather than annotating many words.666An indication that random samples with the chosen sizes can indeed be expected to be representative of the population is given by the results of the simulation study described in Appendix A: We were able to nearly fully recover the population clustering structure from the samples (average of adjusted mean rand index). In this section we describe the setup of the annotation for the modern languages (English, German, and Swedish) first. The setup for Latin is slightly different and we describe it later in this section.
We started with four annotators per language, but had to add additional annotators later because of a high annotation load and dropouts. The total number of annotators for English/German/Swedish was 9/8/5. All annotators were native speakers and present or former university students. For German we had two annotators with a background in historical linguistics, while for English and Swedish we had one such annotator. For each target word we randomly sampled 100 uses from each of and for annotation (total of 200 uses per target word).777We refer to an occurrence of a word in a sentence by ‘use of ’. If a target word had less than 100 uses, we annotated the full sample. We then mixed the use samples of a target word into a joint set and annotated using an extension of the DURel framework [Schlechtwegetal18, haettySurel:2019, Erk13]. DURel produces high inter-annotator agreement even between non-expert annotators relying on the simple notion of semantic relatedness. Pairs of word uses from and are annotated on a four-point scale from unrelated meanings (1) to identical meanings (4) (see Table 3). Our extension consisted in the sampling procedure of use pairs: instead of annotating a random sample of pairs and using comparison of their mean relatedness over time as a measure of LSC [Schlechtwegetal18], we aimed to sample pairs such that after annotation they span a sparsely connected usage graph combining the uses from , , where nodes represent uses and edges represent (the median of) annotator judgments (see Figure 1). This usage graph was then clustered into sets of uses expressing the same sense [Schutze1998]. By further distinguishing two subgraphs for , we got two clusterings with a shared set of clusters, because they were obtained on the same total graph [palla2007quantifying]. We then equated the two clusterings obtained for , with their respective sense frequency distributions , . The change scores followed immediately (see below). Note that this extension remained hidden from the annotators: as with DURel their only task was to judge the relatedness of use pairs. These were presented to annotators in randomized order.
4.1 Edge sampling
Retrieving the full usage graph is not feasible even for a small set of uses as this implies annotating edges. Hence, the main challenge with our annotation approach was to reduce the number of edges to annotate as much as possible, while keeping the necessary information needed to infer a meaningful clustering on the graph. We did this by annotating the data in several rounds. After each round the usage graph of a target word was updated with the new annotations and a new clustering was obtained.888If an edge was annotated by several annotators we took the median as an edge weight.
Based on this clustering we sampled the edges for the next round applying simple heuristics similar to Biemann-2013-system, a detailed description of which can be found in AppendixA. We spread the annotation load randomly over annotators making sure that roughly half of the use pairs were annotated by more than one annotator.
|3: Closely Related|
|2: Distantly Related|
4.2 Special Treatment of Latin
Latin poses a special case due to the lack of native speakers. We recruited 10 annotators with a high-level knowledge of Latin, and ranging from undergraduate students to PhD students, post-doctoral researchers, and more senior researchers. We selected a range of target words whose meaning had changed between the pre-Christian and the Christian era according to the literature [clackson] and in the pre-annotation trial we checked that both meanings were present in the corpus data. For each changed word, we selected a control word whose meaning did not change from the pre-Christian era and the Christian era, whose PoS was the same as the changed word, and whose frequency values in each of the two subcorpora ( and ) were in the following intervals: and , respectively, where ranged between 0.03 and 0.15 and and are the frequency of the changed word in , .999We experimented with increasing values of and chose the minimum for which a control word could be found for each changed word. In a trial annotation task our annotators reported difficulties and that they had to translate to their native language when comparing two excerpts of text. Hence, we decided to use a variation of the procedure described above which was introduced by Erk13. Instead of use pairs, annotators judged the relatedness between a use and a sense definition from a dictionary, on the DURel scale. The sense definitions were taken from the Latin portion of the Logeion online dictionary.101010https://logeion.uchicago.edu/. We selected 30 sample sentences for each of , . Due to the challenge of finding qualified annotators, each word was assigned only to one annotator. We treated sense definitions as additional nodes in a usage graph connected to uses by edges representing annotator judgments. Clustering was then performed as for the other languages.
|General||Subtask 1||Subtask 2|
The usage graphs we obtain from the annotation are weighted, undirected, sparsely observed and noisy. This poses a very specific problem that calls for a robust clustering algorithm. For this, we rely on a variation of correlation clustering [Bansal04] by minimizing the sum of cluster disagreements, i.e., the sum of negative edge weights within a cluster plus the sum of positive edge weights across clusters. To see this, consider Blank97XVI’s continuum of semantic proximity and the DURel relatedness scale derived from it, as illustrated in Table 3. In line with Blank, we assume that use pairs with judgments of 3 and 4 are more likely to belong to the same sense, while judgments of 1 and 2 are more likely to belong to different senses. Consequently, we shift the weight of all edges in a usage graph by . We refer to those edges with a weight as positive edges and edges with weights as negative edges . Let further be some clustering on , be the set of positive edges across any of the clusters in clustering and the set of negative edges within any of the clusters. We then search for a clustering that minimizes :
That is, we try to minimize the sum of positive edge weights between clusters and (absolute) negative edge weights within clusters. Minimizing is a discrete optimization problem which is NP-hard [Bansal04]. However, we have a relatively low number of nodes (), and hence, the global optimum can be approximated sufficiently with a standard optimization algorithm. We choose Simulated Annealing [Pincus1970] as we do not have strong efficiency constraints and the algorithm showed superior performance in a simulation study. More details on the procedure can be found in Appendix A. In order to reduce the search space, we iterate over different values for the maximum number of clusters. We also iterate over randomly as well as heuristically chosen initial clustering states.111111We used mlrose to perform the clustering [Hayes19].
This way of clustering usage graphs has several advantages: (i) It finds the optimal number of clusters on its own. (ii) It easily handles missing information (non-observed edges). (iii) It is robust to errors by using the global information on the graph. That is, a wrong judgment can be outweighed by correct ones. (iv) It directly optimizes an intuitive quality criterion on usage graphs. Many other clustering algorithms such as Chinese Whispers [Biemann2006] make local decisions, so that the final solution is not guaranteed to optimize a global criterion such as . (v) By weighing each edge with its (shifted) weight, respects the gradedness of word meaning. That is, edges with have less influence on than edges with . Finally, it showed superior performance to all other clustering algorithms we tested in a simulation study. (See Appendix A.)
4.4 Change scores
A sense frequency distribution (SFD) encodes how often a word occurs in each of its senses [mccarthy2004, lau-EtAl:2014, e.g.]. From the clustering we obtain two SFDs , for a word in the two corpora , , where each cluster corresponds to one sense.121212The frequency for sense in corpus is given by the number of uses from in the cluster corresponding to sense . Binary LSC for Subtask 1 of the word is then defined as
where and are the frequencies of sense in , and , are lower frequency thresholds aimed to avoid that small random fluctuations in sense frequencies caused by sampling variability or annotation error are misclassified as change [SchlechtwegWalde20]. According to Definition 2
, a word is classified as gaining a sense, if the sense is attested at mosttimes in the annotation sample from , but attested at least times in the sample from . (Similarly for words that lose a sense.) We set , for the smaller samples () in Latin and , for the larger samples () in English, German, Swedish. We make no distinction between words that gain vs. words that lose senses, both fall into the change class. Equally, we make no distinction between words that gain/lose one sense vs. words that gain/lose several senses.
For graded LSC in Subtask 2 we first normalize andand by dividing each element by the total sum of the frequencies of all senses in the respective distribution. The degree of LSC of the word is then defined as the Jensen-Shannon distance between the two normalized frequency distributions:
where the Jensen-Shannon distance is the symmetrized square root of the Kullback-Leibler divergence[Lin1991, DonosoS17]. is symmetric, ranges between and and is high if and assign very different probabilities to the same senses. Note that and not necessarily correspond to each other: a word may show no binary change but high graded change, or vice versa.
Figure 1 and Figure 2 show the annotated and clustered usage graphs for Swedish target ledning and German target Eintagsfliege. Nodes represent uses of the target word. Edges represent the median of relatedness judgments between uses (black/gray lines for positive/negative edges). Colors make clusters (senses) inferred on the full graph. After splitting the full graph into the two time-specific subgraphs for , we obtain the two sense frequency distributions , . From these we inferred the binary and the graded change value. The two words represent semantic changes indicative of Subtask 1 and 2 respectively: ledning gains a sense with rather low frequency in . Hence, it has binary change, but low graded change. For Eintagsfliege, however, its two main senses exist in both and , while their frequencies change dramatically. Hence, it has no binary change, but high graded change.
Find a summary of the annotation outcome for all languages and target words in Table 4. The final test sets contain between 31 (Swedish) and 48 (German) target words. Throughout the annotation we excluded several targets if they had a high number of ‘0’ judgments or needed a high number of further edges to be annotated. As previous studies, we report the mean of Spearman correlations between annotator judgments as agreement measure. Erk13 and Schlechtwegetal18 report agreement scores between 0.55 and 0.68, which is comparable to our scores.131313Note that because we spread disagreements from previous rounds in each round to further annotators, on average uses in later rounds become much harder to judge, which has a negative effect on agreement. Hence, for comparability reasons we report the agreement in the first round where no disagreement detection has taken place. The agreement across all rounds, calculated as weighted mean of agreements is 0.52/0.60/-/0.58. The clustering loss is the value of (Definition 1) divided by the maximum possible loss on the respective graph. It gives a measure of how well the graphs could be partitioned into clusters by the criterion.
The class distribution (column ‘LSC’) for Subtask 1 differs per language as a result of several target words being dropped during the annotation. In Latin the majority of target words have binary change, while in Swedish the majority has no binary change. This is also reflected in the mean scores for graded LSC in Subtask 2. Despite the excluded target words the frequency statistics are roughly balanced (FRQd, FRQm). However, we did not control the test sets for polysemy and there are strong correlations for English, German and Swedish between graded change and polysemy in Subtask 2 (PLYm). This correlation reduces for binary change in Subtask 1 but is still moderate for English and Swedish and remains high for German.
In total, roughly 100,000 judgments were made by annotators. For English/German/Swedish 50% of the use pairs were annotated by more than one annotator. In total, the annotation cost roughly € 20,000 for 1,000 hours – twice as much as originally budgeted.
All teams were allowed a total of 10 submissions, the best of which was kept for the final ranking in the competition. Participants had to submit predictions for both subtasks and all languages. A submission’s final score for each subtask was computed as the average performance across all four languages. During the evaluation phase, the leaderboard was hidden, as per SemEval recommendation.
5.1 Scoring Measures
For Subtask 1 submitted predictions were evaluated against the hidden labels via accuracy, given that we anticipated the class distribution for target words to be approximately balanced before the annotation. Scores are bounded between and . As the distribution turned out to be imbalanced for some languages, we also report F1-score in Appendix C. For Subtask 2, we used Spearman’s rank-order correlation coefficient with the gold rank. Spearman’s only considers the order of the words, the actual predicted change values were not taken into account. Ties are corrected by assigning the average of the ranks that would have been assigned to all the tied values to each value (e.g. two words sharing rank 1 both get assigned rank 1.5). Scores are bounded between (completely opposite to true ranking) and (exact match).
For both subtasks, we have two baselines: (i) Normalized frequency difference (Freq. Baseline) first calculates the frequency for each target word in each of the two corpora, normalizes it by the total corpus frequency and then calculates the absolute difference between these values as a measure of change. (ii) Count vectors with column intersection and cosine distance (Count Baseline) first learns vector representations for each of the two corpora, then aligns them by intersecting their columns and measures change by cosine distance between the two vectors for a target word. A Python implementation of both these baselines was provided in the starting kit. A third baseline, for Subtask 1, is the majority class prediction (Maj. Baseline), i.e., always predicting the ‘0’ class (no change).
6 Participating Systems
Thirty-three teams participated in the task, totaling 53 members. The teams submitted a total of 186 submissions. Given the large number of teams, we provide a summary of the systems in the body of this paper. A more detailed description of each participating system for which a paper was submitted is available in Appendix B. We also encourage the reader to read the full system description papers.
Participating models can be described as a combination of (i) a semantic representation, (ii) an alignment technique and (iii) a change measure. Semantic representations are mainly average embeddings (type embeddings) and contextualized embeddings (token
embeddings). Token embeddings are often combined with a clustering algorithm such as K-means, affinity propagation[frey2007clustering], (H)DBSCAN, GMM, or agglomerative clustering. One team uses a graph-based semantic network, one a topic model and several teams also propose ensemble models. Alignment techniques include Orthogonal Procrustes [Hamilton:2016, OP], Vector Initialization [Kim14, VI], versions of Temporal Referencing [Dubossarskyetal19, TR]
, and Canonical Correlation Analysis (CCA). A variety of change measures are applied, including Cosine Distance (CD), Euclidean Distance (ED), Local Neighborhood Distance (LND), Kullback-Leibler Divergence (KLD), mean/standard deviation of co-occurrence vectors, or cluster frequency. Table5 shows the type of system for every team’s best submission for both subtasks.
|Jiaxin & Jinan||.665||.649||.729||.700||.581||type|
|Jiaxin & Jinan||.518||.325||.717||.440||.588||type|
As illustrated by Table 5, UWB has the best performance in Subtask 1 for the average over all languages, closely followed by Life-Language, RPI-Trust141414The team submits an ensemble model. As all of the features are derived from the type vectors, we classify it as “type” in this section. and Jiaxin & Jinan.151515The team is named “LYNX” on the competition CodaLab. For Subtask 2, UG_Student_Intern performs best, followed by Jiaxin & Jinan and cs2020.161616The team is named “cs2020” and “cs2021” on the competition CodaLab. The combined number of submissions made by the two teams did not exceed the limit of 10. Across all systems, good performance in Subtask 1 does not indicate good performance in Subtask 2 (correlation between the system ranks is 0.22). However, and with the exception of Life-Language and cs2020, most top performing systems in Subtask 1 also excel in Subtask 2, albeit with a slight change of ranking.
Remarkably, all the top performing systems use static-type embedding models, and differ only in terms of their solutions to the alignment problem (Canonical Correlation Analysis, Orthogonal Procrustes, or Temporal Referencing). Interestingly, the top systems refine their models using one or more of the following steps: a) computing additional features from the embedding space; b) combining scores from different models (or extracted features) using ensemble models; c) choosing a threshold for changed words based on a distribution of change scores. We conjecture that these additional (and sometimes very original) post-processing steps are crucial for these systems’ success. We now briefly describe the top performing systems in terms of these three steps (for further details please see Appendix B). UWB (SGNS+CCA+CD) sets the average change score as the threshold (c). Life-Language (SGNS) represents words according to their distances to a set of stable pivot words in two unaligned spaces, and compares their divergence relative to a distribution of change scores obtained from unstable pivot words (a+c). RPI-Trust (SGNS+OP) extract features (a word’s cosine distance, change of distances to its nearest-neighbours and change in frequency), transform each word’s feature to a CDF score, and averages these probabilities (a+b+c). Jiaxin & Jinan
(SGNS+TR+CD) fits the empirical cosine distance change scores to a Gamma Quantile Threshold, and sets the 75% quantile as the threshold (c).UG_Student_Intern (SGNS+OP) measures change using Euclidean distance instead of cosine distance. cs2020 uses SGNS+OP+CD only as baseline method.
An important finding common to most systems is the difference between their performances across the four languages – systems that excel in one language do not necessarily perform well in another. This discrepancy may be due to a range of factors, including the difference in corpus size and the nature of the corpus data, as well as the relative availability of resources in some languages such as English over others. The Latin corpus, for example, covers a very long time span, and the lower performance of the systems on this language may be explained by the fact that the techniques employed, especially word token/type embeddings, have been developed for living languages and little research is available on their adaptation to dead and ancient languages. In general, dead languages tend to pose additional challenges compared to living languages [piotrowski], due to a variety of factors, including their less-resourced status, lack of native speakers, high linguistic variation and non-standardized spelling, and errors in Optical Character Recognition (OCR). Other factors that should be investigated are data quality [hill2019quantifying, van2020assessing]
: while English and Latin are clean data, German and Swedish present notorious OCR errors. The availability of tuned hyperparameters might have played a role as well: for German, some teams report following prior work such as Schlechtwegetal19. Finally, another factor for the discrepancy in performance between languages for any given system is not related to the nature of the systems nor of the data, but due to the fact that some teams focused on some languages, submitting dummy results for the others.
Type versus token embeddings
Tables 5 and 6 illustrate the gap in performance between type-based embedding models and the token-based ones. Out of the best 10 systems in Subtask 1/Subtask 2, 7/8 systems are based on type embeddings compared to only 2/2 systems that are based on token embeddings (same holds for each language individually). Contrary to the recent success of token embeddings [peters-etal-2018-deep] and to commonly held view that contextual embeddings “do everything better”, they are overwhelmingly outperformed by type embeddings in our task. This is most surprising for Subtask 1, because type embeddings do not distinguish between different senses, while token embeddings do. We suggest several possible reasons for these surprising results. The first is the fact that contextual embedding is a recent technology, and as such lacks proper usage conventions. For example, it is not clear whether a model should create an average token representation based on individual instances (and if so, which layers should be averaged), or if it should use clustering of individual instances instead (and if so, what type of clustering algorithm etc.). A second reason may be related to the fact that contextual models are pretrained and cannot exclusively be trained on the relevant historical resources (in contrast to type embeddings). As such, they carry additional, and possibly irrelevant, information that may mask true diachronic changes. The results may also be related to the specific preprocessing we applied to the corpora: (i) Only restricted context is available to the models as a result of the sentence shuffling. Usually, token-based models take more context into account than just the immediate sentence [martinc-etal-2020-context]. (ii) The corpora were lemmatized, while token-based models usually take the raw sentence as input. In order to make the input more suitable for token-based models, we also provide the raw corpora after the evaluation phase and will publish the annotated uses of the target words with additional context.171717https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd
|System||Subtask 1||Subtask 2|
The influence of frequency
In prior work, the predictions of many systems have been shown to be inherently biased towards word frequency, either as a consequence of an increasing sampling error with lower frequency [dubossarsky2017] or by directly relying on frequency-related variables [schlechtweg-EtAl:2017:CoNLL, Schlechtwegetal19]. We have controlled for frequency when selecting target words (recall Table 4) in order to test model performance when frequency is not an indicating factor. Despite the controlled test sets we observe strong frequency biases for the individual models as illustrated for Swedish in Figure 3.181818Find the full set of analysis plots at https://www.ims.uni-stuttgart.de/data/sem-eval-ulscd-post. Models rather correlate negatively with the minimum frequency of target words between corpora (FRQm), and positively with the change in their frequency across corpora (FRQd). This means that models predict higher change for low-frequency words and higher change for words with strong changes in frequency. Despite their superior performance, type embeddings are more strongly influenced by frequency than token embeddings, probably because the latter are not trained on the test corpora limiting the influence of frequency. Similar tendencies can be seen for the other languages. For a range of models correlations reach values .
The influence of polysemy
We did not control the test sets for polysemy. As shown in Table 4, the change scores for both subtasks are moderately to highly correlated with polysemy (PLYm). Hence, it is expected that model predictions would be positively correlated with polysemy. However, these are in almost all cases lower than for the change scores and in some cases even negative (Latin and partly English). We conclude that model predictions are only moderately biased towards polysemy on our data.
Prediction difficulty of words
In order to quantify how difficult a target word is to predict we compute the mean error of all participants’ predictions.191919Because Subtask 2 is a ranking task, we divide the mean error by the expected error: since words in the middle have a lower expected error than words in the top or bottom. In Subtask 1, we find that words with higher rank tend to have higher error, in particular for English, see Figure 4 (left) where words with the gold class 1 have almost twice as high average error than words with gold class 0, and Latin. This is likely due to the tendency for systems to provide zero-predictions following the published baselines. For Subtask 2 (right), we find that the opposite holds; stable words are harder to predict for all languages but Swedish, where instead, it seems that the words in the middle of the rank are the hardest to classify. For English, the top three hardest to predict words are for Subtask 1 vs. Subtask 2 are land, head, edge vs. word, head, multitude. For German, they are packen, überspannen, abgebrüht vs. packen, Seminar, vorliegen. For Latin, they are cohors, credo, virtus vs. virtus, fidelis, itero. For Swedish, they are kemisk, central, bearbeta vs. central, färg, blockera. We could not identify a general pattern with regards to these words’ frequency or polysemy properties.
We presented the results of the first shared task on Unsupervised Lexical Semantic Change Detection. A wide range of systems were evaluated on two subtasks in four languages relying on a thoroughly annotated data set based on 100,000 human judgments. The task setup (unsupervised, no genuine development data, different corpora from different languages with very different sizes, varying class distributions) provided an opportunity to test models in heterogeneous learning scenarios, but was also very challenging. Hence, both subtasks remain far from solved. However, several teams reach high performances on both subtasks. Surprisingly, type embeddings outperformed token embeddings on both subtasks. We suspect that the potential of token embeddings could not yet fully unfold, as no canonical application concept is available and preprocessing was not optimal for token embeddings. We found that type embeddings are strongly influenced by frequency. Hence, one important challenge for future type-based models will be to avoid the frequency bias stemming from the corpus on which the type-based models are trained. An important challenge for token-based models will be to understand the reasons for their current low performance and to develop robust ways for their application. We found that change scores in our test sets strongly correlate with polysemy, while model predictions do not seem to be strongly influenced by polysemy. We believe that this question should be pursued in the future by controlling test sets for polysemy.
We hope that SemEval-2020 Task 1 makes a lasting contribution to the field of Unsupervised Lexical Semantic Change Detection by providing researchers with a standard evaluation framework and high-quality data sets. Despite the limited size of the test sets, many previously reached conclusions can now be tested more thoroughly and future models can be compared on a shared benchmark. Future tasks could focus on even more realistic setups involving more than two time periods or large-scale predictions on the vocabulary of the whole corpus.
The authors would like to thank Dr. Diana McCarthy for her valuable input to the genesis of this task. DS was supported by the Konrad Adenauer Foundation and the CRETA center funded by the German Ministry for Education and Research (BMBF) during the conduct of this study. This task has been funded in part by the project Towards Computational Lexical Semantic Change Detection supported by the Swedish Research Council (2019–2022; dnr 2018-01184), and Nationella språkbanken (the Swedish National Language Bank) – jointly funded by (2018–2024; dnr 2017-00626) and its 10 partner institutions, to NT. The list of potential change words in Swedish was provided by the research group at the Department of Swedish, University of Gothenburg that works with the Contemporary Dictionary of the Swedish Academy. This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1, to BMcG. Additional thanks go to the annotators of our datasets, and an anonymous donor.
Appendix A Annotation Details
a.1 Edge sampling
Retrieving the full usage graph is not feasible even for a small set of uses as this implies annotating edges. Hence, the main challenge with our annotation approach was to reduce the number of edges to annotate as few as possible, while keeping the necessary information needed to infer a meaningful clustering on the graph. We did this by annotating the data in several rounds. After each round the usage graph of a target word was updated with the new annotations and a new clustering was obtained.202020If an edge was annotated by several annotators we took the median as an edge weight. Based on this clustering we sampled the edges for the next round applying simple heuristics similar to Biemann-2013-system. We spread the annotation load randomly over annotators making sure that roughly half of the use pairs is annotated by more than one annotator.
In the first round we aimed to obtain a small but good reference set of uses which would serve to compare the rest of uses in the second round. Hence, we sampled 10% of the uses from and 30% of the edges from this sample by exploration, i.e., by a random walk through the sample graph guaranteeing that all nodes are connected by some path. Hence, the first clustering was obtained on a small but richly connected subgraph guaranteeing that we did not infer a larger number of clusters than present in the data in the first round, which would lead to a strong increase in annotation instances in the subsequent rounds. In all subsequent rounds we combined a combination step with an exploration step. A multi-cluster is a cluster with uses. The combination step combined each single use which is not yet member of a multi-cluster with a random use from each of the multi-clusters to which had not yet been compared. The exploration step consisted of a random walk on 30% of the edges from the non-assignable uses, i.e., uses which had already been compared to each of the multi-clusters but were not assigned to any of these by the clustering algorithm. This procedure slowly populated the graph while minimizing the annotation of redundant information. The procedure stopped when each cluster had been compared to each other cluster. We validated the procedure in a simulation study (see below).
Find an example of our annotation pipeline in Figure 5. As the annotation proceeds through the rounds the graph becomes more populated and the true cluster structure is found. In round 1 one multi-cluster is found. Hence, all remaining uses are compared with this cluster in round 2 by the combination step. In rounds 3 and 4 the exploration step discovers more clusters not found in the rounds before.
We validated the annotation procedure and the clustering algorithm described in Section 4 in a simulation study by simulating 40 ground truth usage graphs with zipfian sense frequency distributions covering roughly the frequency range of the majority our target words (50–1000). We introduced change to half of the target words by setting some of its senses’ frequencies to 0 in either of ,
. We then sampled from these graphs in several rounds as described above, simulated an annotation in each round with a normally distributed error added to judgments and compared the resulting clustering to the clustering of the true graph. The true clustering could be recovered with high accuracy (average ofadjusted mean rand index). We also used the simulation to predict the feasibility of the study and to tune parameters of the annotation such as sample sizes for nodes and edges. With the finally chosen parameters described in Section 4.1 the algorithm converged on average after 5 rounds and judgments per annotator. This was within the bounds of our time limits and financial budget.
We also tested the clustering algorithm against several standard techniques [Biemann2006, Blondel2008] and varied the optimization algorithm for . None of these variations performed compatible with our approach.
Appendix B Systems description
The team obtains contextual embeddings using BERT [devlin-etal-2019-bert], and extracts for every target word usage a word embedding using bert-as-service [xiao2018bertservice]. The team uses the difference of mean value of all cosine distances [salton1986introduction] of a target word between two corpora to detect change.
cs2020212121The team is named “cs2020” and “cs2021” on the competition CodaLab. The combined number of submissions made by the two teams did not exceed the limit of 10. [arefyev-zhikov-2020-cmce]
The team submits systems of two types: SGNS with an Orthogonal Procrustes alignment and cosine distance as a change measure, and a variation of a word-sense induction method by amrami2018word. For the latter, the team replaces BERT by a fine-tuned version of XLM-R [conneau2019unsupervised] and for every target word generates lexical substitutes following amrami2019towards, the vectors of the most probable of which are then clustered using agglomerative clustering, with cosine distance.
Discovery Team [martinc-etal-2020-context]
The team uses two types of word representations: average embeddings from SGNS [Mikolov13a, Mikolov13b] with an Orthogonal Procrustes alignment and contextual embeddings using language-specific BERT [devlin-etal-2019-bert]. For SGNS+OP the team compares vectors using cosine [salton1986introduction], while contextual embeddings see two different strategies: averaging of target-word embeddings, and clustering using k-means and affinity propagation [frey2007clustering]. Clusters are then compared across time using Jensen-Shannon divergence. The team also submits an ensemble model that uses all four strategies (SGNS+OP, averaging, k-means, AP).
This team’s system was designed for Subtask 1 and is based on Temporal Referencing [Dubossarskyetal19]
. They concatenated the corpora of different periods and treated the occurrences of target words in different periods as two independent tokens. Afterwards, they trained word embeddings on the joint corpus and compare the referenced vectors of each target word using cosine similarity. They used the Gensim package for training word2vec embeddings (Continuous Bag of Words with Negative Sampling). The number of negative samples was set to 5.
The team obtains GloVe average embeddings [pennington2014glove], then applies vector initialization (VI) alignment [Kim14] and measures change with cosine distance [salton1986introduction].
Ims222222The team is named “in vain” on the competition CodaLab. [kaiser-etal-2020-IMS]
The team obtains average embeddings from Skip-Gram with Negative Sampling [Mikolov13a, Mikolov13b, SGNS], then applies vector initialization (VI) alignment [Kim14] and measures change with cosine distance [salton1986introduction]. (SGNS+VI+CD) VI is an alignment strategy where the vector space learning model for is initialized with the vectors from . They take a cutoff approach to Subtask 1.
Jiaxin & Jinan232323The team is named “Lynx” on the competition CodaLab. [zhou-etal-2020-temporalteller]
The team uses two static word embedding models, SGNS and PPMI, in addition to averaged BERT representation, with different methods for alignment (Orthogonal Procrustes, Temporal Referencing, and Column Intersection). They compute the cosine distance between these representations and use it standardly for Subtask-2. For Subtask-1 they propose the Gamma Quantile Threshold as a novel approach too choose the change cutoff based on cosine-distance distributions. The team also provides a large scale comparison of two hyper-parameters, embedding dimensionality and context size window.
The team systematically combined existing models for lexical semantic change detection,242424 They considered various word representations (raw count vectors, positive Pointwise Mutual Information, Singular Value Decomposition, Random Indexing, and Skip-Gram with Negative Sampling), alignment methods (Column Intersection, Shared Random Vectors, Orthogonal Procrustes, Vector Initialization), and similarity and dispersion measures (cosine distance, local neighbourhood distance, frequency difference, type difference, and entropy difference).
They considered various word representations (raw count vectors, positive Pointwise Mutual Information, Singular Value Decomposition, Random Indexing, and Skip-Gram with Negative Sampling), alignment methods (Column Intersection, Shared Random Vectors, Orthogonal Procrustes, Vector Initialization), and similarity and dispersion measures (cosine distance, local neighbourhood distance, frequency difference, type difference, and entropy difference).and analyzed their score distribution. After defining a general threshold in an unsupervised way, they adjusted it to each of the models. They measured the models’ decision certainty and used it to filter the best models; they then applied the majority rule to select the binary class for Subtask 1 and used a weighted average of the scores of the best models to obtain the ranking for Subtask 2.
The team trains static embedding (fasttext) on each corpus separately. Target words are represented as probability distributions (after softmax) by their distance relative to fixed pivot words that are considered to be highly stable (according to some frequency heuristics). Semantic change is defined as the KL-divergence between two distributions of the target word. An empirical semantic change distribution is computed by randomly choosing pivot words. This enables to choose a threshold for the change score (Subtask 1). For Subtask 2, the change score is straightforwardly used for ranking.
Nlp@idsia252525The team is named “Vani Kanjirangat” on the competition CodaLab. [kanjirangat-etal-2020-sst]
The team obtains contextual embeddings using BERT [devlin-etal-2019-bert]. The team uses Euclidean distance between vectors of word uses to create clusters using k-means, and the silhouette method to define the number of clusters. They cluster word vectors using two strategies: joined corpora, and separately (within-corpus clusters).
The team uses multilingual contextualized word embeddings [devlin-etal-2019-bert]
to represent a word’s meaning. They then reduce the embedding dimensionality with either autoencoder or UMAP, and cluster the resulting representation with either GMM or HDBSCAN. For Subtask 1 they use the task’s specification directly on the cluster assignments, and for Subtask 2 they use the Jensen-Shannon Divergence for ranking.
This team focused on the problem of identifying when a target word has gained or lost senses. They train dynamic word embeddings using methods based on both explicit alignment such as Dynamic Word2Vec [yaoetal2018], and implicit alignment, like Temporal Random Indexing [basileetal2015] and Temporal Referencing [Dubossarskyetal19]. They also use different similarity measures to determine the extent of a word semantic change and compare the cosine similarity with Pearson Correlation and the neighborhood similarity [shoemarketal2019]
. They introduce a new method to classify changing vs. stable words by clustering the target similarity distributions via Gaussian Mixture Models.
The team uses Gaussian embedding [vilnis2014word] to represent words distributions instead of a points as in standard embedding models. Mean vectors are learned with word2vec using [Kim14] method. Covariance matrices are not trained, but encode the words’ frequency changes between two time points. Kullback-Leibler (KL) divergence is then applied to measure changes to a word’s distribution.
The team uses static word embedding aligned using Orthogonal Procrustes. They extract three features from these representations, cosine-distance between words in two time points, change to their nearest-neighbours and frequency change, which they use in an ensemble model. They adopt an anomaly detection approach to find the threshold for change words (Subtask1), and directly computing the rank on the ensemble score (Subtask 2).
The team uses pretrained cross-lingual contextualized embedding model, XLM-R [conneau2019unsupervised], which enables them to use the same model for all languages. For each target word they generated contextual representations (token representations) from the two corpora, and cluster them using K-Means++ with a fixed number of clusters (8). They use these cluster assignments as a proxy for the words’ senses, and compare their between the two corpora. For Subtask 1, they directly implement the criterion provided in the task reference for LSC, and for Subtask 2 they use the Jensen-Shannon Divergence as a score for the degree of change in these senses.
Tue262626The team is named “Schwaebischschwaetza_tue” on the competition CodaLab. [parnysheva-schwarz-2020-tue]
The team uses a cross-lingual pretrained contextual word embedding model [che2018towards] to represent a word’s meaning across the two corpora. They then cluster these representations using either K-Means or DBSCAN, and compare the words’ cluster assignments between the two time points. These cluster assignments allows that to tackle Subtask 1 directly (using the criterion defined by the organizers), and to compute Jensen-Shannon Divergence for Subtask 2.
UG Student Intern [pomsl-lyapin-2020-circe]
The team submits three types of model: average embeddings from SGNS [Mikolov13a, Mikolov13b] with an Orthogonal Procrustes alignment with vector comparison through Euclidean distance, contextual embeddings using BERT [devlin-etal-2019-bert] and a sentence time classification objective, and finally their ensemble model CIRCE.
The team obtains contextual embeddings of two types: ELMo [peters-etal-2018-deep] and BERT [devlin-etal-2019-bert]. The team measures change in three different ways: cosine distance [salton1986introduction] on averaged vectors, average pairwise distance between all contextual vectors of a same target, and Jensen-Shannon divergence applied on clusters created with affinity propagation [frey2007clustering].
University College Dublin [nulty-lillis-2020-ucdnet]
The team represents words as nodes in a weighted undirected graph that represents their word associations in the two time points. Similar to the Temporal-Referencing model [Dubossarskyetal19], the nodes represent all words in both time points, and only target words have two nodes. The edges’ weights are determined by the words’ ppmi scores that surpass a threshold. Resistance distance metric is used to evaluate the degree of change of the target words. For Subtask 1 a threshold is manually set to determine the change and stable words, and for Subtask 2 the distance metric is used straightforwardly in the ranking.
UoB272727The team is named “Eleri Sarsfield” on the competition CodaLab. [sarsfield-madabushi-2020-uob]
The team uses a topic model (Hierarchical Dirichlet processes [teh2004sharing, HDP]) to model senses for each word, and mimics cook2014novel in utilising the novelty score as a similarity measure.
The team obtains average embeddings from SGNS [Mikolov13a, Mikolov13b], and aligns models from the two periods using Canonical Correlation Analysis (CCA), and Orthogonal Procrustes (using VecMap [artetxe-etal-2018-robust]). The team measures change using cosine distance [salton1986introduction].
Appendix C Results with F1, Precision and Recall
|Jiaxin & Jinan||.583||.789||.646||.6||.562||.58||.583||.824||.683||.769||.769||.769||.381||1.0||.552|