Autoencoding Word Representations through Time for Semantic Change Detection

by   Adam Tsakalidis, et al.

Semantic change detection concerns the task of identifying words whose meaning has changed over time. The current state-of-the-art detects the level of semantic change in a word by comparing its vector representation in two distinct time periods, without considering its evolution through time. In this work, we propose three variants of sequential models for detecting semantically shifted words, effectively accounting for the changes in the word representations over time, in a temporally sensitive manner. Through extensive experimentation under various settings with both synthetic and real data we showcase the importance of sequential modelling of word vectors through time for detecting the words whose semantics have changed the most. Finally, we take a step towards comparing different approaches in a quantitative manner, demonstrating that the temporal modelling of word representations yields a clear-cut advantage in performance.



There are no comments yet.


page 1

page 2

page 3

page 4


An Improved Historical Embedding without Alignment

Many words have evolved in meaning as a result of cultural and social ch...

Capturing Evolution in Word Usage: Just Add More Clusters?

The way the words are used evolves through time, mirroring cultural or t...

On the Linearity of Semantic Change: Investigating Meaning Variation via Dynamic Graph Models

We consider two graph models of semantic change. The first is a time-ser...

A Computational Investigation on Denominalization

Language has been a dynamic system and word meanings always have been ch...

Discovery of Evolving Semantics through Dynamic Word Embedding Learning

During the course of human language evolution, the semantic meanings of ...

SST-BERT at SemEval-2020 Task 1: Semantic Shift Tracing by Clustering in BERT-based Embedding Spaces

Lexical semantic change detection (also known as semantic shift tracing)...

A State-of-the-Art of Semantic Change Computation

This paper reviews the state-of-the-art of semantic change computation, ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Identifying words whose lexical meaning has changed over time is a primary area of research at the intersection of natural language processing and historical linguistics. Through the evolution of language, the task of “semantic change detection”

Tang (2018) can provide valuable insights on cultural evolution over time Michel et al. (2011). Measuring linguistic change more broadly is also relevant to understanding the dynamics in online communities Danescu-Niculescu-Mizil et al. (2013) and the evolution of individuals, e.g. in terms of their expertise McAuley and Leskovec (2013). Recent years have seen a surge in interest in this area since researchers are now able to leverage the increasing availability of historical corpora in digital form and develop algorithms that can detect the shift in a word’s meaning through time.

However, two key challenges in the field still remain. (a) Firstly, there is little work in existing literature on model comparison Schlechtweg et al. (2019); Dubossarsky et al. (2019); Shoemark et al. (2019). Partially due to the lack of labelled datasets, existing work assesses model performance primarily in a qualitative manner, without comparing results against prior work in a quantitative fashion. Therefore, it becomes impossible to assess what constitutes an appropriate approach for semantic change detection. (b) Secondly, on a methodological front, a large body of related work detects semantically shifted words by pairwise comparisons of their representations in distinct periods in time, ignoring the sequential modelling aspect of the task Hamilton et al. (2016); Tsakalidis et al. (2019). Since semantic change is a time-sensitive process Tsakalidis et al. (2019), considering intermediate vector representations in consecutive time periods can be crucial to improving model performance Shoemark et al. (2019). This type of modelling approach is very different from considering changes between two distinct bins of word representations Schlechtweg et al. (2018, 2020).

Here we tackle both of the above challenges by approaching semantic change detection as an anomaly identification task. We propose an encoder-decoder architecture for learning word representations across time. We hypothesize that once such a model has been successfully trained on temporally sensitive word sequences it will be able to accurately predict the evolution of the semantic representation of any word through time. Words that have undergone semantic change will be exactly those that yield the highest errors by the prediction model. Specifically we make the following contributions:

  • we develop three variants of an LSTM-based neural architecture which enable us to measure the level of semantic change of a word by tracking its evolution through time in a sequential manner. These are: (a) a current word representation autoencoder, (b) a future word representation decoder and (c) a hybrid approach combining (a) and (b);

  • we showcase the effectiveness of the proposed models under thorough experimentation with synthetic data;

  • we compare our models against current practices and competitive baselines using real-world data, demonstrating important gains in performance and highlighting the importance of sequential modelling of word vectors across time.

2 Related Work

One can distinguish two directions within the literature on semantic change Tang (2018); Kutuzov et al. (2018): (a) learning word representations over discrete time intervals and comparing the resulting vectors and (b) jointly learning the (diachronic) word representations across time Bamler and Mandt (2017); Rosenfeld and Erk (2018); Yao et al. (2018); Rudolph and Blei (2018). In this work, we focus on (a) due to scalability issues in (b) associated with learning diachronic representations from very large corpora. Our methods, presented in Section 3, are applicable to any type of pre-trained word vectors across time.

Related work in (a) derives word representations , across different time intervals and performs pairwise comparisons for different values of . Early work used frequency- and co-occurrence-based representations for Sagi et al. (2009); Cook and Stevenson (2010); Gulordava and Baroni (2011); Mihalcea and Nastase (2012); however, word2vec-based representations Mikolov et al. (2013) has been the standard practice in recent years. Due to the stochastic nature of word2vec, Orthogonal Procrustes (OP) is often firstly applied to the resulting vectors, aiming at aligning the pairwise representations Kulkarni et al. (2015); Hamilton et al. (2016); Del Tredici et al. (2019); Shoemark et al. (2019); Tsakalidis et al. (2019); Schlechtweg et al. (2019). Given two word matrices , at times and respectively, OP finds the optimal transformation matrix and the semantic shift level of a word during the time interval is defined as the cosine distance Hamilton et al. (2016). To tackle the drawback of basing the alignment of the matrices on the whole vocabulary, which assumes that the vast majority of the words remain stable across time, Tsakalidis et al. (2019) learn the alignment based only on a few semantically stable words across time. However, both approaches operate in a linear pairwise fashion, thus ignoring the time-sensitive, sequential and possibly non-linear nature of semantic change.

By contrast, Kim et al. (2014), Kulkarni et al. (2015) and Shoemark et al. (2019)

derive time series of a word’s level of semantic change and use those to detect semantically shifted words. Even though these methods incorporate temporal modelling, they still rely heavily either on the linear transformation

Kulkarni et al. (2015); Shoemark et al. (2019) or on the similarity of a word with itself across time via continuous representation learning Kim et al. (2014). The latter has recently been demonstrated to lead to worse performance Shoemark et al. (2019).

Finally, the comparative evaluation of semantic change detection models is still in its infancy. Most related work assesses model performance based either on an artificial task Rosenfeld and Erk (2018); Shoemark et al. (2019) or on a few hand-picked examples Del Tredici et al. (2019), without cross-model comparison. Setting a benchmark for model comparison with real-world and sequential word representations would be of great importance to the field.

3 Methods

We formulate semantic change detection as an anomaly detection task. We hypothesize that the pre-trained word vectors

, …, , where (: vocabulary size; : word representation size) in a historical corpus over time periods, evolve according to a non-linear function .111Note that in represents the time period from when the associated word vectors are taken (e.g., the year 2000) and not the position of a word in a sentence. By providing an approximation for , we obtain the level of semantic shift of a word at time by measuring the distance between its word representation against . A key novelty of our work is that we approximate

via a temporally sensitive model using a deep neural network architecture.

Shoemark et al. (2019) showed that accounting for the full sequence of word vectors is more appropriate for detecting semantically shifted words, compared to accounting only for the first and the last representations , as is the practice in most earlier work. Following Shoemark et al. (2019) we model word evolution by accounting for all intermediate representations across time.

Our modelling of the semantic change function is based on two components: (a) an autoencoder, which aims to reconstruct a word’s trajectory up to a given point in time (section 3.1); and (b) a future predictor, which aims to predict future representations of the word (section 3.2). The two models can be trained either individually or (c) in combination, in a multi-task setting (section 3.3).

Figure 1: Overview of our proposed model: the sequence of the representation of a set of word vectors (Vocabulary) over different time steps is encoded through two LSTM layers and then passed over to a reconstruction (3.1) and a future prediction decoder (3.2). The model is trained by utilising either decoder in isolation, or both of them in parallel (3.3).

3.1 Reconstructing Word Representations

Given an input sequence of vectors representing the Vocabulary across points in time , the goal of the autoencoder is to reconstruct the input sequence

, by minimising some loss function. Since the task of semantic change includes a natural temporal dimension, we model our autoencoder via a RNN architecture (see Figure 

1). The encoder is composed of two LSTM layers Hochreiter and Schmidhuber (1997) with Dropout layers operating on their outputs, for regularisation Srivastava et al. (2014). The first layer encodes the input sequence of and returns the hidden states to be fed as input to the second layer. The output of the second layer is the final encoded state, which is then copied times and fed as input to the decoder. The decoder has the exact same architecture as the encoder, albeit with additional dense layers on top of the second LSTM layer, fed with the hidden states of the latter, to make the final reconstruction on the time steps. The model is trained by minimising the mean squared error (MSE) loss function:


After training, the words that yield the highest error rates in a given test set of word representations through time are considered to be the ones whose semantics have changed the most during the given time period. This assumption is in line with prior work based on word alignment Hamilton et al. (2016); Tsakalidis et al. (2019), where the alignment error of a word indicates its level of semantic change.

3.2 Predicting Future Word Representations

Reconstructing the input sequence of word vectors can reveal which words have changed their semantics in the past (i.e., up to time , see section 3.1). If we are interested in predicting changes in the semantics of future word representations (i.e., word vectors after time ), then we can set up a future word representation prediction task, based on a sequence-to-sequence architecture. Formally, given the sequence of past word representations over the first time points, we want to predict the future representations of the words in the vocabulary , for a sequence of overall length (see Figure 1). We follow the same model architecture as described in section 3.1, with the only difference being the number of time steps () that are used in the decoder in order to make predictions. The model is trained using the MSE loss function :


3.3 Joint Model

The two models can be combined into a joint one, where, given an input sequence of representations of the vocabulary over points in time, the goal is both to (a) reconstruct the input sequence and (b) predict the future word representations . The complete model architecture is provided in Figure 1: the encoder is identical to the one used in 3.1 and 3.2. However, the bottleneck is now copied times and passed to the decoders of the reconstruction ( times) and future prediction ( times) components. The loss function used to tune the model parameters is the summation of Eq. 1 and  2:


There are two main reasons for modelling semantic change in this multi-task setting. Firstly, we benefit from the finer granularity of the two decoders due to their handling of only part of the sequence in a more fine-grained manner, compared to the individual task models. Secondly, the joint model is insensitive to the value of in Eq. 3 compared to Eq. 1 and 2. We provide more details on this aspect in 3.4.

3.4 Model Equivalence

The three models perform different operations; however, setting the operational time periods appropriately in Eq. 1-3 can result in model equivalence. Specifically, to detect the words whose semantics have changed during [0, ], the autoencoder in Eq. 1 needs to be fed and reconstruct the full sequence across [0, ] (i.e., =-1). Reducing this interval (reducing ) would limit the autoencoder’s operational time period. On the other hand, an increase in the value of in Eq. 2 of the future prediction component shortens the time period during which it can detect the words whose semantics have changed the most – to account for the whole sequence (i.e., [1, ]), the future prediction model requires only the word representations in the first time interval to then detect the words whose semantics have changed within [1, ]. Therefore, setting the parameter can be crucial for the performance of the two individual models. By contrast, the joint model in section 3.3 is able to detect the words that have undergone semantic change, regardless of the value of (see Eq. 3), since it is still able to operate on the full sequence – we showcase these effects in section 5.2.

4 Experiments with Synthetic Data

In this section we explore the three proposed models and their ability to detect words that have undergone semantic change on an artificial dataset. Tasks ran on artificial data have been used in recent work for evaluation purposes Shoemark et al. (2019). We work with artificial data in the current section as a proof-of-concept of our proposed models – we compare against state-of-the-art models and other baseline methods with real-world data in the following sections. In particular, here we employ a longitudinal dataset of word representations (4.1) and artificially alter the representations of a small set of words across time (4.2). We then train (4.3) our models and evaluate them on the basis of their ability to identify those words that have undergone (artificial) semantic change (4.4).

4.1 Dataset

We make use of the UK Web Archive dataset introduced by Tsakalidis et al. (2019), which contains 100-dimensional representations of 47.8K words for each year in the period 2000-2013. These were generated by employing word2vec (i.e., skip-gram with negative sampling)Mikolov et al. (2013) on the documents published in each year independently. Each year corresponds to a time step in our modelling. The dataset contains 65 words whose meaning is known to have changed during the same time period as indicated by the Oxford English Dictionary. These are removed for the purposes of this section, to avoid interference with the artificial data modeling. We use one subset (80%) of the remaining word representations across time for training our models and the rest (20%) for evaluation purposes.

4.2 Artificial Examples of Semantic Change

We generate artificial examples of words with changing semantics, by following a paradigm inspired by Rosenfeld and Erk (2018). We uniformly at random select 5% of the words in the test set to alter their semantics. For every selected “source” word , we select a “target” word . Details about the selection process of the target words are provided in the next paragraph. We then alter the representation of the source word at each point in time so that it shifts towards the representation of the target word at this point in time as:


In our modelling, receives values between 0 and 1 and acts as a decay function that controls the speed of the change in the source word’s semantics towards the target. As in Rosenfeld and Erk (2018), we model

via a sigmoid function. Thus, the semantic representation of the word

is not altered during the first time points and then it gradually shifts towards the representation of word (for middle values of ), where it stabilizes towards the last time points. Since the duration of the semantic shift of a word may vary, we experiment under three different scenarios, as presented below.

Different modelling approaches of (artificial) semantic change have been presented in Shoemark et al. (2019) – e.g., forcing a word to acquire a new sense while also retaining its original meaning. Here we opted for the “stronger” case of semantic shift in Eq. 4 as a proof of concept for our models. In the next section we experiment with uncontrolled (real-world) examples of semantic change, without the need for any hypothesis on the underlying function.

Conditioning on Target Words    The selection of the target words should be such that they allow the representation of the source word to change through time. This will not be the case if we select a pair of {source, target} words whose representations are very similar (e.g., synonyms). Thus, for each source word we select uniformly at random a target word

s.t. the cosine similarity of their representations at the initial time point

(i.e., in year 2000) falls within a certain range . Higher values of enforce a lower semantic change level for through time, since its representation will be shifted towards a similar word , and vice versa. To assess the performance of our models across different semantic change levels, we experiment with varying values for : {0.0, 0.1, …, 0.5}.

Conditioning on Duration of Change    The duration of the semantic change affects the value of in Eq. 4. We conventionally set , s.t. the artificial word representation of a source word in the year 2007 (i.e., the middle between 2001-2013) to be equal to ). We then experiment with four different duration [start, end] ranges for the semantic change: (a) “Full” [2001-13], (b) “Half” [2005-10], (c) “OT” (One-Third) [2006-09] and (d) “Quarter” [2007-08]. A longer lasting semantic change duration implies a smoother transition of word towards the meaning of word , and vice versa (see Figure 2). By generating synthetic examples of varying lengths of semantic change duration we are able to measure the performance of the models under different conditions.

Figure 2: The different functions used to model in Eq. 4, indicating the speed and duration of the semantic change of our synthetic examples (see section 4.2).

4.3 Artificial Data Experiment

Our task is to rank the words in the test set by means of their level of semantic change. We first train our three models on the training set and then we apply them on the test set. Finally, we measure the semantic change level of a word by means of the average cosine similarity between the predicted and actual word representations at each time step of the decoder. Model performance is assessed via rank-based metrics Basile and McGillivray (2018); Tsakalidis et al. (2019); Shoemark et al. (2019).

Model Training

The following is applicable to training of models for both the artificial and real-world data experiments. We define and train our models as follows:

  • seq2seq: the autoencoder (section 3.1) receives and reconstructs the full sequence of the word representations in the training set: .

  • seq2seq: the future prediction model (section 3.2) receives the representation of the words in the training set in the year 2000 and learns to predict the rest of the sequence: .

  • seq2seq: the multi-task model (section 3.3) is fed with the first half of the sequence of the word representations in the training set and jointly learns to (a) reconstruct the input sequence and (b) predict the word representations in the future: .

We vary the input in terms of number of time steps for seq2seq and seq2seq so that the decoder in each model operates on the maximum possible output sequence, thus exploiting the semantic change of the words over the whole time period (see section 3.4). seq2seq is expected to be insensitive to the number of input time steps, therefore we conventionally set it to half of the overall sequence. We keep 25% of our training set for validation purposes and train our models using the Adam optimiser Kingma and Ba (2015)

. Parameter selection is performed based on 25 trials using the Tree of Parzen Estimators algorithm of the hyperopt module

Bergstra et al. (2013), by means of the maximum average (i.e., per time step) cosine similarity in the validation set.222For the complete list of parameters tested, refer to Appendix A.

(a) Full
(b) Half
(c) OT
(d) Quarter
Figure 3: of our models on the synthetic dataset for different values of the threshold and the four different periods of duration of semantic change (see 4.2). Lower values of indicate a better performance.

Testing and Evaluation    The following applies to experiments with both artificial and real-world data. After training, each model is applied to the test set, yielding its predictions for every word across time.333Note that the future prediction model does not make a prediction for the first time step (year 2000). The level of semantic change of a word in the test set is then calculated as the average cosine similarity between the actual and the predicted word representations through time Hamilton et al. (2016); Tsakalidis et al. (2019), with higher values indicating a better model prediction – thus, a lower level of semantic change. The words are ranked in descending order of their level of semantic change, so the lowest rank indicates a word whose vector representation has changed the most (i.e., indicating the most semantically shifted word). For evaluation purposes, similarly to Tsakalidis et al. (2019), we employ the average rank across all of the semantically changed words (in %, denoted as ), with lower scores indicating a better model. We prefer to the mean reciprocal rank, because the latter puts more weight on the first rankings. Since semantic change detection is an under-explored task in quantitative terms, we aim at getting better insights on model performance by working with an averaging metric such as . For the same reason, in the current section we avoid using classification-based metrics that are based on a cut-off point (e.g., recall at Basile and McGillivray (2018)). We do make use of such metrics in the cross-model comparison in section 5.2.

4.4 Results

Model Comparison    Figure 3 presents the results of the three models on our synthetic data across all (, ) combinations. seq2seq performs consistently better than the individual reconstruction (seq2seq) and future prediction (seq2seq) models across all experimental settings, showcasing that combining the two models under a multi-task setting benefits from the joint and finer-grained parameter tuning of the two components. The autoencoder performs slightly better than seq2seq – a difference partially attributed to the fact that the autoencoder has a longer sequence to output (), which helps explore the temporal variation of the words more effectively.

Figure 4 shows the cosine similarity between the predicted and actual representation of each synthetic word per time step for the “Full” case when =0.0 (highest level of change, see section 4.2). A darker colour indicates a better model prediction – thus a lower level of semantic change. seq2seq reconstructs the input sequence of the synthetic examples more accurately than the future prediction component (average cosine similarity per year (): .65 vs .50). It particularly manages to reconstruct the synthetic word representations during the years 2006-2008 (=.75), which are the points when varies more rapidly (see Figure 2); however, it fails to reconstruct equally well their representations before (= .65) and after (= .59) this sharp change. On the contrary, seq2seq predicts more accurately the synthetic word representations during the first years ( = .74), when the change in their semantics is minor, but completely fails after the semantic change is almost complete (i.e., when , = .24). seq2seq benefits from the individual components’ advantage: it appropriately reconstructs the artificial examples in the first years ( = .85) so that their semantic shift is highlighted more clearly during (= .62) and after the process is almost complete (= .26). Finally, in seq2seq highly correlates with (=.987), potentially providing insights on how to measure the speed of semantic change of a word.

(a) seq2seq
(b) seq2seq
(c) seq2seq
Figure 4: Cosine similarity between the actual and the predicted word vectors of the synthetic words that have undergone artificial semantic change (rows), per year (columns). Lighter colours indicate poorer model performance – thus indicating that the corresponding words have undergone semantic change. Note that seq2seq does not make a prediction for the first time step (i.e., year 2000).

Effect of Conditioning Parameters    Regardless of the duration of the semantic change process and the model under consideration, an increase in the value of results in performance degradation. This is expected, since the increase of implies that the level of semantic change of the source words is lower, as discussed in 4.2, thus making the task of detecting them more difficult. Nevertheless, our worst performing model in the most challenging setting (=0.5, Full, seq2seq) achieves =28.17, which is clearly better than the average , expected by a random baseline (=50.00).

The decrease of the duration of semantic change has a positive effect on our models (see Figure 3). This is more evident in the cases of high value of , where seq2seq (: 26.09-18.21 in the Full-to-Quarter cases), seq2seq (: 28.17-22.48) and seq2seq (:20.38-13.09) all show important gains in performance. This indicates that the models can capture the semantic change in small sub-sequences of the time-series. Studying this effect in datasets with a longer time span is an important future direction.

5 Model Comparison with Real-World Data

Rec@5 Rec@10 Rec@50
’00-’13 avgstd ’00-’13 avgstd ’00-’13 avgstd ’00-’13 avgstd

Past Work/Baselines

RAND 49.97 50.010.04 5.00 4.990.03 10.01 9.980.04 50.02 49.970.08
PROCR 30.63 28.512.68 18.46 14.325.00 27.69 29.944.64 78.46 80.473.79
PROCR 31.47 28.712.65 20.00 14.673.85 29.23 28.764.32 72.31 79.644.49
PROCR 31.91 28.472.85 20.00 14.324.23 27.69 28.884.45 70.77 80.004.53
RF 30.01 30.454.15 10.77 15.624.30 21.54 27.467.16 78.46 77.636.42
LSTM 27.87 27.832.65 12.31 15.985.94 29.23 30.306.39 80.00 80.124.72
LSTM 28.62 28.613.47 16.92 17.405.60 32.31 31.836.07 76.92 78.824.83
GT 47.87 44.041.54 7.69 7.412.26 16.92 14.133.76 52.31 57.902.94
GT 38.09 36.161.74 13.85 14.834.14 24.62 23.363.94 66.15 69.373.26
PROCR 25.01 27.993.03 21.54 15.15 4.52 32.31 28.403.75 81.54 80.243.49


seq2seq 24.75 28.363.38 21.54 19.054.47 38.46 29.946.64 84.62 81.424.64
seq2seq 23.86 27.174.16 26.15 22.016.72 46.15 34.3210.13 84.62 81.185.07
seq2seq 24.28 24.290.67 29.23 25.772.28 36.92 39.492.11 84.62 85.001.16
Table 1: Performance of our models and the baselines when operating on the entire time sequence (2000-2013) and averaged across time (2000-01, …, 2000-13). PROCR and PROCR are based on the methods employed in Hamilton et al. (2016) and Tsakalidis et al. (2019), respectively; GT models are based on the work by Shoemark et al. (2019). The complete results in across all runs are provided in Appendix B.

5.1 Experimental Setting

We approach the task in a rank-based manner, as in section 4. However, here we are interested in (a) detecting uncontrolled real-world examples of semantic change in words and (b) comparing our models against strong baselines and current practices.

Data and Task    We make use of the UK Web Archive dataset (see section 4.1). We keep the same 80/20 train/test split as in section 4 and incorporate in the test set the 65 words with known changes in meaning according to the Oxford English Dictionary. We train our models as in section 4.3, aiming at detecting (i.e., ranking lower) the 65 words in the test set. We use (as in section 4)and additionally recall at

(Rec@k, k=5%, 10%, 50%) as our evaluation metrics. Lower

and higher Rec@k scores indicate better models.

Models    We compare the three variants from section 3 against four types of baselines:

– A random word rank generator (RAND). We report average metrics after 1K runs on the test set.

– Variants of Procrustes Alignment Schönemann (1966), as the standard practice in past work Hamilton et al. (2016); Shoemark et al. (2019); Tsakalidis et al. (2019): Given the word representations in two different years , , PROCR transforms into s.t. the squared differences between and are minimised. We also use the PROCR and PROCR variants (Tsakalidis et al., 2019), which first detect the most stable words across either , (PROCR) or , …, (PROCR) to learn the alignment on and then transform into . Words are ranked based on the cosine distance between , .

– Models leveraging the first and last word representations only

. We use a Random Forest

Breiman (2001) regression model (RF) that predicts , given . We also use the same architectures presented in sections 3.1-3.2, trained on , (ignoring the full sequence): LSTM reconstructs the sequence , ; LSTM predicts , given , similarly to RF. Words are ranked in inverse order of the (average, for LSTM) cosine similarity between their predicted and actual representations.

– Models operating on the time series of distances. Given a sequence of vectors , …, , we construct the time series of cosine distances that result by PROCR Kulkarni et al. (2015); Shoemark et al. (2019). Then, we use two global trend models as in Shoemark et al. (2019): GT ranks the words by means of the absolute value of the Pearson correlation of their time series; GT

fits instead a linear regression model for every word and ranks the words by the absolute value of the slope. Finally, we employ

PROCR, ranking words based on the average cosine distance within .444We refrain from evaluating the GT models when 2, due to the very short time interval that does not allow for correlations to appear in the data, leading to very poor performance.

We report the performance of our models and baselines555All parameters tested during the training process of our baselines are provided in Appendix A. (a) when they operate on the full interval [2000-2013] and (b) averaged across all intermediate intervals [2000-2001, …, 2000-2013]. In the latter case, our models use additional (future) information compared to our baselines (e.g., when seq2seq is fed with the word sequences of [2000, 2001], it makes a prediction for the years [2002, …, 2013] – such information cannot be leveraged by the baselines). Thus, for (b), we only perform intra-model (and intra-baseline) comparisons.

5.2 Results

Our models vs baselines    The results are shown in Table 1. The three models proposed in this work consistently achieve the lowest and highest Rec@ when working on the whole time sequence (’00-’13 columns in Table 1). The comparison between {seq2seq, LSTM} and {seq2seq, LSTM} in the years 2000-13 showcases the benefit of modelling the full sequence of the word representations across time, compared to using the first and last representations only. Overall, our models provide a relative boost of 4.6% in and [35.7%, 42.8%, 5.8%] in Rec@ (for =[5, 10, 50]) compared to the best performing baseline. seq2seq and seq2seq models outperform the autoencoder (seq2seq) in most metrics, while seq2seq yields the most stable results across all experiments. We explore these differences in detail in the last paragraph of this section.

Intra-baseline comparison    Models operating only on the first and last word representations fail to outperform the simplistic Procrustes-based baselines in Rec@, demonstrating again the weakness of operating in a non-sequential manner. The LSTM models achieve low on the 2000-13 experiments; however, the difference with the rest of the baselines in across all years is negligible. The intra-Procrustes model comparison shows that the benefit of selecting a few anchor words to learn a better alignment (PROCR, PROCR) shown in Tsakalidis et al. (2019) in examining semantic change over two consecutive years does not apply when examining a longer time period. Finally, contrary to Shoemark et al. (2019), we find that time sensitive models operating on the word distances across time (GT, GT) perform worse than the baselines that leverage only the first and last word representations. This difference is attributed to the low number of time steps in our dataset that does not allow the GT models to exploit long-term correlations (i.e., considering the average distance across time (PROCR) performs better), but also highlights the importance of leveraging the full word sequence across time.

Figure 5: of our models for varying value of (Eq. 13).777Example: For the year 2005 (x-axis), all models receive the word representations until 2005 as their input. Then, seq2seq reconstructs the word representations up to 2005, seq2seq predicts the future representations (2006, …, 2013) and seq2seq performs both tasks jointly.

Effect of input/output lengths   Figure 5 shows the of our three variants when we alter the length of the input and, therefore, also the length of the output (see section 3.4). The performance of seq2seq increases with the input size since by definition the decoder is able to detect words whose semantics have changed over a longer period of time (i.e., within , with increasing), while also modelling a longer sequence of a word’s representation through time. On the contrary, the performance of seq2seq increases alongside the decrease of the number of input time steps. This is expected since, as decreases, seq2seq encodes a shorter input sequence and the decoding (and hence the semantic change detection) is applied on the remaining (and increased number of) time steps within . These findings provide empirical evidence that both models can achieve better performance if trained over longer sequences of time steps. Finally, the stability of seq2seq

showcases its input length-invariant nature, which is also clearly evident in all of the averaged results (standard deviation in avg

std columns) in Table 1: in its worst performing setting, seq2seq still manages to achieve results that are close to the best performing model (=25.17, Rec@=[21.54, 36.92, 83.08] for the three thresholds) and always better (or equal to) the best performing baseline shown in Table 1 in Rec@. This is a very attractive aspect of the model as it removes the need to manually define the number of time steps to be fed to the encoder.

6 Conclusion and Future Work

We have proposed three variants of sequential models for semantic change detection that effectively exploit the full sequence of a word’s representation through time to determine its level of semantic change. Through extensive experimentation based on synthetic and real-world data, we have demonstrated that the proposed models can surpass state-of-the-art results on the UK Web Archive Dataset. Importantly, their performance increases alongside the duration of the time period under study, confidently outperforming competitive baselines and common practices in the literature on semantic change.

In future work we plan to incorporate anomaly detection approaches operating on the model’s predicted word vectors instead of considering the average similarity between the predicted and the actual representations as the level of semantic change of a word. Employing contextual word representations Devlin et al. (2019); Hu et al. (2019) can also be of high importance in detecting new senses of the words across time. Finally, we plan to investigate different architectures, such as Variational Autoencoders Kingma and Welling (2014), and test our models in datasets of different duration and in different languages to provide clearer evidence on their effectiveness.


This work was supported by The Alan Turing Institute (grant EP/N510129/1) and by a Turing AI Fellowship to Maria Liakata, funded by the Department of Business, Energy & Industrial Strategy.


  • R. Bamler and S. Mandt (2017) Dynamic Word Embeddings. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    pp. 380–389. Cited by: §2.
  • P. Basile and B. McGillivray (2018) Exploiting the Web for Semantic Change Detection. In International Conference on Discovery Science, pp. 194–208. Cited by: §4.3, §4.3.
  • J. Bergstra, D. Yamins, and D. D. Cox (2013)

    Making a Science of Model Search: Hyperparameter Optimization in Hundreds of Dimensions for Vision Architectures

    pp. 115–123. Cited by: §4.3.
  • L. Breiman (2001) Random Forests. Machine Learning 45 (1), pp. 5–32. Cited by: §5.1.
  • P. Cook and S. Stevenson (2010) Automatically Identifying Changes in the Semantic Orientation of Words. In Proceedings of the Seventh conference on International Language Resources and Evaluation, Cited by: §2.
  • C. Danescu-Niculescu-Mizil, R. West, D. Jurafsky, J. Leskovec, and C. Potts (2013) No Country for Old Members: User Lifecycle and Linguistic Change in Online Communities. In Proceedings of the 22nd International Conference on World Wide Web, pp. 307–318. Cited by: §1.
  • M. Del Tredici, R. Fernández, and G. Boleda (2019) Short-Term Meaning Shift: A Distributional Exploration. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2069–2075. Cited by: §2, §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §6.
  • H. Dubossarsky, S. Hengchen, N. Tahmasebi, and D. Schlechtweg (2019) Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 457–470. Cited by: §1.
  • K. Gulordava and M. Baroni (2011) A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus.. In Proceedings of the GEMS 2011 Workshop on Geometrical Models of Natural Language Semantics, pp. 67–71. Cited by: §2.
  • W. L. Hamilton, J. Leskovec, and D. Jurafsky (2016) Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1489–1501. Cited by: §1, §2, §3.1, §4.3, §5.1, Table 1.
  • S. Hochreiter and J. Schmidhuber (1997) Long Short-Term Memory. Neural Computation 9 (8), pp. 1735–1780. Cited by: §3.1.
  • R. Hu, S. Li, and S. Liang (2019) Diachronic Sense Modeling with Deep Contextualized Word Embeddings: An Ecological View. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3899–3908. Cited by: §6.
  • Y. Kim, Y. Chiu, K. Hanaki, D. Hegde, and S. Petrov (2014) Temporal Analysis of Language through Neural Language Models. In Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science, pp. 61–65. Cited by: §2.
  • D. P. Kingma and J. Ba (2015) Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR 2015, Conference Track Proceedings, Cited by: §4.3.
  • D. P. Kingma and M. Welling (2014) Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Conference Track Proceedings, Cited by: §6.
  • V. Kulkarni, R. Al-Rfou, B. Perozzi, and S. Skiena (2015) Statistically significant detection of linguistic change. In Proceedings of the 24th International Conference on World Wide Web, pp. 625–635. Cited by: §2, §2, §5.1.
  • A. Kutuzov, L. Øvrelid, T. Szymanski, and E. Velldal (2018) Diachronic Word Embeddings and Semantic Shifts: A Survey. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1384–1397. Cited by: §2.
  • J. J. McAuley and J. Leskovec (2013) From Amateurs to Connoisseurs: Modeling the Evolution of User Expertise through Online Reviews. In Proceedings of the 22nd International Conference on World Wide Web, pp. 897–908. Cited by: §1.
  • J. Michel, Y. K. Shen, A. P. Aiden, A. Veres, M. K. Gray, J. P. Pickett, D. Hoiberg, D. Clancy, P. Norvig, J. Orwant, et al. (2011) Quantitative Analysis of Culture Using Millions of Digitized Books. Science 331 (6014), pp. 176–182. Cited by: §1.
  • R. Mihalcea and V. Nastase (2012)

    Word Epoch Disambiguation: Finding how Words Change over Time

    In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 259–263. Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed Representations of Words and Phrases and their Compositionality. In Advances in Neural Information Processing Systems, pp. 3111–3119. Cited by: §2, §4.1.
  • A. Rosenfeld and K. Erk (2018) Deep Neural Models of Semantic Shift. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 474–484. Cited by: §2, §2, §4.2.
  • M. Rudolph and D. Blei (2018) Dynamic Embeddings for Language Evolution. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pp. 1003–1011. Cited by: §2.
  • E. Sagi, S. Kaufmann, and B. Clark (2009) Semantic Density Analysis: Comparing Word Meaning across Time and Phonetic Space. In Proceedings of the Workshop on Geometrical Models of Natural Language Semantics, pp. 104–111. Cited by: §2.
  • D. Schlechtweg, A. Hätty, M. del Tredici, and S. S. i. Walde (2019) A Wind of Change: Detecting and Evaluating Lexical Semantic Change across Times and Domains. arXiv preprint arXiv:1906.02979. Cited by: §1, §2.
  • D. Schlechtweg, S. S. im Walde, and S. Eckmann (2018) Diachronic Usage Relatedness (DURel): A Framework for the Annotation of Lexical Semantic Change. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 169–174. Cited by: §1.
  • D. Schlechtweg, B. McGillivray, S. Hengchen, H. Dubossarsky, and N. Tahmasebi (2020) SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection. In To appear in Proceedings of the 14th International Workshop on Semantic Evaluation, Barcelona, Spain. Cited by: §1.
  • P. H. Schönemann (1966) A Generalized Solution of the Orthogonal Procrustes Problem. Psychometrika 31 (1), pp. 1–10. Cited by: §5.1.
  • P. Shoemark, F. Ferdousi Liza, D. Nguyen, S. A. Hale, and B. McGillivray (2019) Room to Glo: A Systematic Comparison of Semantic Change Detection Approaches with Word Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 66–76. Cited by: §1, §2, §2, §2, §3, §4.2, §4.3, §4, §5.1, §5.1, §5.2, Table 1.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: A Simple Way to Prevent Neural Networks from Overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §3.1.
  • X. Tang (2018) A State-of-the-Art of Semantic Change Computation. Natural Language Engineering 24 (5), pp. 649–676. Cited by: §1, §2.
  • A. Tsakalidis, M. Bazzi, M. Cucuringu, P. Basile, and B. McGillivray (2019) Mining the UK Web Archive for Semantic Change Detection. In Recent Advances in Natural Language Processing, Cited by: §1, §2, §3.1, §4.1, §4.3, §4.3, §5.1, §5.2, Table 1.
  • Z. Yao, Y. Sun, W. Ding, N. Rao, and H. Xiong (2018) Dynamic Word Embeddings for Evolving Semantic Discovery. In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, pp. 673–681. Cited by: §2.

Appendix A List of Hyperparameters

Our models

We test the following hyper-parameters for our seq2seq models:

  • encoder_LSTM, number of units: [32, 64, 128, 256, 512]

  • encoder_LSTM, number of units: [32, 64]

  • decoder_LSTM, number of units: [32, 64] (x2, for the case of seq2seq – for (a) the autoencoding and (b) future prediction component)

  • decoder_LSTM, number of units: [32, 64, 128, 256, 512] (x2, for the case of seq2seq)

  • dropout rate in dropout layers: [.1, .25, .5]

  • batch size: [32, 64, 128, 256, 512, 1024]

  • number of epochs: [10, 20, 30, 40, 50]

We optimise our parameters using the Adam optimiser in keras, using the default learning rate (.001).


We experiment with the following hyper-parameters per model:

  • LSTM: we follow the exact same settings as in our models.

  • RF: we experiment with the number of trees ([50, 100, 150, 200]) and select the best model based on the maximum average cosine similarity across all predictions, as in our models.

  • PROCR: we experiment with different rate [.001, .01, .05, .1, .2, … .9] of anchor (or diachronic anchor) words on the basis of the size of the test set. We select to display in our results the best model based on the average performance in the test set (=.9 for PROCR, =.5 for PROCR).

  • GT: we explore different correlation metrics (Spearman Rank, Pearson Correlation, Kendall Tau) and select to display the best one (Pearson Correlation) on the basis of its average performance on the test set across all experiments. Due to the very poor performance of all metrics when operating on a small number of time-steps (), we only provide the results in Table 1 (avgstd columns) when these models operate on longer sequences.

  • PROCR, PROCR, GT, RAND: there are no hyper-parameter to tune in these models.

Appendix B Complete Results on Real Data

The complete list of results () that were presented in Table 1 are provided in Table 2. The interpretation of the “year” for each model is provided in Table 3.

year PROCR PROCR PROCR RF LSTM LSTM GT GT PROCR seq2seq seq2seq seq2seq
2001 34.26 34.11 34.43 37.35 33.67 36.43 - - 34.26 33.66 23.86 23.67
2002 32.70 32.66 32.41 34.94 31.20 32.98 - - 32.98 34.06 23.52 23.42
2003 29.24 29.51 29.41 36.94 30.32 32.57 37.59 43.34 31.02 32.44 23.39 23.47
2004 25.46 25.45 25.03 27.25 24.66 26.08 35.43 42.98 28.68 30.01 23.84 23.50
2005 29.04 29.10 28.65 31.43 28.98 29.17 38.47 44.47 28.23 29.05 24.21 23.93
2006 27.73 28.36 27.38 28.86 26.61 26.55 38.74 44.45 27.71 28.58 24.77 24.28
2007 26.70 26.95 26.64 30.16 25.45 26.39 34.16 41.93 26.98 28.09 25.62 25.17
2008 28.30 28.23 27.87 32.77 26.25 27.86 35.02 42.86 26.72 27.38 26.53 24.44
2009 26.10 26.22 25.81 23.27 24.97 23.73 34.23 43.24 26.15 25.71 27.30 24.72
2010 27.95 28.09 27.38 28.25 28.18 28.19 36.04 44.77 25.81 25.84 29.50 24.83
2011 25.71 25.91 25.74 28.15 26.07 26.24 34.78 43.99 25.31 24.65 30.91 25.14
2012 26.77 27.12 27.44 26.51 27.52 27.12 35.18 44.53 24.94 24.42 33.65 24.93
2013 30.63 31.47 31.91 30.01 27.87 28.62 38.09 47.87 25.01 24.75 36.09 -
AVERAGE 28.51 28.71 28.47 30.45 27.83 28.61 36.16 44.04 27.99 28.36 27.17 24.29
Table 2: Complete scores across all runs.
Model Explanation
Example (year=2006)
Date to use for aligning the word
vectors with their corresponding
ones in the year 2000.
The model aligns the word vectors in the year
2006 with the word vectors in the year 2000.
The date indicating the word vectors to
reconstruct, along with those in the first
LSTM receives as input the word vectors in the
years 2000 and 2006 and reconstructs them.
The date indicating the word vectors to
LSTM/RF receives the word vectors in the year
2000 & predicts the word vectors in the year 2006.
Cut-off date to use for constructing
the time series of the cosine distances.
The time series of cosine distances of every word
are constructed based on the years [2000-2006].
Cut-off date in the input, indicating
the range of years to reconstruct.
seq2seq is fed with the word representations
in the years [2000-2006] and reconstructs them.
Cut-off date in the input, affecting
the range of years to predict.
seq2seq predicts the word vectors during the
years [2007-2013], given the vectors during the
years [2000-2006] as input.
Cut-off date in the input, indicating
the range of years to reconstruct &
affecting the range of dates to predict.
seq2seq receives the word vectors during the
years [2000-2006] and (a) reconstructs them & (b)
predicts their representations in [2007-2013].
Table 3: Explanation of the variable “year” in Table 2.