Word meanings drift over time, with new words emerging, words adopting new senses and the frequency of word usage varying. Vocabulary and usage patterns in social media evolve rapidly (Hamoodat_VC2020), and people’s views change over time (kelman1961processes). This can have an impact on stance classification in social media as the data used for training may not generalise well to future data with different patterns. Previous research has either assumed that a classifier trained on static, temporally-restricted data would suffice to track public opinion over time (deng2013tracking), or focused on short time periods, analysing stance on trending topics such as Brexit, death penalty or climate change (Simaki2017a; Mohammad2016b). Our work contributes to research in stance classification by focusing on the impact of a hitherto overlooked aspect: time.
A recent study by Florio et al. (florio2020time) demonstrated that social media hate speech detection models do not perform well on newer data when simply trained on older data. Despite highlighting the existence of this problem, their work did not propose any solutions to the problem. Here we show that this problem is not exclusive to hate speech detection and that it also impacts the performance of social media stance classifiers (alkhalifa2020qmul). We collect two longitudinal stance detection datasets that we use for the classifier performance evaluation over time (Section 4). In our experiments we reproduce a real world scenario in which training data remains unchanged while new testing data is generated over years. Our findings indicate that a regular stance classifier can drop up to 18% in relative performance in only five years (Sections 6, 7). We then propose novel methodology that makes a social media stance classifier more robust when applied to data that is temporally distant from the training data, which would in turn enable improved tracking of public opinion.
While one can choose the costly option of annotating new stance data regularly to re-train a classifier, here we investigate the scenario where one needs to make the most of the originally labelled data, e.g. due to limited resources. Hence we propose to use temporally adapted word embeddings, to re-train the classifier on the unchanged training data. This approach adapts the model to the vocabulary changes that happened over time while making use of readily available unlabelled data. We compare two types of approaches to update the word embeddigns: (1) incrementally updating the same embedding model with new unlabelled data over time and (2) creating a temporally contextualised embedding for the testing year by incrementally aligning a new embedding with preceding embedding models over time. We find that the second approach is more successful at mitigating the performance drop over time. We can obtain improved performance with a substantially reduced performance drop of up to 5%.
2. Related Work
Stance classification. There is a body of work on target-specific stance classification (Mohammad2017a; Somasundaran2009) aiming to determine a user’s supporting or opposing view towards a target. This research generally focused on a specific target (Kucuk2019) and investigated as a static problem, looking at datasets that cover limited periods of time without paying attention at the impact of time. Others have looked at the problem of dealing with new targets on cross-target stance classification (xu2018cross), i.e. having training data associated with a particular target (e.g. Donald Trump), exploring the possibility of adapting the classifier to new targets (e.g. Joe Biden). Our research differs from this body of work in that we aim to (1) investigate the impact of time in stance classification for a particular target, and to (2) propose a model that makes this longitudinal tracking of stance more robust to changes in opinion (sayeed2013opinion; Graells2020) and language (croft2013evolution), and hence more stable in performance. A line of research in stance identification has looked at the evolving nature of stance in rumour conversations (lukasik2016hawkes; zubiaga2018discourse), however this work focuses on stance exchanges in temporally brief conversations, rather than longitudinal persistence of models.
Temporal persistence of classifiers. Previous research has shown that classifiers trained on old data can drop in performance when tested on new data, as is the case with Amazon reviews (Lukes2019) or hate speech detection (florio2020time). Works by Rocha2008 and Nishida2012 also find that the temporal gap between training and test data has a big impact on the performance of a classification model. Work by Nishida2012Preotiuc-Pietro2013 model periodic distributions of words over time for the hashtag prediction task using Gaussian Processes. Previous work however assumes that new labelled data is available over time, and therefore the classifier can be adapted using new labelled instances progressively; in our work we assume the realistic scenario where we have a labelled dataset pertaining to a period of time, and access to new labelled data for subsequent periods of time is not affordable. We tackle this problem by using word embeddings, which have been used in previous work for capturing semantic shift (Zhang2016; Tan; kim2014temporal) (i.e. determining whether a word has changed its meaning over time), but there is a dearth of research exploring the use of embeddings to achieve persistence of classifiers.
3. Task Definition
We define the stance classification task as identifying the attitude of the author of a post towards a certain topic as either supporting or opposing. Our task in this paper is to maximise the performance of a stance classifier when tested on the new data, which is several years apart from the training data, i.e. make the classifier persistent in time. Our proposed approach is based on adapting the word embeddings used to train the classifier, and thus we refer to is as adaptive stance classification. To study this we use (1) a longitudinal, unlabelled dataset , divided into equally sized temporal slices where , and (2) a longitudinal, labelled dataset of annotated stance tweets representing temporal utterances from a particular domain (e.g., gender equality, healthcare) with a corresponding set of binary stance labels spanning years, , where is a set of tweets from year . We use the unlabelled data to generate a sequence of temporal embeddings , where each
contains vector representations of words generated using the temporal slicerepresenting the ground truth of temporal representation at time . We assess the persistence of a classifier performance by training on data from one of the years , where and testing it on each of the subsequent years , where. Our objective is to update the representation so as to adapt it to the vocabulary change and to maximise persistence in stance classification for any pair and .
We use two types of datasets for our work: (1) labelled datasets, to assess stance detection models, and (2) larger unlabelled datasets, for building temporal word embedding models. Both types of datasets cover the same time period, enabling experiments on stance over time (labelled) by incrementally adapting word embeddings (unlabelled).
4.1. Labelled Datasets
Due to the lack of large-scale temporally annotated datasets for stance classification, we collected new datasets. To enable collection of labelled datasets with sufficient data for each of the years under study, we opted for retrieving distantly supervised datasets, in this case for a six-year time period from 2014 to 2019. The data collection is based on predominantly supporting or opposing hashtags.
Distant supervision became popular for collection of social media datasets labelled for sentiment (go2009twitter) and has more recently been extended to other tasks, including stance classification (Kumar2018; Mohammad2017a). Distant supervision consists of defining a set of keywords (e.g. hashtags) which serve as a proxy to data labels, subsequently removing these keywords from the resulting dataset and leaving the rest of the text of the posts. We collected two Twitter stance datasets by using hashtags111 https://github.com/OpinionChange2021/opinion_are_made_to_be_changed.git spanning the same time period (2014-2019): (1) with hashtags supporting and opposing Gender Equality, involving issues such as feminism and gender pay gap, and (2) with hashtags supporting and opposing Healthcare, involving issues such as dieting and medical care. To assess the quality of the distantly supervised labels, we manually inspected a subset of 225 random tweets from the resulting datasets. We observed that only 11% of the instances are noisy, i.e. opposite stance. This is in line with previous work on distant supervision (cf. (purver2012experimenting)).
We randomly selected a stratified sample from each year, which is split into train, evaluation and test data. Table 1 shows the per-year statistics of the resulting datasets.
To measure the temporal evolution of the datasets, we compute the Jaccard similarities between the vocabulary observed for each year. Figure 1 shows the pairwise Jaccard similarity scores for the two datasets. We can observe that these similarity scores consistently decrease as the distance between the years increases, indicating an increasing variation of vocabulary over time.
4.2. Unlabelled Datasets
We also collected larger domain-specific Twitter datasets linked to the same two topics and using the same hashtags, however disregarding labels to avoid supervision when training the word embedding models. This resulted in 578K and 343K aggregated tweets for Gender Equality and Healthcare, respectively.
5. Methods for Incorporating temporal knowledge into word embeddings
We assess the potential of word embeddings (mikolov2013distributed) to aid classifiers to have a temporally persistent performance, and propose novel methods to further their temporal persistence (see Figure 2). We use the CBOW model (Mikolov2013a), which outperformed skip-gram (Mikolov2013a) for linguistic change detection (kulkarni2015statistically). We control for other variables (e.g. prediction models, label distributions) by keeping them stable across experiments.
Method 1. Discrete Temporal Embedding (DTE), a baseline method that lacks awareness of the temporal evolution. DTE learns CBOW word vector representations given a collection of tweets pertaining to a particular time frame as input. For example, where our classifier needs to train from data pertaining to year and test it on , a DTE embedding is generated from the unlabelled data pertaining to . We can formally represent it as follows: Discrete Temporal Embedding (DTE) are the embeddings generated using temporal slice where represents the time frame of the source set.
In this work we propose four models to incorporate knowledge over time by leveraging unlabelled data.
Method 2. Incremental Temporal Embedding (ITE). New embeddings are trained using the unlabelled data incrementally aggregated from all years preceding and including the target year, i.e. , where and is the target year. Then the stance classifier is retrained using the labelled data from the source year represented using the new up-to-date embeddings .
Method 3. Source-Target Temporal Embedding (2TE). New embeddings are generated using the unlabelled data aggregated from the source and the target years only, while ignoring all years in between. These embeddings are then used to represent source year training data for the stance classifier.
While ITE and 2TE incorporate temporal knowledge, they do not explicitly handle other phenomena such as semantic shift of vocabulary (kim2014temporal), which we anticipate may lead to performance limitations. To address this, we propose alternative methods that perform temporal word alignment. Our proposed solution comes from using a compass (Valerio2019Compass) method for temporal alignment. With compass each temporal embedding becomes temporally contextualised to the testing year semantic-meaning. With this method, we assume words contextual usage fluctuates over time as in social associations creating subtle meaning drift. For example, the word ‘Clinton’ shifted from being related to administration to presidential context over time (Valerio2019Compass). The compass aligns the embeddings of different temporal years using pivot non-shifting vocabularies. It constructs a dynamic temporal context embedding matrix that changes over time, allowing the context embedding to be more time relevant. These settings allow natural selection of vocabularies in terms of temporal contextual words of the target year, and a time-aware representation of the target year in general. We show that this approach is more useful in some cases than model update as the model trained considering the semantic meaning of the target year without additional contexts.
Method 4. Incremental Temporal Alignment (ITA). is incrementally aligned using compass from all preceding , where .
Method 5. Source-Target Temporal Alignment (2TA). performs temporal alignment using compass of and .
We summarise all five models in Table 2 and Figure 2, which enable us to test the impact of three different parameters: (i) the use of discrete vs incremental embedding models, (ii) the use of different learning strategies (none, model update, diachronic alignment), and (iii) the use of different data sources for building embedding models (source, target and preceding years).
6. Experimental Setup
To control for the impact of model choice (Lukes2019; Rocha2008)
, we consistently use a Convolutional Neural Network (CNN) model with 32 filter and 5-gram region sizes. Our
and 10 epochs. Sentence length for vector representations is also fixed in all experiments to. While keeping the CNN model intact, our aim is to assess the effectiveness of the proposed embedding-based representations for the task.
We experiment with all 21 possible combinations from 2014 to 2019 of and for training and testing. In each case, we are interested in the temporal gap between train and test data, measured in number of years. For brevity and clarity, we report the mean average performance of models with the same gap (e.g. 4-year gap performance averages 2014-2018 and 2015-2019). In each case, we report the macro-averaged F1 score, as well as the Relative Performance Drop (RPD) to measure the sharpness of the drop in our model, defined as:
Where represents performance when temporal gap is 0; represents performance when temporal gap is one of 1-5.
7. Results and Discussion
Table 3 and Figure 5 show the results of our experiments. Results are aggregated by temporal gap. We observe that the best results are obtained for same-period experiments (i.e. temporal gap of 0) and a decrease in performance as the temporal gap increases, i.e. confirming our hypothesis that model persistence drops as training data gets older. Furthermore, the performance drops (percentage, shown in brackets) indicate that the drop has an upwards tendency for larger temporal gaps, demonstrating that the older the training model, the less accurate the model becomes when dealing with new data. Temporal dynamics in the stance datasets can indeed lead to deterioration in model persistence.
When we look at the methods separately, we observe that ITA achieves an overall best performance. This is especially true for the healthcare dataset, where ITA is the best method for all temporal gaps under consideration; for the gender equality dataset, ITA is the best method for small temporal gaps (0-1), and while it achieves competitive performance for larger gaps, 2TA and ITE occasionally achieve better performance.
We observe interesting trends when we look at performance scores and performance drops in conjunction. The use of the baseline DTE, solely relying on source-year embeddings, leads to the lowest performance and also the largest performance drops. This reinforces that embeddings from a particular time period gradually become less useful for subsequent periods, more so when the target period is more distant in time. Among the four proposed methods, ITA yields the best same-period performance, however it is also the method experiencing the highest performance drop for larger temporal gaps; this demonstrates ITA’s competitive performance for shorter temporal gaps, but its performance on longer temporal gaps is more uncertain. For methods relying on source and target years, 2TE and 2TA, we observe a modest performance for same-period experiments (equivalent to DTE), which however experience a substantially smaller performance drop for larger temporal gaps. While their performance is not as good for small temporal gaps, they show a good capacity to persist better over time for larger temporal gaps. A look at longer temporal gaps, beyond the 5-year gap considered in our experiments, would be an interesting avenue for future work, e.g. to assess the capacity of 2TA and 2TE to persist further.
In addition, our experiments help us assess the impact of three parameters (see Table 2):
Embedding type: we show that the use of an incremental aggregation of embeddings (ITE, 2TE, ITA, 2TA) improves over the use of discrete embeddings (DTE). This is consistent across datasets and temporal gaps.
Learning strategy: our results indicate that the best learning strategy is the use of diachronic alignment (ITA, 2TA), in our case tested using compass. With a few exceptions, we observe that these methods generally outperform methods that perform incremental model updates (2TE, ITE), and consistently outperform the lack of a learning strategy by relying on discrete embeddings (DTE).
Data source: the worst performance is for the embedding method solely using the source year (DTE), a baseline method that one would naturally use with static classifiers. Other methods considering additional years lead to improved performance. We observe two main patterns: (1) use of all years preceding the target year (ITE, ITA) lead to improved performance over the use of source and target years (2TE, 2TA), however with a larger performance drop for longer temporal gaps, and (2) use of source and target years only leads to lower performance in short temporal gaps, however with a substantially lower performance drop showing a promising trend towards achieving model persistence.
Our work demonstrates the substantial impact of temporal evolution on stance classification in social media, with performance drops of up to 18% in relative macro-F1 scores in only five years. We investigate temporal adaptation of word embeddings used to thrain the classifier to mitigate this drop in performance, showing that incrementally aligning embedding data for all years (ITA) leads to the best performance. However, we also find that consideration of only source and target years in the alignment leads to the smallest performance drop with promising trends towards longer term persistence.
Furthering this research, we aim to investigate the extent to which different factors (e.g. opinion change, social media use) impact performance drop, as well as to explore the potential of using few-shot learning to quantify the benefits of labelling small amounts of target data.
First author would like to thank Ahmad Alkhalifah, Prof. Maria Liakata and Prof. Massimo Poesio for their constructive feedback throughout initial stage of this research work. This research utilised Queen Mary’s Apocrita HPC facility, supported by QMUL Research-IT.