The COVID-19 pandemic has affected our society in many ways – traveling was limited and restricted, supply chains were disrupted, companies experienced contractions in production, financial markets were not stable, school was temporarily closed, and in-person learning was replaced by remote and online learning. Last but not the least, the pandemic has taken a toll on our physical health and also has had a huge impact on mention health, found by multiple studies. xiong2020impact provided a systematic review on how the pandemic has led to increased symptoms of anxiety, depression, post-traumatic stress disorder in the general population. bo2021posttraumatic observed that COVID-19 patients suffered from post-traumatic stress symptoms before discharge. zhang2020differential detected an increased prevalence of depression predominately in COVID-19 patients. chen2020prevalence observed that the prevalence of self-reported depression and anxiety among pediatric medical staff members was significantly high during the pandemic, especially for workers who had COVID-19 exposure experience. Using online surveys, sonderskov2020depressive claimed that the psychological well-being of the general Danish population is negatively affected by the COVID-19 pandemic. wu2020analysis surveyed college students from the top 20 universities in China and found the main causes of anxiety among college students included online learning and epidemic diseases. sharma2020assessing collected text data via an APP and learned that there was a high rate of negative sentiments among students and they were more interested in health-related topics and less interested in education during the pandemic.
The above listed work uses surveys, patient data from hospitals and health care providers, or data collected through specifically designed APPs to study the impact of the pandemic on the mental states in various demographic groups. Social media data, on the other hand, provide another great source of real-world data and contain a vast amount of information that can be leveraged to study the effects of the pandemic on people’s life, behaviors, and health, among others. Due to the unstructured nature of the text and semantic data collected from social media, text mining and machine learning (ML) techniques in natural language processing (NLP) are often used to make sense of the data. For example,low2020natural applied sentiment analysis, classification and clustering, topic modeling to uncover concerns throughout Reddit before and during the pandemic. jelodar2020deepjia2020emotional collected and analyzed data from Weibo (a Chinese social media platform) and concluded that students’ attitude toward returning to school was positive. pandey2021redbert applied the Bidirectional Encoder Representations from Transformers (BERT) to Reddit data to assimilate meaningful latent topics on COVID-19 and sentiment classification.
Besides the rich text and semantic information in social media data, the data are also known for their vast amount of relational information that is also important for training effective ML procedures. Relational information is often formulated as networks or graphs. Graph neural networks (GNNs), a type of NNs that take graphs as input for various learning tasks such as node classification and graph embedding, can be used for learning from both the semantic and relational information in social media data. Since scarselli2008graph proposed the first GNN, many GNN extensions and variants have been proposed. li2015gated
incorporated gated recurrent units into GNNs.defferrard2016convolutional generalized convolutional NNs (CNNs) to graphs using spectral graph theory. kipf2016semi proposed a semi-supervised layer-wise spectral model to effectively utilize graph structures. atwood2016diffusion introduced the diffusion-convolution operation to CNNs to learn representation of graphs. monti2017geometric generalized CNNs to non-Euclidean structured data including graphs and manifolds. hamilton2017inductive
extended semi-supervised learning to a general spatial CNN by sampling from local neighborhoods, aggregating features, and exploring different aggregation functions.velivckovic2017graph proposed graph attention networks (GAT), a self-attention-based GNN to assign different weights to different neighbors. It is computationally efficient because of parallelization and doesn’t require knowing the overall graph structure upfront, but it works with homogeneous graphs only. wang2019heterogeneous developed the heterogeneous attention network (HAN) that extends GAT to heterogeneous graphs of multi-type nodes and edges. They employ the attention mechanisms to aggregate different types of neighbors on different types of meta-paths that represent different types of relations among nodes of different or the same types. HAN is capable of paying attention to “important” neighbors for a given node and capturing meaningful semantics efficiently. hu2020heterogeneous extended the HAN framework to learn from dynamic heterogeneous graphs. chen2020simple
improved previous GNN approaches by using initial residual connections and identity mapping and mitigating over-smoothing.
In this work, we leverage the advances in NLP and GNNs to learn from social media data and study whether the pandemic has negatively affected the emotions and psychological states of people who are connected with a higher-education institute (HEI). “Connected with an HEI” in this context is defined in a very broad sense, described as follows. The data we collected are from university subreddits communities that include basically everyone who contributes to the subreddits during August to November 2019 and August to November 2020222The period August to November is when most schools in the US are in session.. Therefore, an individual who contributes to our data might be a student in an HEI, a staff or faculty member there, or people who do have a direct connection with the HEI but is interested in the HEI or are involved in some situations some way. It is not our goal to generalize the conclusion from this study to the general population as the studied population is clearly not a representative sample of the whole population. However, it represents a subgroup of the general population to which the study conclusions can be generalized to and contributes to the body of literature and research on the impact of the pandemic on mental health in various sub-populations using real-world data.
We collected the social media data in 2019 (pre-pandemic) and 2020 (pandemic) from the subreddits communities associated with 8 universities chosen per a full-factorial design according to three factors as detailed in Section 2.1
. We employ the Robustly Optimized BERT pre-training approach (RoBERTa) and GNNs with the attention mechanism that takes both semantic and relational information as input to perform sentiment analysis in a semi-supervised manner to save manual labeling costs. The predicted sentiment labels are then analyzed by a generalized linear mixed model to examine the effect of pandemic and online learning on the mental state of the communities represented by the collected data. Our contributions and the main results are summarized as follows.
We adopt a full-factorial design when choosing schools for Reddit data collection. This not only makes the collected data more representative by covering different types of schools compared to using data from a single HEI, but also helps with controlling for confounders for the subsequent statistical inference on the effect of the pandemic on the outcome of interest.
We use both the semantic and graph information from the collected Reddit data as inputs, leverage the state-of-the-art NLP techniques and GNNs with the attention mechanism, and apply model stacking to combine the prediction powers from two ML techniques and improve the sentiment classification accuracy.
Our results suggest that the odds of having negative sentiments increases by 14.65% during the pandemic (65,329 cases in 2020) compared to the pre-pandemic period (55,409 cases in 2019) and the increase is statistically significant (p-value ). During the pandemic, the odds of having negative sentiments in schools that opted for in-person learning (38,132 cases from 4 schools) is 1.416 folds of that in schools that chose remote learning (27,197 cases from 4 schools) and the increases is also statistically significant (p-value = 0.037).
The rest of the paper is organized as follows. Section 2 describes the research method, including data collection and processing, the ML procedures we adopt for sentiment classification, and the model used for hypothesis testing and statistical inference on the effects of the pandemic on sentiment. The main results are presented in Section 3. The final discussions and remarks are offered in Section 4.
depicts the process and main steps of the research method we take in this study. We feed the collected text data from Reddit to a pre-trained RoBERTa, which produces embeddings that contain the meaning of each word in the input text data. The embeddings are used for two downstream tasks. First, they are sent to a classification layer (softmax function) to predict the probabilities of negative and non-negative sentiments; second, they are used as part of the input, along with adjacency matrices formulated from the relational data among the messages collected from Reddit, to a GNN. The GNN outputs another set of predicted probabilities of negative and non-negative sentiments, which are combined with the set of probabilities from RoBERTa to train a meta-model (the ensemble and model stacking step) to produce a final classifier. The classifier is then used to generate a sentiment label for each unlabeled message in our collected data. Finally, we combine the data with the observed or learned sentiment labels across the 8 schools prior to and during the pandemic, and employ regression (generalized linear mixed-effects model or GLMM) to examine the effect of the pandemic on sentiment, after adjusting for school characteristics.
In what follows, we illustrate each step in details. Sections 2.1 and 2.2 present the data collection and manual labelling steps to create a set of labeled data to train the ML procedures, respectively; Sections 2.3, 2.4, and 2.5 present the application of RoBERTa, formulation and training of GAT, and model stacking and ensemble of RoBERTa and GAT, respectively; Section 2.6 lists the GLMM used for statistical testing and inference.
2.1 Data Collection
In the current study, we focus on schools that are “R1: Doctoral Universities – Very high research activity” per the Carnegie Classification of Institutions of Higher Education (CC) in this study, with the plan to examine more universities in the future. A university can be characterized in many different ways. When choosing which schools to include in our study, we focus on three factors that can potentially affect students’ sentiment: private vs. public schools, school location (small vs. large cities), and whether a school opted in for in-person learning during the pandemic vs. taking a fully online learning approach from August to November in 2020. To balance potential confounders for the subsequent statistical analysis, we adopt a full-factorial design on the above three factors and select one school for each of the eight cells in the full-factorial design (Table 1). Each of the eight schools has a subreddit on Reddit. We downloaded the post and comment data from each subreddit from August to November in and , respectively, representing the pre-pandemic and pandemic periods, using the Pushshift API (https://github.com/pushshift/api). This results in 120,738 messages in total.
|# of 2019||11,910||12,066||12,083||10,362||2,366||4,248||1,066||1,308|
|Data period||Aug to Nov 2019; Aug to Nov 2020|
2.2 Manual Labeling
For the downloaded data, we manually labeled a sizable number of messages in each school-year so that there are sufficient labeled cases in each sentiment category for training and testing of our ML procedures. The labeled messages are summarized in Table 2.
We focus on binary sentiment classification in this study; and rather than labeling the messages as Negative and Positive, we use the Negative and non-Negative classification because some messages are rather neutral than being either positive or negative. For example, the message could just be a question on the number of students in a class and the reply to that just message is a number. There is also subjectivity among manual labelers on what Neural means. For example, some labelers label “OK” as Positive but others labeled it as Neutral. We had also examined the possibility of adding the Very Positive and Very Negative categories, and again there was substantial disagreement among the labelers on what can be described as Very Positive vs Positive and Very Negative vs Negative. We had also examined the classification accuracy on sentiments when using 3 categories (Positive, Negative, Neutral) vs using 5 categories (Very Positive, Positive, Negative, Very Negative, Neutral), the prediction accuracy was for the former and for the latter.333Our results are consistent with the findings in the literature – for in the -category classification, and in the -category classification (socher2013recursive; mccann2017learned). Given the objective of our study – whether the pandemic has negatively affected the emotions and psychological states of people – and that the 2-class classification is commonly accepted in sentiment analysis, we expect that the Negative versus non-Negative classification is sufficient to address our study goal, along with less discrepancy when labelling and higher accuracy when training the ML procedures and predicting the messages in our collected data.
2.3 Embedding Learning via RoBERTa
RoBERTa (liu2019roberta) is an improved version of the Bidirectional Encoder Representations from Transformers (BERT) framework (devlin2018bert), owing to several modifications of BERT (e.g., dynamic masking, input changes without the next-sentence-prediction loss, large mini-batches, etc). The BERT framework itself is based on transformers (vaswani2017attention)
, a deep learning model built upon the attention mechanism. The BERT model is pre-trained using text from Wikipedia and can be fine-tuned for various types of downstream learning tasks.
The Reddit data in our study, the same as any social media data, contain a large number of emoticons, non-standard spellings and internet slang, causing difficulty for traditional sentence embedding learning methods designed for semantics with standard grammar and spelling. The RoBERTa model that we employ is capable of generating embedding from internet slang, and other non-standard spelling by using special tokenizer. Specifically, we applied the RoBERTa model trained on million messages from Twitter and fine-tuned for sentiment analysis (barbieri2020tweeteval; modelcodes) to obtain the embeddings of the semantic information in our collected Reddit text data.
Besides generating embeddings from the text data via RoBERTa, which are fed to the GAT NN in the next step, we also use the RoBERTa framework to predict the labels of the sentiments in our collected Reddit data, as part of the subgroup-adaptive model stacking introduced in Section 2.5.
The Python code for the RoBERTa framework that we applied is adapted from barbieri2020tweeteval and is available at https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment.
2.4 Training of GAT
We employ GAT to incorporate the relational information in the collected data as a graph input to train a sentiment classifier. GAT is a GNN that employs the attention mechanism to incorporate neighbors’ information in graph embedding. In our problem setting, each message is regarded as a node in the graph and the relation “message message” is coded as an edge in the graph and the corresponding adjacency matrix.
We randomly selected 601 non-Negative and 94 Negative messages from the merged 2019 and 2020 Dartmouth data in Table 2, along with the adjacency matrix in the merged Dartmouth data set, as the training set for the GAT NN. There are a few reasons why we only used the data from one school to train GAT. First, the graph on the merged data across different schools leads to a high-dimensional block-diagonal adjacency matrix because the comments from different schools are barely linked. In other words, merging the data from different schools does not provide additional benefits from the perspective of leveraging relational data to train GAT, but potentially increasing computational costs, and, in our experiments, also leading to worse prediction accuracy than using the data from a single school. Second, laborious manual labeling confined us to getting sufficient labeled data in only one school to train the GAT with meaningful relational information. Third, what makes negative vs positive sentiments and how they relate the relational information among the message is rather independent of a particular school; and there would not be much scarification in prediction accuracy via a trained GAT NN with the relational data from one school.
The steps in training of the GAT NN are listed below and a schematic illustration of the steps is given in Figure 2. Before the implementation, we first train RoBERTa to learn a set of representative features (embeddings) in message . We use to denote the neighbors of message , that is, the set of messages that reply to message .
Define for each and calculate per the multi-head attention mechanism, where is number of attention heads, and is an activation function. The multi-head attention mechanism allows the GNN to capture different types of relations among words within each message as well as across messages. We use in our application.
from step 2) through a dense single layer perceptron, parameterized by and train by minimizing the -regularized cross-entropy loss in Eq (1), for , where the set contains the labeled messages and .
denotes the observed label of message in the training data. classifies the messages into either Negative or non-Negative and is parameterized by , contains the parameters from the graph embedding and attention mechanism in steps 1) and 2) above, and . is the weight parameter and is set at 1 in a general case. Our training data are imbalanced in terms of labeled non-Negative cases vs labeled Negative cases (the former is 6 folds higher). To achieve better classification results, we “oversampled” the Negative cases, setting . The penalty parameter is , chosen via the 4-fold cross-validation.
The loss function in Eqn (1
) is formulated with the labeled data only. Though some nodes in the input graph are unlabeled (given the high cost of manual labeling) and are not part of the loss function, their text information and their relational information with the labeled nodes are still used in the graph embedding in steps 1) and 2) above. In this sense, the training of the GAT NN can be regarded as semi-supervised learning.
2.5 Model Stacking
Besides the classification via the GAT NN in Section2.4, we also performed sentiment classification via RoBERTa. Comparing the classification results by GAT and RoBERTa on the testing data444The test data set contains 375 Negative and 375 non-Negative cases, made of 25 Negative and 25 non-Negative cases from Dartmouth and each of the remaining 14 school-years. side by side (Figure 3), we found some inconsistency. The left plot shows the sensitivity or the probability of correctly predicting non-Negative sentiments and the right plot depicts the specificity defined as the probability of correctly predicting negative sentiments. The cutoff for labeling a message Negative or non-Negative based on its predicted Prob(non-Negative)
is optimized by maximizing the geometric mean of sensitivity (recall) and positive predictive value (precision)(fowlkes1983method) in the ROC curves for GAT and RoBERTa and is 0.2438 and 0.6923, respectively. If , the message is labeled Negative; otherwise, it is labeled non-Negative. In both plots, there is some deviation from the identity line, indicating the inconsistency in the classification between GAT and RoBERTa. GAT performs better that RoBERTa in terms of predicting true negative sentiments, whereas RoBERTa performs better that GAT in terms of predicting true non-negative sentiments. In addition, both classifiers outputs some low probabilities; ideally, if both classifiers had high accuracy, most of the points would be clustered in the upper right corner around .
The observations in Figure 3 suggest that both GAT and RoBERTa have their respective strength but also suffer some degree of inaccuracy when predicting a certain sentiment category. To leverage the advantage of both methods, we develop a subgroup-adaptive ensemble method to aim for better prediction results. Ensemble learning is an ML technique that combines multiple base models (weaker learners) to form a strong learner. Bagging, boosting, and model stacking are all well-known and popular ensemble methods and concepts. The specific technique we use here is model stacking (wolpert1992stacked)
by fitting logistic regression models, a.k.a. meta-models, on the predicted classification probabilities from GAT and RoBERTa. The fitted logistic models are adaptive to each population subgroup, i.e., a specific school-year combination in our study.
We first scale the raw prediction probability of being non-Negative from GAT and RoBERTa to obtain . Specifically,
where represents the RoBERTa and GAT, respectively, is the predicted probability that message from school in year has a non-negative sentiment per the prediction by , The reason behind the scaling of is that the optimal for the raw is different for RoBERTa and GAT and neither equals 0.5, leading to ineffective downstream training of the logistic regression meta-model (Eqn (3)) and difficulty in choosing a cutoff on the final probability from the meta-model. After the scaling, the cutoff on is 0.5 for both GAT and RoBERT, eliminating both concerns.
Specifically, the logistic meta-model for school and year uses and as input to generate the final prediction probability
where are the estimated regression coefficients from the logistic model for school and year . The cutoff on from the meta-model prediction is 0.5; that is, if , then message in school and year is non-Negative; otherwise, it is labeled as Negative.
In terms of the training data for the model stacking step, we used 25 Negative cases and 25 non-Negative cases from the Dartmouth data and from each of the 14 school-years, leading to 375 Negative and 375 non-Negative cases overall. This very set of data (375 Negative and 375 non-Negative cases) was also used to find the optimal cutoff on the probability of Negative sentiments for the binary classification for RoBERTa and GAT, respectively. The testing data set for the logistic meta-model is the same as one used for testing the prediction accuracy of RoBERTa and GAT, with 25 Negative and 25 non-Negative cases per school-year.
2.6 Statistical Modelling and Inference
With the learned sentiment labels (Negative vs. non-Negative) for the messages in the data set, we apply GLMM to examine the effects of the pandemic on emotional state. GLMM uses a logit link function with a binary outcome (non-Negative vs. Negative). Specifically, we run two GLMMs. Model 1 is fitted to the whole collected data ofcases and compares the sentiment between 2019 (pre-pandemic) and 2020 (pandemic), after controlling for school location and type. Model 2 is fitted to the 2020 subset data with cases and examines how in-person learning during the pandemic affects the sentiment compared to remote learning, after controlling for school location and type. Since the messages are clustered by school and the messages from the same school are not independent, we include in both models a random effect of school, thus the “mixed-effects” model. The formulations of the two models are given below.
where and , if private and 0 if public, if small city and 0 if large city, if 2020 and 0 if 2019, if it is in-person and 0 if it is remote, and and are the corresponding fixed-effects for and represents the log odds ratio of being non-Negative when vs. in the models.
3.1 Sentiment Slassification
The classification results from the meta-model from the model stacking step on the testing data are presented in Table 3. We examine multiple metrics on the prediction accuracy, including the overall accuracy rate F1 score, recall, precision, and specificity. Though there is some variation across the schools and years, we have achieved satisfactory accuracy for each subgroup by all the metrics. Compared to GAT and RoBERTa, and model-stacking leads to better or similar classification results across all the examined metrics (Figure 4).
3.2 Statistical inference for the pandemic effect on sentiment
The GLMM results are presented in Table 4. “Year 2020” shows a statistically significant effect on sentiment with -value ; the odds of Negative in year is times the odds of Negative in year , after adjusting for school type and location. “In-person” learning in 2020 also affects sentiments in a statistically significant manner and the odds of negative sentiments increase by compared to online learning. Whether the school is located in small city and whether the school is private do not seem to influence the odds of Negative in a statistically significant manner in both analysis.
|Effect of pandemic||Effect of in-person learning in 2020|
|Factor||odds ratio||-Value||Factor||odds ratio||-Value|
|Small City||1.241||0.126||Small City||1.316||0.110|
In this study, we collected social media data from Reddit and applied state-of-the-art ML techniques to study whether the pandemic has negatively affected the emotional and psychological states of a sub-population. The ML techniques we employed achieved greater than 80% prediction accuracy on sentiment overall by various metrics. Our results suggest the pandemic has a negative impact on the group’s emotional and psychological states in a statistically significant manner and online teaching also increases the odds of negative sentiment in a statistically significant manner compared to in-person learning.
In the future, we plan to keep collecting Reddit data from the same period every year (2021 and behind) and examine the long-term effects of the pandemic on the emotional and psychological states and whether and when the states will return to the pre-pandemic baseline. In addition, we plan to apply and evaluate more ML techniques, such as those developed for heterogeneous graphs, to further improve the prediction accuracy in sentiment analysis based on semantic and relational information.