1 Introduction
The COVID19 pandemic has affected our society in many ways – traveling was limited and restricted, supply chains were disrupted, companies experienced contractions in production, financial markets were not stable, school was temporarily closed, and inperson learning was replaced by remote and online learning. Last but not the least, the pandemic has taken a toll on our physical health and also has had a huge impact on mention health, found by multiple studies. xiong2020impact provided a systematic review on how the pandemic has led to increased symptoms of anxiety, depression, posttraumatic stress disorder in the general population. bo2021posttraumatic observed that COVID19 patients suffered from posttraumatic stress symptoms before discharge. zhang2020differential detected an increased prevalence of depression predominately in COVID19 patients. chen2020prevalence observed that the prevalence of selfreported depression and anxiety among pediatric medical staff members was significantly high during the pandemic, especially for workers who had COVID19 exposure experience. Using online surveys, sonderskov2020depressive claimed that the psychological wellbeing of the general Danish population is negatively affected by the COVID19 pandemic. wu2020analysis surveyed college students from the top 20 universities in China and found the main causes of anxiety among college students included online learning and epidemic diseases. sharma2020assessing collected text data via an APP and learned that there was a high rate of negative sentiments among students and they were more interested in healthrelated topics and less interested in education during the pandemic.
The above listed work uses surveys, patient data from hospitals and health care providers, or data collected through specifically designed APPs to study the impact of the pandemic on the mental states in various demographic groups. Social media data, on the other hand, provide another great source of realworld data and contain a vast amount of information that can be leveraged to study the effects of the pandemic on people’s life, behaviors, and health, among others. Due to the unstructured nature of the text and semantic data collected from social media, text mining and machine learning (ML) techniques in natural language processing (NLP) are often used to make sense of the data. For example,
low2020natural applied sentiment analysis, classification and clustering, topic modeling to uncover concerns throughout Reddit before and during the pandemic. jelodar2020deepapplied long shortterm memory (LSTM) recurrent neural networks (RNNs) to Reddit data and achieved an 81.15% sentiment classification accuracy on COVID19 related comments.
jia2020emotional collected and analyzed data from Weibo (a Chinese social media platform) and concluded that students’ attitude toward returning to school was positive. pandey2021redbert applied the Bidirectional Encoder Representations from Transformers (BERT) to Reddit data to assimilate meaningful latent topics on COVID19 and sentiment classification.Besides the rich text and semantic information in social media data, the data are also known for their vast amount of relational information that is also important for training effective ML procedures. Relational information is often formulated as networks or graphs. Graph neural networks (GNNs), a type of NNs that take graphs as input for various learning tasks such as node classification and graph embedding, can be used for learning from both the semantic and relational information in social media data. Since scarselli2008graph proposed the first GNN, many GNN extensions and variants have been proposed. li2015gated
incorporated gated recurrent units into GNNs.
defferrard2016convolutional generalized convolutional NNs (CNNs) to graphs using spectral graph theory. kipf2016semi proposed a semisupervised layerwise spectral model to effectively utilize graph structures. atwood2016diffusion introduced the diffusionconvolution operation to CNNs to learn representation of graphs. monti2017geometric generalized CNNs to nonEuclidean structured data including graphs and manifolds. hamilton2017inductiveextended semisupervised learning to a general spatial CNN by sampling from local neighborhoods, aggregating features, and exploring different aggregation functions.
velivckovic2017graph proposed graph attention networks (GAT), a selfattentionbased GNN to assign different weights to different neighbors. It is computationally efficient because of parallelization and doesn’t require knowing the overall graph structure upfront, but it works with homogeneous graphs only. wang2019heterogeneous developed the heterogeneous attention network (HAN) that extends GAT to heterogeneous graphs of multitype nodes and edges. They employ the attention mechanisms to aggregate different types of neighbors on different types of metapaths that represent different types of relations among nodes of different or the same types. HAN is capable of paying attention to “important” neighbors for a given node and capturing meaningful semantics efficiently. hu2020heterogeneous extended the HAN framework to learn from dynamic heterogeneous graphs. chen2020simpleimproved previous GNN approaches by using initial residual connections and identity mapping and mitigating oversmoothing.
In this work, we leverage the advances in NLP and GNNs to learn from social media data and study whether the pandemic has negatively affected the emotions and psychological states of people who are connected with a highereducation institute (HEI). “Connected with an HEI” in this context is defined in a very broad sense, described as follows. The data we collected are from university subreddits communities that include basically everyone who contributes to the subreddits during August to November 2019 and August to November 2020^{2}^{2}2The period August to November is when most schools in the US are in session.. Therefore, an individual who contributes to our data might be a student in an HEI, a staff or faculty member there, or people who do have a direct connection with the HEI but is interested in the HEI or are involved in some situations some way. It is not our goal to generalize the conclusion from this study to the general population as the studied population is clearly not a representative sample of the whole population. However, it represents a subgroup of the general population to which the study conclusions can be generalized to and contributes to the body of literature and research on the impact of the pandemic on mental health in various subpopulations using realworld data.
We collected the social media data in 2019 (prepandemic) and 2020 (pandemic) from the subreddits communities associated with 8 universities chosen per a fullfactorial design according to three factors as detailed in Section 2.1
. We employ the Robustly Optimized BERT pretraining approach (RoBERTa) and GNNs with the attention mechanism that takes both semantic and relational information as input to perform sentiment analysis in a semisupervised manner to save manual labeling costs. The predicted sentiment labels are then analyzed by a generalized linear mixed model to examine the effect of pandemic and online learning on the mental state of the communities represented by the collected data. Our contributions and the main results are summarized as follows.

[leftmargin=18pt]

We adopt a fullfactorial design when choosing schools for Reddit data collection. This not only makes the collected data more representative by covering different types of schools compared to using data from a single HEI, but also helps with controlling for confounders for the subsequent statistical inference on the effect of the pandemic on the outcome of interest.

We use both the semantic and graph information from the collected Reddit data as inputs, leverage the stateoftheart NLP techniques and GNNs with the attention mechanism, and apply model stacking to combine the prediction powers from two ML techniques and improve the sentiment classification accuracy.

Our results suggest that the odds of having negative sentiments increases by 14.65% during the pandemic (65,329 cases in 2020) compared to the prepandemic period (55,409 cases in 2019) and the increase is statistically significant (pvalue ). During the pandemic, the odds of having negative sentiments in schools that opted for inperson learning (38,132 cases from 4 schools) is 1.416 folds of that in schools that chose remote learning (27,197 cases from 4 schools) and the increases is also statistically significant (pvalue = 0.037).
The rest of the paper is organized as follows. Section 2 describes the research method, including data collection and processing, the ML procedures we adopt for sentiment classification, and the model used for hypothesis testing and statistical inference on the effects of the pandemic on sentiment. The main results are presented in Section 3. The final discussions and remarks are offered in Section 4.
2 Method
Figure 1
depicts the process and main steps of the research method we take in this study. We feed the collected text data from Reddit to a pretrained RoBERTa, which produces embeddings that contain the meaning of each word in the input text data. The embeddings are used for two downstream tasks. First, they are sent to a classification layer (softmax function) to predict the probabilities of negative and nonnegative sentiments; second, they are used as part of the input, along with adjacency matrices formulated from the relational data among the messages collected from Reddit, to a GNN. The GNN outputs another set of predicted probabilities of negative and nonnegative sentiments, which are combined with the set of probabilities from RoBERTa to train a metamodel (the ensemble and model stacking step) to produce a final classifier. The classifier is then used to generate a sentiment label for each unlabeled message in our collected data. Finally, we combine the data with the observed or learned sentiment labels across the 8 schools prior to and during the pandemic, and employ regression (generalized linear mixedeffects model or GLMM) to examine the effect of the pandemic on sentiment, after adjusting for school characteristics.
In what follows, we illustrate each step in details. Sections 2.1 and 2.2 present the data collection and manual labelling steps to create a set of labeled data to train the ML procedures, respectively; Sections 2.3, 2.4, and 2.5 present the application of RoBERTa, formulation and training of GAT, and model stacking and ensemble of RoBERTa and GAT, respectively; Section 2.6 lists the GLMM used for statistical testing and inference.
2.1 Data Collection
In the current study, we focus on schools that are “R1: Doctoral Universities – Very high research activity” per the Carnegie Classification of Institutions of Higher Education (CC) in this study, with the plan to examine more universities in the future. A university can be characterized in many different ways. When choosing which schools to include in our study, we focus on three factors that can potentially affect students’ sentiment: private vs. public schools, school location (small vs. large cities), and whether a school opted in for inperson learning during the pandemic vs. taking a fully online learning approach from August to November in 2020. To balance potential confounders for the subsequent statistical analysis, we adopt a fullfactorial design on the above three factors and select one school for each of the eight cells in the fullfactorial design (Table 1). Each of the eight schools has a subreddit on Reddit. We downloaded the post and comment data from each subreddit from August to November in and , respectively, representing the prepandemic and pandemic periods, using the Pushshift API (https://github.com/pushshift/api). This results in 120,738 messages in total.
School  UCLA  UCSD  UC  U of  Harvard  Columbia  Dartmouth  Notre 

Berkley  Michigan  Dame  
funding  Public  Public  Public  Public  Private  Private  Private  Private 
location  Large  Large  Small  Small  Large  Large  Small  Small 
inperson learning  No  Yes  No  Yes  No  Yes  No  Yes 
AugNov, 2020  
# of 2019  11,910  12,066  12,083  10,362  2,366  4,248  1,066  1,308 
comment 2020  11,626  11,756  11,704  11,619  2,924  7,693  943  7,064 
Data period  Aug to Nov 2019; Aug to Nov 2020 
2.2 Manual Labeling
For the downloaded data, we manually labeled a sizable number of messages in each schoolyear so that there are sufficient labeled cases in each sentiment category for training and testing of our ML procedures. The labeled messages are summarized in Table 2.
School  Year  nonNegative  Negative  Total 

UCLA  2019  178  53  231 
2020  170  52  222  
UCSD  2019  145  50  195 
2020  130  66  196  
UCBerkley  2019  146  63  209 
2020  161  52  213  
Michigan  2019  120  80  200 
2020  117  81  198  
Harvard  2019  164  50  214 
2020  150  58  208  
Columbia  2019  136  37  173 
2020  135  38  173  
Dartmouth  2019  324  78  402 
2020  327  66  393  
Notre Dame  2019  186  34  220 
2020  112  103  215  
total  2019  1399  445  1844 
2020  1302  516  1818 
We focus on binary sentiment classification in this study; and rather than labeling the messages as Negative and Positive, we use the Negative and nonNegative classification because some messages are rather neutral than being either positive or negative. For example, the message could just be a question on the number of students in a class and the reply to that just message is a number. There is also subjectivity among manual labelers on what Neural means. For example, some labelers label “OK” as Positive but others labeled it as Neutral. We had also examined the possibility of adding the Very Positive and Very Negative categories, and again there was substantial disagreement among the labelers on what can be described as Very Positive vs Positive and Very Negative vs Negative. We had also examined the classification accuracy on sentiments when using 3 categories (Positive, Negative, Neutral) vs using 5 categories (Very Positive, Positive, Negative, Very Negative, Neutral), the prediction accuracy was for the former and for the latter.^{3}^{3}3Our results are consistent with the findings in the literature – for in the category classification, and in the category classification (socher2013recursive; mccann2017learned). Given the objective of our study – whether the pandemic has negatively affected the emotions and psychological states of people – and that the 2class classification is commonly accepted in sentiment analysis, we expect that the Negative versus nonNegative classification is sufficient to address our study goal, along with less discrepancy when labelling and higher accuracy when training the ML procedures and predicting the messages in our collected data.
2.3 Embedding Learning via RoBERTa
RoBERTa (liu2019roberta) is an improved version of the Bidirectional Encoder Representations from Transformers (BERT) framework (devlin2018bert), owing to several modifications of BERT (e.g., dynamic masking, input changes without the nextsentenceprediction loss, large minibatches, etc). The BERT framework itself is based on transformers (vaswani2017attention)
, a deep learning model built upon the attention mechanism. The BERT model is pretrained using text from Wikipedia and can be finetuned for various types of downstream learning tasks.
The Reddit data in our study, the same as any social media data, contain a large number of emoticons, nonstandard spellings and internet slang, causing difficulty for traditional sentence embedding learning methods designed for semantics with standard grammar and spelling. The RoBERTa model that we employ is capable of generating embedding from internet slang, and other nonstandard spelling by using special tokenizer. Specifically, we applied the RoBERTa model trained on million messages from Twitter and finetuned for sentiment analysis (barbieri2020tweeteval; modelcodes) to obtain the embeddings of the semantic information in our collected Reddit text data.
Besides generating embeddings from the text data via RoBERTa, which are fed to the GAT NN in the next step, we also use the RoBERTa framework to predict the labels of the sentiments in our collected Reddit data, as part of the subgroupadaptive model stacking introduced in Section 2.5.
The Python code for the RoBERTa framework that we applied is adapted from barbieri2020tweeteval and is available at https://huggingface.co/cardiffnlp/twitterrobertabasesentiment.
2.4 Training of GAT
We employ GAT to incorporate the relational information in the collected data as a graph input to train a sentiment classifier. GAT is a GNN that employs the attention mechanism to incorporate neighbors’ information in graph embedding. In our problem setting, each message is regarded as a node in the graph and the relation “message message” is coded as an edge in the graph and the corresponding adjacency matrix.
We randomly selected 601 nonNegative and 94 Negative messages from the merged 2019 and 2020 Dartmouth data in Table 2, along with the adjacency matrix in the merged Dartmouth data set, as the training set for the GAT NN. There are a few reasons why we only used the data from one school to train GAT. First, the graph on the merged data across different schools leads to a highdimensional blockdiagonal adjacency matrix because the comments from different schools are barely linked. In other words, merging the data from different schools does not provide additional benefits from the perspective of leveraging relational data to train GAT, but potentially increasing computational costs, and, in our experiments, also leading to worse prediction accuracy than using the data from a single school. Second, laborious manual labeling confined us to getting sufficient labeled data in only one school to train the GAT with meaningful relational information. Third, what makes negative vs positive sentiments and how they relate the relational information among the message is rather independent of a particular school; and there would not be much scarification in prediction accuracy via a trained GAT NN with the relational data from one school.
The steps in training of the GAT NN are listed below and a schematic illustration of the steps is given in Figure 2. Before the implementation, we first train RoBERTa to learn a set of representative features (embeddings) in message . We use to denote the neighbors of message , that is, the set of messages that reply to message .

[leftmargin=18pt]

Define the “closeness” of message to message as , where , and
is a linear transformation matrix.
denotes the row concatenation of the column vectors
and ; is the attention vector that measures the importance of the elements in ; andis an activation function.

Define for each and calculate per the multihead attention mechanism, where is number of attention heads, and is an activation function. The multihead attention mechanism allows the GNN to capture different types of relations among words within each message as well as across messages. We use in our application.

Feed embedding
from step 2) through a dense single layer perceptron
, parameterized by and train by minimizing the regularized crossentropy loss in Eq (1), for , where the set contains the labeled messages and .(1) denotes the observed label of message in the training data. classifies the messages into either Negative or nonNegative and is parameterized by , contains the parameters from the graph embedding and attention mechanism in steps 1) and 2) above, and . is the weight parameter and is set at 1 in a general case. Our training data are imbalanced in terms of labeled nonNegative cases vs labeled Negative cases (the former is 6 folds higher). To achieve better classification results, we “oversampled” the Negative cases, setting . The penalty parameter is , chosen via the 4fold crossvalidation.
The loss function in Eqn (
1) is formulated with the labeled data only. Though some nodes in the input graph are unlabeled (given the high cost of manual labeling) and are not part of the loss function, their text information and their relational information with the labeled nodes are still used in the graph embedding in steps 1) and 2) above. In this sense, the training of the GAT NN can be regarded as semisupervised learning.
The Python code for training the GAT NN is adapted from wang2019heterogeneous and can be found at https://github.com/Jhy1993/HAN
. The training was performed in TensorFlow with randomly initialized model parameters and the Adam optimizer
(kingma2014adam) with a learning rate of .2.5 Model Stacking
Besides the classification via the GAT NN in Section2.4, we also performed sentiment classification via RoBERTa. Comparing the classification results by GAT and RoBERTa on the testing data^{4}^{4}4The test data set contains 375 Negative and 375 nonNegative cases, made of 25 Negative and 25 nonNegative cases from Dartmouth and each of the remaining 14 schoolyears. side by side (Figure 3), we found some inconsistency. The left plot shows the sensitivity or the probability of correctly predicting nonNegative sentiments and the right plot depicts the specificity defined as the probability of correctly predicting negative sentiments. The cutoff for labeling a message Negative or nonNegative based on its predicted Prob(nonNegative)
is optimized by maximizing the geometric mean of sensitivity (recall) and positive predictive value (precision)
(fowlkes1983method) in the ROC curves for GAT and RoBERTa and is 0.2438 and 0.6923, respectively. If , the message is labeled Negative; otherwise, it is labeled nonNegative. In both plots, there is some deviation from the identity line, indicating the inconsistency in the classification between GAT and RoBERTa. GAT performs better that RoBERTa in terms of predicting true negative sentiments, whereas RoBERTa performs better that GAT in terms of predicting true nonnegative sentiments. In addition, both classifiers outputs some low probabilities; ideally, if both classifiers had high accuracy, most of the points would be clustered in the upper right corner around .The observations in Figure 3 suggest that both GAT and RoBERTa have their respective strength but also suffer some degree of inaccuracy when predicting a certain sentiment category. To leverage the advantage of both methods, we develop a subgroupadaptive ensemble method to aim for better prediction results. Ensemble learning is an ML technique that combines multiple base models (weaker learners) to form a strong learner. Bagging, boosting, and model stacking are all wellknown and popular ensemble methods and concepts. The specific technique we use here is model stacking (wolpert1992stacked)
by fitting logistic regression models, a.k.a. metamodels, on the predicted classification probabilities from GAT and RoBERTa. The fitted logistic models are adaptive to each population subgroup, i.e., a specific schoolyear combination in our study.
We first scale the raw prediction probability of being nonNegative from GAT and RoBERTa to obtain . Specifically,
(2) 
where represents the RoBERTa and GAT, respectively, is the predicted probability that message from school in year has a nonnegative sentiment per the prediction by , The reason behind the scaling of is that the optimal for the raw is different for RoBERTa and GAT and neither equals 0.5, leading to ineffective downstream training of the logistic regression metamodel (Eqn (3)) and difficulty in choosing a cutoff on the final probability from the metamodel. After the scaling, the cutoff on is 0.5 for both GAT and RoBERT, eliminating both concerns.
Specifically, the logistic metamodel for school and year uses and as input to generate the final prediction probability
(3) 
where are the estimated regression coefficients from the logistic model for school and year . The cutoff on from the metamodel prediction is 0.5; that is, if , then message in school and year is nonNegative; otherwise, it is labeled as Negative.
In terms of the training data for the model stacking step, we used 25 Negative cases and 25 nonNegative cases from the Dartmouth data and from each of the 14 schoolyears, leading to 375 Negative and 375 nonNegative cases overall. This very set of data (375 Negative and 375 nonNegative cases) was also used to find the optimal cutoff on the probability of Negative sentiments for the binary classification for RoBERTa and GAT, respectively. The testing data set for the logistic metamodel is the same as one used for testing the prediction accuracy of RoBERTa and GAT, with 25 Negative and 25 nonNegative cases per schoolyear.
2.6 Statistical Modelling and Inference
With the learned sentiment labels (Negative vs. nonNegative) for the messages in the data set, we apply GLMM to examine the effects of the pandemic on emotional state. GLMM uses a logit link function with a binary outcome (nonNegative vs. Negative). Specifically, we run two GLMMs. Model 1 is fitted to the whole collected data of
cases and compares the sentiment between 2019 (prepandemic) and 2020 (pandemic), after controlling for school location and type. Model 2 is fitted to the 2020 subset data with cases and examines how inperson learning during the pandemic affects the sentiment compared to remote learning, after controlling for school location and type. Since the messages are clustered by school and the messages from the same school are not independent, we include in both models a random effect of school, thus the “mixedeffects” model. The formulations of the two models are given below.(4)  
(5) 
where and , if private and 0 if public, if small city and 0 if large city, if 2020 and 0 if 2019, if it is inperson and 0 if it is remote, and and are the corresponding fixedeffects for and represents the log odds ratio of being nonNegative when vs. in the models.
3 Results
3.1 Sentiment Slassification
The classification results from the metamodel from the model stacking step on the testing data are presented in Table 3. We examine multiple metrics on the prediction accuracy, including the overall accuracy rate F1 score, recall, precision, and specificity. Though there is some variation across the schools and years, we have achieved satisfactory accuracy for each subgroup by all the metrics. Compared to GAT and RoBERTa, and modelstacking leads to better or similar classification results across all the examined metrics (Figure 4).
School  Year  classification  F1  positive predictive  sensitivity  specificity 

accuracy  score  value (precision)  (recall)  
UCLA  2019  0.88  0.87  0.95  0.80  0.96 
2020  0.80  0.80  0.80  0.80  0.80  
UCSD  2019  0.82  0.79  0.94  0.68  0.96 
2020  0.82  0.82  0.81  0.84  0.8  
Berkeley  2019  0.76  0.78  0.72  0.84  0.68 
2020  0.78  0.76  0.85  0.68  0.88  
Michigan  2019  0.72  0.71  0.74  0.68  0.76 
2020  0.84  0.85  0.81  0.88  0.8  
Harvard  2019  0.94  0.94  0.96  0.92  0.96 
2020  0.78  0.78  0.77  0.8  0.76  
Columbia  2019  0.82  0.82  0.83  0.8  0.84 
2020  0.92  0.92  0.92  0.92  0.92  
Dartmouth  2019  0.84  0.85  0.81  0.88  0.8 
& 2020  
Notre Dame  2019  0.82  0.82  0.83  0.8  0.84 
2020  0.86  0.86  0.85  0.88  0.84  
Overall  0.82  0.82  0.81  0.83  0.81 
F1 = 
3.2 Statistical inference for the pandemic effect on sentiment
The GLMM results are presented in Table 4. “Year 2020” shows a statistically significant effect on sentiment with value ; the odds of Negative in year is times the odds of Negative in year , after adjusting for school type and location. “Inperson” learning in 2020 also affects sentiments in a statistically significant manner and the odds of negative sentiments increase by compared to online learning. Whether the school is located in small city and whether the school is private do not seem to influence the odds of Negative in a statistically significant manner in both analysis.
Effect of pandemic  Effect of inperson learning in 2020  

Factor  odds ratio  Value  Factor  odds ratio  Value 
Year 2020  1.146  InPerson  1.416  0.037  
Small City  1.241  0.126  Small City  1.316  0.110 
Private  0.978  0.877  Private  1.031  0.860 
4 Discussion
In this study, we collected social media data from Reddit and applied stateoftheart ML techniques to study whether the pandemic has negatively affected the emotional and psychological states of a subpopulation. The ML techniques we employed achieved greater than 80% prediction accuracy on sentiment overall by various metrics. Our results suggest the pandemic has a negative impact on the group’s emotional and psychological states in a statistically significant manner and online teaching also increases the odds of negative sentiment in a statistically significant manner compared to inperson learning.
In the future, we plan to keep collecting Reddit data from the same period every year (2021 and behind) and examine the longterm effects of the pandemic on the emotional and psychological states and whether and when the states will return to the prepandemic baseline. In addition, we plan to apply and evaluate more ML techniques, such as those developed for heterogeneous graphs, to further improve the prediction accuracy in sentiment analysis based on semantic and relational information.