Mental disorders are a prevalent problem of our world, 10.7% of people worldwide live with a mental health disorder111https://ourworldindata.org/mental-health. There are various mental illnesses such as depression, anxiety, bipolar disorder, eating disorders or addictions (including alcohol, drugs or gambling). For many people living with these problems, it is hard to access treatment because of the social stigma, many mental illnesses remain undiagnosed.
With the rise in social media popularity, more and more researchers are using it as a source for gathering data because users tend to talk about their emotions and mental health problems online and to seek support.
According to the DSM-5, pathological gambling is a mental health illness in which a person has a persistent and repetitive problematic gambling behaviour. They are constantly preoccupied with gambling, feel a need to spend more money, want to control or to quit their gambling problem, but already have several unsuccessful attempts, they try to hide their addiction and rely on others to regain a stable financial status or suffer other consequences [american2013diagnostic]. For pathological gambling screening, several instruments have been developed, such as The Lie/Bet Questionnaire [johnson1997lie], The Gambling Urge Scale [raylu2004gambling] and The Gambling Craving Scale [young2009gambling]. Besides the classical gambling methods, with the rise of the Internet, the availability of gambling opportunities has increased. With the high accessibility of Internet gambling, the incidence and prevalence of pathological gambling has also increased [gainsbury2011internet, gainsbury2015online].
Self-harm is a serious mental health problem, people suffering from this condition often also being diagnosed for other mental health disorders: borderline personality disorder, depression, anxiety, eating disorders, post-traumatic stress disorder (PTSD), addictions (alcohol and drugs) [fliege2009risk]. Deliberate self-harm behaviour refers to the act of self-inducing harming and resulting in tissue damage. It does not include other types of damage to the body (not exercising, unhealthy eating patterns, engaging deliberately in actions resulting in psychological self-harm). Regarding adolescents, 17% had been involved in non-suicidal self-injury at least once. However, self-harm is less common in adults, with only 5% of them engaging in this behaviour222https://www.apa.org/monitor/2015/07-08/who-self-injures.
Depression is the leading cause of disability, with more than 264 million people from all age groups suffering from this disorder and women being more affected than men333https://www.who.int/news-room/fact-sheets/detail/depression. For measuring the severity of depression, one of the most widely used instruments is the Beck Depression Inventory (BDI) [Upton2013], a 21-question multiple-choice questionnaire that can assess depression symptoms from the patient’s own experience. This questionnaire is a powerful tool that allows for a rapid diagnosis of one’s mental state. Moreover, through computational methods, the answers to the questions can be estimated by looking at their social media interactions and writings.
The early risk detection task became a popular topic in the interdisciplinary research area combining natural language processing and psychology. The Early Risk Prediction on the Internet (eRisk) Workshop444https://erisk.irlab.org/, part of the Conference and Labs of the Evaluation Forum (CLEF), focuses on the effectiveness and practical applications of early risk detection of different pathologies or safety threats. In addition, the workshop explores the technologies that can be used for people’s health and safety, and the strategies related to building test collections [losada2020overview].
Throughout the five editions of the eRisk workshop, the tasks addressed pathological gambling, depression, self-harm and anorexia. Another workshop tackling the early detection task is the Workshop on Computational Linguistics and Clinical Psychology (CLPsych)555https://clpsych.org/ with tasks such as predicting current and future psychological health from childhood essays [lynn2018clpsych] and early detection of suicide risk [zirikly2019clpsych]. Other efforts on early risk detection are also being made for the early detection of cognitive decline [beltrami2018speech] or antisocial behaviour [singh2019framework].
In this edition of the eRisk workshop, 17 teams developed their models to solve the proposed tasks, the highest number of participating teams in the workshop’s history. [parapar2021overview] The tasks from the current edition of the workshop are:
Task 1: Early Detection of Signs of Pathological Gambling
Task 2: Early Detection of Signs of Self-Harm
Task 3: Measuring the Severity of the Signs of Depression
With the recent advancements in deep learning and natural language processing, we focus our solutions by making use of state-of-the-art pre-trained transformer models[devlin-etal-2019-bert] (i.e. BERT) and scraping publicly available data from social media websites. Further, we make an overview of existing methods from previous editions of the eRisk workshop and present our solutions for Tasks 1, 2 and 3.
2 Related Work
Automated methods for compulsive gambling detection focus on analyzing behavioural player data (frequency of betting, number of different games played, etc.) [adami2013markers, peres2021time], transactional data [ladouceur2017responsible] or texts from the correspondence between customers and customer service employees from online gambling operators [haefeli2011early]. To the best of our knowledge, the eRisk workshop is the first to tackle pathological gambling detection from social media text.
The task on early detection of signs of self-harm is a continuation of eRisk 2019’s T2 and 2020’s T1 tasks. The best performing runs from the previous workshop edition in terms of Precision, 1 and were developed by martinez2020early. The team used for sentence classification models based on XLM-RoBERTa [conneau2019unsupervised] trained on different datasets. They avoided using the training dataset given by the eRisk organizers and instead gathered their own dataset from the Pushshift Reddit Dataset [baumgartner2020pushshift]. For the positive group, all posts from users who posted or commented in r/selfharm subreddit were extracted, including their submissions to other subreddits. A small proportion of user posts were manually annotated by the team, and the rest of them were used as positive class without further labeling. For the control group, they extracted random users who have never posted on a self-harm-related subreddit. Following this approach, the team obtained a small dataset, manually labeled, and three other larger datasets automatically generated. In their experiments, they used only the last three collections. Other approaches to self-harm detection were using bag-of-words [oliveira2020bioinfo, achilles2020using], bag of sub-emotions [aragon2020inaoe], LIWC, emotion and sentiment features [uban2020deep] or deep learning models [losada2020overview]. The results from this task suggest that the systems can obtain good performance on very little data and can be used to support human experts in the early detection of the signs of self-harm [losada2020overview].
The task of measuring the severity of the signs of depression is a continuation of eRisk 2019’s T3 and 2020’s T2 tasks. The best performing run from the last edition in terms of the Average Hit Rate (AHR) metric was the one of [oliveira2020bioinfo]
. The team used a machine learning model for depression detection from their previous participation in the eRisk workshop, trained on the RSDD dataset[DBLP:journals/corr/abs-1709-01848]. They predicted if a user was depressed or not with this model and they conjugated the score obtained with several psycholinguistics and behavioural features, such as polarity, use of self-related words, use of absolutist words, mentions of mental disorders-related terms and other terms regarding the symptoms of depression from the BDI questionnaire. Regarding the psycholinguistics features, bucur2021exploratory showed through cross-dataset analysis the connection between offensive language and signs of depression. Other approaches were based on transformers models [martinez2020early]
, SVM and logistic regression[uban2020deep], LDA [maupome2020early], writing style [maupome2020early], etc.
The results from this task show that a depressive screening tool that can predict the answers to the BDI questionnaire from social media posts has not reached a good performance that allows it to be used in a real scenario [losada2020overview], as opposed to the previous task on early detection on signs of depression which had good results when evaluating the error measure for early risk detection [losada2018overview, funez2018unsl, bucur2020detecting]
3 Task 1: Early Detection of Signs of Pathological Gambling
The first task of this edition of the eRisk workshop is detecting early signs of pathological gambling. This is a novel task, not appearing in other workshop editions, which focuses on data mining from social media. As such, the organizers do not provide a training set and only offer a testing platform. The data from this task is collected from Reddit, following the same methodology described by losada2016test. All the data used in this workshop was gathered from Reddit, as the social media platform allows the use of its’ public data for research purposes [losada-crestani2016]. The data is a collection of users posts from different subreddits.
In 2019, the workshop switched from a chunk-based release of the data to a sequence-based approach for the test stage. Before 2019, the participants tested their runs on 10 chunks with several submissions from each user, which were released every week. In the current workshop, each team had to process their runs only on one post at a time from each user in chronological order.
Our approach is to construct a large dataset of user posts, loosely annotated, such that we can train our model. We follow a methodology similar to martinez2020early and use the posts from the r/GamblingAddiction666https://www.reddit.com/r/GamblingAddiction/ and r/problemgambling777https://www.reddit.com/r/problemgambling/ subreddits as positive training data for pathological gambling detection. Posts in these subreddits are mostly discussing gambling addiction. We assume that most of the regular users in this subreddit are pathological gamblers. While it is evident that not all users from this subreddit have a gambling addiction, it is sufficient to train our model.
As such, we crawled the gambling-related subreddits and obtained 2111 unique users with a total of 13377 posts only in these two subreddits. To avoid having only posts containing topics related to gambling, from the 2111 users we also crawled their posts from other unrelated subreddits, which increased the posts from the positive class to 80998.
For the collection of control users, we make use of the control split of the Reddit Self-reported Depression Diagnosis (RSDD) [DBLP:journals/corr/abs-1709-01848] and Early Risk Prediction on the Internet (eRisk) 2018 [losada-crestani2016]
depression datasets. RSDD and eRisk are two publicly available datasets that contain posts also gathered from the Reddit social media platform. Both datasets contain users labeled as having depression by their mention of diagnosis in their posts and control users. In RSDD, each user in the depression group was matched with 12 control users who has the most similar post probability distributions. The control group contains users who do not suffer from depression, there is not any mention of diagnosis in their posts, they do not have depression-related terms in their posts and they had never posted in a mental health-related subreddit. For eRisk, the control group is composed of random Reddit users, some of them active in mental health-related subreddits, but without any mention of diagnosis.
The control splits used in our experiments contain random posts from users from various subreddits and the number of posts was chosen to match the number of posts in the positive class. We chose two variants, one that uses the control split from RSDD and one that uses the control split from eRisk 2018. We wanted to see the impact of the two different control datasets, as RSDD does not contain posts related to mental health, but eRisk does include posts from users active in mental health subreddits.
Further, we fine-tuned a pre-trained BERT classifier using AdamW optimizer with a learning rate of 0.0002 for 2 epochs. For prediction, we chose aggressive thresholds of 0.99 / 0.98 on the output probability to decide whether or not the user presents signs of gambling addiction. This high threshold minimizes false positives and ensures that a decision is taken only if a post denotes signs of addiction.
For this task, evaluation is performed similarly to previous eRisk workshops, namely Precision, Recall, F1, the ERDE [Trotzek_2020] metric as an early risk detection measure and F [10.1145/3159652.3159725]. F was deemed more interpretable and informative for this task as it combines the effectiveness of the decision (estimated with the F measure) and the delay. Testing data for Task 1 was released on an item-by-item basis through an API that processed current predictions for all runs, and progressively released the next sequence data. We opted for 5 runs and processed 1828 user writings, for a total of 1 day and 23:43:28 hours.
The five runs and the corresponding datasets used for the classification of pathological gamblers are described below.
RUN 0 BERT model trained on all the posts (including those from unrelated subreddits) from users who posted in the gambling-related subreddits, with control data from eRisk, with a 0.99 threshold
RUN 1 BERT model trained on all the posts (including those from unrelated subreddits) from users who posted in the gambling-related subreddits, with control data from RSDD, with a 0.99 threshold
RUN 2 BERT model trained on all the posts from the gambling subreddits with control data from eRisk, with a 0.99 threshold for classification
RUN 3 BERT model trained on all the posts (including those from unrelated subreddits) from users who posted in the gambling-related subreddits, with control data from eRisk, with a 0.98 threshold
RUN 4 BERT model trained on all the posts from the gambling subreddits with control data from eRisk, with a 0.98 threshold for classification
|team name||run id||P||R||F1||ERDE||ERDE||latency||speed||latency-weighted F1|
|1 writing||100 writings||500 writings||1000 writings|
Our results, compared with the best results from the early detection of pathological gambling task, are presented in Table 1 for the decision-based evaluation and Table 2 for the ranking-based evaluation. Our runs obtained high Recall, but high Precision and F1 are also needed in a real-life scenario on problem gambling detection. Our ERDE scores were relatively close to the best result from this task. Regarding the ranking-based evaluation, our team’s ratings got better as our models analyzed more data from each user. Even after one writing from each user, the model was quite effective at predicting the risk of users and prioritizing them.
4 Task 2: Early Detection of Signs of Self-Harm
For the second task, on early self-harm detection, the organizers provided a training dataset composed of all 2020’s T1 users (both training and test) plus the test users of 2019’s T2 task. This dataset contains 144 users in the self-harm class and 608 users in the control class, with a total of 120,898 and 1,241,271 posts, respectively. However, we increased the size of this dataset by crawling the r/selfharm subreddit. This subreddit contains various posts from users seeking help and stories related to their own experiences. We deemed the users from this subreddit as being at risk of self-harm. We chose two variants for control users, one that uses the control split from RSDD and one that contains the control split from eRisk 2018, as in the previous task on pathological gambling detection.
We followed the same methodology as in the previous task. We also use a BERT model trained with the same parameters. For prediction, we chose aggressive thresholds of 0.99 / 0.98 / 0.95 on the output probability to decide whether or not the user presents signs of self-harm. This high threshold minimizes false positives and ensures that a decision is taken only if a post denotes signs of self-harm.
Similarly to Task 1, testing data was released progressively after all runs offered predictions for the current batch of posts. We opted for 5 runs, but processed only 156 user writings in 1 day and 4:57:23 hours, due to the lack of computational resources and server latency.
|team name||run id||P||R||F1||ERDE||ERDE||latency||speed||latency-weighted F1|
|1 writing||100 writings||500 writings||1000 writings|
The five runs and the corresponding datasets used for the classification of self-harmers are described below.
RUN 0 BERT model trained only on the training data provided by the organizers, with a 0.95 threshold
RUN 1 BERT model trained on all the posts (including those from unrelated subreddits) from users who posted in the self-harm subreddit, with control data from eRisk, with a 0.99 threshold
RUN 2 BERT model trained on the training data provided by the organizers and posts from the self-harm subreddit, with control data from eRisk, with a 0.99 threshold
RUN 3 BERT model trained on the training data provided by the organizers and posts from the self-harm subreddit, with control data from eRisk, with a 0.98 threshold
RUN 4 BERT model trained on the training data provided by the organizers and posts from the self-harm subreddit, with control data from eRisk, with a 0.95 threshold
Our results, and the best results, for the early detection task of pathological gambling are presented in Table 3 for the decision-based evaluation and Table 4 for the ranking-based evaluation. Our team obtained good results in terms of F1 and ERDE measures. For the ranking-based evaluation, we obtained the best P@10 and NDCG@10 scores after 100 writings, among other teams, which means that these runs were effective to detect the risk of self-harm of users.
5 Task 3: Measuring the severity of the signs of depression
For this task, participants are required to estimate the level of depression of a user, given their history of postings. Users have filled in the Beck’s Depression Inventory (BDI) [Upton2013], which assesses the presence of feelings of sadness, pessimism, loss of energy. The questionnaire is comprised of 21 questions, with 4 answers each, denoting the severity of their respective symptom of depression, except for questions 16 and 18, which have 6 answers each (0, 1a, 1b, 2a, 2b, 3a, 3b). Questions 16 and 18 refer to changes in sleeping patterns and appetite, in which both extremes (undersleeping or oversleeping, for instance) tell of one’s mental health condition.
For model training, data from the 2019 and 2020 challenges was provided, which contains a total of 89 users with ground truth questionnaire answers and a total of 46502 posts from Reddit.
We are provided with 80 users and their posts for evaluation, totalling 32237 posts and comments from Reddit. The evaluation for this task is threefold: number of correct responses, the absolute difference between levels of depression obtained from the ground truth responses to the questionnaire vs the estimated answers, and the depression level, which is used to categorize users as minimal depression (0-9), mild depression (10-18), moderate depression (19-29), severe depression (30-63).
The number of correct responses is measured by the Average Hit Rate (AHR). Average Closeness Rage (ACR) takes into account the responses of the depression questionnaire represent an ordinal scale, not only separate options. Average Difference between Overall Depression Levels (ADODL) measure computes the overall depression level for the ground truth and automated questionnaire. Finally, Depression Category Hit Rate (DCHR) calculates the fraction of cases where the automated questionnaire led to a depression category that is equivalent to the depression category obtained from the ground truth questionnaire [parapar2021overview].
To determine the answers to the questionnaire, we trained several BERT models. Firstly, we fine-tuned a BERT classifier having 21 x 4 outputs, corresponding to all the 84 possible responses to the questionnaire. Each question head outputs a probability distribution across its possible answers. For questions 16 and 18, we merged answers. The model was trained using cross-entropy loss, with a learning rate of 0.0002 for 2 epochs, using AdamW optimizer.
For our second run, we added another classification head, which outputs a probability distribution over the severity score (minimal/mild/moderate/severe). The training regime was identical to our first scenario.
Finally, we constructed an architecture that takes into account each answer’s natural language formulation. At each training iteration, a user post and all the answers are fed into the BERT model, and the output is given by the dot product between embedding vectors of the post text and the answer text. Both answer embeddings the user text embedding are trained alongside. Figure1
shows the diagram of our architecture. This approach is similar to recent proposals to leverage contrastive learning as a way to train neural networks, especially in noisy or scarce data regimes. By computing the similarity between the answers and the user posts, we make use of the semantic meaning of the answer text, which provides more informed predictions than plain classification. The network was trained using AdamW optimizer for 2 epochs with a learning rate of 0.00002.
Since the dataset is noisy, and not all posts contain relevant information for all questions, at test time, we select the answer with the maximum probability across all user posts for a particular user.
Table 5 showcases our results on Task 3, compared to the best results from other teams. The five runs are described below.
RUN 0 corresponds to the plain classification of answers
RUN 1 is the joint prediction of responses and depression severity
RUN 2 is the similarity learning architecture
RUN 3 is an ensemble of the outputs from runs 0, 1 and 2
RUN 4 is an ensemble of the outputs from runs 0 and 2
Our best run (run 2) has the best overall results, showcasing that leveraging the question context is an important step towards accurate depression screening. Better results will undoubtedly be obtained if one considers only posts that are relevant to a particular question.
In this paper, we presented the contributions of the BLUE team in the eRisk 2021 workshop on early risk detection on the internet. Our approaches tackled early risk detection of pathological gambling, self-harm and estimating depression severity from a user’s social media posts (i.e. Reddit). For early risk detection of problem gambling, we opted for crawling and constructing a dataset from popular subreddits that address gambling addiction and trained a BERT classifier on user posts, with control users from RSDD and eRisk 2018, two popular datasets for depression detection. Similarly, we crawled the subreddit associated with the self-harm behaviour to augment the existing training dataset provided by the competition organizers for the detection of self-harm. Finally, for estimating depression severity, we opted for constructing an architecture that combines the questionnaire answers with a user’s posts to estimate the probability of correct answers. We obtained reasonable results through our methods. However, these problems are still far from being solved, and further research is needed to make computational methods feasible for use in the real world. The positive impact of early risk detection for various mental health disorders cannot be emphasized enough, and has the potential to save many human lives.