Adapting Deep Learning Methods for Mental Health Prediction on Social Media

by   Ivan Sekulić, et al.

Mental health poses a significant challenge for an individual's well-being. Text analysis of rich resources, like social media, can contribute to deeper understanding of illnesses and provide means for their early detection. We tackle a challenge of detecting social media users' mental status through deep learning-based models, moving away from traditional approaches to the task. In a binary classification task on predicting if a user suffers from one of nine different disorders, a hierarchical attention network outperforms previously set benchmarks for four of the disorders. Furthermore, we explore the limitations of our model and analyze phrases relevant for classification by inspecting the model's word-level attention weights.



page 1

page 2

page 3

page 4


Data set creation and empirical analysis for detecting signs of depression from social media postings

Depression is a common mental illness that has to be detected and treate...

Res-CNN-BiLSTM Network for overcoming Mental Health Disturbances caused due to Cyberbullying through Social Media

Mental Health Disturbance has many reasons and cyberbullying is one of t...

UVCE-IIITT@DravidianLangTech-EACL2021: Tamil Troll Meme Classification: You need to Pay more Attention

Tamil is a Dravidian language that is commonly used and spoken in the so...

Improving the Generalizability of Depression Detection by Leveraging Clinical Questionnaires

Automated methods have been widely used to identify and analyze mental h...

Detecting Early Onset of Depression from Social Media Text using Learned Confidence Scores

Computational research on mental health disorders from written texts cov...

Depression Status Estimation by Deep Learning based Hybrid Multi-Modal Fusion Model

Preliminary detection of mild depression could immensely help in effecti...

Towards Preemptive Detection of Depression and Anxiety in Twitter

Depression and anxiety are psychiatric disorders that are observed in ma...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mental health is a serious issue of the modern-day world. According to the World Health Organization’s 2017 report and wykes2015mental more than a quarter of Europe’s adult population suffers from an episode of a mental disorder in their life. The problem grows bigger with the fact that as much as 35–50% of those affected go undiagnosed and receive no treatment for their illness. In line with WHO’s Mental Health Action Plan Saxena et al. (2013)

, the natural language processing community helps the gathering of information and evidence on mental conditions, focusing on text analysis of authors affected by mental illnesses.

Researchers can utilize large amounts of text on social media sites to get a deeper understanding of mental health and develop models for early detection of various mental disorders De Choudhury et al. (2013a); Coppersmith et al. (2014); Gkotsis et al. (2016); Benton et al. (2017); Sekulić et al. (2018); Zomick et al. (2019). In this work, we experiment with the Self-reported Mental Health Diagnoses (SMHD) dataset Cohan et al. (2018)

, consisting of thousands of Reddit users diagnosed with one or more mental illnesses. The contribution of our work is threefold. First, we adapt a deep neural model, proved to be successful in large-scale document classification, for user classification on social media, outperforming previously set benchmarks for four out of nine disorders. In contrast to the majority of preceding studies on mental health prediction in social media, which relied mostly on traditional classifiers, we employ Hierarchical Attention Network (HAN)

Yang et al. (2016)

. Second, we explore the limitations of the model in terms of data needed for successful classification, specifically, the number of users and number of posts per user. Third, through the attention mechanism of the model, we analyze the most relevant phrases for the classification and compare them to previous work in the field. We find similarities between lexical features and n-grams identified by the attention mechanism, supporting previous analyses.

2 Dataset and the Model

2.1 Self-reported Mental Health Diagnoses Dataset

The SMHD dataset Cohan et al. (2018) is a large-scale dataset of Reddit posts from users with one or multiple mental health conditions. The users were identified by constructing patterns for discovering self-reported diagnoses of nine different mental disorders. For example, if a user writes “I was officially diagnosed with depression last year”, she/he/other would be considered to suffer from depression.

Nine or more control users, which are meant to represent general population, are selected for each diagnosed user by their similarity, i.e., by their number of posts and the subreddits (sub-forums on Reddit) they post in. Diagnosed users’ language is normalized by removing posts with specific mental health signals and discussions, in order to analyze the language of general discussions and to be more comparable to the control groups. The nine disorders and the number of users per disorder, as well as average number of posts per user, are shown in Table 1.

# users
# posts
per user
Depression 14,139 162.2 (84.2)
ADHD 10,098 164.7 (83.6)
Anxiety 8,783 159.7 (83.0)
Bipolar 6,434 157.6 (82.4)
PTSD 2,894 160.7 (84.7)
Autism 2,911 168.3 (84.5)
OCD 2,336 158.8 (81.4)
Schizophrenia 1,331 157.3 (80.5)
Eating 598 161.4 (81.0)
Table 1: Number of users in SMHD dataset per condition and the average number of posts per user (with std.).

For each disorder, cohan2018smhd analyze the differences in language use between diagnosed users and their respective control groups. They also provide benchmark results for the binary classification task of predicting whether the user belongs to the diagnosed or the control group. We reproduce their baseline models for each disorder and compare to our deep learning-based model, explained in Section 2.3.

2.2 Selecting the Control Group

cohan2018smhd select nine or more control users for each diagnosed user and run their experiments with these mappings. With this exact mapping not being available, for each of the nine conditions, we had to select the control group ourselves. For each diagnosed user, we draw exactly nine control users from the pool of control users present in SMHD and proceed to train and test our binary classifiers on the newly created sub-datasets.

In order to create a statistically-fair comparison, we run the selection process multiple times, as well as reimplement the benchmark models used in cohan2018smhd. Multiple sub-datasets with different control groups not only provide us with unbiased results, but also show how results of a binary classification can differ depending on the control group.

2.3 Hierarchical Attention Network

We adapt a Hierarchical Attention Network (HAN) Yang et al. (2016), originally used for document classification, to user classification on social media. A HAN consists of a word sequence encoder, a word-level attention layer, a sentence encoder and a sentence-level attention layer. It employs GRU-based sequence encoders Cho et al. (2014) on sentence and document level, yielding a document representation in the end. The word sequence encoder produces a representation of a given sentence, which then is forwarded to a sentence sequence encoder that, given a sequence of encoded sentences, returns a document representation. Both, word sequence and sentence sequence encoders, apply attention mechanisms on top to help the encoder more accurately aggregate the representation of given sequence. For details of the architecture we refer the interested readers to yang2016hierarchical.

In this work, we model a user as a document, enabling an intuitive adaptation of the HAN. Just as a document is a sequence of sentences, we propose to model a social media user as a sequence of posts. Similarly, we identify posts as sentences, both being a sequence of tokens. This interpretation enables us to apply the HAN, which had great success in document classification, to user classification on social media.

3 Results

Depression ADHD Anxiety Bipolar PTSD Autism OCD Schizo Eating
Logistic Regression 59.00 51.02 62.34 61.87 69.34 55.57 59.49 56.31 70.71
Linear SVM 58.64 50.08 61.69 61.30 69.91 55.35 58.56 57.43 70.91
Supervised FastText 58.38 48.80 60.17 56.53 61.08 49.52 54.16 46.73 63.73
HAN 68.28 64.27 69.24 67.42 68.59 53.09 58.51 53.68 63.94
Table 2: measure averaged over five runs with different control groups.

3.1 Experimental Setup

The HAN uses two layers of bidirectional GRU units with hidden size of 150, each of them followed by a 100 dimensional attention mechanism. The first layer encodes posts, while the second one encodes a user as a sequence of encoded posts. The output layer is 50-dimensional fully-connected network, with binary cross entropy as a loss function. We initialize the input layer with 300 dimensional GloVe word embeddings

Pennington et al. (2014). We train the model with Adam Kingma and Ba (2014), with an initial learning rate of

and a batch size of 32 for 50 epochs. The model that proves best on the development set is selected.

We implement the baselines as in cohan2018smhd. Logistic regression and the linear SVM were trained on tf-idf weighted bag-of-words features, where users’ posts are all concatenated and all the tokens lower-cased. Optimal parameters were found on the development set, and models were evaluated on the test set. FastText Joulin et al. (2016) was trained for 100 epochs, using character n-grams of size 3 to 6, with a 100 dimensional hidden layer. We take diagnosed users from predefined train-dev-test split, and select the control group as described in Subsection 2.2. To ensure unbiased results and fair comparison to the baselines, we repeat the process of selecting the control group five times for each disorder and report the average of the runs.

3.2 Binary Classification per Disorder

We report the measures per disorder in Table 2, in the task of binary classification of users, with the diagnosed class as a positive one. Our model outperforms the baseline models for Depression, ADHD, Anxiety, and Bipolar disorder, while it proves insufficient for PTSD, Autism, OCD, Schizophrenia, and Eating disorder. We hypothesize that the reason for this are the sizes of particular sub-datasets, which can be seen in Table 1. We observe higher score for the HAN in disorders with sufficient data, suggesting once again that deep neural models are data-hungry Sun et al. (2017). Logistic regression and linear SVM achieve higher scores where there is a smaller number of diagnosed users. In contrast to cohan2018smhd, supervised FastText yields worse results than tuned linear models.

Figure 1: scores for different number of posts per users made available to HAN, averaged over three runs for different control groups.

We further investigate the impact of the size of the dataset on the final results of classification. We limit the number of posts per user available to the model to examine the amount needed for reasonable performance. The results for , , , , and posts per user available are presented in Figure 1. Experiments were run three times for each disorder and each number of available posts, every time with a different control group selected. We observe a positive correlation between the data provided to the model and the performance, although we find an upper bound to this tendency. As the average number of posts per user is roughly (Table 1), it is reasonable to expect of a model to perform well with similar amounts of data available. However, further analysis is required to see if the model reaches the plateau because a large amount of data is not needed for the task, or due to it not being expressive enough.

3.3 Attention Weights Analysis

The HAN, through attention mechanism, provides a clear way to identify posts, and words or phrases in those posts, relevant for classification. We examine attention weights on a word level and compare the most attended words to prior research on depression. Depression is selected as the most prevalent disorder in the SMHD dataset with a number of studies in the field Rude et al. (2004); Chung and Pennebaker (2007); De Choudhury et al. (2013b); Park et al. (2012). For each post, we extracted two words with the highest attention weight as being the most relevant for the classification. If the two words are appearing next to each other in a post we consider them as bigram. Some of the top most common unigrams and bigrams are presented in Table 3, aggregated under the most common LIWC categories.

We observe similar patterns in features shown relevant by the HAN and previous research on signals of depression in language. The importance of personal pronouns in distinguishing depressed authors from the control group is supported by multiple studies Rude et al. (2004); Chung and Pennebaker (2007); De Choudhury et al. (2013b); Cohan et al. (2018). In the categories Affective processes, Social processes, and Biological processes, cohan2018smhd report significant differences between depressed and control group, similar to some other disorders. Except the above mentioned words and their abbreviations, among most commonly attended are swear words, as well as other forms of informal language. The attention mechanism’s weighting suggests that words and phrases proved important in previous studies, using lexical features and linear models, are relevant for the HAN as well.

unigrams bigrams
Pers. pronouns I, my, her, your, they I’ve never, your thoughts
Affective like, nice, love, bad I love
Social friend, boyfriend, girl, guy my dad, my girlfriend, my ex
Biological pain, sex, skin, sleep, porn your pain, a doctor, a therapist
Informal omg, lol, shit, fuck, cool tl dr, holy shit
Other advice, please, reddit thank you, your advice
Table 3: Unigrams and bigrams most often given the highest weight by attention mechanism in depression classification.

4 Related Work

In recent years, social media has been a valuable source for psychological research. While most studies use Twitter data Coppersmith et al. (2015b, 2014); Benton et al. (2017); Coppersmith et al. (2015a), a recent stream turns to Reddit as a richer source of high-volume data De Choudhury and De (2014); Shen and Rudzicz (2017); Gjurković and Šnajder (2018); Cohan et al. (2018); Sekulić et al. (2018); Zirikly et al. (2019). Previous approaches to author’s mental health prediction usually relied on linguistic and stylistic features, e.g., Linguistic Inquiry and Word Count (LIWC) (Pennebaker et al., 2001) – a widely used feature extractor for various studies regarding mental health Rude et al. (2004); Coppersmith et al. (2014); Sekulić et al. (2018); Zomick et al. (2019).

Recently, song2018feature built a feature attention network for depression detection on Reddit, showing high interpretability, but low improvement in accuracy. orabi2018deep concatenate all the tweets of a Twitter user in a single document and experiment with various deep neural models for depression detection. Some of the previous studies use deep learning methods on a post level to infer general information about a user Kshirsagar et al. (2017); Ive et al. (2018); Ruder et al. (2016), or detect different mental health concepts in posts themselves Rojas-Barahona et al. (2018), while we focus on utilizing all of the users’ text. yates2017depression use a CNN on a post-level to extract features, which are then concatenated to get a user representation used for self-harm and depression assessment. A CNN requires a fixed length of posts, putting constraints on the data available to the model, while a HAN utilizes all of the data from posts of arbitrary lengths.

A social media user can be modeled as collection of their posts, so we look at neural models for large-scale text classification. liu2018long split a document into chunks and use a combination of CNNs and RNNs for document classification. While this approach proves to be successful for scientific paper categorization, it is unintuitive to use in social media text due to an unclear way of splitting user’s data into equally sized chunks of text. yang2016hierarchical use a hierarchical attention network for document classification, an approach that we adapt for Reddit. A step further would be adding another hierarchy, similar to jiang2019semantic, who use a multi-depth attention-based hierarchical RNN to tackle the problem of long-length document semantic matching.

4.1 Ethical considerations

Acknowledging the social impact of NLP research Hovy and Spruit (2016), mental health analysis must be approached carefully as it is an extremely sensitive matter Šuster et al. (2017). In order to acquire the SMHD dataset, we comply to the Data Usage Agreement, made to protect the users’ privacy. We do not attempt to contact the users in the dataset, nor identify or link them with other user information.

5 Conclusion

In this study, we experimented with hierarchical attention networks for the task of predicting mental health status of Reddit users. For the disorders with a fair amount of diagnosed users, a HAN proves to be better than the baselines. However, the results worsen as the data available decreases, suggesting that traditional approaches remain better for smaller datasets. The analysis of attention weights on word level suggested similarities to previous studies of depressed authors. Embedding mental health-specific insights from previous work could benefit the model in general. Future work includes analysis of post-level attention weights, with a goal of finding patterns in the relevance of particular posts, and, through them, time periods when a user is in distress. As some of the disorders share similar symptoms, e.g., depressive episodes in bipolar disorder, exploiting correlations between labels through multi-task or transfer learning techniques might prove useful. In order to improve the classification accuracy, a transformer-based model for encoding users’ posts should be tested.

6 Acknowledgments

This work has been funded by the Erasmus+ programme of the European Union and the Klaus Tschira Foundation.


  • A. Benton, M. Mitchell, and D. Hovy (2017) Multi-task learning for mental health using social media text. arXiv preprint arXiv:1712.03538. Cited by: §1, §4.
  • K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio (2014)

    On the properties of neural machine translation: encoder-decoder approaches

    arXiv preprint arXiv:1409.1259. Cited by: §2.3.
  • C. Chung and J. W. Pennebaker (2007) The psychological functions of function words. Social communication 1, pp. 343–359. Cited by: §3.3, §3.3.
  • A. Cohan, B. Desmet, A. Yates, L. Soldaini, S. MacAvaney, and N. Goharian (2018) SMHD: a large-scale resource for exploring online language usage for multiple mental health conditions. In 27th International Conference on Computational Linguistics, pp. 1485–1497. Cited by: §1, §2.1, §3.3, §4.
  • G. Coppersmith, M. Dredze, C. Harman, K. Hollingshead, and M. Mitchell (2015a) CLPsych 2015 shared task: depression and ptsd on twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pp. 31–39. Cited by: §4.
  • G. Coppersmith, M. Dredze, C. Harman, and K. Hollingshead (2015b) From adhd to sad: analyzing the language of mental health on twitter through self-reported diagnoses. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pp. 1–10. Cited by: §4.
  • G. Coppersmith, M. Dredze, and C. Harman (2014) Quantifying mental health signals in twitter. In Proceedings of the workshop on computational linguistics and clinical psychology: From linguistic signal to clinical reality, pp. 51–60. Cited by: §1, §4.
  • M. De Choudhury, S. Counts, and E. Horvitz (2013a) Social media as a measurement tool of depression in populations. In Proceedings of the 5th Annual ACM Web Science Conference, pp. 47–56. Cited by: §1.
  • M. De Choudhury and S. De (2014) Mental health discourse on reddit: self-disclosure, social support, and anonymity. In Eighth International AAAI Conference on Weblogs and Social Media, Cited by: §4.
  • M. De Choudhury, M. Gamon, S. Counts, and E. Horvitz (2013b) Predicting depression via social media. In Seventh international AAAI conference on weblogs and social media, Cited by: §3.3, §3.3.
  • M. Gjurković and J. Šnajder (2018) Reddit: a gold mine for personality prediction. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, pp. 87–97. Cited by: §4.
  • G. Gkotsis, A. Oellrich, T. Hubbard, R. Dobson, M. Liakata, S. Velupillai, and R. Dutta (2016) The language of mental health problems in social media. In Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology, pp. 63–73. Cited by: §1.
  • D. Hovy and S. L. Spruit (2016) The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 591–598. Cited by: §4.1.
  • J. Ive, G. Gkotsis, R. Dutta, R. Stewart, and S. Velupillai (2018) Hierarchical neural model with attention mechanisms for the classification of social media text related to mental health. In Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic, pp. 69–77. Cited by: §4.
  • A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, and T. Mikolov (2016) compressing text classification models. arXiv preprint arXiv:1612.03651. Cited by: §3.1.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
  • R. Kshirsagar, R. Morris, and S. Bowman (2017) Detecting and explaining crisis. arXiv preprint arXiv:1705.09585. Cited by: §4.
  • M. Park, C. Cha, and M. Cha (2012) Depressive moods of users portrayed in twitter. In Proceedings of the ACM SIGKDD Workshop on healthcare informatics (HI-KDD), Vol. 2012, pp. 1–8. Cited by: §3.3.
  • J. W. Pennebaker, M. E. Francis, and R. J. Booth (2001) Linguistic inquiry and word count: liwc 2001. Mahway: Lawrence Erlbaum Associates 71 (2001), pp. 2001. Cited by: §4.
  • J. Pennington, R. Socher, and C. Manning (2014)

    Glove: global vectors for word representation

    In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543. Cited by: §3.1.
  • L. Rojas-Barahona, B. Tseng, Y. Dai, C. Mansfield, O. Ramadan, S. Ultes, M. Crawford, and M. Gasic (2018) Deep learning for language understanding of mental health concepts derived from cognitive behavioural therapy. arXiv preprint arXiv:1809.00640. Cited by: §4.
  • S. Rude, E. Gortner, and J. Pennebaker (2004) Language use of depressed and depression-vulnerable college students. Cognition & Emotion 18 (8), pp. 1121–1133. Cited by: §3.3, §3.3, §4.
  • S. Ruder, P. Ghaffari, and J. G. Breslin (2016)

    Character-level and multi-channel convolutional neural networks for large-scale authorship attribution

    arXiv preprint arXiv:1609.06686. Cited by: §4.
  • S. Saxena, M. Funk, and D. Chisholm (2013) World health assembly adopts comprehensive mental health action plan 2013–2020. The Lancet 381 (9882), pp. 1970–1971. Cited by: §1.
  • I. Sekulić, M. Gjurković, and J. Šnajder (2018) Not just depressed: bipolar disorder prediction on reddit. In 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, Cited by: §1, §4.
  • J. H. Shen and F. Rudzicz (2017) Detecting anxiety through reddit. In Proceedings of the Fourth Workshop on Computational Linguistics and Clinical Psychology—From Linguistic Signal to Clinical Reality, pp. 58–65. Cited by: §4.
  • C. Sun, A. Shrivastava, S. Singh, and A. Gupta (2017) Revisiting unreasonable effectiveness of data in deep learning era. In

    Proceedings of the IEEE international conference on computer vision

    pp. 843–852. Cited by: §3.2.
  • S. Šuster, S. Tulkens, and W. Daelemans (2017) A short review of ethical challenges in clinical natural language processing. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing/Hovy, Dirk [edit.]; et al., pp. 80–87. Cited by: §4.1.
  • Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp. 1480–1489. Cited by: §1, §2.3.
  • A. Zirikly, P. Resnik, Ö. Uzuner, and K. Hollingshead (2019) CLPsych 2019 shared task: predicting the degree of suicide risk in Reddit posts. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, Minneapolis, Minnesota, pp. 24–33. External Links: Link Cited by: §4.
  • J. Zomick, S. I. Levitan, and M. Serper (2019) Linguistic analysis of schizophrenia in reddit posts. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pp. 74–83. Cited by: §1, §4.