In 2015, 4.9 million Canadians aged 15 and over experienced a need for mental health care; 1.6 million felt their needs were partially met or unmet . In 2017, over a third of Ontario students, grades 7 to 12, reported having wanted to talk to someone about their mental health concerns but did not know who to turn to . These numbers highlight a concerning but all too familiar notion: although highly prevalent, mental health concerns often go unheard. Nonetheless, mental disorders can shorten life expectancy by 7-24 years .
In particular, depression is a major cause of morbidity worldwide. Although prevalence varies widely, in most countries, the number of persons that would suffer from depression in their lifetime falls between 8 and 12% . Access to proper diagnosis and care is overall lacking because of a variety of reasons, from the stigma surrounding seeking treatment  to a high rate of misdiagnosis . These obstacles could be mitigated in some way among social media users by analyzing their output on these platforms to assess their risk of depression or other mental health afflictions. The analysis of user-generated content could give valuable insights into the users mental health, identify risks, and help provide them with better support [3, 11]. To promote such analyses that could lead to the development of tools supporting mental health practitioners and forum moderators, the research community has put forward shared tasks like CLPsych  and the CLEF eRisk pilot task [1, 18]. Participants must identify users at risk of mental health issues, such as eminent risk of depression, post traumatic stress disorder, or anorexia. These tasks provide participants with annotated data and a framework for testing the performance of their approaches.
In this paper, we present a neural approach to identify social media users at risk of depression from their writings in a subreddit forum, in the context of the eRisk 2018 pilot task. From a technical standpoint, the principal interest of this investigation is the use of different aggregation methods for predictions on groups of documents. Using the power of rnn for the sequential treatment of documents, we explore several manners in which to combine predictions on documents to make a prediction on its author.
The dataset was built using the writings of 887 users, and was provided in whole at the beginning of the task. Users in the RISK class have admitted to having been diagnosed with depression; NO_RISK users have not. It should be noted that the users’ writings, or posts, may originate from different separate discussions on the website. The individual writings, however, are not labelled. Only the user as a whole is labelled as RISK or NO_RISK. The two classes of users are highly imbalanced in the training set with the positive class only counting 135 users to 752 in the negative class. Table 1 presents some statistics on the task dataset.
We use this dataset but consider a simple classification task, as opposed to the early-risk detection that was the object of the shared task.
|submissions / subject||367.1||640.7||514.7||680.9|
|words / submission||27.4||21.8||27.6||23.7|
We represent users as sets of writings rather than sequences of writings. This is partly due to the intuition that the order of writings would not be significant in the context of forums, generally speaking. It is also due to the fact that treating writings sequentially would be cumbersome, especially if we consider training on all ten chunks. However, we do consider writings as sequences of words, as this is the main strength of rnn. We therefore write a user as the set of his writings, . A given writing , is then a sequence of words, , with being the index of the last word. Thus, is the -th word of the -th post for a given user.
3.1 Aggregating predictions on writings
3.1.1 Late Inter-Document Averaging
We set out to put together an approach that aggregates predictions made individually and sequentially on the writings of a user. That is, we read the different writings of a user in parallel and take the average prediction on them. This is our first model, lida. Using the rnn architecture of our choice, we read each word of a post and update its hidden state,
is the transition function of the chosen rnn architecture, is the set of parameters of our particular rnn model and the initial state is set to zero,
In practice, however, we take but a sample of users’ writings and trim overlong writings (see Sec.5). lida averages over the final state of the rnn, , across writings,
This average is then projected into a binary prediction for the user,
, the standard logistic sigmoid function, to normalize the output and a vector of parameters,. By averaging over all writings, rather than taking the sum, we ensure that the number of writings does not influence the decision. However, we suspect that regularizing on the hidden state alone will not suffice, as the problem remains essentially the same: gradient correction information will have to travel the entire length of the writings regardless of the corrections made as a results of other writings.
3.1.2 Continual Inter-Document Averaging
Our second model, cida, therefore aggregates the hidden state across writings at every time step, as opposed to only the final one. A first rnn, represented by its hidden state , reads the writings as in Eq. 1. The resulting hidden states are averaged across writings and then fed as the input to a second rnn, represented by ,
is used to make a prediction similarly to Eq.3.
3.2 Inter-document attention
It stands to reason that averaging over the ongoing summary of each document would help in classifying a group of documents. Nonetheless, one would suspect that some documents would be more interesting than others to our task. Even if all documents were equally interesting, their interesting parts might not align well. Because we are reading them in parallel, we should try and prioritize the documents that are interesting at the current time step.
cida does not offer this possibility, as no weighting of terms is put in place in Eq.4. Consequently, we turn to the attention mechanism  to provide this information. While several manners of both applying and computing the attention mechanism exist [19, 8, 26], we compute the variant known as general attention , which is both learned and content-dependent. In applying it, we introduce ida, which will provide a weighted average to our previous model.
The computation of , the post-level hidden state, remains the same, i.e. Eq.1. However, these values are compared against the previous user-level hidden state to compute the relevant energy between them,
where is a matrix of parameters that learns the compatibility between the hidden states of the two rnn. The resulting energy scalars,
are mapped to probabilities by way of softmax normalization,
This probability is then used to weight the appropriate ,
is given by Eq.5. Through the use of this probability weighting, we can understand as an expected document summary at position when grouping documents together. As in the previous model, a prediction on the user is made from .
3.3 Intra-document Attention
We extend our use of the attention mechanism in the aggregation to the parsing of individual documents. Similarly to our weighting of documents in aggregation dependent on the current aggregation state, we compare the current input to past inputs to evince a context for it. This is known in the literature as self-attention . We therefore modify the computation of from Eq.1 by adding a context vector, , corresponding to the ongoing context in document at time :
This context vector is computed by comparing past inputs to the present document-level hidden state,
This weighting is normalized by softmax and used in adding the previous inputs together. We refer to this model as iida.
This last attention mechanism arises from practical difficulties in learning long-range dependencies by rnn. While rnn are theoretically capable of summarizing sequences of arbitrary complexity in their hidden state, numerical considerations make learning this process through gradient descent difficult when the sequences are long or the state is too small . This can be addressed in different manners, such as gating mechanisms [13, 10] and the introduction of multiplicative interactions . Self-Attention is one such mechanism where the context vector acts as a reminder of past inputs in the form of a learned expected context. It can be combined to other mechanisms with minimal parameter load.
4 Related Work
Choudhury et al.  used a more classical approach to classify Twitter users as being at risk of depression or not. They first manually crafted features that describe users’ online behavior and characterize their speech. The measures were computed daily, so a user is represented as the time series of the features. Then, the training and predictions were done by a svm with PCA for dimensionality reduction.
More similarly to our approach, Ive et al.  used Hierarchical Attention Networks  to represent user-generated documents. Sentence representations are learned using a rnn with an attention mechanism and are then used to learn the document’s representation using the same network architecture. The computation of the attention weights they use is different from ours as it is non-parametric. Their equivalent of equation 6 would be
This means that the rnn learn the attention weights along with the representation of the sequences themselves. This attention function has been introduced in  under that name of dot.
The location-based function  is a simpler version of the general attention that we used, that only takes into account the target hidden state. It is stated as such :
As previously mentioned, documents are broken into words. The representation of these words is learned from the entirety of the training documents, all chunks included, using the skip-gram algorithm . All words were turned to lowercase. Only the 40k most frequent words were kept. The embedded representation learned is of size 40, using a window of size five. The embeddings are are shared by all models.
Documents are trimmed at the end at a length of 66 words, which is longer than 90% of the posts in the dataset. The number of documents varies greatly across user classes. We take small random samples without replacement of 30 documents per user at every iteration (epoch). We contend that sampling the user at every iteration allows us to train for longer as it is harder for the models to overfit when the components that make up each instance keep changing.
5.2 Model configurations
We use the mlstm 
architecture as the post-level and user-level rnn, where applicable. The flexibility of the transition function in mlstm has shown to be capable of arriving at highly abstract features on its own and can achieve competitive results in sentiment analysis. Due to the limited number of examples, smaller models are required to avoid overfitting. We therefore set the embedded representation at 20 and the size of the hidden state of both rnn to 80. Parameter counts are shown in Table 2.
For our experiments, we reshuffle the original eRisk 2018 dataset, as the training and test sets do not have the same proportions among labels. To provide our models with more training examples, we divide the dataset 9:1, stratifying across labels. We use 10% of the training set as validation.
We train the models using the Adam optimizer , making use of 10% of the training data for validation. Having posited random intra-user sampling as a means of training longer, we set the training time to 30 epochs, taking the best model on validation over all epochs. As noted, the two classes are highly imbalanced. We use inverse class weighting to counteract this.
The nature of the task, which is to prioritize finding positive users, and the class imbalance in the dataset, we use the f1-score as a first metric in validation and in the final testing phase. The f1-score is useful to assess the quality of classification between unbalanced two unbalanced classes, one of which is designated as the positive class. It is defined as the harmonic mean betweenprecision (out of all the positive examples, how many are correctly classified as positive) and recall (out of all examples classified as positive, how many were indeed in the positive class). Using True Positive (TP) as the number of positive examples correctly classified, False Positive (FP) the number of examples in the positive class incorrectly classifed, and True Negative (TN) and False Positive (FP) for the negative class, we have the following equations.
We evaluate our models on the best result on a validation set of 10% of the training data. These best results are selected over 30 epochs.
Our preliminary results in validation are in accordance with our hypotheses. That is, continual aggregation surpasses late aggregation but falls short of the more sophisticated attention model. Moreover, the noticeable difference in performance has little to no cost in terms of parameter count.
In this paper, we have put forward three rnn-based models that aggregate documents to make a prediction on their author. We applied this model to the eRisk 2018 dataset, which associates a user, as a sequence of online forum posts, to a binary label that identifies them as being at risk for depression or not.
With the goal of using rnn to read the individual documents, we tested four methods of combining the resulting predictions, lida, cida, ida and iida. We also introduced the inter-document attention mechanism. Our preliminary results show promise and confirm the parameter efficiency of the attention mechanism.
Future work could involve the use of dot-product alone, which, despite adding no parameters, has been found to be more effective for global attention . An investigation into using late attention aggregation for all hidden states produced across all documents is also necessary.
-  CLEF eRisk pilot task. http://early.irlab.org/, Accessed July 6, 2018
-  CLPsych Shared Task. http://clpsych.org/shared-task-2017/, Accessed July 6, 2018
-  Ayers, J.W., Althouse, B.M., Allem, J.P., Rosenquist, J.N., Ford, D.E.: Seasonality in seeking mental health information on Google. American Journal of Preventive Medicine (AJPM) 44(5), 520–525 (2013)
-  Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
-  Bengio, Y., Simard, P., Frasconi, P., et al.: Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks 5(2), 157–166 (1994)
-  Boak, A., Hamilton, H.A., Adlaf, E.M., Henderson, J.L., Mann, R.E.: The mental health and well-being of Ontario students, 1991-2017: Detailed Findings from the Ontario Student Drug Use and Health Survey. CamhOSDUHS (2016)
-  Canada, S.: Accessing Mental Health Care in Canada (2017), https://www150.statcan.gc.ca/n1/pub/11-627-m/11-627-m2017019-eng.htm
Cheng, J., Dong, L., Lapata, M.: Long short-term memory-networks for machine reading. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 551–561 (2016)
-  Chesney, E., Goodwin, G.M., Fazel, S.: Risks of all-cause and suicide mortality in mental disorders: a meta-review. World Psychiatry 13(2), 153–160 (2014)
-  Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv preprint arXiv:1412.3555 (2014)
-  De Choudhury, M., Gamon, M., Counts, S., Horvitz, E.: Predicting depression via social media. In: Seventh international AAAI conference on weblogs and social media (2013)
-  Graves, A., Wayne, G., Danihelka, I.: Neural Turing Machines. arXiv preprint arXiv:1410.5401 (2014)
-  Hochreiter, S., Schmidhuber, J.: Long Short-term Memory. Neural computation 9(8), 1735–1780 (1997)
-  Ive, J., Gkotsis, G., Dutta, R., Stewart, R., Velupillai, S.: Hierarchical neural model with attention mechanisms for the classification of social media text related to mental health. In: Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic. pp. 69–77 (2018)
-  Kessler, R., Berglund, P., Demler, O., et al: The epidemiology of major depressive disorder: Results from the national comorbidity survey replication (ncs-r). JAMA 289(23), 3095–3105 (2003)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Krause, B., Lu, L., Murray, I., Renals, S.: Multiplicative lstm for sequence modelling. arXiv preprint arXiv:1609.07959 (2016)
-  Losada, D.E., Crestani, F., Parapar, J.: Overview of eRisk – Early Risk Prediction on the Internet. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Ninth International Conference of the CLEF Association (CLEF 2018). Avignon, France (2018)
-  Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025 (2015)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013)
-  Radford, A., Jozefowicz, R., Sutskever, I.: Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444 (2017)
-  Reddit: Reddit. https://www.reddit.com/, Accessed July 6, 2018
-  Rodrigues, S., Bokhour, B., Mueller, N., Dell, N., Osei-Bonsu, P.E., Zhao, S., Glickman, M., Eisen, S.V., Elwy, A.R.: Impact of stigma on veteran treatment seeking for depression. American Journal of Psychiatric Rehabilitation 17(2), 128–146 (2014)
Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 1017–1024 (2011)
-  Vermani, M., Marcus, M., Katzman, M.A.: Rates of detection of mood and anxiety disorders in primary care: a descriptive, cross-sectional study. The primary care companion to CNS disorders 13(2) (2011)
-  Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 37, pp. 2048–2057. PMLR, Lille, France (07–09 Jul 2015), http://proceedings.mlr.press/v37/xuc15.html
-  Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical Attention Networks for Document Classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. pp. 1480–1489 (2016)