Detecting early signs of depression in the conversational domain: The role of transfer learning in low-resource scenarios

The high prevalence of depression in society has given rise to the need for new digital tools to assist in its early detection. To this end, existing research has mainly focused on detecting depression in the domain of social media, where there is a sufficient amount of data. However, with the rise of conversational agents like Siri or Alexa, the conversational domain is becoming more critical. Unfortunately, there is a lack of data in the conversational domain. We perform a study focusing on domain adaptation from social media to the conversational domain. Our approach mainly exploits the linguistic information preserved in the vector representation of text. We describe transfer learning techniques to classify users who suffer from early signs of depression with high recall. We achieve state-of-the-art results on a commonly used conversational dataset, and we highlight how the method can easily be used in conversational agents. We publicly release all source code.



page 1

page 2

page 3

page 4


Joint prediction of truecasing and punctuation for conversational speech in low-resource scenarios

Capitalization and punctuation are important cues for comprehending writ...

TransferTransfo: A Transfer Learning Approach for Neural Network Based Conversational Agents

We introduce a new approach to generative data-driven dialogue systems (...

Detecting Social Media Manipulation in Low-Resource Languages

Social media have been deliberately used for malicious purposes, includi...

Detecting Early Onset of Depression from Social Media Text using Learned Confidence Scores

Computational research on mental health disorders from written texts cov...

End-to-End Natural Language Understanding Pipeline for Bangla Conversational Agents

Chatbots are intelligent software built to be used as a replacement for ...

A Text Classification Framework for Simple and Effective Early Depression Detection Over Social Media Streams

With the rise of the Internet, there is a growing need to build intellig...

Predicting engagement in online social networks: Challenges and opportunities

Since the introduction of social media, user participation or engagement...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The World Health Organization222 that over 300 million people suffer from depression. However, there are approximately only 70 mental health professionals available for every 100,000 people in high-income nations, and this number can drop to 2 for every 100,000 in low-income countries [14]. It leads to a high percentage of the world population suffering from depression, and only a tiny fraction has access to psychiatric care to detect early signs of any mental illness, including depression. On the other hand, almost everyone has access to smartphones [28], and with the rise of conversational agents like Siri or Alexa [24], people are getting used to communicating with their smartphones or smart speakers on daily basis [10]. This evolution of conversational agents allows for building mental health applications, like the virtual therapists TalkToPoppy!333 It brings new challenges to recognize early signs of mental illnesses such as depression immediately during the conversation.

Unfortunately, there is a scarcity of conversational data usable for the detection of early signs of depression. The lack of these data is due to several problems. Authors of such datasets need to collect a representative sample of data to balance positive and especially negative examples [3], and typically cross-reference data with medical records, but this process can raise ethical issues. Some of these problems are mitigated in social media, such as Reddit or Twitter. On social media platforms, we usually get access to vast amounts of self-labeled data [16]. In addition, self-stated diagnoses remove some overheads of annotating data but increase the false-positive noise. Therefore, based on the data scarcity, we focus on sequential transfer learning [23] which helps mainly in the target domain with limited data and it is based on transferring knowledge from a related domain with a sufficient amount of data,.

As shown in [2], there are several indicative symptoms of depression like a loss of interest in everyday activities, feelings of worthlessness, and also a change in the use of language [33, 17]

. Because the change in language use can be detected, there are several lexicon-based approaches

[21, 19] to extracting semantic features from text. These approaches suffer from a limited size of vocabulary and require human annotation. This paper investigates sentence embeddings [4, 7] and whether they can capture these changes in the use of language without the need for a time-consuming design of lexicons . Furthermore, recent works focus on attention mechanisms [34], showing promising results in the social media [30]. We propose a novel model that combines attention mechanisms with sentence embeddings.

Our contributions can be summarized as follows:

  1. We evaluate the usage of sentence embeddings to detect change in the use of language for the detection of early signs of depression and its possible combination with attention mechanisms;

  2. We explore sequential transfer learning from social media to the conversational domain;

  3. We achieve state-of-the-art results on retrieving indications of early signs of depression in the conversational domain;

The rest of the paper is organized as follows. Section 3 introduces the used data, methodology for transfer learning, and our novel model. Section 4 introduces the experimental setting and Section 5 shows the results and proposes possible usage.

2 Related work

Previous works on textual depression detection were mainly focused on detecting early signs of depression in the social media domain [16, 35, 32, 5], in contrast to the conversational domain, which has received rarer attention [11].

2.1 Early Sign of Depression Detection

Datasets - The most remarkable conversational dataset related to virtual mental health applications is The Distress Analysis Interview Corpus (DAIC) [11]. As for datasets focusing on online forums and social networks, there is a sufficient number of them. The eRisk dataset [16], RSDD dataset [35] or the dataset introduced by [32] are extracted from Reddit. Additionally, there is the CLPsych 2015 Shared Task [5] focusing on depression and PTSD on Twitter.

Depression Detection in the Conversational Domain - Several works [33, 17, 8] have focused on the DAIC-WOZ dataset [11]. In [17]

, the authors used the Hierarchical Attention Model

[34] with a low-level representation of words based on word vector representation — GloVe [22] embeddings. They used word embeddings together with different types of the Hierarchical Attention Model to obtain a high-level representation of participant texts. Similarly, [33] uses Hierarchical Attention Model with a combination of lexical features like LIWC [21], NRC Emotion Lexicon (Emolex) [19], and many others. In contrast to [33, 17], we perform an extensive study of transfer learning techniques, and similarly to [8]

, we find that proper hyperparameters are critical for training the model.

Depression Detection in Social Media - The primary studies focusing on depression detection in the social media domain are also based on lexicon-based features, as well as word embeddings [30]. Similarly to [33, 17], [30] uses the Hierarchical Attention Model with additional features. Nevertheless, in [13]

, it is shown that simpler models with additional emotional and linguistic features can achieve comparable results. Because of that, we include a simple baseline model, such as Logistic regression


2.2 Transfer learning

Transfer learning is used to improve a learner from one domain by transferring information from a related domain [31]. It has been shown that if the two domains are related, transfer learning can potentially improve the results of the target learner [26]. Furthermore, it was shown by [26]

that transfer learning can improve the performance in anxiety and depression classification. They examine the performance of deep language models for pre-training the model. Similarly, he authors of

[1] demonstrate usefulness of transfer learning for depression detection from social media postings, applied to the eRisk [16] dataset. In [29], the authors show that transfer learning can be effective for improving prediction performance for disorders where little annotated data is available. They explore different transfer learning strategies for both cross-disorder (across disorders) and cross-platform transfer (across different social media platforms).

3 Methodology

In the following section, we introduce the datasets and metrics. We also discuss our novel chunk-based model.

3.1 Data

To study detection of depression, we set The Distress Analysis Interview Corpus as our target dataset. Source dataset was eRisk data which is labeled for early signs of depression, where for each user with depression there are possibly texts posted before the diagnosis or the onset of the disease. Additionally, we use General Psychotherapy Corpus, leaving other datasets for future research. All datasets contain two categories of participant, depressed (positive) and non-depressed (negative). Each participant is linked with a sequence of sentences. The data statistics for all mentioned datasets, can be seen in Table 1 and the description of these datasets follows.

The Distress Analysis Interview Corpus - Wizard-of-Oz (DAIC-WOZ) is part of a larger corpus, the Distress Analysis Interview Corpus (DAIC) [11]. These interviews were collected as part of a more significant effort to create a conversational agent that interviews people and identifies verbal and non-verbal indicators of mental illness [6]. Original data collected include audio and video recordings and extensive questionnaire responses. DAIC-WOZ includes the Wizard-of-Oz interviews, conducted by a virtual interviewer called Ellie, controlled by a human therapist in another room. The data have been transcribed and annotated for various verbal and non-verbal features.

The eRisk dataset [16] consists of data for the Early Detection of Signs of Depression Task presented at CLEF (specifically 2017444 The texts were extracted from the social media platform Reddit, and it uses the format described in [16].

Additionally, we apply our models to the General Psychotherapy Corpus (GPC) collected by the “Alexander Street Press”555Can be found at (ASP). This dataset contains over 4,000 transcribed therapy sessions, covering various clinical approaches and mental health issues. The data collection was compiled according to [33] and we chose only transcripts related to depression. It results in 147 sessions. Additionally, we randomly chose 201 sessions annotated with mental illnesses different than depression in order to use them as a control group.

Dataset # dialogues # utterances vocabulary labels 0/1 train/valid/test
DAIC-WOZ A 189 20,857 8,272 133/56 107/35/47
DAIC-WOZ P 189 10,505 8,263 133/56 107/35/47
eRisk A 1304 811,586 322,634 214/1090 387/97/820
GPC A 348 54,588 54,844 201/147 208/70/70
GPC P 348 26,860 45,205 201/147 208/70/70
Table 1: Data statistics. A stands for all data. P stands for utterances of participant.

3.2 Metrics

As suggested by [33, 17], we used the Unweighted Average Recall (UAR) between the ground-truth and the predicted labels associated with each participant (see Equation 1). As shown in [33], the UAR metric is also suitable when the label distribution of the dataset is unbalanced. Additionally, we measure Unweighted Average Precision (UAP), same as UAR but recall is substituted by precision and macro F1 score (macro-F1) for completeness.


where are true positives, false positives and false negatives for non-depressed participants, respectively. represent true positives, false positives and false negatives for depressed participants, respectively.

3.3 Chunk-based classification

Since natural conversation is an infinite sequence of utterances, our proposed Chunk-based model works based on a sliding window as shown in Figure 1. We classify each chunk of the conversation with a binary label, then we sum up all the obtained classifications (zeros for chunks corresponding to non-depressed participants and ones for depressed participants). Then we divide the sum by the number of chunks to normalize for different conversation lengths. To obtain a prediction for an entire conversation, we use a threshold on the ratio between positive and negative labels. More concretely, the conversation is created during an iterative process of conversation. Then, each conversation is composed of a set of utterances , where each utterance is composed of one or more sentences , as shown in Equation 2.


Further, in order to allow for the iterative evaluation of the conversation in a real-time setting, we performed classification using a sliding window at the chunk level. Each chunk is labeled according to the label of the conversation from which the chunk is derived. These chunks are overlapping, as shown in Figure 1. All chunks of the same length (shown for the length of three in Equation 3) have same label derived from label of conversation .


Firstly, we train the model to classify each chunk of the conversation into a binary label . After training, we classify all chunks obtained in the validation set data and perform the search for the best threshold for distinguishing conversations of depressed participants from non-depressed ones. The best value of the threshold is based on the accuracy over the whole validation set. The expression describing a classification for a conversation is shown in Equation 4.




Obviously, a smaller chunk size allows us to make more precise predictions gradually as new participant utterances occur. However, a smaller chunk size leads to loss of context information for particular classification.

Figure 1: The conversation is divided into sliding window chunks. Each chunk is classified independently from the others. The positive/negative labels ratio is used to determine the best threshold.

4 Experimental setting

This section highlights the setting of the suggested models and discusses their usability. We measure the performance of our model over the DAIC-WOZ dataset. First setting was without transfer learning. Then we measure the influence of transfer learning as a way to improve the performance of our model on the DAIC-WOZ dataset using the eRisk and GPC datasets as source domains.

In our experiments, we use recurrent neural networks as model

for each chunk, concretely Long Short Term Memory (LSTM)

[12] which is commonly used for sequence labeling in the conversational domain [24]. As input to the LSTM model, we included multi-head self-attention Transformer architecture [7, 25] and Deep Average Architecture [4]. Specifically, the input to the model consists of sentence embeddings obtained from the pooled output of the fine-tuned BERT [25] (sBERT), the output of the Universal Sentence Encoder - Deep Average Network [4] (DAN), or the output of the Universal Sentence Encoder - Transformer based [4] (). Additionally, as in [33, 17, 30], we evaluate an attention mechanism. We use two settings, the Hierarchical Attention Network (HAN) based on the GloVe embedding as in [30] and pure attention based on a dot-product of the hidden states of LSTM and learned attention weights.

We follow the evaluation process described in [17]. However, in contrast to [33, 17], we perform Bayesian hyperparameter optimization [27]. The reported results are with the best performing setting of hyperparameters.

We follow the common setup of transfer learning using the fine-tuning approach in a cross-domain setting [26]. More specifically, we train the model on source domain data until convergence. The learning rate is then reduced in order to avoid catastrophic forgetting [18]. Then, the training continues in the target domain. The decrease of the learning rate also reduces the overwriting of useful pretrained information and maximizes positive transfer. The weights of sentence embedding model are frozen.

5 Results and Ablation Experiments

To demonstrate the importance of the vocabulary size, we focus on the performance of the logistic regression model using different sizes of the vocabulary. Therefore, we include logistic regression [20] over bag-of-words vectors [36] as the baseline model. We extracted several types of vocabulary based on the utterances of participants in the DAIC-WOZ (3k - 3000 words), based on the utterances of participants and therapists in the DAIC-WOZ (6k - 6000 words), as well as based on posts in the eRisk dataset (20k - 20000 words). The best results were achieved with the logistic regression and the 3k vocabulary consisting of 3000 most used words based only on the participants’ utterances - UAR (0.583), UAP (0.603) and macro-F1 (0.593), in contrast to UAR (0.579), UAP (0.561) and macro-F1 (0.570) for the 6k vocabulary or UAR (0.583), UAP (0.580) and macro-F1 (0.581) for the 20k

vocabulary. We infer that a more extensive vocabulary probably introduces additional noise for the classification model. We use the best performing vocabulary for the rest of the experiments when working with logistic regression.

To confirm the difference between the vocabulary of depressed and non-depressed patients, we then examine the weights learned by the logistic regression model. We look at the most significant logistic regression weights in absolute value, both positive and negative, and map the weights to the vocabulary words. The results indicate that words like environment (-7.5), open-minded (-6.3), or accomplish (-4.7) correspond with a non-depressed patient. In contrast, insignificant (5.36), television (5.66), or pollution (+6.1) relate to a depressed patient. It confirms results reported in [33, 17], showing the difference between the depressed and non-depressed groups in terms of the use of language, more specifically, at the level of word usage.

Finally, we test the performance of the classification models proposed in Section 3.3. Our results, shown in Table 2, show the high performance of transfer learning approach. We achieve a new state-of-the-art result, specifically using the chunk-based model based on bidirectional LSTM over sentence embedding. The input to the models was based on the Universal Sentence Encoder - Transformer ().

The transfer learning achieved a notable outcome on data from the domain of social media. We also assume that this was caused by the implicit ability of sentence embedding to capture different language characteristics [15]. We discuss these results in more depth in Section 5. According to our results, attention is not beneficial for chunk-based classification. We assume it is caused by the sliding window chunk-based classification, where the attention mechanism is not fully utilized.

width=1 Model Unweighted Average Recall HCAN [17] 0.54 HLGAN [17] 0.60 HAN [33] 0.54 HAN + L [33] 0.72 DAIC-WOZ eRisk/GPC without fine-tuning eRisk/GPC with fine-tuning LR + unigrams 3k 0.553 0.559 / 0.547 0.613. / 0.553 HAN + GloVe 0.541 0.511 / 0.535 0.529 / 0.613 Chunk-biLSTM + DAN 0.595 0.559 / 0.470 0.625 / 0.541 Chunk-biLSTM + DAN + att 0.666 0.630 / 0.333 0.676 / 0.494 Chunk-biLSTM + 0.660 0.651 / 0.440 0.803 / 0.690 Chunk-biLSTM + + att 0.529 0.541 / 0.589 0.613 / 0.595 Chunk-biLSTM + sBERT 0.440 0.505 / 0.523 0.613 / 0.541 Chunk-biLSTM + sBERT + att 0.442 0.5 / 0.523 0.636 / 0.577

Table 2: Results - LR stands for Logistic Regression, HAN - Hierarchical Attention Model, Chunk-biLSTM - our Chunk-based model based on bidirectional LSTM, DAN - sentence embeddings based on Deep Average Network, - sentence embeddings based on Transformer trained by [4], sBERT - sentence embeddings based on Transformer trained by [25], att stands for the attention mechanism.

Also, we present various ablation experiments to provide some interpretations of our findings.

Are sentence embeddings able to encode information present in lexicon-based features? We are interested in verifying whether lexicon-based features are helpful to our classifiers or if the sentence embeddings already encode the information provided by the lexicons. To test this assumption, we performed another experiment using the best-performing architecture, where we added linguistic characteristics (emotions and LIWC) as input features along with sentence embeddings. Results, shown in Table 3, indicate that there is no improvement in using linguistic characteristics. Therefore, we conclude that the sentence embeddings already include linguistic characteristics needed for detection.

width=1 DAIC-WOZ eRisk/GPC without fine-tuning eRisk/GPC with fine-tuning Chunk-biLSTM + 0.660 0.651 / 0.440 0.803 / 0.690 Chunk-biLSTM + + feat 0.565 0.541 / 0.410 0.597 / 0.511

Table 3: Results - Chunk-biLSTM - our Chunk-based model based on bidirectional LSTM, - sentence embeddings based on Transformer trained by [4], feat stands for additional features.

Is the size of the source domain dataset more critical than domain relatedness? Results in Table 2 show that transfer learning can help with improving the classification performance on the conversational dataset. At the same time, we find a surprising result showing that using GPC as the source domain underperforms the setting in which eRisk data is used as the source domain, even though GPC is a more similar type of data to our target dataset (they are both conversational datasets). We assume, that the smaller data size can cause poor performance when using the GPC dataset: to test this hypothesis, we evaluate a smaller version of eRisk (eRisk small). With 66,516 utterances and 388/96/820 participants. The new smaller dataset is closer to the GPC dataset in respect to size. Results, in Table 4, suggest that the size of the source domain dataset is as much important as domain closeness.

DAIC-WOZ without fine-tuning with fine-tuning
Chunk-biLSTM + 0.660 0.651 0.803
Chunk-biLSTM + 0.660 0.642 0.690
Table 4: Results - Chunk-biLSTM - our Chunk-based model based on bidirectional LSTM. The double horizontal line divides the table with results on eRisk (above) and results on eRisk-small (below).

5.1 Usage

Because our approach is based on sliding window chunks, we are able to perform a real-time evaluation of the conversation as soon as the number of utterances reaches the size of the sliding window chunk. Our experiments are performed with 50 as the chunk size. This allows for including the classification model as another part of the Natural Language Understanding (NLU) unit commonly used in conversational agents [24, 9]. As opposed to other models proposed in literature such as [33, 17, 30], our suggested model is independent of external lexical features, such as lexicon-based features, and therefore it can be run in parallel with other NLU units.

6 Conclusion and Future Work

In this paper, we addressed the problem of detecting early signs of depression in the conversational domain. We achieve state-of-the-art results on the DAIC-WOZ dataset using transfer learning from the social media domain. The proposed model was based on a sequence of chunk classification and uses a recurrent neural network with sentence embedding as input features. Additionally, we show that the attention mechanism is not beneficial for our chunk-based model. We show that transfer learning helps improve the performance in a domain with a lack of data utilizing data from a related domain. Additionally, we demonstrate that the size of the source dataset is as important as the domain relatedness between the source and the target. We also suggest a possible usage of our model as a tool for therapists, who may retrieve early signs of depression from a broad range of conversational systems.

6.1 Ethical concern

This paper showed a possible usage of automatic techniques for detecting early signs of depression. Unfortunately, false positive and false negative cases can cause tremendous damage when used in the conversational agent. We claim that if our model or proposed techniques are used in real-life scenarios, they has to be supervised by a qualified therapist. Also, as we have shown, domain adaptation plays a crucial role too. Our approach, even if quite general, has to be carefully adapted to a specific domain. We also suggest not relying on the classification system only, but using it as another source of information for a qualified therapist. With this setting, we can minimize possible harm and allow the therapist to speed up their work.

6.1.1 Acknowledgments.

The research work of Petr Lorenc and Jan Šedivý was partially supported by the Grant Agency of the Czech Technical University in Prague, grant (SGS22/082/OHK3/1T/37). The research work of Paolo Rosso was partially funded by the Generalitat Valenciana under DeepPattern (PROMETEO/2019/121). The work of Ana Sabina Uban was carried out at the PRHLT Research Center during her postdoc internship. Her work was also partially funded by a grant from Innovation Norway, project "Virtual simulated platform for automated coaching-testing", Ref No 2021/331382.


  • [1] P. Abed-Esfahani, D. Howard, M. Maslej, S. Patel, V. Mann, S. Goegan, and L. French (2019) Transfer learning for depression: early detection and severity prediction from social media postings.. In CLEF (Working Notes), Vol. 1, pp. 1–6. Cited by: §2.2.
  • [2] American Psychiatric Association (2013) Diagnostic and statistical manual of mental disorders. 5th ed. edition, Autor, Washington, DC. Cited by: §1.
  • [3] G. Batista, R. Prati, and M. Monard (2004)

    A study of the behavior of several methods for balancing machine learning training data

    In SIGKDD Explorations, Vol. 6, pp. 20–29. Cited by: §1.
  • [4] D. M. Cer, Y. Yang, S. Kong, N. Hua, N. Limtiaco, R. St. John, N. Constant, M. Guajardo-Cespedes, S. Yuan, C. Tar, Y. Sung, B. Strope, and R. Kurzweil (2018) Universal sentence encoder. In ArXiv, Cited by: §1, §4, Table 2, Table 3.
  • [5] G. Coppersmith, M. Dredze, C. Harman, K. Hollingshead, and M. Mitchell (2015-June5) CLPsych 2015 shared task: depression and PTSD on Twitter. In Proceedings of the 2nd CLPsych, Denver, Colorado, pp. 31–39. Cited by: §2.1, §2.
  • [6] D. DeVault, K. Georgila, R. Artstein, F. Morbini, D. Traum, S. Scherer, A. S. Rizzo, and L. Morency (2013) Verbal indicators of psychological distress in interactive dialogue with a virtual human. In SIGDIAL 2013 Conference, pp. 193–202. Cited by: §3.1.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §4.
  • [8] H. Dinkel, M. Wu, and K. Yu (2019) Text-based depression detection: what triggers an alert. In ArXiv, Cited by: §2.1.
  • [9] S. E. Finch, J. D. Finch, A. Ahmadvand, J. I. Choi, X. Dong, R. Qi, H. Sahijwani, S. Volokhin, Z. Wang, Z. Wang, and J. D. Choi (2020) Emora: an inquisitive social chatbot who cares for you. In Alexa Prize Proceedings, Vol. 3. Cited by: §5.1.
  • [10] R. Gabriel, Y. Liu, A. Gottardi, M. Eric, A. Khatri, A. Chadha, Q. Chen, B. Hedayatnia, P. Rajan, A. Binici, S. Hu, K. Gopalakrishnan, S. Kim, L. Stubel, K. Bland, A. Mandal, and D. Z. Hakkani-Tür (2020) Further advances in open domain dialog systems in the third alexa prize socialbot grand challenge. In Alexa Prize Proceedings, Vol. 3. Cited by: §1.
  • [11] J. Gratch, R. Artstein, G. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsella, D. Traum, S. Rizzo, and L. Morency (2014) The distress analysis interview corpus of human and computer interviews. In LREC 2014), pp. 3123–3128. Cited by: §2.1, §2.1, §2, §3.1.
  • [12] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. In Neural computation, Vol. 9, pp. 1735–80. Cited by: §4.
  • [13] M. R. Islam, A. Kabir, A. Ahmed, A. Kamal, H. Wang, and A. Ulhaq (2018) Depression detection from social network data using machine learning techniques. In Health Information Science and Systems, Vol. 6, pp. 8. Cited by: §2.1.
  • [14] M. Lee, S.C.A. Ackermans, N. van As, H. Chang, E. Lucas, and W. IJsselsteijn (2019) Caring for vincent: a chatbot for self-compassion. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Cited by: §1.
  • [15] N. F. Liu, M. Gardner, Y. Belinkov, M. E. Peters, and N. A. Smith (2019) Linguistic knowledge and transferability of contextual representations. In NAACL-HLT, Cited by: §5.
  • [16] D. E. Losada and F. Crestani (2016) A test collection for research on depression and language use. In Conference Labs of the Evaluation Forum, Cited by: §1, §2.1, §2.2, §2, §3.1.
  • [17] A. Mallol-Ragolta, Z. Zhao, L. Stappen, N. Cummins, and B. Schuller (2019) A hierarchical attention network-based approach for depression detection from transcribed clinical interviews. In Interspeech, pp. 221–225. Cited by: §1, §2.1, §2.1, §3.2, §4, §4, §5.1, Table 2, §5.
  • [18] M. McCloskey and N. J. Cohen (1989) Catastrophic interference in connectionist networks: the sequential learning problem. In Psychology of Learning and Motivation, Vol. 24, pp. 109–165. Cited by: §4.
  • [19] S. Mohammad and P. Turney (2013) Crowdsourcing a word-emotion association lexicon. In Computational Intelligence, Vol. 29. Cited by: §1, §2.1.
  • [20] J. Peng, K. Lee, and G. Ingersoll (2002)

    An introduction to logistic regression analysis and reporting

    In Journal of Educational Research, Vol. 96, pp. 3–14. Cited by: §2.1, §5.
  • [21] J. W. Pennebaker, M. E. Francis, and R. J. Booth (2001) Linguistic inquiry and word count. In Lawrence Erlbaum Associates, Vol. 71. Cited by: §1, §2.1.
  • [22] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In EMNLP, Vol. 14, pp. 1532–1543. Cited by: §2.1.
  • [23] J. Phang, T. Févry, and S. R. Bowman (2018) Sentence encoders on stilts: supplementary training on intermediate labeled-data tasks. In arXiv, Cited by: §1.
  • [24] J. Pichl, P. Marek, J. Konrád, P. Lorenc, V. D. Ta, and J. Sedivý (2020)

    Alquist 3.0: alexa prize bot using conversational knowledge graph

    In Alexa Prize Proceedings, Vol. 3. External Links: 2011.03261 Cited by: §1, §4, §5.1.
  • [25] N. Reimers and I. Gurevych (2019-11) Sentence-bert: sentence embeddings using siamese bert-networks. In EMNLP, External Links: Link Cited by: §4, Table 2.
  • [26] T. Rutowski, E. Shriberg, A. Harati, Y. Lu, P. Chlebek, and R. Oliveira (2020) Depression and anxiety prediction using deep language models and transfer learning. In 7th BESC, Vol. 1, pp. 1–6. Cited by: §2.2, §4.
  • [27] J. Snoek, H. Larochelle, and R. P. Adams (2012) Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems, Vol. 25. Cited by: §4.
  • [28] E. Tsetsi and S. Rains (2017) Smartphone internet access and use: extending the digital divide and usage gap. In Mobile Media and Comm., Vol. 5, pp. 205015791770832. Cited by: §1.
  • [29] A. S. Uban, B. Chulvi, and P. Rosso (2022 (To appear)) Multi-aspect transfer learning for detecting low resource mental disorders on social media. In Proceedings of the 13th Language Resources and Evaluation Conference, Cited by: §2.2.
  • [30] A. Uban, B. Chulvi, and P. Rosso (2021) An emotion and cognitive based analysis of mental health disorders from social media data. In Future Generation Computer Systems, Vol. 124, pp. 480–494. Cited by: §1, §2.1, §4, §5.1.
  • [31] K. R. Weiss, T. Khoshgoftaar, and D. Wang (2016) A survey of transfer learning. In Journal of Big Data, Vol. 3, pp. 1–40. Cited by: §2.2.
  • [32] J. Wolohan, M. Hiraga, A. Mukherjee, Z. A. Sayyed, and M. Millard (2018-08) Detecting linguistic traces of depression in topic-restricted text: attending to self-stigmatized depression with nlp. In Language Cognition and Computational Models, Santa Fe, New Mexico, USA, pp. 11–21. Cited by: §2.1, §2.
  • [33] D. Xezonaki, G. Paraskevopoulos, A. Potamianos, and S. Narayanan (2020) Affective conditioning on hierarchical attention networks applied to depression detection from transcribed clinical interviews. In Interspeech, pp. 4556–4560. Cited by: §1, §2.1, §2.1, §3.1, §3.2, §4, §4, §5.1, Table 2, §5.
  • [34] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 (NAACL-HLT), pp. 1480–1489. Cited by: §1, §2.1.
  • [35] A. Yates, A. Cohan, and N. Goharian (2017-09) Depression and self-harm risk assessment in online forums. In EMNLP, Copenhagen, Denmark, pp. 2968–2978. Cited by: §2.1, §2.
  • [36] Y. Zhang, R. Jin, and Z. Zhou (2010) Understanding bag-of-words model: a statistical framework. In International Journal of Machine Learning and Cybernetics, Vol. 1, pp. 43–52. Cited by: §5.