Weakly-Supervised Methods for Suicide Risk Assessment: Role of Related Domains

by   Chenghao Yang, et al.
Columbia University

Social media has become a valuable resource for the study of suicidal ideation and the assessment of suicide risk. Among social media platforms, Reddit has emerged as the most promising one due to its anonymity and its focus on topic-based communities (subreddits) that can be indicative of someone's state of mind or interest regarding mental health disorders such as r/SuicideWatch, r/Anxiety, r/depression. A challenge for previous work on suicide risk assessment has been the small amount of labeled data. We propose an empirical investigation into several classes of weakly-supervised approaches, and show that using pseudo-labeling based on related issues around mental health (e.g., anxiety, depression) helps improve model performance for suicide risk assessment.



There are no comments yet.


page 8


COVID-19 and Mental Health/Substance Use Disorders on Reddit: A Longitudinal Study

COVID-19 pandemic has adversely and disproportionately impacted people s...

Characterization of Time-variant and Time-invariant Assessment of Suicidality on Reddit using C-SSRS

Suicide is the 10th leading cause of death in the U.S (1999-2019). Howev...

Inter and Intra Document Attention for Depression Risk Assessment

We take interest in the early assessment of risk for depression in socia...

Can We Assess Mental Health through Social Media and Smart Devices? Addressing Bias in Methodology and Evaluation

Predicting mental health from smartphone and social media data on a long...

Temporal Mental Health Dynamics on Social Media

We describe a set of experiments for building a temporal mental health d...

Early Risk Detection of Pathological Gambling, Self-Harm and Depression Using BERT

Early risk detection of mental illnesses has a massive positive impact u...

A Weakly Supervised Approach for Classifying Stance in Twitter Replies

Conversations on social media (SM) are increasingly being used to invest...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Suicide has been identified as one of the leading causes of deaths and approximately of people die by suicide every year WHO and others (2016); Fazel and Runeson (2020). Despite years o clinical research on suicide, researoners have concluded that suicide cannot be predicted using the standard clinical practice of asking patients about their suicidal thoughts McHugh et al. (2019). Recently, Coppersmith et al. (2018) and Nock et al. (2019)

discuss the opportunities of using social media combined with natural language processing (NLP) techniques to complement traditional clinical records and help in suicide risk analysis and early suicide intervention.

To facilitate further research on automatic suicide risk assessment, Zirikly et al. (2019) proposed a shared task, where they collected user data from r/SuicideWatch subreddit and annotated it with user-level suicide risk: no-risk, low-risk, medium-risk and high-risk. By comparing the results of the participating teams in this shared task, Zirikly et al. (2019) conclude that one of the major challenges comes from the insufficient data for intermediate suicide risk levels (i.e., low risk and medium risk) rather than extreme risk levels (i.e., no risk and high risk). Matero et al. (2019)

find that using a dual BERT-LSTM-Attention model to separately extract information from both SuicideWatch and Non-SuicideWatch posts together with feature engineering that includes emotion features, personality scores, user’s anxiety and depression scores are important for model performance.

In this paper, instead of feature engineering or complex model architectures, we explore whether weakly supervised methods and data augmentation techniques based on clinical psychology research can help improve model performance. We explore several weakly-supervised methods, and show that a simple approach based on insights from clinical psychology research O’Connor and Nock (2014) obtains the best performance. This model uses pseudo-labeling (PL) on data from the subreddits r/Anxiety and r/depression, which are considered important risk factors for suicide. We also present a potential application of our model for studying the suicide risk among people who use drugs, opening the door for using NLP methods to deepen our understanding between opioid use disorder (OUD) and mental health. The code for this paper can be found at https://github.com/yangalan123/WM-SRA.

2 Methods

We focus on Task A from the CLPsych 2019 shared task “Predicting the Degree of Suicide Risk in Reddit Posts” Zirikly et al. (2019). The goal of the task is to predict the user-level suicide risk category based on their posts in the r/SuicideWatch subreddit. Specifically, a user is associated with a collection of posts , where each post has sentences . We need to predict using , where represent no-risk, low-risk, medium-risk and high-risk, respectively. In the original dataset, there are users in the training set and users in the test sets. We further split users from the training set to create the validation set. The sizes for the train/valid/test sets are 746, 173, and 186 respectively.

Data Pre-processing

Following the advice in Zirikly et al. (2019), we replace all human names and URLs in the Reddit posts with special tokens ”_PERSON_” and ”_URL_”, respectively. We also remove punctuation and stop words besides lowercasing. Due to the limitation of GPU memory, we split those large posts to be passages with no more than words111The maximum passage length is tuned based on the validation set for both GPU memory and better computational efficiency for large posts. We do not observe a significant performance drop without a larger passage length. and make sure that the split point is not in the middle of the sentence222We use a limited-size stack and greedily add each sentence into the stack. If adding a new sentence will make the sum of lengths of all sentences in the stack exceed , we pop out all sentences, concatenate them to a new passage and then add this new sentence to the stack. For sentences having more than words, we treat them as individual posts.. Such passages are treated as separate posts.

Model Architecture

Our architecture is a BERT Devlin et al. (2019)

model. We also experimented with other state-of-the-art pre-trained language models (PLMs), including RoBERTa 

Liu et al. (2019) and XLNET Yang et al. (2019), but found BERT to work the best and thus consider it as our baseline architecture (more details can be found in Appendix A). Each post is fed into BERT Devlin et al. (2019) and we get post embedding . Then we do simple mean-pooling to obtain the user embedding . Finally, we feed

to a fully-connected layer and use the Softmax layer to predict the risk level probability

. The label with the largest probability is picked as the final prediction . For training, the cross entropy loss is applied to optimize our model.

2.1 Weakly-supervised Methods

Task-Adaptive Pre-training

Recent works Lee et al. (2020); Gururangan et al. (2020) point out that task-adaptive pre-training (TAP) can help pre-trained language models better adapt to the target domains and can bring improvement, especially in data-poor scenarios. Specifically, we continue pre-training (e.g., masked language modeling for BERT) on a task-relevant unlabeled corpus and then do normal fine-tuning on the task. Our unlabeled corpus consists of all r/SuicideWatch posts (aggregated per user) from the training sets of all the tasks (A, B, C) in the shared task Zirikly et al. (2019). There are users and posts in this unlabeled corpus. We do continued pre-training for to epochs and do early stopping.

Multi-view Learning

Multi-view learning Xu et al. (2013) (MVL) is one of the widely recognized semi-supervised methods. Clark et al. (2018) provides a successful example of utilizing MVL in sequential labeling tasks. The idea is to create perturbations by masking words in certain positions and requiring the model to learn the similar distribution over the complete labeled examples and the corresponding masked examples besides normal classification training. However, since ours is a user-level classification task, we cannot directly borrow the same strategy from Clark et al. (2018) as it mainly works on sequence labeling. We propose to create perturbations based on four strategies.333The masking proportions for Word-mask and Sent-Mask are tuned empirically on the validation set.First, for each sentence, we will randomly mask of tokens (Word-Mask). Second, considering that users may have posts of many words, we also propose a sentence-level masking strategy (Sent-Mask). For each post of a single user in the training set, we would randomly mask of tokens. Third, we only keep the beginning and ending sentences in each passage (BegEd). Usually these sentences convey the main purpose of the posts and should preserve important semantics. Forth, we use Bert-extractive-summarizer Miller (2019) to extract the summary for each passage (K-Sum). It works mainly by first encoding each sentence using a PLM to a continuous-valued representation

and then training a K-means clustering over

. Finally it will pick sentences for each passage that are closest to the center. Empirically, we set .

In training, we use KL-divergence to enforce the constraint that the predicted probability on perturbed examples should be close to the one on complete examples (i.e., ). The loss incurred by KL-divergence is simply added to the classification loss and these two losses are optimized together for each training instance.

Clinical Psychology Inspired Pseudo-labeling

According to the analysis of the shared task report Zirikly et al. (2019), the main challenge for the 4-way classification comes from insufficient data for the intermediate classes (i.e., low-risk and medium-risk). A straightforward solution is to collect data for these two classes. Recent clinical psychological research O’Connor and Nock (2014) points out that mental health issues such as depression and anxiety can be important risk factors for suicide. Inspired by this study, we collect data from r/Anxiety and r/depression from Reddit. The time range of all collected data is from December 1, 2008 to September 30, 2020. We sample a small proportion of the collected data from both subreddits and after manual verification, we decided to assign low-risk labels to all r/Anxiety users in the sample and medium-risk labels to all r/depression users in the sample. Since we do not have experts to label these posts, adding too much pseudo-labeling data might introduce too much noise. Based on preliminary experiments on the validation set, the number of added pseudo-labeling data is of the suicide risk assessment training data. The only difference between these experiments and the main experiments is that we only train the model for epochs rather than full epochs. Table 1 show results for different sizes of added pseudo-labeled data from r/depression on the validation set. All pseudo-labeling data follows roughly the same pattern with the best proportion being .

Macro-F1 on Validation set
Table 1: Results of different proportions of added pseudo-labeling data from r/depression.
No. Approach Setup Macro (P/R/F1)
1 Baseline BERT 0.436 / 0.424 / 0.427
2 TAP BERT 0.439 / 0.445 / 0.432
3 MVL Word-Mask 0.464 / 0.466 / 0.463
4 MVL Sent-Mask 0.380 / 0.409 / 0.383
5 MVL BegEd 0.384 / 0.422 / 0.401
6 MVL K-Sum 0.384 / 0.422 / 0.401
7 PL
0.535 / 0.480 / 0.498
8 PL
0.495 / 0.469 / 0.478
9 PL
+ Anxiety
0.473 / 0.456 / 0.463
10 PL
Task C
0.475 / 0.462 / 0.460
11 -
Task C
0.418 / 0.406 / 0.408
Table 2: Results Task A test set. For each of tasks 7-11, the size of added data is 8% of training data. Metrics are all reported on macro-average.

3 Experiments and Results

We implement our BERT model based on huggingface Transformer Wolf et al. (2020). Due to the limitation of GPU memory, we only use the base version.We split of original training data to be the validation set and fix the split for all models. The model selection is made by early stopping and we train all models for epochs with the batch size . For users with too many posts and words, we only sample passages for them. Table 2 shows our results on Macro-F1.

Task-Adaptive Pre-training

After applying task-adaptive pre-training on BERT, we see small performance gains over BERT (i.e., from to ). That might be because even we use the whole corpus provided by the shared task, it is still not large enough.

Multi-view Learning

Word-Mask strategy improves over the BERT baseline. Compared with the adaptive pre-training results on BERT, which also do word-level masking but only trained on language modeling, we can see that MVL provides a more efficient way to utilize a small training corpus and bring gain on Macro-F1. However, all the other MVL approaches hurt the performance when compared to the BERT baseline. This might be because the proposed sentence-level perturbation strategy can seriously break the semantics of each post and thus influence the overall performance, and random sampling over sentences hurts most.

Setup a b c d Baseline Baseline 0.730 0.077 0.333 0.566 0.427
0.764 0.273 0.327 0.627 PL
0.724 0.160 0.415 0.614 0.478
+ Anxiety
0.767 0.143 0.370 0.574 PL
Task C
0.762 0.080 0.318 0.678 0.460
Task C
0.667 0 0.357 0.609
Table 3: Class-wise performance (F1) for PL-based methods (a=no-risk; b=low-risk; c=medium-risk; d=high-risk).

Clinical Psychology Inspired Pseudo-labeling

Exp 7, 8 and 9 in Table 2 achieve the Top-3 Macro-F1 scores. This indicates that although our psychology-inspired pseudo-labeling technique is simpler than other weakly-supervised methods, adding meaningful pseudo-label data from relevant domains helps mitigate the problem of insufficient data in the intermediate classes (b and c). To verify this point, we show the class-wise classification results for PL-based models in Table 3 where we can see improvements on b and c classes. Due to space constraints, we present the class-wise performance for all models in Appendix C.

The investigation over the confusion matrix of the best model (shown in Section

4) further supports our hypothesis. However, when we try to combine different pseudo-labeling data together (see Exp 9, where we add users from r/depression and r/Anxiety following the proportion of 444See Supplemental material B for detailed experiments over different mixing proportions and still keep the added user number the same), we observe a slight performance drop. The reason might be that users in these two PL datasets might be at the boundary of the low-risk and medium-risk and simply mixing them together will make the model confuse between these two classes (see Supplemental material D for all confusion matrices).

Figure 1: Visualization of the confusion matrices for the baseline model (Exp 1) and the best model (Exp 7) .

Furthermore, we wanted to test the role of the clinical psychology aspect of our pseudo-labeling approach. Does the gain come from the meaningful domains (anxiety and depression) or just by adding additional data? To answer this, we use additional data provided by Task C of the shared task that contains posts from random subreddits (e.g., sports). We do two experiments: 1) assign low-risk to all such users and 2) assign the gold labels provided by the task via crowdsourcing. We add the same size as for the other pseudo-label experiment ( of training data). The results (Exp 10 & 11 in Table 2) show that the clinical psychology inspired PL outperforms these models by meaningfully addressing the intermediate classes insufficient data problem.

4 Error Analysis

In this section, we take a closer look at the prediction results of our best model (clinical psychology inspired pseudo labeling using r/depression as medium risk) by looking at the confusion matrix and sampled error cases. We plot the confusion matrices for the baseline model (Exp 1 in Table 2) and the best model (Exp 7 in Table 2) in Figure 1. We can see that, the best model achieves the improvement mainly by fixing error cases wrongly predicted as no-risk (where the true labels are “b”, “c” and “d”, with greater error reduction for ”d”) and low-risk (where the true labels are “c” and “d”). As O’Connor and Nock (2014) point out, depression is a serious mental issue and has become one of the most important risk factors of suicide. Adding posts from r/depression can help the model understand better what is “medium-risk” and “high-risk” and thus raise the alert for the signals of similar or related mental issues.

We can also see that the main problem of our best model, is still the confusion between “b” (low-risk) and “c” (medium-risk). In addition, the problem of wrongly predicting the examples belonging to intermediate classes to high-risk ones still exists. By manual investigation, we find that both problems require expertise in mental health to make the subtle distinctions. For example, the following text comes from a low-risk example555Based on ethical consideration, we drop out many sensitive and private content of this example. that is wrongly predicted as high-risk by our best model:

sadness has taken mei am sad , lonely , and i have no interest in living anymorei didnt want to diemy mind is diseased , unable to take happinessi have no interest in forming any more. i dont think ill do it…”

It can be seen that there are many negative or even desperate expressions (marked as red) in this examples, mixed with some short signals (marked as blue) possibly indicating a person considered at low-risk. The model can be fooled by the massive negative expressions and make the wrong predictions if the model is not aware of the true intent of the person. Therefore, reliable intent identification that could consider user posts across time and other information would be a powerful tool to help the model prevent mistakes like this.

5 Application: Predicting Suicide Risk of People Who Use Drugs

In order to further verify the effectiveness of our model in real-world applications, we create a simulation scenario: we apply our best model (Exp 7) over the data that is collected for users who post on both r/opiates and r/SuicideWatch. r/opiates is a subreddit where people discuss topics around opioid usage (e.g., drug doses, withdrawal anguish, daily experiences, harm reduction). This community members could often be at a high suicide risk Aladağ et al. (2018); Yao et al. (2020). We apply our model over their posts on r/SuicideWatch and find that our model predicts that of them are no-risk, while of them are of low-risk, medium-risk and high-risk. The results on sampled r/opiate posts are for no-risk and for at least some risk. The predicted outputs are highly aligned with reported results using crowdsourcing annotation of suicidal or not-suicidal by Yao et al. (2020) and show the effectiveness of our model in this simulated scenario.666

The original Mturk annotation dataset is not open-sourced and thus we can only do rough trend matching on our own collected data.

We hope this will open the door of using NLP methods to investigate the link between suicidal ideation and fatal overdoses among people who use drugs.

6 Conclusions

We investigated a series of weakly-supervised methods and find that pseudo-labeling on data related to risk factors for suicide (depression, anxiety) can help improve model performance. This provides an alternative way to use theoretically-grounded models (e.g., compared to feature engineering). We also show a potential use case of this work for understanding suicidal ideation among users who use drugs (e.g., opiates).

Ethical Considerations

The dataset for suicide risk assessment was obtained from the organizers of the 2019 Clinical Psychology Shared Task on Suicide Risk Assessment, by filling in a participant application where we affirmed that we would follow the shared task’s rules. We have obtained IRB approval (exempt) from Columbia University to use the data as it consists of publicly available and anonymous posts extracted from Reddit. For the application part, we also obtained Columbia IRB approval (exempt) for the data publicly available and anonymous data from r/opiates. All data is kept secure and online userIDs are not associated to the posts.

Our intention of developing and improving suicide risk assessment models is to help health professionals and/or social workers identify people that might be at risk of committing suicide. We emphasize our intention that suicide risk assessment models such as the ones developed here to be used responsibly, with a human in the loop — for example a medical professional, a mental health specialist, who can look at the predicted labels and offer explanations and decide whether or not they seem sensible. We would urge any user of suicide risk assessment technology to carefully control who may use the system. Currently, the presented models may fail in two ways: they may either mislabel an at-risk user as no-risk (our current models are particularly designed to minimize this risk), or classify a no-risk user with some level of risk. Obviously, there is some potential harm to a person who is truly in need if a system based on this work fails to detect their suicidal ideation, and it is possible that a person who is not truly in need may be irritated or offended if someone reaches out to them because of a mistake. That is why, this system needs only to be used as additional help for health professionals.

We note that because most of our data were collected from Reddit, a website with a known overall demographic skew (towards young, white, American men

777https://social.techjunkie.com/demographics-reddit), our conclusions about what expressions of different suicide risk levels look like and how to detect them cannot necessarily be applied to broader groups of people. This might be particularly acute for vulnerable populations such as people with opioid use disorder (OUD). We hope that this research stimulates more work by the research community to consider and model ways in which different groups express suicidal ideation.


  • A. E. Aladağ, S. Muderrisoglu, N. B. Akbas, O. Zahmacioglu, and H. O. Bingol (2018) Detecting suicidal ideation on forums: proof-of-concept study. Journal of medical Internet research 20 (6), pp. e215. Cited by: §5.
  • K. Clark, M. Luong, C. D. Manning, and Q. Le (2018) Semi-supervised sequence modeling with cross-view training. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1914–1925. Cited by: §2.1.
  • G. Coppersmith, R. Leary, P. Crutchley, and A. Fine (2018) Natural language processing of social media as screening for suicide risk. Biomedical informatics insights 10, pp. 1178222618792860. Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §2.
  • S. Fazel and B. Runeson (2020) Suicide. reply. New England journal of medicine 382 (21), pp. e66–e66. Cited by: §1.
  • S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith (2020) Don’t stop pretraining: adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8342–8360. Cited by: §2.1.
  • J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang (2020) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36 (4), pp. 1234–1240. Cited by: §2.1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: Appendix A, §2.
  • M. Matero, A. Idnani, Y. Son, S. Giorgi, H. Vu, M. Zamani, P. Limbachiya, S. C. Guntuku, and H. A. Schwartz (2019) Suicide risk assessment with multi-level dual-context language and bert. In Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology, pp. 39–44. Cited by: §1.
  • C. M. McHugh, A. Corderoy, C. J. Ryan, I. B. Hickie, and M. M. Large (2019)

    Association between suicidal ideation and suicide: meta-analyses of odds ratios, sensitivity, specificity and positive predictive value

    BJPsych open 5 (2). Cited by: §1.
  • D. Miller (2019)

    Leveraging bert for extractive text summarization on lectures

    arXiv preprint arXiv:1906.04165. Cited by: §2.1.
  • M. K. Nock, F. Ramirez, and O. Rankin (2019) Advancing our understanding of the who, when, and why of suicide risk. JAMA psychiatry 76 (1), pp. 11–12. Cited by: §1.
  • R. C. O’Connor and M. K. Nock (2014) The psychology of suicidal behaviour. The Lancet Psychiatry 1 (1), pp. 73–85. Cited by: §1, §2.1, §4.
  • WHO et al. (2016) Suicide across the world. Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger, M. Drame, Q. Lhoest, and A. M. Rush (2020) Transformers: state-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, pp. 38–45. External Links: Link Cited by: §3.
  • C. Xu, D. Tao, and C. Xu (2013) A survey on multi-view learning. arXiv preprint arXiv:1304.5634. Cited by: §2.1.
  • Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le (2019) Xlnet: generalized autoregressive pretraining for language understanding. In NeurIPS, pp. 5754–5764. Cited by: Appendix A, §2.
  • H. Yao, S. Rashidian, X. Dong, H. Duanmu, R. N. Rosenthal, and F. Wang (2020)

    Detection of suicidality among opioid users on reddit: machine learning–based approach

    Journal of medical internet research 22 (11), pp. e15293. Cited by: §5.
  • A. Zirikly, P. Resnik, O. Uzuner, and K. Hollingshead (2019) CLPsych 2019 shared task: predicting the degree of suicide risk in reddit posts. In Proceedings of the sixth workshop on computational linguistics and clinical psychology, pp. 24–33. Cited by: §1, §2, §2.1, §2.1, §2.

Appendix A Comparison of Different Pre-trained Language Models

Given that there has been significant progress on the architecture designs after BERT, we have experimented with different PLMs, such as RoBERTa Liu et al. (2019) and XLNet Yang et al. (2019). From Table 4, we can see that on the Test set, the Macro-F1 scores for BERT and RoBERTa are almost the same and XLNet performs worse than BERT. Therefore, we hypothesis that the architecture of PLMs will not influence substantially the results on this task so we chose BERT model.

PLM TAP? PL? MVL? Macro-F1
BERT No No No 0.427
XLNET No No No 0.422
RoBERTa No No No 0.408
Table 4: Experiment results for different PLMs. Here we only show the macro-F1 for the baseline model built on different PLMs.

Appendix B Results for Different Mixing Proportions

Table 5 shows the results for different mixing proportions of pseudo-labeling data from r/Anxiety and r/depression. Due to the limitation of space, in the main paper, we only show the results achieved by the best mixing proportions.

Mixing Proportion Macro-F1
1: 5 0.398
1: 2 0.463
1: 1 0.434
2: 1 0.441
5: 1 0.442
Table 5: Experiment results for different mixing proportions. Here the proportion represents the user ratio of .

Appendix C Class-wise Decomposition of Experimental Results

Here we show the class-wise performance for all the models in Table 6.

No. Approach Setup a b c d Baseline BERT 0.742/0.719/0.730 0.077/0.077/0.077 0.400/0.286/0.333 0.525/0.615/0.566 0.436/0.424/0.427
2 TAP BERT 0.774/0.750/0.762 0.143/0.154/0.148 0.250/0.107/0.150 0.588/0.769/0.667 MVL Word-Mask 0.788/0.812/0.800 0.111/0.077/0.091 0.391/0.321/0.353 0.567/0.654/0.607 0.464/0.466/0.463
4 MVL Sent-Mask 0.551/0.844/0.667 0.091/0.077/0.083 0.294/0.179/0.222 0.583/0.538/0.560 MVL BegEd 0.686/0.750/0.716 0/0/0 0.320/0.286/0.302 0.531/0.654/0.586 0.384/0.422/0.401
6 MVL K-Sum 0.686/0.750/0.716 0/0/0 0.320/0.286/0.302 0.531/0.654/0.586 PL
0.913/0.656/0.764 0.333/0.231/0.273 0.333/0.321/0.327 0.561/0.712/0.627 0.535/0.480/0.498
8 PL
0.808/0.656/0.724 0.167/0.154/0.160 0.440/0.393/0.415 0.565/0.673/0.614 PL
+ Anxiety
0.821/0.719/0.767 0.133/0.154/0.143 0.385/0.357/0.370 0.554/0.596/0.574 0.473/0.456/0.463
10 PL
Task C
0.774/0.750/0.762 0.083/0.077/0.080 0.438/0.250/0.318 0.606/0.769/0.678 -
Task C
0.760/0.594/0.667 0/0/0 0.357/0.357/0.357 0.556/0.673/0.609 0.418/0.406/0.408
Table 6: Class-wise decomposition results for models considered in this paper. The results under each class are presented following the ”Precision/Recall/F1” format.

Appendix D Additional Error Analysis

Additional confusion matrices for high-performance models (8, 9, 10 in Table 2) are in Figure 3.

Figure 2: Word-Mask Confusion Matrix.
Figure 3: Additional Confusion Matrices for Task 8, 9, 10, 3 in Table 2