Robin: A Novel Online Suicidal Text Corpus of Substantial Breadth and Scale

09/13/2022
by   Daniel DiPietro, et al.
0

Suicide is a major public health crisis. With more than 20,000,000 suicide attempts each year, the early detection of suicidal intent has the potential to save hundreds of thousands of lives. Traditional mental health screening methods are time-consuming, costly, and often inaccessible to disadvantaged populations; online detection of suicidal intent using machine learning offers a viable alternative. Here we present Robin, the largest non-keyword generated suicidal corpus to date, consisting of over 1.1 million online forum postings. In addition to its unprecedented size, Robin is specially constructed to include various categories of suicidal text, such as suicide bereavement and flippant references, better enabling models trained on Robin to learn the subtle nuances of text expressing suicidal ideation. Experimental results achieve state-of-the-art performance for the classification of suicidal text, both with traditional methods like logistic regression (F1=0.85), as well as with large-scale pre-trained language models like BERT (F1=0.92). Finally, we release the Robin dataset publicly as a machine learning resource with the potential to drive the next generation of suicidal sentiment research.

READ FULL TEXT
research
04/04/2020

CG-BERT: Conditional Text Generation with BERT for Generalized Few-shot Intent Detection

In this paper, we formulate a more realistic and difficult problem setup...
research
06/29/2023

Harnessing the Power of Hugging Face Transformers for Predicting Mental Health Disorders in Social Networks

Early diagnosis of mental disorders and intervention can facilitate the ...
research
01/14/2022

A Warm Start and a Clean Crawled Corpus – A Recipe for Good Language Models

We train several language models for Icelandic, including IceBERT, that ...
research
06/24/2023

My Boli: Code-mixed Marathi-English Corpora, Pretrained Language Models and Evaluation Benchmarks

The research on code-mixed data is limited due to the unavailability of ...
research
01/06/2020

Identifying Historical Travelogues in Large Text Corpora Using Machine Learning

Travelogues represent an important and intensively studied source for sc...
research
06/06/2022

Spam Detection Using BERT

Emails and SMSs are the most popular tools in today communications, and ...
research
03/15/2022

Evaluating BERT-based Pre-training Language Models for Detecting Misinformation

It is challenging to control the quality of online information due to th...

Please sign up or login with your details

Forgot password? Click here to reset