A Survey of Relevant Text Mining Technology

11/28/2022
by   Claudia Peersman, et al.
0

Recent advances in text mining and natural language processing technology have enabled researchers to detect an authors identity or demographic characteristics, such as age and gender, in several text genres by automatically analysing the variation of linguistic characteristics. However, applying such techniques in the wild, i.e., in both cybercriminal and regular online social media, differs from more general applications in that its defining characteristics are both domain and process dependent. This gives rise to a number of challenges of which contemporary research has only scratched the surface. More specifically, a text mining approach applied on social media communications typically has no control over the dataset size, the number of available communications will vary across users. Hence, the system has to be robust towards limited data availability. Additionally, the quality of the data cannot be guaranteed. As a result, the approach needs to be tolerant to a certain degree of linguistic noise (for example, abbreviations, non-standard language use, spelling variations and errors). Finally, in the context of cybercriminal fora, it has to be robust towards deceptive or adversarial behaviour, i.e. offenders who attempt to hide their criminal intentions (obfuscation) or who assume a false digital persona (imitation), potentially using coded language. In this work we present a comprehensive survey that discusses the problems that have already been addressed in current literature and review potential solutions. Additionally, we highlight which areas need to be given more attention.

READ FULL TEXT
research
08/21/2023

Comparing Measures of Linguistic Diversity Across Social Media Language Data and Census Data at Subnational Geographic Areas

This paper describes a preliminary study on the comparative linguistic e...
research
04/10/2018

Mining Social Media for Newsgathering

Social media is becoming an increasingly important data source for learn...
research
02/02/2019

Making a Case for Social Media Corpus for Detecting Depression

The social media platform provides an opportunity to gain valuable insig...
research
03/29/2017

Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

This paper surveys the current state of the art in Natural Language Gene...
research
05/22/2017

Latent Human Traits in the Language of Social Media: An Open-Vocabulary Approach

Over the past century, personality theory and research has successfully ...
research
06/23/2021

Gender Recognition in Informal and Formal Language Scenarios via Transfer Learning

The interest in demographic information retrieval based on text data has...
research
09/17/2020

A Multimodal Memes Classification: A Survey and Open Research Issues

Memes are graphics and text overlapped so that together they present con...

Please sign up or login with your details

Forgot password? Click here to reset