Discovering and Categorising Language Biases in Reddit

08/06/2020
by   Xavier Ferrer, et al.
0

We present a data-driven approach using word embeddings to discover and categorise language biases on the discussion platform Reddit. As spaces for isolated user communities, platforms such as Reddit are increasingly connected to issues of racism, sexism and other forms of discrimination. Hence, there is a need to monitor the language of these groups. One of the most promising AI approaches to trace linguistic biases in large textual datasets involves word embeddings, which transform text into high-dimensional dense vectors and capture semantic relations between words. Yet, previous studies require predefined sets of potential biases to study, e.g., whether gender is more or less associated with particular types of jobs. This makes these approaches unfit to deal with smaller and community-centric datasets such as those on Reddit, which contain smaller vocabularies and slang, as well as biases that may be particular to that community. This paper proposes a data-driven approach to automatically discover language biases encoded in the vocabulary of online discourse communities on Reddit. In our approach, protected attributes are connected to evaluative words found in the data, which are then categorised through a semantic analysis system. We verify the effectiveness of our method by comparing the biases we discover in the Google News dataset with those found in previous literature. We then successfully discover gender bias, religion bias, and ethnic bias in different Reddit communities. We conclude by discussing potential application scenarios and limitations of this data-driven bias discovery method.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/27/2020

Discovering and Interpreting Conceptual Biases in Online Communities

Language carries implicit human biases, functioning both as a reflection...
research
03/09/2020

Joint Multiclass Debiasing of Word Embeddings

Bias in Word Embeddings has been a subject of recent interest, along wit...
research
03/22/2022

Suum Cuique: Studying Bias in Taboo Detection with a Community Perspective

Prior research has discussed and illustrated the need to consider lingui...
research
06/07/2022

Gender Bias in Word Embeddings: A Comprehensive Analysis of Frequency, Syntax, and Semantics

The statistical regularities in language corpora encode well-known socia...
research
09/13/2019

A General Framework for Implicit and Explicit Debiasing of Distributional Word Vector Spaces

Distributional word vectors have recently been shown to encode many of t...
research
05/21/2023

Measuring Intersectional Biases in Historical Documents

Data-driven analyses of biases in historical texts can help illuminate t...
research
02/21/2022

Photometric Redshift Estimation with Convolutional Neural Networks and Galaxy Images: A Case Study of Resolving Biases in Data-Driven Methods

Deep Learning models have been increasingly exploited in astrophysical s...

Please sign up or login with your details

Forgot password? Click here to reset