Speech Toxicity Analysis: A New Spoken Language Processing Task

10/14/2021 ∙ by Sreyan Ghosh, et al. ∙ 27

Toxic speech, also known as hate speech, is regarded as one of the crucial issues plaguing online social media today. Most recent work on toxic speech detection is constrained to the modality of text with no existing work on toxicity detection from spoken utterances. In this paper, we propose a new Spoken Language Processing task of detecting toxicity from spoken speech. We introduce DeToxy, the first publicly available toxicity annotated dataset for English speech, sourced from various openly available speech databases, consisting of over 2 million utterances. Finally, we also provide analysis on how a spoken speech corpus annotated for toxicity can help facilitate the development of E2E models which better capture various prosodic cues in speech, thereby boosting toxicity classification on spoken utterances.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Social network platforms are generally meant to share positive, constructive, and insightful content. However, in recent times, people often get exposed to objectionable content like threat, identity attacks, hate speech, insults, obscene texts, offensive remarks, or bullying. With the rise of different forms of content available online beyond just written text, i.e., audio or video, it is crucial that we device efficient content moderation systems for these forms of shared media. Thus, we propose a new Spoken Language Processing (SLP) task of toxicity classification in spoken language, which remains a crucial problem to solve for interactive intelligent systems, with broad applications in the field of content moderation in online audio/video content, gaming, customer service, etc.

The key challenge in toxicity classification in spoken speech is to learn good representations that capture toxicity signals from speech, inherent in a rich set of acoustic and linguistic content, including semantic meaning, speaker characteristics, emotion, tone, and possibly even sentiment information while remaining invariant under different speakers, acoustic conditions, and other natural speech variations. Traditional approaches employ acoustic features, such as band-energies, filter banks, and MFCC features [1], or raw waveform [2] for various speech downstream tasks. We acknowledge the fact that models trained on these low-level features can easily overfit to noise or signals irrelevant to the task. One way to remove variations in speech is to transcribe the audio into text and use text features to predict toxicity as done with speech sentiment classification by [3]. Nonetheless, toxicity signals in the speech, like tone and emotion, can be lost in the transcription. Thus, it is essential for any system to learn speech representations that make high-level information from speech signals available to solve downstream SLP tasks. In this paper, in addition to proposing a new dataset for toxicity detection in spoken language, we also explore several methodologies that can be employed to solve this task including a 2-step system, learned using already available data online, and an End-to-End (E2E) system built solely on our proposed dataset.

2 Related Work

Hate speech or toxic speech detection is a challenging task with literature including techniques such as dictionary-based, distributional semantics, and recent literature exploring the power of neural network architectures for the same. However, most of the work done on hate speech detection is constrained to just text, on English and other foreign languages

[4, 5, 6] with no work on spoken language to the best of our knowledge.

On the other hand, the most commonly explored SLP tasks include Automatic Speech Recognition (ASR), speech emotion analysis, speaker verification, speaker identification, speech separation, speech enhancement, Named Entity Recognition (NER), phoneme recognition, etc, and the most recently explored speech sentiment classification which is very related to our task. All of these tasks are well studied with a lot of open-source datasets available online. With recent advancements in Natural Language Processing (NLP) and SLP, a lot of the systems achieving state-of-the-art in these tasks leverage end-to-end multi-layer neural networks like CNNs or Transformers, including self-supervised and semi-supervised techniques to either learn powerful speech representations

[7, 8, 9], pseudo-labeling larger corpuses using limited supervised data [10] or leverage labeled data from different modalities like text in SLP tasks [11].

With downstream SLP tasks like speech sentiment classification and NER requiring an understanding of the contents and semantics of the spoken utterance, people have employed both 2-step [12] and E2E methodologies [13, 14]. Some of these self-supervised methodologies, originally invented for pushing the performance of state-of-the-art in ASR systems, have also shown success in other downstream tasks [9].

In this paper, we provide baselines using both 2-step, trained on existing data available online, and E2E procedures using our proposed dataset. We show how prosodies and linguistic cues in natural speech aid our E2E model, which outperforms the 2-step methodology using less than 10% of the total amount of labeled data available to our 2-step system.

3 Dataset

In this paper we present DeToxy, the first annotated dataset for toxic speech detection in English language. Our dataset is a subset of various open-source datasets detailed statistics about which can be found in Table 1. We also present DeToxy-B, a balanced version of the dataset, curated from the original larger version taking into consideration auxiliary factors like trigger terms and utterance sentiment labels.

We define toxicity as rude, disrespectful, or otherwise likely to make someone leave a discussion

. For the initial version of our dataset, we primarily focus on openly available speech databases with or w/o text transcripts available. For obtaining transcripts of datasets for which transcripts were not available we use pre-trained wav2vec2 to obtain the transcriptions. In the process of consolidation of our dataset, primarily, after an empirical analysis on the transcripts of most of the open-source datasets available online, we found that most of them did not contain toxic utterances. Thus, we follow a 2-step procedure to annotate the dataset. First, we train a textual toxicity classifier using BERT


and use that to filter out datasets that had at least 10% of its’ total number of utterances flagged as toxic by the model. Some datasets which did not fit this criterion were LibriSpeech, TIMIT, TED-LIUM and MASS. Next, all the utterances obtained from the datasets through the first step were manually annotated by 3 professional annotators using audino

111https://github.com/midas-research/audino taking both the text and spoken utterances into account. Finally, we did a simple majority voting among the 3 annotations to determine the final class of the utterance 222Cohen’s Kappa Score for inter-annotater agreement is 0.76. Table 1 and Table 2 show detailed descriptions of DeToxy and DeToxy-B respectively.

For DeToxy-B, we keep all the toxic utterances and sample non-toxic utterances with a ratio of 2:1 for non-toxic to toxic utterances. We sample the non-toxic utterances from the larger dataset in a balanced manner, with equal distribution of sentiments. We also collate an explicit test set with non-toxic utterances where each utterance consisted of at least one trigger term 333https://hatebase.org/. We do this to evaluate how both our baselines perform against not getting biased towards trigger terms, a long-standing problem in toxicity classification [15].

Dataset # Utterances # Toxic # Non-Toxic # Sp Toxic Emo Sent TL (hh:mm:ss)
CMU-MOSEI 44,977 216 44,761 1,000 65:53:36
 CMU-MOSI 2,199 68 2,131 98 02:36:17
 Common Voice [16] 1,584,219 2,888 1,581,331 66,173 2181:00:00*
IEMOCAP 10,087 274 9,813 10 11:28:12
 LJ Speech 13,100 40 13,060 1 23:55:17
 MELD [17] 13,708 141 13,567 304 12:02:44
 MSP-Improv [18] 8,348 131 8,217 12 8:25:41
 MSP-Podcast [19] 73,042 550 72,492 1,273* 113:41:00
 Social-IQ 12,024 123 11,901 - 20:39:00


259,890 456 259,434 400 260:00:00*
 VCTK 44,583 50 44,533 110 70:22:28
Table 1: DeToxy Statistics
Dataset # Utterances # Toxic # Non-Toxic # Sp # Positive # Neutral # Negative TL (hh:mm:ss)
CMU-MOSEI 640 216 424 - 156 100 384 1:18:00
CMU-MOSI 200 68 132 - 57 34 109 0:14:06
Common Voice 8,595 2,888 5,707 5,049 2,363 1,489 4,743 9:11:42
IEMOCAP 818 274 544 - 239 127 452 1:00:04
LJ Speech 111 40 71 1 18 19 74 0:11:16
MELD 427 141 286 58 103 91 233 0:23:09
MSP-Improv 523 131 392 12 115 91 317 0:36:32
MSP-Podcast 1,644 550 1,094 482* 445 285 914 2:25:58
Social-IQ 485 123 362 - 155 89 241 0:36:34
Switchboard 1,343 456 887 - 346 249 748 1:53:15
VCTK 145 50 95 81 46 26 73 0:06:19
Total 14,931 4,937 9,994 - 4,043 2,600 8,288 17:56:55

Table 2: DeToxy-B Statistics

4 Experiments

The task of toxic speech detection involves assigning a probability score, denoting the toxicity, to each utterance fed to the model. Formally given, let

X = (x1,…,xn) be the input utterances, then Y = (y1,…,yn) are corresponding probability scores where yi .

This section involves detailed descriptions of our baselines, the different components involved, and the training procedure involved in each.

4.1 2-step

Our 2-step approach consists of 2 primary components, an ASR component that produces transcriptions from spoken utterances, and a Sequence Classification Model for classifying the toxicity of the output transcript. For our ASR model, we use a transformer architecture with a Convolutional Neural Network (CNN) feature encoder layer, in a setting similar to wav2vec 2.0

[7]. Similar to [7], we employ a 2-step training procedure where we first pre-train the model using unlabeled audio in a self-supervised fashion solving a contrastive task similar to [7] where we learn to identify the true quantized representation of the output of the context network , i.e., the transformer architecture, among a set of distractors.


Next, after self-supervised pre-training, we further fine-tune the model with CTC on labeled speech data. We do not use a Language Model in our final setup to decode our utterances during inference, as we did not find significant differences in Word Error Rate (WER) with it in our experiments, and also to keep the setup simple.

For self-supervised pre-training of our ASR model, we use a combination of Libri-Light, Common Voice [16], SwitchBoard (SB) [20], and Fisher. For fine-tuning our ASR model on CTC, we fine-tune it on 300 hours of SB. Our choices of data for pre-training and fine-tuning are influenced by [21], where we try to fit our model to the conversational domain owing to the fact that a greater part of DeToxy is from the same domain.

For our sequence classification model, the second step of our 2-step approach, we employ the BertBASE transformer architecture. We fine-tune BertBASE on a publicly available large-scale hate-speech classification dataset 444https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data or our DeToxy-B annotations. Formally, we tokenize each word in the sentence and feed it as input through the transformer architecture. We utilize the hidden state embedding e corresponding to the [CLS] token, where e , as the aggregate representation of words in the transcript of the utterance. This embedding e is now fed through a final classification layer which learns a parameterized function and outputs where

. Finally, we pass the logits obtained through a softmax activation function to get the probability distribution

of toxicity for each sentence.


We train the model using cross-entropy loss (2) and during inference, we consider the as the final class for each utterance.

4.2 End-to-End

4.2.1 F-Bank

For our F-Bank-based E2E approach, given an audio input sequence, we extract log-compressed mel-filterbanks with a window size of 25 ms and a hop size of 10 ms. Post this, we employ GlobalAveragePooling over all the T hidden states [h1,…,hT], corresponding to each time-step, to obtain a single embedding e for each utterance. This embedding is then fed into a toxicity classification decoder, consisting of a single fully connected layer post which we employ the softmax activation function to the output of this layer to obtain the probability of toxicity for each utterance.

Figure 1: Our Approach: The primitive 2-step approach (left) and the E2E approach (right)

4.2.2 Transformers

For a fair comparison with our 2-step procedure, we employ the same architecture employed in the 2-step procedure for our E2E system. We follow the same pre-training and fine-tuning methodology as our 2-step system but this time we additionally fine-tune all the weights of the transformer architecture directly on the downstream task in an E2E fashion. Next, we repeat this step by freezing all the encoder weights, where the encoder acts only as a feature extractor, to analyze the effect of E2E fine-tuning on our downstream task.

To fine-tune our model on the downstream task, we employ the same pooling strategy and prediction-head as our F-Bank approach, except now we take the embeddings from the 9th layer of our ASR encoder and obtain a single embedding e for each utterance through GlobalAveragePooling over all the time-steps.

Both these models are trained using Cross-Entropy Loss (2) and similar to our 2-step procedure, during inference, we take the argmax of the output of the softmax function as the final class for each utterance.

5 Experimental Setup

We use PyTorch Framework for building our Deep Learning models along with the Transformer implementations, pre-trained models and, specific tokenizers in the HuggingFace library for our 2-step sequence classification models and fairseq for training our ASR model.

For our ASR model, we use the same architecture as [7], consisting of 12 transformer blocks, model dimension 768, inner dimension of 3,072, and 8 attention heads. For pre-training and fine-tuning we use the same tuning parameter setup as [7]. For our Sequence Classification model, we use an architecture similar to BertBASE. We use the pre-trained model from Hugging Face and fine-tune on our downstream data with Adam optimizer in batched mode with a batch size of 32 and a learning rate of

for a maximum of 20 epochs with an early-stopping of 5.

We make all our code and data available on GitHub555https://github.com/Sreyan88/Toxicity-Detection-in-Spoken-Utterances666All the datasets, including audio segments and transcripts, will be released according to individual dataset guidelines, more information on which can be found in our GitHub.

6 Results Analysis

In this section, we provide an analysis of the results obtained from both our 2-step and E2E systems. We use F1 score to evaluate the performance of both our approaches, defined as:


We emphasize using the macro average of both our classes to evaluate our systems. As we see in Table 3, our E2E system clearly outperforms our 2-step system with less than 10% of the total amount of labeled data available for training. For the 2-step approach, we compare BertBASE trained on both Civil Comments corpus and DeToxy-B annotations.

System Category Dev Test Test-T
2-step Civil Comments 0.71 0.73 0.33
DeToxy-B 0.54 0.54 0.45
F-Bank - 0.61 0.62 0.49
Transformer Freezed 0.65 0.68 0.51
Unfreezed 0.73 0.76 0.57
Table 3: Experimental Results

Beyond the problem that error does not propagate in the 2-step approach, it suffers from 2 more major bottlenecks. First, ASR models don’t generalize across domains and generally fail to perform well when trained on one domain and inferred on another. This is also very evident in our case where we get an average WER of 28%. Another major problem with text-based toxicity classification systems is that they often fail to generalize and learn to incorrectly associate certain sentences consisting of commonly-attacked entities or a vocabulary of trigger terms as toxic [15]. Thus, to analyze this, we curate a subset of our test-set only with such trigger terms to see how both our systems perform in handling bias in toxicity classification and report results in Table 3 under Test-T column. As we see in our results, our E2E model outperforms the 2-step by a huge margin. We hypothesize that the E2E model might have learned better prosodic and acoustic cues relevant to the task and want to keep this analysis as part of our future work.

7 Conclusion

In this paper, we introduce the first publicly available dataset for the task of toxic speech classification and propose two strong baselines for this task. Future work includes expanding the dataset at least 10 fold using more naturally spoken utterances, exploring different modalities, and developing better neural architectures to solve this problem.