BERT-Beta: A Proactive Probabilistic Approach to Text Moderation

09/18/2021 ∙ by Fei Tan, et al. ∙ 0

Text moderation for user generated content, which helps to promote healthy interaction among users, has been widely studied and many machine learning models have been proposed. In this work, we explore an alternative perspective by augmenting reactive reviews with proactive forecasting. Specifically, we propose a new concept text toxicity propensity to characterize the extent to which a text tends to attract toxic comments. Beta regression is then introduced to do the probabilistic modeling, which is demonstrated to function well in comprehensive experiments. We also propose an explanation method to communicate the model decision clearly. Both propensity scoring and interpretation benefit text moderation in a novel manner. Finally, the proposed scaling mechanism for the linear model offers useful insights beyond this work.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text moderation is essential for maintaining a non-toxic online community for media platforms nobata2016abusive. Many efforts from both academia and industry have been made to address this critical problem. Recently, the most prototypical thread is to do sophisticated feature engineering or develop powerful learning algorithms nobata2016abusive; badjatiya2017deep; bodapati2019neural; tan2020tnt; tran2020habertor. Automatic comment moderation schemes plus human review are certainly the cornerstone of the fight against toxicity.

These existing works, however, are reactive

approaches to handling user generated text in response to the publication of new articles. In this paper, we revisit this challenge from a

proactive perspective. Specifically, we introduce a novel concept text toxicity propensity to quantify how likely an article is prone to incur toxic comments. This is a proactive outlook index for news articles prior to the publication, which differs radically from the existing reactive approaches to comments.

In this context, reactive describes comment-level moderation algorithms after the publication of news articles (e.g., Perspective perspective), which quantifies whether comments are toxic and should be taken down or sent for human review. Proactive emphasizes article-level moderation effort before the publication (without access to comments), which forecasts how likely articles are to attract toxic comments in the future and gives suggestions (e.g., rephrase news articles properly) in advance. Our work can be viewed as the first machine learning effort for a proactive stance against toxicity.

Formally, we propose a probabilistic approach based on Beta distribution


to regress article toxicity propensity on article text. For previously published news articles with comments, we take the average of comments’ toxicity scores as the ground-truth label for model learning. The effectiveness of this approach is shown in both test set and human labeling. We also develop a scheme that can provide convincing explanation to the decision of the deep learning model.

2 Related Works

To our best knowledge, there’s no prior research thread on proactive text moderation. Nonetheless, many reactive approaches have been explored including hand-crafted feature engineering chen2012detecting; warner2012detecting; nobata2016abusive

, neural networks

badjatiya2017deep; pavlopoulos2017deeper; agrawal2018deep; zhang2018detecting and transformer variants bodapati2019neural; tan2020tnt.

Recently, context, in the form of parent posts, has been studied but it is only viewed as regular text snippets for lifting the performance of toxicity classifiers

pavlopoulos2020toxicity while screening posts. Our work instead focuses on predicting the proactive toxicity propensity of articles before they receive user comments.

Beta distribution is usually utilized as a priori in Bayesian statistics. The most popular example in natural language processing is Topic Model

blei2003latent, where the multivariate version of beta distribution (a.k.a. Dirichlet distribution) generates parameters of mixture models. Beta regression is originally proposed for modeling rate and proportion data ferrari2004beta by parameterizing mean and dispersion and regressing parameters of interest. It has been applied to evaluate grid search parameters in optimization mckinney2019classification, model emotional dimensions aggarwal2020exploration and statistical processes of child-adult linguistic coordination and alignment misiek2020development.

3 Beta Regression

In this work, both comment toxicity score and the derived article toxicity propensity score (to be detailed in the subsequent section 4.1) range from to

. Empirically, their distributions exhibit an asymmetry and may not be modelled well with the Gaussian distribution (Figs.

2 and 3 of Appendix A). Furthermore, comment toxicity score distributions of individual articles vary with article content as shown in Fig. 3 of Appendix A. Modelling the entire distribution of an article comment toxicity scores is thus a reasonable approach. Beta distribution is very flexible and it can model quite a wide range of well-known distribution families from symmetric uniform () and bell-shaped distributions () to asymmetric shapes ().

In this context, the toxicity propensity score

is assumed to follow the Beta distribution with probability density function (pdf):


where and are two positive shape parameters to control the distribution. is the normalization constant and support meets . Eq. 1 holds the probabilistic randomness given and , we thus impose a regression structure of them on text content.

Formally, given a training set

with raw text feature vector

and label for sample , we apply feature engineering or text embedding and then regress and on respectively as


where and are learned jointly. can be either pre-fixed or learned together with and , which is detailed in the subsequent section. Specifically, the learning procedure of , and (if applicable) is to minimize loss


Substituting Eqs. 1 and 2 into it gives the final objective function.

In the inference phase, with learned , and , and for a new sample can be readily derived from Eq. 2. We take the mean of Eq. 1

as a point estimator:

because we are predicting the average toxicity.

4 Experiments

4.1 Dataset

We collect a dataset of articles published on Yahoo media outlets, which are all written in English. We also exclude articles with low comment volume to make the distribution learning reliable. The number of comments for 99% of the analyzed articles lie in [10, 8K], with 25% quantile of 20, median of 50 and mean of 448. The employed dataset is then split into training, validation and test parts based on the publishing date with ratio of 8:1:1 as described in Table

1. It’s worthwhile to note that input text is the concatenation of article title and text body. The toxicity propensity score of article is defined as the average toxicity score of all associated comments. Comments are scored by Google’s Perspective perspective

, which lies in [0, 1]. Perspective intakes user generated text and outputs toxicity probability. It’s a convolutional neural net

noever2018machine trained on a comments dataset 111 of wikipedia labeled by multiple people per majority rule.

width=1 Training Validation Test Sample Size 536,711 70,946 70,946 Publishing Date 2004 - 05/2020 05/2020-06/2020 06/2020-09/2020

Table 1: Basic statistics of dataset breakdown

4.2 Experiment Setup

In Eq. 2, we set both and to single-layer neural networks. For , we experiment with either Bag of Words (BOW) or BERT embedding (BERT) devlin2019bert. Specifically, we take uni-gram and bi-gram words sequence and compute the corresponding Term Frequency-Inverse Document Frequency (TF-IDF) vectors, which leads to around million tokens for BOW. For BERT, we take the base version and then fine-tune and on top of the [CLS] embedding, which ends up with million parameters. If input text exceeds the maximum length ( as [CLS] and [SEP] are reserved), we adopt a simple yet effective truncation scheme sun2019fine. Specifically, we empirically select the first 128 and the last 382 tokens for long text. The rationale is that the informative snippets are more likely to reside in the beginning and end. Batch size is 16 and learning rate is for Adam optimizer kingma2015adam. They are called BOW- and BERT- for short.

4.3 Baseline Methods and Metrics

We compare with the linear regression method using BOW features, as well as the BERT base model. Both are combined with one of two loss functions, Mean Absolute Error (MAE) or Mean Squared Error (MSE). We call them BOW-MAE, BOW-MSE, BERT-MAE, BERT-MSE, respectively. The experiment settings are same as the Beta regression.

Since we are interested in identifying articles of high toxicity propensity, we want to make sure that an article with high average toxicity is ranked higher than one with low propensity. Thus in addition to mean absolute error, root mean squared error (RMSE) and AUC@Precision-Recall curves (AUC@PR), we measure performance using two ranking metrics, Kendall coefficient (Kendall) and Spearman’s coefficient (Spearman).

4.4 Results

We perform evaluation on the whole test set and on human labels.

4.4.1 Test Set

Table 2 details the performance comparisons. Overall, Beta regression stands out across different metrics regardless of feature engineering due to its modeling flexibility. BERT-based methods also outperform BOW ones in terms of feature engineering and representation. This is reasonable as the former has 20 times as large parameters as the latter and offers the contextual embedding. Interestingly, MAE and MSE schemes don’t achieve the minimum MAE and RMSE although they are dedicated to this goal, which might result from the limitation of point estimator.

width=1 Kendall Spearman MAE RMSE val test val test val test val test BOW-MAE 0.332 0.314 0.488 0.464 0.076 0.081 0.095 0.100 BOW-MSE 0.428 0.402 0.606 0.574 0.057 0.063 0.076 0.084 BOW- 0.437 0.413 0.617 0.589 0.056 0.061 0.075 0.081 BERT-MAE 0.360 0.333 0.525 0.489 0.072 0.076 0.092 0.095 BERT-MSE 0.442 0.423 0.621 0.598 0.070 0.073 0.089 0.093 BERT- 0.462 0.440 0.642 0.617 0.056 0.065 0.075 0.085

Table 2: Performance comparisons on test set

4.4.2 Human Labels

As labels are derived from machine, we want a sanity check to ensure that the model decision conforms to human intuition. Namely, when the model classifies an article as having high toxicity propensity, we want to make sure that it correlates well with human judgement. To this end, we divide test set into 10 equal buckets with an interval of and merge the last 4 buckets into [0.6, 1] due to much fewer articles with score being above (as shown in Fig. 2). There are thus a total of buckets [0, 0.1), [0.1, 0.2), [0.2, 0.3), [0.3, 0.4), [0.4, 0.5), [0.5, 0.6) and [0.6, 1.0]. We then randomly take samples per bucket and set aside for human training and the remaining are labelled by the human judges as the benchmark set. We recruit two groups of people for independent annotation, which are required to pick one from five levels (a reasonable balance between smoothness and accuracy for manually labeling toxicity propensity per judges’ suggestion) to describe the propensity extent to which an article is likely to attract toxic comments: Very Unlikely (VU), Unlikely (U), Neutral (N), Likely (L) and Very Likely (VL). Table 3

is the confusion matrix showing how much two groups of human judges agree with each other.

width=1 G1G2 VU U N L VL Total VU 89 28 0 23 1 141 U 30 26 0 37 3 96 2-3 N 0 1 0 3 1 5 5-6 L 31 25 0 34 56 146 VL 18 21 0 87 124 250 Total 168 101 0 184 185 638

Table 3: Confusion matrix of two groups of human annotation G1 and G2
Figure 1: PR curves on human labelled data.

Interestingly, even humans don’t agree with each other on all examples. Roughly, the perception of the two groups is consist on samples (top left and bottom right boxes in Table 3). Moreover, Cohen’s Kappa is about by taking expected chance agreement into account 222 In light of this, we jointly score the set by assigning , , , and to VU, U, N, L and VL, respectively. Since each article has two labels, the addition gives an integer score interval [-4,4]. Table 4 reports the performance with human labels as the ground truth, which confirms the previous findings that BERT- performs the best. Additionally, we pick scores , and

as thresholds to monitor precision and recall curves (Fig.

1). Likewise, the proposed schemes achieve compelling performance widely.

Taken together, our probabilistic methods agree more with both machine and human judgements.

width=1 BOW-MAE BOW-MSE BOW- Kendall 0.402 0.481 0.491 Spearman 0.562 0.635 0.649 BERT-MAE BERT-MSE BERT- Kendall 0.441 0.508 0.522 Spearman 0.599 0.665 0.679

Table 4: Performance on human labelled set

4.5 Explanation

As we focus on the pre-publication text moderation, a reasonable explanation is an essential step to convince stake-holders of subsequent operations. For BERT-

explanation, we adopt gradient-based saliency map variants from computer vision

simonyan2013deep; shrikumar2017learning. We compute the gradient with respect to input tokens embedding , where is the mean prediction for sample (Section 3), and where is a single token. The element of is partial derivative to measure the token-level contribution to the scoring. The explanation is conducted by assuming the article is controversial, and we want to figure out which words cause some comments to be toxic. So it also makes sense to maximize the maximum toxicity of the comments. We thus experiment with , which is the mode (corresponding to the peak in the PDF of Beta distribution) under reasonable assumption (). We denote the resulting scheme by subscript "mode".

For saliency map (SM) simonyan2013deep, the metric is without direction. A variant is dot product (DP) between token embedding and gradient element with direction shrikumar2017learning. We also propose a hybrid (HB) scheme to take magnitude of SM and direction of DP to form a new metric. We perform an ablation study (AS) to delete single token alternately and then compute the score discrepancy between original and as well. As a reference, we examine the regression coefficients (RC) of linear BOW-MSE, which are easy to check for explaining the contribution of corresponding words.

A few well-trained human judges are recruited to tag (example-specific, determined by annotators) most important words. We then prioritize tokens with different metrics and pick top ones as candidates. Hit rate (proportion of human annotated tokens covered by schemes) is used to compare different tools. We take examples for human review and compute the average hit rate, as compared in Table 5.

All schemes for BERT- are much better than linear scheme RC, which is consistent with the predictive performance discrepancy. SM and HB are close and outperform black-box ablation study, which implies the valuable role of model-aware gradients in the explanation. DP is inferior to AS and seems not consistent with human annotation as well as other gradient based methods. In practice, we take SM for the explanation (Appendix B) due to its out-performance and simplicity. As expected, mode (SM) covers more annotated words than mean (SM) on average (more discussions in Appendix C).

0.549 0.430 0.543 0.467 0.382 0.553
Table 5: Performance (average hit rate) comparison

5 Additional Study

Linear regression (BOW-MSE) is inferior to BERT-. Nonetheless, it is much faster in training, inference and explanation as it is about 20 times as small as BERT-. Thus, we investigate if the performance of the linear model could be improved for industrial deployment.

Inspired by NBSVM wang2012baselines, we scale TF-IDF vectors of BOW-MSE by a weight vector defined as where is the training corpus. and ( million) are the training labels and TF-IDF matrix. and are their column-wise means. The pre-computed

can be viewed as a surrogate of the regression coefficient for the linear regression problem, which is used to scale TF-IDF of BOW-MSE in both training and inference phases. We call it Naive Bayes Linear Regression (NBLR) for short.

The scaling benefits the performance, as compared in Table 6. As can be seen, NBLR improves upon BOW-MSE significantly, although it is not as good as BERT-.

width=1 Test Set Human Label Kendall Spearman Kendall Spearman BOW-MSE 0.402 0.574 0.481 0.635 NBLR 0.413 (+.011) 0.588 (+.014) 0.501 (+.020) 0.656 (+.021) BERT- 0.440 0.617 0.522 0.679 Human Label AUC@PR at 2 AUC@PR at 3 AUC@PR at 4 BOW-MSE 0.77 0.74 0.48 NBLR 0.78 (+.01) 0.76 (+.02) 0.53 (+.05) BERT- 0.80 0.77 0.59

Table 6: Performance on test set and human labels

6 Discussion and Future Work

Our work can benefit text moderation. The proactive propensity offers a toxicity outlook for comments, which could be utilized in multiple ways. For example, stricter moderation rules are enforced for articles that are predicted to have a high toxicity propensity. Furthermore, the propensity could be used as an additional feature for the downstream reactive toxicity recognition models, as well as for allocation of appropriate human resources.

The explanation tool can also be used to remind editors to rephrase some controversial words to mitigate the odds of attracting toxic comments. Text moderation is an important yet challenging task, our proactive work is attempting to open up a new perspective to augment the traditional reactive procedure. Our current model, however, is not perfect as shown by article b in Fig.

3 of Appendix A where the learned distribution doesn’t fit well the observed histogram. Technically, NBLR is an encouraging lightweight extension to Linear Regression. Likewise, we will continue to work towards the improvement of the non-linear Beta regression.

7 Conclusion

We approach text moderation by developing a well-motivated probabilistic model to learn a proactive toxicity propensity. An explanation scheme is also proposed to visually explain the connection between this new prospective score, and text content. Our experiment shows the superior performance of the proposed BERT- algorithm, compared with a number of baselines, in predicting both the average toxicity score, and the human judgement.


Appendix A Toxicity Score and Beta Distribution

The distribution of news articles’ toxicity propensity score is reported in Fig. 2. Comment score distributions of two articles with predictive distribution are given in Fig. 3.

Figure 2: Toxicity propensity score (mean comment toxicity scores) distribution of news articles.
Figure 3: Toxicity score histogram density of comments for articles a (top) and b (bottom). Solid red lines represent predictive beta distribution for individual articles.

Appendix B SM Explanation Examples

We pick two samples from the test set and then leverage SM in section 4.5 to highlight key words for the illustration purpose, as shown in Fig. 4. The color intensity is proportional to the normalized saliency map value. The darker the color of a token is, the more important it’s to the scoring. There’s also a positional bias towards the first sentence as it’s the article title.

Figure 4: Model explanation examples.

Appendix C Bert- mode

We also explore the mode of BERT- as a point estimator and compare it with the mean. Table 7 details the performance discrepancy between the test set and human labels. For the toxicity propensity prediction in the test set, it does make sense for mean to slightly outperform mode as ground-truth labels are the score mean of comments. When it comes to human labels and explanation, people annotate news articles based on the perceived controversial words most likely to incur toxic comments. Mode is thus able to capture the worst case better and agrees more with human annotations. This finding is in line with the better explanation performance, as compared in Table 5.

width=1 Test Set Human Label Kendall Spearman RMSE MAE Kendall Spearman mean 0.440 0.617 0.065 0.085 0.522 0.679 mode 0.439 0.614 0.077 0.099 0.543 0.704 Human Label AUC@PR at 2 AUC@PR at 3 AUC@PR at 4 mean 0.8 0.77 0.59 mode 0.82 0.79 0.63

Table 7: Performance of BERT- point estimators on test set and human labels