Civil Rephrases Of Toxic Texts With Self-Supervised Transformers

02/01/2021 ∙ by Leo Laugier, et al. ∙ Google Télécom Paris Athens University of Economics and Business 0

Platforms that support online commentary, from social networks to news sites, are increasingly leveraging machine learning to assist their moderation efforts. But this process does not typically provide feedback to the author that would help them contribute according to the community guidelines. This is prohibitively time-consuming for human moderators to do, and computational approaches are still nascent. This work focuses on models that can help suggest rephrasings of toxic comments in a more civil manner. Inspired by recent progress in unpaired sequence-to-sequence tasks, a self-supervised learning model is introduced, called CAE-T5. CAE-T5 employs a pre-trained text-to-text transformer, which is fine tuned with a denoising and cyclic auto-encoder loss. Experimenting with the largest toxicity detection dataset to date (Civil Comments) our model generates sentences that are more fluent and better at preserving the initial content compared to earlier text style transfer systems which we compare with using several scoring systems and human evaluation.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

There are many ways to express our opinions. When we exchange views online, we do not always immediately measure the emotional impact of our message. Even when the opinions expressed are legitimate, well-intentioned and constructive, a poor phrasing may make the conversation go awry Zhang2018

. Recently, Natural Language Processing (NLP) research has tackled the problem of abusive language detection by developing accurate classification models that flag toxic (or abusive, offensive, hateful) comments

(davidson2017automated; Pavlopoulos2017b; Wulczyn2017; Gamback2017; fortuna2018survey; Zhang2018; 10.1371/journal.pone.0203794; Zampieri2019).

Input offensive comment
you now have to defend this clown along with his russian corruption.
Generated civil comment you now have to defend this guy from his russian ties……..

Input offensive comment
blaming trudeau and the government is just stupid.
Generated civil comment blaming trudeau and the liberal government is just wrong.

Input offensive comment
dubya222A nickname for George W. Bush.was a moron.
Generated civil comment dubya was a republican.

Table 1: Examples of offensive sentences from the Civil Comments test set and the more civil rephrasing generated by our model. The third example shows that its strategy may involve shifting the original intent, since “republican” is not a non-offensive synonym of “moron”.

The prospect of healthier conversations, nudged by Machine Learning (ML) systems, motivates the development of Natural Language Understanding and Generation (NLU and NLG) models that could later be integrated in a system suggesting alternatives to vituperative comments before they are posted. A first approach would be to train a text-to-text model (bahdanau2014neural; vaswani2017attention) on a corpus of parallel comments where each offensive comment has a courteous and fluent rephrasing written by a human annotator. However, such a solution requires a large paired labeled dataset, in practice difficult and expensive to collect (see Section 4.5). Consequently, we limit our setting to the unsupervised case where the comments are only annotated in attributes related to toxicity, such as the Civil Comments dataset (DBLP:journals/corr/abs-1903-04561). We summarize our investigations with the following research question:

rq: Can we fine-tune end-to-end a pre-trained text-to-text transformer to suggest civil rephrasings of rude comments using a dataset solely annotated in toxicity?

Answering this question might provide researchers with an engineering proof-of-concept that would enable further exploration of the many complex questions that arise from such a tool being used in conversations. The main contributions of this work are the following:

  • [leftmargin=*,noitemsep]

  • We addressed for the second time the task of unsupervised civil rephrases of toxic texts, relying for the first time on the Civil Comments dataset, and achieving results that reflect the effectiveness of our model over baselines.

  • We developed a non-task specific approach (i.e. with no human hand-crafting in its design) that can be generalized and later applied to related and/or unexplored attribute transfer tasks.

While several of the ideas we combine in our model have been studied independently, to the best of our knowledge, no existing unsupervised models combine sequence-to-sequence bi-transformers, transfer learning from large pre-trained models, and self-supervised fine-tuning (denoising auto-encoder and cycle consistency). We discuss the related work introducing these tools and techniques in the following section.

2 Related work

Unsupervised complex text attribute transfer (like civil rephrasing of toxic comments) remains in its early stages, and our particular applied task has only a single antecedent (nogueira-dos-santos-etal-2018-fighting). There is a great variety of useful works to tackle the task and this section attempts to summarize the vast majority of these works. We describe below the recent strategies (such as attention mechanisms bahdanau2014neural) that led to significant progress in supervised NLU and NLG tasks. Then, we present the most related lines of work in unsupervised text-to-text tasks.

2.1 Transformers333To avoid confusion we denote as bi-transformer the original encoder-decoder transformer whereas encoder-only and decoder-only models are called uni-transformers here. are state-of-the-art architectures in NLP


showed that transformer architectures, based on attention mechanisms, achieved state-of-the-art results when applied to supervised Neural Machine Translation (NMT). More generally, transformers have proven capable in various NLP and speech tasks

(8462506; huang2018music; le2019flaubert; li2019neural). Moreover, transformers benefit from pre-training before being fine-tuned on downstream tasks (Devlin2018; dai2019transformer; yang2019xlnet; conneau2019cross; raffel2019exploring). Subsequent research has adopted uni-transformers in many supervised classification and regression tasks (Devlin2018) and in unsupervised language modeling (radford2019language; keskarCTRL2019; Dathathri2020Plug), until raffel2019exploring proposed a unified pre-trained bi-transformer applicable to any text classification, text regression and text-to-text task. Further, recent works tackle the language detoxification of unconditional language models (KrauseGeDi2020; gehman2020realtoxicityprompts).

2.2 Unsupervised losses enable training text-to-text models end-to-end

After the success of unsupervised image-to-image style transfer in computer vision (CV), some approaches have addressed unsupervised text-to-text tasks. Unsupervised Neural Machine Translation (UNMT) is maybe the most promising of them.

artetxe2018unsupervised; conneau2017word; lample2017unsupervised; lample2018phrase; conneau2019cross introduced methods based on techniques aligning the embedding spaces of monolingual datasets and tricks such as denoising auto-encoding losses (10.1145/1390156.1390294) and back-translation (sennrich2015improving; edunov2018understanding).

Abstractive summarization (or sentence compression) is also studied in unsupervised settings. baziotis2019seq3 trained a model with a compressor-reconstructor strategy similar to back-translation while liu2019summae trained a denoising auto-encoder that embeds sentences and paragraphs in a common space.

Unsupervised attribute transfer is the task most related to our work. It mainly focuses on sentiment transfer with standard review datasets (maas-etal-2011-learning; he2016ups; shen2017style; li-etal-2018-delete), but also addresses sociolinguistic datasets containing text in various registers (gan2017semantic; rao-tetreault-2018-dear) or with different identity markers (voigt-etal-2018-rtgender; prabhumoye-etal-2018-style; lample2018multipleattribute). When paraphrase generation aims at being explicitly attribute-invariant, it is referred as obfuscation or neutralization (emmery-etal-2018-style; xu-etal-2019-privacy; pryzant2020automatically). Literary style transfer (xu-etal-2012-paraphrasing; pang2019unsupervised) has also been tackled by recent work. Here, we apply attribute transfer to a large dataset annotated in toxicity, but we also use the Yelp review dataset from shen2017style for comparison purposes (see Section 4).

Initial unsupervised attribute transfer approaches sought to build a shared and attribute-agnostic latent representation encoding for the input sentence, with adversarial training. Then, a decoder, aware of the destination attribute, generated a transferred sentence shen2017style; hu2017toward; fu2018style; zhang2018style; xu2018unpaired; john-etal-2019-disentangled.

Unsupervised attribute transfer approaches that do not rely on a latent space are also present in literature. li-etal-2018-delete

assumed that style markers are very local and proposed to delete the tokens most conveying the attribute, before retrieving a second sentence in the destination style. They eventually combined both sentences with a neural network.

lample2018multipleattribute applied UNMT techniques from conneau2019cross to several attribute transfer tasks, including social media datasets. xu2018unpaired; gong2019reinforcement; luo2019dual; wu2019hierarchical

trained models with reinforcement learning.

dai2019transformer introduced unsupervised training of a transformer called StyleTransformer (ST) with a discriminator network. Our approach differs from these unsupervised attribute transfer models in that they did not either leverage large pre-trained transformers, or train with a denoising objective.

The most similar work to ours is nogueira-dos-santos-etal-2018-fighting who trained for the first time an encoder-decoder rewriting offensive sentences in a non-offensive register with non-parallel data from Twitter (ritter-etal-2010-unsupervised) and Reddit (serban2017deep)

. Our approach differs in the following aspects. First, we use transformers pre-trained on a large corpus instead of randomly initialized RNNs for encoding and decoding. Second, their approach involves collaborative classifiers to penalize generation when the attribute is not transferred, while we train end-to-end with a denoising auto-encoder. Even if their model shows high accuracy scores, it suffers from low fluency, with offensive words being often replaced by a placeholder (e.g. “big” instead of “f*cking”).

As underlined by lample2018multipleattribute

, applying Generative Adversarial Networks (GANs)

(zhu2017unpaired) to NLG is not straightforward because generating text implies a sampling operation that is not differentiable. Consequently, as long as text is represented by discrete tokens, loss gradients computed with a classifier cannot be back-propagated without tricks such as the REINFORCE algorithm (he2016dual) or the Gumbel-Softmax approximation baziotis2019seq3

which can be slow and unstable. Besides, controlled text generation

(ficler-goldberg-2017-controlling; keskarCTRL2019; le2019flaubert; Dathathri2020Plug) is a NLG task that consists of a language model conditioned on the attributes of the generated text such as the style. But a major difference with attribute transfer is the absence of a constraint regarding the preservation of the input’s content.

3 Method

3.1 Formalization of the attribute text rewriting problem

Let and be our two non-parallel corpora of comments satisfying the respective attributes “toxic” and “civil”. Let . We aim at learning a parametric function mapping a pair of source sentence and destination attribute to a fluent sentence satisfying and preserving the meaning of . In our case, there are two attributes “toxic” and “civil” that we assumed to be mutually exclusive. We denote to be the attribute of and the other attribute (for instance when “civil”, then “toxic”). Note that can simply be .

3.2 Our approach is based on bi-conditional encoder-decoder generation

Our approach is to train an autoregressive (AR) language model (LM) conditioned on both the input text and the destination attribute .

We compute with a LM . As we do not have access to ground-truth targets , we propose in section 3.3 a training function that we assume to maximize if and only if is a fluent sentence with attribute and preserving ’s content. Additionnaly, we use an AR generating model where inference of is sequential and the token generated at step depends on the tokens generated at previous steps: .

To condition on the input text, we follow the work of bahdanau2014neural; vaswani2017attention; nogueira-dos-santos-etal-2018-fighting; conneau2019cross; lample2018multipleattribute; dai-etal-2019-style; liu2019summae; raffel2019exploring and opt for an encoder-decoder framework. lample2018multipleattribute; dai-etal-2019-style argue that in unsupervised attribute rewriting tasks, encoders do not necessarily output disentangled representations, independent of its attribute. However, the t-SNE visualization of the latent space in liu2019summae allowed us to assume that encoders can output a latent representation , attending to content rather than on an attribute, with a similar training.

The LM is conditioned on the destination attribute with control codes introduced by keskarCTRL2019. A control code is a fixed sequence of tokens prepended to the decoder’s input , and supposed to prepare the generation in the space of sentences with the destination attribute . We define where is the control code of attribute .

3.3 Training the encoder-decoder with an unsupervised objective

Denoising objectives to train transformers are an effective self-supervised strategy. Devlin2018; yang2019xlnet pre-trained a uni-transformer encoder as a masked language model (MLM) to teach the system general-purpose representations, before fine-tuning on downstream tasks. conneau2019cross; lample2018multipleattribute; song2019mass; liu2019summae; raffel2019exploring explore various deshuffling and denoising objectives to pre-train or fine-tune bi-transformers.

During training, we corrupt the encoder’s input with the noise function from Devlin2018:

masks tokens randomly with probability 15%. Then, masks are replaced by a random token in the vocabulary with probability 10% or left as a sentinel (a shared mask token) with probability 90%. We train the model as an denoising auto-encoder (DAE), meaning that we minimize the negative log-likelihood

The hypothesis is that optimizing the DAE objective teaches the controlled generation to the model.

Inspired by an equivalent approach in unsupervised image-to-image style transfer (zhu2017unpaired), we add a cycle-consistency (CC) objective (nogueira-dos-santos-etal-2018-fighting; edunov2018understanding; prabhumoye-etal-2018-style; lample2018multipleattribute; conneau2019cross; dai-etal-2019-style):

which enforces content preservation in the generated prediction. As the cycle-consistency objective computes a non-differentiable AR pseudo-prediction

during stochastic gradient descent training, gradients are not back-propagated to

at training step .

Finally, the loss function sums the DAE and the CC objectives with weighting coefficients:

3.4 The text-to-text bi-transformer architecture

The architectures for the encoder and decoder are uni-transformers. Contrary to vaswani2017attention; conneau2019cross; raffel2019exploring we do not keep decoder’s layers computing cross attention between the encoder’s outputs and the decoder hidden variables because generation suffers from too much conditioning on the input sentence and we observe no significant change in the output sentence. Rather, we follow liu2019summae and compute the latent representation with an affine transformation of the encoder’s hidden state (corresponding to the first token of the input text). Let be the input sequence of token. It is embedded then encoded by the uni-transformer encoder:

is an aggregate sequence representation for the input. There are different heuristics that can be used to integrate it in the decoder. We considered summing

to the embedding of each token of the uni-transformer decoder’s input

since it balances the backpropagation of the signals coming from the original input and from the output being generated in the destination attribute space and it worked well in practice in our experiments.

Plus, the encoder and the decoder uni-transformers share the same embedding layer and the LM Head is tied to the embeddings.

Except for the dense layer computing the latent variable , all parameters are coming from the pre-trained bi-transformer published by raffel2019exploring. Thus, our DAE and CC objectives fine-tune T5’s parameters and this is why we call our model a conditional auto-encoder text-to-text transfer transformer (CAE-T5).

4 Experiments

4.1 Datasets

We employed the largest publicly available toxicity detection dataset to date, which was used in the ‘Jigsaw Unintended Bias in Toxicity Classification’ Kaggle challenge.555 The 2M comments of the Civil Comments dataset stem from a commenting plugin for independent news sites. They were created from 2015 to 2017 and appeared on approximately 50 English-language news sites across the world. Each of these comments was annotated by crowd raters (at least 3 each) for toxicity and toxicity subtypes DBLP:journals/corr/abs-1903-04561.

Following the work of dai-etal-2019-style

for the IMDB Movie Review dataset (positive/negative sentiment labels), we constructed a sentence-level version of the dataset. Initially, we fine-tuned a pre-trained BERT

(Devlin2018) toxicity classifier on the Civil Comments dataset. Then, we split the comments in sentences with NLTK’s sentence tokenizer.666 Eventually, we created (respectively ) with sentences whose system-generated toxicity score (using our BERT classifier) is greater than (respectively less than ) to increase the dataset’s polarity. The test ROC-AUC of the toxicity classifier is with a precision of and a recall of . Even with this low recall is large enough (approx. 90k, see Table 2).

We also conducted a comparison to other style transfer baselines on the Yelp Review Dataset (Yelp), commonly used to compare unsupervised attribute transfer systems. It consists of restaurant and business reviews annotated with a binary positive / negative label. shen2017style processed it and li-etal-2018-delete collected human reference human references for the test set777 Table 2 shows statistics for these datasets.

Dataset Yelp Polar Civ. Com.
Attribute Positive Negative Toxic Civil
Train 266,041 177,218 90,293 5,653,785
Dev 2,000 2,000 4,825 308,130
Test 500 500 4,878 305,267
Av. len. 11.0 13.0 19.4 21.9
Table 2: Statistics for the Yelp dataset and the processed version of the Civil Comments dataset. Average lengths are the average numbers of SentencePiece tokens.

4.2 Evaluation

Evaluating a text-to-text task is challenging, especially when no gold pairs are available. Attribute transfer is successful if generated text: 1) has the destination control attribute, 2) is fluent and 3) preserves the content of the input text.

4.2.1 Automatic evaluation

We follow the current approach of the community (NIPS2018_7959; logeswaran2018content; wang2019controllable; xu2019variational; lample2018multipleattribute; dai-etal-2019-style; he2020probabilistic) and approximate the three criteria with the following metrics:

  1. [wide, labelwidth=!, labelindent=0pt]

  2. Attribute control: Accuracy (ACC) computes the rate of successful changes in attributes. It measures how well the generation is conditioned by the destination attribute. We predict toxic and civil attributes with the same fine-tuned BERT classifier that pre-processed the Civil Comments dataset (single threshold at ).

  3. Fluency: Fluency is measured by perplexity (PPL). To measure PPL, we employed GPT2 (radford2019language) LMs fine-tuned on the corresponding datasets (Civil Comments and Yelp).

  4. Content preservation: Content preservation is the most difficult aspect to measure. UNMT (conneau2019cross), summarization (liu2019summae) and sentiment transfer (li-etal-2018-delete) have access to a few hundred samples with at least one human reference of the transferred text and evaluate content preservation by computing metrics based on matching words (e.g., BLEU papineni-etal-2002-bleu) between the generated prediction and the reference(s) (ref-metric). However, as we do not have these paired samples, we compute a content preservation score between the input and the generated sentences (self-metric).

    Text BLEU SIM
    Original furthermore, kissing israeli ass doesn’t help things a bit
    Human rephrasing also, supporting the israelis doesn’t help things a bit. 57.6 70.6%
    Original just like the rest of the marxist idiots.
    Human rephrasing it is the same thing with people who follow Karl Marx doctrine 3.4 65.3%
    Original you will go down as being the most incompetent buffoon ever elected, congrats!
    Human rephrasing you could find out more about it. 2.3 16.2%
    Table 3: Evaluation with BLEU and SIM of examples rephrased by human crowdworkers.

    Table 3 shows the BLEU scores (based on exact matches) of three examples rephrased by human annotators (Section 4.5). In the top-most example, BLEU score is high. This is explained by the fact that only 4 words are different between the two texts. In contrast to the first example, the two texts in the second example have only 1 word in common. Thus, the BLEU score is low. Despite the low evaluation, however, the candidate text could have been a valid rephrase of the reference text.

    The high complexity of our task explains the motivation for a more general quantitative metric between input and generated text, capturing the semantic similarity rather than overlapping tokens. fu2018style; john-etal-2019-disentangled; gong2019reinforcement; pang2019unsupervised

    proposed to represent sentences as a (weighted) average of their words embeddings before computing the cosine similarity between them. We adopted a similar strategy but we embedded sentences with the pre-trained universal sentence encoder

    (cer2018universal) and call it the sentence similarity score (SIM). The first two sentence pairs of Table 3 have high similarity scores. The rephrasings preserve the original content while not necessarily overlapping much with the original text. However, the last rephrasing does not preserve the initial content and have a low similarity score with its source sentence. As a statistical evidence, the self-SIM score comparing each of the 1,000 test Yelp reviews with their human rewriting is 80.2% whereas the self-SIM score comparing the Yelp review test set to a random derangement of the human references is 36.8%.

    We optimised all three metrics because doing otherwise comes at the expense of the remaining metric(s). We aggregated the scores of the three metrics by computing the geometric mean

    888The geometric mean is not sensitive to the scale of the individual metrics. (GM) of ACC, 1/PPL and self-SIM.

4.2.2 Human evaluation

Following li-etal-2018-delete; zhang-etal-2018-learning-sentiment; zhang2018style; wu2019hierarchical; ijcai2019-732; wang2019controllable; john-etal-2019-disentangled; liu2019revision; luo2019dual; jin2019imat and to further confirm the performance of CAE-T5, we hired human annotators on Appen to rate in a blind fashion different models’ civil rephrasings of 100 randomly selected test toxic comments, in terms of attribute transfer (Att), fluency (Flu), content preservation (Con) and overall quality (Over) on a Likert scale from 1 to 5. Each rephrasing was annotated by 5 different crowd-workers whose annotation quality is controlled by test questions. If a rephrasing is rated 4 or 5 on Att, Flu and Con then it is “successful” (Suc).

4.3 Baselines

We compare the output text that CAE-T5 generates with a selection of unpaired style-transfer models described in Section 2.2 (shen2017style; li-etal-2018-delete; fu2018style; luo2019dual; dai-etal-2019-style). We also compare with Input Masking. It is inspired by an interpretability method called Input Erasure (IE) Li2016. IE is used to interpret the decisions of neural models. Initially, words are removed one at a time and the altered texts are then re-classified (i.e., as many re-classifications as the words). Then, all the words that led to a decreased re-classification score (based on a threshold) are returned as the ones most related to the decision of the neural model. Our baseline follows a similar process, but instead of deleting, it uses a pseudo token (‘[mask]’) to mask one word at a time. When all the masked texts have been scored by the classifier, the rephrased text is returned, comprising as many masks as the tokens that led to a decreased re-classification score (set to 20% after preliminary experiments). We employed a pre-trained BERT as our toxicity classifier, fine-tuned on the Civil Comments dataset (see Section 4.1).

4.4 Results

4.4.1 Quantitative comparison to prior work

Table 4 shows quantitative results on the Civil Comments dataset. Surprisingly, the perplexity (capturing fluency) of text generated by our model is lower than the perplexity computed on human comments. This can be explained by social media authors of comments expressing an important variability in language formal rules, that is only partially replicated by CAE-T5. Other approaches such as StyleTransformer (ST) and CrossAlignment (CA) have higher accuracy but at a cost of both higher perplexity and lower content preservation, meaning that they are better are discriminating toxic phrases but struggle to rephrase in a coherent manner.

In Table 5

we compare our model to prior work in attribute transfer by computing evaluation metrics for different systems on the Yelp test dataset. We achieve competitive results with low perplexity while getting good sentiment controlling (above human references). Our similarity though is lower, showing that some content is lost when decoding, hence the latent space does not fully capture the semantics. It is fairer to compare our model to other style transfer baselines on the Yelp dataset since our model is based on sub-word tokenization while the baselines are often based on a limited size pre-trained word embedding: many more words from the Civil Comments dataset could be attributed to the unknown token if we want to keep reasonable size vocabulary, resulting in a performance drop.

The human evaluation results shown in Table 6 correlate with the automatic evaluation results.

When considering the aggregated scores (geometric mean, success rate and overall human judgement), our model is ranked first on the Civil Comments dataset and second on the Yelp Review dataset, behind DualRL yet our approach is more stable and therefore easier to train when compared to reinforcement learning approaches.

Model ACC PPL self-SIM GM
Copy input 0% 6.8 100% 0.005
Random civil 100% 6.6 20.0% 0.311
Human 82.0% 9.2 73.8% 0.404
CA 94.0% 11.8 38.4% 0.313
IE (BERT) 86.8% 7.5 55.6% 0.401
ST (Cond) 97.8% 47.2 68.3% 0.242
ST (M-C) 98.8% 64.0 67.9% 0.219
CAE-T5 75.0% 5.2 70.0% 0.466
Table 4: Automatic evaluation scores of different models trained and evaluated on the processed Civil Comments dataset. The scores are computed on the toxic test set. “Human” corresponds to 427 human rewritings of randomly sampled toxic comments from the train set. “Random civil” means we randomly sampled 4,878 comments from the civil test set.
Model ACC PPL self-SIM ref-SIM GM self-BLEU ref-BLEU
Copy input 1.3% 11.1 100% 80.2% 0.105 100 32.5
Human references 79.4% 14.0 80.2% 100% 0.357 32.7 100
CrossAlignment (shen2017style) 73.5% 54.4 61.0% 59.0% 0.202 21.5 9.6
RetrieveOnly 99.9% 4.9 47.1% 48.0% 0.213 2.7 1.8
TemplateBased 84.1% 46.0 76.0% 68.2% 0.240 57.0 23.2
DeleteOnly 85.2% 48.7 72.6% 67.7% 0.233 33.9 15.2
D&R 89.8% 35.8 72.0% 67.6% 0.262 36.9 16.9
StyleEmbedding 8.1% 29.8 83.9% 69.8% 0.132 67.5 21.9
MultiDecoder 47.2% 74.2 67.7% 61.4% 0.163 40.4 15.2
DualRL (luo2019dual) 88.1% 20.5 83.6% 77.2% 0.330 58.7 29.0
StyleTransformer (Conditional) 91.7% 44.8 80.3% 74.2% 0.254 53.2 25.6
StyleTransformer (Multi-Class) 85.9% 29.1 84.2% 77.1% 0.292 62.8 29.2
CAE-T5 84.9% 22.9 67.7% 64.4% 0.293 27.3 14.0
Table 5: Automatic evaluation scores of different models trained and evaluated on the Yelp dataset. Accuracy is computed by a BERT classifier fine-tuned on the Yelp train set (accurate at on the test set). Perplexity is measured by a GPT2 language model fine-tuned on the Yelp train set. “self-” refers to a comparison to the input and “ref-” to a human reference.
Model Att Flu Con Suc Over
CA 2.98 2.32 1.89 6 % 1.81
IE (BERT) 2.77 2.39 2.20 6 % 1.89
ST (Cond) 2.91 2.36 2.08 5% 1.87
ST (M-C) 2.93 2.42 2.10 5% 1.93
CAE-T5 2.72 3.06 2.63 13% 2.52
Table 6: Human evaluation of different models trained and evaluated on the Civil Comments dataset.

4.4.2 Qualitative analysis

Table 7 shows examples of rephrases of toxic comments automatically generated by our system. The top first two examples emphasize the ability for the model to perform fluent control generation conditioned on both the input sentence and the destination attribute. We present more results showing that we can effectively suggest fluent civil rephrases of toxic comments in the Appendix Table 8. However we observe more failures than in the sentiment transfer task (see examples in the Appendix Table 9). We identify three natures of failure:


generation does not stop early enough and produces fluent, transferred, related but unnecessary content.


conditioning on the initial sentence fails and the model generates fluent but unrelated content.

Position reversal

the author’s opinion is shifted.

In order to assess the frequency of hallucination and supererogation, we randomly selected 100 toxic comments from the test set and manually labeled the generated sentences with the non-mutually exclusive labels “contains supererogation” and “contains hallucination”. We counted on average 17% of generated sentences with surrerogation and 34% of generated sentences showing hallucination (often local). We observe that the longer the input comment, the more prone to hallucination is the generated text.

While supererogation and hallucination can be explained by the probabilistic nature of generation, we assume that position reversal is due to bias in the dataset, where toxic comments are correlated with negative comments. Thus, offensive comments tend to be transferred to supportive comments even though a human being would rephrase attacks as polite disagreements.

Interestingly, our model is able to add toxicity in civil comments as shown by the examples in the Appendix Table 10. Even if such an application shows limited interest for online platforms, it is worth warning about its potential misuse.

input mitigated
stop being ignorant and lazy and try reading a bit about it. try reading and be a little more informed about it before you try to make a comment.
this is absolutely the most idiotic post i have ever read on all levels. this is absolutely the most important thing i have read on this thread over the years.
trump may be a moron, but clinton is a moron as well. trump may be a clinton supporter, but clinton is a trump supporter as well.
shoot me in the head if you didn’t vote for trump. you’re right if you didn’t vote for trump. i’m not sure i’d vote
50% of teachers don’t have any f*cks to give. 50% of teachers don’t have a phd in anything.
Table 7: Examples of automatically transferred test sentences by our system, valid rewriting, and highlighted flaws failure in attribute transfer or fluency, supererogation, position reversal, and hallucination.

4.5 Discussion

Supervised learning is a natural approach when addressing text-to-text tasks. In our study, we submit the civil rephrasing of toxic comments task to human crowd-sourcing. We randomly sampled 500 sentences from the toxic train set. For each sentence, we asked 5 annotators to rephrase it in a civil way, to assess if the comment was offensive and if it was possible to rewrite it in a way that is less rude while preserving the content. On 2500 answers, we tally 427 examples not flagged as impossible to rewrite and with a rephrasing different from the original sentence. This low yield is caused by two main issues. On the one hand, unfortunately not all toxic comments can be reworded in a civil manner so as to express a constructive point of view; severely toxic comments that are solely made of insults, identity attacks, or threats are not “rephrasable”. On the other hand, evaluating crowd-workers with test questions and answers is complex. The perplexity being higher on crowd-workers’ rephrases than on randomly sampled civil comments raises concerns about the production of human references via crowd-sourcing. The nature of large datasets labeled in toxicity and the lack of incentives for crowd-sourcing civil rephrasing annotation makes it expensive and difficult to train systems in a supervised framework. These limitations motivates unsupervised approaches.

Lastly, the more complex is the unsupervised attribute transfer task, the more difficult is its automatic evaluation. In our case, evaluating whether the attribute is actually transferred requires to train an accurate toxicity classifier. Furthermore, the language model we use to assess the fluency of the generated sentences has some limitations and does not generalize to all varieties of language encountered in social media. Finally measuring the amount of relevant content preserved between the source and generated texts remains a challenging, open research topic.

5 Conclusion and future work

This work is the second one to tackle civil rephrasing to our knowledge and the first one to address it with a fully end-to-end discriminator-free text-to-text self-supervised training. CAE-T5 leverages the NLU / NLG power offered by large pre-trained bi-transformers. The quantitative and qualitative analysis shows that ML systems could contribute to some extent to pacify online conversations, even though many generated examples still suffer from critical semantic drift.

In the future, we plan to explore whether the decoding can benefit from NAR generation (ma2019flowseq; ren2020study). We are also interested in the recent paradigm shift proposed by kumar2018vmf, where the generated tokens representation is continuous, allowing more flexibility in plugging attribute classifiers without sampling.


This work was completed in partial fulfillment for the PhD degree of the first author, which was supported by an unrestricted gift from Google. We are also grateful for support from the Google Cloud Platform credits program. We thank Thomas Bonald and Ion Androutsopoulos for their discussion, insight and useful comments.


Appendix A Supplemental Material

a.1 Experimental setup

a.1.1 Architecture details

We fine-tune the pre-trained “large” bi-transformer from raffel2019exploring. Both uni-transformers (encoder and decoder) have blocks each made of a 16-headed self-attention layer and a feed-forward network. The attention, dense and embedding layers have respective dimensions of , and , for a total of around 800 million parameters.

Input sentences are lowercased then tokenized with SentencePiece999gs://t5-data/vocabs/cc_all.32000/sentencepiece.model (kudo-richardson-2018-sentencepiece) and eventually truncated to a maximum sequence length of for the Yelp dataset and for the processed Civil Comments dataset. The control codes are for attributes in the sentiment transfer task and when we apply to the Civil Comments dataset.

a.1.2 Training details

During training, we apply dropout regularization at a rate of . We set . In preliminary experiments, we observed that was preserving little content from the initial sentence and that was weighting the preservation too much, at the cost of accuracy. Therefore we focused our experiments on . It is a good default setting since we don’t have a priori about the balance between fluency, accuracy (enforced with the auto-encoder) and content preservation (enforced with cycle consistency). DAE and back-transfer (in the course of the CC computation) are trained with teacher-forcing; we do not need AR generation since we have access to a target for the decoder’s output. Each training step computes the loss on a mini-batch made of 64 sentences sharing the same attribute. Mini-batches of attributes and

are interleaved. Since the Civil Comments dataset is class imbalanced, we sample comments from the civil class of the training set at each epoch. The optimizer is AdaFactor

(shazeer2018adafactor) and we train for 88900 steps for 19 hours on a TPU v2 chip.

a.1.3 Evaluation details

Decoding is greedy. The parametric models used to compute ACC and PPL are 12-layer, 12 headed pre-trained, and fine-tuned uni-transformers with hidden size

. The BERT classifier is an encoder followed by a sequence classification head and the GPT2 LM is a decoder with a LM head on top. We use the sacrebleu101010 implementation for BLEU and the universal sentence encoder pre-trained by Google to compute SIM111111

a.2 CAE-T5 learning algorithm

Algorithm 1 and Figure 1 describe the fine-tuning procedure of CAE-T5. computes the cross-entropy.

Input : T5’s pre-trained parameters , unpaired dataset labelled in toxicity
Output : CAE-T5’s fine-tuned parameters
for step  do
       if  then
             Sample a mini-batch of sentences in
             Sample a mini-batch of sentences in
       end if
       Back-propagate gradients through Update by a gradient descent step
end for
Algorithm 1 CAE-T5 training
(a) DAE
(b) CC
Figure 1: Illustration of the training procedure. (a) DAE: The bi-transformer encodes the corrupted input text in a latent variable that is then decoded conditioned on the source attribute with the objective of minimizing the cross entropy between and the generated text . Here, generation is not AR since the DAE is trained with teacher forcing. (b) CC: The input is pseudo-transferred with attribute with AR decoding because we do not know the ground-truth . The generated output is then back-transferred to the original space of sentences with attribute . Back-transfer generation is not AR because we use teacher-forcing here. Thus, we can trivially back-propagate the gradients through (back-transfer) but not through (pseudo-transfer).

Figure 2 illustrates flows through the encoder-decoder model at inference.

Figure 2: Model overview at prediction time. and are the input and generated text, is the destination attribute, is the latent input representation and is the function prepending control codes to the decoder’s input. represents the model’s parameters.
Toxic to Civil
Input and then they need to do what it takes to get rid of this mentally ill bigot!
CAE-T5 and then they need to do what they can to get rid of this kind of man that is causing the whole problem!
Input do you think any of those idiots vote?
CAE-T5 do you think any of those people vote?
Input this is just so stupid.
CAE-T5 this is just so sad.
Input are you nuts?
CAE-T5 are you kidding me?
Input we are all entitled to our own opinions but not the facts that one can’t find in this criminally stupid opinion.
CAE-T5 we are all entitled to our own opinion but not to the kind of facts that can be found in this case.
Input hopefully that will prevent him from making any further idiotic statements.
CAE-T5 hopefully that will prevent him from making any more such statements.
Input the poll is garbage.
CAE-T5 the poll is fake news.
Input it was irresponsible to publish this garbage.
CAE-T5 it was irresponsible to publish this information.
Input just because someone else has a different opinion does not mean that person is stupid, uneducated or foolish.
CAE-T5 just because someone has a different opinion does not mean that someone is less intelligent, less informed, less able to understand, and less capable of making informed decisions………
Input biased leftist trash article.
CAE-T5 biased leftist article.
Input dumb people vote for trump.
CAE-T5 most people vote for trump.
Input what hypocrites your sources are!
CAE-T5 what are your sources of information? ! ! ! !
Input bill o’reilly is a narcissistic sociopath.
CAE-T5 bill o’reilly is a neo-liberal………….
Input f*ck the flag.
CAE-T5 i love the flag.
Input i guess we can kill them.
CAE-T5 but i think we can save them.
Input stupid comment
CAE-T5 interesting comment.
Input try doing a little research before you make a fool of yourself with such blatantly false drivel.
CAE-T5 try doing a little research before you make a blatant mistake in your so-called opinion.
Input hey old fart, you losers have always been fake news.
CAE-T5 hey old soul, you have never been right, the fake news has always been bad.
Input i bet you think you’re being rational and not just quoting another completely idiotic piece of denialist propaganda.
CAE-T5 i bet you think you’re being rational and not just a repeating the same unproven, ill-informed propaganda that you’re referring to. . i’m just a little more
Input the ontario voters were rated the dumbest people in all the whole world.
CAE-T5 the ontario liberals were voted in by the most gullible people in the country……………
Input taxpayer are tired of having our hard-earned money be misspent by idiots.
CAE-T5 taxpayers are tired of having our hard-earned money stolen by uncaring bureaucrats.
Input suggesting that people change their commute times is stupid.
CAE-T5 suggesting that people change their lifestyles is not helpful.
Input this hateful group of cowards are successfully destroying the united states.
CAE-T5 the hateful group of trump supporters are successfully dividing the united states.
Input whatever the idiot in chief touches turns to crap.
CAE-T5 whatever the president does in the room turns to what he wants.
Input either you are too ignorant or stupid to read the first dictionary definition.
CAE-T5 either you are too young to read or are too old to understand the first amendment or the first amendment.
Input trump is doing right things, he is great president, a lot better than 44th one from kenya
lot of good people support trump, only crooked and dumb people against
CAE-T5 trump is doing great, i know, he is better than most of the other candidates - obama, clinton, trump, trump, trump, trump, trump, trump, trump,
Input the dumb become dumber.
CAE-T5 the poor become richer.
Table 8: Examples of automatically transferred toxic test sentences by our system, valid rewriting, and highlighted flaws failure in attribute transfer or fluency, supererogation, position reversal, and hallucination.
Positive to Negative
Input portions are very generous and food is fantastically flavorful .
DualRL portions are very thin and food is confusing .
ST (Multi) portions are very poorly and food is springs flavorless .
CAE-T5 portions are very small and food is awfully greasy for the price .
Human portions are very small and food is not flavorful .
Input staff : very cute and friendly .
DualRL staff : very awful and rude .
ST (Multi) staff : very nightmare and poor .
CAE-T5 staff : very rude and pushy .
Human staff : very ugly and mean .
Input friendly and welcoming with a fun atmosphere and terrific food .
DualRL rude and unprofessional with a loud atmosphere and awful food .
ST (Multi) poor and fake with a fun atmosphere and mushy food .
CAE-T5 rude and unhelpful service with a forced smile and attitude .
Human unfriendly and unwelcoming with a bad atmosphere and food .
Input i love their star design collection .
DualRL i hate their star design disgrace .
ST (Multi) i do n’t care star bites collection .
CAE-T5 i hate starbucks corporate . the staff is horrible .
Human i ca n’t stand their star design collection .
Input oj and jeremy did a great job !
DualRL oj and jeremy did a great job ! disgrace ! disgrace !
ST (Multi) oj and jeremy did a terrible job !
CAE-T5 oh and jesus christ i did n’t have any change !
Human oj and jeremy did a terrible job !
Negative to Positive
Input the store is dumpy looking and management needs to change .
DualRL the store is perfect looking and management speaks to change perfectly .
ST (Multi) the store is dumpy looking and management moved to change .
Ours the store is neatly organized and clean and staff is on top of it .
Human managment is top notch , the place looks great .
Input i emailed to let them know but they apparently dont care .
DualRL i loved them know them know but they dont care .
ST (Multi) i emailed to let them know but they honestly played their .
CAE-T5 i emailed to let them know and they happily responded right away . a great service
Human i emailed to let them know they really do care .
Input this place is dirty and run down and the service stinks !
DualRL this place is clean and run perfect and the service helped !
ST (Multi) this place is quick and run down and the service stunning !
CAE-T5 this place is clean and well maintained and the service is great ! ! !
Human this place is clean , not run down , and the service was great .
Input do not go here if you are interested in eating good food .
DualRL definitely go here if you are interested in eating good food .
ST (Multi) do not go here if you are interested in eating good food .
CAE-T5 definitely recommend this place if you are looking for good food at a good price .
Human do not go here if you are interested in eating bad food .
Input my husband had to walk up to the bar to place our wine order .
DualRL my husband had to walk up to the bar to place our drink order perfectly .
ST (Multi) my husband had to walk up to the bar to place our wine order .
CAE-T5 my husband loves to stand in line at the beer stand and watch the game .
Human my husband did n’t have to walk up to the bar to place our wine order .
Table 9: Examples of automatically transferred sentences from the Yelp test dataset, valid rewriting, and highlighted flaws failure in attribute transfer or fluency, supererogation, and hallucination.
Civil to Toxic
Input great effort and great season.
CAE-T5 stupid and horrible move by this team.
Input canadian teachers
are the foundation of this great country.
CAE-T5 canadians are the dumbest people in this country.
Input why isn’t trudeau in jail for gross negligence?
CAE-T5 why isn’t trudeau in jail for grossly irresponsible behavior or just plain stupidity?????
Input there really is no safe place on the planet to live anymore.
CAE-T5 there really is no safe place on the planet for the stupid to live.
Input trudeau will do as he’s told.
CAE-T5 trudeau will do as he’s told, he’s stupid.
Input all of canada is paying for that decision.
CAE-T5 all of canada is paying for that stupid decision.
Input people like you need their talking points.
CAE-T5 people like you need their asses kicked.
Input this is a great article that hits the nail on the head.
CAE-T5 this is a stupid article that hits the nail on the head.
Input and just which money tree is going to pay for this?
CAE-T5 and just where is the money going to come from to pay for this crap? when is this crap going to end? who cares????
Input you need to get a new dictionary.
CAE-T5 you need to get a new f*cking dictionary. get a new f*cking dictionary. get a new f*cking dictionary. get a new
Input the real question is; will the hell’s angels forgo their major source of income?
CAE-T5 the real question is when will the idiocy of these trumpkins end?
Input according to, trump got 46.1% of the popular vote.
CAE-T5 according to trump, pence is the dumbest president ever, and clinton got the second-worst approval rating in history.. .. trump’
Input so it’s okay to sacrifice the environment in the name of the almighty dollar……
CAE-T5 so it’s okay to destroy the world with the actions of one stupid dude in the white house…………
Input the president dismissed the ecological findings of over 87% of scientists who have been studying the effects of global warming, largely caused by the release of carbon from fossil fuel into the atmosphere.
CAE-T5 the president ignored the scientific consensus that over 90% of all climate scientists are complete idiots, reacting to the rash of terrorist attacks that have been taking place around the worldtrump has made it his life’
Input not sure where you got your definition of a good guy.
CAE-T5 not sure where you got your idea that trump is a kinda dumb guy.
Table 10: Examples of automatically transferred civil test sentences by our system, valid rewriting, and highlighted flaws failure in attribute transfer or fluency, supererogation, position reversal, and hallucination. For the test set of civil sentences, the automatic metrics are ACC; PPL and self-SIM.

a.3 Appen settings

Figure 3 and Figure 4 detail the guidelines we wrote on the crowdsourcing website Appen121212, when we asked human crowd-workers to rate automatic rephrasings and to rephrase toxic comments. Contributor level is set to level 3, which corresponds to the highest quality standard.

Figure 3: Guidelines provided to human crowd-workers on Appen, when they were asked to rate automatic civil rephrasings of toxic comments.
Figure 4: Guidelines provided to human crowd-workers on Appen, when they were asked to rewrite toxic comments in a way that it is less rude.