Effective writing style imitation via combinatorial paraphrasing

05/31/2019 ∙ by Tommi Gröndahl, et al. ∙ aalto Association for Computing Machinery 0

Stylometry can be used to profile authors based on their written text. Transforming text to imitate someone else's writing style while retaining meaning constitutes a defence. A variety of deep learning methods for style imitation have been proposed in recent research literature. Via empirical evaluation of three state-of-the-art models on four datasets, we illustrate that none succeed in semantic retainment, often drastically changing the original meaning or removing important parts of the text. To mitigate this problem we present ParChoice: an alternative approach based on the combinatorial application of multiple paraphrasing techniques. ParChoice first produces a large number of possible candidate paraphrases, from which it then chooses the candidate that maximizes proximity to a target corpus. Through systematic automated and manual evaluation as well as a user study, we demonstrate that ParChoice significantly outperforms prior methods in its ability to retain semantic content. Using state-of-the art deep learning author profiling tools, we additionally show that ParChoice accomplishes better imitation success than A^4NT, the state-of-the-art style imitation technique with the best semantic retainment.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Freedom of speech and privacy are threatened by advances in the field of artificial intelligence (AI), including

natural language processing (NLP). Governments, corporations or other institutions can use NLP techniques to deanonymize whistle-blowers and dissidents (press2018, ). Potential methods include the use of stylometry to identify or profile anonymous authors based on writing style (Stamatatos2009, ; Tempesttetal2017, ). This, in turn, motivates defence measures based on altering writing style while exerting minimal impact on semantic content (Brennanetal2011, ; Narayanetal2012, ; Almisharietal2014, ). Adopting a term from image classification (Gatysetal2016, ), we call this style transfer. The process of style transfer can consist of several transformations, i.e. changes applied to the input text.

While prior attempts have largely focused on mere writing style obfuscation by simple methods like back-and-forth machine translation (Brennanetal2011, ; Caliskan:Greenstadt2012, ; Almisharietal2014, ; Macketal2015, ), recent work has adopted more sophisticated procedures for style imitation, targeting a particular style rather than only concealing the original (Shenetal2017, ; Prabhumoyeetal2018, ; Shettyetal2018, ). These approaches are particularly promising because, unlike many automatic paraphrasing schemes (Xuetal2012, ; Lietal2018, ), they do not require parallel corpora for training. This means that they can be trained on separate corpora representing the source and target style respectively, without the need to match sentences between them.

Reliable semantic retainment is of utmost importance for any style imitation system to be useful in practice. In addition to using textual similarity measures like METEOR (Banerjee:Lavie2005, ), prior work has relied on comparative human evaluations typically obtained via crowd-sourcing (Prabhumoyeetal2018, ; Shettyetal2018, ). However, merely comparing alternatives without measuring semantic retainment is not useful for evaluating real-world applicability, since none of the evaluated alternatives may successfully retain the original meaning, even if participants systematically preferred one alternative to another. By conducting an extensive analysis of imitations by three state-of-the-art imitation techniques (Shenetal2017, ; Prabhumoyeetal2018, ; Shettyetal2018, ) using both automatic and manual measures, we demonstrate that none reliably produce semantically appropriate paraphrases of the original sentence. Based on our manual evaluation on sentences ( from eight imitation directions altogether), almost all acceptable outputs are exact replicas of the original sentence. When these are discarded, the rate of acceptable paraphrases does not exceed . While avoiding unnecessary changes is desirable, viable style imitation must retain semantics across transformations rather than eschew them altogether. Our results show that prior techniques fall short on this count.

We develop an alternative imitation scheme called ParChoice. As shown in Figure 1, ParChoice first generates a variety of possible paraphrase candidates, from which it then selects

the candidate that is closest to the writing style of a target corpus. This selection is based on an independently trained machine learning (ML) model. Importantly, the same paraphrase generation component can be combined with different paraphrase selection models to target distinct writing styles.

While prior work on style transfer via grammatical transformations and paraphrase replacement exists (Khosmood:Levinson2008, ; Khosmood:Levinson2009, ; Khosmood:Levinson2010, ; Khosmood2012, ; Karadzhovetal2017, ), both the rules and corpora used have been highly limited. Declined interest in such approaches plausibly reflects a more general shift away from rule-based NLP methods (Goldberg2016, ; Youngetal2017, ). In spite of this, we show that existing paraphrase resources (Fellbaum1998, ; Ganitkevitchetal2013, ), together with certain grammaticality filters and a small set of additional rules, allow generating large numbers of suitable paraphrase candidates when applied combinatorially. We further integrate this paraphrasing approach with a recent method for producing grammatical transformations, which alter specific grammatical features of the sentence while retaining its other properties (Eat2seq, ).

We measure ParChoice’s semantic retainment ability with three tests: automatic evaluation via the METEOR score (Banerjee:Lavie2005, ), manual evaluation on subsets of the test sets, and a user study eliciting semantic similarity scores from independent human evaluators. ParChoice outperforms the three baseline techniques in all these tests, especially surpassing them by a wide margin in both human evaluations. Only one baseline (ANT (Shettyetal2018, )) is able to retain the original meaning to any significant degree, which it mostly achieves simply by reproducing the original sentence without transformations. We then compare the imitation performance of ParChoice and ANT in eight imitation directions using two state-of-the-art deep learning author profilers, yielding test settings altogether. ParChoice achieves better imitation results in . These evaluations demonstrate that ParChoice performs best compared to the state-of-the-art in terms of retaining semantics while achieving imitation success.

We summarize our contributions below.

  • We present ParChoice: a general style imitation technique that is built to maximize semantic retainment across transformations (Section 4).

  • Via a systematic empirical comparison of ParChoice with three state-of-the-art baseline style imitation techniques (Shenetal2017, ; Prabhumoyeetal2018, ; Shettyetal2018, ) for gender and author imitation (Section 5), we demonstrate that:

    • ParChoice retains semantic information across stylistic changes far more than any of the baseline techniques (Section 6.1)

    • ParChoice’s imitation performance exceeds that of ANT: the best baseline technique in terms of semantic retainment (Section 6.2)

Figure 1. Overview of ParChoice

2. Background

We briefly review prior work on author profiling via stylometry and proposed methods for its mitigation.

2.1. Author profiling as a privacy threat

Stylometry was initiated as a computational task by Mosteller and Wallace (Mosteller:Wallace1964a, ), and has since grown into a major theme within NLP (Stamatatos2009, ; Tempesttetal2017, ). While most studies have focused on traditional machine learning and feature engineering (Zhengetal2006, ; Grieve2007, ; Abbasi:Chen2008, ; Juola2012, ; Potthastetal2016b, ; Tempesttetal2017, ), deep learning methods have recently become popular (Bagnall2015, ; Ge:Sun2016, ; Surendranetal2017, ; Brocardoetal2017, ), in line with broader NLP trends (Goldberg2016, ; Youngetal2017, ).

Deanonymization attacks (Narayanetal2012, ) can be used by governments, corporations, or other institutions to compromise the privacy of anonymous authors (Brennan:Greenstadt2009, ; Brennanetal2011, ; McDonaldetal2012, ; Almisharietal2014, ). That deanonymization attacks are realistic concern is underscored by the high accuracy reported in recent stylometry research, especially when the number of author candidates is relatively small () (Tempesttetal2017, )

. Moreover, even modest accuracy with a high author count can significantly increase the probability of identification

(Narayanetal2012, ), thereby allowing the adversary to narrow down the pool of likely candidate authors for further, possibly manual, evaluation.

Beyond personal identity, more general properties of the author can also be detected from writing style. These include, among others, gender (Schleretal2006, ; Reddy:Knight2016, ), age (Schleretal2006, ), or political leaning (Prabhumoyeetal2018, ). We pool the detection of such properties under author profiling, and also treat deanonymization as a special type of profiling. We use the term (author) profiler for ML models used for author profiling.

Mitigating author profiling requires style transfer, i.e. transforming the wring style of the original document without changing its semantic interpretation. Prior work has largely focused on what Mack et al. call iterative language translation (ILT) (Macketal2015, ), where the original document is translated across one or more intermediate languages back to the original language (Brennanetal2011, ; Caliskan:Greenstadt2012, ; Almisharietal2014, ; Macketal2015, ; Dayetal2016, ).111Our experiments and all studies reviewed here are conducted on English datasets. Broadening the methods to cover other languages is an important venue for future research. ILT has the benefit of being easily available via machine translation (MT) interfaces provided by e.g. Google.222https://translate.google.com/ Results on its effectiveness against author identification are mixed, some studies reporting success (Almisharietal2014, ; Macketal2015, ; Dayetal2016, ) but not all (Brennanetal2011, ; Caliskan:Greenstadt2012, ). Semantic retainment is a major challenge, especially across multiple translations (Tempesttetal2017, ).

Importantly, ILT cannot be used to aim a particular target style; it merely changes the input document to a style represented by the MT system used, at most obfuscating the style present in the original document. As such, ILT does not constitute a style imitation technique. In the rest of this paper, we focus on style imitation.

2.2. Prior style imitation techniques

Unlike mere style obfuscation, imitation involves enacting only those changes that are likely to take author profiling to the intended direction (Shettyetal2018, ). We summarize recent research on the topic. For further details we refer to the papers cited.

Rule-based methods Manually programmed grammatical transformations and paraphrase replacement from pre-existing knowledge bases (e.g. PPDB (Ganitkevitchetal2013, ), WordNet (Wordnet95, )) provide alternatives to ILT as simple transformation techniques (Khosmood:Levinson2008, ; Khosmood:Levinson2009, ; Khosmood:Levinson2010, ; Khosmood2012, ; Reddy:Knight2016, ; Karadzhovetal2017, ). A considerable advantage over ILT is that such rules can be controlled and hence specified to imitate a particular target style. As an example, Karadzhov et al. (Karadzhovetal2017, ) apply various transformations to the input document in order to steer its stylistic features toward the average values in the author profiler’s training corpus.

Stylistic MT MT can also be used to translate directly between styles in the same language, if parallel corpora are available for training. Previous work on stylistic MT includes paraphrasing Shakespeare as modern English (Xuetal2012, ), and automatic text simplification by translating text to a more easily readable variant (Wubbenetal2012, ). While its use in style transformation has been rare, positive results were reported by Day et al. (Dayetal2016, ). However, its central shortcoming is the reliance on parallel corpora matching the source and target styles, which are rarely available. More recent style imitation research has centered around methods that require no parallel corpora for training, and we focus on these in the remainder or this paper.

Cross-aligned autoencoder (CAE) The main idea behind state-of-the-art approaches to style imitation (Shenetal2017, ; Prabhumoyeetal2018, ; Shettyetal2018, ) has been to generate a style-neutral encoding of the original sentence, which serves as the input to a style-specific decoder. Shen et al. (Shenetal2017, )

trained an autoencoder using a method called

cross-alignment for calibrating the distributions between the encodings derived from different source styles. The encoder produces a latent variable from the input sentence, and the decoder generates the target sentence from this content variable together with the target style feature. Shen et al. implemented cross-alignment by adversarial training over generated samples in both styles.

Back-translation (BT) As an alternative to CAE’s alignment method, Prabhumoye et al. (Prabhumoyeetal2018, ) produce the latent content variable via a pre-trained MT system. They first translate the original English sentence to French, and produce the content variable with the French-English encoder. They then train separate decoders that take this encoding as an input and produce sentences in a particular writing style, as evaluated by an independent (pre-trained) author profiler. BT can thus be considered as a more sophisticated variant of ILT, with style-specific decoding allowing imitation of a particular target style. Like ILT, it builds on the notion that MT tends to neutralize stylistic features (Rabinovichetal2016, ).

ANT Shetty et al. (Shettyetal2018, ) make use of generative adversarial networks (GANs) in their style imitation system titled Adversarial Author Attribute Anonymity Neural Translation (A

NT). A GAN consists of two neural networks, where one (the


) is trained to classify the outputs of the other (the

generator), which in turn is trained to deceive the classifier (Goodfellowetal2014, ). Initializing these networks with a pre-trained author profiler and autoencoder, Shetty et al. train the generator to produce sentences classified as the target author as well as preserve the semantic content of the original. The latter is regulated by the probability of reconstructing the original sentence via reverse transformation.

3. Adversary model and use scenarios

The entities involved in our use scenario are authors and an adversary. Each author belongs to some class in . A document is written by an author, and is thereby also allocated to a class in . When we talk about “belonging to” a class, we, strictly speaking, mean the author of belonging to the class. We refer to the true author of as , and define a function from the author to its true class in . By definition, . A profiler maps each document to a class , which we denote as .

When wants to thwart from correctly profiling as belonging to its true class , can apply a series of transformations to , producing . Since is produced by , its true class is also . Style transfer is successful if maps to a different class despite this. We further distinguish between obfuscation and imitation, where the former succeeds if profiles incorrectly, and the latter if maps to a particular target class . In style imitation, sentences in are written in the target style. We provide definitions below.

  • succeeds in obfuscating for profiler if:



  • succeeds in imitating for profiler if:



So far we have considered success in obfuscation or imitation as properties of the transformed document . We can broaden these to cover (transfer/)imitation techniques themselves, which succeed in these tasks if they produce transformed documents that succeed in them. This makes the success of an imitation technique probabilistic

instead of a binary property: it can be estimated based on the extent to which applying the imitation increases the probability of

(obfuscation) or (imitation).

When , the distinction between obfuscation and imitation is unambiguous, the latter being a variant of the former. If (as in our test cases), we consider obfuscation as successful if the probability of correct profiling is reduced to a random level, and imitation as successful if there is a significantly above random probability of misprofiling. Since there is no definitive criterion for what should constitute “significant”, the boundary between obfuscation and imitation is somewhat blurred in two-class datasets.

In the absence of style transfer, the adversary has access to written by , and the profiler that maps each document to some class in . He applies on , and succeeds if . In the imitation scenario, the adversary is only allowed to access . In this paper, we only consider profiling by automatic methods, where is a machine learning (ML) model trained on data derived from texts belonging to the classes in . We therefore assume that the adversary has access to such training data. We call this the (author) profiler dataset.

We further assume to access example documents from two corpora to train the imitation method: one from the true class and another from the target class . These corpora constitute the surrogate dataset. The surrogate dataset is distinct from the author profiler dataset, but we assume them to correspond in labels and distribution. uses the surrogate dataset to train a surrogate profiler, which is required to conduct the imitation.

Figure 2. ParChoice pipeline

4. Design of ParChoice

Figure 2 shows an overview of the ParChoice pipeline. It consists of two stages: (i) paraphrase generation, which takes an input document and generates a set of paraphrase candidates, and (ii) paraphrase selection. To realize the paraphrase selection stage, ParChoice uses a surrogate profiler trained on the surrogate dataset. In this section, we explain each stage in detail, and discuss the parameters that can be used to tune the performance of ParChoice in speed, imitation performance, and semantic retainment.

4.1. Paraphrase generation

The paraphrase generation stage contains three modules: grammatical transformations, paraphrase replacement, and adding typos. The modules are always applied in this order, but the paraphrase replacement loop allows multiple iterations, the number of which is specified as a parameter. In Sections we motivate and describe each module.

4.1.1. Grammatical transformations

This module consists of a parser that maps input sentences into representations of grammatical and semantic structure, and a pre-trained neural machine translation (NMT) model that generates English sentences from these representations. Grammatical transformations form a challenging task for NLP, and not many prior works have tackled them directly. Outside of highly task-specific rule-based approaches (Ahmed:Lin14, ; Biluetal15, ; Baptista16, ), we are aware of only two general techniques proposed for creating grammatical transformations (Logeswaranetal2018, ; Eat2seq, ), one of which we used in ParChoice.

Similar to the ANT method for style transformation (Shettyetal2018, ), Logeswaran et al. (Logeswaranetal2018, )

trained an encoder network to produce a latent content representation of the input sentence, and a decoder network to generate sentences from the content representation together with a set of target attribute features. They enforced content retainment by reconstruction losses from the original sentence’s content representation (autoencoding loss), the decoded sentence (back-translation loss), and their combination (interpolated reconstruction loss). The transformed attributes were represented as Boolean vectors, and included e.g. sentiment (positive/negative), tense (past/present), voice (active/passive), and negation. However, Logeswaran et al. only measured the accuracy of the target class, not semantic retainment. Their method is also a more generic text transformation framework, whereas we wanted to focus on grammatical transformations in particular.

Gröndahl and Asokan (Eat2seq, ) use a similar technique, with the important distinction that instead of an encoded latent representation they use a symbolic format generated from the input sentence’s dependency parse (Tesniere59, ). Since this intermediate representation is human-readable, it can be modified with targeted transformation rules. The transformed sentence is then generated with an NMT network trained to translate the representation back to English. This technique allows producing grammatical transformations between multiple class pairs (declarative-question, present-past, active-passive etc.) in any direction or combination. Gröndahl and Asokan further demonstrate the method to maintain a high level of content preservation as measured by back-transformation overlap with the original sentence as well as manual evaluation. Given the demonstrated success of this method in generating a large variety of grammatical transformations, we adopted it in ParChoice.333 The authors of the original paper (Eat2seq, ) made their model and training data available to us. The model consists of an encoder and a decoder network, both of which are two-layer LSTMs with hidden units in each layer. The intermediate representations are sequences of -component vectors, and use pre-trained GloVe embeddings (Glove, ) for representing words. We refer to the original paper for further details. The training corpus consists of million English sentences of the maximum length of words, derived from the following sources: Stanford parallel corpora (English only) (Luongetal2015, ; Luong:Manning2016, ) (https://nlp.stanford.edu/projects/nmt/), OpenSubtitles 2018 (http://opus.nlpl.eu/OpenSubtitles2018.php), Tatoeba (English only) (https://tatoeba.org), SNLI (SNLI2015, ) (https://nlp.stanford.edu/projects/snli/), SICK (SICK2014, ) (http://clic.cimec.unitn.it/composes/sick.html), Aristo-mini (December 2016 release) (https://www.kaggle.com/allenai/aristo-mini-corpus), and example sentences from WordNet (Wordnet95, ) (http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html). We also used this corpus for obtaining inflection information for WordNet lemmas, as explained in Section 4.2.

We restricted grammatical transformations to those having a minimal semantic effect. With additional appended material we were able to utilize transformations between affirmed and negated sentences, as well as between questions and declaratives. We also used voice transformation (active-passive) in both directions whenever it was available. Examples are shown in Table 1.

Voice Transitive verbs444A verb is transitive if it takes both a subject and a direct object in the active voice. can appear either in the active or passive voice, which impacts the semantic interpretation of the grammatical subject. The direct object of an active clause is expressed as the subject in the corresponding passive clause, and the subject of an active clause is optionally expressed in the passive via the preposition by. We produced both active and passive versions of all sentences that contained both required arguments.

Negation In addition to using the negative particle not/n’t, an affirmative sentence can be negated by embedding it within a main clause that states the falsity of its subordinate clause. We make use of this fact by first producing affirmed versions of negated input sentences, and subsequently wrapping them in the context “I do not/n’t think/believe (that)…”. We chose this prefix due to its fluency in comparison to other options, such as “It is not the case that…”. Of course, different alternatives could be chosen depending on the nature of the input data. The first person narrative is appropriate for our test data, as restaurant reviews, blogs, and political speeches are all typically written from a personal standpoint (cf. Section 5.1.1). We applied voice transformation in producing the affirmed variants of the original sentences.

Questions To paraphrase a polar (yes-no) question, we first transformed it to a declarative variant, which we then appended to the prefix “Is it (true) that…”. For negative questions, we additionally produce the affirmed declarative variant embedded in “Is is not (true) that…” and “Isn’t it (true) that…”. We employed voice transformation when generating the declarative sentences.

Sentence Transformations

John saw Mary.
Mary was seen by John.

John didn’t see Mary.
Mary wasn’t seen by John.
I don’t think that John saw Mary.
I don’t believe that John saw Mary.
I don’t think John saw Mary.
I don’t believe John saw Mary.
I don’t think that Mary was seen by John.
I don’t believe that Mary was seen by John.
I don’t think Mary was seen by John.
I don’t believe Mary was seen by John.

Did John see Mary?
Was Mary seen by John?
Is it that John saw Mary?
Is it true that John saw Mary?
Is it that Mary was seen by John?
Is it true that Mary was seen by John?

Didn’t John see Mary?
Wasn’t Mary seen by John?
Is it that John didn’t see Mary?
Is it true that John didn’t see Mary?
Is it that Mary wasn’t seen by John?
Is it true that Mary wasn’t seen by John?
Is it not that John saw Mary?
Is it not true that John saw Mary?
Isn’t it that John saw Mary?
Isn’t it true that John saw Mary?
Is it not that Mary was seen by John?
Is it not true that Mary was seen by John?
Isn’t it that Mary was seen by John?
Isn’t it true that Mary was seen by John?

Table 1. Examples of grammatical transformations

4.1.2. Paraphrase replacement

After grammatical transformations we applied paraphrase replacement using both simple rules and two external paraphrase corpora: PPDB (Ganitkevitchetal2013, ) and WordNet (Wordnet95, ). Simple rules are applied first, after which the loop PPDBWordNetsimple is iterated for a specified number of times. Each iteration enacts one or more transformations on the input document. To increase the range of paraphrase candidates, we generate their Cartesian product. As a simple example, we can consider the transformations ’mam and n’tnot in the sentence I’m sure John doesn’t win. The transformations can be applied both individually and together, and all possible combinations yield the following candidates:

  • I’m sure John doesn’t win

  • I am sure John doesn’t win

  • I’m sure John does not win

  • I am sure John does not win

We call this approach combinatorial paraphrasing. It quickly increases the size of the candidate set, allowing the optimization of transformations to best fit the target writing style.

Simple rules We apply a small set of rules to produce simple alterations with little to no semantic effect. In addition to supplying paraphrases on their own, such rules are useful for broadening the range of inputs for further processing stages, especially paraphrase replacement from PPDB.

Indefinite pronouns with the suffix -body are interchangeable with pronouns with the suffix -one. We allow any combinations of the following synonyms in the original sentence:

anybody, anyone somebody, someone nobody, no one

Grammatical elements that modify verbs in terms of possibility or judgement (can, should, must, etc.) constitute the class of modal auxiliaries. Some of these are sufficiently equivalent to allow replacement in most contexts, such as can and could. However, some are only equivalent in either an affirmed or a negated context, but not both. The context is negated if the auxiliary precedes a negation (not/n’t), and affirmed otherwise. For instance, Mary could win is (roughly) equivalent Mary may win, but Mary could not win differs substantially from Mary may not win. Hence, could and may are (roughly) equivalent in an affirmed context but not in a negated context. The following sets represent the equivalent modal auxiliary groups we use:555We append ought with the preposition to (following the negation in negated contexts), and conversely remove to if ought is replaced with another auxiliary.

  • Affirmed context

    might, may, could, can should, ought, must will, shall

  • Negated context

    can, could should, ought, must will, shall

English allows the negation, certain inflections of the copula verb (am, are, is), or the auxiliary have to be expressed in a contracted form. Discarding the contraction ’s due to its high ambiguity,666The clitic ’s can replace both is and has, in addition to functioning as a homophonous possessive suffix. There are also fairly complex restrictions regulating its appearance with is or has that make it challenging to use in simple paraphrase replacement. we consider the following as paraphrases:

not, n’t am, ’m are, ’re have, ’ve

We produce both contracted an non-contracted variants of each, as allowed by their grammatical context.777We only produced the contraction ’ve after a pronoun in i, you, we, they, and the contracted negation n’t after an auxiliary in is, are, was, were, have, has, had, wo (variant of will), must, should, need, ought, could, can, do, does, did. We only replaced ’ve with the non-contracted version have and not vice versa, in order to avoid the ungrammatical transformation of a non-auxiliary have to ’ve.

Finally, we maintain that removing commas only has a minor impact on semantic interpretation,888We are aware of counterexamples like Let’s eat(,) grandma or I love cooking(,) my dog and playing golf. However, we maintain that the intended meaning is not irrecoverable in such cases, but merely less evident due to ambiguity. Removing commas may thus introduce alternative readings, but does not remove the original interpretation. and therefore allow any comma in the sentence to be optionally removed. As a very minor transformation, comma removal is mostly relevant for increasing the range of options for paraphrase replacement from PPDB.

PPDB The Paraphrase Database (PPDB) is automatically collected from parallel bilingual corpora based on the assumption that paraphrases are often translated to the same target sentence (Ganitkevitchetal2013, ). It contains more than million paraphrases, which have subsequently been annotated for semantic relation type (equivalence, entailment etc.) and ranked with a supervised regression model (Pavlicketal2015, ). We restricted replacements to the equivalent class, resulting in phrase pairs altogether.999We derived these from the so-called “PPDB-TLDR” version, which contains lexical and phrasal paraphrases from the XL class, within the ranking scale between S to XXXL from highest to lowest similarity. The corpus can be downloaded from:

Some paraphrases contain additional information about the syntactic context in which they can be placed. Using the Python NLP library Spacy101010https://spacy.io/ (Honnibal:Johnson2015, ) to derive syntactic information from the original sentence, we only allowed replacement if the context matched that specified in the PPDB entry. Some paraphrases lack context information, but still specify the part-of-speech (POS) of the head word of the phrase.111111For instance, the multi-word phrase the brown dog has the head dog, making the POS tag associated with this phrase noun. In such cases, we used the POS tag to restrict paraphrase replacement. Based on manual evaluation on independent validation sets, these grammatical restrictions highly improved the grammatical status of the resulting transformations.

WordNet PPDB operates on the level of raw text, where words appeach in inflected form. WordNet (Wordnet95, ), in contrast, is a manually built knowledge base of word senses, which represent possible word meanings and contain various properties including synonyms, hyper/hyponyms, definitions, and example sentences. All words occur in an abstract lemma form, and no inflection information is included. While WordNet has often been used in prior rule-based style transfer (Khosmood:Levinson2010, ; Mansoorizadehetal2016, ; Mihaylovaetal2016, ; Karadzhovetal2017, ), this format is a major obstacle for its direct application.

Selecting one word sense to represent a word in the sentence is known as word sense disambiguation (Navigli2009, ). For this task, we used the simple Lesk algorithm from Python’s WSD library,121212https://github.com/alvations/pywsd which is a variant of the Lesk algorithm (Lesk1986, ) based on overlap in word sense properties (definition, example sentences, hypernyms etc.). To enact inflection, we created a dictionary from lemmas to their surface manifestations with different POS-tags, deriving these from a large text corpus.131313We used the same corpus of million sentences that was used for training the grammatical transformation method described in Section 4.1.1 (see Footnote 3). As the inflected variant of a lemma, we chose the most common surface form associated with a lemma-tag combination in this corpus annotated with NLTK’s POS-tagger. To produce a paraphrase, we inflected all synonyms of the word sense (attained via simple Lesk) based on the POS-tag of the original word, and replaced the original word with each of its inflected synonyms.

4.1.3. Typos

Typographical errors are an important aspect of stylistic variation (Abbasi:Chen2008, ; Brennanetal2011, ; Ho:Ng2016, ). However, given the vast number of possible misspellings and their varying effects on readability, introducing them randomly is not justifiable. Instead, we introduced typos that appear in the target corpus of the surrogate dataset. For obtaining typos we used the Python port of SymSpell,141414https://github.com/wolfgarbe/SymSpell applying it to the target corpus and storing a dictionary from correct spellings to their possible misspellings. We additionally spell-checked the original sentence and included original typos to this typo dictionary.151515Since this spell-checking was done in the beginning of paraphrase generation, grammatical transformation and paraphrase replacement were conducted on the corrected sentence, and typos inserted only at the end. This allowed either retaining original typos or correcting them, depending on their effects on paraphrase selection.

4.2. Paraphrase selection

A selection from the set of paraphrase candidates is done twice in the ParChoice pipeline. These selections are based on the surrogate profiler, which is a ML model trained on the a surrogate dataset to classify between the source and target writing style. First, among the candidates produced by the paraphrase replacement module, only those that the surrogate profiler maps to the target class are retained for potential further iterations. Second, after paraphrase generation, ParChoice selects the paraphrase candidate which is assigned the highest target class probability by the surrogate profiler.

Surrogate profiler

As our surrogate profiler, we used a logistic regression (LR) classifier with word unigrams as input features. Notably, this architecture is much simpler than the deep learning author profilers we used for evaluating imitation performance (cf. Section

6.2). If deep learning profilers can be deceived using only an LR surrogate profiler, this indicates that imitation cannot be thwarted simply by increasing author profiler complexity.


There are three hyperparameters that we can use to control

ParChoice’s paraphrase generation stage. Each has different expected effects on imitation success and semantic retainment. Additionally, increasing the number of candidate paraphrases naturally slows down the speed of ParChoice.

The first parameter concerns the truncation of the candidate set after every iteration of the paraphrase replacement loop. We call this maximum #candidates. As the truncated subset is chosen randomly, increasing it has no detrimental effect on semantic retainment. Since it also improves the likelihood of finding an optimal imitation, its only downside is the slowing down of processing speed.

The second parameter is the number of times the paraphrase replacement loop is iterated. We call this #iterations. Unlike maximum #candidates, #iterations has a negative impact on semantic retainment in addition to processing speed. This is because every iteration builds on top of prior paraphrase candidates, and hence exerts more drastic modifications to the original sentence. However, additional iterations can improve imitation success by generating candidates which would otherwise not be produced.

Finally, we also filter paraphrase candidates at the end of each loop iteration based on edit distance measured between the candidates and the original document. Since edit distance is likely to correlate negatively with semantic retainment, an upper bound on edit distance reduces the probability of semantically unviable candidates. However, it also decreases more drastic transformations that might be needed for successful style imitation.

Table 2 reports the expected effects of increasing each of the three hyperparameters, where “” marks an increase, “” a decrease, and “” a lack of systematic correlation to either direction.

Parameter Effects of increasing
Semantic Imitation Speed
Maximum #candidates
Maximum edit distance
Table 2. Effects of hyperparameter choice

Data type Class labels Sentences Vocabulary size Sentence length
Min. Max. Mean

Restaurant reviews male

Blog posts male

Blog posts author 1 (“Alice”)
author 2 (“Bob”)

Political speeches Trump

Table 3. Dataset properties

5. Experiments

In this section we describe the experiments of empirically comparing ParChoice with three state-of-the-art style imitation techniques as the baseline. We applied each technique to four two-class datasets in both directions, and tested imitation success against two deep learning author profilers. We review the test settings in Section 5.1

and the evaluation metrics in Section


5.1. Test settings

This section introduces the datasets, baseline models, and ParChoice hyperparameters we used in our tests.

5.1.1. Datasets

We used four two-class corpora, two of which are labeled by gender and two by author identity. The gender corpora and one author corpus are used in two of the papers presenting the baseline methods (Prabhumoyeetal2018, ; Shettyetal2018, ). The remaining author corpus is derived from one of the gender corpora (Schleretal2006, ), and has not been used in prior style imitation research. Here we describe each dataset and contrast some of their main characteristics. Size- and length-related information is presented in Table 3. All datasets can be downloaded from the baseline models’ project pages on GitHub.161616YG: https://github.com/shrimai/Style-Transfer-Through-Back-Translation
BG and TO: https://github.com/rakshithShetty/A4NT-author-masking/blob/master/README.md
AB is derived from BG.

Yelp gender (YG) Originally collected by Reddy and Knight (Reddy:Knight2016, ), this corpus consists of restaurant reviews labeled by the gender attribute (based on the author’s proper name). It was used by Prabhumoye et al. in their BT method described in Section 2.2 (Prabhumoyeetal2018, ). The datapoints are typically complete and grammatically correct sentences containing mostly common English terminology and only rarely typos.

Blog gender (BG) The blog dataset, collected by Schler et al. (Schleretal2006, ), contains blog posts labeled by authorship, gender, age, and zodiac sign. In contrast to YG, colloquial expressions, abbreviations (e.g. LOL, OMG), and interjections (e.g. ouch, sigh) appear often in this dataset. This makes it more challenging for paraphrasing, as both dependency parsing (required for grammatical transformations) and paraphrase databases (PPDB, WordNet) are built for mostly grammatically correct and typo-free input. Another important contrast to YG is the much larger vocabulary size, which is indicative of the wider range of terminology used in blogs in comparison to restaurant reviews.

Alice-Bob (AB) From the blog dataset, we extracted the two authors with the most sentences, and used them to build a stand-alone dataset for author profiling. However, while BG only contains sentences with words or less, we used all sentences by these authors to increase dataset size. Therefore the maximum sentence length is higher in AB than in BG, as shown in Table 3. One author is a female in the age range , and the other male a in the age range . For convenience, we call them Alice and Bob, respectively. Of our four datasets, only AB has not been used in prior style imitation research.

Trump-Obama (TO) This dataset includes speeches by the two most recent US presidents: Barack Obama and Donald Trump. Collected by Shetty et al. (Shettyetal2018, ), it originally contains speeches and sentences overall. However, it is highly imbalanced, and we were able to significantly improve profiling accuracy by truncating the larger class (Obama) to the same size as the smaller. We used this balanced version in our experiments. As this dataset consists of prepared speeches, it is generally high in grammatical quality and contains few colloqualisms, abbreviations, or interjections.

5.1.2. Baseline style imitation techniques

We review the technical details involved in each baseline we compared against ParChoice. Some were obtained as pre-trained models from the projects’ GitHub pages, and the rest we trained ourselves based on the descriptions in the original papers. The basic functions of the three baselines were outlined in Section 2.2.

Cross-aligned autoencoder (CAE) The authors of the CAE paper (Shenetal2017, ) do not use any of the datasets we experimented on. However, in Prabhumoye et al.’s paper presenting BT (Prabhumoyeetal2018, ), CAE is used as a baseline against BT. The authors trained both on the YG dataset, and compared their imitation performance as well as human preference on sentiment retainment. In a user study, Prabhumoye et al. asked human annotators the question “Which transferred sentence maintains the same sentiment of the source sentence in the same semantic context?” (Prabhumoyeetal2018, ). Additionally, the authors manually evaluated the fluency of randomly chosen sentences. While CAE performed slightly better on gender imitation than BT ( vs. success), BT outperformed CAE in both sentiment retainment and fluency. In the user study, of the BT-generated sentences were deemed superior in comparison to only of the CAE-generated sentences ( belonging to the “no preference” group). In the fluency measurement, BT received an average score of out of , in comparison to CAE’s .

Back-translation (BT) Prabhumoye et al. (Prabhumoyeetal2018, ) provide the BT model trained on YG on the project’s GitHub page.171717https://github.com/shrimai/Style-Transfer-Through-Back-Translation We used this for our YG experiments, and trained BT ourselves for other datasets. When training on AB, the resulting model exhibited very poor semantic retainment due to the small size of the dataset, only repeating a few words. To add lexical variation, we initialized the model’s weights with those of the original French-English translator not tailored for any target style. This markedly improved the model’s behavior by making lexical choices more diverse and reducing repetition. We did not initialize weights for other datasets due to their larger size.

ANT The GitHub page of ANT,181818https://github.com/rakshithShetty/A4NT-author-masking/blob/master/README.md includes pre-trained models for BG and TO.191919The pre-trained ANT TO-model has been trained on the original Trump-Obama dataset, whereas we used a balanced version for training our profilers and other imitation techniques (CAE, BT), as explained in Section 5.1.1. For comparison we also trained ANT on our balanced version of TO, but used the pre-trained model in our tests as it performed better in imitation. We trained ANT ourselves for YG and AB. While we used the full dataset to train the initial classifier and generator for YG, hardware limitations required us to truncate YG for training the GAN. We used a subset of sentences in this stage.

In the paper presenting ANT, Shetty et al. experimented on BG, TO, and blog author age (the same corpus as BG but with different labels). On all corpora, they managed to reduce classification accuracy (combined from both classes) from above below . They further measured semantic retainment via the METEOR score (Banerjee:Lavie2005, ), and conducted a user study where they compared ANT to ILT via Google Translate (cf. Section 2). ANT received a high evaluation score of on a scale, which superseded ILT’s score of .

5.1.3. ParChoice

We applied the ParChoice pipeline as described in Section 4 to test sets derived from YG, BG, AB, and TO. As hyperparameters (cf. Section 4.2) we used one paraphrase replacement iteration, the maximum candidate set size of , and the maximum edit distance of (cf. Section 7 for discussion).

As mentioned in Section 4.2, our surrogate profiler was a word-based LR classifier. We used as the maximum number of features, and trained the surrogate profiler on a separate surrogate dataset for each test setting. Importantly, this surrogate dataset was always independent of the data used to train the author profiler. In this respect, ParChoice is less restrictive than any of the three baseline methods, where the same data is used to train the style imitation system and the author profiler.

We constructed the surrogate datasets from the original test sets, not used for training the author profiler. For consistency across experiments, we always used of the author profiler training set size as the surrogate dataset size. We then conducted our tests on a separate test set unseen by both the author profiler and the surrogate profiler. Training and test set sizes are shown in Table 4.

Dataset Training set sizes Test set size
Author Surrogate
profiler profiler
Table 4. Training and test set sizes

5.2. Evaluation

We measured the effectiveness of each style imitation technique on two fronts: semantic retainment and imitation success.

5.2.1. Semantic retainment

We evaluated semantic retainment using multiple automatic and manual metrics. The former were measured for the entire test sets and the latter for smaller subsets. For manual evaluations, we used both our own analysis and a user study by independent human evaluators to estimate the imitation techniques’ ability to retain the original meaning.

Automatic evaluation

The most basic estimate of semantic retainment is the n-gram overlap with original sentence, standardly measured via the

BLEU score developed for evaluating MT (Papinenietal02, ; Coughlin2003, ). However, BLEU penalizes any divergence from the original text, which is undesirable for our purposes. As a more sophisticated metric, the METEOR score (Banerjee:Lavie2005, ) uses additional paraphrase tables, which makes it more appropriate for our task. We further provided two METEOR scores: one for all imitations, and another for those that are non-identical with the original sentence. The latter is more relevant for evaluating semantic retainment when transformations occur, which is our main interest.

Direction METEOR Imitation technique

CAE BT ANT ParChoice

femalemale all 44.80
non-identical 43.81

malefemale all 46.98
non-identical 45.78

femalemale all 51.27
non-identical 46.14

malefemale all 53.03
non-identical 49.47

AliceBob all 44.60
non-identical 43.33

BobAlice all 45.71
non-identical 44.22

TrumpObama all 47.95
non-identical 47.54

ObamaTrump all 47.14
non-identical 46.90

Table 5. METEOR scores (highest results emphasized)

Manual evaluation For the eight imitation directions, we manually compared a -sentence subset of the original test sentences to their imitated counterparts, and classified each imitation to one of the following classes:

  • identical to the original sentence

  • paraphrase of the original sentence

  • omission: contains only part of the original sentence

  • other error(s): inappropriate changes or additions

User study In addition to our own manual evaluation, we conducted a user study on people who were not involved in the development of ParChoice or any of the baselines. of the users were native English speakers. were female and male. were years old, and were or older. had a university degree, most often on the Master’s level ().

We administered the study as an online questionnaire. Each user was allocated a unique set of sentences drawn from the YG dataset. They were shown the original sentence along with its imitation to the opposite gender’s writing style by all imitation techniques we used (in a randomized order): CAE, BT, ANT, and ParChoice. We only used imitations that were non-identical to the original sentence. Replicating the method used by Shetty et al. (Shettyetal2018, ) for evaluating ANT, we asked the users to compare all four imitation variants to the original, and rate them on a scale based on how close the meaning of the variant was to that of the original. We gave the users no additional information about the test sentences being computer-generated for the purpose of style imitation.

5.2.2. Imitation success

To evaluate imitation success, we trained two author profilers to classify the original and imitation test sets produced with the baselines and ParChoice. We use the following state-of-the-art NLP techniques to construct our profilers: a long short-term memory network (LSTM) derived from Shetty et al. (Shettyetal2018, ), and a convolutional neural network (CNN) derived from Prabhumoye et al. (Prabhumoyeetal2018, ). Both use words as input features. The code we used for training these is available on the respective papers’ GitHub pages.202020LSTM: https://github.com/rakshithShetty/A4NT-author-masking/blob/master/README.md
CNN: https://github.com/shrimai/Style-Transfer-Through-Back-Translation

6. Results

This section presents the results on semantic retainment and imitation success. Overall, ParChoice exhibits a much higher semantic retainment than the baselines based on both automatic and manual measurements, and a superior imitation performance to ANT, which is the only viable baseline in terms of semantic retainment.

Direction Imitation Transformations: correct Transformations: errors Semantic retainment rate

technique Identical Paraphrase Omission Other All Non-identical

femalemale CAE
ParChoice 56% 53%

malefemale CAE
ParChoice 42% 37%

femalemale CAE
ParChoice 68% 64%

malefemale CAE
ParChoice 50% 43%

AliceBob CAE
ParChoice 56% 49%

BobAlice CAE
ParChoice 62% 57%

TrumpObama CAE
ParChoice 60% 52%

ObamaTrump CAE
ParChoice 62% 60%

Table 6. Manual semantic retainment measures from sentences in each direction (highest retainment emphasized) Note that identical transformations (column 4) increase the semantic retainment rate (column 8) but do not constitute a good style imitation. The last column therefore represents semantic retainment considering only those sentences that underwent at least some paraphrasing.
Imitation Grade
technique Mean Median
ParChoice 2.7 3 41% 24%
Table 7. User study results (grade scale )

6.1. Semantic retainment

We review the results from the three tests we used to measure semantic retainment: automatic evaluation, manual evaluation, and the user study. METEOR scores are presented in Table 5, manual evaluation results in Table 6, and user study results in Table 7.

Direction Author profiler Profiler accuracy
Original ANT ParChoice

femalemale LSTM 0.46
CNN 0.32
malefemale LSTM 0.48
CNN 0.59

femalemale LSTM 0.68
CNN 0.62
malefemale LSTM 0.36
CNN 0.32

AliceBob LSTM 0.62
CNN 0.64
BobAlice LSTM 0.71
CNN 0.63

TrumpObama LSTM 0.35
CNN 0.47
ObamaTrump LSTM 0.23
CNN 0.58

Table 8. Imitation results (lower score is better for imitation, lowest scores emphasized)

Automatic evaluation ParChoice and ANT always clearly outperform CAE and BT in METEOR. Especially on the smaller datasets (AB, TO), CAE and BT produce very poor scores (), which indicates that they bear little to no resemblance to the original sentence. The score of ParChoice exceeds that of ANT in most cases, and in all cases when identical sentences are discarded. Hence, ParChoice more often produces paraphrases recognized by the METEOR paraphrase table. This indicates that ParChoice is able to retain the original meaning better than any of the baselines. As will be seen, results from our manual evaluations and the user study strongly corroborate this hypothesis.

Manual evaluation Table 6 presents results from our manual evaluation on sentences for all four datasets to both directions. The last two columns contain the most important part: the semantic retainment rate, which shows how often applying the imitation technique resulted in an acceptable paraphrase of the original, based on our judgement. As with METEOR, we distinguish between the semantic retainment rate of all evaluated sentences and the rate of those sentences that are non-identical to the original. The latter indicates how often the imitation technique is able to produce acceptable paraphrases of the original sentence without simply reproducing it. As can be seen from the table, only ParChoice produced these systematically, the baselines achieving only a retainment rate as opposed to ParChoice’s .

Of the baseline techniques, CAE and BT produced practically no paraphrases, and rarely reproduced the original. This is especially true in the small datasets (AB, TO), where imitations bore no resemblance to the original sentence and simply repeated certain words prevalent in the target corpus. In ANT the reproduction rate was by far the highest, but non-identical paraphrases remained very rare, as seen in the drastic decrease in semantic retainment rate when only non-identical sentences are considered (last column). ParChoice stands out in clear contrast to all baselines in producing many paraphrases, resulting in the overall semantic retainment rate being by far the highest across all datasets and imitation directions.

Most errors were due to inappropriate word replacements. CAE and BT usually produced novel sentences that only bore a distant lexical and grammatical similarity to the original. In the BG dataset, ANT also generated a significant number of omissions, where some words were simply left out from the original and nothing else was changed. This likely partly accounts for ANT’s high METEOR score in BG. ParChoice’s errors were most often caused by inappropriate WordNet synonyms, reducing which would require improving the word sense disambiguation algorithm used to map words to WordNet senses. This is an important consideration in the future development of ParChoice.

Even though all imitation techniques occasionally resulted in grammatical errors, all techniques trained on large datasets (YG, BG) typically produced full English sentences. When trained on small datasets (AB, TO), CAE and BT repeated a small number of words in largely ungrammatical combinations, whereas ANT produced mostly fluent sentences despite failing in semantic retainment. Uniquely, ParChoice’s did not deteriorate in semantic retainment depending on the training corpus size. This is due to the distinction between paraphrase generation and paraphrase selection, only the latter being affected by the target corpus.

User study Results from the user study are presented in Table 7. All models received grades across the whole spectrum. However, the highest mean grade is only (achieved by ParChoice), which indicates that all imitation techniques had trouble with semantic retainment. Still, ParChoice performed significantly better than any of the baselines. Especially CAE and BT had very poor performance, not exceeding in the mean grade (on the scale ).

ParChoice particularly stands out in the percentage of imitations that obtained high grades. CAE and BT received the grade or in less than of imitations and ANT in , whereas ParChoice achieved or in more than of imitations. Around one fourth of all ParChoice-imitations received the best grade , which surpassed the baselines even more drastically ( in CAE, in BT, and in ANT). Finally, the majority of ParChoice-imitations were on the upper half of the six-point scale (), whereas the majority of all baseline imitations had the grade or .

Technique Sentence


he really made us feel welcome and i will definitely be back again !
CAE he will definitely be back again and i will definitely be back again .
BT he really did n’t have to say and i would be back again !
ANT he really made it key competition and that will definitely be back again .
ParChoice he genuinely made us feel welcome and i shall decidedly be back again !

Original and another cookie , with raspberry filling then dusted with powdered sugar .
CAE for 0 years , and i gave me with me .
BT . and a few other restaurants , with the rest of meat then dusted with the beef in town !
ANT and another organic , with unusual steak then powdered sugar .
ParChoice and another biscuit with raspberry repleting later dusted with powdered sugar .


there is decent value here , but the food is as you should expect .
CAE there is better than the food , but this place was great .
BT there is a good experience here , but the food is like you can get it .
ANT there was beautiful def , but the food is as expensive .
ParChoice there is dignified value here , but the food is as you ought to wait .

Original for fast food , this place is terrific .
CAE for food , this place is amazing .
BT for dessert , this place is amazing .
ANT for fast food , this place is terrific .
ParChoice this placement is fabulous , for quick food .

Table 9. Example imitations from YG

6.2. Imitation

To evaluate imitation success, we measured author profiler accuracy on both classes before and after imitation. However, CAE and BT completely failed in the semantic retainment task. In Table 8 we therefore only present a comparison between ParChoice and ANT, but provide the full comparison in Appendix A. We consider as the imitation limit, and hence an accuracy change from above to below as successful imitation.

YG Both the LSTM and CNN profiler achieved high accuracies in both directions with YG. ParChoice’s performance always exceeded those of ANT in YG, and ParChoice succeeded in imitation in all cases except malefemale measured with the CNN profiler. In contrast, ANT did not cross the imitation limit, at best taking accuracy down to . Both ANT and ParChoice performed better in the femalemale direction than conversely.

BG Original profiling accuracy was highly imbalanced in BG, only the female class achieving a decent score with either author profiler. Male classification accuracy remained below with both profilers. Hence, while both ANT and ParChoice decreased accuracy in the malefemale condition, this does not satisfy our criteria for successful imitation. Furthermore, ANT did not take profiling accuracy down in the femalemale direction at all, instead increasing it by a few percent for both author profilers. ParChoice failed to imitate as well, but still decreased accuracy by on the LSTM an on the CNN.

AB Both author profilers achieved high accuracies (ca. ) on the original AB data. Neither ANT nor ParChoice managed to decrease accuracy to below . ParChoice performed better in the AliceBob direction and ANT in the converse, for both the LSTM and CNN profiler. At most, ANT was able to take accuracy down by , and ParChoice by . Imitation performance was thus highly similar between the two techniques in the AB dataset, but somewhat concentrated to opposing directions.

TO Unlike with the other datasets, in TO we found a clear difference in performance between the LSTM and CNN profilers, the former performing systematically better on the original test set in both classes. Interestingly, however, ANT was able to take the LSTM’s accuracy down much more than the CNN’s, clearly surpassing the imitation limit with the LSTM in both directions. The CNN was far more resistant to ANT’s imitations, to the point of its accuracy increasing in the ObamaTrump direction. ParChoice did not surpass the imitation limit in TO, but still systematically took accuracy down with both profilers and directions.

7. Discussion

Interpretation of results The empirical results presented in Section 6 can be summarized as three main findings. First, the CAE and BT baselines failed to produce acceptable paraphrases in any dataset, as measured both by our own manual evaluations and the user study involving independent evaluators. Since semantic retainment is mandatory for any real-world style imitation, we assert that CAE and BT are not usable for this purpose.

Second, even though ANT maintained a superior semantic retainment to the other baselines, this was largely due to its tendency to reproduce sentences rather than its ability to enact semantically acceptable transformations. While ANT received better overall results in the user study as CAE and BT, its mean () and median () illustrate the poor quality of imitations in human judgement. Since the user study only included imitations that were non-identical with the original sentence, our results indicate that when ANT enacts a transformation, it is highly unlikely to retain the original meaning. Manual evaluation results accord with this conclusion. Discarding identical imitations was likely a major reason why we received much lower scores for ANT compared to Shetty et al.’s (Shettyetal2018, ) original user study. ParChoice performed significantly better than A NT in semantic retainment, in both the user study and manual evaluation.

Third, ParChoice outperformed ANT in most imitation tasks, and maintained a consistent performance of lowering profiling accuracy. ANT, in contrast, resulted in a higher post-imitation accuracy on three test settings. This result is especially important considering that ParChoice has been trained on an independent training set and an unrelated simple surrogate profiler (LR), whereas ANT has used an overlapping training set and the corresponding model architecture with the LSTM author profiler.

Tuning ParChoice We chose the hyperparameter values reported in Section 5.1.3 based on preliminary tests on the YG dataset. Increasing iterations from to had a systematic but relatively small positive effect on imitation success ( increase in imitation). Our maximum candidates value seemed to be high enough for our purposes, as increasing it to had no marked effect on imitation. We did not systematically experiment on different values of the maximum edit distance parameter, but lowering it would likely improve semantic retainment at the expense of imitation performance. More experimentation is needed to find optimal hyperparameter values for ParChoice.

8. Conclusions and future work

We showed that ParChoice significantly improves on the state-of-the-art in style imitation in terms of achieving imitation success while retaining semantic content. However, much room for improvement remains, especially related to improper synonym replacement from WordNet. In future work, we plan on experimenting with more sophisticated word sense disambiguation algorithms to reduce such errors. We will further empirically study the effects of parameter choice in ParChoice. Additionally, applying ParChoice to datasets beyond gender and author profiling, as well as multi-class settings, are important for the future development of this technique.

We thank Sam Spilsbury and Andrei Kazlouski for their help in implementation, and Mika Juuti for valuable discussions related to the project. This work is supported in part by the Helsinki Doctoral Education Network in Information and Communications Technology (HICT).


  • [1] Ahmed Abbasi and Hsinchun Chen. Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information and System Security, 26(2):1–29, 2008.
  • [2] Afroza Ahmed and King Ip Lin.

    Negative sentence generation by using semantic heuristics.

    In The 52nd Annual ACM Southeast Conference (ACMSE 2014), 2014.
  • [3] Mishari Almishari, Ekin Oguz, and Gene Tsudik. Fighting Authorship Linkability with Crowdsourcing. In Proceedings of the second ACM conference on Online social networks, pages 69–82, 2014.
  • [4] Douglas Bagnall.

    Author identification using multi-headed recurrent neural networks.

    In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation Forum, 2015.
  • [5] Satanjeev Banerjee and Alon Lavie. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, 2005.
  • [6] Jorge Baptista, Sandra Lourenco, and Nuno Mamede. Automatic generation of exercises on passive transformation in Portuguese. In

    IEEE Congress on Evolutionary Computation (CEC)

    , pages 4965–4972, 2016.
  • [7] Yonatan Bilu, Daniel Hershcovich, and Noam Slonim. Automatic claim negation: why, how and when. In Proceedings of the 2nd Workshop on Argumentation Mining, pages 84–93, 2015.
  • [8] Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
  • [9] Michael Brennan, Sadia Afroz, and Rachel Greenstadt. Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity. ACM Transactions on Information and System Security, 15(3), 2011.
  • [10] Michael Brennan and Rachel Greenstadt. Practical Attacks Against Authorship Recognition Techniques. In Proceedings of the Twenty-First Conference on Innovative Applications of Artificial Intelligence, 2009.
  • [11] Marcelo Luiz Brocardo, Issa Traore, Isaac Woungang, and Mohammad S. Obaidat.

    Authorship verification using deep belief network systems.

    Communication Systems, 30(12), 2017.
  • [12] Aylin Caliskan and Rachel Greenstadt. Translate once, translate twice, translate thrice and attribute: Identifying authors and machine translation tools in translated text. In Semantic Computing (ICSC), 2012 IEEE Sixth International Conference on, pages 121–125, 2012.
  • [13] Deborah A. Coughlin. Correlating automated and human assessments of machine translation quality. In Proceedings of MT Summit IX, pages 23–27, 2003.
  • [14] Siobahn Day, James Brown, Zachery Thomas, India Gregory, Lowell Bass, and Gerry Dozier. Adversarial Authorship, AuthorWebs, and entropy-based evolutionary clustering.
  • [15] Christiane Fellbaum (ed.). WordNet: An Electronic Lexical Database. MIT Press, Cambridge, 1998.
  • [16] Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. PPDB: The paraphrase database. In Proceedings of NAACL-HLT, pages 758–764, 2013.
  • [17] Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), pages 2414–2423, 2016.
  • [18] Zhenhao Ge and Yufang Sun. Domain specific author attribution based on feedforward neural network language models. In Proceedings of the 5th International Conference on Pattern Recognition Applications and Methods (ICPRAM), pages 597–604, 2016.
  • [19] Yoav Goldberg. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research, 57(1):345–420, 2016.
  • [20] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS), pages 2672–2680, 2014.
  • [21] Jack Grieve. Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing, 22(3), 2007.
  • [22] Tommi Gröndahl and N. Asokan. EAT2seq: A generic framework for controlled sentence transformation without task-specific training. CoRR, abs/1902.09381, 2019.
  • [23] Thanh Nghia Ho and Wee Keong Ng. Application of stylometry to DarkWeb forum user identification. In Information and Communications Security, pages 173–183, 2016.
  • [24] Matthew Honnibal and Mark Johnson. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1373–1378, 2015.
  • [25] Patrick Juola. Large-scale experiments in authorship attribution. English Studies, 93(3):275–283, 2012.
  • [26] Georgi Karadzhov, Tsvetomila Mihaylova, Yasen Kiprov, Georgi Georgiev, Ivan Koychev, and Preslav Nakov. The case for being average: A mediocrity approach to style masking and author obfuscation. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 173–185, 2017.
  • [27] Foaad Khosmood. Comparison of sentence-level paraphrasing approaches for statistical style transformation. In Proceedings of the 2012 International Conference on Artificial Intelligence, 2012.
  • [28] Foaad Khosmood and Robert Levinson. Automatic natural language style classification and transformation. In Proceedings of the 2008 BCS-IRSG Conference on Corpus Profiling, page 3, 2008.
  • [29] Foaad Khosmood and Robert Levinson. Toward automated stylistic transformation of natural language text. In Proceedings of the Digital Humanities, pages 177–181, 2009.
  • [30] Foaad Khosmood and Robert Levinson. Automatic synonym and phrase replacement show promise for style transformation. In The Ninth International Conference on Machine Learning and Applications, 2010.
  • [31] Michael Lesk. Automatic sense disambiguation using machine readable dictionaries: how to tell a pine code from an ice cream cone. In Proceedings of the 5th Annual International Conference on Systems Documentation, pages 24–26, 1986.
  • [32] Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li.

    Paraphrase generation with deep reinforcement learning.

    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3865–3878, 2018.
  • [33] Lajanugen Logeswaran, Honglak Lee, and Samy Bengio.

    Content preserving text generation with attribute controls.

    In 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • [34] Minh-Thang Luong and Christopher D. Manning. Achieving open vocabulary neural machine translation with hybrid word-character models. In Association for Computational Linguistics (ACL), Berlin, Germany, August 2016.
  • [35] Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412–1421, Lisbon, Portugal, September 2015. Association for Computational Linguistics.
  • [36] Nathan Mack, Jasmine Bowers, Henry Williams, Gerry Dozier, and Joseph Shelton. The best way to a strong defense is a strong offense: Mitigating deanonymization attacks via iterative language translation. International Journal of Machine Learning and Computing, 5(5):409–413, 2015.
  • [37] Muharram Mansoorizadeh, Taher Rahgooy, Mohammad Aminiyan, and Mahdy Eskandari. Author obfuscation using WordNet and language models – notebook for PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers.
  • [38] Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella Bernardi, and Roberto Zamparelli. A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of LREC, pages 216–223, 2014.
  • [39] Andrew W.E. McDonald, Sadia Afroz, Aylin Caliskan, Ariel Stolerman, and Rachel Greenstadt. Use fewer instances of the letter i: Toward writing style anonymization. In Simone Fischer-Hübner and Matthew Wright, editors, Privacy Enhancing Technologies. Volume 7384 of Lecture Notes in Computer Science, pages 299–318. 2012.
  • [40] Tsvetomila Mihaylova, Georgi Karadjov, Preslav Nakov, Yasen Kiprov, Georgi Georgiev, and Ivan Koychev. SU@PAN’2016: Author obfuscation – notebook for PAN at CLEF 2016. In CLEF 2016 Evaluation Labs and Workshop – Working Notes Papers.
  • [41] George A. Miller. WordNet: A lexical database for English. Communications of the ACM, 38(11):39–41, 1995.
  • [42] Frederick Mosteller and David L. Wallace. Inference and Disputed Authorship: The Federalist. Addison-Wesley, Reading, Massachusets, 1964.
  • [43] Arvind Narayanan, Hristo Paskov, Neil Zhenqiang Gong, John Bethencourt, Emil Stefanov, Eui Chul Richard Shin, and Dawn Song. On the feasibility of internet-scale author identification. In Proceedings of the IEEE Symposium on Security and Privacy, pages 300–314, 2012.
  • [44] Roberto Navigli. Word sense disambiguation: A survey. ACM Computing Surveys, 41(2):1–69, 2009.
  • [45] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In ACL-2002: 40th Annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  • [46] Ellie Pavlick, Pushpendre Rastogi, Juri Ganitkevitch, Benjamin Van Durme, and Chris Callison-Burch. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations,word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Short Papers), pages 425–430, 2015.
  • [47] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
  • [48] Martin Potthast, Sarah Braun, Tolga Buz, Fabian Duffhauss, Florian Friedrich, Jörg Marvin Gülzow, Jakob Köhler, Winfried Lötzsch, Fabian Müller, Maike Elisa Müller, Robert Paßmann, Bernhard Reinke, Lucas Rettenmeier, Thomas Rometsch, Timo Sommer, Michael Träger, Sebastian Wilhelm, Benno Stein, Efstathios Stamatatos, and Matthias Hagen. Who wrote the web? Revisiting influential author identification research applicable to information retrieval. In Nicola Ferro, Fabio Crestani, Marie-Francine Moens, Josiane Mothe, Fabrizio Silvestri, Giorgio Maria Di Nunzio, Claudia Hauff, and Gianmaria Silvello, editors, Advances in Information Retrieval, pages 393–407. Springer International Publishing, 2016.
  • [49] Shrimai Prabhumoye, Yulia Tsvetkov, Ruslan Salakhutdinov, and Alan W. Black. Style Transfer Through Back-Translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 866–876, 2018.
  • [50] Press Freedom Index. RSF Index 2018: Hatred of journalism threatens democracies. https://rsf.org/en/rsf-index-2018-hatred-journalism-threatens-democracies (May 1st 2018).
  • [51] Ella Rabinovich, Shachar Mirkin, Raj Nath Patel, Lucia Specia, and Shuly Wintner. Personalized machine translation: Preserving original author traits. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 1074–1084, 2016.
  • [52] Sravana Reddy and Kevin Knight. Obfuscating gender in social media writing. In Proceedings of Workshop on Natural Language Processing and Computational Social Science, pages 17–26, 2016.
  • [53] Jonathan Schler, Moshe Koppel, Shlomo Argamon, and James Pennebaker. Effects of age and gender on blogging. In Proceedings of AAAI Spring Symposium on Computational Approaches for Analyzing Weblogs, 2006.
  • [54] Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. Style transfer from non-parallel text by cross-alignment. In Proceedings of Neural Information Processing Systems (NIPS), 2017.
  • [55] Rakshith Shetty, Bernt Schiele, and Mario Fritz. Ant: Author attribute anonymity by adversarial training of neural machine translation. In 27th USENIX Security Symposium, pages 1633–1650, 2018.
  • [56] Efstathios Stamatatos. A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology, 60(3):538–556, 2009.
  • [57] Kl Surendran, O.P. Harilal, Hrudya Poroli, Prabaharan Poornachandran, and N.K. Suchetha. Stylometry detection using deep learning. In Computational Intelligence in Data Mining, pages 749–757, 2017.
  • [58] Neal Tempestt, Kalaivani Sundararajan, Aneez Fatima, Yiming Yan, Yingfei Xiang, and Damon Woodard. Surveying stylometry techniques and applications. ACM Computing Surveys, 50(6), 2017.
  • [59] Louis Tesnière. Èléments de syntaxe structurale. Klincksieck, Paris, 1959.
  • [60] Sander Wubben, Antal van den Bosch, and Emiel Krahmer. Sentence simplification by monolingual machine translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 1015–1024, 2012.
  • [61] Wei Xu, Alan Ritter, Bill Dolan, Ralph Grishman, and Colin Cherry. Paraphrasing for style. In Proceedings of COLING, pages 2899–2914, 2012.
  • [62] Tom Young, Devamanyu Hazarika, Soujanya Poria, and Erik Cambria. Recent trends in deep learning based natural language processing. CoRR, abs/1708.02709, 2017.
  • [63] Rong Zheng, Jiexun Li, Hsinchun Chen, and Zan Huang. A framework of authorship identification for online messages: Writing style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3):378–393, 2006.

Appendix A All imitation results

Direction Author profiler Profiler accuracy
Original CAE BT ANT ParChoice

femalemale LSTM
malefemale LSTM

femalemale LSTM
malefemale LSTM

AliceBob LSTM
BobAlice LSTM

TrumpObama LSTM
ObamaTrump LSTM

Table 10. All imitation results

Table 10 presents the imitation results from all style imitation techniques, including CAE and BT. In YG their performance was similar to ParChoice’s; somewhat worse in femalemale and slightly better in femalemale. In BG they failed to take accuracy down in the femalemale direction, behaving similarly to ANT in this respect. The seemingly best imitation performance is achieved by CAE and BT on the AB dataset, where both are able to take accuracy below in both directions. As our semantic retainment evaluation demonstrated, this was due to CAE and BT largely repeating words common in the target corpus but unrelated to the original sentence. Finally, both CAE and BT successfully imitated Trump but not Obama in TO, CAE actually taking accuracy higher in the TrumpObama direction.