Log In Sign Up

A^4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation

by   Rakshith Shetty, et al.

Text-based analysis methods allow to reveal privacy relevant author attributes such as gender, age and identify of the text's author. Such methods can compromise the privacy of an anonymous author even when the author tries to remove privacy sensitive content. In this paper, we propose an automatic method, called Adversarial Author Attribute Anonymity Neural Translation (A^4NT), to combat such text-based adversaries. We combine sequence-to-sequence language models used in machine translation and generative adversarial networks to obfuscate author attributes. Unlike machine translation techniques which need paired data, our method can be trained on unpaired corpora of text containing different authors. Importantly, we propose and evaluate techniques to impose constraints on our A^4NT to preserve the semantics of the input text. A^4NT learns to make minimal changes to the input text to successfully fool author attribute classifiers, while aiming to maintain the meaning of the input. We show through experiments on two different datasets and three settings that our proposed method is effective in fooling the author attribute classifiers and thereby improving the anonymity of authors.


page 1

page 2

page 3

page 4


Towards Robust and Privacy-preserving Text Representations

Written text often provides sufficient clues to identify the author, the...

The Life of Lazarillo de Tormes and of His Machine Learning Adversities

Summit work of the Spanish Golden Age and forefather of the so-called pi...

DRAG: Director-Generator Language Modelling Framework for Non-Parallel Author Stylized Rewriting

Author stylized rewriting is the task of rewriting an input text in a pa...

How Different Text-preprocessing Techniques Using The BERT Model Affect The Gender Profiling of Authors

Forensic author profiling plays an important role in indicating possible...

Personalized Machine Translation: Preserving Original Author Traits

The language that we produce reflects our personality, and various perso...

Probing Classifiers are Unreliable for Concept Removal and Detection

Neural network models trained on text data have been found to encode und...

1 Introduction

Natural language processing (NLP) methods including stylometric tools enable identification of authors of anonymous texts by analyzing stylistic properties of the text [1, 2, 3]. NLP-based tools have also been applied to profiling users by determining their private attributes like age and gender [4]. These methods have been shown to be effective in various settings like blogs, reddit comments, twitter text [5] and in large scale settings with up to 100,000 possible authors [6]. In a recent famous case, authorship attribution tools were used to help confirm J.K Rowling as the real author of A Cuckoo’s Calling which was written by Ms. Rowling under pseudonymity [7]. This case highlights the privacy risks posed by these tools.

Apart from threat of identification of an anonymous author, the NLP-based tools also make authors susceptible to profiling. Text analysis has been shown to be effective in predicting age group [8], gender [9] and to an extent even political preferences [10]. By determining such private attributes an adversary can build user profiles which have been used for manipulation through targeted advertising, both for commercial and political goals [11].

Since the NLP based profiling methods utilize the stylistic properties of the text to break the authors anonymity, they are immune to defense measures like pseudonymity, masking the IP addresses or obfuscating the posting patterns. The only way to combat them is to modify the content of the text to hide stylistic attributes. Prior work has shown that while people are capable of altering their writing styles to hide their identity [12], success rate depends on the authors skill and doing so consistently is hard for even skilled authors [13]. Currently available solutions to obfuscate authorship and defend against NLP-methods has been largely restricted to semi-automatic solutions suggesting possible changes to the user [14] or hand-crafted transformations to text [15] which need re-engineering on different datasets [15]. This however limits the applicability of these defensive measures beyond the specific dataset it was designed on. To the best of our knowledge, text rephrasing using generic machine translation tools [16] is the only prior work offering a fully automatic solution to author obfuscation which can be applied across datasets. But as found in prior work [17] and further demonstrated with our experiments, generic machine translation based obfuscation fails to sufficiently hide the identity and protect against attribute classifiers.

Additionally the focus in prior research has been towards protecting author identity. However, obfuscating identity does not guarantee protection of private attributes like age and gender. Determining attributes is generally easier than predicting the exact identity for NLP-based adversaries, mainly due to former being small closed-set prediction task compared to later which is larger and potentially open-set prediction task. This makes obfuscating attributes a difficult but an important problem.

Our work. We propose an unified automatic system () to obfuscate authors text and defend against NLP adversaries. follows the imitation model of defense discussed in [12] and protects against various attribute classifiers by learning to imitate the writing style of a target class. For example, learns to hide the gender of a female author by re-synthesizing the text in the style of the male class. This imitation of writing style is learnt by adversarially training [18] our style-transfer network against the attribute classifier. Our network learns the target style by learning to fool the authorship classifiers into mis-classifying the text it generates as target class. This style transfer is accomplished while aiming to retain the semantic content of the input text.

Unlike many prior works on authorship obfuscation [14, 15], we propose an end-to-end learnable author anonymization solution, allowing us to apply our method not only to authorship obfuscation but to the anonymization of different author attributes including identity, gender and age with a unified approach. We illustrate this by successfully applying our model on three different attribute anonymization settings on two different datasets. Through empirical evaluation, we show that the proposed approach is able to fool the author attribute classifiers in all three settings effectively and better than the baselines. While there are still challenges to overcome before applying the system to multiple attributes and situations with very little data, we believe that offers a new data driven approach to authorship obfuscation which can easily adapt to improving NLP-based adversaries.

Technical challenges: We design our network architecture based on the sequence-to-sequence neural machine translation model [19]. A key challenge in learning to perform style transfer, compared to other sequence-to-sequence mapping tasks like machine translation, is the lack of paired training data. Here, paired data refers to datasets with both the input text and its corresponding ground-truth output text. In obfuscation setting, this means having a large dataset with semantically same sentences written in different styles corresponding to attributes we want to hide. Such paired data is infeasible to obtain and this has been a key hurdle in developing automatic obfuscation methods. Some prior attempts to perform text style transfer required paired training data [20] and hence were limited in their applicability beyond toy-data settings. We overcome this by training our network within a generative adversarial networks (GAN) [18] framework. GAN framework enables us to train the network to generate samples that match the target distribution without need for paired data.

We characterize the performance of our

network along two axes: privacy effectiveness and semantic similarity. Using automatic metrics and human evaluation to measure semantic similarity of the generated text to the input, we show that

offers a better trade-off between privacy effectiveness and semantic similarity. We also analyze the effectiveness of for protecting anonymity for varying degrees of input text “difficulty”.

Contributions: In summary, the main contributions of our paper are. (1): We propose a novel approach to authorship obfuscation, that uses a style-transfer network () to automatically transform the input text to a target style and fool the attribute classifiers. The network is trained without paired data using adversarial training. (2): The proposed obfuscation solution is end-to-end trainable, and hence can be applied to protect different author attributes and on different datasets with no changes to the overall framework. (3): Quantifying the performance of our system on privacy effectiveness and semantic similarity to input, we show that it offers a better trade-off between the two metrics compared to baselines.

2 Related Work

In this section, we review prior work relating to four different aspects of our work – author attribute detection (our adversaries), authorship obfuscation (prior work), machine translation (basis of our network) and generative adversarial networks (training framework we use).

Authorship and attribute detectionMachine learning approaches where a set of text features are input to a classifier which learns to predict the author have been popular in recent author attribution works [2]. These methods have been shown to work well on large datasets [6], duplicate author detection  [21] and even on non-textual data like code [22]. Sytlometric models can also be applied to determine private author attributes like age or gender [4].

Classical author attribution methods rely on a predefined set of features extracted from the input text 


. Recently deep-learning methods have been applied to learn to extract the features directly from data 

[24, 3]. [24]

uses a multi-headed recurrent neural network (RNN) to train a generative language model on each author’s text and use the model’s perplexity on the test document to predict the author. Alternatively,


uses convolutional neural network (CNN) to train an author classifiers. To show generality of our

network, we test it against both RNN and CNN based author attribute classifiers.

Authorship obfuscation Authorship obfuscation methods are adversarial in nature to stylometric methods of author attribution; they try to change the style of input text so that author identity is not discernible. The majority of prior works on author attribution are semi-automatic [25, 14], where the system suggests authors to make changes to the document by analyzing the stylometric features. The few automatic obfuscation methods have relied on general rephrasing methods like generic machine translation [16] or on a predefined text transformations [26]. Round-trip machine translation, where input text is translated to multiple languages one after the other until it is translated back to the source language, is proposed as an automatic method of obfuscation in [16]. Recent work [26] obfuscates text by moving the stylometric features towards the average values on the dataset applying pre-defined transformations on input text.

We propose the first method to achieve fully automatic obfuscation using text style transfer. This style transfer is not pre-defined but learnt directly from data optimized for fooling attribute classifiers. This allows us to apply our model across datasets without extra engineering effort.

Machine translation The task of style-transfer of text data shares similarities with the machine translation problem. Both involve mapping an input text sequence onto an output text sequence. Style transfer can be thought of as machine translation on the same language.

Large end-to-end trainable neural networks have become a popular choice in machine translation [27, 28]. These methods are generally based on sequence-to-sequence recurrent models [19]

consisting of two networks, an encoder which encodes the input sentence into a fixed size vector and a decoder which maps this encoding to a sentence in the target language.

We base our network architecture on the word-level sequence-to-sequence language model [19]. Neural machine translation systems are trained with large amounts of paired training data. However, in our setting, obtaining paired data of the same text in different writing styles is not viable. We overcome the lack of paired data by casting the task as matching style distributions instead of matching individual sentences. Specifically, our network takes an input text from a source distribution and generates text whose style matches the target attribute distribution. This is learnt without paired data using distribution matching methods. This reformulation allows us to demonstrate the first successful application of the machine translation models to the obfuscation task.

Generative adversarial networks Generative Adversarial Networks (GAN) [18] are a framework for learning a generative model to produce samples from a target distribution. It consists of two models, a generator and a discriminator. The discriminator network learns to distinguish between the generated samples and real data samples. Simultaneously, the generator learns to fool this discriminator network thereby getting closer to the target distribution. In this two-player game, a fully optimized generator perfectly mimics the target distribution [18].

We train our network within the GAN framework, directly optimizing to fool the attribute classifiers by matching style distribution of a target class. A recent approach to text style-transfer proposed in [29] also utilizes GANs to perform style transfer using unpaired data. However, the solution proposed in [29] changes the meaning of the input text significantly during style transfer and is applied on sentiment transfer task. In contrast, authorship obfuscation task requires the generated text to preserve the semantics of the input. We address this problem by proposing two methods to improve semantic consistency between the input and the output.

3 Author Attribute Anonymization

Figure 1: GAN framework to train our network. Input sentence is transformed by to match the style of the target attribute. This output is evaluated using the attribute classifier and semantic consistency loss.

is trained by backpropagating through these losses.

We propose an author adversarial attribute anonymizing neural translation () network to defend against NLP-based adversaries. The proposed solution includes the Network , the adversarial training scheme, and semantic and language losses to learn to protect private attributes. The network transforms the input text from a source attribute class to mimic the style of a different attribute class, and thus fools the attribute classifiers.

Technically, network is essentially solving a sequence to sequence mapping problem — from text sequence in the source domain to text in the target domain — similar to machine translation. Exploiting this similarity, we design our network based on the sequence-to-sequence neural language models [19], widely used in neural machine translation [27]. These models have proven effective when trained with large amounts of paired data and are also deployed commercially [28]. If there were paired data in source and target attributes, we could train our

network exactly like a machine translation model, with standard supervised learning. However, such paired data is infeasible to obtain as it would require the same text written in multiple styles.

To address the lack of paired data, we cast the anonymization task as learning a generative model, , which transforms an input text sample drawn from source attribute distribution , to look like samples from the target distribution . This formulation enables us to train the network with the GAN framework to produce samples close to the target distribution , using only unpaired samples from and . Figure 1 shows this overall framework.

The GAN framework consists of two models, a generator producing synthetic samples to mimic the target data distribution, and a discriminator which tries to distinguish real data from the synthesized “fake” samples from the generator. The two models are trained adversarially, i.e. the generator tries to fool the discriminator and the discriminator tries to correctly identify the generator samples. We use an attribute classifier as the discriminator and the network as the generator. The network, in trying to fool the attribute classification network, learns to transform the input text to mimic the style of the target attribute and protect the attribute anonymity.

For our network to be a practically useful defensive measure, the text output by this network should be able to fool the attribute classifier while also preserving the meaning of the input sentence. If we could measure the semantic difference between the generated text and the input text it could be used to penalize deviations from the input sentence semantics. Computing this semantic distance perfectly would need true understanding of the meaning of input sentence, which is beyond the capabilities of current natural language processing techniques. To address this aspect of style transfer, we experiment with various proxies to measure and penalize changes to input semantics, which will be discussed in Section 3.4. Following subsections will describe each module in detail.

3.1 Author Attribute Classifiers

Figure 2:

Block diagram of the attribute classifier network. The LSTM encoder embeds the input sentence into a vector. Sentence encoding is passed to linear projection followed by softmax layer to obtain class probabilities

We build our attribute classifiers using neural networks that predict the attribute label by directly operating on the text data. This is similar to recent approaches in authorship recognition [24, 3] where, instead of hand-crafted features used in classical stylometry, neural networks are used to directly predict author identity from raw text data. However, unlike in these prior works, our focus is attribute classification and obfuscation. We train our classifiers with recurrent networks operating at word-level, as opposed to character-level models used in [24, 3] for two reasons. We found that the word-level models give good performance on all three attribute-classification tasks we experiment with (see Section 5.1). Additionally, they are much faster than character-level models, making it feasible to use them in GAN training described in Section 3.2.

Specifically, our attribute classifier to detect attribute value is shown in Figure 2

. It consists of a Long-Short Term Memory (LSTM) 

[30] encoder network to compute an embedding of the input sentence into a fixed size vector. It learns to encode the parts of the sentence most relevant to the classification task into the embedding vector, which for attribute prediction is mainly the stylistic properties of the text. This embedding is input to a linear layer and a softmax layer to output the class probabilities.

Given an input sentence

, the words are one-hot encoded and then embedded into fixed size vectors using the word-embedding layer shown in Figure 

2 to obtain vectors . This word embedding layer encodes similarities between words into the word vectors and can help deal with large vocabulary sizes. The word vectors are randomly initialized and then learned from the data during training of the model. This approach works better than using pre-trained word vectors like word2vec [31] or Glove [32] since the learned word-vectors can encode similarities most relevant to the attribute classification task at hand.

This sequence of word vectors is recursively passed through an LSTM to obtain a sequence of outputs . We refer the reader to [30] for the exact computations performed to get the LSTM output.

Now sentence embedding is obtained by concatenation of the final LSTM output and the mean of the LSTM outputs from other time-steps.


At the last time-step the LSTM network has seen all the words in the sentence and can encode a summary of the sentence in its output. However, using LSTM outputs from all time-steps, instead of just the final one, speeds up training due to improved flow of gradients through the network. Finally, is passed through linear and softmax layers to obtain class probabilities, for each class . The network is then trained using cross-entropy loss.


where is the one-hot encoding of the true class of .

The same network architecture is applied for all our attribute prediction tasks including identity, age and gender.

3.2 The Network

Figure 3: Block diagram of the network. First LSTM encoder embeds the input sentence into a vector. The decoder maps this sentence encoding to the output sequence. Gumbel sampler produces “soft” samples from the softmax distribution to allow backpropagation.

A key design goal for the network is that it is trainable purely from data to obfuscate the author attributes.This is a significant departure from prior works on author obfuscation [14, 26] that rely on hand-crafted rules for text modification to achieve obfuscation. The methods relying on hand-crafted rules are limited in applicability to specific datasets they were designed for.

To achieve this goal, we base our network , shown in Figure 3, on a recurrent sequence-to-sequence neural translation model [19] (Seq2Seq) popular in many sequence mapping tasks. As seen from the wide-range of applications mapping text-to-text [27], speech-to-text [33], text-to-part of speech [34], the Seq2Seq models can effectively learn to map input sequences to arbitrary output sequences, with appropriate training. They operate on raw text data and alleviate the need for hand-crafted features or rules to transform the style of input text, predominantly used in prior works on author obfuscation [14, 26]. Instead, appropriate text transformations can be learnt directly from data. This flexibility allows us to easily apply the same network and training scheme to different datasets and settings.

The network consists of two components, an encoder and a decoder modules, similar to standard sequence-to-sequence models. The encoder embeds the variable length input sentence into a fixed size vector space. The decoder maps the vectors in this embedding space to output text sequences in the target style. The encoder is an LSTM network, sharing the architecture of the sentence encoder in Section 3.1. The same architecture applies here as the task here is also to embed the input sentence into a fixed size vector . However, should learn to represent the semantics of the input sentence allowing the decoder network to generate a sentence with similar meaning but in a different style.

The sentence embedding from the encoder is the input to the decoder LSTM which generates the output sentence one word at a time. At each step , the decoder LSTM takes and the previous output word

to produce a probability distribution over the vocabulary. Sampling from this distribution outputs the next word.


where is the word embedding, matrix maps the LSTM output to vocabulary size and is the vocabulary.

In most applications of Seq2Seq models, the networks are trained using parallel training data, consisting of input and ground-truth output sentence pairs. A sentence is input to the encoder and propagated through the network and the network is trained to maximize the likelihood of generating the paired ground-truth output sentence. However, in our setting, we do not have access to such parallel training data of text in different styles and the network is trained in an unsupervised setting.

We address the lack of parallel training data by using the GAN framework to train the network. In this framework, the network learns by generating text samples and improving itself iteratively to produce text that the attribute classifier, , classifies as target attribute. A benefit of GANs is that the network is directly optimized to fool the attribute classifiers. It can hence learn to make transformations to the parts of the text which are most revealing of the attribute at hand, and so hide the attribute with minimal changes.

However, to apply the GAN framework, we need to differentiate through the samples generated by . The word samples from are discrete tokens and are not differentiable. Following [35], we apply the Gumbel-Softmax approximation [36] to obtain differentiable soft samples and enable end-to-end GAN training. See Appendix A for details.

Splitting decoder: To transfer styles between attribute pairs, and , in both directions, we found it ineffective to use the same network . A single network

is unable to sufficiently switch its output word distributions solely on a binary condition of target attribute. Nonetheless, using a separate network for each ordered pair of attributes is prohibitively expensive. A good compromise we found is to share the encoder to embed the input sentence but use different decoders for style transfer between each ordered pair of attributes. Sharing the encoder allows the two networks to share a significant number of parameters and enables the attribute specific decoders to deal with words found only in the vocabulary of the other attribute group using shared sentence and word embeddings.

3.3 Style Loss with GAN

Figure 4: Illustrating use of GAN framework and cyclic semantic loss to train a pair of networks.

We train the two networks and in the GAN framework to produce samples which are indistinguishable from samples from distributions of attributes and respectively, without having paired sentences from and . Figure 4 shows this training framework.

Given a sentence written by author with attribute , the network outputs a sentence . This is passed to the attribute classifier for attribute , , to obtain probability . tries to fool the classifier into assigning high probability to its output, whereas tries to assign low probability to sentences produced by while assigning high probability to real sentences written by . The same process is followed to train the network from to , with and

swapped. The loss functions used to train the

network and the attribute classifiers in this setting is given by:


The two networks and are adversarially competing with each other when minimizing the above loss functions. At optimality it is guaranteed that the distribution of samples produced by is identical to the distribution of  [18]. However, we want the network to only imitate the style of , while keeping the content from . Thus, we explore methods to enforce the semantic consistency between the the input sentence and the output.

3.4 Preserving Semantics

We want the output sentence, , produced by to not only fool the attribute classifier, but also to preserve the meaning of the input sentence . We propose a semantic loss to quantify the meaning changed during the anonymization by . Simple approaches like matching words in and can severely limit the effectiveness of anonymization, as it penalizes even synonyms or alternate phrasing. In the following subsection we will discuss two approaches to define , and later in Section 5 we compare these approaches quantitatively.

3.4.1 Cycle Constraints

Figure 5: Semantic consistency in networks is enforced by maximizing cyclic reconstruction probability.

One could evaluate how semantically close is to by evaluating how easy it is to reconstruct from . If means exactly the same as , there should be no information loss and we should be able to perfectly reconstruct from . We could use the network in the reverse direction to obtain a reconstruction, and compare it to input sentence . Such an approach, referred to as cycle constraint, has been used in image style transfer [37], where distance is used to compare the reconstructed image and the original image to impose semantic relatedness penalty. However, in our case distance is not meaningful to compare and , as they are sequences of possibly different lengths. Even a single word insertion or deletion in can cause the entire sequence to mismatch and be penalized by the distance.

A simpler and more stable alternative we use is to forgo the reconstruction and just compute the likelihood of reconstruction of when applying reverse style-transfer on . This likelihood is simple to obtain from the reverse network using the word distribution probabilities at the output. This cyclic loss computation is illustrated in Figure 5. Duly, we compute reconstruction probability and define the semantic loss as:


The lower the semantic loss , the higher the reconstruction probability and thus more meaning of the input sentence is preserved in the style-transfer output .

3.4.2 Semantic Embedding Loss

Alternative approach to measure the semantic loss is to embed the two sentences, and , into a semantic space and compare the two embedding vectors using distance. The idea is that a semantic embedding method puts similar meaning sentences close to each other in this vector space. This approach is used in many natural language processing tasks, for example in semantic entailment [38]

Since we do not have annotations of semantic relatedness on our datasets, it is not possible to train a semantic embedding model but instead we have to rely on pre-trained models known to have good transfer learning performance. Several such semantic sentence embeddings are available in the literature 

[39, 38]. We use the universal sentence embedding model from [38], pre-trained on the Stanford natural language inference dataset [40].

We embed the two sentences using this semantic embedding model and use the distance to compare the two embeddings and define the semantic loss as:


3.5 Smoothness with Language Loss

The network can minimize the style and the semantic losses, while still producing text which is broken and grammatically incorrect. To minimize the style loss the network needs to add words typical of the target attribute style, while minimizing the semantic loss, it needs to retain the semantically relevant words from the input text. However neither of these two losses explicitly enforces correct grammar and word order of .

On the other hand, unconditional neural language models are good at producing grammatically correct text. The likelihood of the sentence produced by our model under an unconditional language model, , trained on the text by target attribute authors , is a good indicator of the grammatical correctness of . The higher the likelihood, the more likely the generated text has syntactic properties seen in the real data. Therefore, we add an additional language smoothness loss on in order to enforce to produce syntactically correct text.


Overall loss function: The network is trained with a weighted combination of the three losses: style loss, semantic consistency and language smoothing loss.


We chose the above three weights so that the magnitude of the weighted loss terms are approximately equal at the beginning of training. Model training was not sensitive to exact values of the weights chosen that way.

Implementation details:

We implement our model using the Pytorch framework 


. The networks are trained by optimizing the loss functions described with stochastic gradient descent using the RMSprop algorithm 

[42]. The

network is pre-trained as an autoencoder, i.e to reconstruct the input sentence, before being trained with the loss function described in (

13). During GAN training, the network and the attribute classifiers are trained for one minibatch each alternatively. We will open source our code, models and data at the time of publication.

4 Experimental Setup

We test our network on obfuscation of three different attributes of authors on two different datasets. The three attributes we experiment with include author’s age (under 20 vs over 20), gender (male vs female authors), and author identities (setting with two authors).

4.1 Datasets

We use two real world datasets for our experiments: Blog Authorship corpus [43] and Political Speech dataset. The datasets are from very different sources with distinct language styles, the first being from mini blogs written by several anonymous authors, and the second from political speeches of two US presidents Barack Obama and Donald Trump. This allows us to show that our approach works well across very different language corpora.

Blog dataset: The blog dataset is a large collection of micro blogs from collected by [43]. The dataset consists of 19,320 “documents” along with annotation of author’s age, gender, occupation and star-sign. Each document is a collection of all posts by a single author. We utilize this dataset in two different settings; split by gender (referred to as blog-gender setting) and split by age annotation (blog-age setting). In the blog-age setting, we group the age annotations into two groups, teenagers (age between 13-18) and adults (age between 23-45) to obtain data with binary age labels. Age-groups 19-22 are missing in the original dataset. Since the dataset consists of free form text written while blogging with no proper sentence boundaries markers, we use the Stanford CoreNLP tool to segment the documents into sentences. All numbers are replaced with the NUM token.

Political speech dataset: To test the limits of how far style imitation based anonymization can help protect author identity, we also test our model on two well known political figures with very different verbal styles. We collected the transcriptions of political speeches of Barack Obama and Donald Trump made available by the The American Presidency Project [44]. While the two authors talk about similar topics they have highly distinctive styles and vocabularies, making it a challenging dataset for our network. The dataset consists of 372 speeches, with about 65,000 sentences in total as shown in Table I. We treat each speech as a separate document when evaluating the classification results on document-level. This dataset contains a significant amount of references to named entities like people, organizations, etc. To avoid that both attribute classifiers and the style transfer model rely on these references to specific people, we use the Stanford Named Entity Recognizer tool [45] to identify and replace these entities with entity labels.

The comparison of the two datasets can be found in Table I. The blog dataset is much larger and therefore we run most of our evaluation on it. Using these two datasets, we evaluate our model in three different attribute obfuscation settings, namely age (blog-age), gender (blog-gender) and identity obfuscation (speech dataset). Detailed analysis of our model presented in Section 5.2 is done on the validation split of the blog dataset, in the blog-age setting, containing 2,799 documents and 518,268 sentences.

Dataset Attributes # Documents # Sentences # Vocabulary
Speech Identity 372 65k 5.6k
Blog Age, Gender 19320 3.38 Mil 22k
Table I: Comparing statistics of the two datasets.

4.2 Evaluation Methods

We evaluate our models w.r.t. two different aspects: (a) Privacy effectiveness, to measure how well the network can fool the attribute classifiers and (b) Semantic similarity, quantifying how much of the input sentence semantics is preserved after style transfer.

To quantify privacy effectiveness we compare the F1-score of the attribute classifiers on the original data and the generated text. To mimic a real-world setting where we will not have access to the parameters of the attribute-classifier used by the adversary to train our network, we test it against attribute classifiers which have been separately trained on real data.

The F1-score for the attribute classifiers is defined as:


With successful style transfer, the attribute classifier will mis-classify generated text and the F1-scores will drop. We consider the obfuscation to be successful if the F1-score drops below that of a a random guessing classifier. Note that since the datasets are not perfectly balanced w.r.t the binary attributes, the F1-score for random classifier is not 0.5. The expected document-level F1-scores of random guessing are 0.54 for the blog-age setting, 0.5 for the blog-gender setting, and 0.61 for the speech data.

To quantify semantic similarity, we use the Meteor [46]

metric. It is used in machine translation and image captioning to evaluate the similarity of the candidate text using a reference text. Meteor compares the candidate text to one or more references by matching n-grams, allowing for soft matches using synonym and paraphrase tables. We use the Meteor score between the generated and input text as the measure of semantic similarity.

However, the automatic evaluation for semantic similarity is not perfectly correlated with human judgments, especially with few reference sentences.To address this, we additionally conduct a human evaluation study on a subset of the test data of 745 sentences. We ask human annotators on Amazon Mechanical Turk to judge the semantic similarity of the generated text from our models. No other information was collected from the annotators, thereby keeping them anonymous. The annotators were compensated for their work through AMT system. We manually screened the text shown to the annotators to make sure there was no obvious offensive content in them.

4.3 Baselines

We use the two baseline methods below to compare our model with. Both chosen baselines are automatic obfuscation methods not relying on hand-crafted rules.

Autoencoder We train our network as an autoencoder, where it takes as input and tries to reproduce it from the encoding. The autoencoder is trained similar to a standard neural language model with cross entropy loss. We train two such auto-encoders and for the two attributes. Now simple style transfer can be achieved from to by feeding the sentence to the autoencoder of the other attribute class . Since is trained to output text in the domain, the sentence tends to look similar to sentences in . This model sets the baseline for style transfer that can be achieved without cross domain training using GANs, with the same network architecture and the same number of parameters.

Google machine translation: A simple and accessible approach to change writing style of a piece of text without hand designed rules is to use generic machine translation software. The input text is translated from a source language to multiple intermediate languages and finally translating back to the source language. The hope is that through this round-trip the style of the text has changed, with the meaning preserved. This approach was used in the PAN authorship obfuscation challenge recently [16].

We use the Google machine translation service111 to perform round-trip translation on our input sentences. We have tried a varying number of intermediate languages, results of which will be discussed in Section 5 Since Google limits api-calls and imposes character limits on manual translation, we use this baseline only on the subset of 745 sentences from the test set for human evaluation.

5 Experimental Results

We test our model on the three settings discussed in section 4 with the goal to understand if the proposed network can fool the attribute classifiers to protect the anonymity of the author attributes. Through quantitative evaluation done in Section 5.1, we show that this is indeed the case: our network learns to fool the attribute classifiers across all three settings. We compare the two semantic loss functions presented in Section 3.4 and show that the proposed reconstruction likelihood loss does better than pre-trained semantic encoding.

However, this privacy gain comes with a trade-off. The semantics of the input text is sometimes altered. In Section 5.2, using qualitative examples, we analyze the failure modes of our system and identify limits up to which style-transfer can help preserve anonymity.

We use three variants of our model in the following study. The first model uses the semantic encoding loss described in Section 3.4.2 and is referred to as FBsem. The second uses the reconstruction likelihood loss discussed in  Section 3.4.1 instead, and is denoted by CycML. Finally, CycML+Lang uses both cyclic maximum likelihood and the language smoothing loss described in Section 3.5.

5.1 Quantitative Evaluation

Setting Training Set Validation Set
Sentence Document Sentence Document
Speechdata 0.84 1.00 0.68 1.00
Blog-age 0.76 0.92 0.74 0.88
Blog-gender 0.64 0.93 0.52 0.75
Table II: F1-scores of the attribute classifiers. All of them do well and better than the document-level random chance (0.62 for speech), (0.53 for age), and (0.50 for gender).

Before analyzing the performance of our network, we evaluate the attribute classifiers on the three settings we use. For this, we train the attribute classifier model in Section 3.1 on all three settings. Table II shows the F1-scores of the attribute classifiers on the training and the validation splits of the blog and the speech datasets. Document-level scores are obtained from accumulating the class log-probability scores on each sentence in a document before picking the maximum scoring class as the output label. We also tried hard voting to accumulate sentence level decisions, and observed that the hard voting results follow the same trend across datasets and splits.

On the smaller political speech dataset, the attribute classifier is able to easily discriminate between the two authors, Barack Obama and Donald Trump, achieving perfect F1-score of on both the training and the validation splits. The model also performs well on the age-group classification, achieving F1-score of 0.88 on the validation set at the document-level. Gender classification turns out to be the hardest to generalize, with a significant drop in F1-score on the validation set compared to the training set (down to 0.75 from 0.93). In all three tasks, the performance on sentence-level is worse than on document-level classification. Document-level classification also generalizes better with less difference between training and validation set F1-scores in Table II. Henceforth, we will use document-level F1-score as our primary metric when evaluating the effectiveness of networks.

Model Blog-age data Blog-gender data Speech dataset
Sent F1 Doc F1 Meteor Sent F1 Doc F1 Meteor Sent F1 Doc F1 Meteor
Random classifier 0.54 0.54 - 0.53 0.5 - 0.60 0.61 -
Original text 0.74 0.88 1.0 0.68 0.75 1.0 0.68 1.00 1.0
Autoencoder 0.69 0.85 0.77 0.49 0.61 0.78 0.26 0.12 0.32
FBsem 0.43 0.08 0.40 0.45 0.39 0.53 0.12 0.00 0.28
CycML 0.49 0.20 0.57 0.41 0.44 0.79 0.11 0.00 0.29
CycML+Lang 0.54 0.32 0.69 0.44 0.39 0.79 0.12 0.00 0.29
Table III: Performance of the style transfer anonymization in fooling the classifiers, across the three settings. F1 (lower is better) and Meteor (higher is better). F1-scores below chance levels are shown in italics.

5.1.1 Quantifying Privacy Gains from

We evaluate the performance of our network using the attribute classifiers presented in the previous section. To obfuscate the authorship of a document, the network is run on every sentence in a document to transfer style from the true attribute to the target one. For example, to mask age of a teenager written document we use the network . Style-transferred document is input to the attribute classifier of the source attribute and F1-score of this classifier is computed. This measures the privacy effectiveness of the style transfer. Meteor score is computed between the source sentence and the output, to measure the semantic similarity.

Table III shows these results in the three settings. On the small speech dataset all methods, including the autoencoder baseline described in Section 4.3, successfully fool the attribute classifier. They all obtain F1-scores below the chance-level, with our networks doing better. However the meteor scores of all models is significantly lower than in the blog dataset, indicating significant amount of semantic loss in the process of anonymization.

On the larger blog dataset, the autoencoder baseline fails to fool the attribute classifier, with only a small drop in F1-score of (from to ) in case of age and in case of gender (from to ) Our models however do much better, with all of them being able to drop the F1-score below the random chance.

The FBsem model using semantic encoder loss achieves the largest privacy gain, by decreasing the F1-scores from to in case of age and from to in case of gender. This model however suffers from poor meteor scores, indicating the sentences produced after the style transfer are no longer similar to the input.

The model using reconstruction likelihood to enforce semantic consistency, CycML, fares much better in meteor metric in both age and gender style transfer. It is still able to fool the classifier, albeit with smaller drops in F1-scores (still below random chance). Finally, with addition of the language smoothing loss (CycML+Lang), we see a further improvement in meteor in the blog-age setting, while the performance remains similar to CycML on blog-gender setting and the speech dataset. However, the language smoothing model CycML+Lang fares better in human evaluation discussed in Section 5.1.2 and also produces better qualitative samples as will be seen in Section 5.2.

Generalization to other classifiers: An important question to answer if is to be applied to protect the privacy of author attributes, is how well it performs against unseen NLP based adversaries ? To test this we trained ten different attribute classifiers networks on the blog-age setting. These networks vary in architectures (LSTM, CNN and LSTM+CNN) and hyper-parameters (number of layers and number of units), but all of them achieve good performance in predicting the age attribute. The networks were chosen to reflect real-world architecture choices used for text classification. Results from evaluating the text generated by the networks using these “holdout” classifiers are shown in Table IV. The column “mean” shows the mean performance of the ten classifiers and “max” shows the score of best performing classifier

Holdout classifiers have good performance on the original text, achieving mean document-level F1-score. Table IV shows that all three networks generalize well and are able to drop the document F1-score of the holdout classifiers to the random chance level (0.54 for age dataset). They perform slightly worse than on the seen LSTM classifier, but are able to significantly drop the performance of all the holdout classifiers (mean F1 score drops from 0.85 to 0.53 or below). This is a strong empirical evidence that the transformations applied by the networks are not specific to the classifier they are trained with, but can also generalize to other adversaries.

We conclude that the proposed networks are able to fool the attribute classifiers on all three tested tasks and also show generalization ability to fool classifier architectures not seen during training.

Different operating points : Our model offers the ability to obtain multiple different style-transfer outputs by simply sampling from the models distribution. This is useful as different text samples might have different levels of semantic similarity and privacy effectiveness. Having multiple samples allows users to choose the level of semantic similarity vs privacy trade-off they prefer.

We illustrate this in Figure 6. Here five samples are obtained from each model for each sentence in the test set. By choosing the sentence with minimum, maximum or random meteor scores, we can obtain a trade-off between semantic similarity and privacy. We see that while the FBsem model offers limited variability, CycML+LangLoss offers a wide range of choices of operating points. All operating points of CycML+LangLoss achieve better meteor score than 0.5, which indicates this model preserves the semantic similarity well.

Model Seen Classifier Holdout Classifiers
Mean Max
Original text 0.88 0.85 0.87
Autoencoder 0.85 0.83 0.84
FBsem 0.08 0.19 0.31
CycML 0.20 0.41 0.58
CycML+Lang 0.32 0.53 0.62
Table IV: anonymization fooling unseen classifiers, on blogdata (age). Columns are doc-level F1 score.
Figure 6: Operating points of models on test set.

5.1.2 Human Judgments for Semantic Consistency

In machine translation and image captioning literature, it is well known that automatic semantic similarity evaluation metrics like meteor are only reliable to a certain extent. Evaluation from human judges is still the gold-standard with which models can be reliably compared.

Accordingly, we conduct human evaluations to judge the semantic similarity preserved by our networks. The evaluations were conducted on a subset of 745 random sentences from the test split of the blog-age dataset. First, output from different models is obtained for the 745 test sentences. If any model generates identical sentences to the input, this model is ranked first automatically without human evaluation. Note that, in some cases, multiple models can achieve rank-1, when they all produce identical outputs. The cases without any identical sentences to the input are evaluated using human annotators on Amazon Mechanical Turk (AMT). An annotator is shown one input sentence and multiple style-transfer outputs and is asked to pick the output sentence which is closest in meaning to the input sentence. Three unique annotators are shown each test sample and majority voting is used to determine the model which ranks first. Cases with no majority from human evaluators are excluded.

The main goal of the study is to identify which of the three networks performs best in terms of semantic similarity according to human judges. We also compare the best of our three systems to the baseline model based on Google machine translation, discussed in Section 4.3.

For the machine translation baseline, we obtain style-transferred texts from four different language round-trips. We started with EnglishGermanFrenchEnglish, and obtained three more versions with incrementally adding Spanish, Finnish and finally Armenian languages into the chain before the translation back to English.

To pick the operating points for the human evaluation study, we compare the performance of these four machine translation baselines and our three models on the human-evaluation test set in Figure 7. Note that here we show sentence-level F1 score on the y-axis as the human-evaluation test set is too small for document-level evaluation. We see that none of the Google machine translation baselines are able to fool the attribute classifiers. The model with 5-hop translation achieves best (lowest) F1-score of which is only slightly less than the input data F1-score of . This model also achieves significantly worse meteor score than any of our models.

We conduct human evaluation for our style-transfer models on two operating points of F1-score and F1-scores, to obtain human judgments at two different levels of privacy effectiveness as shown in Table V. We see that the model CycML+Lang outperforms the other two models at both operating points. CycML+Lang wins of the time  (ignoring ties) at operating point and of the time at operating point . These results combined with quantitative evaluation discussed in Section 5.1 confirm that the cyclic ML loss combined with the language model loss gives the best trade-off between semantic similarity and privacy effectiveness.

Finally, we conduct human evaluation between the CycML+Lang model operating at and the Google machine translation baseline with 3 hops. The operating point is chosen so that the two models are closest to each other in privacy effectiveness and meteor score. Results in Table VI show that our model wins over the GoogleMT baseline by approximately ( vs rank1) on semantic similarity as per human judges, while still having better privacy effectiveness. This is largely because our model learns not to change the input text if it is already ambiguous for the attribute classifier, and only makes changes when necessary. In contrast, changes made by GoogleMT round trip are not optimized towards maximizing privacy gain, and can change the input text even when no change is needed.

Table V: Human evaluation to judge semantic similarity. Three variants of our model are compared. Numbers show the % times the model ranked first. Can add to more than 100% as multiple models can have rank-1.
Table VI: Human evaluation of our best model and the Google MT baseline.
Figure 7: Privacy and semantic consistency of and the Google MT baseline on the human eval test set

5.2 Qualitative Analysis

In this section we analyze some qualitative examples of anonymized text produced by our model and try to identify strengths and weaknesses of this approach. Then we analyze the performance of the network on different levels of input difficulty. We use the attribute classifiers’ score as a proxy measure of the input text difficulty. If the text is confidently correctly classified (with classification score of ) by the attribute classifier, then the network has to make significant changes to fool the classifier. If it is already misclassified, the style-transfer network should ideally not make any changes.

5.2.1 Examples of Style Transfer for anonymization

# Input: Teen A(x) Output: Adult A(x)
1 and yeh… it’s raining lots now 0.97 and ooh… it’s raining lots now 0.23
1 yeahh… i never let anyone really know how i’m feeling. 0.94 anyhow, i never let anyone really know how i’m feeling . 0.24
1 yeh, it’s just goin ok here too! 0.95 alas, it’s just goin ok here too! 0.30
1 would i go so far to say that i love her? 0.52 will i go so far to say that i love her? 0.36
2 wad a nice day.. spend almost the whole afternoon doing work! 0.99 definitely a nice day.. spend almost the whole afternoon doing work! 0.19
2 wadeva told u secrets wad did u do ? 0.98 perhaps told u secrets why did u do ? 0.49
2 i don’t know y i even went into dis relationship 0.92 i don’t know why i even went into another relationship . 0.33
2 i have nuthing else to say about this horrid day. 0.79 i have ofcourse else to say about this accountable day. 0.08
3 after school i got my hair cut so it looks nice again. 1.0 after all i have my hair cut so it looks nice again. 0.42
3 i had an interesting day at skool. 0.97 i had an interesting day at wedding. 0.05
# Input: Adult A(x) Output: Teen A(x)
1 funnily enough , i do n’t care all that much. 0.58 haha besides , i do n’t care all that much. 0.05
1 i may go to san francisco state, or i may go back. 0.54 i shall go to san francisco state, or i may go back. 0.09
1 i wonder if they ’ll work out… hard to say. 0.52 i wonder if they ’ll go out… hard to say. 0.39
2 one is to mix my exercise order a bit more. 0.97 one is to mix my diz exercise order a bit more. 0.08
2 ok, think i really will go to bed now. 0.79 ok, relized i really will go to bed now. 0.08
3 my first day going out to see clients after vacation. 0.98 my first day going out to see some1 after vacation. 0.04
3 i’d tell my wife how much i love her every time i saw her. 0.96 i’d tell my crush how much i love her every time i saw her. 0.06
3 i do believe all you need is love. 0.58 i dont think all you need is love . 0.11
Table VII: Qualitative examples of anonymization through style transfer in the blog-age setting. Style transfer in both direction is shown along with the attribute classifier score of the source attribute.
Input: Obama Output: Trump
we can do this because we are MISC. we will do that because we are MISC.
we can do better than that. we will do that better than anybody.
it’s not about reverend PERSON. it’s not about crooked PERSON.
but i’m going to need your help. but i’m going to fight for your country.
so that’s my vision. so that’s my opinion.
their situation is getting worse. their media is getting worse.
i’m kind of the term PERSON because i do care. i’m tired of the system of PERSON PERSON because they don’t care.
that’s what we need to change. that’s what she wanted to change.
that’s how our democracy works. that’s how our horrible horrible trade deals.
Table VIII: Qualitative examples of style transfer on the speech dataset from Obama to Trump’s style

Table VII shows the results of our model CycML+Lang applied to some example sentences in the blog-age setting. Style transfer in both directions, teenager to adult and adult to teenager, is shown along with the corresponding source attribute classifier scores. The examples illustrate some of the common changes made by the model and are grouped into three categories for analysis (# column in Table VII).

# 1. Using synonyms: The network often uses synonyms to change the style to target attribute. This is seen in style transfers in both directions, teen to adult and adult to teen in category # 1 samples in Table VII. We can see the model replacing “yeh” with “ooh”, “would” with “will”, “…” with “,” and so on when going from teen to adult, and replacing “funnily enough” with “haha besides”, “work out” with “go out” and so on when changing from adult to teen. We can also see that the changes are not static, but depend on the context. For example “yeh” is replaced with “alas” in one instance and with “ooh” in another. These changes do not alter the meaning of the sentence too much, but fool the attribute classifiers thereby providing privacy to the author attribute.

# 2. Replacing slang words: When changing from teen to adult, often replaces the slang words or incorrectly spelled words with standard English words, as seen in category #2 in Table VII

. For example, replacing “wad” (what) with “definitely”, “wadeva” with “perhaps” and “nuthing” with “ofcourse”. The opposite effect is seen when going from adult to teenager, with addition of “diz” (this) and replacing of “think” with “relized” (realized). These changes are learned entirely from the data, and would be very hard to encode explicitly in a rule-based system due to the variety in slangs and spelling mistakes.

# 3. Semantic changes: One failure mode of is when the input sentence has semantic content which is significantly more biased to the author’s class. These examples are shown in category #3 in Table VII. For example, when an adult author mentions his “wife”, the network replaces it with “crush”, altering the meaning of the input sentence. Some common entity pairs where this behavior is seen are with (schoolwork), (classoffice), (dadhusband), (mumwife), and so on. Arguably, in such cases, there is no obvious solution to mask the identity of the author without altering these obviously biased content words.

On the smaller speech dataset however, the changes made by the model alter the semantics of the sentences in some cases. Few example style transfers from Obama to Trump’s style are shown in Table VIII. We see that inserts hyperbole (“better than anybody”, “horrible horrible”, “crooked”), references to “media” and “system”, all salient features of Trump’s style. We see that the style-transfer here is quite successful, sufficient to completely fool the identity classifier as was seen in Table III. However, and somewhat expectedly, the semantics of the input sentence is generally lost. A possible cause is that the attribute classifier is too strong on this data, owing to the small dataset size and the highly distinctive styles of the two authors, and to fool them the network learns to make drastic changes to the input text.

5.2.2 Performance Across Input Difficulty

Figure 9 compares the attribute classifier score on the input sentence and the output. Ideally we want all theoutputs to score below the decision boundary, while also not increasing the classifier score compared to input text. This “ideal score” is shown as grey solid line. We see that for the most part all three models are below or close to this ideal line. As the input text gets more difficult (increasing attribute classifier score), the CycML and CycML+Lang slightly cross above the ideal line, but still provide significant improvement over the input text (drop in classifier score of about ).

[0.5][] [0.5][]

Figure 8: Output Privacy vs Privacy on Input.
Figure 9: Meteor score plotted against input difficulty.
Figure 10: Histogram of privacy gain (left side) is shown alongside comparison of meteor score vs privacy gains.

Now, we analyze how much of input semantics is preserved with increasing difficulty. Figure 9 plots the meteor score of the output against the difficulty of input text. We see that the meteor is high for sentences already across the decision boundary. These are easy cases, where the networks need not intervene. As the input gets more difficult, the meteor score of the output drops, as the network needs to do more changes to be able to fool the attribute classifier. The CycML+Lang model fares better than the other two models, with consistently higher meteor across the difficulty spectrum.

Figure 10 shows the histogram of privacy gain across the test set. Privacy gain is the difference between the attribute classifier score on the input and the network output. We see that majority of transformations by the networks leads to positive privacy gains, with only a small fraction leading to negative privacy gains. This is promising given that this histogram is over all the 500k sentences in the test set. Meteor score plotted against privacy gain shown in Figure 10, again confirms that large privacy gains comes with a trade-off of loss in semantics.

6 Conclusions

We presented a novel fully automatic method for protecting privacy sensitive attributes of an author against NLP based attackers. Our solution, the network, is developed using a novel application of adversarial training to machine translation networks to learn to protect private attributes. The network achieves this by learning to perform style-transfer without paired data.

offers a new data driven approach to authorship obfuscation. The flexibility of this end-to-end trainable model means it can adapt to new attack methods and datasets. Experiments on three different attributes namely age, gender and identity, showed that the network is able to effectively fool the attribute classifiers in all the three settings. We also show that the network also performs well against multiple unseen classifier architectures. This strong empirical evidence suggests that the method is likely to be effective against previously unknown NLP adversaries.

We developed a novel solution to preserve the meaning of input text using likelihood of reconstruction. Semantic similarity (quantified by meteor score) of the network remains high for easier sentences, which do not contain obvious give-away words (school, work, husband etc.), but is lower on difficult sentences indicating the network effectively learns to identify and apply the right magnitude of change. The network can be operated at different points on the privacy-effectiveness and semantic-similarity trade-off curve, and thus offers flexibility to the user. The experiments on the political speech data show the limits to which style transfer based approach can be used to hide attributes. On this challenging data with very distinct styles by the two authors, our method effectively fools the identity classifier but achieves this by altering the semantics of the input text.

In future work we would like to explore generator architectures to extend framework to structured data like code, to protect against code stylometric attacks.


This research was supported in part by the German Research Foundation (DFG CRC 1223). We would also like to thank Yang Zhang, Ben Stock and Sven Bugiel for helpful feedback.


  • [1] P. Juola et al., “Authorship attribution,” Foundations and Trends® in Information Retrieval, vol. 1, no. 3, pp. 233–334, 2008.
  • [2] E. Stamatatos, “A survey of modern authorship attribution methods,” Journal of the Association for Information Science and Technology, vol. 60, no. 3, pp. 538–556, 2009.
  • [3] S. Ruder, P. Ghaffari, and J. G. Breslin, “Character-level and multi-channel convolutional neural networks for large-scale authorship attribution,” arXiv preprint arXiv:1609.06686, 2016.
  • [4] S. Argamon, M. Koppel, J. W. Pennebaker, and J. Schler, “Automatically profiling the author of an anonymous text,” Communications of the ACM, vol. 52, no. 2, pp. 119–123, 2009.
  • [5] R. Overdorf and R. Greenstadt, “Blogs, twitter feeds, and reddit comments: Cross-domain authorship attribution,” Proceedings on Privacy Enhancing Technologies, vol. 2016, no. 3, pp. 155–171, 2016.
  • [6] A. Narayanan, H. Paskov, N. Z. Gong, J. Bethencourt, E. Stefanov, E. C. R. Shin, and D. Song, “On the feasibility of internet-scale author identification,” in Security and Privacy (SP), 2012 IEEE Symposium on.   IEEE, 2012, pp. 300–314.
  • [7] P. Juola. (2013) How a computer program helped show J.K. rowling write a cuckoo’s calling. [Online]. Available:
  • [8] A. A. Morgan-Lopez, A. E. Kim, R. F. Chew, and P. Ruddle, “Predicting age groups of twitter users based on language and metadata features,” PloS one, vol. 12, no. 8, p. e0183537, 2017.
  • [9] K. Ikeda, G. Hattori, C. Ono, H. Asoh, and T. Higashino, “Twitter user profiling based on text and community mining for market analysis,” Know.-Based Syst., vol. 51, 2013.
  • [10] A. Makazhanov, D. Rafiei, and M. Waqar, “Predicting political preference of twitter users,” Social Network Analysis and Mining, vol. 4, no. 1, p. 193, 2014.
  • [11] H. Grassegger and M. Krogerus. (2017) The data that turned the world upside down. [Online]. Available:
  • [12] M. Brennan, S. Afroz, and R. Greenstadt, “Adversarial stylometry: Circumventing authorship recognition to preserve privacy and anonymity,” ACM Transactions on Information and System Security (TISSEC), vol. 15, no. 3, p. 12, 2012.
  • [13] S. Afroz, M. Brennan, and R. Greenstadt, “Detecting hoaxes, frauds, and deception in writing style online,” in Security and Privacy (SP), 2012 IEEE Symposium on.   IEEE, 2012, pp. 461–475.
  • [14] A. W. McDonald, S. Afroz, A. Caliskan, A. Stolerman, and R. Greenstadt, “Use fewer instances of the letter” i”: Toward writing style anonymization.” in Privacy Enhancing Technologies, vol. 7384.   Springer, 2012, pp. 299–318.
  • [15] D. Castro, R. Ortega, and R. Muñoz, “Author Masking by Sentence Transformation—Notebook for PAN at CLEF 2017,” in CLEF 2017 Evaluation Labs and Workshop – Working Notes Papers, Sep. 2017.
  • [16] Y. Keswani, H. Trivedi, P. Mehta, and P. Majumder, “Author masking through translation.” in CLEF (Working Notes), 2016, pp. 890–894.
  • [17] A. Caliskan and R. Greenstadt, “Translate once, translate twice, translate thrice and attribute: Identifying authors and machine translation tools in translated text,” in 2012 IEEE Sixth International Conference on Semantic Computing, Sept 2012, pp. 121–125.
  • [18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in Neural Information Processing Systems (NIPS), 2014, pp. 2672–2680.
  • [19] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
  • [20] W. Xu, A. Ritter, B. Dolan, R. Grishman, and C. Cherry, “Paraphrasing for style,” Proceedings of COLING 2012, pp. 2899–2914, 2012.
  • [21] S. Afroz, A. C. Islam, A. Stolerman, R. Greenstadt, and D. McCoy, “Doppelgänger finder: Taking stylometry to the underground,” in Security and Privacy (SP), 2014 IEEE Symposium on.   IEEE, 2014, pp. 212–226.
  • [22] A. Caliskan-Islam, R. Harang, A. Liu, A. Narayanan, C. Voss, F. Yamaguchi, and R. Greenstadt, “De-anonymizing programmers via code stylometry,” in USENIX Security Symposium, 2015.
  • [23] A. Abbasi and H. Chen, “Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace,” ACM Transactions on Information Systems (TOIS), vol. 26, no. 2, p. 7, 2008.
  • [24] D. Bagnall, “Author identification using multi-headed recurrent neural networks,” arXiv preprint arXiv:1506.04891, 2015.
  • [25] G. Kacmarcik and M. Gamon, “Obfuscating document stylometry to preserve author anonymity,” in Proceedings of the COLING/ACL on Main conference poster sessions.   Association for Computational Linguistics, 2006, pp. 444–451.
  • [26] G. Karadzhov, T. Mihaylova, Y. Kiprov, G. Georgiev, I. Koychev, and P. Nakov, “The case for being average: A mediocrity approach to style masking and author obfuscation,” in International Conference of the Cross-Language Evaluation Forum for European Languages.   Springer, 2017, pp. 173–185.
  • [27] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” Proceedings of the International Conference on Learning Representations (ICLR), 2014.
  • [28] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016.
  • [29] T. Shen, T. Lei, R. Barzilay, and T. Jaakkola, “Style transfer from non-parallel text by cross-alignment,” To appear in NIPS, 2017.
  • [30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, 1997.
  • [31]

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in

    Advances in Neural Information Processing Systems (NIPS), 2013.
  • [32] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
  • [33] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen, “Sequence-to-sequence models can directly transcribe foreign speech,” arXiv preprint arXiv:1703.08581, 2017.
  • [34] X. Ma and E. Hovy, “End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2016, pp. 1064–1074.
  • [35] R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and B. Schiele, “Speaking the same language: Matching machine to human captions by adversarial training,” in

    Proceedings of the IEEE International Conference on Computer Vision (ICCV)

    , 2017.
  • [36] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • [37]

    J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,”

    Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [38] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, “Supervised learning of universal sentence representations from natural language inference data,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2017.
  • [39] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel, R. Urtasun, A. Torralba, and S. Fidler, “Skip-thought vectors,” in Advances in neural information processing systems, 2015, pp. 3294–3302.
  • [40] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015.
  • [41] Pytorch framework. [Online]. Available:
  • [42] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude,” COURSERA: Neural networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
  • [43] J. Schler, M. Koppel, S. Argamon, and J. W. Pennebaker, “Effects of age and gender on blogging.” in AAAI spring symposium: Computational approaches to analyzing weblogs, vol. 6, 2006, pp. 199–205.
  • [44] J. T. Woolley and G. Peters. (1999) The american presidency project. [Online]. Available:
  • [45] J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems by gibbs sampling,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2005, pp. 363–370.
  • [46] M. Denkowski and A. Lavie, “Meteor universal: Language specific translation evaluation for any target language,” in Proceedings of the Ninth Workshop on Statistical Machine Translation.   ACL, 2014, pp. 376–380.
  • [47]

    C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation of discrete random variables,”

    Proceedings of the International Conference on Learning Representations (ICLR), 2016.

Appendix A Appendix - Differentiability of discrete samples

To obtain an output sentence sample from the network , we can sample from the distribution , shown in (5), repeatedly until a special ‘END’ token is sampled. This naive sampling though is not suitable for training within a GAN framework as sampling from multinomial distribution, , is not differentiable.

To make sampling differentiable we follow the approach used in [35] and use the Gumbel-Softmax approximation [36] to obtain differentiable soft samples from . The gumbel-softmax approximation includes two parts. First, the re-parametrization trick using the gumbel random variable is applied to make the process of sampling from a multinomial distribution differentiable with respect to the probabilities . Next, softmax is used to approximate the arg-max operator to obtain “soft” samples instead of one-hot vectors. This makes the samples themselves differentiable. Thus, the gumbel-softmax approximation allows differentiating through sentence samples from the network enabling end-to-end GAN training. Further details on gumbel-softmax approximation can be found in [36, 47].