Towards Robust and Privacy-preserving Text Representations

by   Yitong Li, et al.
The University of Melbourne

Written text often provides sufficient clues to identify the author, their gender, age, and other important attributes. Consequently, the authorship of training and evaluation corpora can have unforeseen impacts, including differing model performance for different user groups, as well as privacy implications. In this paper, we propose an approach to explicitly obscure important author characteristics at training time, such that representations learned are invariant to these attributes. Evaluating on two tasks, we show that this leads to increased privacy in the learned representations, as well as more robust models to varying evaluation conditions, including out-of-domain corpora.



There are no comments yet.


page 1

page 2

page 3

page 4


A^4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation

Text-based analysis methods allow to reveal privacy relevant author attr...

Fair NLP Models with Differentially Private Text Encoders

Encoded text representations often capture sensitive attributes about in...

Overlearning Reveals Sensitive Attributes

`Overlearning' means that a model trained for a seemingly simple objecti...

Too good to be true? Predicting author profiles from abusive language

The problem of online threats and abuse could potentially be mitigated w...

Private Text Classification

Confidential text corpora exist in many forms, but do not allow arbitrar...

Learning Language and Multimodal Privacy-Preserving Markers of Mood from Mobile Data

Mental health conditions remain underdiagnosed even in countries with co...

The Case for Being Average: A Mediocrity Approach to Style Masking and Author Obfuscation

Users posting online expect to remain anonymous unless they have logged ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Language is highly diverse, and differs according to author, their background, and personal attributes such as gender, age, education and nationality. This variation can have a substantial effect on NLP models learned from text Hovy et al. (2015), leading to significant variation in inferences across different types of corpora, such as the author’s native language, gender and age. Training corpora are never truly representative, and therefore models fit to these datasets are biased in the sense that they are much more effective for texts from certain groups of user, e.g., middle-aged white men, and considerably poorer for other parts of the population Hovy (2015)

. Moreover, models fit to language corpora often fixate on author attributes which correlate with the target variable, e.g., gender correlating with class skews

Zhao et al. (2017), or translation choices Rabinovich et al. (2017). This signal, however, is rarely fundamental to the task of modelling language, and is better considered as a confounding influence. These auxiliary learning signals can mean the models do not adequately capture the core linguistic problem. In such situations, removing these confounds should give better generalisation, especially for out-of-domain evaluation, a similar motivation to research in domain adaptation based on selection biases over text domains Blitzer et al. (2007); Daumé III (2007).

Another related problem is privacy: texts convey information about their author, often inadvertently, and many individuals may wish to keep this information private. Consider the case of the AOL search data leak, in which AOL released detailed search logs of many of their users in August 2006 Pass et al. (2006). Although they de-identified users in the data, the log itself contained sufficient personally identifiable information that allowed many of these individuals to be identifed Jones et al. (2007). Other sources of user text, such as emails, SMS messages and social media posts, would likely pose similar privacy issues. This raises the question of how the corpora, or models built thereupon, can be distributed without exposing this sensitive data. This is the problem of differential privacy, which is more typically applied to structured data, and often involves data masking, addition or noise, or other forms of corruption, such that formal bounds can be stated in terms of the likelihood of reconstructing the protected components of the dataset Dwork (2008). This often comes at the cost of an accuracy reduction for models trained on the corrupted data Shokri and Shmatikov (2015); Abadi et al. (2016).

Another related setting is where latent representations of the data are shared, rather than the text itself, which might occur when sending data from a phone to the cloud for processing, or trusting a third party with sensitive emails for NLP processing, such as grammar correction or translation. The transfered representations may still contain sensitive information, however, especially if an adversary has preliminary knowledge of the training model, in which case they can readily reverse engineer the input, for example, by a GAN attack algorithm Hitaj et al. (2017). This is true even when differential privacy mechanisms have been applied.

Inspired by the above works, and recent successes of adversarial learning Goodfellow et al. (2014); Ganin et al. (2016), we propose a novel approach for privacy-preserving learning of unbiased representations.111Implementation available at Specially, we employ Ganin et al.

’s approach to training deep models with adversarial learning, to explicitly obscure individuals’ private information. Thereby the learned (hidden) representations of the data can be transferred without compromising the authors’ privacy, while still supporting high-quality NLP inference. We evaluate on the tasks of POS-tagging and sentiment analysis, protecting several demographic attributes — gender, age, and location — and show empirically that doing so does not hurt accuracy, but instead can lead to substantial gains, most notably in out-of-domain evaluation. Compared to differential privacy, we report gains rather than loss in performance, but note that we provide only empirical improvements in privacy, without any formal guarantees.

2 Methodology

We consider a standard supervised learning situation, in which inputs

are used to compute a representation , which then forms the parameterisation of a generalised linear model, used to predict the target

. Training proceeds by minimising a differentiable loss, e.g., cross entropy, between predictions and the ground truth, in order to learn an estimate of the model parameters, denoted


Overfitting is a common problem, particular in deep learning models with large numbers of parameters, whereby

learns to capture specifics of the training instances which do not generalise to unseen data. Some types of overfitting are insidious, and cannot be adequately addressed with standard techniques like dropout or regularisation. Consider, for example, the authorship of each sentence in the training set in a sentiment prediction task. Knowing the author, and their general disposition, will likely provide strong clues about their sentiment wrt any sentence. Moreover, given the ease of authorship attribution, a powerful learning model might learn to detect the author from their text, and use this to predict the sentiment, rather than basing the decision on the semantics of each sentence. This might be the most efficient use of model capacity if there are many sentences by this individual in the training dataset, yet will lead to poor generalisation to test data authored by unseen individuals.

Moreover, this raises privacy issues when is known by an attacker or malicious adversary. Traditional privacy-preserving methods, such as added noise or masking, applied to the representation will often incur a cost in terms of a reduction in task performance. Differential privacy methods are unable to protect the user privacy of under adversarial attacks, as described in Section 1.

Therefore, we consider how to learn an un-biased representations of the data with respect to specific attributes which we expect to behave as confounds in a generalisation setting. To do so, we take inspiration from adversarial learning Goodfellow et al. (2014); Ganin et al. (2016). The architecture is illustrated in Figure 1.

Figure 1: Proposed model architectures, showing a single training instance with two protected attributes, and . indicates a discriminator, and the red dashed and blue lines denote adversarial and standard loss, respectively.

2.1 Adversarial Learning

A key idea of adversarial learning, following ganin2016domain, is to learn a discriminator model jointly with learning the standard supervised model. Using gender as an example, a discriminator will attempt to predict the gender, , of each instance from , such that training involves joint learning of both the model parameters, , and the discriminator parameters . However, the aim of learning for these components are in opposition – we seek a which leads to a good predictor of the target , while being a poor representation for prediction of gender. This leads to the objective (illustrated for a single training instance),



denotes the cross entropy function. The negative sign of the second term, referred to as the adversarial loss, can be implemented by a gradient reversal layer during backpropagation

Ganin et al. (2016). To elaborate, training is based on standard gradient backpropagation for learning the main task, but for the auxiliary task, we start with standard loss backpropagation, however gradients are reversed in sign when they reach . Consequently the linear output components are trained to be good predictors, but is trained to be maximally good for the main task and maximally poor for the auxiliary task.

Furthermore, Equation 1 can be expanded to scenarios with several () protected attributes,


3 Experiments

In this section, we report experimental results for our methods with two very different language tasks.

3.1 POS-tagging

This first task is part-of-speech (POS) tagging, framed as a sequence tagging problem. Recent demographic studies have found that the author’s gender, age and race can influence tagger performance Hovy and Søgaard (2015); Jørgensen et al. (2016). Therefore, we use the POS tagging to demonstrate that our model is capable of eliminating this type of bias, thereby leading to more robust models of the problem.


Our model is a bi-directional LSTM for POS tag prediction Hochreiter and Schmidhuber (1997), formulated as:

for input sequence with terminal hidden states and set to zero, where

is a linear transformation, and

denotes vector concatenation.

For the adversarial learning, we use the training objective from Equation 2 to protect gender and age, both of which are treated as binary values. The adversarial component is parameterised by -hidden feedforward nets, applied to the final hidden representation

. For hyperparameters, we fix the size of the word embeddings and

to , and set all values to . A dropout rate of is applied to all hidden layers during training.


We use the TrustPilot English POS tagged dataset Hovy and Søgaard (2015), which consists of 600 sentences, each labelled with both the sex and age of the author, and manually POS tagged based on the Google Universal POS tagset Petrov et al. (2012). For the purposes of this paper, we follow Hovy and Søgaard’s setup, categorising sex into female (F) and male (M), and age into over-45 (O45) and under-35 (U35). We train the taggers both with and without the adversarial loss, denoted adv and baseline, respectively.

For evaluation, we perform a -fold cross validation, with a train:dev:test split using ratios of 8:1:1. We also follow the evaluation method in hovy2015tagging, by reporting the tagging accuracy for sentences over different slices of the data based on sex and age, and the absolute difference between the two settings.

Considering the tiny quantity of text in the TrustPilot corpus, we use the Web English Treebank (WebEng: bies2012english), as a means of pre-training the tagging model. WebEng was chosen to be as similar as possible in domain to the TrustPilot data, in that the corpus includes unedited user generated internet content.

As a second evaluation set, we use a corpus of African-American Vernacular English (AAVE) from jorgensen2016learning, which is used purely for held-out evaluation. AAVE consists of three very heterogeneous domains: lyrics, subtitles and tweets. Considering the substantial difference between this corpus and WebEng or TrustPilot, and the lack of any domain adaptation, we expect a substantial drop in performance when transferring models, but also expect a larger impact from bias removal using adv training.

sex age
F M O45 U35
Table 1: POS prediction accuracy [%] using the Trustpilot test set, stratified by sex and age (higher is better), and the absolute difference () within each bias group (smaller is better). The best result is indicated in bold.

Results and analysis

Table 1 shows the results for the TrustPilot dataset. Observe that the disparity for the baseline tagger accuracy (the column), for age is larger than for sex, consistent with the results of hovy2015tagging. Our adv method leads to a sizeable reduction in the difference in accuracy across both sex and age, showing our model is capturing the bias signal less and more robust to the tagging task. Moreover, our method leads to a substantial improvement in accuracy across all the test cases. We speculate that this is a consequence of the regularising effect of the adversarial loss, leading to a better characterisation of the tagging problem.

Table 2 shows the results for the AAVE held-out domain. Note that we do not have annotations for sex or age, and thus we only report the overall accuracy on this dataset. Note that adv also significantly outperforms the baseline across the three heldout domains.

Combined, these results demonstrate that our model can learn relatively gender and age de-biased representations, while simultaneously improving the predictive performance, both for in-domain and out-of-domain evaluation scenarios.

3.2 Sentiment Analysis

The second task we use is sentiment analysis, which also has broad applications to the online community, as well as privacy implications for the authors whose text is used to train our models. Many user attributes have been shown to be easily detectable from online review data, as used extensively in sentiment analysis results Hovy et al. (2015); Potthast et al. (2017). In this paper, we focus on three demographic variables of gender, age, and location.


Sentiment is framed as a -class text classification problem, which we model using kim2014convolutional’s convolutional neural net (CNN) architecture, in which the hidden representation is generated by a series of convolutional filters followed a maxpooling step, simply denote as . We follow the hyper-parameter settings of kim2014convolutional, and initialise the model with word2vec embeddings Mikolov et al. (2013). We set the values to and apply a dropout rate of to .

As the discriminator, we also use a feed-forward model with one hidden layer, to predict the private attribute(s). We compare models trained with zero, one, or all three private attributes, denoted baseline, adv-*, and adv-all, respectively.

lyrics subtitles tweets Average
Table 2: POS predictive accuracy [%] over the AAVE dataset, stratified over the three domains, alongside the macro-average accuracy. The best result is indicated in bold.


We again use the TrustPilot dataset derived from hovy2015user, however now we consider the rating score as the target variable, not POS-tag. Each review is associated with three further attributes: gender (sex), age (age), and location (loc). To ensure that loc cannot be trivially predicted based on the script, we discard all non-English reviews based on Lui and Baldwin (2012)

, by retaining only reviews classified as English with a confidence greater than

. We then subsample 10k reviews for each location to balance the five location classes (US, UK, Germany, Denmark, and France), which were highly skewed in the original dataset. We use the same binary representation of sex and age as the POS task, following the setup in hovy2015user.

To evaluate the different models, we perform -fold cross validation and report test performance in terms of the score for the rating task, and the accuracy of each discriminator. Note that the discriminator can be applied to test data, where it plays the role of an adversarial attacker, by trying to determine the private attributes of users based on their hidden representation. That is, lower discriminator performance indicates that the representation conveys better privacy for individuals, and vice versa.

Discrim. [%]
dev test age sex loc
Majority class
Table 3: Sentiment -score [%] over the rating task, and accuracy [%] of all the discriminator across three private attributes. The best score is indicated in bold. The majority class with respect to each private attribute is also reported.


Table 3 shows the results of the different models. Note that all the privacy attributes can be easily detected in baseline, with results that are substantially higher than the majority class, although age and sex are less well captured than loc. The adv trained models all maintain the task performance of the baseline method, however they clearly have a substantial effect on the discrimination accuracy. The privacy of sex and loc is substantially improved, leading to discriminators with performance close to that of the majority class (conveys little information). age proves harder, although our technique leads to privacy improvements. Note that age appears to be related to the other private attributes, in that privacy is improved when optimising an adversarial loss for the other attributes (sex and loc).

Overall, these results show that our approach learns hidden representations that hide much of the personal information of users, without affecting the sentiment task performance. This is a surprising finding, which augurs well for the use of deep learning as a privacy preserving mechanism when handling text corpora.

4 Conclusion

We proposed a novel method for removing model biases by explicitly protecting private author attributes as part of model training, which we formulate as deep learning with adversarial learning. We evaluate our methods with POS tagging and sentiment classification, demonstrating our method results in increased privacy, while also maintaining, or even improving, task performance, through increased model robustness.


We thank Benjamin Rubinstein and the anonymous reviewers for their helpful feedback and suggestions, and the National Computational Infrastructure Australia for computation resources. We also thank Dirk Hovy for providing the Trustpilot dataset. This work was supported by the Australian Research Council (FT130101105).


  • Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318.
  • Bies et al. (2012) Ann Bies, Justin Mott, Colin Warner, and Seth Kulick. 2012. English Web Treebank. Linguistic Data Consortium, Philadelphia, USA.
  • Blitzer et al. (2007) John Blitzer, Mark Dredze, and Fernando Pereira. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 440–447.
  • Daumé III (2007) Hal Daumé III. 2007. Frustratingly easy domain adaptation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 256–263.
  • Dwork (2008) Cynthia Dwork. 2008. Differential privacy: A survey of results. In International Conference on Theory and Applications of Models of Computation, pages 1–19. Springer.
  • Ganin et al. (2016) Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. 2016.

    Domain-adversarial training of neural networks.

    Journal of Machine Learning Research

    , 17:59:1–59:35.
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, pages 2672–2680.
  • Hitaj et al. (2017) Briland Hitaj, Giuseppe Ateniese, and Fernando Pérez-Cruz. 2017. Deep models under the gan: information leakage from collaborative deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 603–618. ACM.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  • Hovy (2015) Dirk Hovy. 2015. Demographic factors improve classification performance. In

    Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

    , volume 1, pages 752–762.
  • Hovy et al. (2015) Dirk Hovy, Anders Johannsen, and Anders Søgaard. 2015. User review sites as a resource for large-scale sociolinguistic studies. In Proceedings of the 24th International Conference on World Wide Web, pages 452–461.
  • Hovy and Søgaard (2015) Dirk Hovy and Anders Søgaard. 2015. Tagging performance correlates with author age. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), volume 2, pages 483–488.
  • Jones et al. (2007) Rosie Jones, Ravi Kumar, Bo Pang, and Andrew Tomkins. 2007. I know what you did last summer: query logs and user privacy. In Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management (CIKM 2007), pages 909–914.
  • Jørgensen et al. (2016) Anna Jørgensen, Dirk Hovy, and Anders Søgaard. 2016. Learning a POS tagger for AAVE-like language. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1115–1120.
  • Kim (2014) Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1746–1751.
  • Lui and Baldwin (2012) Marco Lui and Timothy Baldwin. 2012. An off-the-shelf language identification tool. In Proceedings of ACL 2012 System Demonstrations, pages 25–30.
  • Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119.
  • Pass et al. (2006) Greg Pass, Abdur Chowdhury, and Cayley Torgeson. 2006. A picture of search. In The First International Conference on Scalable Information Systems, volume 152, page 1.
  • Petrov et al. (2012) Slav Petrov, Dipanjan Das, and Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the Eighth International Conference on Language Resources and Evaluation.
  • Potthast et al. (2017) Martin Potthast, Francisco Rangel, Michael Tschuggnall, Efstathios Stamatatos, Paolo Rosso, and Benno Stein. 2017. Overview of PAN’17: Author identification, author profiling, and author obfuscation. In 8th International Conference of the CLEF on Experimental IR Meets Multilinguality, Multimodality, and Visualization.
  • Rabinovich et al. (2017) Ella Rabinovich, Raj Nath Patel, Shachar Mirkin, Lucia Specia, and Shuly Wintner. 2017. Personalized machine translation: Preserving original author traits. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 1074–1084.
  • Shokri and Shmatikov (2015) Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, pages 1310–1321.
  • Zhao et al. (2017) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2979–2989.