(Male, Bachelor) and (Female, Ph.D) have different connotations: Parallelly Annotated Stylistic Language Dataset with Multiple Personas

08/31/2019 ∙ by Dongyeop Kang, et al. ∙ 0

Stylistic variation in text needs to be studied with different aspects including the writer's personal traits, interpersonal relations, rhetoric, and more. Despite recent attempts on computational modeling of the variation, the lack of parallel corpora of style language makes it difficult to systematically control the stylistic change as well as evaluate such models. We release PASTEL, the parallel and annotated stylistic language dataset, that contains 41K parallel sentences (8.3K parallel stories) annotated across different personas. Each persona has different styles in conjunction: gender, age, country, political view, education, ethnic, and time-of-writing. The dataset is collected from human annotators with solid control of input denotation: not only preserving original meaning between text, but promoting stylistic diversity to annotators. We test the dataset on two interesting applications of style language, where PASTEL helps design appropriate experiment and evaluation. First, in predicting a target style (e.g., male or female in gender) given a text, multiple styles of PASTEL make other external style variables controlled (or fixed), which is a more accurate experimental design. Second, a simple supervised model with our parallel text outperforms the unsupervised models using nonparallel text in style transfer. Our dataset is publicly available.



There are no comments yet.


page 4

page 5

page 13

Code Repositories


Data and code for Kang et al., EMNLP 2019's paper titled "(Male, Bachelor) and (Female, Ph.D) have different connotations: Parallelly Annotated Stylistic Language Dataset with Multiple Personas"

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

hovy1987generating claims that appropriately varying the style of text often conveys more information than is contained in the literal meaning of the words. He defines the roles of styles in text variation by pragmatics aspects (e.g., relationship between them) and rhetorical goals (e.g., formality), and provides example texts of how they are tightly coupled in practice. Similarly, biber1991variation categorizes components of conversational situation by participants’ characteristics such as their roles, personal characteristics, and group characteristics (e.g., social class). Despite the broad definition of style, this work mainly focuses on one specific aspect of style, pragmatics aspects in group characteristics of speakers, which is also called persona. Particularly, we look at multiple types of group characteristics in conjunction, such as gender, age, education level, and more.

Stylistic variation in text primarily manifest themselves at the different levels of textual features: lexical features (e.g., word choice), syntactic features (e.g., preference for the passive voice) and even pragmatics, while preserving the original meaning of given text dimarco1990accounting. Connecting such textual features to someone’s persona is an important study to understand stylistic variation of language. For example, do highly educated people write longer sentences bloomfield1927literate? Are Hispanic and East Asian people more likely to drop pronouns white1985pro? Are elder people likely to use lesser anaphora ulatowska1986disruption?

To computationally model a meaning-preserved variance of text across styles, many recent works have developed systems that transfer styles

reddy2016obfuscating; hu2017toward; prabhumoye2018style or profiles authorships from text verhoeven2014clips; koppel2009computational; stamatatos2018overview without parallel corpus of stylistic text. However, the absence of such a parallel dataset makes it difficult both to systematically learn the textual variation of multiple styles as well as properly evaluate the models.

In this paper, we propose a large scale, human-annotated, parallel stylistic dataset called PASTEL, with focus on multiple types of personas in conjunction. Ideally, annotations for a parallel style dataset should preserve the original meaning (i.e., denotation) between reference text and stylistically transformed text, while promoting diversity for annotators to allow their own styles of persona (i.e., connotation). However, if annotators are asked to write their own text given a reference sentence, they may simply produce arbitrarily paraphrased output which does not exhibit a stylistic diversity. To find such a proper input setting for data collection, we conduct a denotation experiment in §3. PASTEL is then collected by crowd workers based on the most effective input setting that balances both meaning preservation and diversity metrics (§4).

PASTEL includes stylistic variation of text at two levels of parallelism: 8.3K annotated, parallel stories and 41K annotated, parallel sentences, where each story has five sentences and has annotators on average. Each sentence or story has the seven types of persona styles in conjunction: gender, age, ethnics, countries to live, education level, political view, and time of the day.

In §5, we introduce two interesting applications of style language using PASTEL: controlled style classification and supervised style transfer. The former application predicts a category (e.g., male or female) of target style (i.e., gender) given a text. Multiplicity of persona styles in PASTEL makes other style variables controlled (or fixed) except the target, which is a more accurate experimental design. In the latter, contrast to the unsupervised style transfer using non-parallel corpus, simple supervised models with our parallel text in PASTEL achieve better performance, being evaluated with the parallel, annotated text.

We hope PASTEL sheds light on the study of stylistic language variation in developing a solid model as well as evaluating the system properly.

2 Related Work

Transferring styles between text has been studied with and without parallel corpus:

Style transfer without parallel corpus: Prior works transfer style between text on single type of style aspect such as sentiment fu2017style; shen2017style; hu2017toward, gender reddy2016obfuscating, political orientation prabhumoye2018style, and two conflicting corpora (e.g, paper and news han2017unsupervised, or real and synthetic reviews lipton2015generative). They use different types of generative models in the same way as style transfer in images, where meaning preservation is not controlled systematically. prabhumoye2018style proposes back-translation to get a style-agnostic sentence representation. However, they lack parallel ground truth for evaluation and present limited evaluation for meaning preservation.

Style transfer with parallel corpus: Few recent works use parallel text for style transfer between modern and Shakespearean text jhamtani2017shakespearizing, sarcastic and literal tweets peled2017sarcasm, and formal and informal text heylighen1999formality; rao2018dear. Compared to these, we aim to understand and demonstrate style variation owing to multiple demographic attributes.

Besides the style transfer, other applications using stylistic features have been studied such as poetry generation ghazvininejad2017hafez, stylometry with demographic information verhoeven2014clips, modeling style bias vogel2012he and modeling biographic attributes garera2009modeling. A series of works by wild; koppel2009computational; argamon2009automatically; koppel2014determining and their shared tasks stamatatos2018overview show huge progress on author profiling and attribute classification tasks. However, none of the prior works have collected a stylistic language dataset to have multiple styles in conjunction, parallely annotated by a human. The multiple styles in conjunction in PASTEL enable an appropriate experiment setting for controlled style classification task in Section 5.1.

3 Denotation Experiment

Denotation: Produced sentences:
single ref. sentence the old door with wood was the only direction to the courtyard
story(imgs) The old wooden door in the stonewall looks like a portal to a fairy tale.
story(imgs.+keyw words) Equally so, he is intrigued by the heavy wooden door in the courtyard.
Reference sentence:
the old wooden door was only one way into the courtyard.
Figure 1: Textual variation across different denotation settings. Each sentence is produced by a same annotator. Note that providing reference sentence increases fidelity to the reference while decreases diversity.
Figure 2: Denotation experiment finds the best input setting for data collection, that preserves meaning but diversifies styles among annotators with different personas.

We first provide a preliminary study to find the best input setting (or denotation) for data collection to balance between two trade-off metrics: meaning preservation and style diversity.

3.1 Preliminary Study

Table 2 shows output texts produced by annotators given different input denotation settings. The basic task is to provide an input denotation (e.g., a sentence only, a sequence of images) and then ask them to reproduce text maintaining the meaning of the input but with their own persona.

For instance, if we provide a single reference sentence, annotators mostly repeat the input text with a little changes of the lexical terms. This setup mostly preserves the meaning by simply paraphrasing the sentence, but annotators’ personal style does not reflect the variation. With a single image, on the other hand, the outputs produced by annotators tend to be diverse. However, the image can be explained with a variety of contents, so the output meaning can drift away from the reference sentence.

If a series of consistent images (i.e., a story) is given, we expect a stylistic diversity can be more narrowed down, by grounding it to a specific event or a story. In addition to that, some keywords added to each image of a story help deliver more concrete meaning of content as well as the style diversity.

denotation settings Style Diversity Meaning Preservation
E(GM) METEOR VectorExtrema


single ref. sentence 2.98 0.37 0.70
story(images) 2.86 0.07 0.38
story(images) + global keywords 2.85 0.07 0.39
story(images + local keywords) 3.07 0.17 0.53
story(images + local keywords + ref. sentence) 2.91 0.21 0.43


story(images) 4.43 0.1 0.4
story(images) + global keywords 4.43 0.1 0.42
story(images + local keywords) 4.58 0.19 0.55
story(images + local keywords + ref. sentence) 4.48 0.22 0.44
Table 1: Denotation experiment to find the best input setting (i.e., meaning preserved but stylistically diverse). story-level measures the metrics for five sentences as a story, and sentence-level per individual sentence. Note that single reference sentence setting only has sentence level. For every metrics in both meaning preservation and style diversity, the higher the better. The bold number is the highest, and the underlined is the second highest.

3.2 Experimental Setup

In order to find the best input setting that preserves meaning as well as promotes a stylistic diversity, we conduct a denotation experiment as described in Figure 2. The experiment is a subset of our original dataset, which have only 100 samples of annotations.

A basic idea behind this setup is to provide (1) a perceptually common denotation via sentences or images so people share the same context (i.e., denotation) given, (2) a series of them as a “story” to limit them into a specific event context, and (3) two modalities (i.e., text and image) for better disambiguation of the context by grounding them to each other.

We test five different input settings222Other settings like Single reference image are tested as well, but they didn’t preserve the meaning well.: Single reference sentence, Story (images), Story (images) + global keywords, Story (images + local keywords), and Story (images + local keywords + ref. sentence).

For the keyword selection, we use RAKE algorithm rose2010automatic to extract keywords and rank them for each sentence by the output score. Top five uni/bigram keywords are chosen at each story, which are called global keywords. On the other hand, another top three uni/bigram keywords are chosen at each image/sentence in a story, which are called local keywords. Local keywords for each image/sentence help annotators not deviate too much. For example, local keywords look like (restaurant, hearing, friends) (pictures, menu, difficult) (salad, corn, chose) for three sentences/images, while global keywords look like (wait, salad, restaurant) for a story of the three sentences/images.

We use Visual Story Telling (ViST) visualstorytelling dataset as our input source. The dataset contains stories, and each story has five pairs of images and sentences. We filter out stories that are not temporally ordered using the timestamps of images. The final number of stories after filtering the non-temporally-ordered stories is 28,130. For the denotation experiment, we only use randomly chosen 100 stories. The detailed pre-processing steps are described in Appendix.

3.3 Measuring Meaning Preservation & Style Diversity across Different Denotations

Figure 3: Final denotation setting for data collection: an event that consists of a series of five images with a handful number of keywords. We ask annotators to produce text about the event for each image.

For each denotation setting, we conduct a quantitative experiment to measure the two metrics: meaning preservation and style diversity. The two metrics pose a trade-off to each other. The best input setting then is one that can capture both in appropriate amounts. For example, we want meaning of the input preserved, while lexical or syntactic features (e.g., POS tags) can vary depending on annotator’s persona. We use the following automatic measures for the two metrics:

Style Diversity

measures how much produced sentences (or stories) differ amongst themselves. Higher the diversity, better the stylistic variation in language it contains. We use an entropy measure to capture the variance of n-gram features between annotated sentences: Entropy (Gaussian-Mixture) that combines the N-Gram entropies


using Gaussian mixture model (N=3).

Meaning Preservation measures semantic similarity of the produced sentence (or story) with the reference sentence (or story). Higher the similarity, better the meaning preserved. We use a hard-measure, METEOR banerjee2005meteor

, that calculates F-score of word overlaps between the output and reference sentences

333Other measures (e.g., BLEU papineni2002bleu, ROUGE lin2003automatic) show relatively similar performance.. Since the hard measures do not take into account all semantic similarities 444METEOR does consider synonymy and paraphrasing but is limited by its predefined model/dictionaries/resources for the respective language, such as Wordnet, we also use a soft measure, VectorExtrema (VecExt) liu2016not

. It computes cosine similarity of averaged word embeddings (i.e., GloVe

pennington2014glove) between the output and reference sentences.

Table 1 shows results of the two metrics across different input settings we define. For the sentence level, as expected, single reference sentence has the highest meaning preservation across all the metrics because it is basically paraphrasing the reference sentence. In general, Story (images + local keywords) shows a great performance with the highest diversity regardless of the levels, as well as the highest preservation at the soft measure on the story-level. Thus, we use Story(images+local keywords) as the input setting for our final data collection, which has the most balanced performance on both metrics. Figure 3 shows an example of our input setting for crowd workers.

4 Pastel: A Parallelly Annotated Dataset for Stylistic Language Dataset

We describe how we collect the dataset with human annotations and provide some analysis on it.

4.1 Annotation Schemes

Our crowd workers are recruited from the Amazon Mechanical Turk (AMT) platform. Our annotation scheme consists of two steps: (1) ask annotator’s demographic information (e.g., gender, age) and (2) given an input denotation like Figure 3, ask them to produce text about the denotation with their own style of persona (i.e., connotation).

In the first step, we use seven different types of persona styles; gender, age, ethnic, country, education level, and political orientation, and one additional context style time-of-day (tod). For each type of persona, we provide several categories for annotators to choose. For example, political orientation has three categories: Centrist, Left Wing, and Right Wing. Categories in other styles are described in the next sub-section.

In the second step, we ask annotators to produce text that describes the given input of denotation. We again use the pre-processed ViST visualstorytelling data in §3 for our input denotations. To reflect annotators’ persona, we explicitly ask annotators to reflect their own persona in the stylistic writing, instead of pretending others’ persona. We attach detailed annotation schemes at Figure 6 in Appendix.

To amortize both costs and annotators’ effort at answering questions, each HIT requires the participants to annotate three stories after answering demographic questions. One annotator was paid $0.11 per HIT. For English proficiency, the annotators were restricted to be from USA or UK. A total 501 unique annotators participated in the study. The average number of HIT per annotator was 9.97.

Number of Sentences Number of Stories
Train 33,240 6,648
Valid 4,155 831
Test 4,155 831
total 41,550 8,310
Table 2: Data statistics of the PASTEL.
Reference Sentence: went to an art museum with a group of friends.
edu:HighSchoolOrNoDiploma My friends and I went to a art museum yesterday .
edu:Bachelor I went to the museum with a bunch of friends.

Reference Sentence: the living room of our new home is nice and bright with natural light.
edu:NoDegree, gender:Male The natural lightning made the apartment look quite nice for the upcoming tour .
edu:Graduate, gender:Female The house tour began in the living room which had a sufficient amount of natural lighting.

Reference Story: Went to an art museum with a group of friends . We were looking for some artwork to purchase, as sometimes artist allow the sales of their items . There were pictures of all sorts , but in front of them were sculptures or arrangements of some sort . Some were far out there or just far fetched . then there were others that were more down to earth and stylish. this set was by far my favorite.very beautiful to me .
edu:HighSchool, ethnic:Caucasian, gender:Female My friends and I went to a art museum yesterday . There were lots of puchases and sales of items going on all day . I loved the way the glass sort of brightened the art so much that I got all sorts of excited . After a few we fetched some grub . My favorite set was all the art that was made out of stylish trash .
edu:Bachelor, ethnic:Caucasian, gender:Female I went to the museum with a bunch of friends . There was some cool art for sale . We spent a lot of time looking at the sculptures . This was one of my favorite pieces that I saw . We looked at some very stylish pieces of artwork .
Table 3: Two sentence-level (top, middle) and one story-level (bottom) annotations in PASTEL. Each text produced by an annotator has their own persona values (underline) for different types of styles (italic). Note that the reference sentence (or story) is given for comparison with the annotated text. Note that misspellings of the text are made by annotators.

Once we complete our annotations, we filter out noisy responses such as stories with missing images and overtly short sentences (i.e., minimum sentence length is 5). The dataset is then randomly split into train, valid, and test set by 0.8, 0.1, and 0.1 ratios, respectively. Table 2 shows the final number of stories and sentences in our dataset.

4.2 Analysis and Examples

Figure 4: Distribution of annotators for each personal style in PASTEL. Best viewed in color.

Figure 4 shows demographic distributions of the annotators. Education-level of annotators is well-balanced, while gender and political view are somewhat biased (e.g., 68% of annotators are Female, only 18.6% represent themselves as right-wing). Table 7 in Appendix includes the categories in other styles and their distributions.

Table 3 shows few examples randomly chosen from our dataset: two at sentence level (top, middle) and one at story level (bottom). Due to paucity of space, we only show a few types of persona styles. For example, we observe that Education level (e.g., NoDegree vs. Graduate) actually reflects a certain degree of formality in their writing at both sentence and story levels. In §5.1, we conduct an in-depth analysis of textual variation with respect to the persona styles in PASTEL.

5 Applications with Pastel

PASTEL can be used in many style related applications including style classification, stylometry verhoeven2014clips, style transfer fu2017style, visually-grounded style transfer, and more. Particularly, we chose two applications, where PASTEL helps design appropriate experiment and evaluation: controlled style classification (§5.1) and supervised style transfer (§5.2).

5.1 Controlled Style Classification

A common mistake in style classification datasets is not controlling external style variables when predicting the category of the target style. For example, when predicting a gender type given a text (gender=, the training data is only labeled by the target style gender. However, the text is actually produced by a person with not only gender= but also other persona styles such as age=55-74 or education=HighSchool

. Without controlling the other external styles, the classifier is easily biased against the training data.

We define a task called controlled style classification where all other style variables are fixed555The distribution of number of training instances per variable is given in Appendix, except one to classify. Here we evaluate (1) which style variables are relatively difficult or easy to predict from the text given, and (2) what types of textual features are salient for each type of style classification.


Stylistic language has a variety of features at different levels such as lexical choices, syntactic structure and more. Thus, we use following features:

  • [noitemsep,topsep=0pt,leftmargin=*]

  • lexical features: ngram’s frequency (n=3), number of named entities, number of stop-words

  • syntax features: sentence length, number of each Part-of-Speech (POS) tag, number of out-of-vocabulary, number of named entities

  • deep features: pre-trained sentence encoder using BERT devlin2018bert

  • semantic feature: sentiment score

where named entities, POS tags, and sentiment scores are obtained using the off-the-shelf tools such as Spacy666https://spacy.io/ library. We use n-gram lexical features, dimensional embeddings, and hand-written features.


We train a binary classifier for each personal style with different models: logistic regression, SVM with linear/RBF kernels, Random Forest, Nearest Neighbors, Multi-layer Perceptron, AdaBoost, and Naive Bayes. For each style, we choose the best classifiers on the validation. Their F-scores are reported in Figure

5. We use sklearn’s implementation of all models pedregosa2011scikit.777http://scikit-learn.org/stable/ We consider various regularization parameters for SVM and logistic regression (e.g., c=[0.01, 0.1, 0.25, 0.5, 0.75, 1.0].

We use neural network based baseline models: deep averaging networks

(DAN, Iyyer:2015) of GloVe word embeddings pennington2014glove888

Other architectures such as convolutional neural networks

(CNN, Zhang:2015)

and recurrent neural networks

(LSTM, Hochreiter:1997) show comparable performance as DAN.
. We also compare with the non-controlled model (Combined) which uses a combined set of samples across all other variables except for one to classify using the same features we used.


We tune hyperparameters using 5-fold cross validation. If a style has more than two categories, we choose the most conflicting two:

gender:{Male, Female}, age: {18-24, 35-44}, education: {Bachelor, No Degree}, and politics: {LeftWing, RightWing}. To classify one style, all possible combinations of other styles (=) are separately trained by different models. We use the macro-averaged F-scores among the separately trained models on the same test set for every models.


Figure 5: Controlled style classification: F-scores on (a) different types of styles on sentences and on (b) our best models between sentences and stories. Best viewed in color.

Figure 5

shows F-scores (a) among different styles and (b) between sentences and stories. In most cases, multilayer perceptron (MLP) outperforms the majority classifier and other models by large margins. Compared to the neural baselines and the combined classifier, our models show better performance. In comparison between controlled and combined settings, controlled setting achieves higher improvements, indicating that fixing external variables helps control irrelevant features that come from other variables. Among different styles, gender is easier to predict from the text than ages or education levels. Interestingly, a longer context (i.e., story) is helpful in predicting age or education, whereas not for political view and gender.

In our ablation test among the feature types, the combination of different features (e.g., lexical, syntax, deep, semantic) is very complementary and effective. Lexical and deep features are two most significant features across all style classifiers, while syntactic features are not.

Gender:Male Gender:Female
PROPN, ADJ, #_ENTITY, went, party, SENT_LEN happy, day, end, group, just, snow, NOUN
Politics:LeftWing Politics:RightWing
female, time, NOUN, ADP, VERB, porch, day, loved SENT_LENGTH, PROPN, #_ENTITY, n’t, ADJ, NUM
Education:Bachelor Education:NoDegree
food, went, #_STOPWORDS, race, ADP !, just, came, love, lots, male, fun, n’t, friends, happy
Age:18-24 Age:35-44
ADP, come, PROPN, day, ride, playing, sunset ADV, did, town, went, NOUN, #_STOPWORDS
Table 4: Most salient lexical (lower cased) and syntactic (upper cased) features on story-level classification. Each feature is chosen by the highest coefficients in the logistic regression classifier.

Table 4

shows the most salient features for classification of each style. Since we can’t interpret deep features, we only show lexical and syntactic features. The salience of features are ranked by coefficients of a logistic regression classifier. Interestingly, female annotators likely write more nouns and lexicons like ‘happy’, while male annotators likely use pronouns, adjectives, and named entities. Annotators on left wing prefer to use ‘female’, nouns and adposition, while annotators on right wing prefer shorter sentences and negative verbs like ‘n’t’. Not many syntactic features are observed from annotators without degrees compared to with bachelor degree.

5.2 Supervised Style Transfer

The style transfer is defined as (, ) : We attempt to alter a given source sentence to a given target style . The model generates a candidate target sentence which preserves the meaning of but is more faithful to the target style so being similar to the target annotated sentence . We evaluate the model by comparing the predicted sentence and target annotated sentence . The sources are from the original reference sentences, while the targets are from our annotations.


We compare five different models:

  • [noitemsep,topsep=0pt,leftmargin=*]

  • AsItIs: copies over the source sentence to the target, without any alterations.

  • WordDistRetrieve: retrieves a training source-target pair that has the same target style as the test pair and is closest to the test source in terms of word edit distance navarro2001guided. It then returns the target of that pair.

  • EmbDistRetrieve: Similar to WordDistRetrieve, except that a continuous bag-of-words (CBOW) is used to retrieve closest source sentence instead of edit distance.

  • Unsupervised

    : use unsupervised style transfer models using Variational Autoencoder

    shen2017style and using additional objectives such as cross-domain and adversarial losses lample2017unsupervised999We can’t directly compare with hu2017toward; prabhumoye2018style since their performance highly depends on the pre-trained classifier that often shows poor performance.. Since unsupervised models can’t train multiple styles at the same time, we train separate models for each style and macro-average their scores at the end. In order not to use the parallel text in PASTEL, we shuffle the training text of each style.

  • Supervised: uses a simple attentional sequence-to-sequence (S2S) model bahdanau2014neural extracting the parallel text from PASTEL. The model jointly trains different styles in conjunction by concatenating them to the source sentence at the beginning.

We avoid more complex architectural choices for Supervised models like adding a pointer component or an adversarial loss, since we seek to establish a minimum level of performance on this dataset.


We experiment with both Softmax and Sigmoid non-linearities to normalize attention scores in the sequence-to-sequence attention. Adam kingma2014adam is used as the optimizer. Word-level cross entropy of the target is used as the loss. The batch size is set to

. We pick the model with lowest validation loss after 15 training epochs. All models are implemented in PyTorch


For an evaluation, in addition to the same hard and soft metrics used for measuring the meaning preservation in §3, we also use BLEU papineni2002bleu for unigrams and bigrams, and ROUGE lin2003automatic for hard metric and Embedding Averaging (EA) similarity liu2016not for soft metric.

Hard (,) Soft (,)
Models: (, ) B M R EA VE
AsItIs 35.41 12.38 21.08 0.649 0.393
WordDistRetrieve 30.64 7.27 22.52 0.771 0.433
EmbDistRetrieve 33.00 8.29 24.11 0.792 0.461
shen2017style 23.78 7.23 21.22 0.795 0.353
lample2017unsupervised 24.52 6.27 19.79 0.702 0.369
S2S 26.78 7.36 25.57 0.773 0.455
S2S+GloVe 31.80 10.18 29.18 0.797 0.524
S2S+GloVe+PreTr. 31.21 10.29 29.52 0.804 0.529
Table 5: Supervised style transfer. GloVe initializes with pre-trained word embeddings. PreTr. denotes pre-training on YAFC. Hard measures are BLEU, METEOR, and ROUGE, and soft measures are EmbedingAveraging and VectorExtrema.
Source (): I’d never seen so many beautiful flowers.
Style (): (Morning, HighSchool)
+ : the beautiful flowers were beautiful.
: the flowers were in full bloom.
Style (): (Afternoon, NoDegree)
+ : The flowers were very beautiful.
: Tulips are one of the magnificent varieties of flowers.

Source (): she changed dresses for the reception and shared food with her new husband.
Style (): (Master, Centrist)
+ : The woman had a great time with her husband
: Her husband shared a cake with her during reception
Style (): (Vocational, Right)
+ : The food is ready for the reception
: The new husband shared the cake at the reception
Table 6: Examples of style transferred text by our supervised model (S2S+GloVe+PreTr.) on PASTEL. Given source text () and style (), the model predicts a target sentence compared to annotated target sentence .


Table 5 shows our results on style tranfer. We observe that initializing both en/decoder’s word embeddings with GloVe pennington2014glove improves model performance on most metrics. Pretraining (PreTr.) on the formality style transfer data YAFC rao2018dear further helps performance. All supervised S2S approaches outperform both retrieval-based baselines on all measures. This illustrates that the performance scores achieved are not simply a result of memorizing the training set. S2S methods surpass AsItIs on both soft measures and ROUGE. The significant gap that remains on BLEU remains a point of exploration for future work. The significant improvement against the unsupervised methods shen2017style; lample2017unsupervised indicates the usefulness of the parallel text in PASTEL.

Table 6 shows output text produced by our model given a source text and a style . We observe that the output text changes according to the set of styles.

6 Conclusion and Future Directions

We present PASTEL, a parallelly annotated stylistic language dataset. Our dataset is collected by human annotation using a proper denotation setting that preserves the meaning as well as maximizes the diversity of styles. Multiplicity of persona styles in PASTEL makes other style variables controlled (or fixed) except the target style for classification, which is a more accurate experimental design. Our simple supervised models with our parallel text in PASTEL outperforms the unsupervised style transfer models using non-parallel corpus. We hope PASTEL

can be a useful benchmark to both train and evaluate models for style transfer and other related problems in text generation field.

We summarize some directions for future style researches:

  • [noitemsep,topsep=0pt,leftmargin=*]

  • In our ablation study, salient features for style classification are not only syntactic or lexical features but also content words (e.g., love, food). This is a counterexample to the hypothesis implicit in much of recent style research: style needs to be separately modeled from content. We also observe that some texts remain similar across different annotator personas or across outputs from our transfer models, indicating that some content is stylistically invariant. Studying these and other aspects of the content-style relationship in PASTEL could be an interesting direction.

  • Does any external variable co-varying with the text qualify to be a style variable/facet? What are the categories of style variables/facets? Do architectures which transfer well across one style variable (e.g gender) generalize to other style variables (e.g age)? We opine that these questions are largely overlooked by current style transfer work. We hope that our consideration of some of these questions in our work, though admittedly rudimentary, will lead to them being addressed extensively in future style transfer work.


This work would not have been possible without the ViST dataset and helpful suggestions with Ting-Hao Huang. We also thank Alan W Black, Dan Jurafsky, Wei Xu, Taehee Jung, and anonymous reviewers for their helpful comments.


Appendix A Data Collection: Details

Here we describe additional details that we face during our data collection. There are some technical difficulties we faced during annotations:

  • [noitemsep,topsep=0pt,leftmargin=*]

  • Since it is nontrivial to verify whether the annotator had understood our instructions insufficiently, we filter out very short sentence annotations.

  • Some images from VIST were too heavy to load or unavailable. Most annotators mark ”No Image”, ”Image not Loaded” etc for such images. We filter out such annotations using hand-coded filtering rules.

  • Some annotators exploit the large number of HITs available and difficulty of verification to game the study and perform large number of HITs carelessly. We manually block such annotators after a minimum threshold.

  • We increase pay from to based on annotator feedback.

Finally, we filter out 4.6% of noisy annotations.

Figure 6: A part of our annotation schemes for asking annotators to generate sentences given a sequence of stories, keywords, and reference sentences.

Figure 6 shows a part of our annotation schemes for asking annotators to generate sentences given a sequence of stories, keywords, and reference sentences.

Appendix B Pastel: Details

Style Categories with number of stories annotated
gender Male (1320), Female (2908), Others (45)
age 12 (4), 12-17 (1), 18-24 (1050), 25-34 (1527), 35-44 (720), 45-54 (475), 55-74 (493), 75 (3)
ethnic Hispanic/Latino (81), MiddleEastern (4), SouthAsian (27), NativeAmerican (115), PacificIslander (26), CentralAsian (3), African (270), EastAsian (112), Caucasian (3521), Other (114)
country UK (98), USA (4172), Others (3)
edu Associate (423), TechTraining (488), NoSchool (3), HighSchoolOrNoDiploma (47), Bachelor (1321), Master (440), NoDegree (939), HighSchool (557), Doctorate (55)
politics Centrist (1786), LeftWing (1691), RightWing (796)
time Morning (1103), Evening (392), Midnight (282), Afternoon (1666), Night (830)
Table 7: Distribution of stories.

Table 7 shows distribution of stories for each category of styles.

Sent Story Quora
Avg. Preservation 2.69 3.89 3.72
Avg. Fluency 4.02 3.46 4.07
Highly Preserved % 45.2 77.4 83.5
Highly Preserved&Fluent % 44.6 76.3 80.1
Table 8: Quality statistics of PASTEL.

We also conduct add Meaning Preservation human study to evaluate the quality of annotations. We recruit different annotators from the data collection and ask them to rate the two following questions under 5 point scale: (1) Does an annotated sentence from our dataset preserve the meaning of the original reference sentence ? and (2) How fluent is? We sample three annotations each for sentences and stories from our test dataset. For a scale-calibration, we also annotate Quora iyer2017first by asking annotators to evaluate the Quora pairs as a control. Table 8 summarizes the results. In detail, we see that the story-level annotations keep more meaning presentation, while the sentence-level ones hold more fluency.

Appendix C Controlled Style Classification: Details

target: politics LeftWing RightWing
 (Male, 18-24, NoDegree) 59 206
 (Female, 18-24, Bachelor) 147 42
 (Female, 18-24, NoDegree) 95 15
 (Male, 35-44, Bachelor) 21 4
 (Male, 18-24, Bachelor) 18 11
 (Female, 35-44, Bachelor) 63 13
 (Female, 35-44, NoDegree) 24 7
 (Male, 35-44, NoDegree) - 12
target: education NoDegree Bachelor
 (Male, 35-44, RightWing) 12 4
 (Female, 18-24, LeftWing) 95 147
 (Female, 35-44, RightWing) 7 13
 (Male, 35-44, LeftWing) - 21
 (Male, 18-24, LeftWing) 59 18
 (Female, 18-24, RightWing) 15 42
 (Female, 35-44, LeftWing) 24 63
 (Male, 18-24, RightWing) 206 11
target: age 18-24 35-44
 (Female, NoDegree, LeftWing) 95 24
 (Male, Bachelor, RightWing) 11 4
 (Female, Bachelor, LeftWing) 147 63
 (Male, NoDegree, LeftWing) 59 -
 (Female, NoDegree, RightWing) 15 7
 (Male, NoDegree, RightWing) 206 12
 (Male, Bachelor, LeftWing) 18 21
 (Female, Bachelor, RightWing) 42 13
target: gender Male Female
 (18-24, Bachelor, LeftWing) 18 147
 (35-44, NoDegree, RightWing) 12 7
 (18-24, NoDegree, LeftWing) 59 95
 (35-44, Bachelor, LeftWing) 21 63
 (35-44, NoDegree, LeftWing) - 24
 (35-44, Bachelor, RightWing) 4 13
 (18-24, Bachelor, RightWing) 11 42
 (18-24, NoDegree, RightWing) 206 15
Table 9: Label distribution for each combination of external variables in Controlled Style classification.

For controlled setting, Table 9 shows label distribution for each combination of external variables. We calculate the final f-scores of each style by macro averaging the combinations. If a combination has zero number of instances for one type of class value, we ignore them in the calculation.

Table 10 and Table 11 show randomly chosen examples from the validation split of our dataset. It shows reference sentence and some sentences written by our annotators, along with their associated demographic traits. Table 12 has more model outputs in style tranfer.

No Style Text
1 Reference went to an art museum with a group of friends .
HighSchool My friends and I went to a art museum yesterday .
Bachelor I went to the museum with a bunch of friends .
2 Reference dad always said if you misbehave, you will end up in the local jail
British, HighSchool, Female misbehaving people sent to local jail .
American, Graduate, Female They misbehave at the local jail .
3 Reference This one was really cool as it rolled down the hill with the people in it.
American, Graduate, Female Rolling together is a messy, but bonding experience.
American, Graduate, Male He rolled with other people down the hill.
4 Reference in the woods, there was a broken tree .
Female A broken tree stands in the center of a path in the woods .
Male Doing my daily walk in the fores and I decided to go a different path and look what I came across , a tree in the middle of a path , interesting .
5 Reference the vehicles were lined up to marvel at .
Caucasian, Doctoral I went to the car show.
NativeAmerican, Masters The car is the super.
6 Reference we went to a concert last night .
VocationalTraining The concert we went to last night was great .
AssociateDegree I have been looking forward to this concert all year .
Table 10: Sentence examples from the validation split of our dataset showing the reference sentence and some sentences written by our annotators, along with their associated demographic traits.
No Style Text
1 Reference Went to an art museum with a group of friends . We were looking for some artwork to purchase, as sometimes artist allow the sales of their items . There were pictures of all sorts , but in front of them were sculptures or arrangements of some sort . Some were far out there or just far fetched . then there were others that were more down to earth and stylish. this set was by far my favorite.very beautiful to me .
Caucasian, Female, HighSchool My friends and I went to a art museum yesterday . There were lots of puchases and sales of items going on all day . I loved the way the glass sort of brightened the art so much that I got all sorts of excited . After a few we fetched some grub . My favorite set was all the art that was made out of stylish trash .
Caucasian, Female, Bachelor I went to the museum with a bunch of friends . There was some cool art for sale . We spent a lot of time looking at the sculptures . This was one of my favorite pieces that I saw . We looked at some very stylish pieces of artwork .
2 Reference We went on a hiking tour as a family . We took a break so that we could hydrate and get something to eat . We were so glad to see the finish line of the hike . We were heading to the lodging for our overnight stay . The scenary was so beautiful on the tour .
Caucasian, Female, Graduate, Night This weekend , we took a hiking tour as a family . We had to remember to take a break to eat and hydrate . We hiked in a line , and I ’m glad we did . We decided to stay overnight in the provided lodging . The scenery of the tour was beautiful .
Caucasian, Female, Graduate, Evening Our family took a hiking tour along the blue ridge parkway . We took plenty of water and snacks so we could hydrate and eat on breaks . We were glad to hike in a line up the hill . We chose to stay overnight in some local lodging . We loved the tour and its beautiful scenery .
Table 11: Story examples from the validation split of our dataset showing the reference sentence and some sentences written by our annotators, along with their associated demographic traits.
No Type Text
1 Reference the guy had buckets of food
Bachelor, RightWing Before they headed out for Spring Break, Joe went out and brought back food for the gang.
HighSchool, LeftWing In college , Mary knew a guy named Greg who always ate his food from buckets .
M: Bachelor, RightWing The food was filled with food.
M: HighSchool, LeftWing The food was filled with a lot of food .
2 Reference the bride and groom walked down the aisle together .
Midnight, Centrist The bride and groom walked out after the ceremony .
Night, RightWing The bride and groom happily walked down the aisle after being pronounced “Man and Wife ” .
M: Midnight, Centrist The bride and groom were ready to be married .
M: Night, RightWing The bride and groom were ready to be married .
3 Reference we picked up our breakfast from a street vendor selling fresh fruits .
NoDegree Along the way , you could see vendors on the street .
Bachelor We loved the market stands .
M: NoDegree The family went to a restaurant .
M: Bachelor The family went to a restaurant for some delicious food .
4 Reference I’d never seen so many beautiful flowers.
Morning, HighSchool the flowers were in full bloom.
Afternoon, NoDegree Tulips are one of the magnificent varieties of flowers.
M: Morning, HighSchool the beautiful flowers were beautiful.
M: Afternoon, NoDegree The flowers were very beautiful.
Table 12: Examples from PASTEL, along with model outputs of the best performing style transfer model. We show the Reference sentence (Reference) and some sentences written by annotators. We only show the style attributes that differ between the annotators. M: denotes the model output given the corresponding Reference as source sentence along with the same target style.