xSLUE: A Benchmark and Analysis Platform for Cross-Style Language Understanding and Evaluation

11/09/2019 ∙ by Dongyeop Kang, et al. ∙ 22

Every natural text is written in some style. The style is formed by a complex combination of different stylistic factors, including formality markers, emotions, metaphors, etc. Some factors implicitly reflect the author's personality, while others are explicitly controlled by the author's choices in order to achieve some personal or social goal. One cannot form a complete understanding of a text and its author without considering these factors. The factors combine and co-vary in complex ways to form styles. Studying the nature of the covarying combinations sheds light on stylistic language in general, sometimes called cross-style language understanding. This paper provides a benchmark corpus (xSLUE) with an online platform (http://xslue.com) for cross-style language understanding and evaluation. The benchmark contains text in 15 different styles and 23 classification tasks. For each task, we provide the fine-tuned classifier for further analysis. Our analysis shows that some styles are highly dependent on each other (e.g., impoliteness and offense), and some domains (e.g., tweets, political debates) are stylistically more diverse than others (e.g., academic manuscripts). We discuss the technical challenges of cross-style understanding and potential directions for future research: cross-style modeling which shares the internal representation for low-resource or low-performance styles and other applications such as cross-style generation.



There are no comments yet.


page 5

page 9

page 10

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

People often use style in text as a strategic choice for their social interaction hovy2018personal. For example, in order to show respect to elder people, one may use more polite language than to friends. The strategic use of stylistic text is mainly because style often conveys more information (e.g., respect) than is contained in the literal meaning of the text hovy1987generating. From a sociolinguistic perspective nguyen2016computational, the role of style can be defined by its pragmatics aspects and rhetorical goals hovy1987generating or personal and group characteristics of participants biber1991variation. More recently, they are called creative language ji2018creative in a broader context.

Imagine an orchestra performed by a large group of instrumental ensemble. What we only hear at the end is the harmonized sound of complex interacting combinations of individual instruments, where the conductor controls their combinatory choices (e.g., score, tempo, correctness) on top of it. Some of the instruments are in the same category such as violin and cello for bowed string type, and horn and trumpet for brass type. Similarly, text as the final output reflects complex combination of different types of styles where each style has its own choice of lexical or syntactic features. Consistent combination of the choices by the speaker (like a conductor in orchestra performance) will produce stylistically appropriate text given context.

The stylistic choice for text is often triggered by an implicit reflection of someone’s characteristics (e.g., personality, demographic traits kang19emnlp_pastel, emotion of the speaker on the topic rashkin2019towards) or an explicit control for the social goals (e.g., the relationship with the hearer, the figurative usage of text dobrovol2005figurative; glucksberg2001understanding; loenneker2010computational, pragmatics aspects hovy1987generating). Broadly, we call each individual as one specific style

type of language in this work. We computationally compress each individual into a single numerical variable (e.g., positive for 1 and negative for 0 in sentiment analysis) in order to represent its amount contained in the text.

However, style is not a single variable, but a combination of multiple variables co-vary in conjunction. For example, a text “a woman needs a man like a fish needs a bicycle” uses a metaphor between two clauses but make the latter not contradictory so make the former ironical. Is irony (or sarcasm) always a subset of metaphor? Are these two styles dependent on each other? Despite the recent advances on various applications of style language such as style transfer, only a few works pay attention on how different style types (e.g., formality, politeness) co-vary together in textual variation, which styles are interdependent on each other, and how they are systematically composed to produce the final text. We call such studies as cross-style language understanding.

Due to the co-varying phenomena of styles, some controlled the confounding variables except for the target style to identify nguyen2016computational; bamman2014gender; rabinovich2016personalized; kang19emnlp_pastel. On the other side, a few works studied cross-style dependency, but focusing on the particular group of styles such as demographics preoctiuc2018user, emotions warriner2013norms, or between metaphor and emotion Dankers2019ModellingTI; mohammad-etal-2016-metaphor.

Figurative styles
   humor (ShortHumor, ShortJoke), sarcasm (SarcGhosh, SARC), metaphor (TroFi, VUA)
Affective styles
    emotion (EmoBank, DailyDialog, CrowdFlower), offense (HateOffensive), romance (ShortRomance), sentiment (SentiTreeBank)
Personal styles
    age (PASTEL), ethnicity (PASTEL), gender (PASTEL), education level (PASTEL), political view (PASTEL)
Interpersonal styles
    formality (GYAFC), politeness (StanfordPolite)
Table 1: Our categorization of styles with their benchmark dataset (under parenthesis) used in xSLUE.

To accelerate more research along this line, we present a benchmark (xSLUE) for understanding and evaluating cross-style language which includes following contributions:

  • [noitemsep,topsep=0pt,leftmargin=*]

  • provide a theoretical categorization of 15 style types (Figure 1) into four groups: figurative, affective, personal and interpersonal styles.

  • build an online platform (http://xslue.com) for comparing systems and easily downloading the dataset. Our benchmark includes 15 different style types (e.g., formality, emotion, humor, politeness, sarcasm, offense, romance, personal traits) and 23 classification tasks.

  • share the fine-tuned classifiers on each style using the BERT devlin2018bert, showing significant improvements over the baselines.

  • collect an extra diagnostic set (i.e., 600 samples of text) which has multiple values of style types in conjunction annotated by human workers for investigating cross-style behavior of the model.

  • provide interesting observations of cross-style language: correlations between two styles (e.g., impoliteness is related to offense) and a comparison of the diversity across domains (e.g., academic papers are stylistically less diverse than tweets).

We believe our benchmark helps more in-depth study of stylistic language in general, and suggest interesting directions for future research such as cross-style transfer and study on underlying inter-style dependency in textual variation.

2 Related Work

Cross-style language understanding. Due to the broad categories of style language, only a few earlier works attempted to define general roles of style and provided theoretical categories of them. For example, hovy1987generating categorized styles into pragmatics aspects (e.g., relationship between speaker and hearer) and rhetorical goals (e.g., formality, power). biber1991variation defined several components of conversational situation such as social roles, personal and group characteristics (e.g., social class), and participants’ relations. However, they are mostly based on theoretical investigation tested on a few example cases without any empirical evidence or scalable analysis to support it. Our study provides the empirical observations to support some of the theories.

On the other hands, some recent works attempted to provide empirical evidence of style dependencies but in very limited setting: warriner2013norms conducted extensive analysis on emotional norms and their correlation in lexical features of text. chhaya-etal-2018-frustrated studied correlation of formality, frustration, and politeness but on small size of samples (i.e., 960 emails). preoctiuc2018user focused on correlation across demographic information (e.g., age, gender, race) and with some other factors such as emotions. Dankers2019ModellingTI; mohammad-etal-2016-metaphor studied the interplay of metaphor and emotion in text. In sarcasm detection, sentiment is also used as a sub-problem for an additional feature liu2010sentiment.

Instead of finding the dependencies, some prior works controlled the confounding style variables to identify the target style: For example, different demographic attributes (e.g., gender, age) are collected in conjunction and controlled on each other nguyen2016computational. Gender type is controlled to study gender-specific social media analysis bamman2014gender or develop personalized machine translation systems rabinovich2016personalized. Recently, six demographic attributes of text (e.g., age, gender, political view, education level, ethnicity) are written in parallel and used for controlled experiment in style classification and transfer kang19emnlp_pastel.

Evaluation platforms. In terms of platform contributions (e.g., benchmark, analysis platform), this work is highly motivated by the recent benchmark, GLUE wang2018glue, for understanding sentence-level textual inference (e.g., entailment). Similar to the GLUE platform, we provide a benchmark set of style language as well as newly collected datasets; romantic and humor classification. In addition to that, we also provide (1) the fine-trained classifiers using the state-of-the-art contextualized word embeddings devlin2018bert, (2) a diagnostic set of text annotated by human workers for cross-style classification, and additional applications such as correlation analysis across styles, style diversity with respect to domains, and more.

Comparison with Multilingualism. Another motivation is the spirit of studying multiple language in conjunction called multilingual edwards2002multilingualism. One major different of multi-stylism with multilingualism is that studying multiple styles is more likely compositional among different styles, while the latter is not except for code-switching case where different languages are used in a mixed way. This is why our work is called cross-style rather than simply multi-style.

3 Categorization of Styles

Compared to the prior categorization of style types hovy1987generating; biber1991variation under a sociolinguistic perspective, our study covers more broad categories of styles. We describe what types of styles are studied in this work and provide our theoretical categorization by clustering them into two dimensions: social participation (from personal to interpersonal) and content coupledness (from loosely coupled to tightly coupled on content).

3.1 Selection of Styles-of-Interest

The term style is often used in a mixed manner, while no one actually defines the exact meaning of it with its overall category. As a bottom-up approach, we survey recent works which describe their work as a style language, and then collect 14 widely-used unique style types: emotion, sentiment, metaphor, humor, sarcasm, offensiveness, romance, formality, politeness, age, ethnicity, gender, political orientation, and education level. The full list of our survey is described in §4.

3.2 Categorization of Styles

Figure 1: A conceptual grouping of styles: x-axis is the aspect of style’s social participation, while y-axis is the aspect of style’s coupledness on content.

We hypothesize two orthogonal aspects of styles; social participation and content coupledness, and cluster the 14 style types over two dimensions of the aspects (Figure 1).

Social participation means whether a style is related to the speaker (i.e., personal) or the hearer (i.e., interpersonal) in a conversation. This dimension was studied in biber1991variation: personal style is the personal characteristics of the speaker (e.g., personalities), while interpersonal style is the relationship with the hearer (e.g., friendship). In addition to the biber1991variation’s definition, our view focuses on whether the style affects textual variation implicitly or explicitly. While personal styles (e.g., age, gender) are originally given to a person so her or his text implicitly contains the combination of her or his personal styles (e.g., age, gender) kang19emnlp_pastel, interpersonal styles (e.g., friend, enemy, boss) are given by their social interactions so the text can be explicitly controlled by the speaker with respect to the hearer. For instance, one may speak more formal words by explicitly controlling the formality of text depending on the person to talk with, while your personal characteristics or demographic traits (e.g., ethnicity) are implicitly contained on your own words without any explicit control. Recently, calvo2013emotions distinguished emotion’s ascription between writer’s vs. reader’s perspectives.

Content coupledness means how much style is tightly or loosely coupled to the content of original text. ficler2017controlling controlled different styles (e.g., descriptive, professional) in text variation, regardless of its coupledness to the semantics of text. However, it is often observed that content words are tightly coupled with its styles kang19emnlp_pastel; preoctiuc2018user. For instance, you can increase or decrease formality of text regardless of the topic of text, while you may have specific degree of emotion or offensiveness with respect to the certain topic or person.

We then project the 13 styles over the two dimensions, and stretch each style if it aligns with a broad spectrum of each dimension (Figure 1). The personal styles (e.g., age, gender, education level, ethnicity) are not biased on the content because they are implicitly reflected on text. On the hands, formality and politeness are interpersonal but loosely coupled on content, because they are independently used regardless of content. Emotion can be either personal or interpersonal, while offense and romance are only related to the other person, and sentiment is tightly coupled with content. The other three styles; metaphor, sarcasm, and humor are more complex phenomena on top of others, so being stretched over the dimensions.

Based on these groupings, we categorize them into four groups: figurative styles (i.e., humor, sarcasm, metaphor), affective styles (i.e., emotion, offense, romance, sentiment), personal style (i.e., age, ethnicity, gender, education level, political view), and interpersonal style (i.e, formality, politeness). Note that our dimensions are driven by simple conjectures, so there might be better projections and categorization. Instead, our goal is to provide one potential categorization of styles using our own theory and then compare it with empirically-observed style clusters in §6.

4 xSlue: A Benchmark for Cross-Style Language Understanding

Style Type & Dataset #S Split #L Label(proportion) Balance Domain Public Task
GYAFC rao2018dear 224k given 2 formal (50%), informal (50%) Y web N clsf.
StanfPolite danescu2013computational 10k given 2 polite (49.6%), impolite (50.3%) Y web Y clsf.
ShortHumor 44k random 2 humor (50%), non-humor (50%) Y web Y clsf.
ShortJoke 463k random 2 humor (50%), non-humor (50%) Y web Y clsf.
SarcGhosh ghosh2016fracking 43k given 2 sarcastic (45%), non-sarcastic (55%) Y tweet Y clsf.
SARC khodak2017large 321k given 2 sarcastic (50%), non-sarcastic (50%) Y reddit Y clsf.
SARC_pol khodak2017large 17k given 2 sarcastic (50%), non-sarcastic (50%) Y reddit Y clsf.
VUA steen2010method 23k given 2 metaphor (28.3%), non-metaphor (71.6%) N misc. Y clsf.
TroFi birke2006clustering 3k random 2 metaphor (43.5%), non-metaphor (54.5%) N news Y clsf.
EmoBank buechel2017emobank 10k random 1 negative, positive - misc. Y rgrs.
EmoBank buechel2017emobank 10k random 1 calm, excited - misc. Y rgrs.
EmoBank buechel2017emobank 10k random 1 being_controlled, being_in_control - misc. Y rgrs.
CrowdFlower 40k random 14 neutral (21%), worry (21%), happy (13%), sad (12%), love (9%) .. N tweet Y clsf.
DailyDialog li2017dailydialog 102k given 7 noemotion (83%), happy (12%), surprise (1%), fear (.1%), disgust (.3%), sad (1%), anger(.9%) N dialogue Y clsf.
HateOffensive davidson2017automated 24k given 3 hate (6.8%), offensive (76.3%), neither (16.8%) N tweet Y clsf.
ShortRomance 2k random 2 romantic (50%), non-romantic (50%) Y web Y clsf.
SentiBank socher2013recursive 239k given 2 positive (54.6%), negative (45.4%) Y web Y clsf.
PASTEL_gender kang19emnlp_pastel 41k given 3 Female (61.2%), Male (38.0%), Others (.6%) N caption Y clsf.
PASTEL_age kang19emnlp_pastel 41k given 8 35-44 (15.3%), 25-34 (42.1%), 18-24 (21.9%), 45-54 (9.2%), 55-74 (10.5%) N caption Y clsf.
PASTEL_country kang19emnlp_pastel 41k given 2 USA (97.9%), UK (2.1%) N caption Y clsf.
PASTEL_politics kang19emnlp_pastel 41k given 3 Left (42.7%), Center (41.7%), Right (15.5%) N caption Y clsf.
PASTEL_education kang19emnlp_pastel 41k given 10 Bachelor (30.6%), Master (18.4%), NoDegree (18.2%), HighSchool (11.0%), Associate (9.3%) .. N caption Y clsf.
PASTEL_ethnicity kang19emnlp_pastel 41k given 10 Caucasian(75.6%), NativeAmerican(8.6%), HispanicOrLatino(3.3%), African(5.5%), EastAsian(2.5%).. N caption Y clsf.
Table 2: Style types and datasets in xSLUE. Every label in the datasets ranges in . #S and #L mean the number of total samples and labels, respectively. ‘_ ‘ means sub-tasks of the dataset. For dataset with multiple labels, we only show top-five frequent ones. clsf. denotes classification task, and rgrs. regression. We use accuracy and f1 measures for classification tasks, and Pearson-Spearman correlation for regression tasks.

4.1 Dataset for Individual Style Language

We choose existing datasets of style language or collect our own if there is no dataset available. Here are the rules of thumbs in our data collection and preprocessing:

  • [noitemsep,topsep=0pt,leftmargin=*]

  • we do not use datasets with small samples (i.e., ) due to its feasibility of training.

  • we limit our dataset to classify on only single sentence, even though there exist various settings of tasks (e.g., context-given sentence).

  • if dataset has its own split, we follow that. Otherwise, we randomly split it by 0.9/0.05/0.05 ratios for train/valid/test, respectively.

  • if dataset has only positive samples (e.g., ShortHumor), we do negative sampling.

  • due to label imbalance of some datasets, we measure f-score and accuracy for classification tasks and Pearson-Spearman correlation for regression tasks. For multi-labels, all scores are macro-averaged.

Table 2 summarizes the style types, datasets, and data statistics (e.g., sample size, data split, distribution of labels, label balance, domain of text, public availability, and task type). We describe more details of our data collection and preprocessing below.


Appropriately choosing the right formality in the situation (e.g., a person to talk to) is the key aspect for effective communication heylighen1999formality. We use GYAFC dataset rao2018dear which includes both formal and informal text collected from web. However, the dataset requires an individual authorization from the authors, so our benchmark only contains the script for preprocessing it to make the same format as other datasets.


Humor (or joke) is a social style to make conversation more smooth or make a break. Detecting humor rodrigosequential; yang2015humor; chandrasekaran2016we and entendre kiddon2011s or generating jokes ritchie2005computational; petrovic2013unsupervised had been broadly studied using various linguistic features. We use the two well-known dataset used in humor detection: ShortHumor111http://github.com/CrowdTruth/Short-Text-Corpus-For-Humor-Detection which contains 22K humorous sentences collected from six different websites222twitter, textfiles, funnyshortjokes, lanughfactory, goodriddlesnow, onelinefun and ShortJoke which contains 231K jokes scraped from several websites333Beside the two, there are many other joke datasets such as pungas; potash2017semeval; rodrigosequential; mihalcea2006learning, but they do not perfectly fit to our project because of the limited domain or low recall problem.. To collect negative samples, we randomly sample negative sentences (i.e., non-humorous text) from the two sources: random sentences from Reddit summarization corpus kang19emnlp_biassum and literal sentences from Reddit corpus khodak2017large. For ShortJoke, we sample more negative samples with replacement up to the same number of positive ones for the label balance.


Encoding (im)politeness in conversation often plays different roles of social interactions such as power dynamics at workplaces, decisive factor, and strategic use of it in social context (e.g., request) chilton1990politeness; holmes2015power; clark1980polite. For example, one can say “if you don’t mind” or “I’m sorry, but” to strategically being indirect or apologizing for the imposition, respectively lakoff1973logic. We use Stanford’s politeness dataset StanfPolite danescu2013computational which collects request types of polite text from web such as Stack Exchange question-answer community444http://stackexchange.com/about.


Sarcasm acts by using words that mean something other than what you want to say, to insult someone, show irritation, or simply be funny. Therefore, it is often used interchangeably with irony. The detailed category of its role is summarized in joshi2017automatic. Detecting sarcastic text is often regarded as a sub-problem for sentiment analysis liu2010sentiment. Such figurative nature of sarcasm leads more challenges to identify it in text tepperman2006yeah; wallace2014humans; wallace2015computational. Sarcasm datasets are collected and annotated in different domains: books joshi2016word, tweets gonzalez2011identifying; ghosh2015semeval; peled2017sarcasm; ghosh2016fracking, reviews filatova2012irony, forums walker2012corpus, and Reddit comments khodak2017large. We choose two of them for the purpose of our project (e.g., data size, public availability, and length of text): SarcGhosh ghosh2016fracking and SARC version 2.0 khodak2017large555SARC is a sub-task for text from politics subreddit.. We use the same preprocessing scheme in ilic2018deep on SARC.


Metaphor is a figurative language that describes an object or an action by applying to which is not actually applicable. Such object is often regarded as a representative or symbolic thing, especially somewhat abstract. Detecting metaphoric text has been studied in different ways: rule-based russell1976computer; martin1992computer, dictionary-based, and recently more computation-based with different factors (e.g., discourse, topic transition, emotion) nissim2003syntactic; jang2017finding; mohler2013semantic. We use two benchmark datasets666we did not include mohler2016introducing because the labels are not obtained from human annotators.: Trope Finder (TroFi) birke2006clustering and VU Amsterdam VUA Corpus steen2010method where metaphoric text are annotated by human annotators.


Hate speech is a speech that targets disadvantaged social groups based on group characteristics (e.g., race, ethnicity, gender, and sexual orientation) in a manner that is potentially harmful to them jacobs1998hate; walker1994hate. More recently, davidson2017automated defined hate speech as language that is used to expressed hatred towards a targeted group or is intended to be derogatory, to humiliate, or to insult the members of group., which is category of offensive language in general. We use the HateOffenssive dataset davidson2017automated which includes hate text (), offensive text (), and none of them ().


To the best of our survey, we could not find any dataset which includes romantic and non-romantic text. Thus, we crawl romantic text from eleven different web sites (See Appendix), pre-process them by filtering out some noisy, too long, and duplicate text, and then make a new dataset called ShortRomance. Similar to the ShortHumor and ShortJoke, we make the same number of negative samples from the literal Reddit sentences khodak2017large as the same number of the romantic text.


Identifying sentiment polarity of an opinion is challenging because of its implicit and explicit presence in text kim2004determining; pang2008opinion. We use the large scale of annotated sentiment corpus on movie reviews; Sentiment Tree Bank socher2013recursive (SentiBank).


Emotion is a more fine-grained modeling of sentiment. Modeling emotion can be either categorical or dimensional. While Ekman’s basic six categories of emotions ekman1992argument conceptualize emotions as discrete states: anger, joy, surprise, disgust, fear, and sadness, the dimensional model warriner2013norms considers the states as a small number of independent emotional dimensions: Valence (concept of polarity), Arousal (degree of calmness or excitement), and Dominance (perceived degree of control); the VAD model. we use two datasets: one (i.e., DailyDialog li2017dailydialog) from the Ekman’s categorical model and another (i.e., EmoBank buechel2017emobank) from the VAD’s model. The range for original EmoBank was but we normalize it in in our benchmark. We also include a large but little noisy emotion-annotated corpus CrowdFlower777http://www.crowdflower.com/wp-content/uploads/2016/07/text_emotion.csv, which includes not only the Ekman’s categories but also additional categories: enthusiasm, worry, love, fun, hate, relief, and boredom.


Persona is a pragmatics style in group characteristics of the speaker kang19emnlp_pastel. It is often observed that certain group of persona has a specific usage of certain textual features. We use the stylistic language dataset written in parallel called PASTEL kang19emnlp_pastel where multiple types of the author’s personas are given in conjunction. Similar to the emotion datasets, PASTEL also has multiple attributes together (i.e., age, gender, political view, ethnic, country, education level), where most categories are unbalanced.

4.2 Diagnostic Set for Cross-Style Language

With individual types of style, we may train a classifier on each and measure its performance on an individual test set. However, without a shared test set across different styles, we can not measure how different styles are identified at the same time (i.e., cross-style classification) and whether the model captures the underlying structure of inter-style variation of text (See our experiment in §5). We collect a diagnostic set by annotating appropriate labels of multiple styles at the same time. We prepare two diagnostic set: cross-test set and tweet-dynamic set.

  • [noitemsep,topsep=0pt,leftmargin=*]

  • cross-test set is 100 samples randomly chosen from test samples on different style datasets. We have two steps of sampling: First, we randomly select 40 test samples from the 15 datasets. Among the 600 test samples, we randomly choose the final 100 samples as our final cross-test diagnostic samples. Each sample from the cross-test set has its ground-truth label for the style which is sampled from, so we used it for a sanity checking of our annotations.

  • tweet-diverse set is another 100 samples chosen from random tweets. We first collect the top-ranked 300 tweets with high stylistic diversity and another bottom-ranked 300 tweets with less stylistic diversity (See §6.2 for our definition and measurement of style diversity). Then, we randomly sample 100 tweets from the collection.

Using the two sets of diagnostic samples888We will collect additional 400 samples; 200 for each set, in the final version., we then ask human workers to predict the stylistic attribute of the text for multiple style types, where three different annotators are assigned on each sample. The detailed instructions and annotation schemes are in Appendix. The final label for each style is decided as a discrete label via majority voting over the three annotators and as a continuous value by averaging the number of votes. For personal styles (e.g., age, gender), we also add Don’t Know option to choose in case that its prediction is too difficult. In case three votes are all different each other, we did not use the sample in our evaluation999We will be releasing these ambiguous or controversy cases including the Don’t Know answer as a separate evaluation set in the future..

5 Single and Cross Style Classification

Single-style classification Cross-style classification
cross-test tweet-diverse


Dataset Majority Original BiLSTM BERT BERT BERT


GYAFC 43.3 (30.2) na 76.5 (76.4) 88.3 (88.3) 64.1 (63.8) 75.0 (55.8)


StanfPolite 56.7 (36.2) 83.2 62.1 (61.8) 66.8 (65.8) 80.7 (80.7) 84.0 (45.6)


ShortHumor 50.0 (33.3) na 88.6 (88.6) 97.0 (97.0) - -
ShortJoke 50.0 (33.3) na 89.1 (89.1) 98.3 (98.3) 52.1 (41.0) 63.0 (54.61)


SarcGhosh 50.0 (33.3) na 73.0 (72.6) 54.4 (42.4) - -
SARC 50.0 (33.3) 75.8 63.0 (63.0) 70.2 (70.1) 57.6 (44.2) 52.0 (41.7)
SARC_pol 50.0 (33.3) 76.0 61.3 (61.3) 71.8 (71.7) - -


VUA 70.0 (41.1) na 77.1 (68.9) 84.5 (89.1) 25.0 (20.0) 50.0 (33.3)
TroFi 57.2 (36.4) 46.3 74.5 (73.9) 75.7 (78.9) - -


EmoBank -/-/- na 78.5/49.4/39.5 81.2/58.7/43.6 64.1/23.9/82.6 78.0/29.0/77.0
CrowdFlower 22.4 (2.8) na 31.1 (12.3) 36.5 (21.9) - -
DailyDialog 81.6 (12.8) na 84.2 (27.61) 84.2 (49.6) 47.8 (19.3) 69.0 (22.3)


HateOffens 75.0 (28.5) 91.0 86.6 (68.2) 96.6 (93.4) 84.7 (47.27) 81.0 (33.6)


ShortRomance 50.0 (33.3) na 90.6 (90.6) 99.0 (98.9) 95.6 (86.3) 75.0 (54.6)


SentiBank 50.0 (33.3) 87.6 82.8 (82.8) 96.6 (96.6) 88.4 (88.0) 85.3 (70.7)


PASTEL_gender 62.8 (25.7) na 73.2 (45.5) 73.0 (48.7) 37.5 (19.6) 38.2 (25.5)
PASTEL_age 41.5 (7.3) na 41.9 (15.2) 46.3 (23.9) 40.9 (23.3) 59.5 (38.1)
PASTEL_country 97.2 (49.2) na 97.2 (49.3) 97.1 (55.2) 97.5 (49.3) 95.9 (48.9)
PASTEL_politics 42.9 (20.0) na 48.5 (33.5) 50.9 (46.1) 9.0 (8.3) 37.5 (29.3)
PASTEL_education 31.4 (4.7) na 42.4 (15.0) 42.5 (25.4) 23.2 (11.7) 25.6 (11.1)
PASTEL_ethnicity 75.4 (8.5) na 82.3 (17.6) 81.1 (25.6) 59.0 (15.7) 34.4 (16.4)
total 55.4(26.8) 69.3(55.7) 73.7(64.3) 57.5(41.2) 61.7(38.8)
Table 3: Single-style and cross-style classification. na means not applicable. We use both accuracy and macro-averaged f1-score (under parenthesis) for classification tasks. For cross-style classification, we only choose one dataset if multiple datasets per style exist.


In a single-style classification, we individually train a classifier (or regression model for EmoBank) on each dataset and predict the label. For a simplicity, we use the pre-trained language model; uncased Bidirectional Encoder Representations from Transformers (BERT) devlin2018bert101010We have tried different variants of the BERT models such as ‘large-uncased‘, showing comparable performance. and fine-tune it on each dataset using two-layers of perceptions on top of the pre-trained model. For evaluation, we report both accuracy and f1-score by macro-averaging due to the label imbalance.

For a baseline, we provide a simple majority classifier (i.e., taking the majority label from the training set and using the label for prediction on test set). In addition, we apply another baseline using Bidirectional LSTM (BiLSTM) hochreiter1997long with the pre-trained word embeddings pennington2014glove

. We report both human and model performances from the original paper if given. If the experimental setup is not directly applicable (e.g., difference in evaluation metrics), we mark them as

na. The details of the hyper-parameters are in Appendix.


Table 3 (left) shows performance on single-style classification (left) and cross-style classification (right). The fine-tuned BERT classifier outperforms the majority and BiLSTM baselines on f1 score by the large margins except for SarcGhosh. Especially, BERT shows significant improvements on f1 scores on for personal styles. For sarcasm and politeness tasks, our classifiers do not outperform the scores in the original papers, which use additional hand-written syntactic features.

When classifying multiple styles at the same time called cross-style classification (Table 3 (right)), single-style classifiers did not show comparable performance as done in single-style classification. This is mainly because single-style classifier trained on specific domain of dataset is biased to the domain and/or the dataset itself may include some annotation artifacts which are not scalable to the held-out samples. More importantly, there is a fundamental difference of cross-style classification compared to the single-style classification: when predicting multiple styles together, you may consider how different styles are dependent on each other, indicating the necessity of an unified model where multiple styles are jointly trained. Such joint models need to take into account the underlying dependency structure of different styles in classification task. This may be also applicable to multi-style generation task like you head the harmonized sound of complex combinations of individual instruments at the end.

6 Cross-Style Language Understanding

We provide useful analyses of style language using xSLUE: (1) finding a correlation between two styles, (2) measuring a stylistic diversity of text, and (3) finding which domain of text (e.g., academic papers vs tweets) is stylistically more diverse.

6.1 Cross-Style Correlation

Figure 2: Cross-style correlation. The degree of correlation gradually increases from Red (i.e., negative), Yellow, to Blue (i.e., positive), where color intensity is proportional to the correlation coefficients. Correlations with

(confidence interval: 0.95) are only considered as statistically significant. Otherwise, crossed. Age ranges start with X in the personal styles.

IMPORTANT: before you interpret anything from these matrices, please be VERY CAREFUL not to make any unethical or misleading claims based on these simple measures. Please read the potential weakness of our experiment below. Best viewed in color.

Setup. Using the pre-trained classifiers on different attributes of styles111111we don’t include some duplicate style attributes such as SarcGhosh and CrowdFlower in §5, we predict the score of each attribute on new 1,000,000 tweets crawled121212We use the tweets from 2008 to 2013 kang2017detecting, and randomly sample 1M tweets from it. using Twitter’s Garden Hose API. We choose tweet as a test bed due to its stylistic diversity compared to other domains such as news articles or academic papers (See §6.3 for stylistic diversity across domains).

We obtain the 53 different style scores across 1 million tweets, then produce a correlation matrix between 53 different predicted styles using Euclidean distance measure. With the matrix, we calculate Pearson correlation coefficients using Euclidean distance measure across style attributes’ scores (i.e., columns) and produce the final correlation matrix (Figure 2): we split it into three pieces based on our groupings defined in §3: interpersonal and figurative styles (top, left), affective styles (bottom, left), and personal styles (right). We only contain correlations which are statistically significant with p-value < 0.05.

Motivation. A basic idea behind this analysis assumes that certain textual features (e.g., lexical choices) which could be detected by the classifier, co-occur across multiple styles, giving frequent co-occurrence. Compared to the theoretical hovy1987generating or empirical preoctiuc2018user; kang19emnlp_pastel analyses of those features, our analysis uses surface-level co-occurrence patterns of features across styles with the help from the classifiers.

Analysis. From the correlation matrix, we could observe interesting correlations: (non-humorous text, text with positive sentiment), (non-humorous text, text by Master / Doctorate education), (polite text, text with no-emotion), (text with dominance:being_in_control, text with positive sentiment), (text with anger emotion, offensive text), (text by Left-Wing political orientation, text by Bachelor / Master education), and more.

However, we should not blindly trust the correlations. For example, there is a highly positive correlation between Age(<12) and Age(>=75), which seems to be unreasonable. More than that, we should be VERY CAREFUL not to make any misleading interpretation on them, especially some styles related to personal traits. This is not only due to the ethical issues but also several weakness of our experimental design:

IMPORTANT: Weakness of our experiment.

  • [noitemsep,topsep=0pt,leftmargin=*]

  • Our analysis is not controlled nor causal. In order to find a causal relation between styles and to control their confounding variables, more sophisticated methods such as analysis of covariance (ANCOVA) keppel1991design or prospensity analysis austin2011introduction need to be applied.

  • Do not trust the classifiers. Aforementioned results on the style classification (§5) indicate that certain styles (e.g., sarcasm, persona styles) are very difficult to predict, leading the unreliable results of our analysis. To overcome it, a jointly-trained model across different styles is indispensable to take benefits from the cross-styling.

  • Each dataset has its own issues. Some dataset is only collected from certain domains (e.g., news articles), making the classifier biased to it. Some has a very imbalanced distribution over labels. Each data collection may or may not have annotation artifacts. Some datasets include some noisy text.

6.2 Stylistic Diversity of Text

In a style transfer, preserving the original meaning of text is a challenging issue. Can we change style of any text regardless of its content? Why do some text be easier to change the style, while others do not? Can we predict whether text can be stylistically changeable?

We propose a simple technique to rank stylistic diversity of text using the fine-tuned style classifiers used in §5

. Given a text, we first calculate the mean and the standard deviation (std) over style scores

predicted by the classifiers. We predict total 1M tweets used in §6.1, and get matrix where its columns are the mean and std over the styles131313For simplicity, we remove all literal type labels such as ‘informal’, ‘non-humorous’, ‘non-sarcastic’, ‘impolite’, ‘non-metaphor’, ‘negative’. ‘neither-hate-or-offensive’, ‘non-romantic’, and ‘noemotion’. . We sort samples by the mean and take the top (or bottom) 10% samples first. Then, we sort the sampled tweets again by the std and take the top (or bottom) 10% samples. The final top or bottom ranked samples are called stylistically diverse or less diverse text in our analysis, indicating that the total amounts of style prediction scores and their variations are high (or less).

















Stylistically diverse text
i’m glad i can add hilarity into your life .32 .45 .98 .99 0 .99 .99 0
it was really cool speaking with you today i look forward to working for you .32 .45 .99 .99 .99 .99 0 0
i’m *ucking proud of you baby you’ve come a long way .31 .45 0 .99 .99 0 .99 .99
Stylistically less diverse text
lip/tongue tingling .15 .28 .01 0 .02 0 0 0
satellite server is a mess cleaning up .15 .28 0 0 .04 .01 .68 0
having beer with and some latin americans .14 .28 0 0 .28 0 0 0
Table 4: Stylistic diversity of text: sampling 10% of tweets with the highest (lowest) average of style prediction scores and then sampling again 10% tweets with highest (lowest) standard deviation, where the former is stylistically diverse (), while the latter is less-diverse (). Some offensive words are replaced by * mark.

Analysis. Table 4

shows the top/bottom-ranked stylistically diverse/less-diverse tweets. We only show some labels (e.g., formal, humorous) due to the space limitation. We observe that stylistically diverse text use more emotions and social expressions (e.g., complaining, greeting), while stylistically less diverse text are more literal, factual, and simply describing a thing. Again, some predicted scores are not accurate due to the aforementioned issues of the classifiers and dataset itself. We realize that the classifiers often predict very extreme scores (e.g., 0.99, 0.01), where its posterior probabilities need to calibrated accordingly. More examples on our style diversity and pair-plot distribution of the prediction scores are in Appendix.

6.3 Stylistic Diversity across Domains

Different domains or genres of text may have their unique patterns of style diversities. For example, text from academic manuscripts may be more literal (stylistically less-diverse), while tweets and other social media posts are stylistically diverse.

We sample sentences from different domains141414We use the collection of different domains of summarization corpora kang19emnlp_biassum; tweets, academic papers, news articles, novel books, dialogues, movie scripts, and political debate161616We use the full transcripts between Clinton and Trump in 2016 US presidential debates151515http://www.kaggle.com/mrisdal/2016-us-presidential-debates/. After splitting the text into sentences, we only use maximum 100,000 sentences for each domain. For each domain, we again predict probability scores of each style using the fine-tuned classifiers in §5, and then average the scores across sentences for individual style.

Figure 5: Style diversity on six different domains: tweets, Reddit posts, news articles, academic papers, movie scripts, and political debates. Best viewed in color.

Analysis. Figure 5 shows absolute proportion of the averaged prediction scores of each style over different domains. For the affective styles in Figure 5(a), text in academic papers has the least affective styles followed by news articles, while text on social media (e.g., tweets, Reddit posts) has a lot of style divesrsity, showing its correlation with freedom of speech in the domains. Interestingly, text in political debate has two conflicting style pair in balance; hate and happy, but less offensive and anger styles.

For the interpersonal and figurative styles in Figure 5(b), tweets are most informal. On the other hand, academic papers are very formal and polite. The analysis on personal styles is included in Appendix.

7 Conclusion, Challenges, and Future Directions

We build a benchmark for studying cross-style language understanding and evaluation, where it includes different types of styles and their datasets. Using the state-of-the-art classifiers trained on each style dataset, we provide interesting observations (e.g., cross-style classification, cross-style correlation, style diversity across different domains) as well as our theoretical grouping of style types. We believe our benchmark helps other researchers develop more solid systems on various applications of style language. We summarize several challenges we faced and potential future directions in cross-style language understanding.

7.1 Challenges

More severe semantic drift. The biggest challenge in collecting cross-style dataset kang19emnlp_pastel or controlling multiple styles in generation ficler2017controlling is to diversify style of text but at the same time preserve the meaning, in order to avoid semantic drift. It can be addressed by collecting text in parallel or preserving the meaning using various techniques. In the cross-style setting, multiple styles change at the same time in different parts of text in a complicated way, leading more server semantic drift.

Style drift due to the cross-style dependency. We face a new challenge; style drift, where different styles are coupled together with text so changing one type may affect the others. For example, if we change it to more impolite text given a text, such change tends to make the text more offensive and negative (See §6.1). In the cross-style transfer or the multi-style generation, we first need to understand the underlying dependencies across style types and develop a generative model which can handle the implicit dependencies.

Stylistic diversity and content coupledness. In §6.2 and §6.3

, we measure a stylistic diversity by the amount of styles (i.e., mean) and its variance (i.e., standard deviation). Some believe that content needs to be separated from styles, while our observation shows that they are highly coupled on each other. Studying more in-depth analysis on the relationship between content and style is required for better understanding style language.

More careful interpretation is required. In a cross-style language, some style types (e.g., personal styles) are very sensitive so require more careful interpretation on their result. We made three weak points about our analysis in §6.1, in order not to make any misleading points from our analysis. Any follow-up research on this direction needs to consider such ethical issues as well as provide potential weakness of their proposed methods.

7.2 Future Directions

Necessity of cross-style modeling. We have not yet explored any models which can learn the internal dependency structure across styles. Studying such cross-style dependency would help develop the complex combination of different styles in classification as well as generation. For example, rather than developing multiple classifiers for each style, developing an universal classifier on multiple styles is necessary, where their internal representations are shared across styles but individual style has its own predictor. Such cross-style modeling may take advantages from the shared representation like interlingua representation in multilingual setting edwards2002multilingualism.

Low-resource and low-performance Styles. In addition, cross-style models will be useful for some styles which have less annotation data (low-resource style) or some styles which show very low performance due to the difficulty of the style language (low-performance style). For example, our study shows that detecting sarcasm and metaphor from text is still very difficult, which might be helped by other style types.

Cross-styling on other applications. Beside the style classification task, our benchmark can be applied to other applications such as style transfer. However, the aforementioned issues such as semantic and style drift, cross-style transfer and style-controlled generation might be more challenging without understanding the underlying dependency across styles in a textual variation.


This work would not have been possible without the efforts of the authors who kindly share the style language datasets publicly. We also thank Taehee Jung, Yongjoon Kim, Wei Xu, Shirley A. Hayati, Varun Gangal, Naoki Otani, Dheeraj Rajagopal, Yohan Jo, and Hyeju Jang for their helpful comments.


Appendix A Details on ShortRomance

ShortRomance text are crawled from the following websites. The copyright of the messages are owned by the original writer of the websites.

Appendix B Hyper-Parameters

For our BERT classifier, we use uncased BERT English model. Both training and testing use 8 size of batching. For the BiLSTM baseline, we use 3217171732 size shows slightly better performance than smaller sizes like 8 or 16. size of batching for both training and testing and 256 hidden size for LSTM layer with 300 size of word embedding from GloVe pennington2014glove. The vocabulary size of BiLSTM is same as the maximum vocabulary of BERT model; 30522.

For both BERT and BiLSTM models, we use same maximum input length 128. Both training use learning rate and

maximum gradient clipping with Adam optimizer with

epsilon. Also, we use early stopping until the maximum training epochs of 5.

Figure 6: Agglomerative clustering of style types based on the correlation matrix. Best viewed in color.

Appendix C Details on Diagnostic Set Collection: Annotation Schemes and Details

Figure 7: Snapshots of our annotation tasks: general instruction (top) and annotation tasks on each style (bottom).

Figure 7

shows snapshots of our annotation platform with the detailed instructions. We estimate the execution time of a task as 4 minutes, so paying

per task. We make 10 size of batches multiple times and incrementally increase the size of batches up to 600 samples. For each batch, we manually checked the quality of outputs and blocked some bad users who abusively answered the questions.

Appendix D More Examples on Stylistic diversity

Table 5 includes more examples with our stylistic diversity analysis.





























Stylistically diverse text
i’m glad i can add hilarity into your life .32 .45 .98 .99 0 .99 .99 0 0 .99 .99 0 0 0 .99 0 0 .99 .97 0 .99 0 0 .99 .99 0 0
sitting with amazing women talking about what we’re grateful for love them .31 .45 .97 .99 .94 .99 0 0 .99 .99 .99 0 0 0 .97 0 0 .99 .98 0 0 0 .98 0 .98 0 .01
the train is a superb opportunity to fall into someone’s lap and meet the love of one’s life .32 .45 .99 .99 .99 .99 .99 0 0 .99 .99 0 0 0 0 0 0 .99 .96 .01 .99 .9 0 .05 .99 0 0
it was really cool speaking with you today i look forward to working for you .32 .45 .99 .99 .99 .99 0 0 0 .99 .99 0 0 0 .98 0 0 .95 .91 .08 0 0 .94 0 0 0 .99
thank god the ap has posted a video of matt damon’s feelings on sarah palin my life is complete .31 .45 0 .99 .89 .99 .99 0 0 .99 .99 0 0 0 .99 0 0 .99 .99 0 .95 0 0 .99 .99 0 0
i’m *ucking proud of you baby you’ve come a long way .31 .45 0 .99 .99 0 .99 0 .99 .99 .99 0 0 0 .99 0 0 0 .99 0 .37 0 .98 0 0 0 .99
tweeter opens so many new communication channels it’s an absolute pleasure to be so close to the pople that have so much to share .31 .45 .99 .99 .93 .99 .99 0 0 0 .99 0 0 0 .99 0 0 .99 .98 .01 .99 .01 0 .96 0 0 .89
have i mentioned how excited i am to hang out with this weekend no i am so excited .3 .44 0 .99 .99 .99 0 0 .99 .99 .99 0 0 0 .99 0 0 .99 .97 .02 0 0 .99 0 0 0 .99
today i feel thankful for the beautiful things in my life .32 .44 .89 .99 .99 .99 .05 0 0 .99 .99 0 0 0 .98 0 0 .99 .98 .01 .94 .93 .02 .01 .99 0 0
yay my friend just called to tell me everything is set for my birthday this thurs can’t wait at least that’s cheering me up .3 .44 0 .99 .99 .99 .99 0 .99 0 .99 0 0 0 .99 0 0 0 .98 0 0 0 .99 0 0 0 .99
Stylistically less-diverse text
lip/tongue tingling .15 .28 .01 0 .02 0 0 0 0 0 0 0 0 0 0 0 0 0 .15 .01 .27 .01 .01 0 .02 .02 0
satellite server is a mess cleaning up .15 .28 0 0 .04 .01 .68 0 0 0 0 0 0 0 0 0 0 .94 .01 .01 0 0 .03 0 0 .06 .29
having beer with and some latin americans .14 .28 0 0 .28 0 0 0 0 0 .99 0 0 0 0 0 0 .93 .01 .04 0 .01 .39 0 0 .05 0
new blog post and then there was stillness .15 .28 0 0 0 0 0 0 0 0 .01 0 0 0 .01 0 0 .99 .09 .49 0 .02 .21 .56 .1 .73 .05
ahh interview in hours .15 .28 0 0 .56 0 0 0 0 0 0 0 0 0 0 0 0 .99 .74 0 0 .06 .17 0 .41 .28 .03
actually usb 3g card connected via dial-up profile want to share over airport google suggests lots of issues with this config .13 .28 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 .92 .01 0 .02 .14 0 0 .01 .28
is listening to owen play with the university of oregon duck lips riotous .14 .28 0 0 0 0 0 0 0 0 .02 0 0 0 .11 0 0 .06 .09 .8 0 .03 .07 0 0 .01 .94
she may have noticed but you may want to mention it just to make sure and to let her know that others think so .14 .28 .05 0 .02 .01 0 0 0 0 0 0 0 0 0 0 0 .99 .69 .02 .01 .26 .02 .02 .59 0 0
i’m all kinds of culinary .15 .28 0 0 0 0 0 0 0 0 .99 0 0 0 0 0 0 .46 .6 .34 0 .01 .01 0 .91 0 .07
at home with wine and wife finally .14 .28 0 0 .09 .01 0 0 .63 0 .99 0 0 0 0 0 0 .99 .13 .2 0 0 .04 0 .01 .29 0
Table 5: Stylistic diversity of text: sampling 10% of tweets with the highest (lowest) average of style prediction scores and then sampling again 10% tweets with highest (lowest) standard deviation, where the former is stylistically diverse (), while the latter is less-diverse (). Some offensive words are replaced by * mark.

Appendix E Details on Cross-Style Correlation

In addition, we cluster style types based on the pairwise correlation coefficients. Figure 6 shows the agglomerative clusters using Ward clustering algorithm ward1963hierarchical where distance is measured using Euclidean distance. We observe some reasonable groupings such as ages (35-44, 45-54, 55-74) negative emotional styles (anger, digust, sadness, fear), positive affective styles (happiness, dominance, valence, positive, polite), and more.

Figure 8: Pairplot of the pairwise correlation between styles. Best viewed in color.

For more detailed analysis, we also provide the pairplot distribution in Figure 8

. As pointed out earlier, many distribution of single style prediction scores is very skewed to the two extremes (left for 0 and right for 1), leading the poster calibration issue. Moreover, more causal analysis between styles by controlling the confounding variables is required.

Appendix F Style Diversity on Personal Styles

Figure 13: Style diversity analysis on personal styles.

Figure 13 shows the style diversity on personal styles (e.g., gender, political view, age, and education level). We find some unreasonable cases in debate domain (Figure 13(b)), where its proportion of political view is extremely biased to the left wing, even though the text is almost equally written by both sides (i.e., Clinton and Trump). Again, this shows the limitation of our classifier-based analysis.