This is the Grammarly's Yahoo Answers Formality Corpus
Style transfer is the task of automatically transforming a piece of text in one particular style into another. A major barrier to progress in this field has been a lack of training and evaluation datasets, as well as benchmarks and automatic metrics. In this work, we create the largest corpus for a particular stylistic transfer (formality) and show that techniques from the machine translation community can serve as strong baselines for future work. We also discuss challenges of using automatic metrics.READ FULL TEXT VIEW PDF
Style transfer is the task of automatically transforming a piece of text...
In most cases, the lack of parallel corpora makes it impossible to direc...
An ongoing debate in the NLG community concerns the best way to evaluate...
We introduce the first study of automatic detoxification of Russian text...
Offensive and abusive language is a pressing problem on social media
We propose DGST, a novel and simple Dual-Generator network architecture ...
Text style transfer has gained increasing attention from the research
This is the Grammarly's Yahoo Answers Formality Corpus
One key aspect of effective communication is the accurate expression of the style or tone of some content. For example, writing a more persuasive email in a marketing position could lead to increased sales; writing a more formal email when applying for a job could lead to an offer; and writing a more polite note to your future spouse’s parents, may put you in a good light. Hovy Hovy (1987) argues that by varying the style of a text, people convey more information than is present in the literal meaning of the words. One particularly important dimension of style is formality Heylighen and Dewaele (1999). Automatically changing the style of a given content to make it more formal can be a useful addition to any writing assistance tool.
In the field of style transfer, to date, the only available dataset has been for the transformation of modern English to Shakespeare, and it led to the application of phrase-based machine translation (PBMT) Xu et al. (2012)
and neural machine translation (NMT)Jhamtani et al. (2017) models to the task. The lack of an equivalent or larger dataset for any other form of style transfer has blocked progress in this field. Moreover, prior work has mainly borrowed metrics from machine translation (MT) and paraphrase communities for evaluating style transfer. However, it is not clear if those metrics are the best ones to use for this task. In this work, we address these issues through the following three contributions:
Corpus: We present Grammarly’s Yahoo Answers Formality Corpus (GYAFC), the largest dataset for any style containing a total of 110K informal / formal sentence pairs. Table 1 shows sample sentence pairs.
Benchmarks: We introduce a set of learning models for the task of formality style transfer. Inspired by work in low resource MT, we adapt existing PBMT and NMT approaches for our task and show that they can serve as strong benchmarks for future work.
Metrics: In addition to MT and paraphrase metrics, we evaluate our models along three axes: formality, fluency and meaning preservation using existing automatic metrics. We compare these metrics with their human judgments and show there is much room for further improvement.
|Informal: I’d say it is punk though.|
|Formal: However, I do believe it to be punk.|
|Informal: Gotta see both sides of the story.|
|Formal: You have to consider both sides of the story.|
In this paper, we primarily focus on the informal to formal direction since we collect our dataset for this direction. However, we evaluate our models on the formal to informal direction as well.111Results are in the supplementary material. All data, model outputs, and evaluation results have been made public222https://github.com/raosudha89/GYAFC-corpus in the hope that they will encourage more research into style transfer.
In the following two sections we discuss related work and the GYAFC dataset. In §4, we detail our rule-based and MT-based approaches. In §5, we describe our human and automatic metric based evaluation. In §8.3, we describe the results of our models using both human and automatic evaluation and discuss how well the automatic metrics correlate with human judgments.
Style Transfer with Parallel Data: Sheikha and Inkpen Sheikha and Inkpen (2011)
collect pairs of formal and informal words and phrases from different sources and use a natural language generation system to generate informal and formal texts by replacing lexical items based on user preferences. Xu et al.Xu et al. (2012) (henceforth Xu12) was one of the first works to treat style transfer as a sequence to sequence task. They generate a parallel corpus of 30K sentence pairs by scraping the modern translations of Shakespeare plays and train a PBMT system to translate from modern English to Shakespearean English.333https://github.com/cocoxu/Shakespeare More recently, Jhamtani et al. Jhamtani et al. (2017) show that a copy-mechanism enriched sequence-to-sequence neural model outperforms Xu12 on the same set. In text simplification, the availability of parallel data extracted from English Wikipedia and Simple Wikipedia Zhu et al. (2010) led to the application of PBMT Wubben et al. (2012a) and more recently NMT Wang et al. (2016) models. We take inspiration from both the PBMT and NMT models and apply several modifications to these approaches for our task of transforming the formality style of the text.
control several linguistic style aspects simultaneously by conditioning a recurrent neural network language model on specific style (professional, personal, length) and content (theme, sentiment) parameters. Under NMT models, Sennrich et al.Sennrich et al. (2016a) control the politeness of the translated text via side constraints, Niu et al. Niu et al. (2017) control the level of formality of MT output by selecting phrases of a requisite formality level from the k-best list during decoding. In the field of text simplification, more recently, xu2016optimizing learn large-scale paraphrase rules using bilingual texts whereas kajiwara2016building build a monolingual parallel corpus using sentence similarity based on alignment between word embeddings. Our work differs from these methods in that we mainly address the question of how much leverage we can derive by collecting a large amount of informal-formal sentence pairs and build models that learn to transfer style directly using this parallel corpus.
. In our work, we reproduce the sentence-level formality classifier introduced in Pavlick and TetreaultPavlick and Tetreault (2016) (PT16) to extract informal sentences for GYAFC creation and to automatically evaluate system outputs.
introduce three new automatic style metrics based on cosine similarity, language model and logistic regression that measure the degree to which the output matches the target style. Under human based evaluation, on the other hand, there has been work on a more fine grained evaluation where human judgments were separately collected for adequacy, fluency and styleXu et al. (2012); Niu et al. (2017). In our work, we conduct a more thorough evaluation where we evaluate model outputs on the three criteria of formality, fluency and meaning using both automatic metrics and human judgments.
Yahoo Answers,444https://answers.yahoo.com/answer a question answering forum, contains a large number of informal sentences and allows redistribution of data. Hence, we use the Yahoo Answers L6 corpus555https://webscope.sandbox.yahoo.com/catalog.php?datatype=l to create our GYAFC
dataset of informal and formal sentence pairs. In order to ensure a uniform distribution of data, we remove sentences that are questions, contain URLs, and are shorter than 5 words or longer than 25. After these preprocessing steps, 40 million sentences remain. The Yahoo Answers corpus consists of several different domains likeBusiness, Entertainment & Music, Travel, Food, etc. PT16 show that the formality level varies significantly across different genres. In order to control for this variation, we work with two specific domains that contain the most informal sentences and show results on training and testing within those categories. We use the formality classifier from PT16 to identify informal sentences. We train this classifier on the Answers genre of the PT16 corpus which consists of nearly 5,000 randomly selected sentences from Yahoo Answers manually annotated on a scale of -3 (very informal) to 3 (very formal).666http://www.seas.upenn.edu/~nlp/resources/formality-corpus.tgz We find that the domains of Entertainment & Music and Family & Relationships contain the most informal sentences and create our GYAFC dataset using these domains. Table 2 shows the number of formal and informal sentences in all of Yahoo Answers corpus and within the two selected domains. Sentences with a score less than 0 are considered as informal and sentences with a score greater than 0 are considered as formal.
|All Yahoo Answers||40M||24M||16M|
|Entertainment & Music||3.8M||2.7M||700K|
|Family & Relationships||7.8M||5.6M||1.8M|
Next, we randomly sample a subset of 53,000 informal sentences each from the Entertainment & Music (E&M) and Family & Relationships (F&R) categories and collect one formal rewrite per sentence using Amazon Mechanical Turk. The workers are presented with detailed instructions, as well as examples. To ensure quality control, four experts, two of which are the authors of this paper, reviewed the rewrites of the workers and rejected those that they felt did not meet the required standards. They also provided the workers with reasons for rejection so that they would not repeat the same mistakes. Any worker who repeatedly performed poorly was eventually blocked from doing the task. We use this train set to train our models for the style transfer tasks in both directions.
|Informal to Formal||Formal to Informal|
Since we want our tune and test sets to be of higher quality compared to the train set, we recruit a set of 85 expert workers for this annotation who had a 100% acceptance rate for our task and who had previously done more than 100 rewrites. Further, we collect multiple references for the tune/test set to adapt PBMT tuning and evaluation techniques to our task. We collect four different rewrites per sentence using our expert workers by randomly assigning sentences to the experts until four rewrites for each sentence are obtained.777Thus, note that the four rewrites are not from the same four workers for each sentence To create our tune and test sets for the informal to formal direction, we sample an additional 3,000 informal sentences for our tune set and 1,500 sentences for our test set from each of the two domains.
To create our tune and test sets for the formal to informal direction, we start with the same tune and test split as the first direction. For each formal rewrite888Out of four, we pick the one with the most edit distance with the original informal. Rationale explained in Section 3.2 from the first direction, we collect three different informal rewrites using our expert workers as before. These three informal rewrites along with the original informal sentence become our set of four references for this direction of the task. Table 3 shows the exact number of sentences in our train, tune and test sets.
The following quantitative and qualitative analyses are aimed at characterizing the changes between the original informal sentence and its formal rewrite in the GYAFC train split.999We observe similar patterns on the tune and test set. We present our analysis here on only the E&M domain data since we observe similar patterns in F&R.
Quantitative Analysis: While rewriting sentences more formally, humans tend to make a wide range of lexical/character-level edits. In Figure 2
, we plot the distribution of the character-level Levenshtein edit distance between the original informal and the formal rewrites in the train set and observe a standard deviation ofwith a mean . Next, we look at the difference in the formality level of the original informal and the formal rewrites in GYAFC. We find that the classifier trained on the Answers genre of PT16 dataset correlates poorly (Spearman = 0.38) with human judgments when tested on our domain specific datasets. Hence, we collect formality judgments on a scale of -3 to +1, similar to PT16, for an additional 5000 sentences each from both domains and obtain a formality classifier with higher correlation (Spearman = 0.56). We use this retrained classifier for our evaluation in §5 as well.
In Figure 2, we plot the distribution of the formality scores on the original informal sentence and their formal rewrites in the train set and observe an increase in the mean formality score as we go from informal () to formal rewrites ().
As compared to edit distance and formality, we observe a much lower variation in sentence lengths with the mean slightly increasing from informal () to their formal rewrites () in the train set.
Qualitative Analysis: To understand what stylistic choices differentiate formal from informal text, we perform an analysis similar to PT16 and look at 50 rewrites from both domains and record the frequency of the types of edits that workers made when creating a more formal sentence.101010Examples of edits in supplementary material. In contrast to PT16, we observe a higher percentage of phrasal paraphrases (47%), edits to punctuations (40%) and expansion of contractions (12%). This is reflective of our sentences coming from very informal domains of Yahoo Answers. Similar to PT16, we also observe capitalization (46%) and normalization (10%).
We experiment with three main classes of approaches: a rule-based approach, PBMT and NMT. Inspired by work in low resource machine translation, we apply several modifications to the standard PBMT and NMT models and create a set of strong benchmarks for the style transfer community. We apply these models to both directions of style transfer: informal to formal and formal to informal. In our description, we refer to the two styles as source and target. We summarize the models below and direct the reader to supplementary material for further detail.
Corresponding to the category of edits described in §3.2, we develop a set of rules to automatically make an informal sentence more formal where we capitalize first word and proper nouns, remove repeated punctuations, handcraft a list of expansion for contractions etc. For the formal to informal direction, we design a similar set of rules in the opposite direction.
Phrased-based machine translation models have had success in the fields of machine translation, style transfer (Xu12) and text simplification Wubben et al. (2012b); Xu et al. (2016). Inspired by work in low resource machine translation, we use a combination of training regimes to develop our model. We train on the output of the rule-based approach when applied to GYAFC. This is meant to force the PBMT model to learn generalizations outside the rules. To increase the data size, we use self-training Ueffing (2006), where we use the PBMT model to translate the large number of in-domain sentences from GYAFC belonging to the the source style and use the resultant output to retrain the PBMT model. Using sub-selection, we only select rewrites that have an Levenshtein edit distance of over 10 characters when compared to the source to encourage the model to be less conservative. Finally, we upweight the rule-based GYAFC data via duplication Sennrich et al. (2016b). For our experiments, we use Moses Koehn et al. (2007). We train a 5-gram language model using KenLM Heafield et al. (2013), and use target style sentences from GYAFC and the sub-sampled target style sentences from out-of-domain Yahoo Answers, as in Moore and Lewis Moore and Lewis (2010), to create a large language model.
While encoder-decoder based neural network models have become quite successful for MTSutskever et al. (2014); Bahdanau et al. (2014); Cho et al. (2014),
the field of style transfer, has not yet been able to fully take advantage of these advances owing to the lack of availability of large parallel data. With GYAFC we can now show how well NMT techniques fare for style transfer. We experiment with three NMT models:
NMT baseline: Our baseline model is a bi-directional LSTM Hochreiter and Schmidhuber (1997) encoder-decoder model with attention Bahdanau et al. (2014).111111Details are in the supplementary material. We pretrain the input word embeddings on Yahoo Answers using GloVE Pennington et al. (2014). As in our PBMT based approach, we train our NMT baseline model on the output of the rule-based approach when applied to GYAFC.
NMT Copy: Jhamtani et al., Jhamtani et al. (2017) introduce a copy-enriched NMT model for style transfer to better handle stretches of text which should not be changed. We incorporate this mechanism into our NMT Baseline.
NMT Combined: The size of our parallel data is smaller than the size typically used to train NMT models. Motivated by this fact, we propose several variants to the baseline models that we find helps minimize this issue. We augment the data used to train NMT Copy via two techniques: 1) we run the PBMT model on additional source data, and 2) we use back-translation Sennrich et al. (2016c) of the PBMT model to translate the large number of in-domain target style sentences from GYAFC. To balance the over one million artificially generated pairs from the respective techniques, we upweight the rule-based GYAFC data via duplication.121212Training data sizes for different methods are summarized in the supplementary material.
As discussed earlier, there has been very little research into best practices for style transfer evaluation. Only a few works have included a human evaluation Xu et al. (2012); Jhamtani et al. (2017), and automatic evaluations have employed BLEU or PINC Xu et al. (2012); Chen and Dolan (2011), which have been borrowed from other fields and not vetted for this task. In our work, we conduct a more thorough and detailed evaluation using both humans and automatic metrics to assess transformations. Inspired by work in the paraphrase community Callison-Burch (2008), we solicit ratings on how formal, how fluent and how meaning-preserving a rewrite is. Additionally, we look at the correlation between the human judgments and the automatic metrics.
We perform human-based evaluation to assess model outputs on the four criteria: formality, fluency, meaning and overall. For a subset of 500 sentences from the test sets of both Entertainment & Music and Family & Relationship domains, we collect five human judgments per sentence per criteria using Amazon Mechanical Turk as follows:
Formality: Following PT16, workers rate the formality of the source style sentence, the target style reference rewrite and the target style model outputs on a discrete scale of -3 to +3 described as: -3: Very Informal, -2: Informal, -1: Somewhat Informal, 0: Neutral, 1: Somewhat Formal, 2: Formal and 3: Very Formal.
Fluency: Following Heilman et al. Heilman et al. (2014), workers rate the fluency of the source style sentence, the target style reference rewrite and the target style model outputs on a discrete scale of 1 to 5 described as: 5: Perfect, 4: Comprehensible, 3: Somewhat Comprehensible, 2: Incomprehensible. We additionally provide an option of 1: Other for sentences that are incomplete or just a fragment.
Meaning Preservation: Following the annotation scheme developed for the Semantic Textual Similarity (STS) dataset Agirre et al. (2016), given two sentences i.e. the source style sentence and the target style reference rewrite or the target style model output, workers rate the meaning similarity of the two sentences on a scale of 1 to 6 described as: 6: Completely equivalent, 5: Mostly equivalent, 4: Roughly equivalent, 3: Not equivalent but share some details, 2: Not equivalent but on same topic, 1: Completely dissimilar.
Overall Ranking: In addition to the fine-grained human judgments, we collect judgments to assess the overall ranking of the systems. Given the original source style sentence, the target style reference rewrite and the target style model outputs, we ask workers to rank the rewrites in the order of their overall formality, taking into account both fluency and meaning preservation. We then rank the model using the equation below:
where, is the one of our models, is a subset of 500 test set sentences, is the set of five judgments, is the model rewrite for sentence , and is the rank of in judgment .
The two authors of the paper reviewed these human judgments and found that in majority of the cases the annotations looked correct. But as is common in any such crowdsourced data collection process, there were some errors, especially in the overall ranking of the systems.
We cover each of the human evaluations with a corresponding automatic metric:
Formality: We use the formality classifier described in PT16. We find that the classifier trained on the answers genre of PT16 dataset does not perform well when tested on our datasets. Hence, we collect formality judgments for an additional 5000 sentences and use the formality classifier re-trained on this in-domain data.
Fluency: We use the reimplementation131313https://github.com/cnap/grammaticality-metrics/tree/master/heilman-et-al of Heilman et al. Heilman et al. (2014) (H14 in Table 4) which is a statistical model for predicting the grammaticality of a sentence on a scale of 0 to 4 previously shown to be effective for other generation tasks like grammatical error correction Napoles et al. (2016).
Meaning Preservation: Modeling semantic similarity at a sentence level is a fundamental language processing task, and one that is a wide open field of research. Recently, He et al., He et al. (2015) (He15 in Table 4
) developed a convolutional neural network based sentence similarity measure. We use their off-the-shelf implementation141414https://github.com/castorini/MP-CNN-Torch to train a model on the STS and use it to measure the meaning similarity between the original source style sentence and its target style rewrite (both reference and model outputs).
In this section, we discuss how well the five models perform in the informal to formal style transfer task using human judgments (§6.1) and automatic metrics (§6.2), the correlation of the automatic metrics and human judgments to determine the efficacy of the metrics (§8.2.3) and present a manual analysis (§6.4). We randomly select 500 sentences from each test set and run all five models. We use the entire train and tune split for training and tuning. We discuss results only on the E&M domain and list results on the F&R domain in the supplementary material.
Table 4 shows the results for human §6.1 and automatic §6.2 evaluation of model rewrites. For all metrics except TERp, a higher score is better. For each of the automatic metrics, we evaluate against four human references. The row ‘Original Informal’ contains the scores when the original informal sentence is compared with the four formal reference rewrites. Comparing the model scores to this score helps us understand how closer are the model outputs to the formal reference rewrites compared to initial distance between the informal and the formal reference rewrite.
The columns marked ‘Human’ in Table 4 show the human judgments for the models on the three separate criteria of formality, fluency and meaning collected using the process described in Section 5.1.151515Out of the four reference rewrites, we pick one at random to show to Turkers. The NMT Baseline and Copy models beat others on the formality axis by a significant margin. Only the NMT Combined model achieves a statistically higher fluency score when compared to the rule-based baseline model. As expected, the rule-based model is the most meaning preserving since it is the most conservative. Figure 3 shows the trend in the four leading models along formality and meaning for varying lengths of the source sentence. NMT Combined beats PBMT on formality for shorter lengths whereas the trend reverses as the length increases. PBMT generally preserves meaning more than the NMT Combined. We find that the fluency scores for all models decreases as the sentence length increases which is similar to the trend generally observed with machine translation based approaches.
Since a good style transfer model is the one that attains a balanced score across all the three axes, we evaluate the models on a combination of these metrics161616We recalibrate the scores to normalize for different ranges. shown under the column ‘Combined’ in Table 4. NMT Combined is the only model having a combined score statistically greater than the rule-based approach.
Finally, Table 5 shows the overall rankings of the models from best to worst in both domains. PBMT and NMT Combined models beat the rule-based model although not significantly in the E&M domain but significantly in the F&R domain. Interestingly, the rule-based approach attains third place with a score significantly higher than NMT Copy and NMT Baseline models. It is important to note here that while such a rule-based approach is relatively easy to craft for the formality style transfer task, the same may not be true for other styles like politeness or persuasiveness.
|(2.03*) Reference||(2.13*) Reference|
|(2.47) PBMT||(2.38*) PBMT|
|(2.48) NMT Combined||(2.38*) NMT Combined|
|(2.54) Rule-based||(2.56) Rule-based|
|(3.03*) NMT Copy||(2.72*) NMT Copy|
|(3.03*) NMT Baseline||(2.79*) NMT Baseline|
|Entertainment & Music|
|Original Informal||Wow , I am very dumb in my observation skills ……|
|Reference Formal||I do not have good observation skills .|
|Rule-based||Wow , I am very dumb in my observation skills .|
|PBMT||Wow , I am very dumb in my observation skills .|
|NMT Baseline||I am very foolish in my observation skills .|
|NMT Copy||Wow , I am very foolish in my observation skills .|
|NMT Combined||I am very unintelligent in my observation skills .|
|Family & Relationship|
|Original Informal||i hardly everrr see him in school either usually i see hima t my brothers basketball games .|
|Reference Formal||I hardly ever see him in school . I usually see him with my brothers playing basketball .|
|Rule-based||I hardly everrr see him in school either usually I see hima t my brothers basketball games .|
|PBMT||I hardly see him in school as well, but my brothers basketball games .|
|NMT||I rarely see him in school , either I see him at my brother ’s basketball games .|
|NMT Copy||I hardly see him in school either , usually I see him at my brother ’s basketball games .|
|NMT Combined||I rarely see him in school either usually I see him at my brothers basketball games .|
Under automatic metrics, the formality and meaning scores align with the human judgments with the NMT Baseline and NMT Copy winning on formality and rule-based winning on meaning. The fluency score of the NMT Baseline is the highest in contrast to human judgments where the NMT Combined wins. This discrepancy could be due to H14 being trained on essays which contains sentences of a more formal genre compared to Yahoo Answers. In fact, the fluency classifier scores the formal reference quite low as well. Under overall metrics, PBMT and NMT Combined models beat other models as per BLEU (significantly) and TERp (not significantly). NMT Baseline and NMT copy win over other models as per PINC which can be explained by the fact that PINC measures lexical dissimilarity with the source and NMT models tend towards making more changes. Although such an analysis is useful, for a more thorough understanding of these metrics, we next look at their correlation with human judgments.
We report the spearman rank correlation co-efficient between automatic metrics and human judgments in Table 6. For formality, fluency and meaning, the correlation is with their respective human judgments whereas for BLEU, TERp and PINC, the correlation is with the overall ranking.
We see that the formality and the fluency metrics correlate moderately well while the meaning metric correlates comparatively poorly. To be fair, the He15 classifier was trained on the STS dataset which contains more formal writing than informal. BLEU correlates moderately well (better than what Xu12 observed for the Shakespeare task) whereas the correlation drops for TERp. PINC, on the other hand, correlates very poorly with a positive correlation with rank when it should have a negative correlation with rank, just like BLEU. This sheds light on the fact that PINC, on its own, is not a good metric for style transfer since it prefers lexical edits at the cost of meaning changes. In the Shakespeare task, Xu12 did observe a higher correlation with PINC (0.41) although the correlation was not with overall system ranking but rather only on the style metric. Moreover, in the Shakespeare task, changing the text is more favorable than in formality.
The prior evaluations reveal the relative performance differences between approaches. Here, we identify trends per and between approaches. We sample 50 informal sentences total from both domains and then analyze the outputs from each model. We present sample sentences in Table 15.
The NMT Baseline and NMT Copy tend to have the most variance in their performance. This is likely due to the fact that they are trained on only 50K sentence pairs, whereas the other models are trained on much more data. For shorter sentences, these models make some nice formal transformations like from ‘very dumb’ to ‘very foolish’. However, for longer sentences, these models make drastic meaning changes and drop some content altogether (see examples in Table 15). On the other hand, the PBMT and NMT Combined models have lower variance in their performance. They make changes more conservatively but when they do, they are usually correct. Thus, most of the outputs from these two models are usually meaning preserving but at the expense of a lower formality score improvement.
In most examples, all models are good at removing very informal words like ‘stupid’, ‘idiot’ and ‘hell’, with PBMT and NMT Combined models doing slightly better. All models struggle when the original sentence is very informal or disfluent. They all also struggle with sentence completions that humans seem to be very good at. This might be because humans assume a context when absent, whereas the models do not. Unknown tokens, either real words or misspelled words, tend to wreak havoc on all approaches. In most cases, the models simply did not transform that section of the sentence, or remove the unknown tokens. Most models are effective at low-level changes such as writing out numbers, inserting commas, and removing common informal phrases.
The goal of this paper was to move the field of style transfer forward by creating a large training and evaluation corpus to be made public, showing that adapting MT techniques to this task can serve as strong baselines for future work, and analyzing the usefulness of existing metrics for overall style transfer as well as three specific criteria of automatic style transfer evaluation. We view this work as rigorously expanding on the foundation set by Xu12 five years earlier. It is our hope that with a common test set, the field can finally benchmark approaches which do not require parallel data.
We found that while the NMT systems perform well given automatic metrics, humans had a slight preference for the PBMT approach. That being said, two of the neural approaches (NMT Baseline and Copy) often made successful changes and larger rewrites that the other models could not. However, this often came at the expense of a meaning change.
We also introduced new metrics and vetted all metrics using comparison with human judgments. We found that previously-used metrics did not correlate well with human judgments, and thus should be avoided in system development or final evaluation. The formality and fluency metrics correlated best and we believe that some combination of these metrics with others would be the best next step in the development of style transfer metrics. Such a metric could then in turn be used to optimize MT models. Finally, in this work we focused on one particular style, formality. The long term goal is to generalize the methods and metrics to any style.
The authors would like to thank Yahoo Research for making their data available. The authors would also like to thank Junchao Zheng and Claudia Leacock for their help in the data creation process, Courtney Napoles for providing the fluency scores, Marcin Junczys-Dowmunt, Rico Sennrich, Ellie Pavlick, Maksym Bezva, Dimitrios Alikaniotis and Kyunghyun Cho for helpful discussion and the three anonymous reviewers for their useful comments and suggestions.
In this supplementary material, we add additional details supporting the dataset §8.1, models §8.2 and results §8.3 we introduce in the main paper. In §8.3, we also discuss results of our models on the formal to informal task.
In Section 3.2 of the main paper, we perform a qualitative analysis to understand the types of edits people made for making a sentence more formal. Table 8 shows the frequency of each types of edits and example sentence pairs for the same. In addition, for a subset of the categories for which we can count the edits automatically, we show the frequency of edits on the entire train split of our GYAFC dataset where we observe a higher percentage of capitalization and punctuation edits as compared to manual counting and a much higher percentage of normalizations.
|Category||Manual||Auto||Original Informal||Formal Rewrite|
|Paraphrase||47%||–||he iss wayyyy hottt||He is very attractive.|
|Capitalization||46%||51%||yes, exept for episode iv.||Yes, but not for episode IV.|
|Punctuation||40%||69%||I’ve watched it and it is AWESOME!!!!||I viewed it and I believe it is a quality|
|Delete fillers||26%||–||Well… Do you talk to that someone much?||do you talk to that person often?|
|Completion||15%||–||Haven’t seen the tv series, but R.O.D.||I have not seen the television series,|
|however I have seen the R.O.D|
|Spelling||14%||–||that page did not give me viroses (i think)||I don’t think that page gave me viruses.|
|Contractions||12%||8%||I didn’t know they had an HBO in the 80’s||I did not know HBO existed in the 1980s.|
|Normalization||10%||61%||my exams r not over yet||My exams are not over yet.|
|Lowercase||7%||8%||But you will DEFINALTELY know||You will definitely know|
|when you are in love!||when you are in love.|
|Split Sentences||4%||–||it wouldnt be a word,||It would not be a word.|
|it would be me singing operah.||It would be a singing opera.|
|Repetitions||2%||5%||i’d find out what realllllllllllllly||I would determine what really|
|happened to marilyn monroe||happened to Marilyn Monroe.|
We use the analysis described in Section 3.2 of the main paper to construct the following set of rules to automatically make an informal sentence more formal:
Capitalization: We capitalize the first letter of a sentence, we capitalize the pronoun ‘I’ and we capitalize proper nouns by identifying words with parts of speech NNP or NNPS.
Lowercase words with all upper cases: In several informal sentences, words are often capitalized for emphasis, e.g. “ARE YOU KIDDING ME????” We lowercase such sentences or words.
Expand contractions: Informal sentences contain contractions like ‘wasn’t’, ‘haven’t’, etc. We handcraft a list of expansions for all such contractions.
Replace slang words: Informal sentences contain slang word usage like ‘juz’, ‘wanna’, etc. We handcraft a list of slang replacements.
Replace swear words: Informal sentences frequently contain swear words. We handcraft a list of swear words and replace all but their first character with asterisks. Example, ‘suck’ is replaced with ‘s***’.
Remove character repetition: Informal sentences contain several instances of repeated characters for emphasis. For example, ‘nooooo’, ‘yayyyyy’, ‘!!!!’, ‘???’. We use regular expressions to replace such repeated occurrences with a single occurrence.
The rule-based model for the second direction of style transfer i.e. from formal to informal consists of the same rules as above but in the reverse direction. The rules of capitalization, contractions and slang usage are applied always whereas the rules of uppercasing and character repetition are applied proportionally to their occurrences in the GYAFC train split.
The different ideas combined together to obtain the PBMT model in Section 4.2 of the main paper are described in detail below:
PBMT on rule-based: When we use the 50K sentence pairs in our GYAFC train set to train a baseline PBMT model, we observe that the model mainly learns to replicate the rules we crafted in our rule-based approach. Hence, in order to force the PBMT model to learn generalizations beyond the rules, we train the PBMT model on the output of our rule-based approach such that the source side of the parallel data is now the output of the rule-based approach. For all of our subsequent models we use this parallel data.
Self-training: The amount of parallel data in our formality dataset is orders of magnitude smaller than the amount typically used for training translation models. We therefore increase the size of our train set by way of self-training where we use the PBMT model to translate the large number of in-domain sentences from GYAFC belonging to the the source style and use the resultant output to retrain the PBMT model.
Sub-selection using Edit Distance: A large portion of the training data obtained via self-training consists of parallel sentences where the two sides are almost identical. In order to push the PBMT model towards translations that involve higher number of edits, we sub-select the additional training data generated using self-training to include only those where the edit distance between the two sides is more than 10. Further, to ensure the equal proportion of the original parallel data and the additional data, we up-weight the original parallel data via duplication.
Larger Language Model with Data Selection: The Yahoo Answers corpus contains a large number of target style sentences spanning across different domains that we could potentially use to train a larger language model, but at the cost of domain mismatch. To sub sample from large out-of-domain data, we use intelligent data selection method and train a language model on sentences that are closer to the target style in-domain data.
Table 10 contains the approximate sizes of the training data used in the main models across the two domains. Under the “Combined” models, the first part is duplication of the GYAFC train split, the second part is additional data obtained via self-training with sub-selection and the third part is the additional data obtained via back-translation for the “NMT Combined model”.
We use the OpenNMT-py Klein et al. (2017) toolkit with default parameters with a vocabulary size of 50K and embeddings of size 300. At test time, we replace unknown tokens with the source token that has the highest attention weight. The input word embeddings are pretrained on Yahoo Answers using GloVE Pennington et al. (2014).
|(2.33) PBMT Combined||(2.24) Rule-based|
|(2.40) NMT Combined||(2.28) PBMT Combined|
|(2.44) Rule-based||(2.36) NMT Combined|
|(2.52) Reference||(2.44) Reference|
|(2.69*) NMT Copy||(2.50*) NMT Baseline|
|(2.72*) NMT Baseline||(2.53*) NMT Copy|
In the main paper, we include results for only the E&M domain. Here we discuss the results on the F&R domain. Table 13, similar to Table 4 in the main paper, shows the results of the models on 500 test sentences evaluated using both human judgments and automatic metrics. The main observations regarding model performances across all metrics are similar to the E&M domain. One difference is that the formality score of the original informal sentences in F&R are higher than in E&M and consequently the formality scores of the formal rewrites from both human references and model outputs are higher than in E&M.
|Entertainment & Music|
|Family & Relationships|
In the main paper, we focus on the informal to formal direction of style transfer. In this section we discuss the results of our models on the other direction. It should be noted that this experimentation is fundamentally different from the first direction in a way that instead of identifying formal sentences from Yahoo Answers and collecting their informal rewrites, we reuse the data created for the first direction.
In Table 14, we show the results of the five main models on the formal to informal task. The main observation is that, in contrast to the first direction, the rule-based model beats all other models across all three criteria of formality, fluency and meaning as per human judgments and automatic metrics (with the exception of meaning automatic metric where PBMT Combined beats rule-based). NMT Combined and PBMT Combined win as per BLEU and TERp and NMT Baseline wins as per PINC. As in Section 6.3 of the main paper, in Table 12, we report the correlation of these metrics with human judgments. In contrast to the first direction, the formality classifier obtains a higher correlation which might be because the classifier is trained on informal data and so it is better at assessing informal model outputs than formal model outputs. The fluency and meaning correlation are about the same. In contrast, BLEU, TERp and PINC all three correlate very poorly with the overall ranking. This difference might be explained by the fact that informal reference rewrites vary highly in that there are much higher number of ways of making a sentence more informal as compared to making it more formal. Therefore, metrics that make use of references might be ill-suited for this style transfer task.
In Table 15, we show some sample model outputs for the E&M and F&M domain sentences. We can see that rule-based method uses simple lexical transformations like ‘just’ to ‘juz’, ‘you’ to ‘u’, ‘because’ to ‘cuz’, ‘love’ to ‘luv’, etc and wins over other models. The intent of evaluating our models on this direction of task was to understand how well the same model would do the reverse task. We find that the second direction has different set of challenges and requires models that cater to those specifically if we wish to beat simple rule-based methods.
|Entertainment & Music|
|Original Formal||I am just glad they didn ’t show us the toilets .|
|Reference Informal||IM GLAD THEY PASSED THE TOILETS|
|Rule-based||i am juz glad they didn ’t show us the toilets …..|
|PBMT Combined||I am just glad they didnt show us the toilets .|
|NMT Baseline||I ’m just glad they didn ’t show us the restroom .|
|NMT Copy||I ’m just glad they didn ’t show us the brids .|
|NMT Combined||I ’m just glad they didn ’t show us the toilets .|
|Family & Relationship|
|Original Formal||Hopefully , you married your husband because you love him .|
|Reference Informal||you married your hubby hopefully because you love him .|
|Rule-based||hopefully , u MARRIED ur husband coz u luv him …..|
|PBMT Combined||hopefully , you married your husband because you love him .|
|NMT Baseline||you married your husband because you love him .|
|NMT Copy||Hopefully you married your husband because you love him .|
|NMT Combined||Hopefully you married your husband because you love him .|
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii, pages 196–205. http://www.aclweb.org/anthology/D08-1021.
International Conference on Machine Learning. pages 1587–1596.
Glove: Global vectors for word representation.In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). pages 1532–1543.
Natural Language Processing and Knowledge Engineering (NLP-KE), 2010 International Conference on. IEEE, pages 1–5.