Adapting Language Models for Non-Parallel Author-Stylized Rewriting

09/22/2019 ∙ by Bakhtiyar Syed, et al. ∙ adobe IIIT Hyderabad 0

Given the recent progress in language modeling using Transformer-based neural models and an active interest in generating stylized text, we present an approach to leverage the generalization capabilities of a language model to rewrite an input text in a target author's style. Our proposed approach adapts a pre-trained language model to generate author-stylized text by fine-tuning on the author-specific corpus using a denoising autoencoder (DAE) loss in a cascaded encoder-decoder framework. Optimizing over DAE loss allows our model to learn the nuances of an author's style without relying on parallel data, which has been a severe limitation of the previous related works in this space. To evaluate the efficacy of our approach, we propose a linguistically-motivated framework to quantify stylistic alignment of the generated text to the target author at lexical, syntactic and surface levels. The evaluation framework is both interpretable as it leads to several insights about the model, and self-contained as it does not rely on external classifiers, e.g. sentiment or formality classifiers. Qualitative and quantitative assessment indicates that the proposed approach rewrites the input text with better alignment to the target style while preserving the original content better than state-of-the-art baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

There has been a growing interest in studying style in natural language and solving tasks related to it [Hu et al.2017, Shen et al.2017, Subramanian et al.2018, Fu et al.2018, Vadapalli et al.2018, Niu and Bansal2018]. Tasks like genre classification [Kessler, Numberg, and Schütze1997], author profiling [Garera and Yarowsky2009]

, sentiment analysis

[Wilson, Wiebe, and Hoffmann2005], social relationship classification [Peterson, Hohensee, and Xia2011] have been of active interest to the community. Recently, stylized text generation [Hovy1990, Inkpen and Hirst2006] and style transfer [Li et al.2018, Prabhumoye et al.2018, Fu et al.2018] have gained traction; both these tasks aim to generate realizations of an input text that align to a target style. A majority of the work here is focused around generating text with different levels of sentiment [Shen et al.2017, Ficler and Goldberg2017] and formality [Jain et al.2019] and also a combination of these attributes [Subramanian et al.2018]. The interest along these lines has given rise to annotated and parallel data that comprise of paired realizations that lie on opposite ends of formality and sentiment spectrum [Rao and Tetreault2018, Mathews, Xie, and He2016]. The dimensions of style considered across all these works are psycholinguistic aspects of text and the aim is to transfer the text across different levels of the chosen aspect.

Figure 1: An overview of generating author-stylized text using StyleLM, our proposed model.

However, there has been lack of explorations that aim to generate text across author styles – wherein the notion of style is not a specific psycholinguistic aspect but an amalgam of the author’s linguistic choices expressed in their writing [Jhamtani et al.2017, Tikhonov and Yamshchikov2018]. While the work by jhamtani2017shakespearizing (jhamtani2017shakespearizing) tries to generate “Shakespearized” text from Modern English and is in a similar vein, it relies on the availability of parallel data. Since the availability of parallel data is not always guaranteed and it is arduous to curate one, such an approach cannot scale for different authors. We therefore propose a novel framework for author-stylized rewriting without relying on parallel data with source-text to target-text mappings. Figure 1 shows a few examples where an input text is rewritten in the style of a chosen author by our model.

Our approach for generating author-stylized text involves leveraging the generalization capabilities of state-of-the-art language models and adapting them to incorporate the stylistic characteristics of a target author without the need of parallel data. We first pre-train a language model on a combination of author corpus [Lahiri2014] and Wikipedia data using the masked language modeling objective [Devlin et al.2019]. Drawing inspiration from the unsupervised machine translation setup of lample2019cross (lample2019cross), we cascade two copies of this pre-trained language model into an encoder-decoder framework, where the parameters of the encoder and decoder are initialized with the pre-trained language model. This cascaded framework is fine-tuned on a specific target author’s corpus of text by reconstructing the original text from its noisy version and optimizing on a denoising autoencoder loss. The fine-tuned model thus adapts itself towards the style of the target author as we show via our experimental analysis.

Author-stylized rewriting of text takes a text, which may or may not have a distinctive style, and rewrites it in a style that can be attributed to a target author. Since the writing style of authors is determined by several linguistically active elements that are expressed at lexical, syntactic, and semantic levels, it is challenging to evaluate the stylistic alignment of rewritten text to target author’s style. To this end, we propose a novel and interpretable framework that is linguistically motivated, to quantify the extent of stylistic alignment at multiple levels. As we elaborate upon in the later sections, our evaluation suggests that the proposed approach performs better than three relevant and competitive baselines – showing significant adaption to the writing style of target authors, both qualitatively and quantitatively. Notably, our approach performs on par (and better in certain dimensions) with state-of-the-art method for stylistic rewriting using parallel data, without leveraging the parallel nature of underlying data.

The key contributions of this work are threefold.

  1. We propose and evaluate an approach to generate author-stylized text without relying on parallel data by adapting state-of-the-art language models.

  2. We propose an evaluation framework to assess the efficacy of stylized text generation that accounts for alignment of lexical and syntactic aspects of style. Contrary to existing evaluation techniques, our evaluation framework is linguistically-aware and easily interpretable.

  3. Our proposed approach shows significant improvement in author-stylized text generation over baselines, both in quantitative and qualitative evaluations.

Related Work

Stylized Text Generation:

In recent times, several explorations that aim to generate stylized text define a psycholinguistic aspect, like, formality or sentiment [Jain et al.2019, Shen et al.2017, Ficler and Goldberg2017] and transfer text along this dimension. The approaches themselves can range from completely supervised, which is contingent on the availability of parallel data [Ficler and Goldberg2017], to unsupervised [Li et al.2018, Shen et al.2017, Jain et al.2019]. Some of the influential unsupervised approaches include (a) using readily available classification-based discriminators to guide the process of generation [Fu et al.2018], (b) using simple linguistic rules to achieve alignment with the target style [Li et al.2018], or (c) using auxiliary modules (called scorers) that score the generation process on aspects like fluency, formality and semantic relatedness while deciding on the learning scheme of the encoder-decoder network [Jain et al.2019]. However, in the context of our setting, it is not possible to build a classification-based discriminator or scorers to generate author-stylized text. Moreover, linguistic-rule based generations are intractable given the large number of rules required to define a target author’s style. To this end, we aim to adapt state-of-the-art language models to generate author-stylized text from non-parallel data. The choice of using language models is motivated by the fact that stylistic rewriting builds on the task of simple text generation (i.e., writing).

There are some works that adapt an input text to the writing style of a specific author [Jhamtani et al.2017, Tikhonov and Yamshchikov2018]. While jhamtani2017shakespearizing (jhamtani2017shakespearizing) aim to generate “Shakespearized” version of modern English language using parallel data, tikhonov2018guess (tikhonov2018guess) use the multilingual setup to generate author-stylized poetry using paired instances of Russian and English poetry. Both these approaches rely on the availability of parallel data and hence cannot easily scale to new authors. Our proposed approach aims to overcome this shortcoming by only relying on non-parallel data and only requires the corpus of the target author text for stylistic rewriting. As we show later, the proposed framework is comparable (even better in some of the dimensions) to jhamtani2017shakespearizing’s approach across content preservation and style transmission metrics without utilizing the parallel corpus.

Language Models:

Generative pre-training of sentence encoders [Radford et al.2018, Devlin et al.2019, Howard and Ruder2018] has led to strong improvements on several natural language tasks. Their approach is based on learning a Transformer [Vaswani et al.2017] language model on a large unsupervised text corpus and then fine-tuning on classification and inference-based natural language understanding (NLU) tasks. Building up on this, lample2019cross (lample2019cross) extend this approach to learn cross-lingual language models. Taking inspiration from this, we extend the generative pre-training for our task of author-stylized rewriting.

The recently proposed language model GPT-2 [Radford et al.2019]

is pre-trained on a large and diverse dataset (WebText) and is shown to perform well across several domains and datasets including natural language generation. The unsupervised pre-training is setup to model the generation probability of the next word, given the previous words, i.e.,

– more generally referred to as the causal language modeling (CLM) objective. Specific to the task of text generation, it takes an input prompt () and aims to generate text that adheres to the input context. As substantiated in the later sections, GPT-2, when fine-tuned on author-specific corpus, shows significant stylistic alignment with the writing style of target author. However, given the inherent differences involved in the setup of stylistic rewriting and stylized text generation, it performs poorly on content preservation. While in stylistic rewriting, the objective is to retain the information in the input text in the stylized generation, stylistic generation by GPT-2 generates the content that is related to the input prompt and hence fine-tuned GPT-2 cannot address the task of stylistic rewriting. In the cross-lingual language modeling literature, a recent exploration by lample2019cross (lample2019cross) learns cross-lingual language models by first pre-training on different language modelling objectives: (i) causal language model (CLM), (ii) masked language model (MLM) – similar to BERT [Devlin et al.2019], and (iii) translation language model (TLM) - which is a supervised setup leveraging parallel corpora. Following the pre-training, lample2019cross cascade the encoder and decoder to address the tasks of supervised cross-lingual classification and machine translation by fine-tuning on a combination of denoising auto-encoder (DAE) and back-translation losses. Taking inspiration from this work, we pre-train a language model on a large corpus using MLM objective and then fine-tune it on author-specific corpus using DAE loss in an encoder-decoder setup. Using DAE loss ensures that we don’t rely on availability of parallel corpora, while the pre-trained language model facilitates the task of rewriting by building a firm substratum.

Evaluating Stylized Generation:

fu2018style (fu2018style) propose an evaluation framework to assess the efficacy of style transfer models on two axes: (i) content preservation and (ii) transfer strength. While the former caters to the content overlap between input and generated text (quantified using BLEU [Papineni et al.2002]), the latter takes into account the alignment of generated text with target style. In their setup, as it is with many others, the notion of target style is a psycholinguistic aspect (formality or sentiment) for which classifiers or scorers are readily available and are hence used to quantify the transfer strength [Jain et al.2019, Li et al.2018, Mir et al.2019]. However, for evaluating author-stylized text generations the evaluation frameworks are not well established. jhamtani2017shakespearizing (jhamtani2017shakespearizing) and tikhonov2018guess (tikhonov2018guess) overcome this by using the content preservation metrics as a proxy of transfer strength, leveraging the availability of the ground-truth stylized text. The unavailability of a suitable metric for transfer strength is particularly pronounced in evaluating unsupervised approaches as there is no target data to compare the generations against. To this end, we propose a linguistically-aware and interpretable evaluation framework which quantifies alignment of multiple lexical and syntactic aspects of style in the generated text with respect to the target author’s style.

Proposed Approach: StyleLM

There are two key aspects to our approach – pre-training a Transformer-based language model on a large dataset that acts as a substratum and fine-tuning on author-specific corpus using DAE loss to enable stylized rewriting. The entire approach is not contingent on the availability of parallel data and the models are learned in a self-supervised manner.

Figure 2: Proposed StyleLM model. We first pre-train a language model on large English corpus (I. Unsupervised Pretraining) and then cascade the pre-trained LMs into an encoder-decoder like framework (as represented by the curved arrows). The encoder-decoder is fine-tuned separately on each of the target author’s corpus using DAE loss (II. Author-specific fine-tuning).

Figure 2 illustrates the proposed framework for stylistic rewriting. We first pre-train the Transformer-based language model on a large unsupervised corpus using the masked language modeling (MLM) objective [Devlin et al.2019]. The choice of using a Transformer-based architecture is based on their recent success in language modeling [Vaswani et al.2017, Devlin et al.2019, Radford et al.2018, Radford et al.2019]. The MLM objective encourages the LM to predict the masked word(s) from the input sequence of words leveraging bidirectional context information of the input.

Given a source sentence , is a modified version of where its token from position is masked by replacing it with a mask token - thus keeping the length of the masked sentence unchanged. The MLM objective pre-trains the language model by predicting the original token , taking the masked sequence as input, while learning the parameters for the conditional probability of the language model. We minimize the log-likelihood given by,

(1)

where, denotes the entire training corpus. For pre-training the language model using the MLM objective, following devlin2019bert (devlin2019bert), we randomly mask of the tokens in each input sequence, replace them with the token of the time, by a random token of the time, and keep them unchanged of the time. A difference between our model and the MLM proposed by devlin2019bert (devlin2019bert) is the use of text streams of sentences (truncated at tokens) in contrast to pairs of sentences. This has been shown to give considerable gains for text generation tasks [Lample and Conneau2019]. Also, unlike devlin2019bert (devlin2019bert), we do not use the Next Sentence Prediction (NSP) objective.

The language model (LM) above learns to predict the masked words over a large corpus, but does not incorporate any style-related fine-tuning that facilitates rewriting the input text in a given target author’s style. To achieve this, we cascade two instances of the pre-trained LM in an encoder-decoder setup where one instance acts as the encoder and the other acts as a decoder. In other words, the learnable parameters of both encoder and decoder are initialized using the pre-trained LM. Note that the architecture of Transformer-based language models allows two exact instances of the pre-trained LM to be cascaded, without explicitly aligning the encoder’s output and the decoder’s input [Bahdanau, Cho, and Bengio2014] since the attention-mechanism is inherent in the design of Transformers [Vaswani et al.2017]. lample2019cross (lample2019cross) successfully used such a cascading to bootstrap the iterative process of the model initialization for the unsupervised machine translation task. Taking inspiration from this, we fine-tune the encoder-decoder on the DAE loss, given by,

(2)

where, is the noisy version of the input sentence and are the sentences in target author’s corpus. To obtain a noisy version of input text , we drop every word in with a probability .

When the pre-trained language model is cascaded as the encoder and decoder, and further fine-tuned with a noisy version of the text, the encoder generates the masked words (since that is the original objective of the pre-trained LM). However, since the input to the decoder, which is same as the output of the encoder, has no masked words, it tries to reconstruct the clean version of the noisy input text. In other words, fine-tuning the encoder-decoder on target author’s corpus using the DAE loss (equation 2) pushes the model’s decoder towards inducing target author’s style while rewriting the input text from the encoder.

Implementation Details

During pre-training with MLM, we use the Transformer encoder [Vaswani et al.2017] (-layer) with GELU activations [Hendrycks and Gimpel2017], hidden units, heads, a dropout rate of and learned positional embeddings. We train our models with the Adam optimizer [Kingma and Ba2014], and a learning rate of . We use streams of tokens and a mini-batches of size . We train our model on the MLM objective until the language model’s perplexity shows no improvement over the validation dataset.

For fine-tuning the encoder-decoder on a target author, we obtain a noisy version of input text , i.e., by dropping every word in with a probability . We also blank222replace the word with . the input words with a probability . This helps in the construction of a noisy input from which we try to reconstruct the whole input passage333Unlike the MLM which predicts only a part of the input.. While we also experimented with randomly shuffling

by applying a random permutation to the input text – similar to lample2019cross (lample2019cross) – we found this to adversely affect the final results, since the word sequence is often part of an author’s linguistic choice. For the fine-tuning, we use the same pre-trained MLM Transformer initialization for both the encoder and decoder, similar to lample2019cross (lample2019cross), with the same hyperparameters used for pre-training.

and are set to and the model is fine-tuned until convergence.

To handle the vocabulary size for such a huge dataset, we use Byte Pair Encoding (BPE) [Sennrich, Haddow, and Birch2015] on the combined training dataset and learn BPE codes on the dataset. Since we use BPE codes on the combination of the training dataset of the authors, we can scale these for any author at will – thus the ability to adapt to any author in the Gutenberg corpus or beyond.

Evaluation Framework

Dataset

We collated a subset of the Gutenberg corpus [Lahiri2014] consisting of authors and books written by them. For evaluating on a completely unseen author (a zero-shot setting), as discussed later, we set aside the writings by Mark Twain from the training corpus. The remaining authors are utilized as training corpus during the pre-training stage resulting in a total of million passages. To diversify the pre-training dataset, we also use million passages from Wikipedia [Radford et al.2018] along with the M passages from the Gutenberg corpus – leading to a total of M passages for pre-training the LM. Of these, we set aside passages for validation and for test during the pre-training stage.

To fine-tune the encoder-decoder framework from the pre-trained LM, we pick a subset of authors from the Gutenberg corpus and independently treat them as target authors to generate author-stylized text. The chosen authors are: Sir Arthur Conan Doyle, Charles Dickens, George Alfred Henty, Nathaniel Hawthorne, Robert Louis Stevenson, Rudyard Kipling, Thomas Hardy, William Makepeace Thackeray, and Zane Grey. We fine-tune independently for each of the target authors and evaluate the efficacy of our proposed approach using a novel evaluation framework with roots in linguistic literature, described in a later section.

For inference during test-time, we use the following three corpora to obtain our source sentences : (a) texts from books written by Mark Twain, (b) Opinosis Review dataset [Ganesan, Zhai, and Han2010], (c) a Wikipedia article on Artificial Intelligence (https://en.wikipedia.org/wiki/Artificial˙intelligence) which does not appear in the original mix of the Wikipedia training corpus. Texts from these sources span a diverse range of topics and writing styles – while Mark Twain’s writings are literary, Opinosis reviews are everyday, the Wikipedia article on AI presents an interesting scenario where many of the words in the source text are not present in target author’s corpus, given the different timelines.

We evaluate our performance against baselines - of which are trained on non-parallel data, while the th one uses parallel data.

1. Vanilla GPT-2 based generation: radford2019language (radford2019language) show that language models present considerable promise as unsupervised multi-task learners. We use their vanilla GPT-2 pre-trained Transformer decoder [Radford et al.2019] as our first baseline. In our experimental setup, we utilise the pre-trained parameter model for generation444https://github.com/openai/gpt-2. The GPT-2 is fed a prompt directly during inference and the generated outputs are compared against other generations.

2. Author fine-tuned GPT-2: The second baseline is the same parameter GPT-2 model as above, but fine-tuned for the cross-entropy loss on each of the target author’s corpus separately. We use the stylized text generated by providing a prompt to the fine-tuned model for comparisons.

3. Denoising-LM : no author-specific fine-tuning: This baseline is similar to our StyleLM network, but fine-tuned on the entire corpora using the DAE loss (as opposed to just the author-specific corpus). The purpose of this baseline is to evaluate the content preservation capabilities of our setup.

4. Supervised Stylized Rewriting: jhamtani2017shakespearizing (jhamtani2017shakespearizing) propose an LSTM-based encoder-decoder architecture for generating a “Shakespearized” text originally written in modern English, by leveraging parallel data. Given that our problem is close to their work, we consider this as our fourth baseline. However, since jhamtani2017shakespearizing’s model requires a parallel corpus, we compare this baseline only for generating Shakespearized text (using their data from [Jhamtani et al.2017]). We train the other three baselines and our StyleLM by treating Shakespeare’s corpus as the target author’s corpus (without using the parallel nature of the data).

Data Source Model Content Preservation () Stylistic Alignment ()
BLEU ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-L Lexical (MSE) Syntactic (JSD) Surface (MSE)
Opinosis GPT-2
GPT-2 (FT)
LM + DAE
StyleLM
Mark Twain GPT-2
GPT-2 (FT)
LM + DAE
StyleLM
AI Wiki GPT-2
GPT-2 (FT)
LM + DAE
StyleLM
Table 1: Evaluating content preservation and stylistic alignment. We evaluate the performance of StyleLM

against three baselines and on three test sets across multiple content preservation and stylistic alignment metrics. The reported numbers are mean and standard deviations (

) across all the target authors. FT denotes author-specific fine-tuning; / indicates that higher / lower is better, respectively.
Model Content Preservation () Stylistic Alignment ()
BLEU ROUGE-1 ROUGE-2 ROUGE-3 ROUGE-L Lexical (MSE) Syntactic (JSD) Surface (MSE)
GPT-2
GPT-2 (FT)
LM + DAE
jhamtani2017shakespearizing (jhamtani2017shakespearizing)
StyleLM
Table 2: Comparison against supervised baseline. Similar to Table 1, we evaluate the performance of all the models against the approach of [Jhamtani et al.2017] which relies on parallel data. For author-specific fine-tuning of StyleLM and GPT-2 (FT), we use Shakespeare’s corpus but without exploiting its parallel nature with modern English corpus.

Proposed Evaluation Methodology

Following existing literature on style transfer and stylized text generation, we evaluate our proposed frameworks along two axes: content preservation and stylistic alignment.

Content preservation aims to measure the degree to which the generated stylized outputs have the same meaning as the corresponding input sentences. Following existing literature, we use the BLEU metric555BLEU score is measured with multi-bleu-detok.perl [Papineni et al.2002] and the ROUGE scores (ROUGE-1, ROUGE-2, ROUGE-3 and ROUGE-L [Lin2004]).

The core contribution of our evaluation framework is in the linguistic-motivation used to quantify the stylistic alignment of a generated piece of text with the target style we wish to achieve. While there have been several studies around formality and sentiment transfer on text, the same evaluation criteria does not apply to our setting because of two reasons: (a) the classifier-based evaluation, which is facilitated by readily available classifiers for aspects like sentiment and formality, cannot be used to evaluate stylistic alignment with respect to an author’s style, and (b) author style is an amalgam of several linguistic aspects which are much more granular than the psycholinguistic concepts. To this end, taking motivation from verma2019lexical (verma2019lexical), we formulate a multi-level evaluation scheme that identifies and quantifies stylistic expression at surface, lexical and syntactic level. Once we quantify the stylistic expression, we use standard distance metrics to measure the stylistic alignment with target.

Linguists have identified style, especially in English language, to be expressed at three levels – surface, lexical and syntactic [Strunk2007, DiMarco and Hirst1988, Crystal and Davy2016]. We first discuss the expression of stylistic elements as well their quantification. After quantifying the stylistic expressions at these levels, we discuss their incorporation into out evaluation framework.

Lexical elements

of style are expressed at the word-level. For instance, an authors choice words may be more subjective than objective (home vs. residence), or more formal than informal (palatable vs. tasty). For instance, we found that Rudyard Kipling, known for his classics of children’s literature, had a higher tendency to use more concrete words (like, gongs, rockets, torch, etc.) unlike Abraham Lincoln, who being a political writer, used more abstract words (like freedom, patriotism, etc.). Inspired from brooke2013multi (brooke2013multi), we consider four different spectrums to take lexical-style into account: (i) subjective-objective, (ii) concrete-abstract, (iii) literary-colloquial, and (iv) formal-informal.

For quantifying these lexical elements, we use a list of seed words for each of the eight categories above, viz. subjective, objective, concrete, abstract, literary, colloquial, formal and informal [Brooke and Hirst2013]

. Following brooke2013multi (brooke2013multi), we compute normalized pointwise mutual information index (PMI) to obtain a raw style score for each dimension, by leveraging co-occurrences of words in the large corpus. The raw scores are normalized to obtain style vectors for every word, followed by a transformation of style vectors into k-Nearest Neighbor (kNN) graphs, where label propagation is applied. Since the eight original dimensions lie on the two extremes of four different spectrums, i.e., subjective-objective, concrete-abstract, literary-colloquial, and formal-informal, we compute

averages across the entire author-specific corpus. The averages, in the range , denote the tendency of author using subjective, concrete, literary, or formal words, in contrast to using objective, abstract, colloquial, or informal words, as evidenced in their historical works666The final output is a dimensional vector with each of the elements, let’s say . The value of will denote the tendency of the author to choose subjective words instead of their objective counterparts, which can be given by .

Syntactic elements

relate to the syntax of the sentence – while some authors construct complex sentences, others construct simple sentences. For instance, as per the writings of Abraham Lincoln available in the Gutenberg corpus, a majority of his sentences can be categorized as compound-complex, while those of Rudyard Kipling’s are mostly simple sentences (which are better suited to children). Taking inspiration from feng2012characterizing (feng2012characterizing), we categorize syntactic style into different categories – (a) simple (b) compound (c) complex (d) complex-compound sentences, (e) others. For quantifying these stylistic elements, we compute the fraction of sentences that are categorized into the categories by the algorithm proposed by feng2012characterizing (feng2012characterizing). Since any given sentence will definitely lie in only one of the categories, the

dimensional vector averaged across the sentences in a corpus can be thought of as probability distribution over the

categories.

Surface elements

relate to statistical observations concerning aspects like the average number of (i) commas, (ii) semicolons, (iii) colons per sentence, (iv) sentences in a paragraph, and (v) number of words in a sentence. We quantify the surface-level elements into a dimensional vector.

Although the above enumerations of stylistic elements within a level, whether lexical, syntactic or surface, are not exhaustive, they are indicative of the stylistic expression at different levels. Computing the above statistics on an author-specific corpus gives an interpretable notion of the concerned author’s writing style. Such a notion of style spans across multiple linguistic levels and has a considerable granularity. To this end, to quantify the stylistic alignment between generated text and the target text, we first compute these statistics for both the generated corpus and the target author’s corpus. Then, we use standard distance metrics to obtain the extent of stylistic alignment at different linguistic levels. For lexical and surface-level alignment, we use mean squared error (MSE). Since syntactic style vector is a probability distribution over different syntactic categories, we use Jensen-Shannon divergence (otherwise known as symmetric KL divergence) to measure the alignment.

Source Original Text NH’s Style CD’s Style GAH’s Style
Opinosis The staff was so polite and catered to our every need . The staff was so polite and kind to our every need . The staff was so polite and obliged to our every need . The staff was so polite and ready to accept our every need .
One, for what you get it is over priced . But for what you get it is all over. And for what you get it is over , price over. Now for what you get it is over price.
The performance and speed of this laptop is amazing. The performance and speed of this tablet is amazing . The performance and speed of this communicate is amazing. The performance and speed of this believe is wonderful.
Front desk staff were not super easy to work with but … Western desk , the staff were not abilities easy to work with , but… front desk and staff were not extra easy to work with , but… The won desk staff were not force easy to work with , but…
Mark Twain I asked him if he learned to talk out of a book, and if I could borrow it anywhere? I asked him whether he had learned to talk of a dream , and if I could borrow it. I asked him if he had learned to talk out of a book ; and if I could borrow it . I asked him if he learned to talk out of a man’s mind and if I could borrow it
But I was not hungry any more, and did not care for food. But I was not hungry any longer , and did not account for it . But I was not hungry any one’s , and did not care for anybody’s. But I was not hungry any longer , and did not care for . ’
Meanwhile, if we understand each other now, I will go to work again. And if we understand each other’s, I go to work. And if we understand each other , I will go to work. Then if we understand each other’s words I will go to work.
For of all the strange and frantic and incomprehensible and uninterpretable books which the imagination of man has created, surely this one is the prize sample. For spite of all the strange and frantic and incomprehensible man’s books, the imagination of man has so surely this one is the prize. For of all the strange and frantic and incomprehensible’s books which the imagination of man has, it’s surely this one is the prize’s . For of all the strange and part of man and man’s books which the imagination of man has won’t surely this one is the prize .
AI Wiki If the AI is programmed

for “reinforcement learning”, goals can be implicitly induced by

rewarding some types of behavior or punishing others.
If the human mind is bosoms for Heaven’s sake , he can be implicitly induced by rewarded some types of behavior or punishment. If the brain is learn for men’s object can be implicitly induced by gratification some kind of behaviour or punishment’s punish’s If the round is turn for one’s point he can be implicitly induced by done some type of conduct or punishing.
A typical AI analyzes its environment and takes actions that maximize its chance of success. A type of human nature in its environment , and takes actions in its chance of life. A characteristic Sissy’s feelings takes place, and takes actions that sort of people’s chance of to-morrow’s. A man is in its situation, and takes actions that : and its chance of escape.
Table 3: Samples of stylized text generated by StyleLM. The target authors are Nathaniel Hawthorne (NH), Charles Dickens (CD) and George Alfred Henty (GAH). The source text has been taken from Opinosis, Mark Twain and AI Wiki, as indicated.

Results and Analysis

Qualitative Evaluation

Table 3 presents samples of author-stylized text generated using StyleLM for some of the authors. Key highlights include the switch between ‘kind’, ‘obliged’ and ‘ready to accept’ for the source word ‘catered’. The modification of the word ‘super’ – which is used in a colloquial sense, to ‘extra’ without sacrificing the semantic meaning, demonstrates author-specific adaptation across different time frames. Similar observation can be made by noting the adaptation of ‘AI is programmed’ to ‘brain is to learn’ and ‘rewarding‘ to ‘gratification‘ on fine-tuning for Charles Dickens’ writing style. Qualitative assessment of the generated samples depict the efficacy of our approach by illustrating alignment with the target author’s style as well as significant content preservation. We provide more examples as well as comparisons across all the baselines in the supplementary material.

Quantitative Evaluation

Our evaluation framework assesses the capability of our proposed StyleLM model across both content preservation and stylistic alignment metrics.

The results for stylized rewriting of the test corpus to the various author’s style ( in total) are presented in in Table 1. All the fine-tuned StyleLM models are tested on a test set that spans different domains – (a) Opinosis [Ganesan, Zhai, and Han2010] which contains sentences extracted from user reviews on a variety of topics from Tripadvisor (hotels), Edmunds.com (cars) and Amazon.com (various electronics), (b) text from Mark Twain’s books, and (c) a Wikipedia page on Artificial Intelligence777We did not include any of these in the pre-training nor in the fine-tuning stage. As such, our model has never seen this data.. To reiterate, the objective is to rewrite the above test corpora into a style that reflects the style of target author we fine-tuned for. The averaged values for all 10 authors, as well as the standard deviation, across both content preservation as well as stylistic alignment metrics, are given in Table 1.

It can be inferred from Table 1 that in terms of stylistic alignment, GPT-2 (FT), i.e., author fine-tuned GPT-2, performs comparable to LM + DAE, i.e., denoising LM with no author-specific fine-tuning, across all the three datasets and on each of the three stylistic levels. However, the content preservation for LM + DAE is better than that of GPT-2 (FT). The vanilla GPT-2, however, shows the least impressive in terms of both content preservation as well stylistic alignment. Specifically, the poor performance on content preservation can be attributed to the fact that GPT-2 and GPT-2 (FT) are both trained for generating continuations of input prompts and not for the task of stylistic rewriting. It is nonetheless encouraging to see that fine-tuning the GPT-2 language model on author-specific corpus, i.e., GPT-2 (FT), increases the extent of stylistic alignment with target author’s style, establishing GPT-2 (FT) as a competitive baseline to compare stylistic alignment against.

While LM + DAE, i.e., denoising LM without author-specific fine-tuning, shows good performance in terms of content preservation and stylistic alignment, our proposed approach, StyleLM, shows considerable gains across all the metrics, against the LM + DAE. This observation confirms our hypothesis that the author-specific fine-tuning using DAE loss teaches the model to better learn the stylistic characteristics of the target author. Consistency of results across the diverse test sets shows a broader coverage in terms of applicability of the presented results.

Interestingly, we notice that ROUGE-1 scores for the baseline LM + DAE (without author fine-tuning) are slightly higher than those for StyleLM. A closer inspection of the generated samples from the two models reveals that this is because the stylized generations of the former are not as structurally coherent as those of the latter; i.e., while the predicted words are more accurate, they are not predicted in the correct order. This is further substantiated by the higher values for ROUGE-2, ROUGE-3 and ROUGE-L scores.

Comparison with Supervised Approach

While StyleLM performs better than the other unsupervised stylized generation models, as shown in Table 1, it is critical to determine its performance with respect to the supervised approach proposed by jhamtani2017shakespearizing (jhamtani2017shakespearizing). We compare their LSTM-based encoder-decoder approach with GPT-2, GPT-2 (FT), LM + DAE and StyleLM after fine-tuning them on Shakespeare’s corpus. As it can be inferred from the results presented in Table 2, StyleLM performs better than the supervised approach in terms of BLEU, ROUGE-3, ROUGE-L, and lexical stylistic alignment. The performance, as quantified by rest of the metrics, is comparable to that of [Jhamtani et al.2017]. Given that StyleLM was trained without leveraging the parallel nature of the data, which jhamtani2017shakespearizing rely on, the results are promising and demonstrate the abilities of our proposed model in generating author-stylized text while preserving the original content.

Conclusion & Future Work

In this work, we address the task of author-stylized rewriting by proposing a novel approach that leverages the generalization capabilities of language models. Building on the top of language models, we fine-tune on target author’s corpus using denoising autoencoder loss to allow for stylistic adaptation in the process of reconstruction, without relying on parallel data. We also propose a new interpretable framework to evaluate stylistic alignment at multiple linguistic levels. We show that our proposed approach is able to capture the stylistic characteristics of target authors while rewriting the input text and performs not only better than other relevant and competitive baselines, but is also competent to an entirely supervised approach that relies on parallel data.

The linguistic understanding of style, on which the proposed evaluation framework is based, can be used to guide the process of generating stylized text. The process of generation can be tuned to comply with attributes of style at different levels by penalizing or rewarding the (mis)alignment with these elemental attributes of style. Our plan is to explore this in further details, as part of future work.

References