Log In Sign Up

How Sequence-to-Sequence Models Perceive Language Styles?

Style is ubiquitous in our daily language uses, while what is language style to learning machines? In this paper, by exploiting the second-order statistics of semantic vectors of different corpora, we present a novel perspective on this question via style matrix, i.e. the covariance matrix of semantic vectors, and explain for the first time how Sequence-to-Sequence models encode style information innately in its semantic vectors. As an application, we devise a learning-free text style transfer algorithm, which explicitly constructs a pair of transfer operators from the style matrices for style transfer. Moreover, our algorithm is also observed to be flexible enough to transfer out-of-domain sentences. Extensive experimental evidence justifies the informativeness of style matrix and the competitive performance of our proposed style transfer algorithm with the state-of-the-art methods.


page 1

page 2

page 3

page 4


Learning Linear Transformations for Fast Arbitrary Style Transfer

Given a random pair of images, an arbitrary style transfer method extrac...

Replacing Language Model for Style Transfer

We introduce replacing language model (RLM), a sequence-to-sequence lang...

Multimodal Style Transfer via Graph Cuts

An assumption widely used in recent neural style transfer methods is tha...

Low Resource Style Transfer via Domain Adaptive Meta Learning

Text style transfer (TST) without parallel data has achieved some practi...

Grammatical Error Correction and Style Transfer via Zero-shot Monolingual Translation

Both grammatical error correction and text style transfer can be viewed ...

Plug and Play Autoencoders for Conditional Text Generation

Text autoencoders are commonly used for conditional generation tasks suc...

DGST: a Dual-Generator Network for Text Style Transfer

We propose DGST, a novel and simple Dual-Generator network architecture ...

1 Introduction

Different corpus may present different language styles, featured by variations in attitude, tense, word choice, et cetera. As human beings, we always have an intuitive perception of style differences in texts. In the literature of linguistics, there also developed a number of mature theories for characterizing style phenomena in our daily lives (Bell, 1984; Coupland, 2007; Ray, 2014)

. However, with decades of advancement of machine learning techniques in Natural Language Processing (NLP), an interesting and fundamental question still remains open:

How is style information encoded by learning models?

In this paper, we share our novel observations for this question in a specified version, that is, in what approach Sequence-to-Sequence (seq2seq; Sutskever et al. 2014

), a prestigious neural network architecture widely used in NLP and representation learning

(Bahdanau et al., 2014; Li et al., 2015; Kiros et al., 2015), encodes language styles.

Figure 1:

Visualization of the first two eigenvectors of style matrices on four subcorpora with ratings ranging in

, , , , collected from Yelp review dataset. The reviews with higher rating are more positive in attitude and vice versa.

In our preliminary studies, we applied a typical seq2seq model as an autoencoder to learn semantic vectors for sentences from Yelp review dataset

111 and we strikingly made the following observation: After calculating respectively the covariance matrices of the semantic vectors of reviews with different attitude polarity and intensity, we found their second eigenvectors

, i.e. eigenvectors with the second largest eigenvalues, were roughly grouped in two parts according to their polarity and meanwhile showed slight difference according to their intensity, as in the colored part of Fig.

1, while the first eigenvectors

illustrated in gray formed a single cluster, as if they captured certain common attributes of the Yelp corpus, e.g. casual word choices. This phenomenon suggests the covariance matrices have probably encoded the language style in an informative way. Based on this observation, we provide the notion of style matrix and investigate a number of far-reaching implications brought by style matrix in the remainder of this paper.

Conjecture 1 (Style Matrix).

The style of a corpus is encoded in the covariance matrix of its semantic vectors, which is called style matrix.

For the best of our knowledge, our research question and conjecture are quite novel and there barely exist any relevant works before. The most related works are probably the recent studies on text style transfer, a task which was first investigated by Shen et al. (2017) for converting a given corpus in one style to another. Although most of them have not discussed style at a fundamental level, we mainly identify three related perspectives in existing style transfer methods.

  1. [A.]

  2. Discrete Label: A major proportion of the state-of-the-art style transfer methods simply view the corpus style as discrete labels. For instance, the style of positive and negative reviews in Yelp dataset are respectively assigned with binary labels (Shen et al., 2017; Hu et al., 2017; Chen et al., 2018).

  3. Style Embedding: Instead of discrete labels, some methods propose to learn semantic-independent

    embedding vectors as distributed representations of style

    (Fu et al., 2018).

  4. Lexicon-Based: Other works suggest use the most significant lexical units, i.e. those serving as major factors to style classification decision, as representatives of text style (Li et al., 2018; Xu et al., 2018).

In principle, our main conjecture is better compatible with linguistic aspects of style phenomenon than the aforementioned perspectives, especially in the following aspects.

  1. Style is a statistical phenomenon. According to the variationist’s view in sociolinguistics Coupland (2007), style emerges from variation of language usages and is always a global phenomenon rather than a property of single sentence. In our conjecture, style matrix by definition reflects the covariance of the corpus, while Perspective C could only characterize the sentence-level style.

  2. Style is inherent in semantics. As Ray (2014) suggests, expression often helps to form meaning. In our conjecture, style matrix is an explicit function of semantic vectors, while Perspective B improperly assumed style embedding’s independence on semantic.

  3. Style is multi-modal. Usually, we recognize style in texts from many different aspects Bell (1984). For instance, I am very unhappy with this place is negative in attitude and meanwhile present in tense. Moreover, one can also recognize slight differences in style intensity for both negative sentences, i.e. with/without very in the example above. With experiments, we show style matrix is able to distinguish various intensity level of style (§4.2.1) and capture multiple styles in one corpus (§4.2.2), while Perspective A is impotent to characterize these delicate style differences by discrete labels.

In practice, based on the notion of style matrix, we propose a novel algorithm called Neutralization-Stylization (NS) for unpaired text style transfer. Given style matrices of source and target corpora obtained from a pre-trained seq2seq autoencoder, our algorithm works in a fully learning-free manner by first preparing a pair of matrix transform operators from the style matrices. After the preparation, it simply applies these operators to the style matrix of given corpus to accomplish text style transfer on the fly. By introducing additional style information as supervision on the learning process of the seq2seq autoencoder, we observe NS algorithm can achieve comparable performance with the state-of-the-art style transfer methods on each standard metric. Moreover, the flexibility of our method is further demonstrated by its ability to control the style of unlabeled sentences from other domains, i.e. out-of-domain text style transfer, which we propose as a much challenging task to foster future researches.

In summary, our contributions are as follows:

  • [leftmargin=*]

  • We present the notion of style matrix as an informative delegate to language style and explains for the first time how seq2seq models encode language styles (§2).

  • With the aid of style matrix, we devise Neutralization-Stylization as a learning-free algorithm for text style transfer among binary, multiple and mixed styles (§3), which achieves competitive performance compared with the state-of-the-art methods (§4.3).

  • We introduce the challenging out-of-domain text style transfer task to further prove the flexibility of our proposed method (§4.4).

2 Style Matrix

2.1 A General Framework for Style Matrix Extraction

Given a corpus , where is a sequence of tokens, we wonder whether there exists an explicit way to extract the global style of with no other external knowledge. Inspired from the variationist’s approach to language style in the context of sociolinguistic (Coupland, 2007), we suggest exploiting the second-order statistics of semantics, specifically the covariance matrix, as an informative representation of the corpus style (Conjecture 1

). Somewhat coincidentally, a similar viewpoint on visual style has been investigated in the computer vision community recently

(Gatys et al., 2015).

Due to the discrete essence of language, to compute the covariance matrix is not directly applicable to raw representations such as those in one-hot scheme. In order to fulfill the statement in our main conjecture, we require the semantic vector to be both distributed and nearly lossless. The former property requires the semantic of the original sentence can be compressed into a latent vector, while the latter requires the original sentence can be near-optimally reconstructed from the semantic vector alone.

Formally, we first convert into distributed representations with a mapping (i.e. encoder) from to , a -dimensional real-valued vector space. In order to guarantee is lossless, we further require the existence of a reverse mapping (i.e. decoder) from which satisfies , the identity mapping on the corpus. Once these conditions satisfied, we call the distributed representations , which consists of s.t. , the semantic vectors of corpus . As a slight abuse of notation, we also use to represent the semantic vecotrs in matrix form, i.e. . Based on these notations, we provide the formal counterpart to Conjecture 1 as follows.

Definition 1 (Style Matrix).

Given corpus with a semantic encoder satisfying the requirements above, we define the style matrix as


where denotes after being centered, i.e. and .

In recent studies of sentence embedding (e.g. Le and Mikolov 2014; Conneau et al. 2017; Pagliardini et al. 2017), there indeed exist various existing choices for implementing encoder . However, as most of them do not have an explicit notion of the decoder, the extracted style matrix would therefore not be able to be further utilized in downstream style transfer tasks. Therefore, in the next section, we propose to leverage the power of seq2seq paradigm (Sutskever et al., 2014) as a practical tool for extracting highly informative style matrix and meanwhile, facilitates style transfer tasks with the simultaneously trained decoder module. A detailed implementation is provided below.

2.2 Case Study: Seq2seq for Style Matrix Extraction

As an overview, we implement the encoder in Definition 1 with the encoder module of a seq2seq model, while its decoder module learns alongside under the reconstruction loss to guarantee the original semantic is largely preserved in the obtained semantic vectors . Given a sentence with each token from a vocabulary , we propose the learning process for style matrix extraction below.

For the encoder module, we use a recurrent neural network with Gated Recurrent Unit (GRU;

Cho et al. (2014)), i.e. , to encode x into hidden state vectors with


where contains all the hidden states calculated by the GRU encoder.

Next, viewing the last hidden state of the encoder as the semantic vector , we further require the decoder can reconstruct based on token by token with a GRU (denoted as ). Formally, at each step , takes the generated token and the previous hidden state as input to calculate the current state by


where the initial state is set as .

Subsequently, with a linear projection layer followed by a softmax transformation, the distribution of the next token over the vocabulary is calculated as


where is a learnable matrix in .

By convention of unsupervised learning

(LeCun et al., 2015), we set the reconstruction objective as the categorical cross entropy between the input sequence and the distribution of the reconstructed sequence . In practice, we further apply the scheduled sampling technique to accelerate the aforementioned learning process (Bengio et al., 2015).

It is worth to notice, in our implementation of seq2seq for autoencoding, we have intentionally avoided the usage of attention mechanism (Luong et al., 2015). It is mainly because, with the attention mechanism, information flow from encoder to decoder is not limited to the semantic vector . For example, the reconstruction process is otherwise also dependent on the context vector. Therefore, although attention mechanism can bring optimal reconstruction loss even with small hidden state size, it may cause potential semantic loss and therefore compromise the quality of the extracted style matrix.

As a final remark, we demonstrate our method above with GRU modules only for the sake of concreteness. Besides GRU, there are various available recurrent architectures for implementing , such as vanilla recurrent unit (Rumelhart et al., 1985)

, Long Short-Term Memory network (LSTM;

Hochreiter and Schmidhuber (1997)) and their bidirectional or stacked variants (Jurafsky, 2000). In experiments, we also report results with several typical architectures as a comprehensive self-comparison.

3 Style Transfer with Style Matrix

In this section, we propose a novel algorithm called Neuralization-Stylization (NS) for unpaired text style transfer by directly aligning the style matrix of one corpus to the other with a pair of plug-and-play matrix operations. To achieve competitive performance as the state-of-the-art style transfer methods, we further augment the unsupervised style matrix extraction process in Section 2.2 by introducing human-defined style information as external supervision.

3.1 Neutralization-Stylization algorithm

As a covariance matrix in essence, style matrix can be factorized into the following form due to its positive semi-definiteness


where is a diagonal matrix consisting of its eigenvalues and

is an orthogonal matrix formed by its eigenvectors

(Meyer, 2000) .

Given two corpora and and a seq2seq autoencoder pretrained on as a larger corpus, we calculate Eq. 1 respectively on to obtain the style matrices . Using eigenvalue decomposition in Eq. 5, we next introduce a pair of Neutralization and Stylization operators, which can be easily used for on-the-fly text style transfer in a plug-and-play manner. Note both operators are defined on a set of semantic vectors rather than a single embedding, which highly corresponds to the statistical essence of language style (Coupland, 2007).

3.1.1 Plug-and-Play Style Transfer Operators

Neutralization. Neutralization operator is used to remove the style characteristic of corpus from a set of semantic vectors . Formally, in the spirit of Zero-phase Component Analysis (ZCA) (Bell and Sejnowski, 1997), neutralization operator is defined as


An intuitive way to understand how it works is by replacing directly with the semantic vectors of . It is easy to check: has its style matrix as , which means the dimensions of semantics become uncorrelated after neutralization.

Stylization. Stylization transformation is used to add the style characteristic of corpus to a set of semantic vectors by reestabilishing the correlation among dimensions of semantics, which, with inspirations from Hossain (2016), is defined as


Similarly, by stylizing a neutral set of semantic vectors (i.e. ), we can easily check has the same style matrix as that of corpus , which hence demonstrates the properness of .

3.1.2 On-the-Fly Text Style Transfer

With the well-defined neutralization and stylization operators, our proposed learning-free NS algorithm works straightforwardly by: (1) encoding with ; (2) applying prepared operators successively; (3) decoding the semantic vector with . Formally, the target sentence is calculated as


Moreover, thanks to the flexibility of style matrix perspective and NS algorithm, we can even conduct out-of-domain style transfer, where the input sentence not necessarily comes from corpus or has style labels. For details, we present an interesting case study on out-of-domain style transfer between Yelp and Amazon datasets in Section 4.4.

3.2 Incorporate Human-Defined Style Label

In practice, we notice the performance of NS algorithm with raw style matrix is not competitive with the state-of-the-art methods specified on this task. We speculate the main reason lies in: Style matrix is highly informative and probably incorporates even the most delicate aspect of style of the underlying corpus. Therefore, its unsatisfactory performance on style transfer task implies the corpus actually has other latent attributes of style besides the human-defined ones, as we have illustrated with the Yelp example in Section 1 by its clustered first eigenvectors (gray arrows in Fig. 1).

To enhance the quality of style transfer, we suggest to augment the style matrix extraction process with human-defined attribute (e.g. attitude). Concretely, we propose to train the encoder

of the seq2seq model in a semi-supervised way by adding a nonlinear binary classifier

on the semantic space, which provides supervision signal simultaneously with the original unsupervised reconstruction process. Formally, given semantic vector , we define the classifier as


where is the trainable parameter and is the sigmoid activation.

Noticeably, under both scenarios, our text style transfer algorithm is learning-free because: we only need to pretrain a seq2seq model, either in fully unsupervised or semi-supervised way, to obtain a pair of encoder and decoder and prepare the operators with several matrix operations. Without time-consuming adversarial training (e.g. Shen et al. (2017)), our augmented method achieved competitive transfer performance on each standard metric (§4.3).

4 Experiments222Code is provided at

4.1 Overall Settings

Datasets. We used the following two standard benchmark datasets for empirical studies.

Yelp: The Yelp dataset collected the reviews to restaurants on Yelp. Each sentence is associated with an integer rating ranging from to , where a higher score implies the more positive of the corresponding review’s attitude and vice versa. We treated the attitude of reviews with ratings above as positive while those below as negative.

Amazon: The Amazon dataset contains the product reviews on Amazon. Each sentence is originally labeled with positive or negative attitudes (He and McAuley, 2016).

With an automatic tense analysis tool (Ramm et al., 2017), we annotated the tense attribute for sentences in Yelp and Amazon as an additional style factor. We filtered out the sentences which were not in past and present tense and split each processed dataset into train, validation and test sets. For statistics, please refer to Appendix A.

Evaluation Metric. We evaluated the performance of style transfer on the following two standard metrics.

Accuracy (Acc.): In order to evaluate whether the transferred sentences have the desired style, we followed the evaluation method in Shen et al. (2017) by pretraining a style classifier on the training set and utilizing its classification accuracy on the transferred sentences as a metric. Specifically, we used the TextCNN model (Kim, 2014) as a style classifier.

BLEU: In order to evaluate the quality of content preservation, we used the BLEU score Papineni et al. (2002) between the generated and the source sentences as a measure. Intuitively, a higher BLEU score primarily indicates the model has a stronger ability to preserve content by copying style-neutral words from the source sentence.

To evaluate the overall performance of style transfer quality, we also calculated the geometric mean (i.e.

G-Score) and arithmetic mean (i.e. Mean) of Acc. and BLEU metrics.

width=0.45 NS Operators From Acc. (Baseline444The error rate of original sentences in validation set is provided as baseline, so as in Table 3) BLEU (R1, R5) 39.48 (2.67) 36.97 (R1 R2, R4 R5) 35.84 (2.67) 39.44 (R2, R4) 32.82 (2.67) 41.08

Table 1: Results of style transfer with NS algorithm on corpora pairs with different style contrast levels.

Implementation Details. We embedded words into distributed representations (with dimension ) using CBOW Mikolov et al. (2013) and froze the word embeddings during the training process. We implemented the seq2seq model with (1) GRU of hidden units, (2) LSTM of hidden units and (3) bi-directional GRU of both forward and backward hidden units. For (2), we concatenated the final hidden state and cell state to form the -dimensional semantic vector, while for (3), we concatenated the forward and backward final hidden states. We trained each seq2seq model on the training set with Adam optimizer Kingma and Ba (2014) and performed style transfer on the validation set. We set the weight of reconstruction loss and classification loss as 10:1. As observed in Section 4.3, the informativeness of style matrix was insensitive to different choices of recurrent architectures and hence we only report the results of GRU implementation in other parts.

4.2 Explore the Styles of Yelp

As is discussed in Section 1, the notion of style matrix conforms to the linguistic aspects that style is innate in semantics and is multi-modal. To demonstrate style matrix can indeed capture these delicate style phenomena, we first mixed up all the reviews on Yelp with different ratings and trained a seq2seq model with reconstruction loss only. We then divided the corpus into several sub-corpora with well-designed criteria. Finally, we performed text style transfer with operators prepared respectively with these pairs of sub-corpora. Detailed results and analyses are followed in each part.

width=0.8 Model Yelp Amazon Acc. BLEU G-Score Mean Acc. BLEU G-Score Mean Test Set 97.48 - - - 80.97 - - - Cross-Aligned 83.78 12.69 32.37 46.73 60.84 8.56 22.82 34.70 Style-Embedding 6.34 85.14 23.23 45.74 29.21 68.14 44.61 48.68 NS-GRU 80.33 13.43 32.85 46.88 79.50 12.97 32.11 46.23 NS-LSTM 78.07 12.38 31.10 45.23 74.30 15.90 34.37 45.10 NS-BiGRU 72.02 12.60 30.12 42.31 74.04 25.23 43.22 49.64

Table 2: Performance of different style transfer methods on standard benchmarks.

4.2.1 Style Intensity

We collected four corpora which contained sentences respectively with rating , , and (denoted as R1, R2, R4, R5) and discarded the neutral sentences with rating . The former two sub-corpora have the same polarity of attitude (i.e. negative) but with different intensity and so as the latter two. For visualization, Fig. 2 plots the first eigenvectors of each style matrix, which shows a recognizable color gradience from the most negative corpus (with rating ) to the most positive corpus (with rating ).

Figure 2: Heatmap of the first eigenvectors of style matrices on Yelp subcorpora with different intensity levels of attitude, from R1 the most negative corpus to R5 the most positive one (better viewed in color).

Subsequently, we constructed three sets of operators respectively from stylistic pairs (R1, R5), (R1 R2, R4 R5) and (R2, R4), in the decreasing order of style contrast level. We performed style transfer on the same validation set with the three sets of prepared operators. The results are reported in Table 1. As we can see, the style transfer quality of each set of operators were positively related to the degree of style contrast and we suggest this phenomenon as an implicit validation for the informativeness of style matrix on capturing slight difference in style intensity.

4.2.2 Multiple Styles

width=0.35 Acc. (Baseline) BLEU Attitude 29.56 (2.67) 62.04 Tense 65.94 (2.73) 52.28

Table 3: Results of style transfer with NS algorithm on corpora pairs with multiple styles.

Based on the attitude and tense annotations on Yelp, we partitioned the original corpus into two pairs of sub-corpora, namely the attitude pair (positive, negative) and the tense pair (present, past). Correspondingly, we calculated attitude (tense) transform operators respectively on each pair and applied the prepared operators to transfer the target attribute with the other style attribute fixed. We report the transfer performance of NS algorithm in Table 3, which empirically proved style matrix can simultaneously capture multiple style attributes.

4.3 Unpaired Text Style Transfer

In this part, we compared the performance of NS algorithm with the state-of-the-art methods on Yelp and Amazon datasets. We trained a seq2seq model in the semi-supervised way as described in Section 3.2 and transferred attitude of sentences with NS algorithm. We chose the following representative state-of-the-art style transfer methods as baselines.

Cross-Aligned: This method assumes a shared latent content distribution across the corpora with different styles and leverages refined alignment of latent representations to perform style transfer (Shen et al., 2017) .

Style-Embedding: This method learns separate content representations and style representations using adversarial networks. With the style information embed into distributed vector representations, one single decoder is trained for different corpora (Fu et al., 2018).

As observed in Sec. 4.2.1, the transform operators have a stronger transfer capability when generated from a pair of corpora with higher style contrast, which inspires us to further enhance the performance of NS algorithm by removing sentences with low confidence judged by the simultaneously trained style classifier. Fig. 3 plots the model performance on different metrics over drop rates ranging from to

with a fixed stride


As we can see, the increase in drop rate caused an increase of Acc. and decrease of BLEU score. We speculate it is inevitable due to the tight interdependence between style and semantics. The result at drop rate provides further evidence on this phenomenon, that is, to change the style of the validation set to a corpus with extreme style feature would largely change their semantics.

Figure 3: Performance of NS algorithm with different sentences’ drop rates for preparing transfer operators.

max width= Yelp (Tense) Amazon (Attitude) Source she did not finish the liver . Source the edger function did not work well for me . Past she did not finish the liver . Neg the edger function did not work well for me . Pres she does not finish the liver . Pos the edger function work well for me . Source however , i think i would try somewhere else to dine . Source my daughter was just frustrated with this toy . Past however , i thought i would try somewhere to dine . Neg my daughter was just upset with this toy . Pres however , i think i would try somewhere else to dine . Pos my daughter was just happier with this toy .

Table 4: Sampled out-of-domain sentences transferred by NS algorithm.

Since the trade-off between the transfer ability and content preservation can be controlled, it is hard to select one balanced point to fully characterize the performance of our method. As a complement, we suggest to use Mean as an overall performance measure, which is more stable than G-Score as observed in Fig. 3. Table 2 shows the performance of our methods with different recurrent architectures and baselines. As we can see, our method achieved comparable performance with two baselines while averagely outperformed them on Amazon, the benchmark with a larger vocabulary size. For an illustrative comparison, we further provide some generated samples from each method in Appendix A.

4.4 Out-of-Domain Style Transfer

width=0.35 Yelp (Tense) Acc. (Test) BLEU G-Score Mean 86.35(94.80) 32.64 53.09 59.49 Amazon (Attitude) Acc. (Test) BLEU G-Score Mean 87.94(97.48) 22.05 44.03 55.00

Table 5: Performance of NS algorithm on out-of-domain style transfer tasks.

In the final part, we propose out-of-domain style transfer as a much challenging task for text style transfer, where, given a corpus with style labels, the style transfer models are required to control the style of unlabeled sentences coming from out-of-domain corpora. For validation of NS algorithm’s performance on this task, we use the Yelp with attitude labels only and Amazon with tense labels only to control their style on the other pair of attributes which is not observed by them. In other words, we would transfer tenses of sentences in Yelp with the operators prepared from Amazon and vice versa.

In this scenario, we only need a slight modification on our proposed method in Sec. 3, that is, to train the seq2seq model on Yelp Amazon with two style classifiers, namely attitude classifier on Yelp and tense classifier on Amazon. After preparing the style transform operators on the other domain, it is straightforward to out-of-domain style transfer. The results are reported in Table 5. It is worth to notice, even though the validation set is unlabeled in the style attribute we want to transfer, our NS algorithm can still achieve superior performance in both cases, which further validated the flexibility of style matrix perspective and the effectiveness of NS algorithm. We also provide some illustrative results in Table 4. Noticeably, the capability of out-of-domain style transfer allows us to leverage several corpus annotated with single style attributes for controlling multiple styles on each corpus.

5 Related work

Unparalled Text Style Transfer. A major proportion of works proposed to learn the style-independent semantic representations of sentences for downstream transfer tasks (Fu et al., 2018; Shen et al., 2017; Hu et al., 2017; Chen et al., 2018). These works minimized reconstruction loss of a variational autoencoder (Kingma and Welling, 2013)

to compress the sentences and align the distributions of these vectors by adversarial training. Some other works utilized heuristic transformation to accomplish style transfer by explicitly dividing the sentence into semantic words and style words

(Li et al., 2018; Xu et al., 2018). Essentially different from these previous works, our work focuses on studying how seq2seq models perceive language styles and the competitive performance of our proposed style transfer algorithm is therefore better to be considered as an implicit justification to our style matrix view on language style.

Style in Other Domains. Style phenomenon is also studied in other domains, especially in computer vision. The groundbreaking works by Gatys et al. (2015, 2016)

showed the Gram matrices of the feature maps extracted by a pre-trained convolution neural network are able to capture the visual style of an image, which was immediately followed by numerous works have been developed to transfer the style by matching the generated Gram matrices (e.g.

Ulyanov et al. 2016, 2017; Johnson et al. 2016; Chen et al. 2017) and Li et al. (2017) theoretically proves that it’s equivalent to minimize the maximum mean discrepancy of two distributions.

6 Conclusion

In this paper, we have investigated the style matrix encoded by seq2seq models as an informative delegate to language style. The notion of style matrix conforms well to human experiences and existing linguistic theories on language style. In practice, we have also proposed NS algorithm as a plug-and-play solution to unpaired text style transfer which achieved competitive transfer quality with the state-of-the-art methods and meanwhile showed superior flexibility in various use cases. In the future, we plan to discuss how the quality of semantic vectors impacts the informativeness of style matrix and study what is encoded in higher-order statistics of semantic vectors.


Appendix A Omitted Experimental Details

a.1 Dataset Statistics

We provide the statistics of Amazon and Yelp datasets we used in experiments in the following table.

Attributes Train Dev Test
Yelp 9603 Positive 173K 37614 76392
Negative 263K 24849 50278
Past 82K 10682 22499
Present 117K 13455 26362
Amazon 33640 Positive 100K 38319 957
Negative 100K 35899 916
Past 63K 5K 1K
Present 129K 5K 1K
Table 6: Datasets statistics.

a.2 Procedures to Produce Fig. 1

We calculated four style matrices to the corpora associated with different ratings with the style matrix extraction method introduced in Sec.4.3.1 to visualize their similarities and differences of styles the style matrix captured. We picked their first and second eigenvectors corresponding to the first two eigenvalues and applied dimension reduction to them with Multi-Dimensional Scaling (MDS) to the 2-D plane.

a.3 Sampled Sentences with Different Style Transfer Methods

From negative to postitive (Yelp)
Source the food tasted awful .
Cross Aligned the food is amazing .
Style Embedding the food tasted awful .
Ours the food tasted amazing .
Source i love the food … however service here is horrible .
Cross Aligned i love the food here is great service great .
Style Embedding i love the food … however service here is horrible .
Ours i love the food , service here is great .
Source customer service is horrible , and their prices are well above internet pricing .
Cross Aligned great service , and prices are great , well quality their people .
Style Embedding customer service is horrible , and their prices are well above internet pricing .
Ours customer service is excellent and their prices are nice internet pricing .
From positive to negative (Yelp)
Source lol , we all love love love this deli .
Cross Aligned then , we love , but i love this salon .
Style Embedding lol , we all love love love this deli .
Ours lol , everyone really do n’t love this deli .
Source one of the best service experiences i ’ve ever had .
Cross Aligned one of the time i would i had ever had to .
Style Embedding one of the best service experiences i ’ve ever had .
Ours one of the worst service experiences i ’ve ever had .
Source would definitely recommend this place for anyone looking for a good sandwich .
Cross Aligned would not recommend this place for a good for _num_ for a food .
Style Embedding would definitely recommend this place for anyone looking for a good sandwich .
Ours would not recommend this place for anyone looking for a good sandwich .
From negative to postitive (Amazon)
Source there are so many expensive natural products out there .
Cross Aligned there are more than other than there are there .
Style Embedding there are so many expensive natural products out there .
Ours there are so many natural products out there .
Source toaster looks better but performs far worse than $ ones i ve had .
Cross Aligned the price works great as far as far i have ever ordered .
Style Embedding toaster looks better but performs much similar awesome if i were this phone .
Ours toaster looks better but far better than $ ones i ve had .
From positive to negative (Amazon)
Source the product was delivered on the agreed date .
Cross Aligned the product was was on the same problem .
Style Embedding the product was delivered on the page said .
Ours the product was delivered on the expiration date .
Source i bought a second one with the same wonderful results .
Cross Aligned i bought a replacement one of the same unk problem .
Style Embedding i bought a second one with the same wonderful results .
Ours i bought a second one with the same results .

a.4 More Samples by NS Algorithm on Out-of-Domain Style Transfer

Yelp on Tense
Source staff are nice and friendly .
Past staff were nice and friendly .
Pres staff are nice and friendly .
Source i never realized the beauty of the desert until i moved here !
Past i never realized the beauty of the desert until i moved here !
Pres i never assume the beauty of the desert i ’m coming here !
Amazon on Attitude
Source yep , i thought $ was a pretty good price .
Neg yep , i thought $ was not a pretty good price .
Pos yep , i thought $ was pretty good .
Source i even tried it once and it is absolutely delicious .
Neg i even tried it once and it is absolutely just seasoned .
Pos i even tried it once and it is delicious !