code/data for "Structuring Latent Spaces for Stylized Response Generation" (Gao et al., EMNLP'19)
Generating responses in a targeted style is a useful yet challenging task, especially in the absence of parallel data. With limited data, existing methods tend to generate responses that are either less stylized or less context-relevant. We propose StyleFusion, which bridges conversation modeling and non-parallel style transfer by sharing a structured latent space. This structure allows the system to generate stylized relevant responses by sampling in the neighborhood of the conversation model prediction, and continuously control the style level. We demonstrate this method using dialogues from Reddit data and two sets of sentences with distinct styles (arXiv and Sherlock Holmes novels). Automatic and human evaluation show that, without sacrificing appropriateness, the system generates responses of the targeted style and outperforms competitive baselines.READ FULL TEXT VIEW PDF
Generating relevant/conditioned responses in dialog is challenging, and
Text style transfer task requires the model to transfer a sentence of on...
Human-like chit-chat conversation requires agents to generate responses ...
Existing open-domain dialogue generation models are usually trained to m...
Neural conversation models tend to generate safe, generic responses for ...
Although recent neural conversation models have shown great potential, t...
We present a novel approach to the problem of text style transfer. Unlik...
code/data for "Structuring Latent Spaces for Stylized Response Generation" (Gao et al., EMNLP'19)
A social chatbot designed to establish long-term emotional connections with users must generate responses that not only match the content of user input and context, but also do so in a desired target style Zhou et al. (2018); Li et al. (2016b); Luan et al. (2016); Gao et al. (2019a)
. A conversational agent that speaks in a polite, professional tone is likely to facilitate service in customer relationship scenarios; likewise, an agent that sounds like an cartoon character or a superhero can be more engaging in a theme park. The master of response style is also an important step towards human-like chatbots. As highlighted in social psychology studiesNiederhoffer and Pennebaker (2002a, b), when two people are talking, they tend to match linguistic style of each other, sometime even regardless of their intentions. Achieving this level of performance, however, is challenging. Lacking parallel data in different conversational styles, researchers often resort to what we will term style datasets that are in non-conversational format (e.g. news, novels, blogs). Since the contents and formats of these are quite different from conversation data, existing approaches tend to generate responses that are either less style-specific Luan et al. (2017) or less context-relevant Niu and Bansal (2018).
We suggest that this trade-off between appropriateness and style stems from profound differences between conversation and style datasets in format, style and contents that impede joint learning. One approach has been to combine these only during decoding: Niu and Bansal (2018) trained two models separately, a Sequence-to-Sequence (S2S) Sutskever et al. (2014)
on a conversation dataset and a language model (LM) on a style dataset. At inference time, they take a weighted average of the token probability distribution of the two models to predict the next token. This forced bias, however, degrades output relevance. An alternative approach attempts to map the two datasets into the same latent space:Luan et al. (2017)
use multi-task learning to train a S2S model on a conversation dataset and an autoencoder (AE) on a style dataset.Gao et al. (2019b) point out that the two datasets still form separate clusters in the latent space; below we observe that this leads to a low style intensity in generated responses (Section 5).
We propose to bridge conversation modeling and non-parallel style transfer by structuring a shared latent space using novel regularization techniques, that we dub StyleFusion. In contrast to Luan et al. (2017), the two datasets are well aligned in the latent space by generalizing SpaceFusion 222Integrated into Microsoft Icecaps toolkit Shiv et al. (2019) https://github.com/microsoft/icecaps. Gao et al. (2019b), an approach that aligns latent spaces for paired samples, to non-parallel datasets. In the structured shared latent space, stylized sentences are nudged adjacent to semantically related conversations, thereby allowing the system to generate style-specific relevant responses by sampling in the neighborhood of the model prediction. Distance and direction from the model prediction roughly match the style intensity and content diversity of generated responses, respectively, as illustrated in Fig. 1
We demonstrate this method using dialogues from Reddit data and two sets of sentences with distinct styles (arXiv and Sherlock Holmes novels). Automatic and human evaluation show that, without sacrificing appropriateness, our system can generate responses in a targeted style and outperforms competitive baselines.
Our contribution can be summarized thus: 1) We introduce an end-to-end approach that generates style-specific responses from conversational data and non-parallel non-conversation style data. 2) We generalize the SpaceFusion model of Gao et al. (2019b) to non-parallel data by a new regularization method. 3) We present a visualization analysis that provides intuitive insights into the drawbacks of alternative approaches.
is a related but distinct task. It usually preserves the content Yang et al. (2018); Hu et al. (2017); Fu et al. (2018); Shen et al. (2017); Gong et al. (2019). In contrast, content of conversational responses in a given context can be semantically diverse. Various approaches have been proposed for non-parallel data setup. Fu et al. (2018)
proposed to use separate decoders for different styles and a classifier to measure style strength.Shen et al. (2017) proposed to map texts of two different styles into a shared latent space where the ”content” information is preserved and ”style” information is discarded. An adversarial discriminator is used to align the latent spaces of two different styles. However, Yang et al. (2018) point out the difficulty of training an adversarial discriminator and proposed instead the use of language models as discriminator. Like Shen et al. (2017); Yang et al. (2018), we align latent spaces for different styles. However we also align latent spaces encoded by different models (S2S and AE).
is a relatively new task. Akama et al. (2017) use a stylized conversation corpus to fine-tune a conversation model pretrained on a background conversation dataset. However, stylized texts are usually in non-conversational format, as in the present setting. Niu and Bansal (2018) proposed a method that takes the weighted average of the token probability distribution predicted by a S2S trained on background conversational dataset and that predicted by a LM trained on style dataset as the token probability. They observed reduced relevance and attributed this to the fact that the LM was not trained to attend to conversation context and S2S was not trained to learn style during training. In contrast, we jointly learn from conversation and style datasets during training. Niu and Bansal (2018) have proposed label-fine-tuning, but this is limited to scenarios where a reasonable portion of the conversational dataset is in the target style, which is not always the case.
Li et al. (2016b); Luan et al. (2017) aim to generate responses mimicking a speaker. It is closely related to the present task, since persona is, broadly speaking, the manifestation of a type of style. Li et al. (2016b) feeds a speaker ID to the decoder to promote generation of response for that target speaker. However non-conversational data cannot be used. Luan et al. (2017) applied a multi-task learning approach to utilize non-conversational data. A S2S model, taking in conversational data, and an autoencoder (AE), taking in non-conversational data, share the decoder and are trained alternately. However, Gao et al. (2019b) observed that sharing the decoder may not truly allow S2S and AE to share the latent space, and thus S2S may not fully utilize what is learned by AE. Unlike Li et al. (2016b) using labelled persona IDs, Zhang et al. (2019) have proposed using a self-supervised method to extract persona features from conversation history. This allows modeling persona dynamically, which agrees with the fact that even the same person can speak in different style in different scenarios.
McCann et al. (2018); Liu et al. (2019); Luan et al. (2017); Gao et al. (2019b); Zhang et al. (2017) aggregates the strengths of each specific task, and induces regularization effects Liu et al. (2019) as the model is trained to learn a more universal representation. However a simple multi-task approach Luan et al. (2017) may learn separate representations for each dataset Gao et al. (2019b). To address this, in previous work Gao et al. (2019b), we proposed the SpaceFusion model featuring a regularization technique that explicitly encourages alignment of latent spaces for a universal representation. SpaceFusion, however, is only designed for paired samples. We generalize SpaceFusion to non-parallel datasets in this paper.
Let denote a conversation dataset, where and are context sentences and a corresponding response, respectively. consists of one or more utterances and is only one utterance. denotes a non-conversational style dataset, where is a sentence sampled from a corpus of the targeted style. Samples from do not have a labelled corresponding relation with samples from (thus ”non-parallel”). Our aim is to train a model jointly on and to generate appropriate responses in the style similar to sentences from , to a given context. The iven context may or may not be in the target style.
In contrast to SpaceFusionGao et al. (2019b), which only fuses context-response pairs, our goal is to additionally map related stylized sentences to points surrounding the context in the shared latent representation space. The system can then generate relevant stylized responses by sampling in the neighborhood of the prediction based on the context.
As illustrated in figure 2, the model consists of a sequence-to-sequence (S2S) module and an autoencoder (AE) module that shares a decoder. We use S2S encoder to produce the prediction representation , or ”latent action” Zhao (2019), and AE encoder to obtain the representation of the corresponding responses and stylized sentences . We use generalized regularization terms, fusion and smoothness, to align the three latent spaces , , and .
encourage different latent spaces to be close to each other. Accordingly, we define the cross-latent-space distances to be minimized. For response appropriateness, as and are paired as context and response, we use their pair-wise dissimilarity, following Gao et al. (2019b)
where is the batch size, is the dimension of latent space , and we use , the Euclidean distance in latent space, as the dissimilarity.
For style transfer, however, and is not paired. Thus, we instead minimize the distance between a point and its nearest neighbor from another dataset to pull these two datasets close to each other in the shared latent space.
where is the batch average of the distance between and – the nearest neighbor (NN) of from set
While minimizing the cross-latent-space distances, and , we want the samples from the same latent space spread out, following Gao et al. (2019b). For this purpose, Gao et al. (2019b) maximized the average of capped distance between points from the same latent space. However, we found that the results are sensitive to the cap value. Instead, we define the following nearest-neighbor-based characteristic distance
Combining these loss terms we have the following two objectives:
encourages smooth semantic transition in the shared latent space. For response appropriateness, following Gao et al. (2019b)
, we encourage the interpolation between the predictionand the target response to generate the target response .
whereis a Gaussian noise with zero mean and covariance matrix of .
For style transfer, as we move from a non-stylized sentence to a random stylized sentence , we expect to generate a partially stylized sentence and encourage the generated sentence to gradually change from to .
to be minimized is a combination of a vanilla S2S and the above regularization terms 333More generally, one may use a weighted sum of these terms instead. We set them equally weighted for simplicity. and are new terms not existing in Gao et al. (2019b). A more compact definition and yields
For the case is much smaller than , as in the present work, the model may overfit on . We propose to firstly pretrain the model on only 444by setting terms and as zero, then continue training on both and . Furthermore, to reduce overfitting, we applied a data augmentation technique by randomly mask tokens in by a special out-of-vocab token. The masking probability of a token is inversely proportional to its frequency in training data.
Following Gao et al. (2019b), we sample in the neighborhood of by adding a noise of a given length towards a direction randomly drawn from the uniform distribution.
As depends on , the dimension of , We define a normalized value
As the stylized texts are usually sparse, it is possible to generate non-stylized hypothesis as we vary along some direction. Thus we rank the hypotheses considering both relevance and style intensity.
where = 0.5 unless otherwise specified, estimates the relevancy, and is the probability of hypothesis being targeted style predicted by pretrained classifiers.
We considered two style classifiers: a ”neural” based on two stacked GRU Cho et al. (2014) cells, and a ”ngram” classifier which is a logistic regressor using ngram (n=1,2,3,4) multi-hot features. Both classifiers are trained using as negative samples and as positive samples. is computed by taking average of the prediction of these two classifiers.
We experiments with two tasks: generating arXiv-like and Holmes-like responses, respectively, using the datasets summarized in Table 1
(i) Reddit is a conversation dataset constructed from posts and comments on Reddit.com 555using raw data collected by a third party http://files.pushshift.io/reddit/comments/ during 2011, consisting of 10M pairs of context and response . (ii) arXiv is a non-conversational dataset extracted from articles on arXiv.org 666from KDD 2003 http://www.cs.cornell.edu/projects/kddcup/datasets.html from 1998 to 2002, consisting of 1M sentences . (iii) Holmes refers to another non-conversational dataset extracted from Sherlock Holmes novel series777from https://gutenberg.org by Arthur Conan Doyle, with 38k sentences.
is the test set with stylized reference responses, constructed by filtering the Reddit dataset from year 2013 using the trained neural and ngram classifiers. For each context, there are at least 4 reference responses approximately in the targeted style (). The style intensity of the context is not filtered.
We designed the following two tasks.
measurement task presents a context and a set of hypotheses (from the present and baseline systems), and for each hypothesis annotators choose from one of the following options that best fits the quality of the response: ok, marginal, bad (generic or irrelevant), and then map them to numerical score 1, 0.5, and 0, respectively.
task presents a hypothesis and two groups of example sentences, from Reddit and style corpus (Holmes or arXiv). Then crowd-sourced annotators judge whether the hypothesis is more similar to the Reddit group, not sure, or more similar to the style corpus group. We then map these to numerical scores 0, 0.5, and 1, respectively.
For all tasks, the hypotheses of different systems of the same set of 500 randomly selected are presented in random order and the identity of the system is invisible to annotators. Each sample is judged by 5 annotators individually.
For style intensity evaluation, besides the neural and ngram classifier prediction (Section 3.3), we also use simple word-counting (hereafter count metric) to minimize model-specific effects. We first construct a training corpus with balanced positive (from ) and negative (responses sampled from ) samples. Then, for each word that appears in more than 5 sentences in the training corpus, we compute the average style intensity of sentences that contain this word. The top words of highest style intensity are chosen as the keywords in this style. For a test corpus, we compute the average ratio of words that are keywords of a style, as its ”count” style metric.
Besides the overall style comparisons (Reddit vs. Holmes, and Reddit vs. arXiv), we also crowd-sourced three sets of sentences with human labeled levels in three finer styles: how formal, emotional, and technical each sentence is, and build the keyword list for the count metric.
We compare the following baseline systems.
The first category is generative models. (i) MTask refers to the vanilla multi-task learning model proposed in Luan et al. (2017) trained on both and . (ii) S2S+LM refers to the method proposed by Niu and Bansal (2018)888This method was referred as ”Fusion” in Niu and Bansal (2018) but to avoid confusing readers with our StyleFusion method, we refer it as ”S2S+LM”, which uses the weighted average of a S2S model, trained on , and a LM model, trained on , as the token probability distribution at inference time.
The second category draws a training sample as hypothesis. (iii) Retrieval refers to a simple retrieval system which returns the sentence from of highest generation probability by the MTask model. (iv) Rand is a system that randomly picks a sentence from . (v) Human refers to the system randomly picks one of the multiple reference responses in the given context from .
StyleFusion and trainable baselines, MTask and S2S+LM, use two stacked GRU Cho et al. (2014) cells for encoders and decoders with
. The word embedding is also 1000 dimension, trained from random initialization. The variance of the noiseis set to . The state of the top layer of encoder GRU cell is used as . is used as the initial state of all layers of the decoder. All trainable models are trained with the ADAM method Kingma and Ba (2014) with a learning rate of 0.0003. For StyleFusion and MTask, we first train on
for 2 epochs, and then continue the training on bothand until convergence 999approximately one pass of arXiv and 5 passes of Holmes. For all systems except ”Rand” and ”Retrieval”, we use the ranking method Eq 15 to select the top one hypothesis from 100 candidates.
|context||Do you want to play a game?|
|towards||The conclusion depends on the scenario .|
|The answer is yes.|
|The answer depends on the game.|
|towards||This would be an interesting viewpoint.|
|This is a good idea.|
|This would be an interesting experience|
|towards||This is not a desirable characteristic.|
|I don’t play it.|
|This is not a valid question.|
By leveraging the structure of the shared latent space, we can modulate the style intensity by , as illustrated by examples in Table 2. For example, given the context ”Do you want to play a game”, the hypothesis generated from is ”I do”, which is non-stylized. While moving towards of an arXiv-style sentence ”This would be an interesting viewpoint”, the responses gradually change to ”This would be an interesting experience” at , which remains relevant but is more similar to the target style. Similar trends can be observed when moving in the other direction ”The conclusion depends on the scenario” and ”This is not a desirable characteristic”. It also shows that the contents are affected by the direction, a desired property inherited from SpaceFusion models.
The relation between style intensity and is further confirmed by automatic measurement. As illustrated in Fig. 3, as increases, responses come to resemble the targeted style within the depicted range. In contrast, the style intensity of MTask outputs rises only slightly as increases.
The increase of overall style intensity is coupled with change in the style’s finer granularity, as illustrated in Fig. 4. Compared to Reddit, arXiv is less emotional, and more formal and technical. Consistent with this, StyleFusion outputs exhibit less emotion, but become much more technical and formal as increases. MTask, however, tends only to show increased technical level, but fails to be less emotional and more formal, inconsistent with the target style. Where Holmes is the target, the emotional and technical levels do not significantly change compared to Reddit, but Holmes is stylistically more formal. StyleFusion captures these trends, whereas MTask outputs are insufficiently formal, shown in Fig. 4(lowest panel).
We also measured the BLEU4 score at different as shown in Fig. 5. Besides (stylized references), we also tested on held-out Reddit data (i.e. non-stylized references). We observed that at smaller , the StyleFusion outputs are relevant to context and style is more similar to Reddit, indicated by the relatively high BLEU4 computed using non-stylized references. At greater radius, BLEU4 rises when calculated on arXiv-like and Holmes-like references, indicating that the responses generated remain relevant but are closer to the targeted style. Combined with the case at small radius, the implication is that, although the style gradually changes, the responses generated by StyleFusion remain relevant at a relatively wide range of radius.
|context||Okay, but can we host it in the cloud?|
|Style-Fusion||It would be an interesting experiment. It is a possibility.|
|S2S||I think it might be a bit of a stretch.|
|MTask||Yes, yes you can.|
|S2S+LM||The relevant question is , as to this , is you referring to the casimir energy program?|
|Retrieval||Even though this issue can be analyzed in more detail, we shall content ourselves with some simple statements.|
|rand||In particular, it involves the combination equation|
|human||To the cloud !|
|context||The universe has no edge.|
|Style-Fusion||Does this mean the universe is expanding?|
|S2S||The universe is infinite.|
|MTask||The universe is finite.|
|S2S+LM||The universe can be on the infinite loop.|
|Retrieval||Is there really any difference in the low energy physics?|
|rand||In other words , the wave function in equation.|
|human||So… a sphere?|
|context||First step: go check behind the cabinet and see what’s back there.|
|Style-Fusion||I would advise against this.|
|S2S||I’m going to try this next time.|
|MTask||I’ll be back in a few minutes.|
|S2S+LM||I’ve got a little time that i had not been engaged yet.|
|Retrieval||i was thinking the same thing.|
|Rand||Holmes shrugged his shoulders.|
|human||Then there would be nothing , and that would be the answer.|
|context||that ’s a surprisingly large amount of police coming to see a *** teenager.|
|Style-Fusion||What is this crime you speak of?|
|S2S||I’m pretty sure that ’s the only thing that ’s going to happen.|
|MTask||I’m not sure what you mean by that.|
|S2S+LM||I should be on the same business.|
|Rand||I shall be back in an hour or two.|
|human||Must have feared what he was packin’|
As illustrated by the MDS Borg and Groenen (2003) visualization in Fig. 6, MTask cannot align different latent spaces, not only those from different model ( and ), but also for those from same model that have different styles ( and ). SpaceFusion Gao et al. (2019b) can align and better, but forms a separate cluster, indicating that the conversation dataset and style dataset remain unaligned in the latent space. This is because SpaceFusion was not designed to align non-parallel samples. The separation between the conversation dataset and style dataset in latent space, as is the case for MTask and SpaceFusion, makes it difficult for the conversation model to use style knowledge. In contrast, StyleFusion aligns all three latent spaces well as evidenced by Fig. 6.
Human evaluation results are presented in Table 5. As in the automatic evaluation results, StyleFusion and MTask show the highest appropriateness (not statistically different) apart from the Human system. However StyleFusion outputs are much more stylized. Rand, Retrieval and S2S+LM tend to generate stylized but irrelevant responses. To make the overall trends sharper, following Gao et al. (2019b), we compute the harmonic mean of appropriateness and style intensity, in terms of which StyleFusion outperforms all baselines except the Human system. Additional examples of the system outputs and human responses are provided in Table 3 and Table 4
|sampled () or generated||neural||ngram||count||BLEU1||BLEU2||BLEU3||BLEU4||entropy4||distinct1||distinc2|
|target = arXiv|
|target = Holmes|
The automatic evaluation results for arXiv-like and Holmes-like response generation tasks are presented in Table 6. In both instances, StyleFusion achieved relatively high BLEU, and showed high style intensity. The Rand baseline has the highest style intensity but lowest relevance. S2S+LM has the comparable style intensity to StyleFusion but BLEU is much lower, consistent with the observation made by Niu and Bansal (2018). MTask shows significantly less style intensity than StyleFusion. Moreover, MTask’s diversity, as measured by entropy4 and distinct1,2, is much lower, indicating that outputs of this model tend to be bland. Adding regularization, which is SpaceFusion, increases diversity, relevance and style intensity slightly, consistent with the finding in Gao et al. (2019b). Style intensity is further boosted by the addition of term . relevancy and diversity are not significantly affected by the addition of .
We propose a regularized multi-task learning approach, StyleFusion, that bridges conversation models and non-parallel style transfer by structuring a shared latent space. This structure allows the system to generate stylized relevant responses by sampling in the neighborhood of the model prediction, and to continuously control style intensity by modulating the sampling radius. We demonstrate this method in two tasks: generating arXiv-like and Holmes-like conversational responses. Automatic and human evaluation show that, without sacrificing relevance, the system generates responses of the targeted style and outperforms competitive baselines. In future work, we will use this technique to distill information from other non-parallel datasets, such as external informative text Qin et al. (2019); Galley et al. (2019).
Generating stylistically consistent dialog responses with transfer learning. In IJCNLP, pp. 408–412. Cited by: §2.
On the properties of neural machine translation: encoder–decoder approaches. In SSST-8, pp. 103–111. Cited by: §3.3, §4.5.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.
Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504. Cited by: §2.
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1, pp. 605–614. Cited by: §1, §1, §1, §2, §2, §4.4.