Style transfer aims at migrating the content of a sample from a source style to a target style. Recently, great progress has been achieved by applying deep neural networks to redraw an image in a particular style(Kulkarni et al., 2015; Liu and Tuzel, 2016; Gatys et al., 2016; Zhu et al., 2017; Luan et al., 2017). However, so far very few approaches have been proposed for style transfer of natural language sentences, i.e., changing the style or genre of a sentence while preserving its semantic content. For example, we would like a system that can convert a given text piece in the language of Shakespeare (Mueller et al., 2017); or rewrite product reviews with a favored sentiment (Shen et al., 2017); or generate responses with a consistent persona Li et al. (2016).
A big challenge faced by language style transfer is that large-scale parallel data are unavailable. However, parallel data are necessary for most text generation frameworks, such as the popular sequence-to-sequence models(Sutskever et al., 2014; Bahdanau et al., 2014; Rush et al., 2015; Nallapati et al., 2016; Paulus et al., 2017). Hence these methods are not applicable to the language style transfer problem. A few approaches have been proposed to deal with non-parallel data (Hu et al., 2017; Shen et al., 2017). Most of these approaches try to learn a latent representation of the content disentangled from the source style, and then recombine it with the target style to generate the corresponding sentence.
All the above approaches assume that data have only two styles, and their task is to transfer sentences from one style to the other. However, in many practical settings, we may deal with sentences with arbitrary unknown styles. Consider we are building chatbots. A good chatbot needs to exhibit a consistent persona, so that it can gain the trust of users. However, existing chatbots such as Siri, Cortana, and XiaoIce (Shum et al., 2018) lack the ability of generating responses with a consistent persona during the whole conversation. Table 1 shows some examples. The chatbot responds to the user with varying sentiments (neutral, positive, or negative). One possible solution is to transfer the generated chatbot responses into a target persona before sending them to users. Hence, in this paper, we study the setting of language style transfer in which the source data to be transferred can have arbitrary unknown styles.
Another challenge in language style transfer is that the transferred sentence should preserve the content of the original sentence disentangled from its style. To tackle this problem, Shen et al. (2017) assumed the source and target domain share the same latent content space, and trained their model by aligning these two latent spaces. Hu et al. (2017) constrained that the latent content representation of the original sentence could be inferred from the transferred sentence. However, these attempts considered content modification in the latent content space but not the sentence space.
The contribution of this paper mainly consists of the following three parts:
We address a new style transfer task where sentences in the source domain can have arbitrary language styles but those in the target domain are with only one language style.
We propose a style discrepancy loss to learn disentangled representations of content and style. This loss enforces that the discrepancy between an arbitrary style representation and the target style representation should be consistent with the closeness of its sentence style to the target style. Additionally, we employ a cycle consistency to avoid content change.
We evaluate our model in three tasks: sentiment modification of restaurant reviews, dialog response revision with a romantic style, and sentence rewriting with a Shakespearean style. Experimental results show that our model surpasses the state-of-the-art style transfer model (Shen et al., 2017) in these three tasks.
2 Related Work
Image Style Transfer: Most style transfer approaches in the literatures focus on vision data. Kulkarni et al. (2015) proposed to disentangle the content representations from image attributes, and control the image generation by manipulating the graphics code that encodes the attribute information. Gatys et al. (2016)
used Convolutional Neural Networks (CNNs) to learn separated representations of the image content and style, and then created the new image from their combination. Some approaches have been proposed to align the two data domains with the idea of generative adversarial networks (GANs)(Goodfellow et al., 2014). Liu and Tuzel (2016)
proposed a coupled GAN framework to learn a joint distribution of multi-domain data by the weight-sharing constraint.Zhu et al. (2017) introduced a cycle consistency loss, which minimizes the gap between the transferred images and the original ones. However, due to the discreteness of the natural language, this loss function cannot be directly applied on text data. In our work, we show how the idea of cycle consistency can be used on text data.
Text Style Transfer: To handle the non-parallel data problem, Mueller et al. (2017)
revised the latent representation of a sentence in a certain direction guided by a classifier, so that the decoded sentence imitates those favored by the classifier.Ficler and Goldberg (2017)
encoded textual property values with embedding vectors, and adopted a conditioned language model to generate sentences satisfying the specified content and style properties.Li et al. (2018) demonstrated that a simple delete-retrieve-generate approach could achieve good performance in sentiment transfer tasks. Hu et al. (2017) used the variational auto-encoder (VAE) Kingma and Welling (2013) to encode the sentence into a latent content representation disentangled from the source style, and then recombine it with the target style to generate its counterpart. Shen et al. (2017) considered transferring between two styles simultaneously. They utilized adversarial training to align the generated sentences from one style to the data domain of the other style. We also adopt similar adversarial training in our model. However, since we assume the source domain contains data with various and possibly unknown styles, it is impossible for us to apply a discriminator to determine whether a sentence transferred from the target domain is aligned in the source domain as in Shen et al. (2017).
3 Problem Formulation
We now formally present our problem formulation. Suppose there are two data domains, one source domain in which each sentence may have its own language style, and one target domain consisting of data with the same language style. During training, we observe samples from and samples from , denoted as and . Note that we can hardly find a sentence pair that describes the same content. Our task is to design a model to learn from these non-parallel training data such that for an unseen testing sentence , we can transfer it into its counterpart , where should preserve the content of but with the language style in .
4.1 Encoder-decoder Framework
We assume each sentence can be decomposed into two representations: one is the style representation , and the other is the content representation , which is disentangled from its style. Each sentence has its individual style , while all the sentences share the same style, denoted as . Our model is built upon the encoder-decoder framework. In the encoding module, we assume that and of a sentence can be obtained through two encoding functions and respectively:
where , , , and is an indicator function. When a sentence comes from the source domain, we use a function to encode its style representation. For from the target domain, a shared style representation is used. Both and parameters in are learnt jointly together with other parameters in our model.
For the decoding module, we first employ a reconstruction loss to encourage that the sentence from the decoding function given and of a sentence can well reconstruct itself. Here, we use a probabilistic generator as the decoding function and the reconstruction loss is:
where denotes the parameter of the corresponding module.
To enable style transfer using non-parallel training data, we enforce that for a sample , its decoded sequence using given its content representation and the target style should be in the target domain . We use the idea of GAN (Goodfellow et al., 2014)) and introduce an adversarial loss to be minimized in decoding. The goal of the discriminator is to distinguish between and , while the generator tries to bewilder the discriminator:
Remarks: The above encoder-decoder framework is under-constrained in two aspects:
The discriminator can only encourage the generated sentence to be aligned with the target domain , but cannot guarantee to keep the content of the source sentence intact.
To address the first problem, we propose a style discrepancy loss, to constrain that the learnt should have its distance from guided by another discriminator which evaluates the closeness of the sentence style to the target style. For the second problem, we get inspired by the idea in He et al. (2016) and Zhu et al. (2017) and introduce a cycle consistency loss applicable to word sequence, which requires that the generated sentence can be transferred back to the original sentence .
4.2 Style Discrepancy Loss
By using a portion of the training data, we can first train a discriminator to predict whether a given sentence
has the target language style with an output probability, denoted as. When learning the decomposed style representation for a sample , we enforce that the discrepancy between this style representation and the target style representation , should be consistent with the output probability from . Specifically, since the styles are represented with embedding vectors, we measure the style discrepancy using the norm:
Intuitively, if a sentence has a larger probability to be considered having the target style, its style representation should be closer to the target style representation . Thus, we would like to have positively correlated with
. To incorporate this idea in our model, we use a probability density function, and define the style discrepancy loss as:
where ( is a valid probability density function) and is pre-trained and then fixed. If a sentence has a large , incorporating the above loss into the encoder-decoder framework will encourage a large and hence a small , which means will be close to . In our experiment, we instantiate
with the standard normal distribution for simplicity:
However, better probability density functions can be used if we have some prior knowledge of the style distribution. With Equation 5, the style discrepancy loss can be equivalently minimized by:
Note that is not jointly trained in our model. The reason is that, if we integrate it into the end-to-end training, we may start with a
with a low accuracy, and then our model is inclined to optimize a wrong style-discrepancy loss for many epochs and get stuck into a poor local optimum.
4.3 Cycle Consistency Loss
Inspired by He et al. (2016); Zhu et al. (2017), we require that a sentence transferred by the generator should preserve the content of its original sentence, and thus it should have the capacity to recover the original sentence in a cyclic manner. For a sample with its transferred sentence having the target style , we encode and combine its content with its original style for decoding. We should expect that with a high probability, the original sentence is generated. For a sample , though we do not aim to change its language style in our task, we can still compute its cycle consistency loss for the purpose of additional regularization. We first choose an arbitrary style obtained from a sentence in , and transfer into this style. Next, we put this generated sentence into the encoder-decoder model with the style , and the original sentence should be generated. Formally, the cycle consistency loss is:
4.4 Full Objective
An illustration of our basic model with the style discrepancy loss is shown in Figure 1 and the full model combined with the cycle consistency loss is shown in Figure 1. To summarize, the full loss function of our model is:
where are parameters balancing the relative importance of the different loss parts. The overall training objective is a minmax game played among the encoder , , generator and discriminator :
We implement the encoder using an RNN with the last hidden state as the content representation, and the style encoder using a CNN with the output representation of the last layer as the style representation. The generator is an RNN that takes the concatenation of the content and style representations as the initial hidden state. The discriminator and the pre-trained discriminator used in the style discrepancy loss are CNNs with the similar network structure in followed by a sigmoid output layer.
Yelp: Raw data are from the Yelp Dataset Challenge Round 10, which are restaurant reviews on Yelp. Generally, reviews rated with 4 or 5 stars are considered positive, 1 or 2 stars are negative, and 3 stars are neutral. For positive and negative reviews, we use the processed data released by Shen et al. (2017), which contains 250k negative sentences and 350k positive sentences. For neutral reviews, we follow similar steps in Shen et al. (2017) to process and select the data. We first filter out neutral reviews (rated with 3 stars and categorized with the keyword ‘restaurant’) with the length exceeding 15 or less than 3. Then, data selection in Moore and Lewis (2010) is used to ensure a large enough vocabulary overlap between the neutral data and the data in Shen et al. (2017). Afterwards, we sample 500k sentences from the resulting dataset as the neutral data. We use the positive data as the target style domain. Based on the three classes of data, we construct two source datasets with multiple styles:
Positive+Negative (Pos+Neg): we add different numbers of positive data (50k, 100k, 150k) into the negative data so that the source domain contains data with two sentiments.
Neutral+Negative (Neu+Neg): we combine neutral (50k, 100k, 150k) and negative data together as the source data.
We consider the Neu+Neg dataset is harder to learn from than the Pos+Neg dataset since for the Pos+Neg dataset, we can make use of a pre-trained classifier to possibly filter out some positive data so that most of the source data have the same style and the model in Shen et al. (2017) can work. However, the neutral data cannot be removed in this way. Also, most of the real data may be in the neutral sentiment, and we want to see if such sentences can be transferred well.
Chat: We use sentences from a real Chinese dialog dataset as the source domain. Users can chat with various personalized language styles, which are not easy to be classified into one of the three sentiments as in Yelp. Romantic sentences are collected from several online novel websites and filtered by human annotators. The dataset has 400k romantic sentences and 800k general sentences. Our task is to transfer the dialog sentences to a romantic style, characterized by the selected romantic sentences.
Shakespeare: We experiment on revising modern text in the Shakespearean style at the sentence-level as in Mueller et al. (2017). Following their experimental setup, we collect 29,388 sentences authored by Shakespeare and 54,800 sentences from non-Shakespeare-authored works. The length of all the sentences ranges from 3 to 15.
Each of the above three datasets is divided into three non-overlapping parts which are respectively used for the style transfer model, the pre-trained discriminator , and the evaluation classifier used in Section 5.3. Each part is further divided into training, testing, and validation sets. There is one exception that for Shakespeare is trained on a subset of the data for training the style transfer model because the Shakespeare dataset is small. Statistics of the data for training and evaluating the style transfer model are shown in the supplementary material.
5.2 Compared Methods and Configurations
We compare our method with Shen et al. (2017) which is the state-of-the-art language style transfer model with non-parallel data, and we name as Style Transfer Baseline (STB). As described in Section 2 and Section 4, STB is built upon an auto-encoder framework. It focuses on transferring sentences from one style to the other, with the source and target language styles represented by two embedding vectors. It also relies on adversarial training methods to align content spaces of the two domains. We keep the configurations of the modules in STB, such as the encoder, decoder and discriminator, the same as ours for a fair comparison.
We implement our model using Tensorflow(Abadi et al., 2016). We use GRU as the encoder and generation cells in our encoder-decoder framework. Dropout (Srivastava et al., 2014) is applied in GRUs and the dropout probability is set to 0.5. Throughout our experiments, we set the dimension of the word embedding, content representation and style representation as , and respectively. For the style encoder , we follow the CNN architecture in Kim (2014), and use filter sizes of with 100 feature maps each, so that the resulting output layer is of size , i.e., the dimension of the style representation. The pre-trained discriminator is implemented similar to but using filter sizes with 250 feature maps each. The testing accuracy of the pre-trained is 95.23% for Yelp, 87.60% for Chat, and 87.60% for Shakespeare. We further set the balancing parameters , , and train the model using the Adam optimizer (Kingma and Ba, 2015) with the learning rate
. All input sentences are padded so that they have the same length 20 for Yelp and Shakespeare and 35 for Chat. Furthermore, we use the pre-trained word embeddingsGlove (Pennington et al., 2014) for Yelp and Shakespeare and use the Chinese word embeddings trained on a large amount of Chinese news data for Chat when training the classifiers.
5.3 Evaluation Setup
Following Shen et al. (2017)
, we use a model-based evaluation metric. Specifically, we use a pre-trained evaluation classifier to classify whether the transferred sentence has the correct style. The evaluation classifier is implemented with the same network structure as the discriminator
. The testing accuracy of evaluation classifiers is 95.36% for Yelp, 87.05% for Chat, and 88.70% for Shakespeare. We repeat the training three times for each experiment setting and report the mean accuracy on the testing data with their standard deviation.
5.4 Experiment Roadmap
We perform a set of experiments on Yelp, which is also used in Shen et al. (2017), to systematically investigate the effectiveness of our model:
We validate the usefulness of the proposed style discrepancy loss and the cycle consistency loss in our setting (Sec 5.5.1).
We vary the difficulty level of the source data and verify the robustness of our model (Sec 5.5.2).
We look into some transferred sentences and analyze successful cases and failed cases (Sec 5.5.3).
We perform human evaluations to evaluate the overall quality of the transferred sentences and check if the results are consistent with those using the model-based evaluation (Sec 5.5.4).
After conducting the thorough study of our model on Yelp, we apply our model on Chat and Shakespeare, two real datasets in which the source domain has various language styles.
5.5 Experiments and Analysis on Yelp
5.5.1 Ablation Study
To investigate the effectiveness of the style discrepancy loss, we try our full model by removing the style discrepancy loss only on the dataset Pos+Neg whose source domain contains both positive sentences and negative sentences. When the model converges, we find that the cycle consistency loss is much larger than that obtained in the full model for the same setting. We also manually check that almost all sentences fail to transfer their styles with this model. This indicates that the proposed style discrepancy loss is an inseparable part of our model.
We also validate the effectiveness of the cycle consistency loss 111Note that our proposed cycle consistency loss can be similarly added in STB.. Specifically, we compare two versions of both STB and our model, one with the cycle consistency loss and one without. We vary the number of positive sentences in the source domain and results are shown in Table 2. It can be seen that incorporating the cycle consistency loss consistently improves the performance for both STB and our proposed model.
5.5.2 Difficulty Level of Training Data
Pos+Neg as Source Data:We compare STB and our proposed model on the first dataset Pos+Neg using the results in Table 2. As the number of positive sentences in the source data increases, the average performance of both versions of STB decreases drastically. This is reasonable because STB introduces a discriminator to align the sentences from the target domain back to the source domain, and when the source domain contains more positive samples, it is hard to find a good alignment to the source domain. Meanwhile the performance of our model, even the basic one without the cycle consistency loss, does not fluctuate much with the increase of the number of positive samples, showing that our model is not that sensitive to the source data containing more than one sentiments. Overall, our model with the cycle consistency loss performs the best.
Neu+Neg as Source Data: The dataset Pos+Neg is not so challenging because we can use a pre-trained discriminator similar to in our model, to remove those samples classified as positive with high probabilities, so that only sentences with a less positive sentiment remain in the source domain. Thus, we test on our second dataset Neu+Neg. In this setting, in case that some positive sentences exist in those neutral reviews, when STB is trained, we use the same pre-trained discriminator in our model to filter out samples classified as positive with probabilities larger than 0.9. In comparison, our model uses all data, since it naturally allows for those data with styles similar to the target style. Experimental results in Table 3 show that with the same amount of neutral data mixed in the source domain, our model performs better than STB (with Cyc) and is relatively stable among multiple runs.
Limited Sentences in Target Domain: In real applications, there may be only a small amount of data in the target domain. To simulate this scenario, we limit the amount of the target data (randomly sampled from the positive data) used for training, and evaluate the robustness of the compared methods. Table 4 shows the experimental results. It is surprising to see that both methods obtain relatively steady accuracies with different numbers of target samples. Yet, our model surpasses STB (with Cyc) in all the cases.
5.5.3 Case Study
We manually examine the generated sentences for a detailed study. Overall, our full model can generate grammatically correct positive reviews without changing the original content in more cases than the other methods. In Table 5, we present some example results of the various methods. We can see that when the original sentences are simple such as the first example, all models can transfer the sentence successfully. However, when the original sentences are complex, both versions of STB and our basic model (without the cycle consistency loss) cannot generate fluent sentences, but our full model still succeeds. One minor problem with our model is that it may use a wrong tense in transferred sentences (i.e., the last transferred sentence by our model), which however does not influence the sentence style and meaning a lot.
5.5.4 Human Evaluation
The model-based evaluation metric is inadequate at measuring whether a transferred sentence preserves the content of a source sentence (Fu et al., 2017)
. Therefore, we rely on human evaluations to evaluate model performance in content preservation. We randomly select 200 test samples from Yelp and perform human evaluations to estimate the overall quality of transferred sentences rating from 1 (failed), 2 (tolerable), 3 (satisfying) and 4 (perfect) by jointly considering content preservation, sentiment modification, and fluency of the transferred sentences.
We hire five annotators to evaluate the results. Since we have 9 settings and totally 24 different methods in Table 2-4, here we select one of the settings on Yelp due to limited budgets. Table 6 shows the averaged scores of the annotator’s evaluations. As can be seen, by considering all the above three aspects, our model is better than other methods. This result is also consistent with the automatic evaluation ones.
5.6 Experiments and Analysis on Chat
As in the Yelp experiment, we vary the number of target sentences to test the robustness of the compared methods. The experimental results are shown in Table 7. Several observations can be made. First, STB (with Cyc) obtains a relatively low performance with only 10k target samples. As more target samples are used, its performance increases. Second, our model achieves a high accuracy even with 10k target samples used, and remains stable in all the cases. Thus, our model achieves better performance as well as stronger robustness on Chat. Due to space limitations, we present a few examples in Table 10 in the supplementary material. We find that our model generally successfully transfers the sentence into a romantic style with some romantic phrases used.
5.7 Experiments and Analysis on Shakespeare
The testing accuracies are shown in Table 8. We also present some example sentences in Table 9. Compared with STB, our model can generate sentences which are more fluent and have a higher probability to have a correct target style. For example, in the sentences transferred by our model, the words such as ’sir’, ’master’, and ’lustre’ are used, which are common in the Shakespearean works. However, we find that both STB and our model tend to generate short sentences and change the content of source sentences in more cases in this set of experiment than in the Yelp and Chat datasets. We conjecture this is caused by the scarcity of training data. Sentences in the Shakespearean style form a vocabulary of 8559 words, but almost 60% of them appear less than 10 times. On the other hand, the source domain contains 19962 words, but there are only 5211 common words in these two vocabularies. Thus aligned words/phrases may not exist in the dataset.
We have presented an encoder-decoder framework for language style transfer. It allows for the use of non-parallel data, where the source data have various unknown language styles. Each sentence is encoded into two latent representations, one corresponding to its content disentangled from the style and and the other representing the style only. By recombining the content with the target style, we can decode a sentence aligned in the target domain. Specifically, we propose two loss functions, i.e., the style discrepancy loss and the cycle consistency loss, to adequately constrain the encoding and decoding functions. The style discrepancy loss is used to enforce a properly encoded style representation while the cycle consistency loss is utilized to ensure that the style-transferred sentences can be transferred back to their original sentences. Experimental results in three tasks demonstrate that our proposed method outperforms previous style transfer methods.
- Abadi et al. (2016) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Chung et al. (2014) Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Ficler and Goldberg (2017) Jessica Ficler and Yoav Goldberg. 2017. Controlling linguistic style aspects in neural language generation. In Proceedings of the Workshop on Stylistic Variation, pages 94–104.
- Fu et al. (2017) Zhenxin Fu, Xiaoye Tan, Nanyun Peng, Dongyan Zhao, and Rui Yan. 2017. Style transfer in text: Exploration and evaluation. arXiv preprint arXiv:1711.06861.
- Gatys et al. (2016) Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680.
- He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tieyan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828.
Hu et al. (2017)
Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing.
Toward controlled generation of text.
Proceedings of the International Conference on Machine Learning, pages 1587–1596.
Yoon Kim. 2014.
Convolutional neural networks for sentence classification.
Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1746–1751.
- Kingma and Ba (2015) Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In Proceedings of the International Conference for Learning Representations.
- Kingma and Welling (2013) Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Kulkarni et al. (2015) Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and Josh Tenenbaum. 2015. Deep convolutional inverse graphics network. In Advances in Neural Information Processing Systems, pages 2539–2547.
- Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Georgios P Spithourakis, Jianfeng Gao, and Bill Dolan. 2016. A persona-based neural conversation model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 994–1003.
- Li et al. (2018) Juncen Li, Robin Jia, He He, and Percy Liang. 2018. Delete, retrieve, generate: A simple approach to sentiment and style transfer. arXiv preprint arXiv:1804.06437.
- Liu and Tuzel (2016) Ming-Yu Liu and Oncel Tuzel. 2016. Coupled generative adversarial networks. In Advances in Neural Information Processing Systems, pages 469–477.
- Luan et al. (2017) Fujun Luan, Sylvain Paris, Eli Shechtman, and Kavita Bala. 2017. Deep photo style transfer. arXiv preprint arXiv:1703.07511.
- Moore and Lewis (2010) Robert C Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, pages 220–224.
- Mueller et al. (2017) Jonas Mueller, David Gifford, and Tommi Jaakkola. 2017. Sequence to better sequence: continuous revision of combinatorial structures. In Proceedings of the International Conference on Machine Learning, pages 2536–2544.
- Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
- Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1532–1543.
Rush et al. (2015)
Alexander M Rush, Sumit Chopra, and Jason Weston. 2015.
A neural attention model for abstractive sentence summarization.In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 379–389.
- Shen et al. (2017) Tianxiao Shen, Tao Lei, Regina Barzilay, and Tommi Jaakkola. 2017. Style transfer from non-parallel text by cross-alignment. In Advances in Neural Information Processing Systems.
- Shum et al. (2018) Heung-Yeung Shum, Xiaodong He, and Di Li. 2018. From eliza to xiaoice: Challenges and opportunities with social chatbots. CoRR.
- Srivastava et al. (2014) Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research, 15(1):1929–1958.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, pages 3104–3112.
Zhu et al. (2017)
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In Proceedings of the International Conference on Computer Vision.