Log In Sign Up

c-TextGen: Conditional Text Generation for Harmonious Human-Machine Interaction

by   Bin Guo, et al.

In recent years, with the development of deep learning technology, text generation technology has undergone great changes and provided many kinds of services for human beings, such as restaurant reservation and daily communication. The automatically generated text is becoming more and more fluent so researchers begin to consider more anthropomorphic text generation technology, that is the conditional text generation, including emotional text generation, personalized text generation, and so on. Conditional text generation (c-TextGen) has thus become a research hotspot. As a promising research field, we find that many efforts have been paid to researches of c-TextGen. Therefore, we aim to give a comprehensive review of the new research trends of c-TextGen. We first give a brief literature review of text generation technology, based on which we formalize the concept model of c-TextGen. We further make an investigation of several different c-TextGen techniques, and illustrate the advantages and disadvantages of commonly used neural network models. Finally, we discuss the open issues and promising research directions of c-TextGen.


Towards information-rich, logical text generation with knowledge-enhanced neural models

Text generation system has made massive promising progress contributed b...

AI-Powered Text Generation for Harmonious Human-Machine Interaction: Current State and Future Directions

In the last two decades, the landscape of text generation has undergone ...

Texygen: A Benchmarking Platform for Text Generation Models

We introduce Texygen, a benchmarking platform to support research on ope...

Automatic Conditional Generation of Personalized Social Media Short Texts

Automatic text generation has received much attention owing to rapid dev...

Clinical Text Generation through Leveraging Medical Concept and Relations

With a neural sequence generation model, this study aims to develop a me...

Automated Speech Generation from UN General Assembly Statements: Mapping Risks in AI Generated Texts

Automated text generation has been applied broadly in many domains such ...

MixingBoard: a Knowledgeable Stylized Integrated Text Generation Platform

We present MixingBoard, a platform for quickly building demos with a foc...

1. Introduction

Jorge Luis Borges111 has described a Library, “The Library of Babel”222, where anyone can find any book he wants. The readers cannot help but wonder who wrote these books. Are they all written by human writers? Absolutely, the answer is no. This library seems unlikely to exist, however, the development of text generation technology in recent years has made it possible. Philip M. Parker, the writer, has sold more than 100,000 books on Amazon, which obviously did not mean that he wrote all those books personally. Instead, he used computer programs to collect a large amount of publicly available information on the Internet, which was then used to automate compiling books333 The above method of Parker belongs to the text-to-text generation method (Genest and Lapalme, 2011), which takes existing text as input and automatically generates the new, coherent text as output.

Text-to-text generation is a typical application in text generation (McKeown, 1992)

, which is based on the development of natural language processing (NLP) and has great application prospects. Text generation enables computers to learn to write like human, using diverse types of information, including image, text and so on, to automatically generate high-quality natural language text, so as to replace human to complete a variety of tasks. According to different data sources, text generation can be divided into

data-to-text, text-to-text, and image-to-text generation. News generation is a typical application of data-to-text generation. There was an earthquake in the Beverly hills, California on March 17, 2014 and the Los Angeles Times firstly provided detailed information about the time, location and strength of the quake. The news report was automatically generated by a ‘robot reporter’, which converted the incoming automatically registered seismic data into text by filling in the blanks in the predefined template text (Oremus, 2014). The data-to-text generation technology fills the established template with structured data and generates the output text containing all the key information, which has exerted considerable influence in the field of news and media research.

The application of text generation from text to text includes machine translation (Cho et al., 2014)

, text summarization

(Rush et al., 2017), reading comprehension (Hermann et al., 2015)

, etc. By understanding the content information of the original text and obtaining its semantic representation, natural language text is generated for summarizing or refining. The application fields from image to text generation include image captioning

(Mao et al., 2014), visual question answering (QA) (Antol et al., 2015), etc. By processing image information, the contents contained in the image can be understood to generate corresponding natural language descriptions or answers related to the questions.

Deep learning has made great achievements in many applications including computer vision, speech recognition and NLP, and contributes to the most recent advances in the text generation field. Specifically, with the help of the recurrent neural networks (RNN)

(Elman, 1990), attention mechanism, generative adversarial networks (GAN) (Goodfellow et al., 2014)

, and the reinforcement learning (RL) models, the generated text becomes more coherent, logical and emotionally harmonious, making text generation technology able to offer assistance in every aspect of people’s lives. For example, dialogue systems, such as Microsoft XiaoIce

444, Cortana555 and Apple Siri666, have brought great convenience to people through carrying out conversations. News-writing-robots have provided creative assistance for journalists, and the machine translation technology has effectively met our needs of translation.

With the rapid development of universal text generation technology, researchers turn to concentrate on more anthropomorphic text generation technology, such as context-based text generation (Jaech and Ostendorf, 2018), personalized text generation (Luo et al., 2019), topic-aware text generation (Wang et al., 2018), and knowledge-enhanced text generation (Young et al., 2018)

. Applying additional information in text generation makes the generated text more personified and facilitates harmonious human-machine interaction, however, new challenges are raised. First, it is worth investigating how to efficiently integrate the conditional information with traditional model structures. Second, due to the scarcity of text datasets with specific conditions, training the conditional text generation models become more difficult. Last, the reasonable evaluation metrics of the conditional text generation models should also be taken into consideration.

This paper aims to give an in-depth survey of the development of text generation methods. Specifically, we mainly focus on various studies on conditional text generation (c-TextGen), such as context-based text generation, topic-aware text generation, and knowledge-enhanced text generation. Compared with general text generation methods, the conditional text generation technology is more in line with the needs of human beings and provide people with more accurate services, which is of great help for the construction of more anthropopathic text generation systems.

To sum up, we summarize the contributions of our work as follows.

  • Based on a brief literature review of text generation technology, we characterize the concept model of conditional text generation (c-TextGen) and present the major human-centric services.

  • We make an investigation of several different c-TextGen techniques, including context-based text generation, topic-aware text generation, etc. We also elaborate the models commonly used in c-TextGen, illustrating the advantages and disadvantages of each model and introducing their typical applications.

  • We further discuss some promising research directions of c-TextGen, including the consideration of different types of contexts, the novel text generation models, the multi-modal data translation, and so on.

The remainder of this paper is organized as follows. In Section 2, we review the literature history of existing text generation studies. In Section 3, we characterize the concept model of c-TextGen. We then summarize the major researches and key techniques of c-TextGen in Section 4 and 5, followed by the open issues and future research directions in Section 6. Finally, we sum up our work in Section 7.

2. The Literature History of Text Generation

After decades of development, text generation technology has made great progress, which is developed from the original rule-based methods and statistical learning models, to the recently surging deep neural network (DNN)-based methods. This section gives a brief literature review of general text generation methods.

2.1. Rule-based and Statistical Methods

Early studies on text generation in NLP mainly use rule-based methods (Mauldin, 1984). However, it has obvious disadvantages. First, human language is flexible and rules can only cover a small part of the language knowledge. Second, rule-based methods ask developers to be proficient not only in computers but also in linguistics. Therefore, although rule-based methods have solved some simple problems, they cannot be widely used in practice. Consequently, some researchers introduce the statistical-based approaches to text generation. For example, Bahl et al. use a statistical-based approach which formulates speech recognition as a problem of maximum likelihood decoding to increase the speech recognition rate from 70% to 90% (Bahl et al., 1983). During this stage, NLP has made substantial breakthroughs and begun to move from the laboratory to practical applications. Statistical-based methods are effective and explainable, however, vast manual-designed rules or templates are required, which makes them not likely to work effectively.

The language model is the basis of the text generation task. The n-gram model is one of the classic language models, which is a typical statistical model, leveraging the previous n words to predict the next word. It owns some advantages. Firstly, the n-gram model uses the maximum likelihood algorithm to make the parameters easy to train. Secondly, all the information of the first n-1 words will be contained during the calculation process. But at the same time, the n-gram language model also has many shortcomings that prevent it from large-scale applications. For example, due to the lack of long-term dependence, the model can only take the first n-1 words into consideration. If n

increases indefinitely, the parameter space will grow explosively. The probability of each word is purely determined by statistical frequencies, which determines the lack of generalization ability of the

n-gram model.

2.2. DNN-based Methods

DNN-based methods are not dependent on any artificial designed rules, and they can automatically learn the continuous vector representations for the task-specific knowledge in different tasks

(Gao et al., 2019). Neural network based methods usually perform NLP tasks (such as dialogue system and machine translation) in three steps, that is encoding, reasoning and decoding respectively. The inputs of neural network models will be firstly encoded into vector space, where semantically related or similar concepts are close to each other. And then the neural networks will reason in the vector space according to the state of the system and current input to produce the system response. Finally, the system response will be decoded to generate natural language text. Different neural network structures are usually adopted in the process of encoding, reasoning and decoding, and parameters are optimized through back-propagation and gradient descent algorithms. The neural networks trained in the end-to-end learning mechanism can fully mine the correlation in the data, and alleviate the requirement of characteristic engineering greatly. For example, Bengio et al. (Bengio et al., 2003)

introduce a feed-forward neural network-based language model, which transforms words into vectors, and learns the constraint relationship between the probability of the word and its forward words. Mikolov

et al. (Mikolov et al., 2010) further introduce RNN into the field of text generation, leveraging its unique sequence structure to capture the sequence information and build the relationship between the previous words and subsequent word.

The DNN-based methods can be subdivided into three categories, including RNN, the attention mechanism, GAN and RL, summarized as follows.


The natural sequence structure of RNN is quite fit for processing natural language with the sequence form. The variations of RNN, such as long short-term memory (LSTM) and gated recurrent unit (GRU), address the issue of gradient vanishing or explosion arising in the traditional RNN model, and gradually become the most popular models in text generation.

Sutskever et al. firstly propose the Sequence-to-Sequence (Seq2seq) learning model (Sutskever et al., 2014), which is a generalized framework for converting one sequence to another sequence. In this framework, a neural network as the encoder compresses sequences into vector representations. Then another neural network as the decoder predicts output words one by one conditioned on the hidden state and takes the previous output as the input to predict the next output. Since this framework has no limitation of the length of input and output sequences, it has been widely used in the text generation task, including machine translation (Cho et al., 2014), text summarization (Rush et al., 2017), dialogue systems (Vinyals and Le, 2015), and so on. Since then, the pure data-driven end-to-end training based on the Seq2seq model has become the mainstream method in text generation. Neural networks can automatically learn the efficient representation of the input and extract hidden relationship features from massive text data in the process of end-to-end training.

Attention mechanism. Traditional Seq2seq models generally have two problems. The first is that all inputs are compressed into a vector with fixed length, reducing the vector’s reduction ability of input information and the second is that assigning all the input words with the same weight cannot effectively capture the key information in the input. To solve those problems, the Attention mechanism, an extensive utilized mechanism in computer vision, is introduced into NLP. Through assigning different weights to different parts of the input data, the Attention mechanism will extract the most important information from the input, enabling the model to make more accurate judgments and greatly reducing the computational and storage costs. The Attention mechanism is applied to the Seq2seq model to fulfill machine translation tasks in the beginning which achieves the best results (Bahdanau et al., 2014). Since then, the Attention mechanism becomes the most popular model in many kinds of NLP tasks. Xing et al. (Xing et al., 2018) introduce a attention-based multi-turn response generation model to find the most import information in the dialogue context. The Attention mechanism makes the multi-turn dialogue more coherent and consistent.

GAN and RL. GAN (Goodfellow et al., 2014) and RL make great progress in the computer vision field, attracting researchers to explore applying them in text generation. Zhang et al. (Zhang et al., 2016)

attempt to combine the LSTM and convolutional neural network (CNN) to generate realistic text using the idea of adversarial training. However, the original GAN is only applicable to generate continuous data and has worse performance on processing discrete data. Since text data is the most typical discrete data, researchers have made some fine-tuning to GAN’s structure to make it possible to generate discrete data, e.g., the Wasserstein GAN model

(Arjovsky et al., 2017). Though the improvement of GAN has made some progress in text generation, the performance needs to be further improved. Combining GAN with RL, the SeqGAN (Yu et al., 2017) model regards the discriminator in GAN as the source of reward in RL and leverages the reward mechanism and the Policy Gradient technology in RL, to skillfully avoid the problem that the gradient cannot be back propagated when GAN faces discrete data.

3. Conditional Text Generation

The development of deep neural networks brings unprecedented progress to text generation. However, there are still some problems with the existing text generation technology. For example, many studies train the text generation model only based on the content of input text, which makes the text generated by such models is completely content-based, ignoring many other factors. However, a real person not only considers the context content, but also adjusts the content according to their own conditions (such as mood and gender) and external factors (such as weather and environment) when speaking or writing in his/her life. In this paper, we take conditional text generation (c-TextGen) as the future research direction which is the key factor to improve the quality of generated text. Specifically, it includes context-based text generation, personalized text generation, topic-aware text generation, emotional text generation, knowledge-enhanced text generation, diversified text generation and visual text generation. In this section, we formalize the definition of c-TextGen and introduce the wide application fields of it.

3.1. The Concept Model

The c-TextGen refers to taking certain external conditions into consideration to influence the generated results in the process of text generation. These conditions usually include context, topic, emotion, common sense, and so on. The general text generation methods only consider the content factor, which makes the generated text less diverse and has a large gap with human expression. Consideration of external conditions in text generation makes it more anthropomorphic and brings better services to human beings in various fields. We define various kinds of c-TextGen as follows:

Definition 3.1 ().

CONTEXT-BASED TEXT GENERATION: Integrating contextual information during text generation to produce more coherent content. For example, considering the historical dialogue content in the multi-round dialogues.

Definition 3.2 ().

PERSONALIZED TEXT GENERATION: Assigning specific personalization characteristics to the agents to produce personalized text contents which fit the given personalization characteristics.

Definition 3.3 ().

TOPIC-AWARE TEXT GENERATION: Incorporating a specific topic in the process of text generation to make the whole text content suitable for the topic and ensure the coherence and rationality of the generated text.

Definition 3.4 ().

EMOTIONAL TEXT GENERATION: Embodying the emotional expressions of the agents in the process of text generation, such as positive or negative, happy or sad, to adjust the content and expression style of the generated text.

Definition 3.5 ().

KNOWLEDGE-ENHANCED TEXT GENERATION: Embracing external knowledge, such as search engine or knowledge base to provide factual basis and reference of the generated content in the text generation process.

Definition 3.6 ().

DIVERSIFIED TEXT GENERATION: Utilizing various optimization strategies to generate more diverse text to prevent the text generation system from frequently generating “I don’t know”, “sorry” or other generic text.

Definition 3.7 ().

VISUAL TEXT GENERATION: Integrating the semantic information in images into generated text, such as generating text description according to image contents, or conducting visual QA.

Figure 1. The overall structure of general and conditional text generation

The difference between the general text generation and the conditional text generation is shown in Fig 1. The conditional text generation takes additional conditions, such as context and personalized information, as additional input to augment the generated text, making it more coherent, personalized, emotional, thematic, fact-based, and of high diversity.

3.2. Text Generation-based Human-Centric Services

Text generation technology has a wide range of application scenarios in daily life. It is an ongoing effort of the academic/industry researchers to use various text generation technologies to provide human-centric services.

Goal-oriented dialog systems. One of the typical applications of text generation is the dialogue systems, which can be divided into goal-oriented and non-goal-oriented dialogue systems. Goal-oriented dialogue systems assist us to fulfil various tasks to reduce our operational burden, such as restaurant reservation, travel time arrangement and air ticket reservation. Microsoft Cortana777, Apple Siri888 and other intelligent assistants are all typical goal-oriented dialogue systems. Luo et al. (Luo et al., 2019) build a personalized goal-oriented dialog system to complete the restaurant reservation task. This dialogue system utilizes the personalized profile embedding combined with the global dialogue memory and knowledge base to generate personalized dialogue content according to the different users’ personalities.

Chatbots. The non-goal-oriented dialogue systems, also known as chatbots, can communicate with humans normally in the open domain. Instead of completing specific tasks, chatbots engage in chatty conversations with humans, which perform like the real person. Chatbots provide us with a real and interactive dialogue experience and establish certain emotional connections with us. Microsoft XiaoIce999 is a well-known chatbot which has made conservations with hundreds of millions of users and successfully built long-time emotional connections with them. Zhou et al. (Zhou et al., 2018a) describe the development of the Microsoft XiaoIce system to provide some guidance for chatbot researchers.

Query Answering. Besides the dialogue systems, the Query and Answering (QA) system, giving corresponding answers to users’ different questions, is another typical application of text generation. The QA system needs to find relevant content through search engines or knowledge bases to organize the answers of the questions which may relate to commonsense facts or details about current events. Dehghani et al. (Dehghani et al., 2019) propose a QA model, called TraCRNet, to achieve the goal of open-domain query answering. The TraCRNet model reasons to correctly answer the question, combing information of multiple documents extracted from search engine which includes not only the high-ranked web pages but also the low-ranked web pages that are not directly related to the question.

Product review and advertisement generators. Due to the large number of product reviews in the shopping websites, writing reviews for products may puzzle customers and waste their time. Fortunately, based on the information of a given product and the star rating of the review, the review content related to the product can be automatically generated by the review generation technology, providing references for other customers. Ni et al. (Ni and McAuley, 2018) build an assistive system helping users to write reviews. This model expands the contents of input phrases and conforms to users’ personalized aspect preferences to generate diverse and smooth product reviews. At the same time, although the recommendation advertisements of products attract users mostly, writing specific advertisements for vast products is a time-consuming task, particularly when we want to generate personalized ads for each user. The personalized advertisement generation technology can automatically generate well-suited product description according to different selling points of products and user preferences/traits. It not only provides great convenience for consumers, but also for companies. Chen et al. (Chen et al., 2019) realize a personalized produce description generation model using neural networks combined with the knowledge base. By leveraging product information, user characteristics and knowledge base, the proposed model in this paper generates personalized product description as the advertisement of the product.

Image captioning and visual QA. Text information is just one way for human to obtain information, while we are more likely exposed to image information in real life. Image captioning technology can automatically generate corresponding text description according to the content of images, so as to facilitate readers to better understand image contents. Feng et al. (Feng et al., 2019) train an unsupervised image captioning model to generate image captions without the paired image-sentence datasets. Visual QA is another interactive point between text generation and image understanding field. By understanding image content and corresponding questions at the same time, visual QA technology can generate answers of questions related to the images. Li et al. (Li et al., 2019b) convert visual QA question into a machine reading comprehension problem combined with the large-scale external knowledge base to realize the knowledge-based visual QA.

With the rapid development of NLP technology, it is ubiquitous to find the usage of text generation technology in our daily life. Nevertheless, text generation has been far from mature. For example, it is still easy to find out whether we talk to a real person or a chatbot, which means that an obvious gap still exists between robots and human beings. C-TextGen is the key factor to solve this problem, achieving the high quality of the generated text and more anthropomorphic text generation. In this paper, we will study different types of conditional text generation methods.

3.3. C-TextGen datasets

The training of conditional text generation model needs the support of a large number of conditional text data, such as emotional text data or personalized text data, but this kind of data is relatively scarce. In order to protect researchers from data scarcity, many high-quality conditional text datasets have been released. We give a brief summary of conditional text datasets in Table 1.

Dataset Organization Description
PERSONA CHAT (Zhang et al., 2018b) Facebook A large-scale and high-quality personalized dialogue dataset; Every speaker in this dataset is given a set of personalities
Humeau et al. (Mazaré et al., 2018) Facebook A very large-scale profile-based dialogue dataset; Extracting personalized characteristics from users’ posts in REDDIT
Empathetic Dialogues (Rashkin et al., 2019) Facebook An emotional dialogue dataset; Each conversation in this dataset is produced from a specific situation where the speaker is feeling a given emotion
EmotionLines (Chen et al., 2018) Academia Sinica, Taiwan An emotional dialogue dataset in which all utterances are labelled with emotions
VisDial (Das et al., 2017) Carnegie Mellon University A large-scale Visual Dialogue dataset; All queries and answers in this dataset are based on the given image
Table 1. A review of conditional text datasets

4. Major Research Areas

After introducing the definition and application fields of c-TextGen, in this section, we make a detail investigation of different c-TextGen techniques, including context-based text generation, topic-aware text generation, personalized text generation, etc.

4.1. Context-based Text Generation

In many applications of text generation, the context information is the key factor to realize the coherence and smoothness of the generated text. The context means the situations in which natural languages are generated. In the dialogue system, the context usually refers to the dialogue history that has taken place in multi-rounds dialogues. The ability to consider previous utterances is the core to build active and engaging dialogue systems (Chen et al., 2017). Meanwhile, in the review generation system, the context refers to the time, emotions, sentiments and other factors. The context information provides clews to the generation of natural language (Tang et al., 2016). Therefore, in order to generate high quality text, it is necessary to consider the context information in text generation. We give a brief summary of context-based text generation methods in Table 2.

Work Method Description
Sordoni et al. (Sordoni et al., 2015) RNN + Dialog history embedding Embedding all words and phrases in the dialogue history into continuous representations as additional input of the decode stage
Serban et al. (Serban et al., 2016) HRED + Hierarchical dialog history embedding Embedding the word sequences in each context sentences at the low-level and embedding the sentence sequences in the historical dialogues at the top-level to efficiently capture the context information
Tang et al. (Tang et al., 2016) RNN + Context embedding Refining the information or situations that may affect the product reviews as the context; Embedding the context information to generate reviews which fit the specific context
Jaech et al. (Jaech and Ostendorf, 2018) RNN + Context Adaption Using the context information (dialogue history) to transform the weights of recurrent units in RNN to effectively capture high-dimensional context
Table 2. A summary of context-based text generation methods

Dialog history embedding. Among the numerous tasks of context-based text generation, the consideration of context in the dialogue systems attracts great attention of researchers. By integrating historical dialogue content in different ways, long-time and multi-rounds dialogue systems develop vigorously. For instance, Sordoni et al. (Sordoni et al., 2015) tackle the challenge of context-based dialogue response generation by embedding all words and phrases in the dialogue history into continuous representations. Dialogue history information is encoded into vectors, which are decoded by another RNN to promote context-aware responses.

Hierarchical dialog history embedding. Instead of embedding all the dialogue history directly, hierarchical embedding method divides the dialogue history embedding into words embedding and sentences embedding. Serban et al. (Serban et al., 2016) use Hierarchical Recurrent Encoder-Decoder (HRED) model to hierarchically encode the dialogue history for capturing the context information and guiding the generation of replies. In particular, the word sequences in each context sentence are encoded at the low-level, while the sentence sequences in the historical dialogues are encoded at the top-level. Xing et al. (Xing et al., 2018) leverage the attention mechanism to extend the HRED model. By incorporating attention mechanism at the words and sentences level respectively, the model captures the most important parts in the context. In (Zhang et al., 2018a), Zhang et al. observe that we can have more smooth conversations without much context information in the multi-user dialogue, and thus produce a tree-based hierarchical multi-user dialogue model, which builds a tree structure consists of many branches for multi-user conversation to select exact context sequences. Besides, Tian et al. (Tian et al., 2017) conduct a comprehensive survey on existing context-aware conversational models and find that compared with the non-hierarchical model, the hierarchical model is more capable for capturing context information.

Context embedding. In addition to the dialogue system, other applications also need to add different types of context information to the procedure of text generation. Tang et al. (Tang et al., 2016) redefine the connotation of context, which is not the historical corpus in the dialogue system, but the information or situations that may influence the output content. They then propose an RNN-based text generation model, generating context-specific reviews by embedding the contexts information into continuous vectors representation. Similarly, Clark et al. (Clark et al., 2018) propose a text generation model in stories, treating entity representations extracted from dialogue history as context. By encoding the historical conversations together with the entity representations as context, the model can better determine which entities or words to be mentioned next.

Context adaption. The context adaption method changes the model itself rather than regarding context as an extra input of the text generation model. Jaech et al. (Jaech and Ostendorf, 2018) utilize the context information to transform weights of the recurrent layer in RNN. In particular, they utilize a low-rank decomposition algorithm to control the degree of parameter sharing in context, which performs well on high-dimensional and sparse context.

4.2. Personalized Text Generation

Various human characteristics significantly impact interpersonal communication and human writing style. In other words, personalization plays a key role in enhancing the quality of text generation system. On the one hand, personalization is vital for creating truly smart dialogue agents which can be seamlessly incorporated into the lives of human beings. On the other hand, it ensures the generated product review content depends not only on the characteristics of the product, but also on the preferences of specific users, endowing the authenticity for the generated reviews. Several efforts for personalized text generation are conducted, as summarized in Table  3, and we will discuss them in details below.

Work Method Description
Li et al. (Li et al., 2016) RNN + Speaker model The speaker model encodes each individual speaker into a vector to capture individual characteristics; Generating personal responses matching a specific user
Kottur et al. (Kottur et al., 2017) HRED + Speaker model Combing the HRED and speaker model to better capture context-related information and user personalized features; Generating personal and context relevant responses
Luan et al. (Luan et al., 2017) Autoencoder + Multi-task learning Training a response generation model on a small personalized dialogue data, and then training an autoencoder model with non-conversational data; Sharing parameters of the two models to obtain the personalized dialogue model
Yang et al. (Yang et al., 2017) Transfer learning + Pretrain and fine-tuned Respectively using massive generic dialogue data and a small-scale personalized dialogue data to pre-trained and fine-tune the dialogue model to generate personalized responses
Yang et al. (Yang et al., 2018) Reinforcement Learning + Persona embedding Embedding user-specific information into vector representation; RL mechanism optimizes three rewards – topic coherent, informative and grammatical, to generate more personalized responses
Table 3. A summary of personalized text generation methods

Personalized feature embedding. The simplest method to achieve personalized text generation is embedding the personalized characteristics of different users. Li et al. (Li et al., 2016) present a speaker model which encodes user profiles (e.g., speaking style, background information etc.) into vectors so as to capture personalized characteristics and guide the response generation during the decode stage. In (Kottur et al., 2017), Kottur et al. propose the CoPerHED model, which combines the speaker model and HRED model to better capture context-related information and user personalized features for high-quality dialogue materials. In (Qian et al., 2017), Qian et al. achieve the goal of generating profile-consistent replies by leveraging general dialogue data released in the social media by users. They utilize a profile detector to determine whether a user profile should be expressed in the response and decide the word position of the profile value. Instead of encoding personalized features into vector representations directly, Herzig et al. (Herzig et al., 2017) use an additional neural network to capture the high-level personality-based information. The additional layer implicitly influences the decoding hidden state to ensure that the personalized features are integrated into the generated text.

Multi-task and transfer learning. The personalized text datasets are so scarce that the above models are difficult to obtain very good effect. In order to address this issue, some researchers attempt to enhance the performance of personalized text generation by using transfer learning and multi-task learning models. In (Luan et al., 2017), Luan et al. train a response generation model with a small-scale personalized dialogue dataset and then train an autoencoder model with non-conversational data. Then the parameters of the two models are shared by the multi-task learning mechanism to obtain the personalized response. Yang et al. (Yang et al., 2017) propose a personalized dialogue model combining with the domain adaptation idea in transfer learning. They respectively use massive generic dialogue data and a small-scale personalized dialogue data to pre-trained and fine-tune the dialogue model, and applies the policy gradient algorithm to improve the personalized and informative features of generated responses. Similarly, Zhang et al. (Zhang et al., 2019) put forward a model, named Learning to Start (LTS), to optimize the quality of responses, which divides the training process into initialization (modeling the responding style of human) and adaptation (generating personalized responses) for generating relevant and diverse responses.

RL models. RL can control the quality of generated content through different policies or rewards, so researchers begin to consider incorporating it to implement personalized text generation task. Yang et al. (Yang et al., 2018) present the attention-based hierarchical encoder-decoder architecture via RL to realize personalized dialogue generation, which defines three types of reward mechanisms, including topic coherence, mutual information, and language model.

Personalized datasets. In order to perform more efficiently on personalized text generation without data obsession, researchers publish several high-quality personalized text datasets. For example, Zhang et al. (Zhang et al., 2018b) present a large-scale and high-quality personalized dialogue dataset named PERSONA CHAT101010, containing 164,356 sentences between randomly paired crowd workers. The paired workers are assigned with a group of profile information and asked to communicate as usual to enhance mutual understanding. Each conversion in it is amusing and engaging which are suitable for training the personalized dialogue model. In (Mazaré et al., 2018), Humeau et al. build an authoritative profile-based dialogue dataset using conversations collecting from REDDIT111111, including over 700 million multi-turn dialogues and more than 5 million user profiles. The personalized characteristics are extracted from users’ ever sent posts, providing a new idea of personalized text generation for later researchers.

4.3. Topic-aware Text Generation

Topic information is indispensable in human conversation, and we usually decide the topic of articles before expanding the content when writing. Therefore, topic-aware text generation is a research hotspot. We give a review of topic-aware text generation studies in Table 4.

Work Method Description
Xing et al. (Xing et al., 2017) RNN + LDA + Topic embedding Utilizing LDA model to get topic information about dialogue contents, and embedding the topic words into the vector; Generating more informative, and topic relevant responses
Dziri et al. (Dziri et al., 2018) HRED + LDA + Topic embedding Combining topic and context information to produce not only contextual but also topic-aware responses
Wang et al. (Wang et al., 2018) CNN + LDA + RL Using LDA model to get topic information of the dialogue content, CNN to capture the dialogue information, and RL mechanism to optimize the model with specific evaluation metric; Generating coherent, diverse, and informative text summaries
Table 4. A summary of topic-aware text generation methods

It is a common idea to extract topic information from existing text and use the topic information to guide the process of text generation. Xing et al. (Xing et al., 2017) propose a topic-aware Seq2seq (TA-Seq2Seq) model to generate informative and interesting responses for chatbots, which incorporates topic information of the dialogue history extracted by the pre-trained LDA model, followed by a joint attention mechanism for generation guidance. Similarly, in (Xing et al., 2016), Xing et al. capture the topic information in the dialogue by encoding the topics related to the input sentences into vectors for conducting high quality and diversified responses generation. Choudhary et al. (Choudhary et al., 2017)

observe that topic information can be divided into multiple domains (e.g., games, sports, movies), and thus they utilize domain classifiers to capture domain information from the dialogue history for generating domain-relevant responses.

There are also studies that consider both the topic information and other types of conditions to improve the quality of text generation. For instance, In (Dziri et al., 2018), Dziri et al. introduce a Topical Hierarchical Recurrent Encoder Decoder (THRED) model to generate contextual and topic-aware responses, which considers topic information and involves long-time context information simultaneously by hierarchical attention-based architecture. Wang et al. (Wang et al., 2017) present a response generation model which can generate response with a specific language style and a topic restriction. Through an additional topic embedding layer and language style finetuning methods, the model can effectively restrict the style and the topic information of generation responses.

The above studies mostly employ the RNN model in the task of topic-aware text generation. In addition to RNN, several other models are also applied to this task, which also achieves remarkable results. For instance, Wang et al. (Wang et al., 2018) propose a topic-aware convolutional Seq2seq (ConvS2S)-based text summarization model, which leverages the joint attention and biased probability generation mechanism for incorporating topic information. Through directly optimizing the model with the commonly used evaluation metric ROUGE using the self-critical sequence training method, the model yields high accuracy for text summarization.

4.4. Emotional Text Generation

Natural language is full of emotions and emotional words are more likely to stimulate the interest of readers. Additionally, people adjust their spoken content and speaking style according to their own and other people’s emotional changes in daily communication. Due to the necessity of integrating emotional information, researchers begin to pay attention to incorporate emotional information into the generated text in order to provide people with better experience, as summarized in Table 5.

Work Method Description
Zhou et al. (Zhou et al., 2018b) GRU + Emotional embedding Emotion category embedding captures emotional information from the dialogue history and the internal emotion memory balances the grammaticality and the expression degree of emotions; Generating context relevant, grammatical correct and emotionally consistent dialogue content
Fu et al. (Fu et al., 2018) GRU + Multi-task learning Multi-decoder Seq2seq module generates outputs with different styles and style embedding module augments the encoded representations; Generating outputs in different styles
Kong et al. (Kong et al., 2019) Conditional GAN (CGAN) + Sentiment control The generator generates sentimental responses based on a sentiment label and the discriminator distinguishes the generated replies and real replies; Generating sentimental responses under a specific sentiment label
Li et al. (Li et al., 2019a) RL + Emotional editor The emotional editor module selects the template sentence based on the topic and emotion, and the RL mechanism force the model to enhance the coherence and emotion expression of generated responses
Table 5. A summary of emotional text generation methods

Emotion extraction and embedding. One general way to generate emotional text is to extract corresponding emotional information from existing text and integrate it in the training process. Asghar et al. (Asghar et al., 2018) propose a LSTM-based emotional dialogue generation model. They utilize several strategies to generate emotional responses, including embedding emotions based on cognitively engineered dictionary, leveraging emotionally-minded objective functions, and introducing emotionally diverse decoding strategies. In (Zhou et al., 2018b), Zhou et al. produce the Emotional Chatting Machine (ECM), aiming at generating grammatically correct, context-relevant and emotionally congruent dialogue content. The ECM leverages the internal emotion memory to balance the grammaticality and the expression degree of emotions, and applies the external emotion memory to enhance the decoder to produce more reasonable emotional responses. In (Hu et al., 2017), Hu et al. propose a Variational Autoencoder (VAE)-based neural emotional text generation model which generates sentences with specific attributes of tenses or sentiments. The model can be viewed as enhancing VAE model combine with an extended wake-sleep mechanism to alternatively learn the generator and discriminator in the sleep phase.

Emotion transferring. In addition to embedding emotional information directly, the transfer of text from one emotion to another is another way to generate emotional text. Fu et al. (Fu et al., 2018) achieve the goal of transferring the emotion of reviews from positive to negative or from negative to positive through multi-task learning and adversarial training. The proposed model leverages a style embedding module to augment the language style representations and a multi-decoder Seq2seq model to respectively generate text with different styles.

GAN and RL models. Researchers have found that in addition to RNN, GAN and RL have significant effects on emotional text generation. In (Kong et al., 2019), Kong et al. propose a conditional GAN (CGAN)-based sentiment-controlled dialogue generation model. The generator of CGAN generates sentimental responses under the given dialogue history and sentiment label, while the discriminator identifies the quality of generated response through checking whether the items (dialogue history, sentiment label, and dialogue response) belongs to the real data distribution. Li et al. (Li et al., 2019a) propose a RL-based dialogue model combined with an emotional editor module to generate emotional, topic-relevant and meaningful dialogue responses. The emotional editor module selects the template sentences according to the emotion and topic information in the dialogue history and the RL mechanism constrains the quality of generated responses from three points: emotion, topic and coherence.

Emotional datasets. To address the problem of scarce datasets of emotional text generation, Rashkin et al. (Rashkin et al., 2019) publish a large-scale emotional dialogue dataset called Empathetic Dialogues121212, consisting of 25k conversations. This dataset includes an extensive set of emotions and every speaker in it feels with a given emotion during conversations. Chen et al. (Chen et al., 2018) publish another high-quality emotional dialogue dataset collecting from telescripts and dialogues in Facebook, named EmotionLines131313, including 29,245 utterances of 2,000 dialogues. All utterances in it are labelled with specific emotion labels according to textual content to guide the emotional dialogue response generation.

4.5. Knowledge-enhanced Text Generation

Nowadays, most text generation systems take advantage of deep neural network models, such as RNN, GAN and RL, aiming to generate fluent, semantic and consistent text. However, a big difference between such machine-generated text and human language expression is that, human will combine their own knowledge in speaking or article writing, while most text generation systems fail to achieve this. By combining sufficient knowledge, such as general knowledge and known information about specific objects/events, the text generation system can generate more logical, credible, and informative text. There are many ways to combine external knowledge in text generation systems, such as knowledge graph and external memory, as summarized in Table 

6, and we will introduce them in details below.

Work Method Description
Ghazvininejad et al. (Ghazvininejad et al., 2018) GRU + Multi-task learning + Facts embedding The Facts Encoder module leverages an external memory for embedding the facts related to an entity mentioned in the conversation to generate content-rich responses
Zhou et al. (Zhou et al., 2018c) GRU + Knowledge graph Graph attention mechanisms integrate commonsense information from knowledge based on the dialogue history; Generating more appropriate and informative responses
Dinan et al. (Dinan et al., 2018) Transformer + Memory Network Memory Network retrieves knowledge about the dialogue from the memory and Transformer encodes and decodes the text representations to generate responses; Conducting knowledgeable discussions on open-domain topics
Mazumder et al. (Mazumder et al., 2018) Lifelong learning + Open-world knowledge base completion Obtaining new knowledge by asking users related items when facing unknown concepts and then inferencing to grow knowledge over time
Table 6. A summary of knowledge-enhanced text generation methods

Usage of external memory. The way to leverage external memory is to extract relevant knowledge from external memory according to the input and then regard the extracted knowledge as part of the input of the system. Ghazvininejad et al. (Ghazvininejad et al., 2018) present a knowledge-based dialogue model aiming at producing knowledgeable dialogue responses. They utilize the Facts Encoder module to embedding the facts relevant to the specific entities appeared in the dialogue history, to provide factual evidence for the text generation procedure. In (Young et al., 2018), Young et al. propose a dialogue model integrating a large commonsense knowledge base to retrieve commonsense knowledge about concepts appeared in the dialogue. The retrieved knowledge and dialogue context are encoded respectively to guide the response generation. Wang et al. (Wang et al., 2019b) build a technical-oriented dialogue system to communicate with people about Ubuntu-relevant questions, which concatenates the technical knowledge embedding to the traditional word embedding to better understand the technical terms in dialogue history and generate more professional and specific response. In addition to RNN model, Dinan et al. (Dinan et al., 2018) combine the Transformer model and the Memory Network to build an open-domain knowledge-based dialogue system. The Memory Network module in this system retrieves related knowledge from the Internet, and understands the retrieved knowledge, while the Transformer module encodes and decodes the text representations to generate knowledgeable responses.

Knowledge graph-based models. Knowledge graph is a kind of structured knowledge base, which describes physical entities and their connections efficiently. As its name suggests, knowledge graph is also a useful method to build knowledge-enhanced text generation systems. In (Liu et al., 2019), Liu et al. present a knowledge graph-based chatbot model to generate informative responses. Specifically, the reinforcement learning-based reasoning model selects relevant knowledge from the knowledge graph by multi-hop reasoning technologies and the encoder-decoder model produces responses conditioned on the selected knowledge. Zhou et al. (Zhou et al., 2018c) produce a knowledge-based dialogue model (CCM) which leveraging two graph attention mechanisms to promote dialogue understanding and knowledgeable responses generating. The static graph attention module encodes the graphs relevant to the dialogue history and then the dynamic graph attention module utilizes the encoded semantic information to generation acquainted responses.

Continuous learning models. Although the exiting models introduce some real-world knowledge to the text generation procedure, the knowledge is usually fixed and cannot be expanded or updated. Some researchers try to employ large-scale knowledge bases, while the scale and comprehensiveness of them are still limited. Continuous learning in the interactive surroundings is an important capability of human beings. We keep on learning and updating our knowledge base according to what we have received in the daily life, which should be considered as an important factor when building the humanoid text generation system. Mazumder et al. (Mazumder et al., 2018) build a knowledge learning model, lifelong interactive learning and inference (LiLi), for chatbots enabling chatbots to interactively and continuously learn new knowledge when communicating with users. By mimicking humans to acquire knowledge, Lili will ask users for related items when facing unknown concepts and then inference to grow knowledge over time.

4.6. Diversified Text Generation

The traditional Seq2seq text generation models prefer to generate generic but meaningless text, such as “I don’t know” and “I’m sorry”. The reason for this phenomenon is that there is a large amount of general text in the training datasets, so neural networks increase the probability of such text to optimize the maximum likelihood objective function. To generate diverse texts, researchers have made great efforts, such as changing the objective function and adjusting the model structure, as summarized in Table 7.

Work Method Description
Li et al. (Li et al., 2015) LSTM + Maximum Mutual Information Replacing the original objective function with the Maximum Mutual Information (MMI) to reduce the probabilities of generic responses; Increasing the diversity of generated dialogue response
Xu et al. (Xu et al., 2018) Diversity-Promoting GAN (DP-GAN) + Language-model based discriminator Discriminator gives reward according to the novelty of generated text where the novel and fluent text will be highly rewarded; Improving the diversity and informativeness of the text generated by generator
Shi et al. (Shi et al., 2018) Inverse Reinforcement Learning (IRL) Assuming the diversity of the real text is higher and the rewards function distinguishes the real text in the training set and the generated text; Generating higher quality and more diverse text
Table 7. A summary of diversified text generation methods

Optimizing the objective function. Li et al. (Li et al., 2015) utilize the Maximum Mutual Information (MMI) to replace the Negative Log-likelihood (NLL) objective function of traditional Seq2seq model to solve the problem of low diversity of generated text. The MMI deduces the score of general responses which appear most frequently in the training data to increase the probability of the diversified responses.

GAN and RL-based models. In addition to replacing the objective function, GAN and RL are also widely applied to generate diverse text and achieve significant effect. Xu et al. (Xu et al., 2018) propose the Diversity-Promoting Generative Adversarial Network (DP-GAN) model, in which the discriminator gives different reward to the generator according to the novelty of generated text. Instead of using a classifier, DPGAN leverages the language model-based discriminator innovatively to distinguish the novelty of text according to the output of the language model, that is the cross-entropy. By combining GAN with the adjusted objective function, Zhang et al. (Zhang et al., 2018c) propose the Adversarial Information Maximization (AIM) model to produce dialogue responses with high diversity and informativeness. The AIM model leverages the idea of adversarial training to improve the diversity of generated text and maximizes the Variational Information Maximization Objective (VIMO) to increase the informativeness of the responses.

In (Shi et al., 2018), Shi et al. use inverse reinforcement learning (IRL) model to generate diverse texts. The reward function of the IRL model distinguishes the real text in the dataset and the generated text to explain how the natural language text is structured, and the generation policy learns to maximize the expected total rewards.

4.7. Visual Text Generation

Since people usually gather information from images, visual text generation is also an important research direction in text generation. Two of the most important applications are image caption and visual QA (Query Answer). A summary of visual text generation methods is given in Table 8.

Work Method Description
Vinyals et al. (Vinyals et al., 2015) RNN + CNN Encoder CNN captures information in an image, and decoder RNN generates neural language description of this image based on the image features
Dai et al. (Dai et al., 2017) Conditional GAN (CGAN) + CNN + LSTM CNN captures information in an image, LSTM generates the relevant descriptions, and the discriminator evaluates how well a sentence or paragraph describes
Malinowski et al. (Malinowski et al., 2015) LSTM + CNN CNN and LSTM relatively encodes the image and the question into vectors to capture the semantic information, and then another LSTM generates corresponding answers
Das et al. (Das et al., 2017) HRED + Memory Network Embedding the image, the historical dialogue and the given question respectively to consider the image and dialogue context information in conversation
Table 8. A summary of visual text generation methods

Image caption. Image caption technologies generate the corresponding text description about the given images to improve the efficiency of information acquisition of users. Vinyals et al. (Vinyals et al., 2015) achieve the goal of automatically viewing an image and generating the reasonable description utilizing the encoder-decoder structure. The encoder CNN captures information in the image, and the decoder RNN generates the text description. Due to the heavy loss of image information causing by the high-dimension structure of CNN, Xu et al. (Xu et al., 2015) propose an image caption model utilizing attention mechanism to extract the most important information in the image which generates more accurately and detailed image description. Different from previous articles, Dai et al. (Dai et al., 2017) leverages the CGAN model to generate high quality image descriptions in three aspects, that are naturalness, semantic relevance, and diversity respectively. The discriminator of CGAN evaluates the quality of generated image description to offer guidance to the generator.

Visual QA. Besides image caption, visual QA is another important technology in the field of visual text generation. By understanding the questions and the related images, the visual QA system can find the corresponding information in the images and generate correct answers. Malinowski et al. (Malinowski et al., 2015) combine CNN with LSTM to answer questions about the given image, in which the CNN captures the related information in the image about the question, and the LSTM generates answers based on the image and question’s vector representation. Zhu et al. (Zhu et al., 2016) build a semantic relationship between text descriptions and regions in the image by object-level grounding to generate answers of questions correspond with specific image regions.

Visual dialogue. Instead of simple single-round visual QA, Das et al. (Das et al., 2017) implement a visual dialogue system to communicate with people in multiple rounds about a given image. They put forward the task of Visual Dialogue and publish a large-scale Visual Dialogue dataset called VisDial141414 Three novel encoder modules are designed for the visual dialogue task, in which the Late Fusion module encodes the image, historical dialogue and the given question respectively, the Hierarchical Recurrent Encoder module encodes the dialogue history in the high level and the Memory Network module stores the former QA pair as the “fact” to offer factual basis for the latter responses generation.

5. Key Techniques of C-Textgen

Having presented the recent progress and representative studies of c-TextGen, in this section we elaborate the key techniques used in these studies. The advantages and disadvantages of different text generation techniques are summarized in Table 9.

Technique Advantages Disadvantages
RNN Natural sequence structure is very suitable for the task of sequence modeling Very prone to the problem of vanishing gradient or exploding gradient; Cannot well capture the long-distance dependent information
GAN Unsupervised learning; Generating clearer and more realistic samples than other generative models Instable training process; Not suitable for processing discrete data, such as text
Reinforcement learning Similar to human learning manners; Combining with GAN can subtly solve the existing problems in GAN and generate extremely real text Quite complicated training process
VAE Leveraging the latent vectors to increase the diversity of the generated text Without adversarial learning like GAN, generating less fluent content
Transformer The attention mechanism can efficiently capture the context information and avoid the problem of vanishing gradient or exploding gradient; Fast parallel computing speed Large amount of calculation and slow training speed
Table 9. A summary of text generation techniques

5.1. Rnn

RNN is one of the most commonly used models in text generation, whose natural sequence structure is suitable for the task of sequence modeling. The recurrent structure in the RNN model determines it process sequence data in order, and the output at each time conditioned on the current input and the previous outputs. After sequential processing, all semantic information of the given text are compressed into a fixed-length vector, enabling the RNN model to have memory. Due to the phenomenon of vanishing gradient or exploding gradient and the lack of ability of capturing the long-distance dependent information, RNN faces a certain limitation in practical application. Variants of RNN model, such as LSTM and GRU, combine the short-time and long-time memory through uniquely designed gating mechanisms, making them effectively solving problems existing in RNN model and becoming the most popular RNN model. Numerous researchers have shown that LSTM has the ability to generate natural and realistic texts in many text generation tasks. In (Sutskever et al., 2014), Sutskever et al. propose an encoder-decoder framework based on LSTM which maps a text sequence to another. In this framework, the “source” sequence is encoded into a fixed-length vector by a LSTM, and then another LSTM generates the natural language text taking the vector obtained from the encoder stage as the initial state. Since it was put forward, this framework is widely used in various kinds of NLP tasks, including machine translation, dialogue system, text summarization , and so on.

RNN is widely used in the c-TextGen models. In the case of personalized text generation, Luo et al. (Luo et al., 2019) realize the personalized goal-oriented dialogue system to accomplish the task of restaurant reservation. The Profile Model encodes the users’ personalized information, the Preference Model solves the ambiguity problem of the same query when facing different users, and the Memory Network stores similar users’ dialogue history. They combine the similar users’ dialogue customs and the user’s personalized feature to generate personalized responses. As to the diversified text generation, Jiang et al. (Jiang et al., 2019)

introduce another loss function to improve the cross-entropy in the traditional RNN model, namely Frequency-Aware Cross-Entropy (

FACE). By assigning different weights conditioned on words’ frequency, the FACE function reduces the probability of generic words and increases the diversity of generated responses.

5.2. GAN and RL

Although RNN is the most popular model in text generation, it also has some disadvantages. First, most text generation models based on RNN are trained by maximizing the log-likelihood objective function, which may lead to the problem of exposure bias. Second, most loss functions are calculated at the level of words, while most evaluation metrics are based on the level of sentences, which may result in the inconsistency between the optimization direction of the model and the actual requirements. Facing these problems, GAN is introduced into the field of text generation. GAN is made of two parts: the generator and the discriminator. The generator produces false sample distributions similar to the real data, and the discriminator distinguishes generated samples and real samples as accurately as possible.

It is difficult for the gradient of the discriminator to correctly back-propagate through discrete variables, so the application of GAN in text generation tasks is not easy because text is the typical discrete data. Zhang et al. (Zhang et al., 2016) solve the above problem by the smooth approximation technology to approximate the output of generator. Instead of utilizing the standard objective function of GAN, they match the feature distribution and make the word predictions “soft” in the embedding vector space to generate high-quality sentences.

Although the direct improvement of GAN has achieved some progress, it is still far from meeting the researcher’s requirements. Therefore, the idea of RL begins to be introduced to text generation. RL is usually a Markov decision process in which the action of each state will be rewarded (or reversely rewarded–punishment). For maximizing the expected rewards, the RL machine tries various possible actions in various states to evaluate the optimal policy according to the rewards provided by the environment. Through combing RL and GAN, researchers made some excellent results. Yu

et al. (Yu et al., 2017) propose the SeqGAN model to solve the problems of GAN in generating discrete text data. SeqGAN regards the text generation as a sequence decision procedure in RL, in which the generated sequence so far represents the current state, the next word to be generated is regarded as the action to be taken and the returned reward is the discriminator’s score of the generated sequence. Through gradient policy algorithms, the SeqGAN model directly avoid the differentiability problem in the generator and obtain excellent results in generating realistic natural language text.

To realize the personalized text generation, Mo et al. (Mo et al., 2018) present a Partially Observable Markov Decision Process (POMDP)-based transfer RL framework. This framework firstly extracts common neural language knowledge from the original datasets and then transfers the learned knowledge to the target model leveraging transfer learning technology. As for the emotional text generation, Wang et al. (Wang and Wan, 2018) propose the SentiGAN model to generate natural, diversified sentimental text under specific emotional labels (e.g., positive or negative). SentiGAN includes several generators while only one multi-class discriminator. Every generator generates sentences under a specific sentiment label and the discriminator ensures each generator to generate specific sentences under a given sentiment label precisely. In the meanwhile, a penalty based objective function is applied to prompt the generators to generate diversified sentences.

5.3. Variational Autoencoder (VAE)

Although the traditional Seq2seq model has made great progress in text generation, it’s training objective function determines it tends to produce general and safe sentences with high probability. At the same time, the hidden layer of the encoder stage tends to remember short-term dependencies rather than the global information. To solve these problems, the idea of Variational Autoencoder (VAE) is introduced into the text generation model. VAE is a variant of regularized autoencoder and belongs to the generative model which forces a prior distribution on the hidden states. In (Bowman et al., 2015), Bowman et al. introduce a RNN-based VAE text generation model which assigns whole sentences with distributed latent vectors. By appending Gaussian prior distribution regularization on the encoder hidden state, a sequence autoencoder model is formed and the results sentences are generated word by word conditioned on the hidden vector to obtain coherent and diverse sentences.

Sequential data usually owns hierarchical structures and complicated dependencies between sub-sequences. For example, the sentence sequences and word sequences in a multi-round conversation have massive dependencies. Serban et al. (Serban et al., 2017) attach the latent variable to the hierarchical dialogue model to assign the generative model with multiple levels of variability to generate meaningful and diverse responses. They attach a high-dimensional latent variable to each sentence in the dialogue history and then generates responses conditioned on the latent variable.

The VAE model can also be applied to topic-aware text generation. Wang et al. (Wang et al., 2019a)

combine a topic model with a VAE-based sequence generation model to generate the topic-aware text. The topic model captures long-range semantic information in the whole document and then parametrize the prior distribution as a Gaussian mixture model (

GMM) in the VAE model.

5.4. Transformer

Although RNN is suitable for NLP tasks due to its natural sequence structure, it has some obvious shortcomings. Firstly, RNN processes the input sequence with strict linear order from front to back, which yields the long back-propagation path, resulting in the problem of vanishing gradient or exploding gradient. Secondly, RNN lacks of efficient parallel computing capability due to its linear propagation structure where the calculation at the later moment relies on the outputs at the previous moment. Therefore, RNN faces the issues of low calculation efficiency in large-scale application scenarios. To address these problems, Google proposed a new sequence modeling model, the Transformer model

(Vaswani et al., 2017), which abandons the sequence structure in RNN and is completely composed of Attention modules.

More precisely, the Transformer model is an encoder-decoder structure and only consists of Attention modules and feedforward neural networks. The self-attention mechanism is the core of the Transformer model which captures the dependency between each word in a sequence to obtain better sentiment representations of each word. The multi-head attention mechanism, composed of many self-attention modules, is proposed to further improve the ability of capturing context semantic information. Thanks to the parallelization of the Attention module, the Transformer model has powerful parallel computing capacity and broad application prospect.

In the past two years, the pre-training and fine-tuning research mode has been studied extensively in NLP, among which the BERT model (Devlin et al., 2018) and the GPT model (Radford et al., 2018), all based on the Transformer model, receive the most attention. The BERT model obtains bidirectional representations of a large amount of text by conditioning on both the preceding and following contexts of a sentence sequence. Just adding a specific output layer rather than adjusting model’s structure, the pre-trained BERT model can be fine-tuned to achieve the best performance in many tasks, including machine translation, text classification, and so on. The success of these pre-trained models demonstrates the effectiveness of the Transformer model for sequence modeling tasks. In the meanwhile, the encoder-decoder structure of Transformer can directly complete text generation tasks, which will bring great development.

6. Open Issues and Future Trends

Although many advanced technologies have been applied to the field of text generation and some remarkable achievements have been made, there are still many serious issues remain to be solved. In this section, we put forward some key issues and point out some future development trends of text generation.

6.1. Different Types of Contexts

Context information is very important for generating smooth and coherent text. In multiple rounds of conversations, context information usually refers to historical dialogues, while in the review generation scenario, context refers to information about the item to be commented, such as user ratings. Existing studies highlight the importance of contexts in text generation system, and propose numerous context-aware text generation methods. Yan et al. (Yan et al., 2016) concatenate the context dialogue sentences and the input straightway while others leverage hierarchical models to firstly capture contextual information in each sentence and then integrate them to capture contextual information in the whole dialogue process (Serban et al., 2017). These works achieve relatively excellent results. However, context information contains much more than those considered in existing studies. For instance, when we talk with others, we may be influenced by the environment. The sunny or rainy weather may affect our mood and change our dialogue content. When we write something, we may be stimulated by external events to express different contents. In the future work, novel models should be investigated to incorporate diverse contexts in text generation.

6.2. Multi-modal Data Translation and Domain Adaptation

In addition to text data, there are various types of data, such as voice, image and video. Human can efficiently extract useful information from various types of data and convert them into corresponding text representation, such as describing the content of a painting and summarizing the content of a movie. Researchers carry out a lot work of text generation with multi-modal data as inputs, such as generating description/caption for a given image (Lu et al., 2017), conducting QA with images (Song et al., 2018), and communicating based on the content of a given image (Das et al., 2017). These studies usually utilize the CNN model to extract relevant information from images, and then generate corresponding text using common models in the field of text generation. Incorporating different types of data and developing the unified model for multi-source processing are two huge challenges. Much more efforts should be conducted to generate informative texts with multi-modal data sources.

At the same time, in many tasks of c-TextGen, such as personalized text generation and emotional text generation, the available training data is very scarce. Most text data are totally general data and do not contain personalized or emotional characteristics, which cannot meet the requirements under specific conditions. Transfer learning is a promising way to address this problem. By learning general knowledge of natural language from massive common text data, and then transferring it to a specific domain training with a small-scale conditional text data, the model can not only master the general knowledge of the source domain, but also learn the specific needs of the target domain, to make up for the scarcity of data. Yang et al. (Yang et al., 2017) use the idea of domain adaptation in transfer learning to address the issue of lacking personalized dialogue data. Though fine-tuning the general dialogue model with the small size personalized dialogue data, the model generates personalized dialogue responses. Transfer learning is a rapid developing technology in deep learning and integrating diverse transfer learning models with scarce usable data for text generation is also a promising research direction.

6.3. Long Text Generation

Long text has a wide range of application areas, including writing compositions, translating articles, writing reports, etc. However, the current technology has some bottlenecks in processing long text because of the long-distance dependence existing in the natural language. We have the ability to extract the key information (contexts, topics) from long text, which, however, is difficult for machines. Researchers have conducted much efforts to improve the models’ ability of generating long text. For example, the LSTM and GRU model is produced to address the issue that the original RNN model cannot capture the long-distance dependence. Guo et al. (Guo et al., 2018) propose the LeakGAN model to generate long texts. The generator in LeakGAN model has a hierarchical RL structure, consisting of the Manager module and the Worker module. The Manager module receives a feature representation from the discriminator at each time step and then transfers a guidance signal to the Worker module. The Worker module encodes the input and connects the encoder output and the received signal from the Manager module to jointly calculate the next action, namely to generate the next word. For text generation technology to truly behave like humans, it needs the ability to freely generate long or short texts, which, however, still has much to be investigated.

6.4. Evaluation metrics

The natural language is quite complex, so it is hard to evaluate all text generation tasks by uniform evaluation metrics. In many tasks, the most accurate and reasonable evaluation method is still manual evaluation. However, it needs a lot of overhead and has a certain subjectivity. Machine evaluation metrics, such as BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005), perform well in the evaluation of machine translation tasks. They evaluate the generated results according to the word overlap rate between the generated text and the standard answer. In other types of tasks, such as dialogue systems, the performance of the model cannot be well-measured by the two metrics. Therefore, designing more standard evaluation metrics to evaluate various types of text generation tasks is a great impetus to promote the development of text generation.

6.5. Lifelong learning

Lifelong learning is an important ability of human beings. We continuously learn new knowledge, expand and update our knowledge base through various data sources in the physical world to adapt to the fast-changing pace of society. To make the text generation models more anthropomorphic and better meet human needs, they should have the ability of continuous lifelong learning. Combining external knowledge base is an important method to realize lifelong learning. There have been many text generation researches combined with knowledge bases, focusing on the dialogue systems (Zhou et al., 2018c) (Dinan et al., 2018). However, most of them are based on fixed knowledge bases in which the knowledge does not keep updating in real time, so the model still does not have the ability of continuous learning. Therefore, the dynamic evolution of knowledge base is very important. A meaningful exploration of this is discussed by Mazumder et al. (Mazumder et al., 2018). They propose the lifelong interactive learning and inference (LiLi) model which will actively ask users questions when encountering unknown concepts, and update its knowledge base after corresponding answers are reached. New knowledge is huge and complex. How to find and learn the most effective information from the numerous external inputs and achieve efficient lifelong learning is a very important research direction in text generation.

7. Conclusion

We have made a systematic review of the research trends of conditional text generation (c-TextGen). In this paper, we first make a brief summary of the development history of text generation technology and then give the formal definitions of different types of c-TextGen. We further investigate the research status of various c-TextGen methods. Finally, we discuss different text generation technologies and analyze their advantages and disadvantages. Though there has been a big research progress in c-TextGen, it is still at the early stage and numerous open research issues and promising research directions should be studied, such as long text generation, multimodal data translation, and lifelong learning.

This work was supported by the National Key RD Program of China (2017YFB1001800) and the National Natural Science Foundation of China (No. 61772428, 61725205).