The recent advance of deep learning has inspired many applications of neural dialogue systemsWen et al. (2017); Bordes et al. (2017)
. A typical dialogue system pipeline can be divided into several components: a speech recognizer that transcribes a user’s speech input into texts, a natural language understanding module (NLU) to classify the domain along with domain-specific intents and fill in a set of slots to form a semantic frameHakkani-Tür et al. (2016). Following a dialogue state tracking (DST) module that predicts the current dialogue state according to the multi-turn conversations, then the dialogue policy determines the system action for the next step given the current dialogue state Su et al. (2018a)
. Finally the semantic frame of the system action is then fed into a natural language generation (NLG) module to construct a response utterance to the user.
NLG is a key component in a dialogue system, where the goal is to generate natural language sentences conditioned on the given semantics from the dialogue manager. As the endpoint of interacting with users, the quality of generated sentences is crucial for better user experience. The common and mostly adopted method is the rule-based (or template-based) method Mirkovic and Cavedon (2011)
, which can ensure the natural language quality and fluency. Considering that designing templates is time-consuming and the scalability issue, data-driven approaches have been investigated for open-domain NLG tasks. In spite of robustness and adequacy of the rule-based methods, frequent repetition of identical, tedious output makes talking to a template-based machine unsatisfactory. Furthermore, scalability is an issue, because designing sophisticated rules for a specific domain is time-consuming. Recurrent neural network-based language model (RNNLM) have demonstrated the capability of modeling long-term dependency in sequence prediction by leveraging recurrent structuresMikolov et al. (2010). Previous work proposed an RNNLM-based NLG that can be trained on any corpus of dialogue act-utterance pairs without hand-crafted features and any semantic alignment Wen et al. (2015). The following work based on sequence-to-sequence (seq2seq) further obtained better performance by employing encoder-decoder structure with linguistic knowledge such as syntax trees Su et al. (2018b); Su and Chen (2018).
Recently, several virtual intelligent assistants show up in the market, such as Apple Siri, Google Assistant, Microsoft Cortana, and Amazon Alexa. However, most of these systems only focus on single or monotonous modality, such as textual or vocal interaction. Although the existing systems have shown capability of enabling users to perform basic information inquiries and helping simple daily activities, real human-human conversation actually involve multiple modalities of information. Communication between humans is complex and often requiring mixture of expression mode to easily precisely exchange information, for example, using both gestures and voice. Multimodal dialogues reflect more human-like behavior, however, this research topic of is barely explored due to its difficulty in data collection, cross-modality reasoning, and various aspects. Previous work Saha et al. (2018) first created a dataset of multimodal conversations in the fashion domain which explores the direction of cross-modality reasoning within dialogue contexts, and Liao et al. (2018) further presents a knowledge-aware multimodal dialogue model to address the limitation of text-based dialogue systems.
Recent trends towards intersection of computer vision and natural language processing brought various resarch direction. For instance, the goals of visual question answering (VQA)Antol et al. (2015) and video question answering Yang et al. (2003) tasks are to answer questions based on image or video, while image captioning Kulkarni et al. (2013) and video captioning Rohrbach et al. (2013) try to generate description of visual content. Beyond independent question-answer pairs, visual dialog task Das et al. (2017) aggregates multiple question-answer pairs to form a dialog, which requires exploitation of visual information and reasoning in dialog contexts. In this paper, we explores a brand new research direction, which aim to bridge dialogue generation and facial expression synthesis. The goal is to generate dialogue responses and simultaneously synthesize corresponding faces, which is also an ultimate step toward more human-like virtual assistants. Inspired by talking head tasks Suwajanakorn et al. (2017), the line of research in synthesizing talking faces from speech, in which the face is animated to mimic the continuous time-varying context (i.e. talking) and affective states carried in the speech.
The proposed task is to generate natural language utterances and then construct realistic faces based on the generated sequences. In other words, the synthesized facial expression should be relevant to certain semantic concept in the generated sentences, in this paper, we model the shared semantics of the two modalities by emotion
. Since emotions are expressed through a combination of verbal and non-verbal channels, like gestures, facial expression, speech, and spoken content, therefore it could be viewed as the intersection of various modes. In spoken language, emotion detection has been a widely explored field, there are two types of analysis available to detect emotion: sentiment analysis and emotion analysis. In sentiment analysisSocher et al. (2013), the goal is to detect sentiment from the given user input text, which is generally a bipolar or tri-polar (positive, negative and neutral) feeling, while in emotion analysis Chen et al. (2018) we can detect types of generic feelings such as happy, sad, anger, disgust, fear and surprise from the given input.
For human, one of the most straight-forward ways to express emotions is by facial expression. In computer vision research community, face synthesis has also been one of the most popular research fields. Most of the recent methods are based on deep generative models, especially generative adversarial networks (GAN) Goodfellow et al. (2014). The generative adversarial networks (GANs) are a powerful class of generative models, a typical GAN optimization pipeline consists of simultaneously training a generator network to produce realistic fake samples and training a discriminator network to distinguish between real and fake data. Recent work Arjovsky et al. (2017) has shown improved stability by incorporating Earth Mover Distance metric (Wasserstein Distance), which is also used in this paper to train our model. GANs have been shown to produce very realistic image samples with rich details and have been successfully used for image translation Zhu et al. (2017), face generation Karras et al. (2017)
, super-resolution imagingLedig et al. (2017). Recent work Karras et al. (2018) proposed an alternative generator architecture for generative adversarial networks inspired by style transfer techniques, enabling intuitive and scale-specific control of the synthesis. In this paper, we introduce a framework based on Pumarola et al. (2018), which is a GAN conditioning scheme based on Action Units (AU) annotations, allowing controlling the magnitude of activation of each AU and combine several of them.
In this work, we set out to bridge dialogue generation and facial expression synthesis, by combining the two functions, the dialogue agent could automatically generate utterances along with corresponding facial expressions. The proposed concept is a new research direction of multimodal dialogues and also an ultimate step towards more human-like virtual assistants. Furthermore, a dataset preparation strategy and a training framework are also introduced.
2 Proposed Framework
In this section, we first describe the proposed task, and then introduce the data preparation strategy and a training pipeline.
2.1 Task Description
Given a dialogue consisting of sentence-level features sampled from a set of dialogues
. The core goal of dialogue generation is to generate proper next responses based on preceding dialogue contexts. A typical strategy for the optimization problem is based on maximum likelihood estimation (MLE) of the parameterized conditional distribution by the learnable parametersformulated as below:
The proposed task is to generate natural language utterances, predict the emotion information of the spoken content, and then construct realistic faces based on the emotion distribution. Because of such multimodal scenario, the sentence-level features contains not only utterances but emotion information and corresponding facial expression,
. In other words, the goal of the generative model is to estimate joint probability of spoken content, emotion distribution , and facial expression . The MLE objective is hereby reformulated as:
2.2 Data Preparation
In this work, we focus on dialogue generation, however, most of the textual emotional datasets consist of emotion labels of only individual words, sentences or documents, which makes it challenging to discuss the contextual flow of emotions. IEMOCAP database Busso et al. (2008) provides emotion labels for each utterance, however, it carries the risk of overacting because of being created by actors performing emotions Chen et al. (2018). In the labeling process of IEMOCAP, the annotators only watch the videos instead of reading the transcripts, which means they may make the decisions only base on the facial expression or the prosodic features without realizing the actual meaning of the words. Considering such potential bias, we decide to combine separate vision and language datasets according to our needs.
For language part, we choose EmotionLines Chen et al. (2018), which contains a total of 29245 labeled utterances from 2000 dialogues. Each utterance in dialogues is labeled with one of seven emotions, six Ekman’s basic emotions plus the neutral emotion. Each labeling was accomplished by 5 workers, and for each utterance in a label, the emotion category with the highest votes was set as the label of the utterance. Those utterances voted as more than two different emotions were put into the non-neutral category. Therefore the dataset has a total of 8 types of emotion labels, anger, disgust, fear, happiness, sadness, surprise, neutral, and non-neutral.
On the other hand, Radboud Faces Database (RaFD) Langner et al. (2010) has 8 binary labels for facial expressions, namely sad, neutral, angry, contemptuous, disgusted, surprised, fearful and happy. In total, the set contains 67 models: 20 Caucasian male adults, 19 Caucasian female adults, 4 Caucasian male children, 6 Caucasian female children, and 18 Moroccan male adults. All models in the dataset show the above eight facial expressions with three gaze directions, photographed simultaneously from five different camera angles. The photos were taken in a highly controlled environment. All displayed facial expressions were based on prototypes from Facial Action Coding System (FACS) Ekman and Friesen (1976). FACS was developed for describing facial expressions in terms of the so-called Action Units (AUs), which are anatomically related to the contractions of specific facial muscles. We select frontal images, crop the head regions, and use OpenFace 2.1.0 Baltrusaitis et al. (2018) to recognize Action Units from the images.
Note that either the facial expression dataset or the dialogue with emotion label dataset could be chosen or collected at will.
In this section, we design a pipeline composed of two components: (1) a multi-task NLG model based on gated recurrent unit (GRU)Cho et al. (2014) and (2) a GAN-based model for facial expression generation Pumarola et al. (2018).
2.3.1 Multi-task NLG Model
The framework of the proposed NLG model is illustrated in Figure 1, where the model architecture is based on an encoder-decoder (seq2seq) structure Sutskever et al. (2014). In the encoder-decoder architecture, a typical generation process includes encoding and decoding phases: Firstly, a given word sequence is fed into a RNN-based encoder to capture the temporal dependency and project the input to a latent feature space, where the recurrent unit of the encoder is bidirectional gated recurrent unit (GRU) Cho et al. (2014):
Secondly, the encoded semantic vector,, is then fed into an RNN-based decoder as the initial state to decode word sequences:
where is the -th hidden state, is the
-th output logit, andis the -th predicted word. To predict the emotion signal, we pass the encoded vector through a fully-connected layer.
where is the output logit and is the predicted emotion tag.
To facilitate training and improve the performance, the training strategy scheduled sampling is also utilized. Teacher forcing Williams and Zipser (1989) is a strategy for training RNN which typically uses model output from a prior time step as the input, and it works by using the expected output at the current time step as the input at the next time step rather than the output generated by the network. The teacher forcing techniques can also be triggered only with a certain probability, which is known as the scheduled sampling approach Bengio et al. (2015). The scheduled sampling methods are adopted in our experiments:
Both sequence and emotion prediction are classification problem. Therefore we use the cross entropy loss as our training objective for optimization, which is to optimize the conditional probability , so that the difference between the predicted distribution and the target distribution, , can be minimized:
where is the number of samples, is the number of classes, and the labels are the labels.
2.3.2 Action Units Mapping
2.3.3 Facial Expression Generator
We utilize the GAN-based model Pumarola et al. (2018) as our face generator. As illustrated in Figure 1, the face generator is composed of two main modules: (1) a pair of generators ( and ) is trained to change the facial expression in image according to the given desired attributes and (2) a pair of WGAN-GP-based discriminators ( and ) to examine the photo-realism and desired expression fulfillment of the generated images.
Since our goal is to manipulate the facial expression in the image , the generator should focus only on those regions of the image that are relevant to constitute facial expressions and keep the rest elements of the image such as hair, glasses, hats or background untouched. For this purpose, instead of directly regressing a full image, the generators predict two masks, a color mask and an attention mask , by and respectively. Rather than directly constructing a full image as in the typical GANs, the image is now obtained by the following formula:
where and represent the desired and original attributes of facial expressions respectively, and the subscripts d and o denote the words desired and original. Note that the desired attributes are mapped from the predicted emotion tag as mentioned in the previous section. Both the color mask and the attention mask are predicted based on the original input image and the given desired attributes, , . The generated attention is a single-channel mask, indicating the preserved region from the original image. On the other hand, the color mask is a RGB, three-channel mask, determining the actual facial movements. By (2), the model could focus on the pixels defining the facial movement and preserve the static region from the original image, which leads to sharper and more realistic synthesis.
Original GAN training utilizes Jensen-Shannon (JS) divergence as the loss function, which aims to maximize the probability of correctly distinguishing between real and generated data. Since the divergence is potentially not continuous hence resulting in vanishing gradients, in order to address the issue, we use WGAN-GPGulrajani et al. (2017) as our adversarial framework, which replaces JS divergence with Earth Mover Distance (Wasserstein Distance) along with a gradient penalty term. The gradient penalty term is an alternative way to enforce the Lipschitz condition, which directly constrains the gradient norm of the critic’s output with respect to its input. The adversarial objective is hereby formulated as below:
where is the penalty coefficient, and represent original data distribution and the distribution of sampling uniformly along straight lines between pairs of points sampled from the original data distribution.
The attention mask and the color mask are both learned by direct end-to-end training driven by the signals provided by the discriminator. Because the discriminator would assess the photo-realism, the attention mask would tend to saturate to all 1, leading to a complete copy of the original input image. To circumvent the potential issue and improve smoothness of transformation, regularization is performed over attention mask distribution:
where and stand for the indexes of the attention mask matrices. The first term enforce the smoothness, while the second term is the standard norm penalty.
As the training scheme is based on conditional GANs, the generators should learn to synthesize realistic data and simultaneously satisfy the given attributes , which are activation of Action Units. Specifically, the discriminators should identify from generated examples and from original data. Another condition loss is hereby formulated as below:
The above described objectives encourage the generators to render realistic facial expression according to desired attributes . However, the generated face is not guaranteed to correspond to the same person in the input original image . In this work, the cycle consistency loss Zhu et al. (2017) is utilized to regularize the learning inclination to preserve the identity in the original input:
Markovian discriminator (PatchGAN) Isola et al. (2017) is introduced to restrict attention to the structure in local image patches to model high-frequency region, while norm it utilized to capture low-frequency structure. Next, equations (2.3.3) to (6) with their corresponding coefficients are combined into the full objective:
where , , and
are the hyperparameters controlling the importance of each loss term. Finally, we aim to solve the following minimax problem:
3 Preliminary Results
The generated example of our proposed pipeline is shown in Figure 2, demonstrating the ability of generating sentences and facial expressions grounding to the same emotion signals.
In the experiments, we use mini-batch Adam
as the optimizer with each batch of 256 examples, 100 training epochs were performed without early stop. For the sentence generator, the hidden size of network layers is 200, and word embedding is of size 50 and trained by end-to-end learning. On the other hand, the implementation details of the facial expression generator are same as the original workPumarola et al. (2018).
In this paper, we introduce a new multimodal task which aims to generate natural language sentences and simultaneously synthesize corresponding facial expression. To bridge these two problem, we further propose to model the shared semantics of the two modalities by emotion signals. Furthermore, a dataset preparation strategy and a training framework are also introduced. The proposed concept is a new research direction of multimodal dialogues and also an ultimate step towards more human-like interfaces of virtual assistants.
- Antol et al. (2015) Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
- Baltrusaitis et al. (2018) Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency. 2018. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 59–66. IEEE.
- Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. 2015. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179.
- Bordes et al. (2017) Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2017. Learning end-to-end goal-oriented dialog. In Proceedings of ICLR.
- Busso et al. (2008) Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N Chang, Sungbok Lee, and Shrikanth S Narayanan. 2008. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42(4):335.
- Chen et al. (2018) Sheng-Yeh Chen, Chao-Chun Hsu, Chuan-Chun Kuo, Lun-Wei Ku, et al. 2018. Emotionlines: An emotion corpus of multi-party conversations. arXiv preprint arXiv:1802.08379.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of EMNLP, pages 1724–1734.
Das et al. (2017)
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav,
José MF Moura, Devi Parikh, and Dhruv Batra. 2017.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 326–335.
- Ekman and Friesen (1976) Paul Ekman and Wallace V Friesen. 1976. Measuring facial movement. Environmental psychology and nonverbal behavior, 1(1):56–75.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. In Advances in Neural Information Processing Systems, pages 5767–5777.
- Hakkani-Tür et al. (2016) Dilek Hakkani-Tür, Gökhan Tür, Asli Celikyilmaz, Yun-Nung Chen, Jianfeng Gao, Li Deng, and Ye-Yi Wang. 2016. Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Proceedings of INTERSPEECH, pages 715–719.
- Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134.
- Karras et al. (2017) Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.
- Karras et al. (2018) Tero Karras, Samuli Laine, and Timo Aila. 2018. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948.
- Kulkarni et al. (2013) Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. 2013. Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2891–2903.
- Langner et al. (2010) Oliver Langner, Ron Dotsch, Gijsbert Bijlstra, Daniel HJ Wigboldus, Skyler T Hawk, and AD Van Knippenberg. 2010. Presentation and validation of the radboud faces database. Cognition and emotion, 24(8):1377–1388.
- Ledig et al. (2017) Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, et al. 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4681–4690.
- Liao et al. (2018) Lizi Liao, Yunshan Ma, Xiangnan He, Richang Hong, and Tat-seng Chua. 2018. Knowledge-aware multimodal dialogue systems. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 801–809. ACM.
- Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association.
- Mirkovic and Cavedon (2011) Danilo Mirkovic and Lawrence Cavedon. 2011. Dialogue management using scripts. US Patent 8,041,570.
- Pumarola et al. (2018) Albert Pumarola, Antonio Agudo, Aleix M Martinez, Alberto Sanfeliu, and Francesc Moreno-Noguer. 2018. Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European Conference on Computer Vision (ECCV), pages 818–833.
- Rohrbach et al. (2013) Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. 2013. Translating video content to natural language descriptions. In Proceedings of the IEEE International Conference on Computer Vision, pages 433–440.
- Saha et al. (2018) Amrita Saha, Mitesh M Khapra, and Karthik Sankaranarayanan. 2018. Towards building large scale multimodal domain-aware conversation systems. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642.
- Su and Chen (2018) Shang-Yu Su and Yun-Nung Chen. 2018. Investigating linguistic pattern ordering in hierarchical natural language generation. In Proceedings of 7th IEEE Workshop on Spoken Language Technology.
- Su et al. (2018a) Shang-Yu Su, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Yun-Nung Chen. 2018a. Discriminative deep dyna-q: Robust planning for dialogue policy learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
- Su et al. (2018b) Shang-Yu Su, Kai-Ling Lo, Yi-Ting Yeh, and Yun-Nung Chen. 2018b. Natural language generation by hierarchical decoding with linguistic patterns. In Proceedings of The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of NIPS, pages 3104–3112.
- Suwajanakorn et al. (2017) Supasorn Suwajanakorn, Steven M Seitz, and Ira Kemelmacher-Shlizerman. 2017. Synthesizing obama: learning lip sync from audio. ACM Transactions on Graphics (TOG), 36(4):95.
- Wen et al. (2015) Tsung-Hsien Wen, Milica Gasic, Dongho Kim, Nikola Mrksic, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Stochastic language generation in dialogue using recurrent neural networks with convolutional sentence reranking. In Proceedings of SIGDIAL, pages 275–284.
- Wen et al. (2017) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of EACL, pages 438–449.
- Williams and Zipser (1989) Ronald J Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280.
- Yang et al. (2003) Hui Yang, Lekha Chaisorn, Yunlong Zhao, Shi-Yong Neo, and Tat-Seng Chua. 2003. Videoqa: question answering on news video. In Proceedings of the eleventh ACM international conference on Multimedia, pages 632–641. ACM.
- Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2223–2232.