Spoken dialogue systems that can help users solve complex tasks such as booking a movie ticket have become an emerging research topic in artificial intelligence and natural language processing areas. With a well-designed dialogue system as an intelligent personal assistant, people can accomplish certain tasks more easily via natural language interactions. The recent advance of deep learning has inspired many applications of neural dialogue systemsWen et al. (2017); Bordes et al. (2017); Dhingra et al. (2017); Li et al. (2017)
. A typical dialogue system pipeline can be divided into several parts: 1) a speech recognizer that transcribes a user’s speech input into texts, 2) a natural language understanding module (NLU) that classifies the domain and associated intents and fills the slot values to form a semantic frameChi et al. (2017); Chen et al. (2017); Zhang et al. (2018); Su et al. (2018c, 2019), 3) a dialogue state tracker (DST) that predicts the current dialogue state in the multi-turn conversations, 4) a dialogue policy that determines the system action for the next step given the current state Peng et al. (2018); Su et al. (2018a), and 5) a natural language generator (NLG) that outputs a response given the input semantic frame Wen et al. (2015); Su et al. (2018b); Su and Chen (2018).
Many artificial intelligence tasks come with a dual form; that is, we could directly swap the input and target of a task to formulate another task. Machine translation is a classic example Wu et al. (2016); for example, translating from English to Chinese has a dual task of translating from Chinese to English; automatic speech recognition (ASR) and text-to-speech (TTS) also have structural duality Tjandra et al. (2017), and so on. Previous work first exploited the duality of the task pairs and proposed supervised Xia et al. (2017)
and unsupervised (reinforcement learning)He et al. (2016) training schemes. The recent studies magnified the importance of the duality by revealing exploiting it could boost the performance of both tasks.
NLU is to extract the core semantic concept from the given utterances, while the goal of NLG is to construct corresponding sentences based on given semantics. In other words, understanding and generating sentences are a dual problem pair shown in Figure 1. In this paper, we introduce a new training framework for NLU and NLG based on dual supervised learning Xia et al. (2017). The experiments show that the proposed approach improves the performance for both tasks.
2 Proposed Framework
This section first describes the problem formulation, and then introduces the core training algorithm along with the proposed methods of estimating data distribution.
Given data pairs , the goal of NLG is to generate corresponding utterances based on given semantics. In other words, the task is to learn a mapping function to transform semantic representations into natural language. On the other hand, NLU is to capture the core meaning of utterances, finding a function to predict semantic representations given natural language. A typical strategy of these optimization problems is based on maximum likelihood estimation (MLE) of the parameterized conditional distribution by the learnable parameters and .
2.1 Dual Supervised Learning
Considering the duality between two tasks in the dual problems, it is intuitive to bridge the bidirectional relationship from a probabilistic perspective. If the models of two tasks are optimal, we have probabilistic duality:
are marginal distributions of data. The condition reflects parallel, bidirectional relationship between two tasks in the dual problem. Although standard supervised learning with respect to a given loss function is a straight-forward approach to address MLE, it does not consider the relationship between two tasks.
Xia et al. (2017)
exploited the duality of the dual problems to introduce a new learning scheme, which explicitly imposed the empirical probability duality on the objective function. The training strategy is based on the standard supervised learning and incorporates the probability duality constraint, so-calleddual supervised learning. Therefore the training objective is extended to a multi-objective optimization problem:
where are the given loss functions. Such constraint optimization problem could be solved by introducing Lagrange multiplier to incorporate the constraint:
where and are the Lagrange parameters and the constraint is formulated as follows:
Now the entire objective could be viewed as the standard supervised learning with an additional regularization term considering the duality between tasks. Therefore the learning scheme is to learn the models by minimizing the weighted combination of original loss term and regularization term. Note that the true marginal distribution of data and are often intractable, here we replace them by approximate empirical marginal distribution and .
2.2 Distribution Estimation as Autoregression
The current problem is how to estimate the empirical marginal distribution , which arises in the reason that different data types have different structural natures. For example, natural language has sequential structures and temporal dependencies, while other structure data may not. Therefore we design an individual method of estimating distribution for each data type.
From the probabilistic perspective, we can decompose any data distribution into the product of its nested conditional probability,
where could be any data type and is the index of a variable unit.
2.2.1 Language Modeling
Natural language has an intrinsic sequential nature; therefore it is intuitive to leverage the autoregressive property to learn a language model. In this work, we learn the language model based on recurrent neural networksMikolov et al. (2010); Sundermeyer et al. (2012) by the cross entropy objective in an unsupervised manner.
2.2.2 Masked Autoencoder
The semantic representation in our work is discrete semantic frames containing specific slots and corresponding values. Each semantic frame contains the core concept of a certain sentence, for example, the slot-value pairs “name[Bibimbap House], food[English], priceRange[moderate], area [riverside], near[Clare Hall]” corresponds to the target sentence “Bibimbap House is a moderately priced restaurant who’s main cuisine is English food. You will find this local gem near Clare Hall in the Riverside area.”. Even though the product rule (1
) enables us to decompose any probability distribution into a product of a sequence of conditional probability, how we decompose the distribution reflects a specific physical meaning. For example, language modeling outputs the probability distribution over vocabulary space of-th word by only taking the preceding word sequence
. Natural language has the intrinsic sequential structure and temporal dependency, so modeling the joint distribution of words in a sequence by such autoregressive property is logically reasonable. However, slot-value pairs in semantic frames do not have a single directional relationship between them, while they parallel describe the same sentence, so treating a semantic frame as a sequence of slot-value pairs is not suitable. Furthermore, slot-value pairs are not independent, because the pairs in a semantic frame correspond to the same individual utterance. For example, French food would probably costs more. Therefore the correlation should be taken into account when estimating the joint distribution.
Considering the above issues, to model the joint distribution of flat semantic frames, various dependencies between slot-value semantics should be leveraged. In this work, we propose to utilize a masked autoencoder Germain et al. (2015) to estimate the marginal distribution. By zeroing certain connections, we could enforce the variable unit to only depend on any specific set of variables, not necessary on ; eventually we could still have the marginal distribution by product rule:
where is a specific set of variable units.
In practical, we elementwise-multiply each weight matrix by a binary mask matrix to interrupt some connections, as illustrated in Figure 2. To impose the autoregressive property, we first assign each hidden unit an integer ranging from 1 to the dimension of data inclusively; for the input and output layers, we assign each unit a number ranging from 1 to exclusively. Then binary mask matrices can be built as follows:
where indicates the index of the hidden layer and indicates the output layer. With the constructed mask matrices, the masked autoencoder is shown to be able to estimate the joint distribution as autoregression. Because there is no explicit rule specifying the exact dependencies between slot-value pairs in our data, we consider various dependencies by ensemble of multiple decomposition, that is, to sample different sets .
|(a)||Baseline: Iterative training||71.14||55.05||55.37||27.95||39.90|
|(b)||Dual supervised learning,||72.32||57.16||56.37||29.19||40.44|
|(c)||Dual supervised learning,||72.08||55.07||55.56||28.42||40.04|
|(d)||Dual supervised learning,||71.71||56.17||55.90||28.44||40.08|
|(e)||Dual supervised learning w/o MADE||70.97||55.96||55.99||28.74||39.98|
To evaluate the effectiveness of the proposed model, we conduct the experiments and analyze the results.
The E2E NLG challenge dataset Novikova et al. (2017) is utilized in our experiments, which is a crowd-sourced dataset of 50k instances in the restaurant domain. Our models are trained on the official training set and verified on the official testing set. Each instance is a pair of a semantic frame containing specific slots and corresponding values and a associated natural language utterance with the given semantics. The data preprocessing includes trimming punctuation marks, lemmatization, and turning all words into lowercase.
Although the original dataset is for NLG, of which the goal is to generate sentences based on the given slot-value pairs, we further formulate the NLU task as predicting slot-value pair based on the utterances, which is a multi-label classification problem. Each possible slot-value pair is treated as an individual label, and the total number of labels is 79. To evaluate the quality of the generated sequences regarding both precision and recall, for NLG, the evaluation metrics include BLEU and ROUGE (1, 2, L) scores with multiple references, while F1 score is measured for NLU results.
The model architectures for NLG and NLU are gated recurrent unit (GRU)Cho et al. (2014) with two identical fully-connected layers at two ends of the GRU. Thus the model is symmetrical and may have semantic frame representation as initial and final hidden states and sentences as the sequential input.
In all experiments, we use mini-batch Adam
as the optimizer with each batch of 64 examples, 10 training epochs were performed without early stop, the hidden size of network layers is 200, and word embedding is of size 50 and trained in the end-to-end fashion.
3.3 Results and Analysis
The experiment results are shown in Table 1, where each reported number is averaged over the results on the official testing set from three different models. The row (a) is the baseline that trains NLU and NLG separately and independently, and the rows (b)-(d) are the results from the proposed approach with different Lagrange parameters.
The proposed approach incorporates probability duality into the objective as the regularization term. To examine its effectiveness, we control the intensity of regularization by adjusting the Lagrange parameters. The results (rows (b)-(d)) show that the proposed method outperforms the baseline on all automatic evaluation metrics. Furthermore, the performance improves more with stronger regularization (row (b)).
In this paper, we design the methods for estimating marginal distribution for data in NLG and NLU tasks: language modeling is utilized for sequential data (natural language utterances), while the masked autoencoder is conducted for flat representation (semantic frames). The proposed method for estimating the distribution of semantic frames considers complex implicit dependencies between semantics by ensemble of multiple decomposition of joint distribution. In our experiments, the empirical marginal distribution is the average over the results from 10 different masks and orders; in other words, 10 types of dependencies are modeled. The row (e) can be viewed as the ablation test, where the marginal distribution of semantic frames is estimated by considering slot-value pairs independent to others and statistically computed from the training set. The performance is worse than the ones that model the dependencies, demonstrating the importance of considering the nature of input data.
This paper proposes a new training framework for natural language understanding and generation based on dual supervised learning, which exploits the duality between NLU and NLG and introduces it into the learning objective as the regularization term. Moreover, domain knowledge is incorporated to design suitable approaches to estimating data distribution. The proposed methods demonstrate effectiveness by boosting the performance of both tasks simultaneously in the benchmark experiments.
- Bordes et al. (2017) Antoine Bordes, Y-Lan Boureau, and Jason Weston. 2017. Learning end-to-end goal-oriented dialog. In Proceedings of ICLR.
- Chen et al. (2017) Po-Chun Chen, Ta-Chung Chi, Shang-Yu Su, and Yun-Nung Chen. 2017. Dynamic time-aware attention to speaker roles and contexts for spoken language understanding. In Proceedings of ASRU.
- Chi et al. (2017) Ta-Chung Chi, Po-Chun Chen, Shang-Yu Su, and Yun-Nung Chen. 2017. Speaker role contextual modeling for language understanding and dialogue policy learning. In Proceedings of IJCNLP.
- Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of EMNLP, pages 1724–1734.
- Dhingra et al. (2017) Bhuwan Dhingra, Lihong Li, Xiujun Li, Jianfeng Gao, Yun-Nung Chen, Faisal Ahmed, and Li Deng. 2017. Towards end-to-end reinforcement learning of dialogue agents for information access. In Proceedings of ACL, pages 484–495.
Germain et al. (2015)
Mathieu Germain, Karol Gregor, Iain Murray, and Hugo Larochelle. 2015.
Made: Masked autoencoder for distribution estimation.
International Conference on Machine Learning, pages 881–889.
- He et al. (2016) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, pages 820–828.
- Li et al. (2017) Xiujun Li, Yun-Nung Chen, Lihong Li, Jianfeng Gao, and Asli Celikyilmaz. 2017. End-to-end task-completion neural dialogue systems. In Proceedings of The 8th International Joint Conference on Natural Language Processing.
- Mikolov et al. (2010) Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association.
- Novikova et al. (2017) Jekaterina Novikova, Ondrej Dušek, and Verena Rieser. 2017. The E2E dataset: New challenges for end-to-end generation. In Proceedings of SIGDIAL, pages 201–206.
- Peng et al. (2018) Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Kam-Fai Wong, and Shang-Yu Su. 2018. Deep dyna-q: Integrating planning for task-completion dialogue policy learning. arXiv preprint arXiv:1801.06176.
- Su and Chen (2018) Shang-Yu Su and Yun-Nung Chen. 2018. Investigating linguistic pattern ordering in hierarchical natural language generation. In Proceedings of 7th IEEE Workshop on Spoken Language Technology.
- Su et al. (2018a) Shang-Yu Su, Xiujun Li, Jianfeng Gao, Jingjing Liu, and Yun-Nung Chen. 2018a. Discriminative deep dyna-q: Robust planning for dialogue policy learning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing.
- Su et al. (2018b) Shang-Yu Su, Kai-Ling Lo, Yi-Ting Yeh, and Yun-Nung Chen. 2018b. Natural language generation by hierarchical decoding with linguistic patterns. In Proceedings of The 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Su et al. (2018c) Shang-Yu Su, Pei-Chieh Yuan, and Yun-Nung Chen. 2018c. How time matters: Learning time-decay attention for contextual spoken language understanding in dialogues. In Proceedings of NAACL-HLT.
- Su et al. (2019) Shang-Yu Su, Pei-Chieh Yuan, and Yun-Nung Chen. 2019. Dynamically context-sensitive time-decay attention for dialogue modeling. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7200–7204. IEEE.
- Sundermeyer et al. (2012) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association.
- Tjandra et al. (2017) Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. 2017. Listening while speaking: Speech chain by deep learning. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 301–308. IEEE.
- Wen et al. (2017) Tsung-Hsien Wen, Milica Gasic, Nikola Mrksic, Lina M Rojas-Barahona, Pei-Hao Su, Stefan Ultes, David Vandyke, and Steve Young. 2017. A network-based end-to-end trainable task-oriented dialogue system. In Proceedings of EACL, pages 438–449.
- Wen et al. (2015) Tsung-Hsien Wen, Milica Gasic, Nikola Mrkšić, Pei-Hao Su, David Vandyke, and Steve Young. 2015. Semantically conditioned lstm-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1711–1721.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
- Xia et al. (2017) Yingce Xia, Tao Qin, Wei Chen, Jiang Bian, Nenghai Yu, and Tie-Yan Liu. 2017. Dual supervised learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 3789–3798. JMLR. org.
- Zhang et al. (2018) Rui Zhang, Honglak Lee, Lazaros Polymenakos, and Dragomir Radev. 2018. Addressee and response selection in multi-party conversations with speaker interaction rnns. In Proceedings of AAAI.