Nearly Zero-Shot Learning for Semantic Decoding in Spoken Dialogue Systems

06/14/2018 ∙ by Lina M. Rojas Barahona, et al. ∙ University of Cambridge 0

This paper presents two ways of dealing with scarce data in semantic decoding using N-Best speech recognition hypotheses. First, we learn features by using a deep learning architecture in which the weights for the unknown and known categories are jointly optimised. Second, an unsupervised method is used for further tuning the weights. Sharing weights injects prior knowledge to unknown categories. The unsupervised tuning (i.e. the risk minimisation) improves the F-Measure when recognising nearly zero-shot data on the DSTC3 corpus. This unsupervised method can be applied subject to two assumptions: the rank of the class marginal is assumed to be known and the class-conditional scores of the classifier are assumed to follow a Gaussian distribution.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The semantic decoder in a dialogue system is the component in charge of processing the automatic speech recognition (ASR) output and predicting the semantic representation. In slot-filling dialogue systems, the semantic representation consists of a dialogue act and a set of slot-value pairs. For instance, the semantic representation of the utterance ”uhm I am looking for a restaurant in the north of town” will have the semantics: inform(type=restaurant,area=north), where inform is the dialogue act, type and area are slots and restaurant and north are their respective values. Making the semantic decoder robust to rare slots is a crucial step towards open-domain language understanding.

In this paper, we deal with rarely seen slots by following two steps. (i) We optimise jointly in a deep neural network the weights that feed multiple binary Softmax units. (ii) We further tune the weights learned in the previous step by minimising the theoretical risk of the binary classifiers as proposed in 

Balasubramanian et al. (2011). In order to apply the second step, we rely on two assumptions: the rank of the class marginal is assumed to be known and the class-conditional linear scores are assumed to follow a Gaussian distribution. In  Balasubramanian et al. (2011), this approach has been proven to converge towards the true optimal classifier risk. We conducted experiments on the dialogue corpus released for the third dialogue state tracking challenge, namely DSTC3 Henderson et al. (2014) and we show positive results for detecting rare slots as well as zero-shot slot-value pairs.

2 Related Work

Previous work on domain adaptation improved discriminative models by using priors and feature augmentation Daumé III (2009). The former uses the weights of the classifier in the known domain as a prior for the unknown domain. The latter extends the feature space with general features that might be common to both domains.

Recently, feature-based adaptation has been refined with unsupervised auto-encoders that learn features that can generalise between domains Glorot et al. (2011); Zhou et al. (2016)

. These models have proven to be successful for sentiment analysis but not for more complex semantic representations. A popular way to support semantic generalisation is to use high dimensional word vectors trained on a very large amount of data 

Mikolov et al. (2013); Pennington et al. (2014) or even cross-lingual data Mrksic et al. (2017).

Previous approaches for recognising scarce slots in spoken language understanding relied on the semantic web Tur et al. (2012), linguistic resources Gardent and Rojas Barahona (2013), open domain knowledge bases (e.g., NELL, Pappu and Rudnicky (2013), user feedback Ferreira et al. (2015) or generation of synthetic data by simulating ASR errors Zhu et al. (2014).

Unlike most of the state-of-the-art models Liu and Lane (2016); Mesnil et al. (2013), in this work semantic decoding is not treated as a sequence model because of the lack word-aligned semantic annotations. In this paper, we inject priors as proposed in Daumé III (2009). Moreover, our work differs from his because the priors are given by the weights trained through a joint optimisation of several binary Softmax units within a deep architecture exploiting word vectors. In this way, the rare slots exploit the embedded information learned about the known slots. Furthermore, we propose an unsupervised method for further tuning the weights by minimising the theoretical risk.

3 Deep Learning Semantic Decoder

The Deep Learning semantic decoder is similar to the one proposed in Rojas Barahona et al. (2016). It has been split into two steps: (i) detecting the slots and (ii) predicting the values per slots. The deep architecture depicted in Figure 1 is used in both steps. It combines sentence and context representations, applying a non linear function to their weighted sum (Eq.1), to generate the final hidden unit that feeds various binary Softmax outputs (Eq.2).



is the index of the output neuron representing one class.

The sentence representation (

) is obtained through a convolutional neural network (CNN) that processes the 10 best ASR hypotheses. The context representation (

) is a long short-term memory (LSTM) that has been trained with the previous system dialogue acts. In the first step, there are as many Softmax units as slots (Figure 

1). In the second step, a distinct model is trained for each slot and there are as many distinct Softmax units as possible values per slot (i.e. as define by an ontology ). For instance, the model that predicts food, will have Softmax units. One that predicts the presence or absence of ”Italian” food, another unit that predicts ”Chinese” food and so on.

Figure 1: Combination of sentence and context representations for the joint optimisation. In the first step, the binary classifiers are predicting the presence or absence of slots. In the second step, they are predicting the presence or absence of the values for a given slot.

All the weights in the neural network are optimised jointly. The benefits of joint inference have been published in the past for different NLP tasks Singh et al. (2013); Liu and Lane (2016). The main advantage of joint-inference is that parameters are shared between predictors, thus weights can be adjusted based on their mutual influence. For instance, the most frequent slots might influence infrequent slots.

4 Risk Minimisation (RM)

We use the unsupervised approach proposed in Balasubramanian et al. (2011) for risk minimisation (RM). We assume a binary classifier that associates a score to the first class 0 for the hidden unit of dimension :

where the parameter represents the weight of the feature indexed by for class 0.

The objective of training is to minimize the classifier risk:


where is the true label and

is the loss function. The risk is derived as follows:


We use the following hinge loss:


where , and is the linear score for the correct class . Similarly, is the linear score for the wrong class.

Given and , the loss value in the integral (Equation 4) can be computed easily. Two terms remain: and

. The former is the class marginal and is assumed to be known. The latter is the class-conditional distribution of the linear scores, which is assumed to be normally distributed. This implies that

is distributed as a mixture of two Gaussians (GMM):


is the normal probability density function. The parameters

can be estimated from an

unlabeled corpus

using a standard Expectation-Maximization (EM) algorithm for GMM training. Once these parameters are known, it is possible to compute the integral in Eq. 

4 and thus an estimate of the risk without relying on any labeled corpus. In  Balasubramanian et al. (2011), it has been proven that: (i) the Gaussian parameters estimated with EM converge towards their true values, (ii) converges towards the true risk and (iii) the estimated optimum converges towards the true optimal parameters, when the size of the unlabeled corpus increases infinitely. This is still true even when the class priors are unknown.

The unsupervised algorithm is as follows:

Unsupervised tuning for the binary classifier , where

1:input: the top hidden layer and the weights , as trained by the deep learning decoder (Section 3).
2:output: The tuned weights
4:       for every index in  do ,
5:             Change the weights ,
6:             Estimate the Gaussian parameters using EM
7:             Compute the risk (Eq. 4)111A closed-form is used to compute the risk for binary classifiers. Rojas Barahona and Cerisara (2015) on the unlabeled corpus (i.e. the evaluation set).
8:             Compute the gradient using finite differences
9:             Update the weights accordingly
10:       end for
11:until convergence

5 Experiments

The supervised and unsupervised models are evaluated on DSTC3 Henderson et al. (2014) using the macro F-Measure222

The macro F-score was chosen because we are evaluating the capacity of the classifiers to predict the correct class and both classes positive and negative are equally important for our task. Moreover, being nearly zero-shot classifiers, it would be unfair to evaluate only the capacity of predicting the positive category.

. We compare then three distinct models, (i) independent neural models for every binary classifier; (ii) neural models optimised jointly and (iii) further tuning of the weights through RM.


As displayed in Table 1 in DSTC3 new slots were introduced relative to DSTC2. The training set contains only a few examples of these slots while the test set contains a large number of them. Interestingly, frequent values per slots in the trainset such as area=north, are absolutely absent in the testset. In DSTC3 the dialogues are related to restaurants, pubs and coffee shops. The new slots are: childrenallowed, hastv, hasinternet and near. Known slots, such as food, can have zero-shot values as shown in Table 2. The corpus contains dialogues, turns in the trainset and dialogues, turns in the testset.

Hyperparameters and Training

The neural models were implemented in Theano 

Bastien et al. (2012). We used filter windows of 3, 4, and 5 with 100 feature maps each for the CNN. A dropout rate of

and a batch size of 50 was employed. Training is done through stochastic gradient descent over shuffled mini-batches with the Adadelta update rule. GloVE word vectors were used 

Pennington et al. (2014) to intialise the models with a dimension . For the context representation, we use a window of the 4 previous system acts. The risk minimisation gradient descent runs during 2000 iterations for each binary classifier and the class priors were set to and for the positive and negative classes respectively.

Slot Train Test
hastv 1 239
childrenallowed 2 119
near 3 74
hasinternet 4 215
area 3149 5384
food 5744 7809
Table 1: Frequency of slots in DSTC3.
Slot Value Train Test
near trinity college 0 5
food american 0 90
food chinese takeaway 0 87
area romsey 0 127
area girton 0 118
Table 2: Some zero-shot values per slots in DSTC3.

The Gaussianity Assumption

As explained in Section 4, the risk minimisation tuning assumes the class-conditional linear scores are distributed normally. We verified this assumption empirically on our unlabeled corpus (i.e. DSTC3 testset) and we found that for the slots: childrenallowed, hastv and hasinternet this assumption holds. However, the distribution for near

has a negative skew. When verifying the values per slot, this assumption does not hold for

area. Therefore, we can not guarantee this method will work correctly for area values on this evaluation set.

Deep Learning Independent Models
Slot F-Measure
childrenallowed %
hastv %
hasinternet %
near %
Deep Learning Joint Optimisation
childrenallowed %
hastv %
hasinternet %
near %
Risk Minimisation Tuning
childrenallowed %
hastv %
hasinternet %
near %
Table 3: Results for learning rare slots on DSTC3 evaluation set.
Deep Learning Independent Models
Slot Value F-Measure
near trinity college %
food american %
chinese take away %
area romsey %
girton %
Deep Learning Joint Optimisation
near trinity college %
food american %
chinese take away %
area romsey %
girton %
Risk Minimisation Tuning
near trinity college %
food american %
chinese take away %
area romsey %
girton %
Table 4: Results for learning zero shot slot-value pairs on DSTC3 evaluation set.

6 Results

Tables 3 and 4 display the performance of the models that predict slots and values respectively. The low F-Measure in the independent models shown their inability to predict positive examples. The models improve significantly the precision and F-Measure after the joint-optimisation. Applying RM tuning results in the best F-Measure for all the rare slots (Table 3) and for the values of the slots food and near (Table 4). For area, the joint optimisation improves the F-Measure but the improvement is lower than for other slots. The performance is being affected by its low cardinality (i.e. ), the high variability of new places and the fact that frequent values such as north and east, are completely absent in the test set. As suspected, the RM tuning degraded the precision and F-Measure because the Gaussianity assumption does not hold for area. However, RM will work well in larger evaluation sets because the Gaussian assumption will hold when the unlabeled corpus tends to infinite (please refer to  Balasubramanian et al. (2011) for the theoretical proofs).

7 Conclusion

We presented here two novel methods for zero-shot learning in a deep semantic decoder. First, features and weights were learned through a joint optimisation within a deep learning architecture. Second, the weights were further tuned through risk minimisation. We have shown that the joint optimisation significantly improves the neural models for nearly zero-shot slots. We have also shown that under the Gaussianity assumption, the RM tuning is a promising method for further tuning the weights of zero-shot data in an unsupervised way.