1 Introduction
The task in the Fourth Dialogue Breakdown Detection Challenge (DBDC4) DBDC4 is to build a model that detects whether an utterance from the system causes a breakdown in a dialogue context involving a system and a user. A breakdown is defined as a situation where the user cannot proceed with the conversation. Given a system utterance, the model is required to produce two outputs: 1. A single breakdown label chosen from the three breakdown labels (NB: Not a breakdown, PB: Possible breakdown, and B: Breakdown). 2. The probability distribution of the three breakdown labels, which we refer as P(NB), P(PB), and P(B) hereinafter. For evaluating the model, the organisers adopted classificationrelated metrics and distributionrelated metrics and put an emphasis on the mean squared error (MSE). A complete description of the challenge can be found in the DBDC4 overview paper DBDC4 .
RSL19BD (Waseda University Sakai Laboratory) participated in DBDC4 and submitted five runs to both English and Japanese subtasks. In these runs, we utilise the Decision Treebased model and the Long ShortTerm Memorybased (LSTMbased) model following the approaches of RSL17BD RSL17BD and KTH KTH in the Third Dialogue Breakdown Detection Challenge (DBDC3) DBDC3 respectively.
2 Prior Art
At DBDC3 DBDC3 , RSL17BD RSL17BD and KTH KTH both submitted models which achieved high performances. This section briefly describes their approaches.
2.1 RSL17BD at DBDC3
The topperforming model of RSL17BD utilises ExtraTreesRegressor extratree ^{2}^{2}2https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html and employed six features shown in Table 1 based on pattern analysis to predict the mean and variance of the probability distribution of the breakdown labels for each target system utterance. The predicted mean and variance are then converted into the predicted probability distribution of the three breakdown labels. The single breakdown label is determined by choosing the label with the highest probability.
Feature 

turnindex of the target utterance 
length of the target utterance (number of characters) 
length of the target utterance (number of terms) 
keyword flags of the target utterance 
term frequency vector similarities among the target system utterance, the immediately preceding user utterance, and the system utterance that immediately precedes that user utterance 
word embedding vector similarities among the target system utterance, the immediately preceding user utterance, and the system utterance that immediately precedes that user utterance 
2.2 KTH at DBDC3
The topperforming model of KTH utilises Long ShortTerm Memory (LSTM) LSTM
. For the preprocessing of English data, it produces a sequence of 300 dimensional word embedding vectors for each utterance in every dialogue and take the average sum of the sequence to produce the final embedding for a single utterance. The number of turns in each dialogue is fixed to 20 by removing the first system utterance which has no annotations or removing the last user turn. This produces an embedded dialogue of 20 turns, with each turn represented by a single 300 dimensional utterance embedding. The embedded dialogue is then processed by 4 LSTM layers and a Dense layer to produce 4 outputs for each turn. The 4 outputs are P(NB), P(PB), P(B), and P(U), where P(U) refers to the probability of user turn. The reason for adding P(U) is that user turns are included in the embedded dialogue as well and need to be predicted with a label different from NB, PB, and B. The model is trained for 100 epochs using Adadelta
adadelta as its optimiser. During training, it targets the single breakdown label and aims to minimise the categorical cross entropy loss for each target system utterance. For Japanese data, KTH did not submit any runs.3 Description of DBDC4 Dataset
The development and evaluation dataset given in DBDC4 contain two languages: English and Japanese.
The English data consists of dialogues from a dialogue system named IRIS and six other dialogue systems (anonymised as Bot001 to Bot006) which participated in the conversational intelligence challenge. In this paper, Bot001 to Bot006 are treated as a single system referred as BOT. Each dialogue is composed of 20 or 21 turns of alternating system and user utterances, with 10 system utterances being labeled. The labeled system utterances are evaluated by 15 human annotators, where each annotator labels it with a breakdown label chosen from NB, PB, and B.
The Japanese data consists of two types of dialogues. The first type comes from three dialogue systems named DCM, DIT, and IRS. Each dialogue is composed of 21 turns of alternating system and user utterances, with 11 system utterances being labeled. The second type is located under a folder named dbd_livecompe_eval and comes from five systems (IRS, MMK, MRK, TRF, and ZNK) which participated in a live competition held in Japan. Each dialogue is composed of 31 turns of alternating system and user utterances, with 16 system utterances being labeled. In the development data, all labeled system utterances are evaluated by 30 human annotators. In the evaluation data, labeled system utterances of the first type and second type are evaluated by 15 and 30 human annotators respectively. A complete description of the dataset can be found in the DBDC4 overview paper DBDC4 .
For each dialogue system in the development dataset, we calculated the average probability distribution of the three breakdown labels across all its labeled utterances. We did not do so for the evaluation dataset since it is unlabeled. Tables 4, 3 and 2 show our calculated results along with other statistical information.
System name  No. of dialogues  No. of turns  No. of annotators  NB  PB  B 
BOT (dev)  168  20 or 21  15  38.1%  28.7%  33.2% 
IRIS (dev)  43  21  15  30.0%  30.3%  39.6% 
BOT (eval)  173  20 or 21  15       
IRIS (eval)  27  21  15       
System name  No. of dialogues  No. of turns  No. of annotators  NB  PB  B 

DCM (dev)  350  21  30  42.2%  29.9%  27.9% 
DIT (dev)  150  21  30  26.0%  29.6%  44.4% 
IRS (dev)  150  21  30  30.5%  25.8%  43.7% 
DCM (eval)  50  21  15       
DIT (eval)  50  21  15       
IRS (eval)  50  21  15       
System name  No. of dialogues  No. of turns  No. of annotators  NB  PB  B 

IRS (dev)  13  31  30  32.8%  25.4%  41.7% 
MMK (dev)  15  31  30  57.6%  29.4%  13.0% 
MRK (dev)  15  31  30  48.5%  35.5%  16.0% 
TRF (dev)  14  31  30  69.4%  20.0%  10.6% 
ZNK (dev)  16  31  30  47.2%  30.6%  22.2% 
IRS (eval)  15  31  30       
MMK (eval)  14  31  30       
MRK (eval)  14  31  30       
TRF (eval)  16  31  30       
ZNK (eval)  14  31  30       
4 Model Descriptions
4.1 Decision Treebased model
For the preprocessing of both English and Japanese data, we follow the same approach as RSL17BD RSL17BD at DBDC3 DBDC3 . Our model employs the same set of features as RSL17BD’s model, but utilises RandomForestRegressor randomforest ^{3}^{3}3https://scikitlearn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html instead of ExtraTreesRegressor extratree . In addition, instead of predicting the mean and the variance of the probability distribution over the three breakdown labels and then deriving the probability of each label, it predicts the probability of each label directly. The probability distribution is then calculated by normalising the probability of the three labels by their sum.
The modifications above are decided by training and evaluating different model configurations using the English ^{4}^{4}4When training and evaluating a model using the English data from DBDC3, we used the revised data mentioned in the DBDC3 overview paper DBDC3 . and Japanese data from DBDC3 ^{5}^{5}5Due to the late release of dataset for DBDC4, we first built our models using the dataset from DBDC3.. The evaluation results are shown in Tables 6 and 5. DT means the model utilises decision trees, EX10
means the model utilises ExtraTreesRegressor with 10 estimators, and
RF100 means the model utilises RandomForestRegressor with 100 estimators. AV means the model predicts the mean and the variance of the probability distribution, and NBPBBmeans the model predicts the probability of each label directly. There are four evaluation metrics. Accuracy denotes the number of correctly predicted breakdown labels divided by the total number of breakdown labels to be predicted (the larger the better); F1 (B) denotes the F1measure where only the B labels are considered correct (the larger the better); JSD (NB, PB, B) denotes the JensenShannon Divergence between the predicted and correct probability distribution (the smaller the better); MSE (NB, PB, B) denotes the mean squared error between the predicted and correct probability distribution (the smaller the better). The results show that
DTRF100NBPBB outperformed the model submitted by RSL17BD in DBDC3 (DTEX10AV) in all evaluation metrics in both English and Japanese data. Thus, we chose to utilise the configuration of DTRF100NBPBB when submitting the model for Run 1.Model  Accuracy  F1 (B)  JSD (NB, PB, B)  MSE (NB, PB, B) 
DTEX10AV  0.3430  0.2344  0.0594  0.0357 
DTEX10NBPBB  0.4065  0.3696  0.0498  0.0291 
DTRF10NBPBB  0.3950  0.3542  0.0486  0.0282 
DTRF100NBPBB  0.4095  0.3548  0.0466  0.0271 
Model  Accuracy  F1 (B)  JSD (NB, PB, B)  MSE (NB, PB, B) 
DTEX10AV  0.3927  0.3225  0.1297  0.0769 
DTEX10NBPBB  0.5303  0.6050  0.0920  0.0502 
DTRF10NBPBB  0.5455  0.6292  0.0875  0.0481 
DTRF100NBPBB  0.5630  0.6511  0.0845  0.0460 
4.2 LSTMbased model
Following the approach of KTH KTH at DBDC3 DBDC3 , we utilise Long ShortTerm Memory (LSTM) LSTM . However, instead of taking the average sum of word embedding vectors for each utterance, we utilise Convolutional Neural Networks (CNN) to perform text feature extraction and produce the final embedded utterance. In addition, instead of targeting the single breakdown label and minimising the categorical cross entropy loss for each target system utterance, our model targets the probability distribution of the three breakdown labels and minimises its mean squared error. We chose Adam adam
as our optimiser and mean squared error as our loss function.
Fig. 1 shows the architecture diagram of our model.For the preprocessing of both English and Japanese data, we first follow the same approach as RSL17BD at DBDC3 to produce a sequence of 300 dimensional word embedding vectors for each utterance in every dialogue. The number of word vectors in each sequence is fixed to , with set to 50. This is done by truncating sequences that are longer than
and padding sequences that are shorter than
with zero vectors. The number of turns in a each dialogue is fixed to by either removing the first system utterance which has no annotations or removing the last user turn. For the English data and the Japanese data from DCM, DIT, and IRS, the number of turns in each dialogue is fixed to 20 by setting to 10. For the data of the five dialogue systems under dbd_livecompe_eval, the number of turns in each dialogue is fixed to 30 by setting to 15.The process above produces a dialogue of turns, with each turn represented by a sequence of
word vectors. We apply Onedimensional Convolutional Neural Networks (1D CNN), Onedimensional Global Max Pooling (1D GMax Pooling), and Dropout
Dropoutfor each sequence to produce an embedded dialogue. The 1D CNN layer uses 150 filters of size 2 with ReLU
reluas the activation function. The dropout rate of the Dropout layer is set to 0.4.
The embedded dialogue is then processed by 4 LSTM layers sequentially. Each LSTM layer contains 64 units, with dropout set to 0.1, and recurrent dropout set to 0.1. We used LSTM instead of BiLSTM because the usage of turns after the target system utterance is disallowed. The output sequences from the 4 LSTM layers are concatenated to form a (, 256) dimensional matrix, and processed by a Dense layer with softmax activation and 4 outputs. The 4 outputs represent P(NB), P(PB), P(B), and P(U) respectively. The probability distribution for each target system utterance is calculated by normalising P(NB), P(PB), and P(B) by their sum. The single breakdown label is determined by choosing the label with the highest probability in the distribution.
The modifications above are decided by training and evaluating different model configurations using the English and Japanese data from DBDC3. The evaluation results are shown in Tables 8 and 7. LSTM means the model utilises LSTM, and LTSMCNN means the model utilises LSTM and CNN. ADADCAT means the model utilises Adadelta as optimizer and categorical cross entropy as loss function, and ADAMMSE means the model utilises Adam as optimizer and mean squared error as loss function.
The results show that for English data, LSTMCNNADAMMSE outperformed LSTMADADCAT and LSTMADAMMSE in all evaluation metrics except F1 (B). Although LSTMADADCAT achieved high performance in F1 (B), its performance in mean squared error (MSE (NB, PB, B)) was poor. Since mean squared error is emphasised in this challenge, we decided to discard LSTMADADCAT. For Japanese data, LSTMCNNADAMMSE outperformed LSTMADAMMSE in all evaluation metrics. We did not evaluate LSTMADADCAT because it is already discarded after the evaluation of English data. In the end, we chose to utilise the configuration of LSTMCNNADAMMSE when submitting the models for Run 2 and Run 3.
Model  Epochs  Accuracy  F1 (B)  JSD (NB, PB, B)  MSE (NB, PB, B) 

LSTMADADCAT  100  0.4130  0.4616  0.0928  0.0573 
LSTMADAMMSE  100  0.3940  0.3714  0.0516  0.0300 
LSTMCNNADAMMSE  50  0.4620  0.4268  0.0474  0.0274 
Model  Epochs  Accuracy  F1 (B)  JSD (NB, PB, B)  MSE (NB, PB, B) 

LSTMADAMMSE  100  0.5448  0.6148  0.0885  0.0497 
LSTMCNNADAMMSE  50  0.5739  0.6594  0.0826  0.0463 
5 Runs
The descriptions of our runs are shown in Table 9. In Runs 13, we used the same strategy in creating the training data from the given development data in DBDC4. For the English submission, we created a single group of training data which consists of the entire English development data. We refer it as hereinafter. The entire English evaluation data is referred as hereinafter. For the Japanese submission, we created two groups of training data. The first group consists of the development data from DCM, DIT, and IRS, and the second group consists of the development data from the five dialogue systems under dbd_livecompe_eval. We refer them as and hereinafter. The evaluation data from DCM, DIT, and IRS and the evaluation data from the five dialogue systems under dbd_livecompe_eval are referred as and hereinafter.
Run  Description 

1  Decision Treebased model 
2  LSTMbased model 
3  Ensemble of 5 LSTMbased models 
4  Ensemble of Run 1 and Run 2 
5  Ensemble of Run 1 and Run 3 
5.1 Run 1: Decision Treebased model
For the English submission, we trained our Decision Treebased model with and made predictions on . For the Japanese submission, we built two models by training one with , and the other with . We made predictions on using the former model and using the latter model.
5.2 Run 2: LSTMbased model
For the English submission, we pretrained our LSTMbased model for 30 epochs with the entire English development and evaluation data in DBDC3, finetuned it by training for 32 epochs with , and made predictions on . For the Japanese submission, we built two LSTMbased models. The first model is trained for 30 epochs with . The second model is created by loading the weights from the first model and finetuning for 25 epochs with . We made predictions on using the first model and using the second model. Every model is trained using a batch size of 32.
5.3 Run 3: Ensemble of 5 LSTMbased models
The way an ensemble of 5 LSTMbased models is built is described as follows: Given training data and evaluation data , we randomly divide into 10 portions and sample 5 portions from it. We build 5 models, where each model is trained using one of the sampled portions as validation data and the rest of the development data as training data. The batch size is set to 32. Each model is saved when the validation loss is minimum and no overfitting occurred. We make predictions on using each model, and take the mean of the predicted probability distribution for each target system utterance from the 5 models to produce a new probability distribution. The new single breakdown label is determined by choosing the label with the highest probability in the new probability distribution.
For the English submission, we pretrained an LSTMbased model for 30 epochs with the entire English development and evaluation data in DBDC3. The ensemble of 5 LSTMbased models is built by finetuning the pretrained model with and . The results of each LSTMbased model on the sampled validation data are shown in Table 10.
For the Japanese submission, we built two ensemble models. The first model is built with and . The results of each LSTMbased model on the sampled validation data are shown in Table 11. The second model is built by loading the weights of the first model from the Japanese submission in Run 2 and finetuning it with and . The results of each LSTMbased model on the sampled validation data are shown in Table 12.
Model  Accuracy  F1 (B)  JSD (NB, PB, B)  MSE (NB, PB, B) 

1  0.5286  0.5385  0.0649  0.0343 
2  0.5190  0.5318  0.0701  0.0370 
3  0.5524  0.5660  0.0713  0.0375 
4  0.5381  0.6306  0.0706  0.0370 
5  0.5810  0.6635  0.0771  0.0401 
Model  Accuracy  F1 (B)  JSD (NB, PB, B)  MSE (NB, PB, B) 

1  0.5664  0.5782  0.0887  0.0469 
2  0.5804  0.6057  0.0914  0.0477 
3  0.5944  0.6505  0.0786  0.0429 
4  0.5944  0.6402  0.0903  0.0473 
5  0.5846  0.6231  0.0961  0.0495 
Model  Accuracy  F1 (B)  JSD (NB, PB, B)  MSE (NB, PB, B) 

1  0.6313  0.4583  0.0706  0.0371 
2  0.6953  0.4324  0.0591  0.0323 
3  0.6484  0.3750  0.0510  0.0277 
4  0.6797  0.3529  0.0600  0.0319 
5  0.6016  0.0000  0.0678  0.0336 
5.4 Run 4: Ensemble of Run 1 and Run 2
For both English and Japanese submissions, we take the mean of the predicted probability distribution for each target system utterance from Run 1 and Run 2 to produce a new probability distribution. The new single breakdown label is determined by choosing the label with the highest probability in the new probability distribution.
5.5 Run 5: Ensemble of Run 1 and Run 3
This run is identical with Run 4 except that Run 2 is replaced by Run 3.
6 Results
Tables 14 and 13 show the official results of our English and Japanese runs respectively. It can be observed that Run 5 did well on average. For English runs, it outperformed all other runs in all evaluation metrics. For Japanese runs, it outperformed all other runs in JSD (NB, PB, B) and MSE (NB, PB, B).
Tables 16 and 15 show the results of comparing the MSE (NB, PB, B) of Runs 15 based on the Randomised Tukey’s Honestly Significant Differences (HSD) test. The test is conducted with 10,000 replicates. The pvalues are shown alongside with effect sizes (standardised mean differences) effectsize . Table 15 shows that Run 5 statistically significantly outperformed all other runs in terms of MSE (NB, PB, B) for the English data. Table 16 shows that Run 5 statistically significantly outperformed all other runs except Run 4 in terms of MSE (NB, PB, B) for the Japanese data. The pvalues show that the differences are statistically significant at the alpha level of 0.05.
Model  Accuracy  F1 (B)  JSD (NB, PB, B)  MSE (NB, PB, B) 
Run 1  0.4990  0.4411  0.0700  0.0362 
Run 2  0.4730  0.4483  0.0725  0.0374 
Run 3  0.5200  0.4554  0.0675  0.0346 
Run 4  0.5050  0.4650  0.0690  0.0353 
Run 5  0.5255  0.4690  0.0662  0.0336 
Run  Accuracy  F1 (B)  JSD (NB, PB, B)  MSE (NB, PB, B) 
Run 1  0.5390  0.4568  0.0975  0.0492 
Run 2  0.5412  0.4613  0.0989  0.0509 
Run 3  0.5476  0.4589  0.0967  0.0493 
Run 4  0.5412  0.4583  0.0954  0.0480 
Run 5  0.5444  0.4603  0.0947  0.0475 
Run2  Run3  Run4  Run 5  

Run 1  
Run 2    
Run 3      
Run 4       
Run2  Run3  Run4  Run 5  

Run 1  
Run 2    
Run 3      
Run 4       
7 Discussions
7.1 Naive strategy in creating the training data
As described in section 5, in Runs 13, we used the same strategy in creating the training data from the given development data. For the English submission, we created one group of training data and trained a single model with it. The reason for doing so is that we wanted to create sufficient training data, since there is only a total number of 211 dialogues. For the Japanese submission, we created two groups of training data and and trained two models with them respectively. The reason for doing so is that the first group consists of dialogues with 21 turns (fixed to 20 turns in preprocessing) while the second group consists of dialogues with 31 turns (fixed to 30 turns in preprocessing). Because our LSTMbased model only accepts fixed turn lengths, we had to build two models to target two different turn lengths. We used the same strategy for building our Decision Treebased model so that the ensemble with the LSTMbased model can be done easily.
Nevertheless, the above strategy is rather naive as it does not consider the overall probability distribution of the three breakdown labels for each dialogue system. As shown in Tables 4, 3 and 2, the average probability distribution of each dialogue system is different from one another. In particular, the system IRS in Table 4 has a significantly higher probability for label B compared to the other four systems. We believe that IRS should not have been combined with the other four systems to create training data . Furthermore, the model which is trained with should not have been used for predicting the data of IRS in .
Table 17 shows the official results of MSE (NB, PB, B) for . It can be observed that due to the naive strategy above, all runs achieved poor performance with regard to IRS. To improve the result, we believe that the development data of IRS should be excluded from and combined with . When predicting the labels for IRS in , we should utilise the model trained with instead of the one trained with . This proposed strategy requires us to either fix all training data to 30 turns in the LSTMbased model or develop a new model which accepts a shorter fixed turn length such as 5.
IRS  MMK  MRK  TRF  ZNK  
Run 1  0.0662  0.0243  0.0393  0.0282  0.0418 
Run 2  0.0606  0.0184  0.0328  0.0230  0.0389 
Run 3  0.0602  0.0195  0.0322  0.0231  0.0394 
Run 4  0.0606  0.0197  0.0341  0.0236  0.0378 
Run 5  0.0608  0.0206  0.0340  0.0239  0.0384 
7.2 Ensemble works?
We analysed our runs in terms of MSE (NB, PB, B) (referred as MSE in this section), which is the emphasised evaluation metric in this challenge. From Tables 14 and 13, it can be observed that Run 4 outperformed Run 1 and Run 2, and Run 5 outperformed Run 1 and Run 3 in terms of MSE for both English and Japanese data. To investigate how well the ensemble actually worked for each utterance, we would like to know the number of target system utterances for which the ensemble model outperformed the original models that were ensembled. In this section, we focus on Run 5 which achieved the best performance in terms of MSE and compare its results with Run 1 and Run 3 ^{6}^{6}6When comparing the runs in section 7.2, we remove the first predicted system utterance of every dialogue in Japanese data. This is because the first system utterances in Japanese data are all annotated with the same labels (NB) and are all predicted correctly with MSE = by every run..
Tables 19 and 18 show the number of target system utterances for which each run outperformed the others. denotes the set of target system utterances in the evaluation dataset, and denotes the MSE of Run given a target system utterance (. , , and are defined by the following equations:
(1) 
(2) 
(3) 
a subset of turns ()  

866  
958  
176  
0 
a subset of turns ()  

1200  
1233  
162  
0 
From Tables 19 and 18, it can be observed that the number of target system utterances for which Run 5 outperformed the other runs is relatively small. We plotted the relationship of the differences between the MSE of Run 1, Run 3, and Run 5 in Figs. 3 and 2. The xaxis is , and the yaxis is . The points coloured in blue, orange, and green denote the target system utterances that match the condition of , , and respectively.
By observing Figs. 3 and 2, it appears that the condition which makes the MSE of Run 5 lower than the ones of Run 1 and Run 3 is that the target system utterance is located in the first quadrant of Figs. 3 and 2.
Tables 21 and 20 show the mean MSE of Run 1, Run 3, and Run 5 over , , and respectively. From Tables 21 and 20, it can be observed that when Run 5 outperformed Run 1 and Run 3, the MSEs of Run 1 and Run 3 tend to be low. Similar to Figs. 3 and 2, we plotted the relationship of the MSE between Run 1 and Run 3 in Figs. 5 and 4.
a subset of turns ()  Run 1  Run 3  Run 5 

0.0270  0.0451  0.0344  
0.0481  0.0285  0.0367  
0.0159  0.0159  0.0129 
a subset of turns ()  Run 1  Run 3  Run 5 

0.0463  0.0721  0.0573  
0.0649  0.0399  0.0505  
0.0195  0.0194  0.0164 
By observing Figs. 5 and 4, it appears that the green points are concentrated at the origin of both axes. In addition, Run 5 tends to outperform the two other runs when the MSE of Run 1 and Run 3 are similar. We looked into the system utterances for which the difference between the MSE of Run 1 and Run 3 are high and found out these utterances tend to be labeled with high probability of NB or B compared to other utterances. We plotted the relationship of the absolute difference between the MSE of Run 1 and Run 3 and in Figs. 7 and 6, where denotes the maximum probability of the labeled probabilities of NB and B. The points coloured in blue are the target system utterances.
From Figs. 7 and 6, it can be observed that the MSE of Run 1 and Run 3 tend to be similar when is low. This means that the ensemble model tends to perform the best in target system utterances which are not labeled with high probability of NB or B. Therefore, to further improve our ensemble model, we should either develop a new ensemble strategy different from simple averaging or include a third model which focuses on minimising the MSE in target system utterances that are labeled with high probability of NB or B.
8 Conclusions
We submitted five runs to both English and Japanese subtasks of DBDC4. Run 1 utilises a Decision Treebased model; Run 2 utilises an LSTMbased model; Run 3 performs an ensemble of 5 LSTMbased models; Run 4 performs an ensemble of Run 1 and Run 2; Run 5 performs an ensemble of Run 1 and Run 3. Run 5 statistically significantly outperformed all other runs in terms of MSE (NB, PB, B) for the English data and all other runs except Run 4 in terms of MSE (NB, PB, B) for the Japanese data (alpha level = 0.05).
Our future work includes utilising a proposed strategy in creating the training data and improving our ensemble model. The proposed strategy considers the overall probability distribution of the three breakdown labels for each dialogue systems and requires us to either fix all training data to 30 turns in the LSTMbased model or develop a new model which accepts a shorter fixed turn length such as 5. To improve our ensemble model, we should either develop a new ensemble strategy different from simple averaging or include a third model which focuses on minimising the MSE in target system utterances that are labeled with high probability of NB or B.
Bibliography

(1)
Breiman, L.: Random forests.
Machine Learning 45(1), 5–32 (2001)  (2) Geurts, P., Ernst, D., Wehenkel, L.: Extremely randomized trees. Machine Learning 63(1), 3–42 (2006). DOI 10.1007/s1099400662261. URL https://doi.org/10.1007/s1099400662261
 (3) Higashinaka, R., D’Haro, L.F., Shawar, B.A., Banchs, R., Funakoshi, K., Inaba, M., Tsunomori, Y., Takahashi, T., ao Sedoc, J.: Overview of the dialogue breakdown detection challenge 4
 (4) Higashinaka, R., Funakoshi, K., Inaba, M., Tsunomori, Y., Takahashi, T., Kaji, N.: Overview of Dialogue Breakdown Detection Challenge 3. In: Proceedings of Dialog System Technology Challenge 6 (DSTC6) Workshop (2017)
 (5) Hochreiter, S., Schmidhuber, J.: Long ShortTerm Memory. Neural Comput. 9(8), 1735–1780 (1997)
 (6) Kato, S., Sakai, T.: RSL17BD at DBDC3: Computing Utterance Similarities based on Term Frequency and Word Embedding Vectors. In: Proceedings of Dialog System Technology Challenge 6 (DSTC6) Workshop (2017)
 (7) Kingma, D.P., Ba, J.: Adam: A Method for Stochastic Optimization (2014)
 (8) Lopes, J.: How Generic Can Dialogue Breakdown Detection Be? The KTH entry to DBDC3. In: Proceedings of Dialog System Technology Challenge 6 (DSTC6) Workshop (2017)

(9)
Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines.
In: Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pp. 807–814 (2010)  (10) Sakai, T.: Laboratory Experiments in Information Retrieval: Sample Sizes, Effect Sizes, and Statistical Power 40(1) (2018). DOI 10.1007/9789811311994. URL https://www.springer.com/jp/book/9789811311987
 (11) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
 (12) Zeiler, M.D.: ADADELTA: an adaptive learning rate method. CoRR abs/1212.5701 (2012). URL http://arxiv.org/abs/1212.5701
Comments
There are no comments yet.