Spoken language understanding is a long-term goal of machine learning and potentially has a huge impact in practical applications. However, the difficulty of processing speech signals itself is a bottleneck, for instance, the core part of speech translation has to be processed in the text domain. In other words, a failure of capturing the key features in the speech signals can lead the next applications into unexpected results.
Identifying whether a given utterance is a question or not can be one of the key features in applications such as speech translation. Unfortunately, a speech recognition system is likely to fail achieving two goals at a same time: (1) extract text sequences from the input utterances, (2) detect questions. We can think of a question detection system that works independently and unburdens the load of the speech recognition system [15, 10, 4, 14, 3]. Later, an annotation of being a question can form a set with the output of the speech recognition system and handed over to the machine translation system.
In this study, we will let the model to learn the features from the training examples and the objective function. We propose a recurrent neural network (RNN) based system with various model architectures that can detect questions using multimodal inputs. Our question detection system runs as fast as other real-time systems at the test time, receives multimodal inputs and returns a scalar score value . We evaluate our models on two multimodal datasets, which consist with pairs of text transcripts and audio signals. Our experiments reveal what types of context functions, regularization methods, state transition functions of RNNs and data domains are helpful in RNN-based question detection systems.
|Yes-No||Did you attend the meeting?|
|wh-words||Where have you been?|
|Declarative||You are at the meeting?|
2.1 Types of Questions
Questions can have different canonical forms, and they are usually not standardized. However, we can divide the questions into three groups based upon some criteria. Table 1 shows an example from each group. We note that declarative questions are rather unclear to differentiate from non-question statements by looking into their canonical forms because they usually do not contain any wh-words. However, audio signals might contain the features that can be useful when making predictions on this kind of examples, where a question usually contains a rising pitch at the end of the utterance.
2.2 Neural Networks
An RNN can process a sequence by recursively applying a transition function to each symbol:
where is usually a deterministic non-linear transition function. gains extra strength to capture long-term memories when implemented with gated activation functions 
such as long short-term memory[LSTM, 7]5]. We can add more hidden layers in advance or subsequent to the RNN to increase the capacity of the model such that:
where, a sequence is the transformed feature representation of the input sequence , and and are additional hidden layers. Instead of using the whole , we can apply a context function to reduce the dimensionality and take only the abstract information out of . The context function can be either defined as introduced in :
or as introduced in :
where is the weight of each annotation .
can be used as the learned features for the logistic regression classifier:
is a notation of a sigmoid function.
3 Proposed Models
We take a neural network based approach where we can stack multiple feedforward and recurrent layers to learn hierarchical features from the training examples and the objective function via stochastic gradient descent.
We consider two types of inputs, which are text transcripts and audio signals of utterances. Depending on what types of inputs are used, we can divide the models into three groups: (1) receive only text inputs, (2) receive only audio inputs and (3) receive both inputs. When a model receives both inputs as (3), we can think of a simple but naive way of combining two different features as shown as ‘Combinational’ in Fig. 1. For a model in each group, it can choose the context function to become as either Eq. (3) or Eq. (4), so the number of combinations becomes six. However, there is another model, receives both inputs, uses Eq. (4) as the context function, but uses a different way of combining two features that is depicted as ‘Conditional’ in Fig. 1.
For each model in each group, we train it with three different ways: (1) without any regularization methods, (2) use dropout 
and (3) use batch normalization (BN) (note that we are not the first to apply batch normalization to a neural network architecture that contains an RNN ). However, there is another diversity, the state transition function of the RNN hidden state, which can be implemented either as a GRU or an LSTM. Therefore, for each model, there are six different candidates to compare with. Recall that we have seven different models, each model has six different variations, there is a total of candidates to be tested on two datasets that are MSR-Skype and CALLHOME.
4 Experiment Settings
MSR-Skype dataset contains examples given as text-audio pairs, and the proportion of positive and negative examples are well-balanced. Each example is an utterance, which is segmented manually. We only use examples that contain to words to train the models. We use of the examples as a training set and reserve of the examples to validate and evaluate the models.
We use a subset of the original CALLHOME, where the text transcripts are created by human annotators. There are examples given as text-audio pairs. Utterances are segmented manually, and the train/validation/test splits are divided as same as the MSR-Skype dataset.
For the text data, we remove punctuations, commas, question marks, exclamation marks to prevent the model from making decisions based on these special tokens. We do not consider pretraining word representation vectors with external datasets, however, they are learned jointly with the objective function during the training procedure. Therefore, in ‘Single’ (only when is text data) and in ‘Combinational’ and ‘Conditional’ become continuous vector representations of the words (in this case, we do not apply non-linearity). We built the dictionary from MSR-Skype and CALLHOME, which contains 13,911 vocabularies.
We extract MFCC from the raw audio signals with frame duration, and overlap. The lengths of the audio sequences (after extracting MFCC) could be significantly longer than the text sequences, therefore, in order to reduce the number of timesteps, we concatenate four frames into one chunk and treat it as a single frame.
|GRU||LSTM||GRU, D||LSTM, D||GRU, BN||LSTM, BN|
Table 2 shows the results of the models trained on MSR-Skype dataset. We can observe a few tendencies in the obtained results depending on what kind of variations are applied to the models ( or , GRU or LSTM, dropout or batch normalization and types of inputs).
In general, using both input sources are helpful, but the advantage is not that impressive when batch normalization is used for training. The lengths of the audio sequences are usually longer than the text sequences, and attention mechanism ()  is known to be a nice solution to deal with long sequences. Therefore, when the model can only take audio inputs, is a better option than .
Dropout will help in most cases, however, when using both input sources, the performance does not improve that much. In fact, the performance gets worse than the models, which do not use dropout. We assume that the optimization problem becomes difficult with dropout when the models receive both input sources, hence, in this case we need more care in using dropout. Batch normalization improves the performance with a huge gap for the models that receive text source as inputs. However, batch normalization does not help the models that can only receive audio inputs. The best performance is achieved by a model that receives both input sources (combinational), uses as context function, uses batch normalization for training and uses LSTM as the state transition function of the RNN.
Table 3 shows the result of each model trained on CALLHOME. We can observe that helps the models that only take audio inputs, and batch normalization improves the performance of the models that includes text source as their inputs. The best performance is achieved by a model that takes both input sources (combinational), uses as context function, uses batch normalization for training and uses GRU as the state transition function of the RNN.
|GRU||LSTM||GRU, D||LSTM, D||GRU, BN||LSTM, BN|
In Table 4, we test our models on sequences with different lengths. We use the same models that were trained on MSR-Skype, without any regularization methods. The sequences are divided into three groups depending on the number of words contained in each sequence. Short sequences have less than words, long sequences have more than words, and intermediate sequences contain to words. We observe that the models achieve the best performance on intermediate sequences, and the models tend to do better jobs on short and long sequences when the inputs contain text source. The performance degradations on short or long sequences compared to intermediate sequences are smaller when we use both input sources (see ‘combination’ and ‘condition’, especially models lose less performance against long sequences).
|any other questions?||0.44||0.98||0.72||0.84|
|and your cats?||0.63||0.93||0.97||0.72|
|oh, the bird?||0.42||0.83||0.77||0.72|
Table 5 shows some test examples that neither contain wh-words nor have canonical form of questions, which we have already introduced as declarative questions in Sec. 2.1. In this kind of questions, there are usually rising pitches at the end of the audio signals. For the models, which receive the audio source as inputs, can benefit from having audio information as shown in Table 5 (see ‘audio’, ‘combination’ and ‘condition’). For the models, which receive only text source as inputs, do not have relevant information to guess whether the given utterances are questions or not.
The predicted scores from the models using both inputs are sometimes less than the scores from the model using only audio inputs. We assume that these models have to make compromise between the text features and audio features when these two are in conflict. However, given the training objective, it is difficult to expect that the models will completely ignore one of the features, instead, the models will tend to learn more smooth decision boundaries.
We explore various types of RNN-based architectures for detecting questions in English utterances. We discover some features that can help the models to achieve better scores in the question detection task. Different types of inputs can complement each other, and the models can benefit from using both text and audio sources as inputs. Attention mechanism () helps the models that receive long audio sequences as inputs. Regularization methods can help the models to generalize better, however, when the models receive multimodal inputs, we need to be more careful on using these regularization methods.
- Bahdanau et al.  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations (ICLR), 2014.
- Bastien et al.  F. Bastien, P. Lamblin, R. Pascanu, J. Bergstra, I. J. Goodfellow, A. Bergeron, N. Bouchard, and Y. Bengio. Theano: new features and speed improvements. Deep Learning and Unsupervised Feature Learning NIPS 2012 Workshop, 2012.
- Bazillon et al.  T. Bazillon, B. Maza, M. Rouvier, F. Bechet, and A. Nasr. Speaker role recognition using question detection and characterization. In INTERSPEECH, pages 1333–1336, 2011.
- Boakye et al.  K. Boakye, B. Favre, and D. Hakkani-Tür. Any questions? automatic question detection in meetings. In IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pages 485–489. IEEE, 2009.
Cho et al. 
K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk,
and Y. Bengio.
Learning phrase representations using rnn encoder–decoder for
statistical machine translation.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1724–1734, 2014.
- Chung et al.  J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS Workshop on Deep Learning, 2014.
- Hochreiter and Schmidhuber  S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Ioffe and Szegedy  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning (ICML), 2015.
- Laurent et al.  C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio. Batch normalized recurrent neural networks. arXiv preprint arXiv:1510.01378, 2015.
- Metzler and Croft  D. Metzler and W. B. Croft. Analysis of statistical question classification for fact-based questions. Information Retrieval, 8(3):481–504, 2005.
Nair and Hinton 
V. Nair and G. E. Hinton.
Rectified linear units improve restricted boltzmann machines.In Proceedings of the 27th International Conference on Machine Learning (ICML), pages 807–814, 2010.
- Schuster and Paliwal  M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
- Srivastava et al.  N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
- Wang and Chua  K. Wang and T.-S. Chua. Exploiting salient patterns for question detection and question retrieval in community-based question answering. In Proceedings of the 23rd International Conference on Computational Linguistics, pages 1155–1163. Association for Computational Linguistics, 2010.
- Yuan and Jurafsky  J. Yuan and D. Jurafsky. Detection of questions in chinese conversational speech. In IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pages 47–52. IEEE, 2005.