1 Introduction and related work
Artificial neural networks (ANNs) have recently shown state-of-the-art results on various NLP tasks including language modeling 2, 3, 4], text classification [5, 6, 7, 8], question answering [9, 10], and machine translation [11, 12]
. Unlike other popular non-ANN-based machine learning algorithms such as support vector machines (SVMs) and conditional random fields (CRFs), ANNs can automatically learn features that are useful for NLP tasks, thereby requiring no manually engineering features.
However, ANNs have hyperparameters that need to be tuned in order to achieve the best results. The hyperparameters of ANNs may define either its learning process (e.g., learning rate or mini-batch size) or its architecture (e.g., number of hidden units or layers). ANNs commonly contain over ten hyperparameters 
, which makes it challenging to optimize. Therefore, most published ANN-based work on NLP tasks, rely on basic heuristics such as manual or random search, and sometimes do not even optimize hyperparameters.
Although most of them report state-of-the-art results without optimizing hyperparameters extensively, we argue that the results can be further improved by properly optimizing the hyperparameters. Despite this, one of the main reasons why most previous NLP works do not thoroughly optimize hyperparameters is that it may represent a significant time investment. However, if we optimize them “efficiently”, we can find hyperparameters that perform well within a reasonable amount of time as shown in this paper.
Like ANNs, other machine learning algorithms also have hyperparameters. The two most widely used methods for hyperparameter optimization of machine learning algorithms are manual or grid search . Bergstra and Yoshua  show that random search is as good or better than grid search at finding hyperparameters within a small fraction of computation time and suggest that random search is a natural baseline for judging the performance of automatic approaches for tuning the hyperparameters of a learning algorithm. However, all above-mentioned methods for tuning hyperparameters have some downsides. Manual search requires human experts or use arbitrary rules of thumb, while grid and random searches are computationally expensive .
Recently, a more systematic approach based on Bayesian optimization with Gaussian process (GP) 
has been shown to be effective in automatically tuning the hyperparameters of machine learning algorithms, such as latent dirichlet allocation, SVMs, convolutional neural networks
, and deep belief networks, as well as tuning the hyperparameters that features may have [18, 19]. In this approach, the model’s performance for each hyperparameter combination is modeled as a sample from a GP, resulting in a tractable posterior distribution given previous experiments. Therefore, this posterior distribution is used to find the optimal hyperparameter combination to try next based on the observation.
In this work, we demonstrate the application of Gaussian Process (GP) to optimize ANN hyperparameters on an NLP task, namely dialog act classification , whose goal is to assign a dialog act to each utterance. The ANN model in  makes a good candidate for hyperparameter optimization since it is a simple model with a few architectural hyperparameters, and the optimized architectural hyperparameters are interpetable and give some insights for the task at hand. Using this model, we show that optimizing hyperparameters further improves the state-of-the-art results on two datasets, and reduces the computational time by a factor of 4 compared to a random search.
The ANN model for dialog act classification is introduced in  and is briefly described in Section 2.1. The GP used to optimize the hyperparameters of the ANN model is presented in Section 2.2. The colon notation represents the sequence of vectors .
2.1 ANN model
Each utterance of a dialog is mapped to a vector representation via a CNN (Section 2.1.1). Each utterance is then sequentially classified by leveraging preceding utterances (Section 2.1.2). Figure 1 gives an overview of the ANN model.
2.1.1 Utterance representation via CNN
An utterance of length is represented as the sequence of word vectors Given the word vectors, the CNN model produces the utterance representation
Let be the size of a filter, and the sequence of vectors be the corresponding filter matrix. A convolution operation on consecutive word vectors starting from the word outputs the scalar feature where is a bias term.
We perform convolution operations with different filters, and denote the resulting features as each of whose dimensions comes from a distinct filter. Repeating the convolution operations for each window of consecutive words in the utterance, we obtain The utterance representation is computed in the max pooling layer, as the element-wise maximum of
During training, dropout with probabilityis applied on this utterance representation
The filter size the number of filters and a dropout probability are the hyperparameters of this section that we optimize using the GP (Section 2.2).
2.1.2 Sequential utterance classification
Let be the utterance representation given by the CNN architecture for the utterance in the sequence of length . The sequence is input to a two-layer feedforward neural network that classifies each utterance. The hyperparameters the history sizes used in the first and second layers respectively, are optimized using the GP (Section 2.2).
The first layer takes as input and outputs where is the number of classes for the classification task, i.e. the number of dialog acts. It uses a tanh activation function. Similarly, the second layer takes as input and outputs with a softmax activation function.
The final output
represents the probability distribution over the set ofclasses for the utterance: the element of corresponds to the probability that the utterance belongs to the class. Each utterance is assigned to the class with the highest probability.
2.2 Hyperparameter optimization using GP
Let be the set of all hyperparameter combinations considered, and let be the function mapping from hyperparameter combinations to a real-valued performance metric (such as F1-score on test set) of a learning algorithm using the given hyperparameter combination. Our interest lies in efficiently finding a hyperparameter combination that yields a near-optimal performance . In this paper, we use Bayesian optimization of hyperparameters using GP, which we call GP search.
2.2.1 Comparison with other methods
A grid search is brute-forcefully evaluating for each defined on a grid and then selecting the best one. In a random search, one randomly selects an and evaluates the performance ; this process is repeated until an with a satisfactory is found. In a manual search, an expert tries out some hyperparameter combinations based on prior experience until settling on a good one.
In contrast with the other methods mentioned above, a GP search chooses the hyperparameter combination to evaluate next by exploiting all previous evaluations. To achieve this, we assume the prior distribution on the function to be a Gaussian process, which allows us to construct a probabilistic model for using all previous evaluations, by calculating the posterior distribution in a tractable manner. Once the model for is computed, it is used to choose an optimal hyperparameter combination to evaluate next.
2.2.2 GP search
In our case is the F1-score on the test set evaluated for the ANN model using the given hyperparameter combination which is a 5-dimensional vector consisting of filter size , number of filters , dropout rate , and history sizes .
Let and be the training inputs and outputs, and test inputs and outputs, respectively. and . Note that is known, and is unknown. The goal is to find the distribution of given and in order to select among the hyperparameter combination that is the most likely to yield the highest F1-score.
The joint distribution ofand according to the prior is
where is a vector of the means evaluated at all training and test points respectively, and denotes the matrix of the covariances evaluated at all pairs of training and test points, and similarly for and .
Conditioning the joint Gaussian prior on the observations yields where
The choice of the kernel impacts predictions. We investigate 4 different kernels:
To initialize the GP search, one needs to compute the F1-score for a certain number of randomly chosen hyperparameter combinations : we investigate what the optimal number is. We then iterate over the following two steps until a specified maximum number of iterations is reached. First, we find the hyperparameter combination in the test set with the highest F1-score predicted by the GP. Second, we compute the actual F1-score, and move it to the training set. This process is outlined in Algorithm 1.
We evaluate the random and GP searches on the dialog act classification task using the Dialog State Tracking Challenge 4 (DSTC 4) [21, 22], ICSI Meeting Recorder Dialog Act (MRDA) [23, 24], and Switchboard Dialog Act (SwDA)  datasets. DSTC 4, MRDA, and SwDA respectively contain 32k, 109k, and 221k utterances, which are labeled with 89, 5, and 43 different dialog acts (we used the 5 coarse-grained dialog acts introduced in  for MRDA). The train/test splits are provided along with the datasets, and the validation set was chosen randomly except for MRDA, which specifies a validation set.111See https://github.com/Franck-Dernoncourt/slt2016 for the train, validation, and test splits.
For a given hyperparameter combination, the ANN is trained to minimize the negative log-likelihood of assigning the correct dialog acts to the utterances in the training set, using stochastic gradient descent with the Adadelta update rule
. At each gradient descent step, weight matrices, bias vectors, and word vectors are updated. For regularization, dropout is applied after the pooling layer, and early stopping is used on the validation set with a patience of 10 epochs. We initialize the word vectors with the 300-dimensional word vectors pretrained with word2vec on Google News[28, 29] for DSTC 4, and the 200-dimensional word vectors pretrained with GloVe on Twitter  for SwDA.
For each hyperparameter combination, the reported F1-score is averaged over 5 runs. Table 1 presents the hyperparameter search space.
|Filter size||3, 4, 5|
|Number of filters||50, 100, 250, 500, 1000|
|Dropout rate||0.1, 0.2, , 0.9|
|History size||1, 2, 3|
|History size||1, 2, 3|
GP search finds near-optimal hyperparameters faster than random search. Figure 5 compares the GP searches with different kernels against the random search, which is a natural baseline for hyperparameter optimization algorithms . On all datasets, the F1-score evaluated using the hyperparameters found by the GP search converges to near-optimal values significantly faster than the random search, regardless of the kernels used. For example, on SwDA, after computing the F1-scores for 100 different hyperparameter combinations, the GP search reaches on average 72.1, whereas the random search only obtains 71.4. The random search requires computing over 400 F1-scores to reach 72.1: the GP search therefore reduces the computational time by a factor of 4. This is a significant improvement considering that computing the average F1-scores over 5 runs for 300 extra hyperparameter combinations takes 60 days on a GeForce GTX Titan X GPU.
Squared exponential kernel converges more slowly than others.
Even though the GP search with any kernel choice is faster than the random search, some kernels result in better performance than others. The best kernel choice depends on the choice of the dataset, but the squared exponential kernel (a.k.a. radial basis function kernel) consistently converges more slowly, as illustrated by Figure5. Across the datasets, there was no consistent differences among the linear, absolute exponential, and cubic kernels.
The number of initial random points impacts the performances. As mentioned in Section 2.2, the GP search starts with computing the F1-score for a certain number of randomly chosen hyperparameter combinations. Figure 9 shows the impact of this number on all three datasets. The optimal number seems to be around 10 on average, i.e. 1% of the hyperparameter search space. When the number is very low (e.g., 2), the GP might fail to find the optimal hyperparameter combinations: it performs significantly worse on MRDA and SwDA. Conversely, when the number is very high (e.g., 50) it unnecessarily delays the convergence.
GP search often finds near-optimal hyperparameters quickly. After evaluating the F1-scores with 50 hyperparameter combinations, the GP search finds one of the 5 best hyperparameter combinations almost 80% of the time on SwDA, as shown in Figure 13, and even more frequently on DSTC 4 and MRDA. After computing 100 hyperparameter combinations, the GP search finds the best one over 70% of the time, while the random search stumbles upon it less 10% of the time.
Simple heuristics may not find optimal hyperparameters well. Compared to the previous state-of-the-art results that use the same model optimized manually , the GP search found more optimal hyperparameters, improving the F1-score by 0.5 (), 0.1 (), and 0.7 () on DTSC 4, MRDA, and SwDA, respectively. In , the hyperparameters were optimized by varying one hyperparameter at a time while keeping the hyperparameters fixed. Figures 14 and 15 demonstrate that optimizing each hyperparameter independently might result in a suboptimal choice of hyperparameters. Figure 14 illustrates that the optimal choice of hyperparameters is impacted by the choice of other hyperparameters. For example, a higher number of filters works better with a smaller dropout probability, and conversely a lower number of filters yields better results when used with a larger dropout probability. Figure 15 shows that, for instance, if one had first fixed the number of filters to be 100 and optimized the dropout rate, one would have found that the optimal dropout rate is 0.5. Then, fixing the dropout rate at 0.5, one would have determined that 500 is the optimal number of filters, thereby obtaining an F1-score of 70.0, which is far from the best F1-score (70.7).
The faster convergence of the GP search may stem from the capacity of the GP to leverage the patterns in the F1-score landscape such as the one shown in Figure 15. The random search cannot make use of this regularity.
In this paper we addressed the commonly encountered issue of tuning ANN hyperparameters. Towards this purpose, we explored a strategy based on GP to automatically pinpoint optimal or near-optimal ANN hyperparameters. We showed that the GP search requires 4 times less computational time than random search on three datasets, and improves the state-of-the-art results by efficiently finding the optimal hyperparameter combinations. While the choices of the kernels and the number of initial random points impact the performance of the GP search, our findings show that it is more efficient than the random search regardless of these choices. The GP search can be used for any ordinal hyperparameter; it is therefore a useful technique when developing ANN models for NLP tasks.
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev
“Recurrent neural network based language model.,”in INTERSPEECH, 2010, vol. 2, p. 3.
-  Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa, “Natural language processing (almost) from scratch,” The Journal of Machine Learning Research, vol. 12, pp. 2493–2537, 2011.
-  Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer, “Neural architectures for named entity recognition,” arXiv preprint arXiv:1603.01360, 2016.
-  Matthieu Labeau, Kevin Löser, and Alexandre Allauzen, “Non-lexical neural architecture for fine-grained POS tagging,” in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, September 2015, pp. 232–237, Association for Computational Linguistics.
-  Richard Socher, Alex Perelygin, Jean Y Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the conference on empirical methods in natural language processing (EMNLP). Citeseer, 2013, vol. 1631, p. 1642.
-  Yoon Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2014, pp. 1746–1751.
-  Phil Blunsom, Edward Grefenstette, Nal Kalchbrenner, et al., “A convolutional neural network for modelling sentences,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014.
-  Ji Young Lee and Franck Dernoncourt, “Sequential short-text classification with recurrent and convolutional neural networks,” in Human Language Technologies 2016: The Conference of the North American Chapter of the Association for Computational Linguistics, NAACL HLT 2016, 2016.
-  Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov, “Towards AI-complete question answering: A set of prerequisite toy tasks,” arXiv preprint arXiv:1502.05698, 2015.
Di Wang and Eric Nyberg,
“A long short-term memory model for answer sentence selection in question answering,”in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), Beijing, China, July 2015, pp. 707–712, Association for Computational Linguistics.
-  Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  Akihiro Tamura, Taro Watanabe, and Eiichiro Sumita, “Recurrent neural networks for word alignment model.,” in ACL (1), 2014, pp. 1470–1480.
-  Yoshua Bengio, “Practical recommendations for gradient-based training of deep architectures,” in Neural Networks: Tricks of the Trade, pp. 437–478. Springer, 2012.
-  James Bergstra and Yoshua Bengio, “Random search for hyper-parameter optimization,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 281–305, 2012.
-  Jasper Snoek, Hugo Larochelle, and Ryan P Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, Eds., pp. 2951–2959. Curran Associates, Inc., 2012.
-  Christopher KI Williams and Carl Edward Rasmussen, “Gaussian processes for machine learning,” the MIT Press, vol. 2, no. 3, pp. 4, 2006.
-  James Bergstra, Rémi Bardenet, Yoshua Bengio, and Balázs Kégl, “Algorithms for hyper-parameter optimization,” in Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, Eds., pp. 2546–2554. Curran Associates, Inc., 2011.
Franck Dernoncourt, Kalyan Veeramachaneni, and Una-May O’Reilly,
“Gaussian process-based feature selection for wavelet parameters: Predicting acute hypotensive episodes from physiological signals,”in IEEE 28th International Symposium on Computer-Based Medical Systems, 2015.
-  Franck Dernoncourt, Elias Baedorf Kassis, and Mohammad Mahdi Ghassemi, “Hyperparameter selection,” in Secondary Analysis of Electronic Health Records, pp. 419–427. Springer International Publishing, 2016.
-  Andreas Stolcke, Klaus Ries, Noah Coccaro, Elizabeth Shriberg, Rebecca Bates, Daniel Jurafsky, Paul Taylor, Rachel Martin, Carol Van Ess-Dykema, and Marie Meteer, “Dialogue act modeling for automatic tagging and recognition of conversational speech,” Computational linguistics, vol. 26, no. 3, pp. 339–373, 2000.
-  Seokhwan Kim, Luis Fernando D’Haro, Rafael E. Banchs, Jason Williams, and Matthew Henderson, “Dialog State Tracking Challenge 4: Handbook,” 2015.
-  Seokhwan Kim, Luis Fernando D’Haro, Rafael E. Banchs, Jason Williams, and Matthew Henderson, “The Fourth Dialog State Tracking Challenge,” in Proceedings of the 7th International Workshop on Spoken Dialogue Systems (IWSDS), 2016.
-  Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, et al., “The ICSI meeting corpus,” in Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03). 2003 IEEE International Conference on. IEEE, 2003, vol. 1, pp. I–364.
-  Elizabeth Shriberg, Raj Dhillon, Sonali Bhagat, Jeremy Ang, and Hannah Carvey, “The ICSI meeting recorder dialog act (MRDA) corpus,” Tech. Rep., DTIC Document, 2004.
-  Dan Jurafsky, Elizabeth Shriberg, and Debra Biasca, “Switchboard SWBD-DAMSL shallow-discourse-function annotation coders manual,” Institute of Cognitive Science Technical Report, pp. 97–102, 1997.
-  Jeremy Ang, Yang Liu, and Elizabeth Shriberg, “Automatic dialog act segmentation and classification in multiparty meetings.,” in ICASSP (1), 2005, pp. 1061–1064.
-  Matthew D Zeiler, “Adadelta: An adaptive learning rate method,” arXiv preprint arXiv:1212.5701, 2012.
-  Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean,
“Distributed representations of words and phrases and their compositionality,”in Advances in neural information processing systems, 2013, pp. 3111–3119.
-  Jeffrey Pennington, Richard Socher, and Christopher D Manning, “GloVe: global vectors for word representation,” Proceedings of the Empiricial Methods in Natural Language Processing (EMNLP 2014), vol. 12, pp. 1532–1543, 2014.