hyperparameters
Automatically tuning hyperparameters for deep learning
view repo
When applying machine learning to problems in NLP, there are many choices to make about how to represent input texts. These choices can have a big effect on performance, but they are often uninteresting to researchers or practitioners who simply need a module that performs well. We propose an approach to optimizing over this space of choices, formulating the problem as global optimization. We apply a sequential modelbased optimization technique and show that our method makes standard linear models competitive with more sophisticated, expensive stateoftheart methods based on latent variable models or neural networks on various topic classification and sentiment analysis problems. Our approach is a first step towards blackbox NLP systems that work with raw text and do not require manual tuning.
READ FULL TEXT VIEW PDF
Bayesian optimization has recently emerged as a popular method for the
s...
read it
While Bayesian Optimization (BO) is a very popular method for optimizing...
read it
We present mlrMBO, a flexible and comprehensive R toolbox for modelbase...
read it
Recent work on Bayesian optimization has shown its effectiveness in glob...
read it
Machine learning algorithms frequently require careful tuning of model
h...
read it
Automated Machine Learning (AutoML) is the problem of automatically find...
read it
Bayesian optimization is a powerful tool for expensive stochastic black...
read it
Automatically tuning hyperparameters for deep learning
NLP researchers and practitioners spend a considerable amount of time comparing machinelearned models of text that differ in relatively uninteresting ways. For example, in categorizing texts, should the “bag of words” include bigrams, and is tfidf weighting a good idea? These choices matter experimentally, often leading to big differences in performance, with little consistency across tasks and datasets in which combination of choices works best. Unfortunately, these differences tell us little about language or the problems that machine learners are supposed to solve.
We propose that these decisions can be automated in a similar way to hyperparameter selection (e.g., choosing the strength of a ridge or lasso regularizer). Given a particular text dataset and classification task, we introduce a technique for optimizing over the space of representational choices, along with other “nuisances” that interact with these decisions, like hyperparameter selection.
^{1}^{1}1In §5 we argue that the technique is also applicable in unsupervised settings. For example, using higherordergrams means more features and a need for stronger regularization and more training iterations. Generally, these decisions about instance representation are made by humans, heuristically; our work is the first to automate them.
Our technique instantiates sequential modelbased optimization (SMBO; Hutter et al., 2011). SMBO and other Bayesian optimization approaches have been shown to work well for hyperparameter tuning [Bergstra et al.2011, Hoffman et al.2011, Snoek et al.2012]
. Though popular in computer vision
[Bergstra et al.2013], these techniques have received little attention in NLP.We apply the technique to logistic regression on a range of topic and sentiment classification tasks. Consistently, our method finds representational choices that perform better than linear baselines previously reported in the literature, and that, in some cases, are competitive with more sophisticated nonlinear models trained using neural networks.
Let the training data consist of a collection of pairs , where each input is a text document and each output , the output space. The overall training goal is to maximize a performance function (e.g., classification accuracy, loglikelihood, score, etc.) of a machinelearned model, on a heldout dataset, .
Classfication proceeds in three steps: first,
maps each input to a vector representation. Second, a classifier is learned from the inputs (now transformed into vectors) and outputs:
. Finally, the resulting classifier is fixed as(i.e., the composition of the representation function with the learned classifier).
Here we consider linear classifiers of the form
(1) 
where the coefficients , for each output , are learned using logistic regression on the training data. We let denote the concatenation of all . Hence the parameters can be understood as a function of the training data and the representation function . The performance function , in turn, is a function of the heldout data and —also and , through . For simplicity, we will write “” when the rest are clear from context.
Typically, is fixed by the model designer, perhaps after some experimentation, and learning focuses on selecting the parameters . For logistic regression and many other linear models, this training step reduces to convex optimization in dimensions—a solvable problem that is still costly for large datasets and/or large output spaces. In seeking to maximize with respect to , we do not wish to carry out training any more times than necessary.
Choosing can be understood as a problem of selecting hyperparameter values. We therefore turn to Bayesian optimization, a family of techniques recently introduced for selecting hyperparameter values intelligently when solving for parameters () is costly.
Our approach is based on sequential modelbased optimization (SMBO; Hutter et al., 2011). It iteratively chooses representation functions
. On each round, it makes this choice through a nonparametricallyestimated probabilistic model of
, then evaluates —we call this a “trial.” As in any iterative search algorithm, the goal is to balance exploration of options for with exploitation of previouslyexplored options, so that a good choice is found in a small number of trials. See Algorithm 1.More concretely, in the th trial, is selected using an acquisition function and a “surrogate” probabilistic model . Second, is evaluated given —an expensive operation which involves training to select parameters and assessing performance on the heldout data. Third, the probabilistic model is updated using a nonparametric estimator.
We next describe the acquisition function and the surrogate model used in our experiments.
A good acquisition function returns high values for such that either the value is predicted to be high, or because uncertainty about ’s value is high; balancing between these is the classic tradeoff between exploitation and exploration. We use a criterion called Expected Improvement (EI; Jones, 2001), which is the expectation (under the current surrogate model ) that the choice will exceed :
where is chosen depending on the surrogate model, discussed below. (For now, think of it as a stronglyperforming “benchmark” value of
, discovered in earlier iterations.) Other options for the acquisition function include maximum probability of improvement
[Jones2001], minimum conditional entropy [Villemonteix et al.2006], Gaussian process upper confidence bound [Srinivas et al.2010], or a combination of them [Hoffman et al.2011]. We selected EI because it is the most widely used acquisition function that has been shown to work well on a range of tasks.As a surrogate model, we use a treestructured Parzen estimator (TPE; Bergstra et al., 2011). This is a nonparametric approach to density estimation. We seek to estimate where , the performance function that is expensive to compute exactly. The TPE approach is as follows:
where and are densities estimated using observations from previous trials that are less than and greater than , respectively. In TPE,
is defined as some quantile of the observed
; we use 15quantiles.As shown by bergstra, the Expected Improvement in TPE can be written as:
(2) 
where , fixed at by definition of (above). Here, we prefer with high probability under and low probability under . To maximize this quantity, we draw many candidates according to and evaluate them according to ). Note that does not need to be given an explicit form.
In order to evaluate Eq. 2, we need to compute and
. These joint distributions depend on the graphical model of the hyperparameter space—which is allowed to form a tree structure.
We discuss how to compute in the following. is computed similarly, using trials where . We associate each hyperparameter with a node in the graphical model; consider the th dimension of
, denoted by random variable
.If ranges over a discrete set , TPE uses a reweighted categorical distribution, where the probability that is proportional to a smoothing parameter plus the counts of occurrences of in with .
When
is continuousvalued, TPE constructs a probability distribution by placing a truncated Gaussian distribution centered at each of
where, with standard deviation set to the greater of the distances to the left and right neighbors.
In the simplest version, each node is independent, so we can compute by multiplying individual probabilities at every node. In the treestructured version, we only multiply probabilities along the relevant path, excluding some nodes.
Another common approach to the surrogate is the Gaussian Process [Rasmussen and Williams2006, Hoffman et al.2011, Snoek et al.2012]. Like bergstra, our preliminary experiments found the TPE to perform favorably. Further TPE’s treestructured configuration space is advantageous, because it allows nested definitions of hyperparameters, which we exploit in our experiments (e.g., only allows bigrams to be chosen if unigrams are also chosen).
Because research on SMBO is active, many implementations are publicly available; we use the HPOlib library [Eggensperger et al.2013].^{2}^{2}2http://www.automl.org/hpolib.html The libray takes as input a function , which is treated as a black box—in our case, a logistic regression trainer that wraps the LIBLINEAR library [Fan et al.2008], based on the trust region Newton method [Lin et al.2008]—and a specification of hyperparameters.
Our experiments consider representational choices and hyperparameters for several text categorization problems.
We fix our learner to logistic regression. We optimize text representation based on the types of grams used, the type of weighting scheme, and the removal of stopwords. For grams, we have two parameters, minimum and maximum lengths ( and ). (All gram lengths between the minimum and maximum, inclusive, are used.) For weighting scheme, we consider term frequency, tfidf, and binary schemes. Last, we also choose whether we should remove stopwords before constructing feature vectors for each document.
Furthermore, the choice of representation interacts with the regularizer and the training convergence criterion (e.g., more grams means slower training time). We consider two regularizers, penalty [Tibshirani1996] or squared penalty [Hoerl and Kennard1970]. We also have hyperparameters for regularization strength and training convergence tolerance. See Table 1 for a complete list of hyperparameters in our experiments.
Note that even with this limited number of options, the number of possible combinations is huge (it is actually infinite since the regularization strength and convergence tolerance are continuous values, although we can also use sets of possible values), so exhaustive search is computationally expensive. In all our experiments for all datasets, we limit ourselves to 30 trials per dataset. The only preprocessing we applied was downcasing (see §5 for discussion about this).
We always use a development set to evaluate during learning and report the final result on an unseen test set.
Hyperparameter  Values 

weighting scheme  {tf, tfidf, binary} 
remove stop words?  {True, False} 
regularization  
regularization strength  
convergence tolerance 
We evaluate our method on five text categorization tasks.
Stanford sentiment treebank [Socher et al.2013]: a sentencelevel sentiment analysis dataset for movie reviews from the rottentomatoes.com website. We use the binary classification task where the goal is to predict whether a review is positive or negative (no neutral reviews). We obtained this dataset from http://nlp.stanford.edu/sentiment.
Electronics product reviews from Amazon [McAuley and Leskovec2013]: this dataset consists of electronic product reviews, which is a subset of a large Amazon review dataset. Following the setup of riejohnson, we only use the text section and ignore the summary section. We also only consider positive and negative reviews. We obtained this dataset from http://riejohnson.com/cnn_data.html.
IMDB movie reviews [Maas et al.2011]: a binary sentiment analysis dataset of highly polar IMDB movie reviews, obtained from http://ai.stanford.edu/~amaas//data/sentiment.
Congressional vote [Thomas et al.2006]: transcripts from the U.S. Congressional floor debates. The dataset only includes debates for controversial bills (the losing side has at least 20% of the speeches). Similar to previous work [Thomas et al.2006, Yessenalina et al.2010], we consider the task to predict the vote (“yea” or “nay”) for the speaker of each speech segment (speakerbased speechsegment classification). We obtained it from http://www.cs.cornell.edu/~ainur/sledata.html.
20 Newsgroups [Lang1995]: the 20 Newsgroups dataset is a benchmark topic classification dataset, we use the publicly available copy at http://qwone.com/~jason/20Newsgroups. There are 20 topics in this dataset. We derived four topic classification tasks from this dataset. The first task is to classify documents across all 20 topics. The second task is to classify related science documents into four science topics (sci.crypt, sci.electronics, sci.med, sci.med). ^{3}^{3}3We were not able to find previous results that are comparable to ours on the second task; we include them to enable further comparisons in the future. The third and fourth tasks are talk.religion.misc vs. alt.atheism and comp.graphics vs. comp.windows.x. To consider a more realistic setting, we removed header information from each article since they often contain label information.
Dataset  Training  Dev.  Test 

Stanford sentiment  6,920  872  1,821 
Amazon electronics  20,000  5,000  25,000 
IMDB reviews  20,000  5,000  25,000 
Congress vote  1,175  113  411 
20N all topics  9,052  2,262  7,532 
20N all science  1,899  474  1,579 
20N atheist.religion  686  171  570 
20N x.graphics  942  235  784 
These are standard datasets for evaluating text categorization models, where benchmark results are available. In total, we have eight tasks, of which four are sentiment analysis tasks and four are topic classification tasks. See Table 2
for descriptive statistics of our datasets.
Dataset  Acc.  Weighting  Stop.  Reg.  Strength  Conv.  

Stanford sentiment  82.43  1  2  tfidf  F  10  0.098  
Amazon electronics  91.56  1  3  binary  F  120  0.022  
IMDB reviews  90.85  1  2  binary  F  147  0.019  
Congress vote  78.59  2  2  binary  F  121  0.012  
20N all topics  87.84  1  2  binary  F  16  0.008  
20N all science  95.82  1  2  binary  F  142  0.007  
20N atheist.religion  86.32  1  2  binary  T  41  0.011  
20N x.graphics  92.09  1  1  binary  T  91  0.014 
For each dataset, we select supervised, nonensemble classification methods from previous literature as baselines. In each case, we emphasize comparisons with the bestpublished linear method (often an SVM with a linear kernel with representation selected by experts) and the bestpublished method overall. In the followings, “SVM” always means “linear SVM”. All methods were trained and evaluated on the same training/testing data splits; in cases where standard development sets were not available, we used a random 20% of the training data as a development set.
We summarize the hyperparameters selected by our method, and the accuracies achieved (on test data) in Table 3. We discuss comparisons to baselines for each dataset in turn.
Our logistic regression model outperforms the baseline SVM reported by socher, who used only unigrams but did not specify the weighting scheme for their SVM baseline. While our result is still below the stateoftheart based on the the recursive neural tensor networks
[Socher et al.2013] and the paragraph vector [Le and Mikolov2014], we show that logistic regression is comparable with recursive and matrixvector neural networks [Socher et al.2011, Socher et al.2012].Method  Acc. 

Naïve Bayes  81.8 
SVM  79.4 
Vector average  80.1 
Recursive neural networks  82.4 
LR (this work)  82.4 
Matrixvector RNN  82.9 
Recursive neural tensor networks  85.4 
Paragraph vector  87.8 

The bestperforming methods on this dataset are based on convolutional neural networks
[Johnson and Zhang2014].^{4}^{4}4These are fully connected neural networks with a rectifier activation function, trained under
regularization with stochastic gradient descent.
Our method is on par with the secondbest of these, outperforming all of the reported feedforward neural networks and SVM variants Johnson and Zhang used as baselines. They varied the representations, and used log term frequency and normalization to unit vectors as the weighting scheme, after finding that this outperformed term frequency. Our method achieved the best performance with binary weighting, which they did not consider.
Method  Acc. 

SVMunigrams  88.62 
SVMgrams  90.70 
SVMgrams  90.68 
NNunigrams  88.94 
NNgrams  91.10 
NNgrams  91.24 
LR (this work)  91.56 
Bag of words CNN  91.58 
Sequential CNN  92.22 
The results parallel those for Amazon electronics; our method comes close to convolutional neural networks [Johnson and Zhang2014], which are stateoftheart.^{5}^{5}5As noted, semisupervised and ensemble methods are excluded for a fair comparison.
It outperforms SVMs and feedforward neural networks, the restricted Boltzmann machine approach presented by dahl, and compressive feature learning
[Paskov et al.2013].^{6}^{6}6This approach is based on minimum description length, using unlabeled data to select a set of higherorder grams to use as features. It is technically a semisupervised method. The results we compare to use logistic regression with elastic net regularization and heuristic normalizations.Method  Acc. 

SVMunigrams  88.69 
SVMgrams  89.83 
SVMgrams  89.62 
RBM  89.23 
NNunigrams  88.95 
NNgrams  90.08 
NNgrams  90.31 
Compressive feature learning  90.40 
LRgrams  90.60 
LR (this work)  90.85 
Bag of words CNN  91.03 
Sequential CNN  91.26 
Our method outperforms the best reported results of ainur, which use a multilevel structured model based on a latentvariable SVM. We show comparisons to two wellknown but weaker baselines, as well.
Method  Acc. 

SVMlink  71.28 
Mincut  75.00 
SVMSLE  77.67 
LR (this work)  78.59 
Our method outperforms stateoftheart methods including the distributed structured output model [Srikumar and Manning2014].^{7}^{7}7
This method was designed for structured prediction, but srikumar also applied it to classification. It attempts to learn a distributed representation for features and for labels. The authors used unigrams and did not elaborate the weighting scheme.
The strong logistic regression baseline from compressive uses all 5grams, heuristic normalization, and elastic net regularization; our method found that unigrams and bigrams, with binary weighting and penalty, achieved far better results.Method  Acc. 

Discriminative RBM  76.20 
Compressive feature learning  83.00 
LRgrams  82.80 
Distributed structured output  84.00 
LR (this work)  87.84 
wangmanning report a bigram naïve Bayes model achieving 85.1% and 91.2% on these tasks, respectively.^{8}^{8}8They also report a naïve Bayes/SVM ensemble achieving 87.9% and 91.2%. Our method achieves 86.3% and 92.1% using slightly different setups (see Table 3).
Our results suggest that seemingly mundane representation choices can raise the performance of simple linear models to be comparable with much more sophisticated models. Achieving these results is not a matter of deep expertise about the domain or engineering skill; the choices can be automated. Our experiments only considered logistic regression with downcased text; more choices—stemming, count thresholding, normalization of numbers, etc.—can be offered to the optimizer, as can additional feature options like gappy grams.
As NLP becomes more widely used in applications, we believe that automating these choices will be very attractive for those who need to train a highperformance model quickly.
For each task, the chosen representation is different. Out of all possible hyperparameter choices in our experiments (Table 1), each of them is used by at least one of the datsets (Table 3). For example, on the Congressional Vote dataset, we only need to use bigrams, whereas on the Amazon electronics dataset we need to use unigrams, bigrams, and trigrams. The binary weighting scheme works well for most of the datasets, except the sentencelevel sentence analysis task, where the tfidf weighting scheme was selected. regularization was best in all cases but one.
We do not believe that an NLP expert would be likely to make these particular choices, except through the same kind of trialanderror process our method automates efficiently. Often, we believe, researchers in NLP make initial choices and stick with them through all experiments (as we have admittedly done with logistic regression). Optimizing over more of these choices will give stronger baselines.
We ran 30 trials for each dataset in our experiments. Figure 1 shows each trial accuracy and the best accuracy on development data as we increase the number of trials for three datasets. We can see that 30 trials are generally enough for the model to obtain good results, although the search space is large.
In the presence of unlimited computational resources, Bayesian optimization is slower than grid search on all hyperparameters, since the latter is easy to parallelize. This is not realistic in most research and development environments, and it is certainly impractical in increasingly widespread instances of personalized machine learning. The Bayesian optimization approach that we use in our experiments is performed sequentially. It attempts to predict what set of hyperparameters we should try next based on information from previous trials. There has been work to parallelize Bayesian optimization, making it possible to leverage the power of multicore architectures [Snoek et al.2012, Desautels et al.2012, Hutter et al.2012].
We treat each dataset independently and create a separate model for each of them. It is also possible to learn from previous datasets (i.e., transfer learning) or to learn from all datasets simultaneously (i.e., multitask learning) to improve performance. This has the potential to reduce the number of trials required even further. See bardenet, multitask, and yogatamamann2014 for how to perform Bayesian optimization in these settings.
We use logistic regression as our classification model, and our experiments show how simple linear models can be competitive with more sophisticated models given the right representation. Other models, can be considered, of course, as can ensembles [Yogatama and Mann2014]. Increasing the number of options may lead to a need for more trials, and evaluating (e.g., training the neural network) will take longer for more sophisticated models. We have demonstrated, using one of the simplest classification models (logistic regression), that even simple choices about text representation can matter quite a lot.
Our framework could also be applied to structured prediction problems. For example, in partofspeech tagging, the set of features can include character grams, word shape features, and word type features. The optimal choice for different languages is not always the same, our approach can automate this process.
Our framework could also be extended to unsupervised and semisupervised models. For example, in document clustering (e.g., means), we also need to construct representations for documents. Loglikelihood might serve as a performance function. A range of random initializations might be considered. Investigation of this approach for nonconvex problems like clustering is an exciting area for future work.
We used a Bayesian optimization approach to optimize choices about text representations for various categorization problems. Our sequential modelbased optimization technique identifies settings for a standard linear model (logistic regression) that are competitive with far more sophisticated stateoftheart methods on topic classification and sentiment analysis. Every task and dataset has its own optimal choices; though relatively uninteresting to researchers and not directly linked to domain or linguistic expertise, these choices have a big effect on performance. We see our approach as a first step towards blackbox NLP systems that work with raw text and do not require manual tuning.
This work was supported by the Defense Advanced Research Projects Agency through grant FA87501420244 and computing resources provided by Amazon.
Semisupervised recursive autoencoders for predicting sentiment distributions.
In Proc. of EMNLP.
Comments
There are no comments yet.