I Introduction
Hyper parameter tuning is an integral part of building deep learning models. State of the art models are often benchmarked on a small set of datasets such as Penn Treebank [1], WikiText, GigaWord, MNIST, CIFAR10 to name a few of the limited set. The hyper parameters values on these datasets are however not directly applicable to other use case specific datasets.
Advances in deep learning research including its applications to Natural Language Processing (NLP) is correlated to the introduction of new increasing strategies for regularization and optimization of neural networks. These strategies, more often than not introduce new hyper parameters, thus, compounding the challenge of hyper parameter tuning; even more so if hyper parameter values are overly sensitive to the dataset. The effect of this would be that reproducing state of the art neural models on a unique dataset would require significant hyper parameter search thus limiting the reach of these models to parties with significant computing resources.
We present work done to understand the effect of the set of parameters selected on the perplexity (the exponential of the average negative loglikelihood of prediction of the next word in a sequence[2]) of a Neural Language Model (NLM). We apply hyper parameter search methods given baseline hyper parameter values for benchmark datasets to modeling codemixed text. Codemixed text is text which draws from elements of two or more grammatical systems [3]). Codemixed text is common in countries in which multiple languages coexist. In this work we assess the performance of the AWDLSTM model[4] for language modeling to better understand how relevant the published hyper parameters may be for a codemixed corpus and to isolate which hyper parameters could be further tuned to improve performance.
Our results show that as a whole, the set of hyperparameters considered the best
[4] are reasonably good, however ther are better sets hyperparamers for the codemixed corpora. Moreover, even with the best set of hyper parameters, the perplexity observed for our data are significantly higher (i.e. performance is worse at the task of language modeling) than the performance demonstrated in the literature. Finally, our implemented approach is one that not only enables confirmation of the goodness of the hyper parameters values, but we can also develop inferences about which hyper parameter values would yield better results.Ii Background
emsize  nhid  nlayers  dropout  dropoute  dropouth  dropouti  wdrop  bptt  clip  lr 

300  1150  3  0.4  0.1  0.3  0.65  0.5  70  0.25  30 
Hyper parameter  Potential value 

emsize  [300, 350, 400, 450] 
nhid  [950, 1050, 1150, 1250] 
nlayers  [2, 3, 4, 5] 
dropout  [0.3, 0.4, 0.5, 0.6] 
dropoute  [0.3, 0.4, 0.5, 0.6] 
dropouth  [0.3, 0.4, 0.5, 0.6] 
dropouti  [0.3, 0.4, 0.5, 0.6] 
wdrop  [0.3, 0.4, 0.5, 0.6] 
bptt  [50, 60, 70, 80] 
clip  [0.15, 0.25, 0.35, 0.45] 
lr  [20, 30, 40, 45] 
Deeplearning has found sucess in various applications including natural language processing tasks such as language modeling, parts of speech tagging, summarization and many others. The learning performance of deep neural networks however depends on systematic tuning of the hyper parameters. As such finding optimal hyper parameters is an integral part of building neural models including neural language models.
Recurrent neural networks (RNNs) being well suited to dealing with sequences of vectors, have found much success in NLP by leveraging the sequential structure of language, A variant of RNNs known as Long ShortTerm Memory Networks (LSTMs) [5] have particularly been widely used and stands as the state of the art technique for language modeling on benchmark datasets such as Penn Treebank (PTB) [1] and One billion words [6] among others. Language models (LMs) by themselves are valuable because well trained LMs improve the underlying metrics of downstream tasks such as word error rate for speech recognition, BLEU score for translation. In addition, LMs compactly extract knowledge encoded in training data [7].
The current state of the art on modeling both PTB and WikiText 2 [8] datasets as reported in [4] shows little sensitivity to hyper parameters; sharing almost all hyper parameters values between both datasets. In [9], its is also shown that deep learning model can jointly learn a number of largescale tasks from multiple domains by designing a multimodal architecture in which as many parameters as possible are shared.
Training and evaluating a neural network involves mapping the hyper parameter configuration (set of values for each hyper parameter) used in training the network to the validation error obtained at the end. Strategies for searching and obtaining an optimal configuration that have been applied and found considerable success include grid search, random search, Bayesian optimization, Sequential Modelbased Bayesian Optimization (SMBO) [10]
, deterministic hyperparameter optimization methods that employs radial basis functions as error surrogates proposed by
[11], Gaussian Process Batch Upper Confidence Bound (GPBUCB) [12]; an upper confidence boundbased algorithm, which models the reward function as a sample from a Gaussian process. In [13], the authors propose initializing Bayesian hyper parameters using metalearning. The idea being initializing the configuration space for a novel dataset based on configurations that are known to perform well on similar, previously evaluated, datasets.Following a metalearning approach, we apply a genetic algorithm and a sequential search algorithm, described in the next section, initialized using the best configuration reported in
[4] to search the space around optimal hyper parameters for the AWDLSTM model. Twitter tweets collected using a geolocation filter for Nigeria and Kenya with the goal of acquiring a codemixed text corpus serve as our evaluation datasets. We report the test perplexity distributions of the various evaluated configurations and draw inferences on the sensitivity of each hyper parameter to our unique dataset.Iii Methodology
We begin our work by establishing what the baseline and current state of the art model is for a language modeling task [4]. Applying the AWDLSTM model, based on the open sourced code and trained on codemixed Twitter data, we sample 84 different hyper parameter configurations for each dataset, and evaluate the resulting test perplexity distributions while varying individual hyperparameter values to understand the effect of the set of hyper parameter values selected on the model perplexity.
Iiia Datasets
Two sources of data are collected using the Twitter streaming API with a geolocation filter set to geocordinates for Kenya and Nigeria. The resulting data is codemixed with the Kenya corpus (Dataset 1) containing several mixes of English and Swahili both official languages in Kenya. The Nigeria data (Dataset 2) on the other hand, does not predominantly contain mixes of English with another language in the same sentence. Rather, English is simply often completety rewritten in a pidgin form. The training data for Kenya and Nigeria contains 13,185 words and 27,855 words respectively. All tweets are stripped of mentions and hashtags as well as converted to lowercase.
The phenomenon of codemixed language use is common in locales that are surrounded by others which speak different languages or locales with a large number of immigrants. In Kenya and Nigeria as such, the use of English is influenced by the presence of one or more local languages and this is evident in the corpus.
IiiB Model Hyper parameters
We considered 11 hyper parameters for tuning including the size of the word embedding (emsize), the number of hidden units in each LSTM layer (nhid), the number of LSTM layers (nlayers), the initial learning rate of the optimizer (lr), the maximum norm for gradient clipping (clip), the backpropagation through time sequence length (bptt), dropout  applied to the layers (dropout), weight dropout applied to the LSTM hidden to hidden matrix (wdrop), the input embedding layer dropout (dropouti), dropout for the LSTM layer (dropouth), and dropout to remove words from embedding layer (dropoute). Table
I contains the default values of the individual hyper parameters.All experiments involved training for 100 epochs inline with available GPU resources. The training criteron was the crossentropy loss which is the average negative loglikelihood of predicting the right next word by the LM. It took approximately two hours wall clock time to train the model for each hyper parameter configuration.
IiiC Sequential search
The search process begins by setting the values of each hyper parameter (configuration) to known best values (see Table I). We then iteratively search for the best value for each hyper parameter. The order used in this search is defined in the rows of Table II. Performance is evaluated based on the text perplexity for the modeling task. Once the best perplexity is identified from the list of possible values for the given hyper parameter, it is fixed and the space of the next hyper parameter in the sequence is searched. In this manner the the configuration space of the model is explored.
This approach shares similarities with the method applied in [14], though it remains an open question what the impact of the sequence is on the quality of best configuration produced. For the context of this work, our aim is not to find the best configuration. Instead it is to better understand the configuration space defined by these hyper parameters to determine the impact of their values on the performance when considering a codemixed corpora.
IiiD Population based search
We apply a genetic algorithm (GA) to provide a complementary approach to the sequential search for the exploration of hyper parameter configurations. The GA is a biologically inspired population based search technique presented by [15]
. The algorithm is a metaheuristic inspired by biological natural selection. The test perplexity for a particular hyper parameter configuration is the measure of its fitness. Given an evaluated population i.e a set of hyper parameter configurations, we derive the next generation of the population by first selecting the candidate configurations using roulette wheel selection
[16]. This biases the selection of good configuration to pass their ”genetic material” to the subsequent generation. The probability of selection of the
configuration in a generation, is defined in (1) where is its fitness which is a function of the test perplexity.(1) 
Each hyper parameter configuration in the subsequent generation is derived from two parent configurations selected via this approach. Mimicking biological crossover chromosomes [15], the two configurations selected are mixed, and one of the resulting configurations are selected at random. Finally, a random subset of the components of each derived configuration is perturbed by adding noise. This sequence of processes define how configurations from a current generation are used to derive the next generation.
IiiE Metalearning initialization
Both the population based and sequential search space were manually initialized with four (4) values of each hyperparameter in the neighbourhood of the best values reported in [4] as shown in Table I. It is important to note that the sampled configuration space is very small compared to the overall space which is of size . 84 samples for each dataset constitute the set of sampled configurations which is a far cry from the size of the universal set.
Iv Results
We use the term default value when refering to an individual hyper parameter value that makes up the configuration with the best result for the AWDLSTM model as reported in [4]. We evaluate the sensitivity of the hyper parameters by observing the test perplexity distribution comparing it with the default values.
Iv1 Better
When considering each of these hyper parameters, it’s possible to identify that the default value is correlated with a statistically better performance. This is the case for dropouti, dropoute, dropouth, clip, wdrop, and lr.
Iv2 Not better but not worse! (though generally the best)
When considering each of these hyper parameters, it’s possible to identify that the default value is not correlated with a statistically better performance but also not correlated with statistically worse performance. However the default values are closely in the neighbourhood of the best hyper parameter values for both datasets. This is the case for dropout and bptt as shown in 1 and 2 respectively.
Iv3 Not better but not worse! (though generally not the best)
When considering each of these hyper parameters, it’s possible to identify that the default value is not correlated with a statistically better performance but also not correlated with statistically worse performance. However the default values are not closely in the neighbourhood of the best values. This is the case for the nlayers and nhid hyper parameters as shown in 3 and 4 respectively.
The data from varying the number of LSTM layers suggests that shallow models yields lower perplexity. And this is consistent across both datasets and supported by [17]. The number of hidden units indicates a bound on the number of nonlinear transformations in the network with an increase leading to an increase in the number of calculations between inputs and corresponding outputs. A higher number of hidden units is expected to improve the model performance. What is observed however on both datasets is the lowest value of hidden units yielding the best result.
Iv4 Worse
The emsize is observed to be the only hyperparameter for which the default values is correlated with statistically significant worse performance on both Datasets 1 and 2 as shown in Figure 5.
Encoding context, morphology, relationships between words through training the word embedding is directly tied to the vocabulary. As there are varying degree of similarity between words across corpora the embedding size is a hyperparameter that is expected to be sensitive to the dataset. Every corpus has varying level of semantic and syntatic context that needs to be encoded as features that affects the NLM. Thus, the sensitivity of the embedding size hyperparameter is not overall surprising.
We present a comparison of the test perplexity using the default values of the AWDLSTM model with the best test perplexity from the both sequential and population based search in Table III.
dataset  default values perplexity  best perplexity  changed parameters 

Dataset 1  851.24  839.56  emsize 
Dataset 2  490.28  482.41  emsize, nhid 
V Conclusion
In this work we set out to characterize the space of hyper parameter values for a neural language model trained to perform the task of language modeling. The performance of such models is sensitive to the selection of hyper parameters which define their operation. Our work applied language modeling to the domain of codemixed text, and we found that although the published hyper parameters the performance of the state of the art architecture, the AWDLSTM model, were largely good, they did not define the best combination of hyper parameters for the task. Our hyper parameter searches uncovered that the AWDLSTM model is not generally sensitive to novel datasets. Specifically, the size of the word embedding, and the number of hidden units in each LSTM layer are observed to be the only two hyper parameters from the set of 11 evaluated hyper parameters that differ from the published work. This work thus can serve as a solid baseline model derivation of better sets of hyper parameters for this type of data.
Of particular interest to us is the performance of the AWDLSTM model which is the current state of the art on a codemixed corpus. The perplexity values are a far cry from what is generally known to be ‘good’ perplexity of NLMs. As such, while the AWDLSTM model shows promising results on benchmark datasets, evaluating on a codemixed corpus with hyper parameter values found to be the best on such benchmark datasets, as well as values in the same neighbourhood, results in unacceptable and impractical perplexity values for a NLM. We hope to explore the various strategies for developing a better language model of the datasets introduced in this work.
References
 [1] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur, “Recurrent neural network based language model.” in Interspeech, vol. 2, 2010, p. 3.

[2]
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin, “A neural probabilistic
language model,”
Journal of machine learning research
, vol. 3, no. Feb, pp. 1137–1155, 2003.  [3] Wikipedia, “Codemixing — wikipedia, the free encyclopedia,” 2017, [Online; accessed 1November2017]. [Online]. Available: https://en.wikipedia.org/w/index.php?title=Codemixing&oldid=805135369
 [4] S. Merity, N. S. Keskar, and R. Socher, “Regularizing and optimizing lstm language models,” arXiv preprint arXiv:1708.02182, 2017.
 [5] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
 [6] C. Chelba, T. Mikolov, M. Schuster, Q. Ge, T. Brants, P. Koehn, and T. Robinson, “One billion word benchmark for measuring progress in statistical language modeling,” arXiv preprint arXiv:1312.3005, 2013.
 [7] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu, “Exploring the limits of language modeling,” arXiv preprint arXiv:1602.02410, 2016.
 [8] S. Merity, C. Xiong, J. Bradbury, and R. Socher, “Pointer sentinel mixture models,” arXiv preprint arXiv:1609.07843, 2016.
 [9] L. Kaiser, A. N. Gomez, N. Shazeer, A. Vaswani, N. Parmar, L. Jones, and J. Uszkoreit, “One model to learn them all,” arXiv preprint arXiv:1706.05137, 2017.
 [10] E. Brochu, V. M. Cora, and N. De Freitas, “A tutorial on bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning,” arXiv preprint arXiv:1012.2599, 2010.
 [11] I. Ilievski, T. Akhtar, J. Feng, and C. A. Shoemaker, “Efficient hyperparameter optimization for deep learning algorithms using deterministic rbf surrogates.” in AAAI, 2017, pp. 822–829.
 [12] T. Desautels, A. Krause, and J. W. Burdick, “Parallelizing explorationexploitation tradeoffs in gaussian process bandit optimization,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3873–3923, 2014.
 [13] P. Brazdil, C. G. Carrier, C. Soares, and R. Vilalta, Metalearning: Applications to data mining. Springer Science & Business Media, 2008.
 [14] Z. Wang, H. Mi, and A. Ittycheriah, “Semisupervised clustering for short text via deep representation learning,” arXiv preprint arXiv:1602.06797, 2016.
 [15] J. H. Holland, “Genetic algorithms,” Scientific american, vol. 267, no. 1, pp. 66–73, 1992.
 [16] D. E. Goldberg and K. Deb, “A comparative analysis of selection schemes used in genetic algorithms,” Foundations of genetic algorithms, vol. 1, pp. 69–93, 1991.
 [17] L. Zhang and B. G. Thomas, “State of the art in evaluation and control of steel cleanliness,” ISIJ international, vol. 43, no. 3, pp. 271–291, 2003.
Comments
There are no comments yet.