1 Introduction and Background
Natural Language Processing (NLP) is a field driven by empirical evaluations. Authors are under pressure to demonstrate that their models or methods achieve stateoftheart performance on a particular task or dataset, which by definition requires reliable model comparison. As models become more numerous, require larger computational resources to train, and the performance of competing models gets closer, the task of reliable model selection has not only become more important, but also increasingly difficult. Without full disclosure of model settings and data splits, it is impossible to accurately compare methods and models.
To be able to perform meaningful model comparisons, we need to be able to reliably evaluate models. Unfortunately, evaluating a model is a nontrivial task and the best we can do is to produce noisy estimates of model performance with the following two distinct sources of stochasticity:

We only have access to a finite training dataset, however, evaluating a model on its training data leads to severe overestimates of performance. To evaluate models without overfitting, practitioners typically randomly partitioning data into independent training and testing sets, producing estimates that are random quantities with often high variability for NLP problems Moss et al. (2018). Although methods like bootstrapping Efron and Tibshirani (1994) and leaveoneout cross validation Kohavi (1995) can provide deterministic estimates of performance, they require the fitting of a large number of models and so are not computationally feasible for the complex models and large data prevalent in NLP. Standard NLP model evaluation strategies range from using a simple (and computationally cheap) single traintest split, to the more sophisticated fold cross validation, CV Kohavi (1995).

The vast majority of recent NLP models are nondeterministic and so their performance has another source of stochasticity, controlled by the choice of random seed during training. Common sources of model instability in modern NLP include weight initialisation, data subsampling for stochastic gradient calculation, negative sampling used to train word embeddings Mikolov et al. (2013) and feature subsampling for ensemble methods. In particular, the often stateoftheart LSTMs (and its many variants) have been shown to exhibit high sensitivity to random seeds Reimers and Gurevych (2017).
For reliable model selection, it is crucial to take into account both sources of variability when estimating model performance. Observing a higher score for one model could be a consequence of a particularly nonrepresentative traintest split and/or random seed used to evaluate the model rather than a genuine model improvement. This subtlety is ignored by large scale NLP competitions such as SemEval with evaluations based on a predetermined traintest split.
Although more precise model evaluations can be obtained with higher computation, calculating overly precise model evaluations is a huge waste of computational resource. On the other hand, our evaluations need to provide reliable conclusions (with only a small probability of selecting a suboptimal model). It is poorly understood how to choose an appropriate evaluation strategy for a given model selection problem. These are task specific, depending on model stability, the closeness in performance of competing models and subtle properties of the data such as the representativeness of traintest splits.
In contrast to common practice, we consider model selection as a sequential process. Rather than using a fixed evaluation strategy for each model (which we refer to as a nonadaptive approach), we start with a cheap evaluation of each model on just a single traintest split, and then cleverly choose where to allocate further computational resources based on the observed evaluations. If we decide to further test a promising model, we calculate an additional evaluation based on another data split and seed, observing both sources of evaluation variability and allowing reliable assessments of performance.
To perform sequential model fitting, we borrow methods from the multiarmedbandit (MAB) statistical literature Lai and Robbins (1985). This field covers problems motivated by designing optimal strategies for pulling the arms of a bandit (also known as a slot machine) in casinos. Each arm produces rewards from different random distributions which the user must learn by pulling arms. In particular, model selection is equivalent to the problem of bestarmidentification; identifying the arm with the highest mean. Although appearing simple at a first glance, this problem is deceptively complex and has provided motivation for efficient algorithms in a wide range of domains, including clinical trials Villar et al. (2015) and recommendation systems Li et al. (2010).
Although we believe that we are the first to use bandits to reduce the cost and improve the reliability of model selection, we are not the first to use them in NLP. Recent work in machine translation makes use of another major part of the MAB literature, seeking to optimise the longterm performance of translation algorithms Nguyen et al. (2017); Sokolov et al. (2016); Lawrence et al. (2017). Within NLP, our work is most similar to Haffari et al. (2017)
, who use bandits to minimise the number of data queries required to calculate the Fscores of models. However, this work does not consider the stochasticity of the resulting estimates or easily extend to other evaluation metrics.
The main contribution of this paper is the application of three intuitive algorithms to model selection in NLP, alongside a userfriendly Python implementation: FIESTA (Fast IdEntification of StateofTheArt)^{1}^{1}1https://github.com/apmoore1/fiesta. We can automatically identify an optimal model from large collections of candidate models to a userchosen confidence level in a small number of model evaluations. We focus on three distinct scenarios that are of interest to the NLP community. Firstly, we consider the fixed budget (FB) model selection problem (Section 4.1), a situation common in industry, where a fixed quota of computational resources (or time constraints for realtime decisions) must be appropriately allocated to identify an optimal model with the highest possible confidence. In contrast, we also consider the fixed confidence (FC) problem (Section 4.2), which we expect to be of more use for researchers. Here, we wish to claim with a specified confidence level that our selected model is stateoftheart against a collection of competing models using the minimal amount of computation. Finally, we also consider an extension to the FC scenario, where a practitioner has the computational capacity to fit multiple models in parallel. We demonstrate the effectiveness of our procedures over current model selection approaches when identifying an optimal targetdependent sentiment analysis model from a set of eight competing candidate models (Section 5).
2 Motivating example
We now provide evidence for the need to vary both data splits and random seeds for reliable model selection. We extend the motivating example used in the work of Reimers and Gurevych (2017)
, comparing two LSTMbased Named Entity Recognition (NER) models by
Ma and Hovy (2016) and Lample et al. (2016), differing only in character representation (via a CNN and a LSTM respectively). We base model training on Ma and Hovy (2016), however, following the settings of Yang et al. (2018) we use a batch size of 64, a weight decay of and removed momentum. We ran each of the NER models five times with a different random seed on 150 different train, validation, and test splits^{2}^{2}2The original CoNLL data was split with respect to time rather than random subsampling, explaining the discrepancy with previous scores on this dataset using the same models.. Reimers and Gurevych (2017) showed the effect of model instability between these two models, where changing the model’s random seeds can lead to drawing different conclusions about which model performed best. We extend this argument by showing that different conclusions can also be drawn if we instead vary the traintest split used for the model evaluation (Figure 1). We see that while data splits 0 and 2 correctly suggest that the LSTM is optimal, using data split 1 suggests the opposite. Therefore, it is clear that we must vary both the random seeds and traintest splits used to evaluate our models if we want reliable model selection.3 Problem Statement
Extending notation from Arcuri and Briand (2014), we can precisely state the task of selecting between a collection of candidate models as finding
(1) 
is the best model according to some chosen evaluation metric that measures the performance of that model, e.g accuracy, Fscore or AUC (for an summary of model evaluation metrics see Friedman et al. (2001)).
As already argued, Equation (1) paints an overly simplistic picture of model selection. In reality we only have access to noisy realisations of the true model score
and direct comparisons of single realisations of random variables are unreliable. Therefore, we follow the arguments of
Reimers and Gurevych (2018) and consider a meaningful way of comparing noisy model evaluations: namely, finding the model with largest expected performance estimate across different traintest splits and random seeds. Defining the mean performance of model as , we see that the task of model selection is equivalent to the accurate learning and comparison of these unknown means:We can now set up the sequential framework of our model selection procedure and precisely state what we mean by reliable model selection. At each step in our algorithm we choose a model to evaluate and sample a performance estimate by randomly generating a data split and random seed. After collecting evaluations, we can calculate sample means for each model, which we denote as . After running our algorithm for steps, reliable model selection corresponds to knowing how confident we should be that our chosen model is in fact the true optimal model , i.e. we wish to make a precise statement of the form;
(2) 
where represents this confidence.
In Section 1 we motivated two distinct goals of a sequential model selection routine, which we can now state as:

Fixed budget model selection (FB): We wish to find the best model using only a fixed budget of model evaluations. The aim is to collect the evaluations that allow us to claim (2) with the largest possible confidence level .

Fixed confidence model selection (FC): We wish to find the best model to a prespecified confidence level. The aim is to collect the minimal number of model evaluations that allow us to claim (2).
Although an algorithm designed to do well in one of these scenarios will likely also do well in the other, we will see that to achieve the best performance at either FB or FC model selection, we require subtly different algorithms.
4 Algorithms
We now examine model selection from a bandit viewpoint, summarising three bandit algorithms and relating their use to three distinct model selection scenarios. Although the underpinning theoretical arguments for these algorithms are beyond the scope of this work, we do highlight one point that is relevant for model selection; that scenarios enjoying the largest efficiency gains from moving to adaptive algorithms are those where only a subset of arms have performance close to optimal Jamieson et al. (2013). Model selection in NLP is often in this scenario, with only a small number of considered models being close to stateoftheart, and so (as we demonstrate in Section 5) NLP has a lot to gain from using our adaptive model selection algorithms.
4.1 Fixed Budget by Sequential Halving
FB bestarm identification algorithms are typically based on successively eliminating arms until just a single (ideally) optimal arm remains Jamieson et al. (2013); Jamieson and Nowak (2014); Audibert and Bubeck (2010). We focus on the sequential halving (SH) algorithm of Karnin et al. (2013) (Algorithm 1). Here we break our model selection routine into a series of rounds, each discarding the least promising half of our candidate model set, eventually resulting in a single remaining model. Our computational budget is split equally among the rounds to be equally budgeted among the models remaining in that round. This allocation strategy ensures an efficient use of resources, for example the surviving final two models are evaluated times as often as the models eliminated in the first round. An example run of the algorithm is summarised in Table 1.
Round  Candidate Models  # Evaluations 

1  2  
2  4  
output: 
In the bandit literature Karnin et al. (2013), this algorithm is shown to have strong theoretical guarantees of reliably choosing the optimal arm, as long as the rewarddistributions for each arm are bounded (limited to some finite range). This is not a restrictive assumption for NLP, as the majority of common performance metrics are bounded, for example accuracy, recall, precision and Fscore are all constrained to lie in . We will demonstrate the effectiveness of sequential halving for model selection in Section 5.
4.2 Fixed Confidence by TTTS
For fixed confidence model selection, where we wish to guarantee the selection of an optimal arm at a given confidence level, we cannot just discard arms that are likely to be suboptimal without accurately estimating this likelihood of suboptimality. Although approaches that sequentially eliminate arms (like our sequential halving algorithm) do exist for FC bestarm identification Jamieson et al. (2014); Karnin et al. (2013); Audibert and Bubeck (2010); EvenDar et al. (2002), the best theoretical guarantees for the FC problem come from algorithms that maintain the ability to sample any arm at any point in the selection procedure Garivier and Kaufmann (2016); Jamieson and Nowak (2014). Rather than seeking to eliminate half the considered models at regular intervals of computation, a model is only evaluated until we can be sufficiently confident that it is suboptimal. Unfortunately, the performance guarantees for these methods are asymptotic results (in the number of arms and the number of arm pulls) and have little practical relevance to the (at most) tens of arms in a model selection problem.
Our practical recommendation for FC model selection is a variant of the wellknown Bayesian sampling algorithm, Thompson sampling, known as
toptwo Thompson sampling (TTTS) Russo (2016). We will see that this algorithm can efficiently allocate computational resources to quickly find optimal models. Furthermore, this approach provides full uncertainty estimation over the final choice of model, providing the confidence guarantees required for FC model selection.Our implementation makes the assumption that the evaluations of each model roughly follow a Gaussian distribution, with different means and variances. Although such assumptions are common in the model evaluation literature
Reimers and Gurevych (2018) and for statistical testing in NLP Dror et al. (2018), they could be problematic for the bounded metrics common in NLP. Therefore we also experimented with modelling the logit transformation of our evaluations, mapping our evaluation metric to the whole real line. However, for our examples of Section
5 we found that this mapping provided a negligible improvement in reliability and so was not worth including in our experimental results. This may not be the case for other tasks or less wellbehaved evaluation metrics and so we include this functionality in the FIESTA package.To provide efficient model selection, we use our current believed probability that a given model is optimal (producing a distribution over the models ) to drive the allocation of computational resources. Standard Thompson sampling is a stochastic algorithm that generates a choice of model by sampling from our current belief , i.e. choosing to evaluate a model with the same probability that we believe is optimal (see Russo et al. (2018) for a concise introduction). Although this strategy allows us to focus computation on promising arms, it actually does so too aggressively. Once we believe that an arm is optimal with reasonably high confidence, computation will be heavily focused on evaluating this arm even though we need to become more confident about the suboptimality of competing models to improve our confidence level. This criticism motivates our chosen algorithm TTTS (summarised in Algorithm 2), where instead of sampling a single model according to , we sample two distinct models. We then uniformly choose between these two models for the next evaluation, allowing a greater exploration of the arms and much improved rates of convergence to the desired confidence level Russo (2016). We use this new evaluation to update our belief and continue making evaluations until we believe that a model is optimal with a higher probability than and terminate the algorithm. An example run of TTTS is demonstrated on a synthetic example in Figure 2, where we simulate from Gaussian distributions with means
to mimic accuracy measurements for a model selection problem.We now explain how we calculate (our belief in the location of the optimal model) using wellknown results from Bayesian decision theory (see Berger (2013) for a comprehensive coverage). As justified earlier, we assume that the evaluations of model are independently distributed with a Gaussian distribution for some unknown mean and variance . Although we are primarily interested in learning , we must also learn in order to make confidence guarantees about the optimality of our selected model. Therefore, as well as keeping track of the sample means for the evaluations of each model , we also keep track of the sample variances and counters of the number of times each model has been evaluated. To facilitate inference, we choose a uniform prior for the unknown and
. Not only is this a conjugate prior for Gaussian likelihoods, but it is also shown to encourage beneficial exploratory behaviour when using Thompson sampling on Gaussian bandit problems
Honda and Takemura (2014) and so allows fast identification of optimal arms (or models). After observing evaluations of each model and producing estimates and , our posterior belief for each deviation between the true and observed model means satisfies (as derived in Honda and Takemura (2014));where is a Student’s tdistribution with degrees of freedom.
is then defined as the probability vector, such that
is the relative probability that is the largest according to this posterior belief. Unfortunately, there is no closed form expression for the maximum of tdistributions and so FIESTA uses a simple MonteCarlo approximation based on the sample maxima of repeated draws from our posteriors for . In practice this is very accurate and did not slow down our experiments, especially in comparison to the time saved by reducing the number of model evaluations.4.3 Batch Fixed Confidence by BTS
NLP practitioners often have the computational capacity to fit models in parallel across multiple workers, evaluating multiple models or the same model across multiple seeds at once. Their model selection routines must therefore provide batches of models to evaluate. Our proposed solution to FB model selection naturally provides such batches, with each successive round of SH producing a collection of model evaluations that can be calculated in parallel. Unfortunately, TTTS for FC model selection successively chooses and then waits for the evaluation of single models and so is not naturally suited to parallelism.
Extending TTTS to batch decision making is an open problem in the MAB literature. Therefore, we instead consider batch Thompson sampling (BTS), an extension of standard Thompson sampling (as described in Section 4.2) to batch sampling from the related field of Bayesian optimisation Kandasamy et al. (2018). At each step in our selection process we take model draws according to our current belief that the model is optimal, where represents our computational capacity. This is in contrast to the single draw in standard Thompson sampling and the drawn pair in TTTS. In addition, this approach extends to the asynchronous setting, where rather than waiting for the whole batch of models to be evaluated before choosing the next batch, each worker can draw a new model to evaluate according to the updated . This flexibility provides an additional efficiency gain for problems where the different models have a wide range of run times.
5 Experiments
We now test our three algorithms on a challenging model selection task typical of NLP, selecting between eight Target Dependent Sentiment Analysis (TDSA) models based on their macro F1 score. We consider two variants of four reimplementations of wellknown TDSA models: ATAE Wang et al. (2016), IAN Ma et al. (2017), TDLSTM Tang et al. (2016) (without target words in the left and right LSTM), and a nontargetaware LSTM method used as the baseline in Tang et al. (2016).
These methods represent stateoftheart within TDSA, with only small differences in performance between TDLSTM, IAN, and ATAE (see figure 3
). All the models are reimplemented in PyTorch
Paszke et al. (2017) using AllenNLP Gardner et al. (2018). To ensure the only difference between the models is their network architecture the models use the same optimiser settings and the same regularisation. All words are lower cased and we use the same Glove common crawl 840B token 300 dimension word embedding Pennington et al. (2014). We use variational Gal and Ghahramani (2016) and regular Hinton et al. (2012) dropout for regularisation and an ADAM Kingma and Ba (2014) optimiser with standard settings, a batch size of and use at most epochs (with early stopping on a validation set). Many of these settings are not the same as originally implemented, however, having the same training setup is required for fair comparison (this explains the differences between our results and the original implementations). To increase the difficulty of our model selection problem, we additionally create four extra models by reducing the dimensions of the Glove vectors to 50 and removing dropout. Although these models are clearly not stateoftheart, they increase the size of our candidate model set and so provide a more complicated model selection problem (an intuition discussed in Appendix A).All of the TDSA experiments are conducted on the wellstudied SemEval 2014 task 4 Restaurant dataset Pontiki et al. (2014) and we force trainvaltest splits to follow the same ratios as this dataset’s official traintest split. Each individual model evaluation is then made on a randomly generated traintest split and random seed to access both sources of evaluation variability.
5.1 Fixed Budget Model Selection
We use the TDSA model selection problem to test fixed budget model selection. To thoroughly test our algorithm, we consider an additional four models based on 200 dimensional Glove vectors, bringing the total number of models to 12. We compare our approach of sequential halving to the standard nonadaptive approach of splitting the available computational budget equally between the 12 candidate models. For example, we would allocate a budget of model evaluations as evaluating each model two times and selecting the model with the highest sample mean.
Figure 4 compares the proportion of runs of sequential halving that correctly identify the optimal model with the proportion identified by the nonadaptive approach with the same computational budget. Sequential halving identifies the optimal model more reliably ( more often) than the current approach to FB model selection in NLP. Using sequential halving with evaluations almost always ( of runs) selects the optimal model, whereas the nonadaptive approach is only correct of the time.
#  evaluations  with  NonAdaptive  #  evaluations  with  TTTS  

min  mean  max 

min  mean  max 


0.05  48  281  1552  100  27  130  518  100  
0.1  40  206  1192  99  24  96  460  99  
0.2  32  128  608  96  24  65  274  97 
5.2 Fixed Confidence Model Selection
We perform fixed confidence model selection on the eight TDSA candidate models (the full models and those based on 50 dimensional vectors). We compare TTTS to a nonadaptive approach where all models are evaluated at each step, irrespective of the results of earlier evaluations (the standard approach for model selection in NLP). We run this nonadaptive approach until we reach the required confidence level calculated using the same Bayesian framework as in TTTS.
We run each approach times and note the number evaluations required to get to a range of confidence levels (Table 2) alongside the proportion that correctly identify the optimal model. TTTS requires substantially less model evaluations (in terms of the minimum, mean and max across our runs) to reach a given confidence level than the nonadaptive approach, achieving the same reliability at half the cost (on average). TTTS is able to quickly identify suboptimal models and so can avoid wasting resources repeatedly evaluating the whole candidate set.
#  evaluations  with  BTS4  #  evaluations  with  BTS8  

min  mean  max 

min  mean  max 


0.05  28  282  1392  100  88  315  1128  100  
0.1  24  144  520  100  56  178  784  100  
0.2  24  76  280  98  32  106  352  99 
Finally, we test our proposed approach to batch FC model selection by running exactly the same experiment but using BTS to choose collections of four and eight models at a time (Table 3). As expected, performance degrades as we increase batch size, with batches of four allowing more fine grained control over model evaluations than using batches of eight. In particular, due to the exploitative nature of Thompson sampling, we see that selecting models to a very high confidence (95%) requires more computation with BTS than the standard nonadaptive approach. However, BTS does reach the other confidence levels faster and correctly identifies the optimal model more often. However, as TTTS performs significantly better across all confidence levels, we emphasise the need for a lessexploitative version of BTS with adjustments similar to those used in TTTS.
6 Conclusions
The aim of this paper has been to propose three algorithms for model selection in NLP, providing efficient and reliable selection for two distinct realistic scenarios: fixed confidence and fixed budget model selection. Crucially, our research further calls into question the current practice in NLP evaluation as used in the literature and international competitions such as SemEval. Our algorithms adaptively allocate resources to evaluate promising models, basing evaluations across multiple random seeds and traintest splits. We demonstrate that this allows significant computational savings and improves reliability over current model selection approaches.
Although we have demonstrated that our algorithms perform well on a complex model selection problem typical of NLP, there is still work to be done to create algorithms more suited to these problems. Future research directions include making selection routines more robust to evaluation outliers, relaxing our Gaussian assumptions and developing more effective batch strategies.
7 Acknowledgements
The authors are grateful to reviewers, whose comments and advice have greatly improved this paper. The research was supported by an EPSRC Doctoral Training Grant and the STORi Centre for Doctoral Training. We thank Dr Chris Jewell at the Centre for Health Informatics, Computing, and Statistics, Lancaster University for the loan of a NVIDIA GP100equipped workstation for this study.
References
 Arcuri and Briand (2014) Andrea Arcuri and Lionel Briand. 2014. A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability, 24(3):219–250.
 Audibert and Bubeck (2010) JeanYves Audibert and Sébastien Bubeck. 2010. Best arm identification in multiarmed bandits. In COLT  23th Conference on Learning Theory  2010, pages 13–p.
 Berger (2013) James O Berger. 2013. Statistical decision theory and Bayesian analysis. Springer Science & Business Media.
 Dror et al. (2018) Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383–1392. Association for Computational Linguistics.
 Efron and Tibshirani (1994) Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap. CRC press.

EvenDar et al. (2002)
Eyal EvenDar, Shie Mannor, and Yishay Mansour. 2002.
Pac bounds for multiarmed bandit and markov decision processes.
InInternational Conference on Computational Learning Theory
, pages 255–270. Springer.  Friedman et al. (2001) Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Springer series in statistics New York, NY, USA:.
 Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1019–1027. Curran Associates, Inc.
 Gardner et al. (2018) Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke Zettlemoyer. 2018. Allennlp: A deep semantic natural language processing platform. In Proceedings of Workshop for NLP Open Source Software (NLPOSS), pages 1–6. Association for Computational Linguistics.
 Garivier and Kaufmann (2016) Aurélien Garivier and Emilie Kaufmann. 2016. Optimal best arm identification with fixed confidence. In Conference on Learning Theory, pages 998–1027.
 Haffari et al. (2017) Gholamreza Haffari, Tuan Dung Tran, and Mark Carman. 2017. Efficient benchmarking of nlp apis using multiarmed bandits. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 408–416. Association for Computational Linguistics.
 Hinton et al. (2012) Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. 2012. Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580.
 Honda and Takemura (2014) Junya Honda and Akimichi Takemura. 2014. Optimality of thompson sampling for gaussian bandits depends on priors. In Artificial Intelligence and Statistics, pages 375–383.
 Jamieson et al. (2013) Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sebastien Bubeck. 2013. On finding the largest mean among many. arXiv preprint arXiv:1306.3917.
 Jamieson et al. (2014) Kevin Jamieson, Matthew Malloy, Robert Nowak, and Sébastien Bubeck. 2014. lil’ucb: An optimal exploration algorithm for multiarmed bandits. In Conference on Learning Theory, pages 423–439.
 Jamieson and Nowak (2014) Kevin Jamieson and Robert Nowak. 2014. Bestarm identification algorithms for multiarmed bandits in the fixed confidence setting. In 2014 48th Annual Conference on Information Sciences and Systems (CISS), pages 1–6. IEEE.
 Kandasamy et al. (2018) Kirthevasan Kandasamy, Akshay Krishnamurthy, Jeff Schneider, and Barnabás Póczos. 2018. Parallelised bayesian optimisation via thompson sampling. In International Conference on Artificial Intelligence and Statistics.

Karnin et al. (2013)
Zohar Karnin, Tomer Koren, and Oren Somekh. 2013.
Almost optimal
exploration in multiarmed bandits.
In
International Conference on Machine Learning
, pages 1238–1246.  Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 Kohavi (1995) Ron Kohavi. 1995. A study of crossvalidation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th international joint conference on Artificial intelligenceVolume 2, pages 1137–1143. Morgan Kaufmann Publishers Inc.
 Lai and Robbins (1985) Tze Leung Lai and Herbert Robbins. 1985. Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22.
 Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260–270. Association for Computational Linguistics.
 Lawrence et al. (2017) Carolin Lawrence, Artem Sokolov, and Stefan Riezler. 2017. Counterfactual learning from bandit feedback under deterministic logging : A case study in statistical machine translation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2566–2576. Association for Computational Linguistics.
 Li et al. (2010) Lihong Li, Wei Chu, John Langford, and Robert E Schapire. 2010. A contextualbandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pages 661–670. ACM.
 Ma et al. (2017) Dehong Ma, Sujian Li, Xiaodong Zhang, and Houfeng Wang. 2017. Interactive attention networks for aspectlevel sentiment classification. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 4068–4074. AAAI Press.
 Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. Endtoend sequence labeling via bidirectional lstmcnnscrf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064–1074. Association for Computational Linguistics.
 Mannor and Tsitsiklis (2004) Shie Mannor and John N Tsitsiklis. 2004. The sample complexity of exploration in the multiarmed bandit problem. Journal of Machine Learning Research, 5(Jun):623–648.
 Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
 Moss et al. (2018) Henry Moss, David Leslie, and Paul Rayson. 2018. Using jkfold cross validation to reduce variance when tuning nlp models. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2978–2989. Association for Computational Linguistics.
 Nguyen et al. (2017) Khanh Nguyen, Hal Daumé III, and Jordan BoydGraber. 2017. Reinforcement learning for bandit neural machine translation with simulated human feedback. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1464–1474. Association for Computational Linguistics.
 Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. In NIPSW.
 Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543. Association for Computational Linguistics.
 Pontiki et al. (2014) Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar. 2014. Semeval2014 task 4: Aspect based sentiment analysis. In Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pages 27–35. Association for Computational Linguistics.
 Reimers and Gurevych (2017) Nils Reimers and Iryna Gurevych. 2017. Reporting score distributions makes a difference: Performance study of lstmnetworks for sequence tagging. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 338–348. Association for Computational Linguistics.
 Reimers and Gurevych (2018) Nils Reimers and Iryna Gurevych. 2018. Why comparing single performance scores does not allow to draw conclusions about machine learning approaches. arXiv preprint arXiv:1803.09578.
 Russo (2016) Daniel Russo. 2016. Simple bayesian algorithms for best arm identification. In Conference on Learning Theory, pages 1417–1418.
 Russo et al. (2018) Daniel J Russo, Benjamin Van Roy, Abbas Kazerouni, Ian Osband, Zheng Wen, et al. 2018. A tutorial on thompson sampling. Foundations and Trends in Machine Learning, 11(1):1–96.
 (38) SemEval. 2018. Proceedings of The 12th International Workshop on Semantic Evaluation. Association for Computational Linguistics, New Orleans, Louisiana.
 Sokolov et al. (2016) Artem Sokolov, Julia Kreutzer, Christopher Lo, and Stefan Riezler. 2016. Learning structured predictors from bandit feedback for interactive nlp. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1610–1620. Association for Computational Linguistics.
 Tang et al. (2016) Duyu Tang, Bing Qin, Xiaocheng Feng, and Ting Liu. 2016. Effective lstms for targetdependent sentiment classification. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 3298–3307. The COLING 2016 Organizing Committee.
 Villar et al. (2015) Sofía S Villar, Jack Bowden, and James Wason. 2015. Multiarmed bandit models for the optimal design of clinical trials: benefits and challenges. Statistical science: a review journal of the Institute of Mathematical Statistics, 30(2):199.
 Wang et al. (2016) Yequan Wang, Minlie Huang, xiaoyan zhu, and Li Zhao. 2016. Attentionbased lstm for aspectlevel sentiment classification. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 606–615. Association for Computational Linguistics.
 Yang et al. (2018) Jie Yang, Shuailong Liang, and Yue Zhang. 2018. Design challenges and misconceptions in neural sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3879–3889. Association for Computational Linguistics.
8 Appendix
Appendix A Characterising the Difficulty of a Model Selection Problem
We briefly summarise a result from the bestarm identification literature, providing intuition for our experiment section through a mechanism to characterise the difficulty of a model selection problem. Intuitively, model selection difficulty increases with the size of the set of candidate models and as the performance of suboptimal models approaches that of the optimal model (and becomes harder to distinguish), i.e. as gets small for some suboptimal arm . In fact, it is well known in the MAB literature that it is exactly these two properties that characterise the complexity of a bestarmidentification problem, confirming our intuition for model selection. Mannor and Tsitsiklis (2004) show that the number of arm pulls required for the identification of a best arm at a confidence level has at least a computational complexity of , where