Extreme Algorithm Selection With Dyadic Feature Representation

01/29/2020 ∙ by Alexander Tornede, et al. ∙ Universität Paderborn 0

Algorithm selection (AS) deals with selecting an algorithm from a fixed set of candidate algorithms most suitable for a specific instance of an algorithmic problem, e.g., choosing solvers for SAT problems. Benchmark suites for AS usually comprise candidate sets consisting of at most tens of algorithms, whereas in combined algorithm selection and hyperparameter optimization problems the number of candidates becomes intractable, impeding to learn effective meta-models and thus requiring costly online performance evaluations. Therefore, here we propose the setting of extreme algorithm selection (XAS) where we consider fixed sets of thousands of candidate algorithms, facilitating meta learning. We assess the applicability of state-of-the-art AS techniques to the XAS setting and propose approaches leveraging a dyadic feature representation in which both problem instances and algorithms are described. We find the latter to improve significantly over the current state of the art in various metrics.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Algorithm selection (AS) refers to a specific recommendation task, in which the choice alternatives are algorithms: Given a set of candidate algorithms to choose from, and a specific instance of a problem class, such as SAT or integer optimization, the task is to select or recommend an algorithm that appears to be most suitable for that instance, in the sense of performing best in terms of criteria such as runtime, solution quality, etc. Hitherto practical applications of AS, as selecting a SAT solver for a logical formula, typically comprise candidate sets consisting of only a handful or at most tens of algorithms, and this is also the order of magnitude that is found in standard AS benchmark suites, such as ASlib [2].

On the contrary, especially in the field of automated machine learning (AutoML), combined algorithms selection and hyperparameter optimization problems are considered, where the number of potential candidates is potentially infinite

[22, 6, 17]. However, these methods heavily rely on computationally extensive search procedures combined with costly online evaluations of the performance measure to optimize for, since learning effective meta models for an instantaneous recommendation becomes infeasible.

To this end, we propose extreme algorithm selection (XAS) as a novel setting, which is characterized by an extremely large though still finite (and predefined) set of candidate algorithms. XAS achieves a compromise between standard AS and AutoML: While facilitating the learning of algorithm selectors, it is still amenable to AS techniques and allows for instantaneous recommendations. The XAS is especially motivated by application scenarios such as “On-the-fly computing” [11], in which algorithm selection needs to be supported, but costly search and online evaluations are not affordable.

In a sense, XAS relates to standard AS as the emerging topic of extreme classification (XC) [1]

relates to standard multi-class classification. Similar to XC, the problem of learning from sparse data is a major challenge for XAS: For a single algorithm, there are typically only observations for a few instances. In this paper, we propose a benchmark dataset for XAS and investigate the ability of state-of-the-art AS approaches to deal with this sparsity and to scale with the size of candidate sets. Furthermore, to support more effective learning from sparse data, we propose methods based on “dyadic” feature representations, in which both problem instances and algorithms are represented in terms of feature vectors. In an extensive experimental study, we find these methods to yield significant improvements.

2 From Standard To Extreme Algorithm Selection

In the standard (per-instance) algorithm selection setting, first introduced in [19], we are interested in finding a mapping , called algorithm selector. Given an instance from the instance space , the latter selects the algorithm from a set of candidate algorithms , optimizing a performance measure . Furthermore, is usually costly to evaluate. The optimal selector is called oracle and is defined as

(1)

for all . The expectation operator accounts for any randomness in the application of the algorithm — in that case, the result of applying to

, and hence the values of the performance measure, are random variables.

Most AS approaches leverage machine learning techniques, in one way or another learning a surrogate (regression) model , which is fast to evaluate and thus allows to compute a selector by

(2)

In order to infer such a model, we usually assume the existence of a set of training instances for which we have instantaneous access to the associated performances of some or often all algorithms in according to .

The XAS setting dissociates itself from the standard AS setting by two main differences. Firstly, we assume that the set of candidate algorithms is extremely large. Thus, approaches need to be able to scale with such a large set. Secondly, due to the size of , we can no longer reasonably assume to have evaluations for each algorithm on each training instance. Instead, we assume that the training matrix spanned by the training instances and algorithms is only sparsely filled, i.e., we do not have a performance value for each algorithm on each training instance. In fact, we might even have algorithms without any evaluations at all. Hence, suitable approaches need to be able to learn from very few data and need to be able to make predictions for algorithms without any associated data at all.

3 Exploiting Instance Features

Instance-specific AS is based on the assumption that instances can be represented in terms of feature information. For this purpose, denotes a function representing instances as -dimensional, real-valued feature vectors, which can be exploited to learn a surrogate model (2

). As explained in the following, this can be done based on different types of data and using different loss functions.

3.1 Regression

The most common approach is to tackle AS as a regression problem, i.e., to construct a regression dataset for each algorithm, where entries consist of an instance representation and the associated performance of the algorithm at question. Accordingly, the dataset associated with algorithm is defined as all tuples for all and where an performance evaluation exists. Using this dataset, a standard regression model per algorithm

, such as a neural network or a random forest, can be learned under various loss functions, such as the root mean squared error or the absolute error, and used as a surrogate. For an overview of examples of methods using such a technique, we refer to Section 

6.

This approach has two main disadvantages. Firstly, it is not well suited for the XAS setting, as it requires learning a huge number of surrogate models, one per algorithm. Although these models can usually be trained very quickly, the assumption of sparse training data in the XAS setting requires them to be learned from only a handful of training examples — it is not even uncommon to have algorithms without any performance value at all. Accordingly, the sparser the data, the more drastically this approach drops in performance, as will be seen in the evaluation in Section 5. Secondly, it requires precise real-valued evaluations of the measure as training information, which might be costly to obtain. In this regard, one may also wonder, whether regression is not solving an unnecessarily difficult problem: Eventually, AS is only interested in finding the best algorithm for a given problem instance, or, more generally, in ranking the candidate algorithms in decreasing order of their expected performance. An accurate prediction of absolute performances is a sufficient but not a necessary condition for doing so.

3.2 Ranking

As an alternative to regression, one may therefore think of tackling AS as a ranking problem. More specifically, the counterpart of the regression approach outlined above is called label ranking (LR) in the literature [25]. Label ranking deals with learning to rank choice alternatives (referred to as “labels”) based on given contexts represented by feature information. In the setting of AS, contexts and labels correspond to instances and algorithms, respectively. The type of training data assumed in LR consists of rankings associated with training instances , that is, order relations of the form , in which denotes an underlying preference relation; thus, means that, for instance represented by features , algorithm is preferred to (better than) algorithm . If is clear from the context, we also represent the ranking by

. Compared to the case of regression, a ranking dataset of this form can be constructed more easily, as it only requires qualitative comparisons between algorithms instead of real-valued performance estimates.

A common approach to label ranking is based on the so-called Plackett-Luce (PL) model [4]

which specifies a parameterized probability distribution on rankings over labels (i.e., algorithms in our case). The underlying idea, similar as before, is to associate each algorithm

with a latent utility function of a context (i.e., an instance), which estimates how well an algorithm is suited for a given instance. The functions are usually modeled as log-linear functions

(3)

where is a real-valued, -dimensional vector, which has to be fit for each algorithm . The PL model offers a natural probabilistic interpretation: given an instance , the probability of a ranking over any subset is

(4)

A probabilistic model of that kind suggests learning the parameter matrix via maximum likelihood estimation, i.e., by maximizing the likelihood function

associated with (4); this approach is explained in detail in [4]. Hence, the associated loss function under which we learn is now of a probabilistic nature (the logarithm of the PL-probability). It no longer focuses on the difference between the approximated performance and the true performance , but on the ranking of the algorithms with respect to  — adopting preference learning terminology, the former is a “pointwise” while the latter is a “listwise” method for learning to rank [3].

This approach potentially overcomes the second problem explained for the case of regression, but not the first one: It still fits a single model per algorithm (the parameter vector ), which essentially disqualifies it for the XAS setting.

3.3 Collaborative Filtering

This may suggest yet another approach, namely the use of collaborative filtering (CF) [9], in the setting of AS originally proposed by [21]. In CF for AS, we assume a (usually sparse) performance matrix , where an entry corresponds to the performance of algorithm on instance according to if known, and otherwise. CF methods were originally designed for large-scale settings, where products (e.g. movies) are recommended to users, and data to learn from is sparse. Hence, they appear to fit well for our XAS setting.

Similar to regression and ranking, model-based CF methods also learn a latent utility function. They do so by applying matrix factorization techniques to the performance matrix , trying to decompose it into matrices and w.r.t. some loss function , such that

(5)

where () can be interpreted as latent features of the instances (algorithms), and is the number of latent features. Accordingly, the latent utility of a known algorithm for a known instance can be computed as

(6)

even if the associated value is unknown in the performance matrix used for training. The loss function depends on the exact approach used — examples include the root mean squared error and the absolute error restricted by some regularization term to avoid overfitting. [16] suggest a CF approach called Alors, which we will use in our experiments later on. It can deal with unknown instances by learning a feature map from the original instance to the latent instance feature space. Alors makes use of the CF approach CoFi [27] using the normalized discounted cumulative gain (NDCG) [26] as loss function . Since the NDCG is a ranking loss, it focuses on decomposing the matrix so as to produce an accurate ranking of the algorithms. More precisely, it uses an exponentially decaying weight function for ranks, such that more emphasis is put on the top and less on the bottom ranks. Hence, it seems particularly well suited for our use case.

4 Dyadic Feature Representation

As discussed earlier, by leveraging instance features, or learning such a representation as in the case of Alors, the approaches presented in the previous section can generalize over instances. Yet, none of them scales well to the XAS setting, as they do not generalize over algorithms; instead, the models are algorithm-specific and trained independently of each other. For the approaches presented earlier (except for Alors), this does not only result in a large number of models but also requires these models to be trained on very few data. Furthermore, it is not uncommon to have algorithms without any observation. A natural idea, therefore, is to leverage feature information on algorithms as well.

More specifically, we use a feature function representing algorithms as -dimensional, real-valued feature vectors. Then, instead of learning one latent utility model per algorithm, the joint feature representation of a “dyad” consisting of an instance and an algorithm, allows us to learn a single joint model

(7)

and hence to estimate the performance of a given algorithm on a given instance in terms of .

4.1 Regression

With the additional feature information at hand, instead of constructing one dataset per algorithm, we resolve to a single joint dataset comprised of examples with dyadic feature information for all instances and algorithms for which a performance value is known. Here,

(8)

is a joint feature map that defines how the instance and algorithm features are combined into a single feature representation of a dyad. What is sought, then, is a (parametrized) latent utility function , such that

(9)

is an estimation of the performance of algorithm on instance . Obviously, the choice of will have an important influence on the difficulty of the regression problem and the quality of the model (9) induced from the data . The regression task itself comes down to learning the parameter vector . In principle, this can be done exactly as in Section 3.1, also using the same loss function.

4.2 Ranking

A similar adaptation can be made for the (label) ranking approach presented in Section 3.2 [23]. Formally, this corresponds to a transition from the setting of label ranking to the setting of dyad ranking (DR) as recently proposed in [20]. The first major change in comparison to the setting of label ranking concerns the training data where the rankings over subsets of algorithms for instance are now of the form

(10)

Thus, we no longer represent an algorithm simply by its label () but by features . Furthermore, like in the case of regression, we no longer learn one latent utility function per algorithm, but a single model of the form (9) based on a dyadic feature representation. In particular, we model

as a feed-forward neural network, where

represents its weights, which, as shown in [20], can be learned via maximum likelihood estimation on the likelihood function implied by the underlying PL model.

In contrast to the methods presented in the previous section, the methods based on dyadic feature information are capable of assigning a utility to unknown algorithms. Thus, they are well suited for the XAS setting and in principle even applicable when is infinite, as long as a suitable feature representation is available. Furthermore, as demonstrated empirically in Section 5, the dyadic feature approaches are very well suited for dealing with sparse performance matrices that are typical of the XAS setting.

5 Experimental Evaluation

In our experiments, we evaluate well established state-of-the-art approaches to algorithm selection as well as the proposed dyadic approaches in the XAS setting. More specifically, we consider the problem of selecting a machine learning classifier (algorithm) for a new classification dataset (instance). To this end, we first generate a benchmark and then use this benchmark for comparison. The generated benchmark dataset as well as the implementation of the approaches including detailed documentation is provided on GitHub

222https://github.com/alexandertornede/extreme_algorithm_selection.

5.1 Benchmark Dataset

In order to benchmark the generalization performance of the approaches presented above in the XAS setting, we consider the domain of machine learning. More precisely, the task is to select a classification algorithm for an (unseen) dataset. Therefore, a finite set of algorithms for classification and a set of instances corresponding to classification datasets need to be specified. Furthermore, a performance measure  is to be defined to score the algorithms’ performance.

The set of candidate algorithms is defined by sampling up to 100 different parameterizations of 18 classification algorithms stemming from the Java machine learning library WEKA [7], ensuring these parameterizations not being too similar. An overview of the algorithms, their parameters and the number of instantiations contained in is given in Table 1. This yields algorithms in total. The last row of the table sums up the items of the respective column, providing insights into the dimensionality of the space of potential candidate algorithms.

The set of instances is taken from the OpenML CC-18 benchmarking suite333https://docs.openml.org/benchmark/#openml-cc18 (Excluding datasets 554, 40923, 40927, 40996 due to technical issues.) [24], which is a curated collection of various classification datasets that are considered interesting from a model selection resp. hyperparameter optimization point of view. This property makes the datasets particularly appealing for the XAS benchmark dataset, as it ensures more diversity across the algorithms.

In the domain of machine learning, one is usually more interested in the generalization performance of an algorithm than in the runtime. Therefore, is chosen to assess the solution performance of an algorithm. To this end, we carry out a 5-fold cross validation and measure the mean accuracy across the folds. As the measure of interest, accuracy is a reasonable though to some extent arbitrary choice. Note that in principle any other measure could have been used for generating the benchmark as well.

The standard deviation for the performance values per dataset is on average

, minimum and maximum making the instances quite heterogeneous.

Learner

0R

1R

BN

DS

DT

IBk

J48

JR

KS

L

LMT

MP

NB

PART

REPT

RF

RT

SMO

#num.P 0 1 0 0 1 1 2 2 1 1 2 2 0 2 3 3 4 1
#cat.P 0 0 2 0 3 3 6 2 2 0 5 6 2 2 2 2 4 2
n 1 30 12 1 45 89 100 100 99 100 100 100 3 91 100 99 100 100
Table 1: The table shows the types of classifiers used to derive the set . Additionally, the number of numeric parameters (#num.P), categorical parameters (#cat.P), and instantiations (n) is shown.

Training data for CF and regression-based approaches can then be obtained by using the performance values as labels. In contrast, for training ranking approaches, the data is labeled with rankings derived by ordering the algorithms in a descending order with respect to their performance values. Note that, as a consequence, information about the exact performance value itself is lost in ranking approaches.

Instance Features. For the setting of machine learning, the instances are classification datasets and associated feature representations are called meta-features [18]. To derive a feature description of the datasets, we make use of a specific subclass of meta-features called landmarkers, which are performance scores of cheap-to-validate algorithms on the respective dataset. More specifically, we use all 45 landmarkers as provided by OpenML [24]

, for which different configurations of the following learning algorithms are evaluated based on the error rate, area under the (ROC) curve, and Kappa coefficient: Naive Bayes, One-Nearest Neighbour, Decision Stump, Random Tree, REPTree and J48.

Algorithm Features. The presumably most straight-forward way of representing an algorithm in terms of a feature vector is to use the values of its hyperparameters. Thus, we can describe each individual algorithm by a vector of their hyperparameters respectively their values. Based on this, the general feature description is obtained by concatenation of the vectors. Due to the nature of how we generated the set of candidate algorithms , we can compress the vector sharing features for algorithms of the same type. Additionally, we augment the vector by a single categorical feature denoting the type of algorithm. Given any candidate algorithm, its feature representation is obtained by setting the type of algorithm indicator feature to its type, each element of the vector corresponding to one of its hyperparameters to the specific value, and other entries are initialized with .

5.2 Baselines

To better relate the performance of the different approaches to each other and to the problem itself, we employ various baselines. While RandomRank assigns ranks to algorithms simply at random, AvgPerformance first averages for each candidate algorithm the observed performance values and predicts the ranking according to those average performances. -NN LR retrieves the nearest neighbors from the training data, averages the performances and predicts the ranking which is induced by the average performances. Since AvgRank is commonly used as another baseline in the standard AS setting, we note that we omit this baseline on purpose. This is because meaningful average ranks of algorithms are difficult to compute in the XAS setting, where the number of algorithms evaluated, and hence the length of the rankings of algorithms, vary from dataset to dataset.

5.3 Experimental Setup

In the following experiments, we investigate the performance of the different approaches and baselines in the setting of XAS for the example of the proposed benchmark dataset as described in Section 5.1.

We conduct a 10-fold cross validation to divide the dataset into 9 folds of known and 1 fold of unknown instances. From the resulting set of known performance values, we then draw a sample of 25, 50, or 125 pairs of algorithms for every instance under the constraint that the performances of the two algorithms is not identical. Thus, a maximum fill degree of 4%, 8% respectively 20% of the performance matrix is used for training, as algorithms may occur more than once in the sampled pairs. The sparse number of training examples is motivated by the large number of algorithms in the XAS setting. The assumption that performance values are only available for a small subset of the algorithms is clearly plausible here. Throughout the experiments, we ensure that all approaches are provided the same instances for training and testing, and that the label information is at least based on the very same performance values.

In the experiments, we compare various models with each other. This includes two versions of Alors, namely Alors (REGR) and Alors (NDCG) optimizing for a regression respectively ranking loss. Furthermore, we consider a state-of-the-art regression approach learning a RandomForest regression model per algorithm (PAReg). Note that for those algorithms for which no training data is available at all, we make PARegpredict a performance of , as recommending an algorithm about which nothing is known seems unreasonable. Lastly, we consider two approaches leveraging a dyadic feature representation, internally fitting either a RandomForest for regression (DFReg) or a feed-forward neural network for ranking (DR). For both dyadic approaches, the simple concatenation of instance and algorithm features is used as a feature map. In contrast to the other methods, the ranking model is only provided the information which algorithm of a sampled pair performs better, as opposed to the exact performance value that is given to other methods. A summary of what approach/baseline uses which type of features and label information is given on the left side of Table 2.

Approach Label

approaches

Alors (REGR)
Alors (NDCG)
PAReg
DFReg
DR

baselines

RandomRank
AvgPrfm
AvgRank
-NN LR
Table 2: Overview of the data provided to the approaches and their applicability to the considered scenarios.

The test performance of the approaches is evaluated by sampling algorithms for every (unknown) instance to test for. The comparison is done with respect to different metrics detailed further below, and the outlined sampling evaluation routine is repeated times.

Statistical significance w.r.t performance differences between the best method and any other method is determined by a Wilcoxon rank sum test with a threshold of 0.05 for the p-value. Significant improvements of the best method over another one is indicated by .

Experiments were run on nodes with two Intel Xeon Gold ”Skylake” 6148 with 20 cores each and 192 GB RAM.

5.4 Performance Metrics

On the test data, we compute the following performance metrics measuring desirable properties of XAS approaches.

regret@ is the difference between the performance value of the best algorithm within the predicted top- of algorithms and the actual best algorithm. The domain of regret@ is , where 0 is the optimum meaning no regret.

NDCG is a position-dependent ranking measure (normalized discounted cumulative gain) to measure how well the ranking of the top- algorithms can be predicted. It is defined as

where is a (fixed) instance, is a ranking and the optimal ranking, and gives the algorithm on rank in ranking . The NDCG emphasizes correctly assigned ranks at higher positions with an exponentially decaying importance. NDCG ranges in , where is the optimal value.

Approach 4% fill rate / 25 performance value pairs 8% fill rate / 50 performance value pairs 20% fill rate / 125 performance value pairs
N@3 N@5 R@1 R@3 N@3 N@5 R@1 R@3 N@3 N@5 R@1 R@3
PAReg 0.1712 0.9352 0.9433 0.0601 0.0185 0.2537 0.9453 0.9594 0.0493 0.0136 0.3003 0.9525 0.9632 0.0395 0.0107
Alors (NDCG) 0.0504 0.9205 0.9223 0.0686 0.0225 0.0472 0.9155 0.9164 0.0614 0.0208 0.0540 0.9220 0.9242 0.0542 0.0228
Alors (REGR) 0.0303 0.9117 0.9191 0.0794 0.0190 0.0807 0.9172 0.9304 0.0754 0.0285 0.1039 0.9160 0.9329 0.0604 0.0222
DR 0.3445 0.9523 0.9604 0.0381 0.0089 0.3950 0.9584 0.9685 0.0322 0.0087 0.4507 0.9696 0.9715 0.0241 0.0055
DFReg 0.3819 0.9564 0.9652 0.0302 0.0079 0.3692 0.9573 0.9661 0.0300 0.0123 0.4264 0.9629 0.9720 0.0292 0.0071
RandomRank -0.0038 0.8933 0.9105 0.0878 0.0272 -0.0038 0.8933 0.9105 0.0878 0.0272 -0.0038 0.8933 0.9105 0.0878 0.0272
AvgPerformance 0.1384 0.9388 0.9433 0.0337 0.0090 0.2083 0.9355 0.9508 0.0493 0.0199 0.2541 0.9437 0.9536 0.0523 0.0084
1-NN LR 0.1227 0.9290 0.9310 0.0733 0.0230 0.1059 0.9246 0.9296 0.0564 0.0209 0.1152 0.9245 0.9318 0.0594 0.0249
2-NN LR 0.1303 0.9278 0.9310 0.0642 0.0193 0.0874 0.9269 0.9343 0.0541 0.0206 0.1142 0.9292 0.9350 0.0412 0.0176
Table 3: Averaged results for the performance metrics Kendall’ tau (), NDCG@k (N@3, N@5), and regret@ (R@1, R@3) for varying number of performance value pairs used for training. The best performing approach is highlighted in bold, the second best is underlined, and significant improvements of the best approach over others is denoted by .

Kendall’s is a rank correlation measure. Given two rankings (over the same set of elements) and , it is defined as

(11)

where / is the number of so-called concordant/discordant pairs in the two rankings, and / is the number of ties in /. Two elements are called a concordant/discordant pair if their order within the two rankings is identical/different, and tied in ranking if they are on the same rank. Intuitively, this measure determines on how many pairs of elements the two rankings coincide. It take values in , where means uncorrelated, means inversely, and perfectly correlated.

5.5 Results

The results of the experiments are shown in Table 3. It is clear from the table that the methods for standard algorithm selection tend to fail especially in the scenarios with only few algorithm performance values per instance. This includes the approach of building a distinct regression model for each algorithm (PAReg) as well as for the collaborative filtering approach Alors, independently of the loss optimized for, even though the NDCG variant has a slight edge over the regression one. Moreover, Alors even fails to improve over simple baselines, such as AvgPerformance and -NN LR. With an increasing number of training examples, PAReg improves over the baselines and also performs better than Alors, but never yields the best performance for any of the considered settings or metrics.

In contrast to this, the proposed dyadic feature approaches clearly improve over both the methods for the standard AS setting and the considered baselines for all the metrics. Interestingly, DFReg performs best for the setting with only 25 performance value pairs, while DR has an edge over DFReg for the other two settings. Still, the differences between the dyadic feature approaches are never significant, whereas significant improvements can be achieved in comparison to the baselines and the other AS approaches.

The results of our study show that models with strong generalization performance can be obtained despite the small number of training examples. Moreover, the results suggest that there is a need for the development of specific methods addressing the characteristics of the XAS setting. This concerns the large number of different candidate algorithms as well as the sparsity of the training data. Future work will include the consideration of other problem domains, such as selecting SAT or TSP solvers, and the determination of shortcomings compared to the CASH setting.

6 Related Work

As most closely related work, we subsequently highlight several AS approaches to learning hidden utility functions. For an up-to-date survey, we refer to [13].

A prominent example of a method learning a regression-based hidden utility function is [28]

, which features an empirical hardness model per algorithm for estimating the runtime of an algorithm, i.e., its performance, for a given instance based on a ridge regression approach in the setting of SAT solver selection. Similarly,

[14]

learn per-algorithm hardness models using statistical (non-)linear regression models for algorithms solving the winner determination problem. Depending on whether a given SAT instance is presumably satisfiable or not, conditional runtime prediction models are learned in

[10] using ridge linear regression.

In [5], a label-ranking-based AS approach for selecting collaborative filtering algorithms in the context of recommender systems is presented, and various label ranking algorithms are compared, including nearest neighbor and random forest label rankers. Similarly, [12]

use a multi-layer perceptron powered label ranker to select meta-heuristics for solving TSP instances.

AS was modelled as CF problem for the first time in [21], making use of a probabilistic matrix factorization technique to select algorithms for the constrained solving problem. Assuming a complete performance matrix for training, low-rank latent factors are learned in [15]

using singular value decomposition to obtain a selector en par with the oracle. Lastly,

[8] demonstrates how probabilistic matrix factorization can be used to build and select machine learning pipelines.

7 Conclusion

In this paper, we introduced the extreme algorithm selection (XAS) setting and investigated the scalability of various algorithm selection approaches in this setting. To this end, we defined a benchmark based on the OpenML CC-18 benchmark suite for classification and a set of more than 1,200 candidate algorithms. Furthermore, we proposed the use of dyadic approaches, specifically dyad ranking, taking into account feature representations of both problem instances (datasets) and algorithms which allows them to work on very few training data. In an extensive evaluation, we found that the approaches exploiting dyadic feature representations perform particularly well according to various metrics on the proposed benchmark and outperform other state-of-the-art AS approaches developed for the standard AS setting.

The currently employed algorithm features allow for solving the cold start problem only to a limited extent, i.e., only algorithms featuring only known hyperparameters can be considered as new candidate algorithms. Investigating features to describe completely new algorithms is a key requirement for the approaches considered in this paper, and therefore an important direction for future work.

Acknowledgements

This work was supported by the German Research Foundation (DFG) within the Collaborative Research Center ”On-The-Fly Computing” (SFB 901/3 project no. 160364472). The authors gratefully acknowledge the funding of this project by computing time provided by the Paderborn Center for Parallel Computing (PC).

References

  • [1] S. Bengio, K. Dembczynski, T. Joachims, M. Kloft, and M. Varma (2019) Extreme classification (dagstuhl seminar 18291). Cited by: §1.
  • [2] B. Bischl, P. Kerschke, L. Kotthoff, M. Lindauer, Y. Malitsky, A. Fréchette, H. H. Hoos, F. Hutter, K. Leyton-Brown, K. Tierney, et al. (2016) Aslib: a benchmark library for algorithm selection. Artificial Intelligence 237. Cited by: §1.
  • [3] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: from pairwise approach to listwise approach. In ICML, Cited by: §3.2.
  • [4] W. Cheng, E. Hüllermeier, and K. J. Dembczynski (2010) Label ranking methods based on the plackett-luce model. In ICML, Cited by: §3.2.
  • [5] T. Cunha, C. Soares, and A. C. de Carvalho (2018) A label ranking approach for selecting rankings of collaborative filtering algorithms. In SAC, Cited by: §6.
  • [6] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter (2015) Efficient and robust automated machine learning. In NIPS, Cited by: §1.
  • [7] E. Frank, M. Hall, and I. Witten (2016) The weka workbench (online appendix). data mining. Morgan Kaufmann). Forth Edition. Cited by: §5.1.
  • [8] N. Fusi, R. Sheth, and M. Elibol (2018) Probabilistic matrix factorization for automated machine learning. In NIPS, Cited by: §6.
  • [9] D. Goldberg, D. Nichols, B. M. Oki, and D. Terry (1992) Using collaborative filtering to weave an information tapestry. Communications of the ACM 35 (12). Cited by: §3.3.
  • [10] S. Haim and T. Walsh (2009) Restart strategy selection using machine learning techniques. In SAT, Cited by: §6.
  • [11] M. Happe, F. Meyer auf der Heide, P. Kling, M. Platzner, and C. Plessl (2013) On-the-fly computing: a novel paradigm for individualized it services. In IEEE ISORC, Cited by: §1.
  • [12] J. Kanda, C. Soares, E. Hruschka, and A. De Carvalho (2012) A meta-learning approach to select meta-heuristics for the traveling salesman problem using mlp-based label ranking. In NIPS, Cited by: §6.
  • [13] P. Kerschke, H. H. Hoos, F. Neumann, and H. Trautmann (2019) Automated algorithm selection: survey and perspectives. ECJ 27 (1). Cited by: §6.
  • [14] K. Leyton-Brown, E. Nudelman, and Y. Shoham (2002) Learning the empirical hardness of optimization problems: the case of combinatorial auctions. In CP, Cited by: §6.
  • [15] Y. Malitsky and B. O’Sullivan (2014) Latent features for algorithm selection. In SOCS, External Links: Link Cited by: §6.
  • [16] M. Mısır and M. Sebag (2017) Alors: an algorithm recommender system. Artificial Intelligence 244. Cited by: §3.3.
  • [17] F. Mohr, M. Wever, and E. Hüllermeier (2018) ML-Plan: automated machine learning via hierarchical planning. Machine Learning 107 (8-10). External Links: Link, Document Cited by: §1.
  • [18] P. Nguyen, M. Hilario, and A. Kalousis (2014) Using meta-mining to support data mining workflow planning and optimization. Journal of Artificial Intelligence Research 51. Cited by: §5.1.
  • [19] J. R. Rice (1976) The algorithm selection problem. In Advances in computers, Vol. 15. Cited by: §2.
  • [20] D. Schäfer and E. Hüllermeier (2018) Dyad ranking using plackett-luce models based on joint feature representations. Machine Learning 107 (5). External Links: Link, Document Cited by: §4.2.
  • [21] D. H. Stern, H. Samulowitz, R. Herbrich, T. Graepel, L. Pulina, and A. Tacchella (2010) Collaborative expert portfolio management. In AAAI, External Links: Link Cited by: §3.3, §6.
  • [22] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown (2013) Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In SIGKDD, Cited by: §1.
  • [23] A. Tornede, M. Wever, and E. Hüllermeier (2019) Algorithm selection as recommendation: from collaborative filtering to dyad ranking. In CI Workshop, Dortmund, Cited by: §4.2.
  • [24] J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo (2013) OpenML: networked science in machine learning. SIGKDD Explorations 15 (2). External Links: Link, Document Cited by: §5.1, §5.1.
  • [25] S. Vembu and T. Gärtner (2010) Label ranking algorithms: a survey. In Preference Learning, Cited by: §3.2.
  • [26] Y. Wang, L. Wang, Y. Li, D. He, W. Chen, and T. Liu (2013) A theoretical analysis of ndcg ranking measures. In COLT, Cited by: §3.3.
  • [27] M. Weimer, A. Karatzoglou, Q. V. Le, and A. J. Smola (2008) Cofi rank-maximum margin matrix factorization for collaborative ranking. In NIPS, Cited by: §3.3.
  • [28] L. Xu, F. Hutter, H. H. Hoos, and K. Leyton-Brown (2008) SATzilla: portfolio-based algorithm selection for sat. Journal of artificial intelligence research 32. Cited by: §6.