Automated Multi-Label Classification based on ML-Plan

by   Marcel Wever, et al.
Universität Paderborn

Automated machine learning (AutoML) has received increasing attention in the recent past. While the main tools for AutoML, such as Auto-WEKA, TPOT, and auto-sklearn, mainly deal with single-label classification and regression, there is very little work on other types of machine learning tasks. In particular, there is almost no work on automating the engineering of machine learning applications for multi-label classification. This paper makes two contributions. First, it discusses the usefulness and feasibility of an AutoML approach for multi-label classification. Second, we show how the scope of ML-Plan, an AutoML-tool for multi-class classification, can be extended towards multi-label classification using MEKA, which is a multi-label extension of the well-known Java library WEKA. The resulting approach recursively refines MEKA's multi-label classifiers, which sometimes nest another multi-label classifier, up to the selection of a single-label base learner provided by WEKA. In our evaluation, we find that the proposed approach yields superb results and performs significantly better than a set of baselines.


page 1

page 2

page 3

page 4


An Online Universal Classifier for Binary, Multi-class and Multi-label Classification

Classification involves the learning of the mapping function that associ...

A Robust Experimental Evaluation of Automated Multi-Label Classification Methods

Automated Machine Learning (AutoML) has emerged to deal with the selecti...

Imbalanced multi-label classification using multi-task learning with extractive summarization

Extractive summarization and imbalanced multi-label classification often...

Large-Scale Online Semantic Indexing of Biomedical Articles via an Ensemble of Multi-Label Classification Models

Background: In this paper we present the approaches and methods employed...

DeepEthnic: Multi-Label Ethnic Classification from Face Images

Ethnic group classification is a well-researched problem, which has been...

FrugalMCT: Efficient Online ML API Selection for Multi-Label Classification Tasks

Multi-label classification tasks such as OCR and multi-object recognitio...

From Extreme Multi-label to Multi-class: A Hierarchical Approach for Automated ICD-10 Coding Using Phrase-level Attention

Clinical coding is the task of assigning a set of alphanumeric codes, re...

1 Introduction

These days, machine learning functionality is required in more and more application areas, and machine learning applications have already become part of everyday life. Since end users in application domains are normally not machine learning experts, there is an urgent need for suitable support in terms of tools that are easy to use. Ideally, the induction of models from data, including the data preprocessing, the choice of a model class, the training and evaluation of a predictor, the representation and interpretation of results, etc., would be automated to a large extent [12]. This has triggered the field of automated machine learning (AutoML), which has developed into an important branch of machine learning research in the last couple of years.

State-of-the-art AutoML tools [25, 11, 4] have shown impressive results on multi-class classification problems. These approaches are essentially based on a formalization of the AutoML problem in terms of an optimization problem with a fixed number of decision variables, amenable to standard (Bayesian) optimization tools such as SMAC. Typically, there is one variable for the preprocessing algorithm, one variable for the learning algorithm, and one variable for each parameter of each algorithm. While this technique works well for problems with no or only little hierarchical structure, it is less suitable for more complex problems whose solutions are naturally designed in a recursive manner.

An example of such a problem is multi-label classification (MLC), which is the topic of this paper. One reason for the natural appearance of recursion in MLC is the common use of meta-learning techniques for reducing multi-label to binary or multi-class problems. Almost each such learner takes a base learner as input, which, in principle, could be an entire machine learning pipeline (ML pipeline) itself. However, there has been very little work on AutoML for multi-label classification (MLC), i.e., finding good multi-label classifiers in an automated fashion. Besides the work of de Sá et al. [22, 21]

based on genetic algorithms, we are not aware of previous work on automated multi-label classification.

In this paper, we discuss the usefulness and the feasibility of an AutoML approach for MLC. As multi-label classifiers usually reduce the MLC task to several single-label classification tasks, the configuration of a multi-label classifier is of a hierarchical nature. Therefore, we propose to use a hierarchical technique for the configuration of such ML pipelines, which caters more naturally for the hierarchical structure of the problem. Starting with an empty pipeline, the algorithm first selects a multi-label classifier. If the chosen multi-label classifier represents a meta strategy, it is refined with another multi-label classifier. Otherwise, multi-label classifiers usually require a base learner for binary classification, which in turn has to be selected from a portfolio of single-label classifiers.

To configure these multi-label classifiers, we adapt and use an AutoML tool for single-label classification called ML-Plan [15]. ML-Plan leverages a derivative of hierarchical task network (HTN) planning [5], called programmatic task network planning [13], to solve the AutoML task, and thus, it naturally supports recursive structures as they appear in AutoML for multi-label classification. For instance, in [28] it is shown how ML-Plan can produce tree-shaped preprocessing workflows of arbitrary depth. This is also the reason for why we prefer ML-Plan to other AutoML tools. In the following, we refer to this adapted version of ML-Plan as ML-Plan (Multi-Label ML-Plan). We empirically show that this approach performs particularly well and significantly outperforms the baselines. Due to the lack of dedicated AutoML tools for the task of multi-label classification (except the recent work of [22] based on genetic algorithms), a comparison is not straight-forward. Nevertheless, we managed to set up meaningful and reasonably strong baselines including a random search, and a reduction to a single-label classification AutoML tool.

2 Multi-Label Classification

In contrast to conventional (single-label) classification, the setting of multi-label classification (MLC) allows an instance to belong to several classes simultaneously, i.e., to be assigned several labels at the same time. For example, a single image could be tagged simultaneously with labels Sun and Beach and Sea.

More formally, let denote an instance space, and let be a finite set of class labels. We assume that an instance is (non-deterministically) associated with a subset of labels ; this subset is often called the set of relevant labels, while the complement is considered as irrelevant for . We identify a set

of relevant labels with a binary vector

, in which iff . By we denote the set of possible labelings.

In general, a multi-label classifier is a mapping . For a given instance , it returns a prediction in the form of a vector

The problem of MLC can be stated as follows: Given training data in the form of a finite set of observations

the goal is to learn a classifier

that generalizes well beyond these observations in the sense of minimizing the risk with respect to a specific loss function.

There are various loss functions that are commonly used in MLC, including the subset 0/1 loss (exact match)111 is the indicator function.

the Hamming loss

and the (instance-wise) F-measure (which is actually a measure of accuracy)


In slightly different tasks like ranking or probability estimation, the prediction of a classifier is not restricted to binary vectors. Instead, a hypothesis

is a mapping , which assigns scores to labels. Corresponding predictions also require other loss functions. An example is the rank loss, which compares a ground-truth labeling with a predicted ranking of the labels and counts the number of incorrectly ordered label pairs:

At first sight, MLC problems can be solved in a quite straightforward way, namely through decomposition into several binary classification problems: One binary classifier is trained for each label and used to predict whether, for a given query instance, this label is present (relevant) or not. This approach is known as binary relevance (BR) learning.

However, BR has been criticized for ignoring important information hidden in the label space, namely information about the interdependencies between the labels. Since the presence or absence of the different class labels has to be predicted simultaneously, it is arguably important to exploit any such dependencies.

Going beyond BR, a large repertoire of methods for MLC has been proposed in the recent years. Most of these methods seek to improve predictive accuracy by exploiting label dependencies in one way or the other. We refer to [29] for an up-to-date survey on MLC algorithms.

3 AutoML and Hierarchical Planning

AutoML seeks to automatically compose and parametrize

machine learning algorithms into ML pipelines, with the goal to optimize a given metric, e.g., minimizing the exact match loss or maximizing the instance-wise F-measure in the case of MLC. The algorithms are typically related either to preprocessing (feature selection, transformation, imputation, etc.) or to the core functionality (classification, regression, ranking, etc.).

While there is no general limitation on the structure of the composition of these algorithms (they are unbound in length and may contain alternative branches or even loops), the pipelines created by current approaches are usually rather simple and essentially limited to a preprocessing step and a classifier. For ML problems more complex than standard classification, approaches of that kind are not fully suitable.

The decomposition scheme on the left-hand side in Fig. 1 suggests that we can construct a machine learning pipeline in a hierarchical way. In an initial step, we have the complex tasks to choose a preprocessor (possibly an empty one) and a classifier. However, each of these components may need other components and/or parameters in turn. So we need to choose and configure these sub-components and parameters first, which, as illustrated in the figure, may have sub-components and parameters, too. This recursion continues until no more refinement is necessary or possible.

In this paper, to create multi-label classifiers, we make use of hierarchical planning as a formalism that is amenable to this recursive structure. Hierarchical planning is a concept from the field of AI planning [6]. The core idea is to iteratively break down an initially given complex task into new sub-tasks, which may also be complex or simple (no need of further refinement). The complex tasks are recursively decomposed until only simple tasks remain. This is comparable, for example, to deriving a sentence from a context-free grammar, where complex tasks correspond to non-terminals and simple tasks are terminal symbols.

Figure 1: Visualization of the hierarchical structure of a machine learning pipeline (left) and an excerpt of the hierarchical planning search graph (right).

Note that there is not one canonical but many possible hierarchical planning problems that can be used to hierarchically construct pipelines. Just to give an example, we could first choose the algorithms for the multi-label classifier and set their parameters, and then choose the single-label classifier as a base learner, its sub-components, and their parameters. Yet, we may also nest the process of choosing algorithms and setting the parameters, e.g., to first choose the multi-label classifier and the binary classifier algorithms, and then set the parameters of both of them. While this looks like a trivial change that does not affect the set of pipelines that can be constructed, it has dramatic effects on the structure of the search tree.

Algorithmically, a planning problem is solved using graph search algorithms. The (hierarchical) planning problem induces a (possibly infinite) search graph, which is represented by a distinguished root node, a successor generator function, and a goal-test function. The successor generator creates the successor nodes for any node of the graph, and the goal-test decides whether a node is a goal. Most HTN planners perform a forward-decomposition, which means that they create one successor for each possible decomposition of the first unsolved task in the list of remaining tasks. In every child node, the list of remaining tasks is the previous list of tasks where the decomposed task is replaced by the list that represents the respective decomposition. The resulting search graph is sketched on the right-hand side in Fig. 1 where every box shows a list of tasks (green ones are simple, the yellow one is the next complex task to be decomposed, and the red ones are complex to be resolved later). A node is a goal node if all remaining tasks are simple. A standard graph search algorithm can then be used to identify a path from the root to a goal node, which induces a solution plan.

However, it is not easily possible to solve AutoML problems using standard planners such as SHOP2 [1]. The main problem is that, in contrast to the usual assumption of standard planners applying A* search, the cost of a solution (e.g., expected loss of a classifier) cannot be computed from the descriptions of the plan elements.

We are aware of three approaches to AutoML using hierarchical planning or related techniques. The first approach is related to optimization within the RapidMiner framework based on hierarchical task networks (HTN) [18, 10]. They conduct a beam search (hill-climbing in the most extreme case), where the beam is selected based on a ranking of alternative choices obtained from a meta-learning module, which compares the current dataset with previous ones and choices taken back then. The most recent representative of this line of research is Meta-Miner [18]. While these approaches do not execute candidates during search to observe their performance, an approach of extensive evaluation is presented in RECIPE [23]. RECIPE creates pipelines using a grammar-based genetic algorithm; the pipeline candidates are evaluated in the course of computing their fitness. Third, ML-Plan [15] recognizes the value of executing pipelines during search, but also observes that the extensive evaluation conducted in TPOT and RECIPE is infeasible for larger datasets. It reduces the number of evaluations by only considering candidates obtained from completions of currently best candidates. Like Meta-Learner, it is based on HTN planning.

While none of the above approaches has been used to solve multi-label classification problems, they can be adapted into that direction rather easily. This is precisely thanks to the hierarchical view on the solution candidates, because pipelines for MLC are very similar to pipelines for multi-class classification. The main difference between the pipelines is that there is at least one more layer of recursion in the configuration of its elements, which is easily incorporated within a hierarchical model. The adaptation for the HTN-based approaches is particularly appealing, because a huge part of the search graph definition (the one concerned with the configuration of base classifiers) can be simply adopted without any changes.

Of course, other AutoML solutions such as Auto-WEKA or auto-sklearn can be extended to the multi-label problem as well. It is clear that one can flatten any clearly limited hierarchical structure into a vector as long as the allowed structures are bound in length, which makes those approaches generally applicable. In fact, this was already done in both frameworks to cope with preprocessors and meta classifiers.

In this paper, we solve the multi-label AutoML problem by extending ML-Plan, our approach to automated multi-class classification. The resulting algorithm is called ML-Plan, which stands for Planning for Multi Label Machine Learning. In the context of ML-Plan, the extension is quite straight-forward and essentially comes down to augmenting the already existing hierarchical planning problem definition by the MLC algorithms. In the following section, we give a brief overview of ML-Plan and how it is extended to ML-Plan.

4 A Multi-Label Version of ML-Plan

4.1 ML-Plan

As briefly sketched above, ML-Plan is a hierarchical planner designed for AutoML problems [15]. Standard hierarchical planners such as SHOP2 [1] lack some fundamental requirements of AutoML, e.g., to evaluate candidate solutions during search, which was a main motivation for developing ML-Plan.

The search technique adopted by ML-Plan is a best-first search. ML-Plan makes no assumption (like monotonicity) about the node evaluations or how they are acquired. Instead, ML-Plan simply requires that the node evaluation function is provided by the user. It is then possible to conduct complex computations in order to obtain node evaluations, a property that is missing in classical planners. The node evaluation in ML-Plan is based on random path completion as also used in Monte Carlo Tree Search [2]. To obtain the evaluation of a node, this strategy draws a fixed number of path completions, builds the corresponding pipelines and evaluates them against a validation set. The score assigned to the node is the minimal score that was observed over these validations in order to estimate the best solution that can be obtained when following paths under the node.

Intuitively, ML-Plan formalizes the HTN problem in a way that the resulting search graph is split into an algorithm selection region (upper region) and an algorithm configuration region (lower region). This idea is captured in Fig. 2. The main motivation for this strategy lies in the node evaluation we want to apply, which is based on random completions. Since algorithm selections usually constitute a much more significant change to the performance of a pipeline than parameter settings, we consider all solutions under a node that has all algorithms fixed as a kind of neighborhood, and the random samples drawn in that lower region are then more reliable estimates.

Having the idea of a two-phased search graph in mind, the HTN definition of ML-Plan is as follows222Since we have not formally introduced HTN planning, we describe the problem definition in a rather intuitive way. The formal definition can be found in the implementation published with this paper. The initial task createClassifier can be broken down into a chain of the three tasks createRawPP, setupClassifier, refinePP. The first task is meant to choose the algorithms used for pre-processing without parametrizing them, the second task is meant to choose and configure the multi-label classifier, and the third step parametrizes the previously chosen pre-processors. The second task setupClassifier can, for each classifier, be decomposed into two sub-tasks. First, <classifier>:create is a simple task indicating the creation of a new classifier of the respective class, e.g. J48:create. Second, <classifier>:configure is a complex task meant to configure the parameters of the classifier.

Figure 2: Process of hierarchically refining an ML pipeline

As an additional remark, ML-Plan comes with a built-in strategy to prevent over-fitting. This strategy apportions the assigned timeout for the whole search process among two phases. The first phase covers the actual search in the space. The second phase takes a collection of identified solutions and selects the one that minimizes the estimated generalization error. Roughly speaking, the collection used for selection in phase 2 corresponds to the best candidates and random candidates that are not significantly worse than the best candidate. The time allocated at time step for the second phase is flexible and corresponds to the accumulated time that was required in phase 1 to evaluate the classifiers that would be chosen at time step for the selection process.

4.2 Deriving ML-Plan from ML-Plan

To obtain ML-Plan from ML-Plan, we need to make two changes. First, we modify the HTN problem definition to support MLC algorithms and omit tasks that are not necessary or reasonable in MLC. Second, we adjust the node evaluation function to be based on multi-label loss functions. We now explain these two aspects in more detail.

ML-Plan modifies the HTN planning problem of ML-Plan in two ways:

  1. It simplifies the problem by removing preprocessing and by deactivating parameter configuration, so ML-Plan only conducts algorithm selection. Preprocessing is ignored because the preprocessors of multi-class classification are not directly applicable to the multi-label classification case; there are extensions [7, 20], but no implementations are available in the used libraries. Algorithm configuration is ignored, because the evaluations of solution candidates are usually so expensive that, in the current form, even the algorithm selection problem cannot be solved within a reasonable time bound.

  2. It extends the problem by adding new algorithms, the MLC algorithms, and by introducing a dedicated notation for the decisions on dependent sub-classifiers. The latter is to overcome the previous practice that sub-classifiers are seen as parameters. Now, there is a dedicated task for each sub-algorithm of each algorithm. For example, the meta-classifiers (both multi-label and multi-class) need to be refined by choosing their base classifiers. As single-label classifiers are the sub-classifiers of most basic multi-label classifiers, up to three recursions are possible in this way.

Technically, the initial task now becomes createMLClassifier to indicate that a multi-label classifier needs to be constructed. It can be resolved either by a meta multi-label classifier, which results in a new task createMLBaseClassifier or a simple multi-label classifier, which already solves the task in case of the majority classifier and induces a task createWekaClassifier for the configuration of the used base learner in any other case. This task, which roughly corresponds to setupClassifier

in ML-Plan, can be either refined to an empty rest problem by choosing a basic learner like decision trees or a neural network, or it can be refined with a meta learner like AdaBoost, which induces a further task

setupBaseClassifier; the latter task can then only be refined with non-meta single-label classifiers.

The set of applied algorithms is a strict superset of the ones adopted in ML-Plan. The number of possible candidates of composed multi-label classifiers is roughly 80,000. Thus, the space of possible candidates is much

smaller than in the case of multi-class classification where hyperparameter optimization is performed as well. However, due to the much costlier evaluations per pipeline, the traversion of a bigger search space does not appear reasonable. The learners considered by ML

-Plan are as follows.

  • Multi-label classifiers (meta): BaggingML, BaggingMLdup, CM, DeepML, EM, EnsembleML, FilteredClassifier, MBR, MultiSearch, SubsetMapper,

  • Multi-label classifiers

    (base): Bayesian Classifier Chains (BCC), Back Propagation Neural Network (BPNN), Binary Relevance (BR), Binary Relevance quick (BRq), Classifier Chains (CC), Classifier Chains quick CCq, Conditional Dependency Networks (CDN), Conditional Dependency Trellis (CDT), Classifier Trellis (CT), Deep Back-Propagation Neural Network (DBPNN), Fourclass Pairwise (FW), Hierarchical Label Sets (HASEL), Label Combination (LC), Majority Label Set, Multi-lAbel classificatioN using AutoenCoders (Maniac), Monte-Carlo Classifier Chains (MCC), Probabilistic Classifier Chains (PCC), PMCC, Pruned Sets (PS), Pruned Sets with Threshold (PSt), RAndom k-labEl pruned sets (RAkEL), RAndom k-labEL Disjoint pruned sets (RAkELd), Ranking+Threshold (RT), Multi-Label Classification using Boolean Matrix Decomposition (MLCBMaD)

  • Single-label classifiers (meta): AdaBoostM1, AdditiveRegression, AttributeSelectedClassifier, Bagging, ClassificationViaRegression, LogitBoost, MultiClassClassifier, RandomCommittee, RandomSubspace, Stacking, Vote

  • Single-label classifiers (base): BayesNet, NaiveBayes, NaiveBayesMultinomial, Logistic, MultilayerPerceptron, SimpleLinearRegression, SimpleLogistic, SMO, VotedPerceptron, IBk, KStar, JRip, M5Rules, OneR, PART, ZeroR, DecisionStump, J48, LMT, M5P, RandomForest, RandomTree, REPTree

ML-Plan adopts the node evaluation function of ML-Plan. As explained above, this node evaluation function draws random completions and evaluates the corresponding completed pipelines on a validation set. Every pipeline is validated several times using different splits of size 70%/30% of the data available for search to make the estimate more reliable.

While ML-Plan adopts 0/1-loss for computing the loss of a single pipeline, there are different loss measures for multi-label classification (cf. Section 2). In principle, each of the losses could be used to guide the search; we took the F-measure (1), which we consider as one of the most meaningful MLC performance measures. As different losses are known to be potentially competitive [3], it would also be possible to conduct a multi-objective search, which we consider as an interesting idea for future work.

5 Experimental Evaluation

We evaluate ML-Plan as introduced in the previous section on various datasets and for different timeouts. Since there is no dedicated tool for automated multi-label classification publicly available to compare with, we evaluate ML-Plan against the baselines as introduced in the following. The implementation of ML-Plan is publicly available333

5.1 Baselines

In our experimental study, we challenge ML-Plan by two baselines in order to empirically show that it is better than randomly traversing the search space and beyond a simple reduction of current AutoML tools for multi-class classification. To this end, we define the baselines as follows.

5.1.1 Random Search (RS)

First, to demonstrate the effectiveness of ML-Plan’s strategy to search for good candidates, the second baseline is a random search which samples random candidates from the same search space. We let ML-Plan and the random search operate on the same search space, as for instance by optimizing the search space structure a bias towards better performing candidates could be introduced. As a direct consequence, all the combinations possible for ML-Plan are also available for the random search. Thus, the only difference between ML-Plan and the random search baseline is the strategy of traversing the search space. The random search baseline is bound to the same timeout specification as ML-Plan.

5.1.2 Reduction to AutoML for Binary/Multinomial Classification (BR-AW)

Second, we define a baseline as an optimized version of Binary Relevance (BR), which is often used as a baseline in the multi-label classification literature. As BR reduces the MLC problem to a set of binary classification problems and the primary function of current AutoML tools [4, 25, 19, 23] addresses binary respectively multinomial classification or regression problems, we leverage the well-established AutoML tool Auto-WEKA to individually optimize each of the base learners for the induced binary classification problems. Note that Auto-WEKA not only configures a classifier for the binary problems, but also an individualized pre-processing if appropriate. To motivate this baseline, note that it would arguably be the most simple choice of an inexperienced user who is trying to incorporate some automation into the process of solving an MLC problem.

To guarantee a fair comparison in terms of permitted run time, we divided the available resources by the number of labels and, hence, the number of binary problems to be solved by Auto-WEKA. More precisely, given a timeout for ML-Plan, the entire baseline algorithm should also have time available. Consequently, we assigned as a timeout to each Auto-WEKA instance within BR. While it would of course be possible to parallelize the learning process in BR and, hence, relax the timeout for each binary problem, no such parallelization is realized within MEKA. ML-Plan also adopts the MEKA implementation of BR and would not benefit from any parallelization. In our view, this setup therefore maximizes fairness for the comparison.

Since it might be the case that Auto-WEKA fails to provide a result within the given time bound, two instances of BR using either SMO or Random Trees as base learners are evaluated in parallel for all labels. In these two cases, neither hyperparameter optimization nor preprocessing methods are configured. The baseline then assumes the maximum over the three performances of Auto-WEKA and the backup BR variants.

5.2 Experimental Setup

Results were obtained by carrying out 25 runs on 24 datasets with three different timeouts. The datasets stem from the MULAN project website, but an independent copy is available444link hidden during review, MULAN sources are

. The significance of an improvement per dataset was determined using a t-test with a threshold for the p-value of 0.05.

We considered timeouts of one minute, one hour, and 12 hours. Depending on the overall timeout, the timeout for the internal evaluation of a single solution by ML-Plan was set to 10s for 1 minute runs and 5m for the other cases. Runs that did not adhere to the time or resource limitations (plus a tolerance threshold) were canceled without considering their results. That is, we canceled the algorithms if they did not terminate within 110% of the predefined timeout. Likewise, the algorithms were killed if they consumed more resources (memory or CPU) than allowed, which happens as both implementations fork new processes whose overall CPU and memory consumption is hard to control.

In each run, we used 70% of a randomized split of the data for learning (search) and 30% for testing. We used the same splits for all candidates, i.e., for each split and each timeout, we ran once the baseline and once ML-Plan. Note that there is no natural way of stratifying splits as in multi-class classification. While there are approaches to obtain such splits [24], we used random splits for implementation-related reasons. The relative performance of the algorithms should not be significantly affected by the split technique, since the same splits were used for all algorithms.

The computations were executed on (up to) 150 Linux machines in parallel, each of which with a resource limitation of 8 cores (Intel Xeon E5-2670, 2.6Ghz) and 32GB memory. The total run-time was over 130k CPU hours (more than 14 CPU years).

5.3 Results of Evaluation on Test Data

In Table 1, we report the results on the averaged (instance-wise) F-measure (1), for which ML-Plan optimized. We compare ML-Plan to the outlined baselines Random Search (RS) and the reduction to AutoML optimizing each single base learner of binary relevance with the help of Auto-WEKA (BR-AW). Best results are highlighted in bold; significant improvements of ML-Plan over a baseline according to the t-test is indicated by and a significant degradation by (only comparing for the same timeout). Results for using Auto-WEKA with binary relevance for a timeout of one minute do not exist—since the minimum run-time of Auto-WEKA is one minute, it cannot be used in that scenario.

The overall picture is that ML-Plan dominates the baselines to a large extent. Compared to BR-AW, ML-Plan yields a significantly degraded result in 2 of 24 cases for the 1h timeout and 1/24 for the 24h timeout. For most of the remaining datasets and both timeouts, the results returned by ML-Plan significantly better than the ones returned by BR-AW. In many data sets, even the results obtained within one minute are better than the ones returned by BR-AW for the 12h timeout. Thus, it clearly outperforms a naive reduction of multi-label classification to binary relevance incorporating a single-label classification AutoML tool. The worse performance might be due to the handling of the timeout constraints for BR-AW as the overall run-time is divided by the number of labels. However, BR-AW manages to perform significantly better than ML-Plan within a timeout of 1h for the dataset with the second-most number of labels, which somehow contradicts this intuition. Another reason is that although binary relevance is often already a very strong base for the comparison of single MLC strategies, it is clearly defeated by a portfolio of MLC strategies from which ML-Plan is allowed to choose. A consequence from this conjecture is that algorithm selection proves beneficial also for MLC.

Considering the random search baseline (RS), we observe that, for a one minute timeout, there is no clear winner or loser. While ML-Plan achieves significant improvements over RS in 8 cases, RS in turn yields superior results in 5 cases. In a few cases RS does not manage to return any result within the given timeout as it is not bound to an internal timeout for evaluating candidates. This picture slowly changes when increasing the timeout to 1h, where ML-Plan achieves 12 significant improvements over RS and only 2 significant degradations. This trend is continued when moving on to the timeout of 12h, where a significant degradation is obtained on only 2 datasets, while there are 14 significant improvements to be noted. Summing up the observations, we notice that for smaller timeouts ML-Plan first behaves similar to a random search, which is somehow intuitive as some of the nodes have to be evaluated first before the search may notice high performance regions. Over time, ML-Plan manages to find superior results as compared to the results found by the random search. From this, we conclude that the best-first search adopted by ML-Plan and also used in ML-Plan to find well-performing multi-label classifiers proves beneficial for the AutoML multi-label classification problem.

Based on these observations, a dedicated tool for multi-label classification can clearly be justified. Investing into a guided search through the ML-Plan search space brings a significant advantage over the proposed baselines. We do not claim that combining reduction techniques with the use of standard AutoML tools is not a meaningful approach—quite to the contrary. However, instead of using a predefined reduction, the most appropriate one should be found in a systematic way, like in ML-Plan. The observations in the next section confirm that it is unlikely that there is a single-best multi-label classifier that could serve for this purpose.

In addition, the results after 12 hours are also equal or better than those reported on automated multi-label classification [22], despite the fact that the multi-label classifiers were trained on 90% of a dataset and tested on 10%. In their workshop paper, de Sá et. al evaluate their evolutionary approach on the three datasets Birds, flags, and Scene. We calculate that the considered timeout was roughly 52 CPU days, which would mean that, for a parallelization on a 16 core machine, this would be roughly 6 times as much time as the 12 hours granted for ML-Plan 555A notice for the reviewer: A direct experimental comparison was not possible, because de Sá et al. did neither publish their code nor did they provide it (or binaries) on request.. As a consequence, there is currently no known tool that achieves better results on automated multi-label classification than ML-Plan.

Dataset #Inst. #Att. #L 1m 1h 12h
Arts 7484 23146 26 22.41.4 20.50.0 50.18.3 28.00.8 0.00.0 53.01.2 28.00.8 19.114
Bibtex 7395 1836 159 10.32.9 17.16.1 33.00.5 39.60.7 22.62.2 35.42.0 39.60.7 28.814
Birds 645 258 21 36.12.8 29.59.0 37.43.2 32.82.5 37.54.2 39.51.9 38.91.5 36.63.0
Bookmarks 87856 2150 208 10.73.5 21.31.6 22.11.2
Business 11214 21924 30 73.00.7 73.92.6 75.50.9 73.10.4 79.70.5 75.50.9 73.50.3
Computers 12444 34096 33 44.00.7 51.77.4 46.85.9 37.30.6 44.30.0 64.00.7 37.30.6 42.29.6
Education 12030 27534 33 27.70.6 51.50.0 48.110 24.80.7 38.414 53.10.6 24.80.7 20.214
Enron 1702 1001 53 46.63.5 39.69.3 54.11.0 50.91.1 48.56.4 56.30.7 46.69.7 57.52.8
Entertainment 12730 32001 21 26.71.9 32.50.0 40.615 32.90.6 31.816 67.40.5 32.90.6 37.913
Flags 194 14 12 64.33.2 66.43.2 67.01.8 65.22.8 68.52.9 66.31.5 65.61.3 68.62.1
Health 9205 30605 32 42.81.0 41.70.9 62.514 55.30.8 33.48.8 72.91.1 55.30.8 48.14.0
LangLog 1460 1004 75 8.83.8 6.54.2 19.91.8 13.81.1 11.55.1 19.70.6 13.81.1 12.74.2
MEDC 978 1449 45 78.01.6 64.917 78.81.4 77.31.3 77.73.7 77.52.6 72.24.1 79.22.3
Mediamill 43907 120 101 38.912 56.14.1 49.40.2 44.60.0 60.01.5 49.40.2 48.65.1
Musicout 593 72 6 66.23.0 62.95.1 67.03.1 57.83.6 66.22.7 68.11.1 60.21.5 67.62.1
Protein 662 1186 27 99.00.4 73.336 99.00.5 58.20.0 97.92.8 98.70.4 79.118 98.70.7
Recreation 12828 30324 22 20.83.1 35.518 26.91.0 30.926 63.10.6 26.91.0 33.95.5
Reference 8027 39679 33 44.10.7 44.30.0 54.29.7 49.00.4 48.45.3 63.50.7 49.00.4 52.00.0
Scene 2407 294 6 62.72.4 57.119 76.11.1 63.02.0 72.34.4 77.60.8 63.31.4 77.12.1
Science 6428 37187 40 22.01.1 21.80.0 45.916 29.40.9 53.60.0 56.30.5 29.40.9 16.321
Social 12111 52350 39 41.83.5 2.60.0 47.410 38.92.1 0.00.0 68.10.6 38.92.1 55.33.4
Society 14512 31802 27 39.71.5 43.36.5 30.20.4 19.916 55.60.9 30.20.4
Tmc 28596 49060 22 20.50.4 35.41.5 33.70.5 34.70.0
Yeast 2417 103 14 60.61.4 56.811 64.40.8 60.12.1 64.02.1 65.20.9 61.31.2 64.11.6
Table 1:

Means and standard deviation of instance-wise F-Measure. Each entry represents the mean and standard-deviation over 25 runs with different random seeds. Missing values are due to returning no result in the given timeout or due to memory overflows which occurred in long runs on memory-intensive datasaets.

5.4 Overview of Chosen Classifiers

Figure 3: Heatmap for selected combinations of multi-label classifier and base classifier

The map in Fig. 3 gives an overview of the algorithm combinations chosen by ML-Plan for the scenario of a 1h timeout. The vertical axis iterates over the multi-label classifiers, and the horizontal axis over the WEKA base classifiers plugged into the multi-label classifiers by ML-Plan. Note that the Majority Label Set classifier is not considered in this map, because it is not parametrized with a base learner; still, it was also selected a couple of times. It is not surprising that the map looks rather sparse, because it is unlikely that every combination of multi-label classifier and base classifier occurs.

The diversity of choices supports the idea of optimizing multi-label classification pipelines in an instance-specific manner instead of using a single multi-label classifier that is best on average. We observe that each multi-label classifier was selected at least once and that, in general, there is no single classifier dominating the others. Likewise, quite some variety of base learners was selected by ML-Plan, and there is not really a dominant base learner666

Random trees and Naive Bayes Multinomial are selected over average, but this is mainly due to their priority assigned by ML

-Plan during search as we explain in the following paragraph.. In fact, even if we focus on a single multi-label classifier, there are no specifically dominant base classifiers.

The observation that not all base learners were chosen at least once may be explained by the fact that not even all of them were tried at least once. Recalling the procedure of ML-Plan sketched in Fig. 2, we see that the base learner is the third algorithm choice, and ML-Plan inherits an ordering among these learners from ML-Plan, in which they are analyzed. It is clear that a base classifier with low rank will not be chosen by ML-Plan if it has not even been tried before the timeout triggers.

Of course, this last point raises the question of scalability, which affects not only ML-Plan but Auto-ML tools that evaluate candidates in general. On one hand, it can be certainly said that these executing approaches, i.e., the approaches that execute a significant number of candidates during search (including ML-Plan), do not scale with the data size. For the case of automated multi-label classification, this is even more severe, since the search space is even more complex than for the “simple” case of multi-class classification (which itself is already infeasible). On the other hand, there is currently no alternative in sight that would offer a more efficient solution. Indeed, the algorithm selection community is aware of this and has adopted ideas to exploit knowledge gained in the past or during the search in order to avoid less promising evaluations [9, 8]. However, at the present time, there is no smart solution in sight that relieves us from execution.

6 Conclusion

In this paper, we have presented ML-Plan, an AutoML tool to automatically select and configure a multi-label classifier for a given dataset. On the model level, ML-Plan adopts hierarchical task network (HTN) planning, a technique from AI planning, to recursively build multi-label classifiers. On the code level, ML-Plan binds these multi-label classifiers to implementations in the multi-label framework MEKA in order to execute them and evaluate their performance. We demonstrate the usefulness of a tool dedicated to multi-label classification by comparing it to a random search baseline and a baseline which reduces a multi-label classification problem to binary relevance incorporating Auto-WEKA to tailor each individual base learner. We show that, given a suitable timeout, ML-Plan significantly outperforms both baselines. To the best of our knowledge, apart from [22, 21], this is the first substantially evaluated approach to automated multi-label classification.

Having confirmed the suitability of a dedicated tool for automated multi-label classification, some natural follow-up research questions arise. The most important question is associated with the issue of scalability, which affects almost all AutoML frameworks, but which is particularly severe for multi-label classification due to an even more complex search space. One possible way for scaling ML-Plan would be to use it in a service-oriented architecture as proposed in [17, 14]. Another question is what benefit can be expected of MLC pipelines incorporating preprocessing and being more flexible in both the configuration of parameters and the configuration of base classifiers decomposing the original problem into sub-problems, as proposed in [16]

for multi-class classification, such that finally a heuristic strategy for HOMER

[26] is obtained. Furthermore, building ensembles of optimized hierarchical decompositions could lead to further improvements of the overall performance, as shown for multi-class classification in [27].

This question is contrarious to the first one in the sense that any step in that direction will enlarge the search space even further. Finally, the existence of different MLC loss measures motivates a multi-objective optimization process that not only considers a single measure but several such losses simultaneously.


This work was partially supported by the German Research Foundation (DFG) within the Collaborative Research Center ”On-The-Fly Computing” (SFB 901).


  • [1] Au, T., Ilghami, O., Kuter, U., Murdock, J.W., Nau, D.S., Wu, D., Yaman, F.: SHOP2: an HTN planning system. CoRR abs/1106.4869 (2011)
  • [2] Browne, C., Powley, E.J., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen, P., Tavener, S., Liebana, D.P., Samothrakis, S., Colton, S.: A survey of monte carlo tree search methods. IEEE Trans. Comput. Intellig. and AI in Games 4(1), 1–43 (2012)
  • [3] Dembczynski, K., Waegeman, W., Cheng, W., Hüllermeier, E.: On label dependence and loss minimization in multi-label classification. Machine Learning 88(1-2), 5–45 (2012)
  • [4] Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., Hutter, F.: Efficient and robust automated machine learning. In: Advances in Neural Information Processing Systems. pp. 2962–2970 (2015)
  • [5] Georgievski, I., Aiello, M.: HTN planning: Overview, comparison, and beyond. Artif. Intell. 222, 124–156 (2015)
  • [6] Ghallab, M., Nau, D., Traverso, P.: Automated Planning: theory and practice. Elsevier (2004)
  • [7]

    Gharroudi, O., Elghazel, H., Aussem, A.: A comparison of multi-label feature selection methods using the random forest paradigm. In: Advances in Artificial Intelligence - 27th Canadian Conference on Artificial Intelligence, Canadian AI 2014, Montréal, QC, Canada, May 6-9, 2014. Proceedings. pp. 95–106 (2014)

  • [8] Hutter, F., Hoos, H.H., Leyton-Brown, K.: Sequential model-based optimization for general algorithm configuration. LION 5, 507–523 (2011)
  • [9] Kadioglu, S., Malitsky, Y., Sellmann, M., Tierney, K.: ISAC - instance-specific algorithm configuration. In: ECAI 2010 - 19th European Conference on Artificial Intelligence, Lisbon, Portugal, August 16-20, 2010, Proceedings. pp. 751–756 (2010)
  • [10] Kietz, J., Serban, F., Bernstein, A., Fischer, S.: Designing kdd-workflows via htn-planning. In: ECAI 2012 - 20th European Conference on Artificial Intelligence. Including Prestigious Applications of Artificial Intelligence (PAIS-2012) System Demonstrations Track, Montpellier, France, August 27-31 , 2012. pp. 1011–1012 (2012)
  • [11] Komer, B., Bergstra, J., Eliasmith, C.: Hyperopt-sklearn: automatic hyperparameter configuration for scikit-learn. In: ICML workshop on AutoML (2014)
  • [12] Lloyd, J.R., Duvenaud, D.K., Grosse, R.B., Tenenbaum, J.B., Ghahramani, Z.: Automatic construction and natural-language description of nonparametric regression models. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Québec City, Québec, Canada. pp. 1242–1250 (2014)
  • [13] Mohr, F., Lettmann, T., Hüllermeier, E., Wever, M.: Programmatic task network planning. In: Proceedings of the 28th International Conference on Automated Planning and Scheduling. AAAI (2018)
  • [14] Mohr, F., Wever, M., Hüllermeier, E.: Automated machine learning service composition. CoRR abs/1809.00486 (2018)
  • [15] Mohr, F., Wever, M., Hüllermeier, E.: Ml-plan: Automated machine learning via hierarchical planning. Machine Learning 107(8-10), 1495–1515 (2018)
  • [16] Mohr, F., Wever, M., Hüllermeier, E.: Reduction stumps for multi-class classification. In: IDA. Lecture Notes in Computer Science, vol. 11191, pp. 225–237. Springer (2018)
  • [17] Mohr, F., Wever, M., Hüllermeier, E., Faez, A.: (WIP) towards the automated composition of machine learning services. In: SCC. pp. 241–244. IEEE (2018)
  • [18] Nguyen, P., Hilario, M., Kalousis, A.: Using meta-mining to support data mining workflow planning and optimization. J. Artif. Intell. Res. 51, 605–644 (2014)
  • [19] Olson, R.S., Moore, J.H.: Tpot: A tree-based pipeline optimization tool for automating machine learning. In: Workshop on Automatic Machine Learning. pp. 66–74 (2016)
  • [20] Pereira, R.B., Plastino, A., Zadrozny, B., Merschmann, L.H.C.: Categorizing feature selection methods for multi-label classification. Artif. Intell. Rev. 49(1), 57–78 (2018)
  • [21]

    de Sá, A.G.C., Freitas, A.A., Pappa, G.L.: Automated selection and configuration of multi-label classification algorithms with grammar-based genetic programming. In: PPSN (2). Lecture Notes in Computer Science, vol. 11102, pp. 308–320. Springer (2018)

  • [22]

    de Sá, A.G.C., Pappa, G.L., Freitas, A.A.: Towards a method for automatically selecting and configuring multi-label classification algorithms. In: Genetic and Evolutionary Computation Conference, Berlin, Germany, July 15-19, 2017, Companion Material Proceedings. pp. 1125–1132 (2017)

  • [23] de Sá, A.G.C., Pinto, W.J.G.S., Oliveira, L.O.V.B., Pappa, G.L.: RECIPE: A grammar-based framework for automatically evolving classification pipelines. In: Genetic Programming - 20th European Conference, EuroGP 2017, Amsterdam, The Netherlands, April 19-21, 2017, Proceedings. pp. 246–261 (2017)
  • [24] Sechidis, K., Tsoumakas, G., Vlahavas, I.P.: On the stratification of multi-label data. In: Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011, Proceedings, Part III. pp. 145–158 (2011)
  • [25] Thornton, C., Hutter, F., Hoos, H.H., Leyton-Brown, K.: Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: The 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013, Chicago, IL, USA. pp. 847–855 (2013)
  • [26] Tsoumakas, G., Katakis, I., Vlahavas, I.: Effective and efficient multilabel classification in domains with large number of labels (01 2008)
  • [27] Wever, M., Mohr, F., Hüllermeier, E.: Ensembles of evolved nested dichotomies for classification. In: GECCO. pp. 561–568. ACM (2018)
  • [28] Wever, M., Mohr, F., Hüllermeier, E.: ML-Plan for Unlimited-Length Machine Learning Pipelines (2018)
  • [29] Zhang, M., Zhou, Z.: A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26(8), 1819–1837 (2014)