1. Introduction
We are experiencing the era of data. With its great availability, people in general (e.g., practitioners, data scientists, and researchers) are trying hard to extract useful information encoded on data (Siegel2013). This resulted in an evergrowing popularity and the indiscriminate use of machine learning (ML) algorithms by many types of users.
The field of Automated Machine Learning (AutoML) (Hutter2019) has emerged to help this wide and heterogeneous public. This field has the purpose of democratizing ML in a way ML can be used with less difficulties by general audiences. In addition, AutoML also aims to assist experienced data scientists. In both scenarios, the field of AutoML has the scope of recommending learning algorithms (and often their hyperparameters’ settings too) when people face a particular problem that might be (partially or totally) solved with ML. Broadly speaking, AutoML proposes to deal with users’ biases by customizing the solutions (in terms of algorithms and configurations) to ML problems following different approaches.
AutoML has been successfully and mainly employed to solve traditional (singlelabel) classification and regression problems (Elshawi2019). However, this work is interested in AutoML methods for a different and specific type of data, called MultiLabel Classification (MLC) (Tsoumakas2011; Zhang2014; Herrera2016). The goal of MLC is to learn a model that expresses the relationships between a set of predictive features (attributes) describing the examples and a predefined set of class labels. In MLC, each example can be simultaneously associated to one or more class labels. Each class label is represented by a discrete value.
When compared to singlelabel classification (SLC), MLC can be considered a more challenging task, mainly due to the following reasons. First, an MLC algorithm needs to consider the label correlations (i.e., detecting whether or not they exist) in order to learn a model that produces accurate classification results (Zhang2014). Second, given the usual limited number of examples for each class label in the dataset (Herrera2016), the generalization in MLC is considered harder than SLC, as the MLC algorithm needs more examples to create a good model from such complex data (Domingos2012)
. Third, there is a strain to evaluate MLC classifiers as several metrics follow contrasting aspects to define what is a good MLC prediction
(Pereira2018). Finally, the learning algorithms applied to solve MLC problems need more computational resources than the ones used to solve SLC (Herrera2016). This is mainly due to MLC being a generalization of SLC, so that the algorithms need to look at several labels instead of just one.We claim that these aforementioned challenges are part of the reason why AutoML for MLC problems (i.e., AutoMLC) has not been sufficiently explored. Taking it into consideration, this work performs an assessment of popular search methods for AutoMLC, including evolutionary methods, a Bayesian optimization method and blindsearch methods. For that, we propose two novel AutoMLC search methods. The first is an extension of the work of de Sá et al. (deSa2018) on Grammarbased Genetic Programming (GGP) (McKay2010; Whigham1995) for AutoMLC, named AutoMEKA. Our extension adds to the GGP core a speciation approach (Back1999) aiming to improve the diversity of the produced solutions. The second search method is a Bayesian optimization (BO) method, namely Sequential Modelbased Algorithm Configuration (SMAC) (Hutter2011) – note that there was no such methods previously proposed in the AutoMLC literature. As both proposed methods are based on the wellknown MLC MEKA framework (Read2016), we named these search methods as AutoMEKA and AutoMEKA, respectively.
We compare these two proposed search methods with AutoMEKA (deSa2018), a random search (namely, AutoMEKA) and a greedy search (namely, AutoMEKA
) on three designed MLC search spaces (namely, Small, Medium and Large) over 14 benchmarking datasets. Finally, in this work, we use five performance measures for evaluating these methods, due to the additional degree of freedom that the MLC algorithms’ setting introduces
(Madjarov2012).The experimental results show that AutoMEKA mostly presented the best average results and also the best average ranks for several search spaces and measures. Besides, AutoMEKA was the only method to be statistically better than all other evaluated search methods in different occasions (i.e., performance measures versus search spaces), except when compared to AutoMEKA.
Although this is a positive result for the evolutionary methods, we believe that more robust methods – such as AutoMEKA, AutoMEKA and AutoMEKA– can still improve their predictive performances. With this in mind, we observe that these methods could not satisfactorily tradeoff between exploration and exploitation as they were not statistically and simultaneously better than pureexploration and pureexploitation methods (i.e., AutoMEKA and AutoMEKA, respectively).
The results also show that there is a high correlation between the size (and definition) of the search space and the effectiveness of AutoMLC methods to select and configure algorithms. When looking at the predictive accuracy of the AutoMLC methods, we have an indication that as the size of an AutoMLC’s search space decreases, pureexploration and/or pureexploitation AutoMLC search methods tend to have similar results to robust AutoMLC methods (such as the ones presented in this work).
The remainder of this paper is organized as follows. Section 2 introduces MLC and Section 3 reviews related work on AutoMLC. Section 4 details AutoMLC in terms of the proposed search spaces and evaluated search methods that are included in the experimental comparison, while Section 5 presents and discusses the obtained results. Finally, Section 6 draws some conclusions and discusses directions for future work.
2. Multilabel classification
There is a great number of works on traditional singlelabel classification (SLC) for machine learning (ML) (Zaki2020). In SLC, each example is defined by a tuple (), where is a
dimensional vector representing the feature space (i.e., the categorical and/or numerical characteristics of that example) and
is the class value, where , a set of disjoint class labels. In SLC, each example is strictly associated to a single class label.Nevertheless, there is an increasing number of applications that require associating an example to more than one class label (Gibaja2015), such as medical diagnosis and protein function prediction. This classification scenario is better known as MultiLabel Classification (MLC). According to (Tsoumakas2010), each example in MLC is represented by a tuple (, ), where is the dimensional feature vector, and is a set of nondisjoint class labels. Hence, we would like to find an MLC model : such that maximizes a quality criterion .
The literature divides MLC algorithms into three categories (Gibaja2015): problem transformation (PT), algorithm adaptation (AA) and ensemble or metaalgorithms methods (MetaMLC). Whereas PT methods transform the multilabel problem into one or more singlelabel classification problems, AA methods simply extend singlelabel classification algorithms so they can directly handle multilabel data. Finally, MetaMLC methods act on top of PT or AA multilabel classifiers, aiming to combine the results of MLC algorithms and produce models with more robust predictive performances.
Among the great number of MLC algorithms (Tsoumakas2011; Zhang2014; Herrera2016), it is important to mention three methods that transform an MLC problem into one or many SLC problems: Label Powerset (LP), Binary Relevance (BR) and Classifier Chain (CC). LP creates a single class for each unique set of labels that is associated with at least one example in a multilabel training set. BR, in turn, learns independent binary classifiers, one for each label in the labelset . Finally, CC changes the BR method by chaining the binary classifiers. In this case, the feature space of each link in the chain is increased with the classification outputs of all previous links.
3. Related Work on AutoMLC
Most AutoML methods in the literature were designed to solve the conventional singlelabel classification and regression tasks (Elshawi2019), and can not handle multilabel data. As far as we know, there are only a few works related to Automated MultiLabel Classification (AutoMLC).
(Chekina2011) developed a metamodel (i.e., a Nearest Neighbors classifier) for selecting one out of 11 multilabel classification algorithms, taking into account 30 characterizing measures and 36 metadatasets. Nevertheless, as it is a preliminary work, it only selects the MLC algorithm, not setting the algorithm’s hyperparameters.
Evolutionary MultiLabel Ensemble (EME) (Moyano2019) encompasses the problem of selecting MLC algorithms to compose MLC ensembles. The main idea of EME stands on the simplicity of each ensemble’s multilabel classifier, which is focused on a small subset of the labels, but still considering the relationships among them and avoiding the high complexity of the output space. Nevertheless, EME takes into account only one type of model to compose the ensembles (i.e., the model produced by label powerset), so it is not sufficient to deal with all types of MLC problems.
Furthermore, (Wever2019) proposed an extension to a canonical hierarchical planing method (i.e., MLPlan) to the MLC context. They named this method as MLPlan (MultiLabel MLPlan). Basically, MLPlan is implemented as a global bestfirst search over the graph induced by the planning problem at hand.
Finally, (deSa2017; deSa2018) proposed two AutoML methods for MLC problems: GAAutoMLC and AutoMEKA
. Whereas GAAutoMLC employs a realcoded genetic algorithm
(Eiben2003) to perform its search, AutoMEKA uses a grammarbased genetic programming algorithm (McKay2010; Whigham1995). AutoMEKA is a robust enhancement of GAAutoMLC to handle huge and, consequently, complex MLC search spaces. Because of that, we decided to only include AutoMEKA into our experiments. We will further discuss it in Section 4.4. AutoML Methods for Multilabel Classification
This section introduces a generic AutoMLC framework followed by all AutoMLC methods evaluated in this paper. As illustrated in Figure 1, the AutoMLC method receives as input a specific multilabel dataset (with the feature space and the class labels to ). Structurally, the evaluated AutoMLC methods have two main components: the search space and the search method.
The search space consists of the main building blocks (e.g., the prediction threshold values, the hyperparameters and the algorithms at the SLC level) from previously designed MLC algorithms. To explore this search space, the AutoMLC method uses a search method, which finds appropriate MLC algorithms to the dataset at hand. However, the performance of the search method depends on what is specified in the search space.
Once the search is over, the AutoMLC method outputs an MLC algorithm tailored to the input dataset based on that search space. This MLC algorithm is specifically selected and (hyper)parameterized to this dataset, although it could be applied to any multilabel dataset. In the end, the customized MLC algorithm returns an MLC model, and consequently, its classification results.
The evaluation presented in this paper considers three search spaces, namely Small, Medium and Large, which differ from each other in terms of complexity. These three search spaces are explored using five search methods: AutoMEKA, AutoMEKA, AutoMEKA, AutoMEKA and AutoMEKA, as detailed in Section 4.2.
4.1. Search Spaces
To design the search spaces for the AutoMLC methods being evaluated, we first performed a deep study about multilabel classification in the MEKA software. We analyzed in detail all the algorithms and their hyperparameters, the constraints associated with different hyperparameter settings, the hierarchical nature of operations performed by problem transformation methods and metaalgorithms, among other issues.
Based on that, we designed three search spaces^{1}^{1}1For more details about the MLC and SLC algorithms and metaalgorithms that compose the search spaces, see the supplementary material.: Small, Medium and Large. The reason behind this threefold modeling is basically because we want to test different levels of search space complexity.
For the search space Small, for instance, we have five MLC algorithms, where four of them can be combined with other five SLC algorithms, as they are from the PT category. The only algorithm that can not be combined with SLC algorithms is MLBPNN, which belongs to the AA category. Therefore, the search space Small consists of 10 learning algorithms – five MLC algorithms and five SLC algorithms, which gives a set of 21 combinations of learning algorithms, where the AA category counts as one combination.
In contrast, the search space Medium has 30 learning algorithms – 15 MLC algorithms and 15 SLC algorithms, which produces 211 combinations of algorithms. Finally, the search space Large has a total of 54 learning algorithms – 26 MLC algorithms and 28 SLC algorithms, which produces 16,568 possible combinations of learning algorithms.
Although the main difference between the Small and Medium search spaces is the number of learning algorithms, note that, when comparing Medium to Large, we have a change on the structure of the search space. This happens because we only added metaalgorithms at the MLC and SLC levels into Large. Hence, we have more levels in the multilabel hierarchy to consider. For example, when a search method is selecting a new MLC algorithm in this search space, it must decide whether to include or exclude metaalgorithms at the MLC and SLC levels. As a result, this search space is hierarchically more complex than the other two (i.e., Small and Medium).
Taking into account the number of learning algorithms, the number of hyperparameters, and the constraints in the choices of algorithms’ components and (hyper)parameters in MEKA, we estimated the size of the three
search spaces^{2}^{2}2In these estimations, the realvalued hyperparameters have always taken 100 different discrete values.. In total, the search space Small has possible MLC algorithm configurations (i.e., a given set of learning algorithms with their respective hyperparameters), where is the number of features (or attributes) and is the number of labels of the dataset. The search space Medium, on the other hand, is estimated as having possible MLC algorithm configurations. Finally, the search space Large is estimated to have approximately possible MLC algorithm configurations.4.2. Search Methods
This section details the five search methods used in our comparison. They all follow the same methodology.
Each search method starts its own iterative process by generating, evaluating and looking for MLC algorithms configurations. To perform the evaluation, the search methods use the average of four wellknown measures (Tsoumakas2010; Gibaja2015): Exact Match (EM), Hamming Loss (HL), Macro averaged by label (FM) and Ranking Loss (RL), as indicated in Equation 1. The search method keeps iterating while a maximum time budget is not reached. At the end, the best MLC algorithm configuration in accordance to the quality criteria is returned and assessed in the test set.
(1) 
Regarding the MLC evaluation measures, EM is a very strict metric, as it only takes the value one when the predicted label set is an exact match to the true label set for an example, and takes the value zero otherwise. HL, in turn, calculates how many examplelabel pairs are misclassified. Furthermore, FM is the harmonic mean between precision and recall, and its average is first calculated for each label and, after that, across all the labels. Finally, RL measures the number of times that irrelevant labels are ranked higher than relevant labels, i.e., it penalizes the label pairs that are reversely ordered in the ranking for a given example. All four metrics are within the
interval. However, whereas the EM and FM measures should be maximized, the HL and RL measures should be minimized. Hence, HL and RL are subtracted from one in Equation 1 to make the search maximize the fitness function.4.2.1.
This method was proposed in (deSa2018), and relies on a Grammarbased Genetic Programming (GGP) approach (McKay2010; Whigham1995), which has the advantage of hierarchically exploring the nature of the AutoMLC problem. In this case, the grammar encompasses the search space of MLC algorithms and hyperparameter settings.
In , each individual expresses an MLC algorithm configuration, and is represented by a derivation tree generated from the grammar. Individuals are first generated by choosing at random a valid production rule, and then mapping it into an MLC algorithm (with a specific hyperparameter setting).
Figure 2 details the whole mapping process followed by the evaluation process for each GGP individual. In the example of Figure 2, ellipsoid nodes are the grammar’s nonterminals, whereas the rectangles are the terminals.
The mapping process takes the terminals from the tree and constructs a valid MLC algorithm. The mapping in Figure 2
will produce the following MLC algorithm: a Binary Relevance method combined with a Logistic Regression algorithm (with the hyperparameter ridge set to 0.019), using a threshold of 0.3 to classify the MLC data.
Next, individuals have their fitness calculated as previously explained, and undergo tournament selection. The GGP operators (i.e., Whigham’s crossover and mutation (Whigham1995)) are applied to the selected individuals to create a new population. These operators also respect the grammar constraints, ensuring that the produced individuals represent valid solutions.
4.2.2.
This novel AutoMLC method enhances the search mechanisms of AutoMEKA by adding a speciation process (Back1999) into its search method. The general idea is to use Grammarbased Genetic Programming with Speciation (spGGP) to improve the tradeoff between exploration and exploitation of the search for MLC algorithms and hyperparameter settings. Because the proposed search spaces have an exponential size and a complex hierarchical nature, it may be crucial to use this approach to deal with these aspects. A species is a set of individuals that resemble each other more inherently than the individuals in another species (Back1999). In speciationbased evolutionary computation, the objective is to emphatically restrict mating to those among like individuals from the population. In this case, likeness among individuals is identified if they have similar genotypes or phenotypes.
In this work, we defined a set of species based on the types of MLC hyperparameters (i.e., categorical, discrete or continuous) and their interactions. It is worth noting that, whilst the categorical hyperparameters take Boolean and stringbased values (e.g., an algorithm name or a procedure name in an algorithm), the discrete hyperparameters only take integer values. Therefore, our speciationbased method focuses not exclusively on the choice of the learning algorithms but primarily on the different types of hyperparameters, where the choice of the learning algorithms is set as a special case of a categorical hyperparameter.
In general, we would like to understand if there is a dependence between the final AutoMLC predictive performance and the types (and the interactions) of hyperparameters for a given dataset. For instance, if we would like to recommend MLC algorithms for two datasets with different characteristics, understanding only the categorical hyperparameters for the first dataset may be more beneficial than understanding discrete and/or continuous hyperparameters. This could be the opposite for the second dataset.
In this context, we design eight species. Different species specialize on optimizing different combinations of hyperparameter types and their interactions. All species have instances of all learning algorithms at both MLC and SLC levels based on the defined search space, but they vary on the types of hyperparameters that are left with their default values during evolution and cannot be updated. In the descriptions below, the settings of the hyperparameters we refer to can be changed by the evolutionary process, while all others are set to their default values. The species may vary:

Learning algorithms: Only the categorical hyperparameters referring to the names of the (traditional and meta) learning algorithms at the MLC and SLC levels can be combined and evolved.

Learning algorithms and common categorical hyperparameters: Together with the categorical hyperparameters indicating the names of the learning algorithms (species 1), this species also allows the combination and evolution of common categorical hyperparameters (e.g., the names of a metric). In addition, this species also encompasses Boolean hyperparameters.

Learning algorithms and discrete hyperparameters: This species considers, alongside with the categorical hyperparameters that indicate the names of the learning algorithms, the discrete (integer) hyperparameters.

Learning algorithms and continuous hyperparameters: This species allows the modification and combination of the continuous hyperparameters of species 1.

Learning algorithms and the combination of common categorical and discrete hyperparameters: In this species, we evolve the individuals considering the learning algorithms themselves (species 1) together with common categorical and discrete hyperparameters.

Learning algorithms and the combination of common categorical and continuous hyperparameters: In this case, we make the evolutionary process consider the hyperparameters representing the learning algorithms, the common categorical hyperparameters and the continuous hyperparameters.

Learning algorithms and the combination of discrete and continuous hyperparameters: This species allows the combination of the names of the learning algorithms with discrete and continuous hyperparameters.

All types of hyperparameters: This species is more general, and all types of hyperparameters (categorical referring to the names of the learning algorithms, common categorical, discrete, continuous hyperparameters) are considered to be explored/exploited.
The first step of ’s evolutionary process is the initialization procedure, where we generate for each species a population of individuals, which are represented by trees and built based on a specific grammar for that species.
also differs from in the crossover operator, which can be performed for both intraspecies and interspecies individuals. By interchangeably using both types of crossover operations, we have more chances to test unknown regions of the search space (exploration) when using the interspecies crossover, while a more local search over the different types of hyperparameters is performed in each species by the intraspecies crossover (exploitation).
It is worth noting that we decided to design the mutation operator as a local operator in each specie. By doing that, Whigham’s mutation uses the grow method on the individual’s derivation tree but ensures that the MLC grammar of the current species is applied over the grow method.
4.2.3.
This proposed Bayesian Optimization (BO) AutoMLC method employs the Sequential Modelbased Algorithm Configuration (SMAC) (Hutter2011) as a procedure to search for suitable MLC configurations. In our generic framework, AutoMEKA can be categorized as a sophisticated extension of AutoWEKA (Thornton2013).
Hence, given a dataset and a search space,
uses a performance model (in our case, a Random Forest) to robustly select the MLC configurations. This model is initialized with a default MLC algorithm with default hyperparameter settings. In the case of
, we initialize the model with different algorithms as the search spaces allow different types of learning algorithms. For the search space Small, we run and include into the model the results of the classifier chain (CC) algorithm using the naïve Bayes (NB) classification algorithm at the singlelabel base level.As the search space
Medium is similar to Small in terms of the hierarchical levels, we keep the CC algorithm at the multilabel level. However, we have tried to improve the singlelabel classification level by using a strong algorithm, i.e., we use a more sophisticated Bayesian network classifier (BNC) algorithm instead of a simple NB classification algorithm. Hence, at this level, the K2 algorithm is employed.
Finally, for the search space Large, we define as the initial configuration to the model the random subspace metaalgorithm for multilabel classification (RSML), using the Bayesian classifier chain (BCC) algorithm at the multilabel base level, the locally weighted learning (LWL) algorithm at the singlelabel meta level, and the BNC K2 algorithm at the singlelabel base level. Except for RSML, which is a robust metaalgorithm, the other levels were chosen in an arbitrary fashion, although they are also considered strong algorithms in the machine learning literature.
After this initialization step, we choose the next configuration from the MLC search space in the configuration files, relying on this performance model. To do that, the SMAC method is used to select a better configuration. Next, this MLC configuration is evaluated in the MEKA framework and then compared with the best MLC configuration found so far. If the current configuration has a better score than the current best configuration, it is saved and set as the new best configuration. Otherwise, the process continues by verifying if the time budget was reached. If this criterion is met, returns the best configuration found up to now. Otherwise, the last evaluated MLC configuration is added to the performance model with its corresponding quality value, updating it. The process continues by following these same steps until the time budget expires.
4.2.4. AutoMEKA
The AutoMLC random search method (RS) iterates over the predefined MLC search space at random. First, it creates MLC algorithm configurations, evaluates them and saves the best configuration in terms of the proposed quality measure (see Equation 1) into a list. Next, it creates other new MLC algorithm configurations, evaluates these configurations and saves the best at this iteration into the same list. RS keeps doing this procedure until the total time budget is reached. At the end, it returns the best MLC algorithm configuration from the list based on the quality measure.
4.2.5. AutoMEKA
The AutoMLC greedy search method (GS) starts by generating an initial random solution (i.e., an MLC algorithm configuration), which is set as the current best. From this solution, we generate others by performing local changes into its representation. We use the aforementioned grammarbased representation for both random and greedy searches. Thus, from the grammar, we generate a derivation tree and employ Whigham’s mutation (McKay2010; Whigham1995) to perform local operations in the respective tree. We evaluate these solutions (see Equation 1) and check if one of them has a better quality score than the current best MLC configuration. If so, we update the best configuration with the best score. Otherwise, we maintain the best MLC configuration. Next, from the current best configuration, we continue looking at its neighbors to create, evaluate and possibly find new better solutions. This search process remains while the final time budget is not reached. At the end, the best found MLC algorithm configuration is returned based on the proposed quality measure.
5. Experimental Analysis
This section presents the experimental results of the AutoMLC methods discussed in the previous section. The experiments involve a set of 14 datasets selected from the KDIS (Knowledge and Discovery Systems) repository^{3}^{3}3The datasets are also available at: http://www.uco.es/kdis/mllresources/., as described in Table 1. These datasets were chosen based on their differences in application domain, the number of instances (), the number of features (), the number of labels (), the label cardinality – the average number of labels associated with each example in the dataset (Card.), the label density – the average number of labels associated with each example divided by the number of labels (Dens.), and the label diversity – the percentage of labelsets in the dataset divided by the number of possible labelsets (Div.).
Datasets  Acronym  n  m  q  Card.  Dens.  Div. 
Bibtex  BTX  7395  1836  159  2.402  0.015  0.386 
Birds  BRD  645  260  19  1.014  0.053  0.206 
CAL500  CAL  502  68  174  26.044  0.150  1.000 
CHD_49  CHD  555  49  6  2.580  0.430  0.531 
Enron  ENR  1702  1001  53  3.378  0.064  0.442 
Flags  FLG  194  19  7  3.392  0.485  0.422 
Genbase  GBS  662  1186  27  1.252  0.046  0.048 
GpositivePseAAC  GPP  519  440  4  1.008  0.252  0.438 
Medical  MED  978  1449  45  1.245  0.028  0.096 
PlantPseAAC  PPA  978  440  12  1.079  0.090  0.033 
Scene  SCN  2407  294  6  1.074  0.179  0.234 
VirusPseAAC  VPA  207  440  6  1.217  0.203  0.266 
Waterquality  WQT  1060  16  14  5.073  0.362  0.778 
Yeast  YST  2417  103  14  4.237  0.303  0.082 
The two evolutionary search methods (i.e., AutoMEKA and AutoMEKA
) were run with 80 individuals evolved considering a time budget of five hours, tournament selection of size two, elitism of one individual, and crossover and mutation probabilities of 0.8 and 0.2, respectively. For these two methods, the learning and validation sets are also resampled from the training set every five generations in order to avoid overfitting. Additionally, we use time and memory budgets for each MLC algorithm (generated by the
search methods) of three minutes and 2GB of RAM, respectively. If any MLC algorithm reaches these budgets, it is assigned the lowest fitness, i.e., zero. Furthermore, the following convergence criterion is considered: at each iteration, we check if the best individual has remained the same for over five generations and the search method has run for at least 20 generations. If this happens, we restart the evolutionary process with another pseudorandom seed.In the case of AutoMEKA, as we have eight species, we specify 10 individuals for each species. We define AutoMEKA’s convergence criteria for each species individually. We set the intraspecies and interspecies crossover probabilities for AutoMEKA as 0.5 and 0.5, respectively. On the other hand, AutoMEKA has kept only the time and memory budgets – i.e., five hours of run for its respective search method, and three minutes and 2GB of time and memory budgets for each produced MLC algorithm, respectively. As in the evolutionary methods, the MLC algorithms that reach time and memory budgets are assigned a fitness of zero.
In order to be fair in the comparisons with the evolutionary methods, we set the value of equal to 80 for both AutoMEKA and AutoMEKA, i.e., the methods that implement random search and greedy search, respectively.
All experiments were run using a stratified fivefold crossvalidation procedure (Sechidis2011). This section shows three measures considered when evaluating the results in terms of classification quality: hamming loss (HL), ranking loss (RL), and the general measure we defined as the fitness/quality criteria (see Equation 1)^{4}^{4}4We did not present and analyze the results of exact match and Macro averaged by label due to their similar predictive performances to HL, RL and/or fitness measures.. Given the average values – based on the 14 datasets – for all methods on a particular measure, results are evaluated using an adapted Friedman test followed by a Nemenyi posthoc test with the usual significance level of 0.05 (Demvsar2006).
5.1. Experimental Results
Table 2 shows the average values, the average ranks and the final statistical analysis for HL, RL and fitness measures, respectively.
Search Space  Evaluated Result  spGGP  GGP  BO  RS  GS 
Hamming Loss (HL)  
Small  Avg. Values  0.135  0.134  0.208  0.137  0.135 
Avg. Ranks  3.143  2.250  3.714  3.107  2.786  
Stat. Comparison  no differences among all methods  
Medium  Avg. Values  0.157  0.136  0.134  0.139  0.142 
Avg. Ranks  4.036  2.251  2.607  3.00  3.107  
Stat. Comparison  {GGP} {spGGP}, no differences among the others  
Large  Avg. Values  0.137  0.134  0.130  0.143  0.139 
Avg. Ranks  3.214  2.571  2.07  4.00  3.142  
Stat. Comparison  {BO} {RS}, no differences among the others  
Ranking Loss (RL)  
Small  Avg. Values  0.153  0.161  0.210  0.167  0.167 
Avg. Ranks  2.321  2.214  3.750  3.429  3.286  
Stat. Comparison  no differences among all methods  
Medium  Avg. Values  0.146  0.150  0.152  0.156  0.148 
Avg. Ranks  2.679  2.821  3.250  3.321  2.929  
Stat. Comparison  no differences among all methods  
Large  Avg. Values  0.147  0.135  0.149  0.157  0.140 
Avg. Ranks  2.821  2.536  2.786  4.071  2.786  
Stat. Comparison  no differences among all methods  
Fitness  
Small  Avg. Values  0.626  0.631  0.590  0.624  0.626 
Avg. Ranks  2.679  1.964  3.857  3.535  2.964  
Stat. Comparison  {GGP} {BO}, no differences among the others  
Medium  Avg. Values  0.634  0.639  0.626  0.629  0.645 
Avg. Ranks  3.107  2.643  2.929  3.429  2.893  
Stat. Comparison  no differences for all methods  
Large  Avg. Values  0.632  0.635  0.635  0.626  0.632 
Avg. Ranks  2.607  2.429  2.571  4.286  3.107  
Stat. Comparison  {spGGP, GGP, BO} {RS}, no differences among the others 
The results for the HL measure with five hours, in Table 2, showed that AutoMEKA had the best average values and ranks in the search space Small, but it had only the best average rank in the search space Medium. In addition, AutoMEKA obtained the best average value and AutoMEKA was statistically better than AutoMEKA in the search space Medium. Nonetheless, there is no indication of statistical difference for the other cases of search space Medium, neither for any cases of Small. On the other hand, when considering the search space Large, AutoMEKA had the best average value and the best average rank. By looking at the statistical results of HL in Table 2, we can observe that only AutoMEKA was able to significantly outperform AutoMEKA for the search space Large. In the other cases, we do not see an indication of the search methods to tradeoff well between exploration and exploitation. Therefore, based on these HL results, we can conclude that the size of the search space has a strong influence on the performance of the search method.
In addition, Table 2 summarizes the results for the evaluation measure from a different evaluation context, i.e., RL. Whilst AutoMEKA was the search method with the best average value in the search space Small, achieved the best average rank in this scenario. Distinctly, selected and configured MLC algorithms in a way that produced the best average value and rank in the search space Medium. For the search space Large, was the search method with the best average value and the best average rank.
The results of RL did not present any evidence of statistical significance. Hence, the search methods did not differ from each other. As this metric comes from another context, we would like to understand why it presented such a flat result for all search spaces and search methods. We believe that the RL measure is very conservative and it is not sensitive to different choices of multilabel classifiers. As it only penalizes reversed pairs of labels into the ranking and it does not take into account the labelpair depth into the ranking to penalize, this can make this measure not good enough to be used in isolation to evaluate MLC algorithms. In future work, it might be interesting to evaluate whether this measure is appropriate to be part of our study or whether we should consider a rankbased measure that takes into account the depth of the ranking.
HL and RL are measures that compose the fitness/quality function. We also analyze the results of the fitness measure to have a better assessment of the results of search methods. In the last part of Table 2, we show the results for this measure. With respect to the search space Small, AutoMEKA presented the best average value and also the best average rank. Besides, we found statistical evidence that AutoMEKA has better results than AutoMEKA in this search space. In the search space Medium, we observe that AutoMEKA produced the best average ranking and AutoMEKA presented the best average value. Finally, in the comparison regarding the search space Large, AutoMEKA, AutoMEKA and AutoMEKA achieved significantly better results than AutoMEKA, showing their capabilities to handle enormous search spaces – when giving enough time for them to proceed with their searches. Apart from the statistical results, we can observe in Table 2 that AutoMEKA also reached the best average value and rank within five hours of running. Furthermore, AutoMEKA had even results to AutoMEKA in terms of the average value.
We can now provide overall conclusions based on the results of this section, specially on the fitness measure. We believe that the size of the search spaces explored/exploited by the search methods influenced the accuracy of the MLC predictions. We understand that, for smaller search spaces (i.e., Small and Medium), the search methods find it easier to proceed with their searches and, hence, their results become broadly similar among each other. When we increase the search space to Large, only those with robust search mechanisms could deal better with the tradeoff between exploration and exploration, making them achieve competitive results.
However, we believe the robust methods (i.e., those based on the evolutionary and Bayesian optimization frameworks) can still improve their final predictive performances as they could not beat the greedy search method, a pureexploitation method. They also face challenges to beat the random search method in some of the analyzed cases. Thus, these search methods still could not satisfactorily balance between exploration and exploitation. In our perspective, this would occur if they could beat simultaneously AutoMEKA and AutoMEKA. For the search spaces Small and Medium, this result is more understandable. The smaller the search space, the easier it is to perform the search on it. This leads to better results for AutoMEKA and AutoMEKA in such a way that the other search methods were not statistically different from both of them. For bigger search spaces (i.e., Large), this should be the opposite. As they have robust search mechanisms, they should obtain statistically better results against pureexploration and pureexploitation search methods.
5.2. Analysis of the Diversity of the Selected Algorithms
We also analyzed the diversity of the MLC algorithms and metaalgorithms selected by the five AutoMLC search methods. We focus only on the selected MLC algorithms and metaalgorithms (which are the “macrocomponents”), and not on their selected SLC algorithms and hyperparameter settings (the “microcomponents”), to simplify our analysis.
It is important to emphasize that, by analyzing the MLC and SLC algorithms and metaalgorithms selected by all search methods in each search space, we can better understand the results of Table 2. This would give an idea of how the choice of an MLC algorithm influences the performance of the search methods. However, for the sake of simplicity, we perform this analysis only for the search space Large.
We present in Figures 2(b) through 2(e) the bar plots to analyze the relative frequency of selection of MLC algorithms for the AutoMLC search methods. In these figures we have, for each MLC algorithm, a bar representing the average relative frequency of selection of an algorithm type over all the 70 runs: 14 datasets times five independent runs per dataset (five crossvalidation folds times one run per fold). We consider two cases: (i) when the traditional MLC algorithm is solely selected; (ii) when the traditional MLC algorithm is selected together with a MLC metaalgorithm. To emphasize these two cases, the bar for each traditional MLC algorithm is divided into two parts, with sizes proportional to the relative frequency of selection as a standalone algorithm (in gray color) and the relative frequency of selection as part of a metaalgorithm (in white color).
Considering this information, BR, PSt and RT were the traditional MLC algorithms most frequently selected by all AutoMLC search methods in the search space Large. BR was chosen, on average, in 21.43% of all runs for all methods. PSt and RT, in turn, were selected on average in 20.66% and 13.43% of all runs for the five evaluated search methods, respectively. Nevertheless, some of these MLC algorithms were not so present in the selections performed by the search methods. For instance, BR and RT were not frequently chosen by AutoMEKA and AutoMEKA. This partially shows the differences in the selection and configuration of the AutoMLC search methods, although most of them had similar algorithms at the top five regarding the ranking of selection.
We can also justify the performance of the methods based on their selection at the MLC meta level. For example, AutoMEKA achieved the best results for the search space Large in terms of the average value and rank based on fitness, which is the measure we use to decide (for all methods) what algorithm is the most appropriate. By looking at AutoMEKA’s selection at the MLC meta level we can understand why this happened. AutoMEKA and AutoMEKA have chosen these MLC metaalgorithms with a low relative frequency (22.86% for both methods). Therefore, the complexity of the final solution turned them into better options for AutoMLC in the MLC context when contrasted to AutoMEKA, which selected metaalgorithms in 50% of the cases. However, their level of selection of MLC metaalgorithms is still high when compared to AutoMEKA and AutoMEKA, which selected MLC metaalgorithms in only 10% of the cases. This might be the reason for the competitiveness of AutoMEKA and AutoMEKA.
One test that might be interesting is to remove the MLC metaalgorithms from the search space, and reexecute the search methods. This could show us whether or not the learned model is more likely to overfit on the training set when we select very complex combinations of base and metaalgorithms. We did that in the search spaces Small and Medium, but they do not include all traditional MLC algorithms as the search space Large does.
6. Conclusions
This paper presented an overall comparison among five AutoML search methods in the context of multilabel classification – i.e., AutoMEKA, AutoMEKA, AutoMEKA, AutoMEKA and AutoMEKA. To perform this assessment, the search methods were run in 14 MLC datasets with the same execution time budget (i.e., five hours) and in three designed search spaces.
The experimental results indicate that AutoMEKA is so far the best search method as it yields the best predictive results. Besides, it is the only method to be statistically better than AutoMEKA, AutoMEKA and AutoMEKA in different cases.
However, we expected that methods with robust search mechanisms (e.g., AutoMEKA, AutoMEKA and AutoMEKA) could balance better between exploration and exploitation. This limitation made these methods not being able to produce statistically better results than AutoMEKA and AutoMEKA, which are pureexploration and pureexploitation methods, respectively.
We also observed that the size of the search space is a crucial issue for the AutoML methods’ behavior. Thus, as a first future work, we intend to better understand the tradeoff between parsimony and sufficiency (Banzhaf1998). In other words, we would like to investigate which algorithms should be included to or excluded from the search spaces, in order to keep good (combinations of) learning algorithms.
In addition, as two out of the five measures (i.e., FM and RL) yielded flat results, we need to understand how neutral the search spaces are in terms of the chosen MLC performance measures (Pitzer2012; Malan2013). This would help us to have insights to propose efficient methods or enhancements to these search spaces.
Furthermore, we expect to test other quality criteria to discover appropriate MLC algorithms configurations for a given dataset of interest. This may include finding other relevant performance measures or to set new weights for the current measures that compose the proposed fitness metric.
Finally, we also plan to include into the proposed search methods a bilevel optimization approach (Talbi2013) to diminish the hardness of the search in huge search spaces. Fundamentally, this would mean to select the learning algorithms in the first place and only configure their hyperparameters in a second step of the search procedure.
Acknowledgements
The authors would like to thank FAPEMIG (through the grant no. CEXPPM0009817), MPMG (through the project Analytical Capabilities), CNPq (through the grant no. 310833/20191), CAPES, MCTIC/RNP (through the grant no. 51119) and H2020 (through the grant no. 777154) for their partial financial support.