1 Introduction
Every day massive amounts of data are collected and processed by computers and embedded devices. This data, however, is useless to the people and organizations collecting it, unless it can be properly processed and converted into actionable knowledge. Machine Learning (ML)
(Murphy, 2012) techniques are especially useful in such domains, where automatic extraction of knowledge from data is required.One of the most common and extensively studied knowledge extraction tasks is classification. In traditional classification problems, data samples are associated with a single category, termed class, that may have two or more possible values. For example, the outlook for tomorrow’s weather may be ‘sunny’, ‘overcast’ or ‘rainy’. On the other hand, multilabel classification^{1}^{1}1 Multilabel classification can be viewed as a particular case of the multidimensional problem (Read et al., 2014) where the goal is to assign each data sample to multiple multivalued (in contrast to binary, for the multilabel case) classes. , that is the focus of our current investigation, involves problems where each sample is associated with one or more binary categories, termed labels. For example, a newspaper article about climate change can be described by both tags ‘environment’ and ‘politics’; or a patient can simultaneously be diagnosed with ‘high blood pressure’, ‘diabetes’ and ‘myopia’.
Although singlelabel classification problems have been thoroughly explored, with the aid of various ML algorithms, literature on multilabel classification is far less abundant. Multilabel classification problems are, however, by no means less natural or intuitive and are, in fact, very common in reallife. The fact that, until recently, only a few of the corresponding problems were tackled as multilabel is mainly due to computational limitations. Recent research (see Tsoumakas et al. (2010) and Read (2010) for overviews) and modern hardware, though, has made multilabel classification more affordable. A gradually increasing number of problems are now being tackled as multilabel, allowing for richer and more accurate knowledge mining in realworld domains, such as medical diagnoses, protein function prediction and semantic scene processing.
A careful inspection of the corresponding literature reveals that multilabel classification is nowadays a widely popularized task, but Evolutionary Computation (EC) approaches to prediction model induction are very sparse. The few approaches that exist
(Vallim et al., 2008, 2009; Ahmadi Abhari et al., 2011) explore the use of Michiganstyle Learning Classifier Systems (LCS) (Holland, 1975) – a Geneticsbased ML method that combines EC and reinforcement (Wilson, 1995) or supervised learning (BernadóMansilla and GarrellGuiu, 2003; OrriolsPuig and BernadóMansilla, 2008) – but report promising results only on small artificial and realworld problems. Although they lack an extensive experimental evaluation, however, both in terms of target multilabel classification problem and rival algorithm variety, they are still based on a valid premise. This premise actually also summarizes the motivation of our current work: LCS, due to their inherent characteristics, are naturally suited to multilabel classification and can provide an effective alternative in problem domains where highly expressive humanreadable knowledge needs to be extracted, while maintaining low inference complexity.Indeed, in recent years, LCS have been modified for data mining (Bull et al., 2008) and singlestep classification problems, notably in the UCS (OrriolsPuig and BernadóMansilla, 2008) and the SSLCS frameworks (Tzima et al., 2012; Tzima and Mitkas, 2013). Their nichebased update and their overall iterative (rather than batch) learning approach has been shown to be very efficient in domains where different problem niches occur (including multiclass and unbalanced classification problems). Thus, we believe that it will also allow them to tackle the multiple and often very specific niches that comprise the search space of multilabel classification problems.
Moreover, LCS may provide a practical alternative to deterministic methods, when exhaustive search is intractable (for example, in multilabel classification problems with large numbers of labels and/or attributes) or, in general, when targeting problems with large, complex and diverse search spaces. In such cases, the global search capability of EC, combined with the local search ability of reinforcement learning, allows LCS to evolve flexible, distributed solutions, wherein discovered patterns are spread over a population of (individual or groups of) rules, each modeling a niche of the problem space
(Urbanowicz and Moore, 2009).LCS are also modelfree and, thus, do not make any assumptions about target data (e.g. number, types and dependencies among attributes, missing data, distribution of training instances in terms of the target categories). This allows them to identify all kinds of relationships – including epistatic ones that are characteristic of multilabel domains – both between the feature and label space and among the various labels.
Finally, as already mentioned, the nature of the knowledge representation evolved by LCS is a great advantage in certain application domains, where rule comprehensibility is an important requirement. At this point is should be noted that, Michiganstyle LCS, although implicitly geared towards maximally accurate and general rules, tend to evolve rather large populations, mainly due to the distributed nature of the evolved solutions and the retainment of inexperienced rules created by the system’s exploration component. Ruleset compaction techniques are, though, available to reduce the number of rules in the final models and enhance their comprehensibility.
Overall, the aim of our current work (that builds on previous research presented in Allamanis et al. (2013)) is to develop an effective LCS algorithm for multilabel classification. In this direction, we employ a general supervised learning framework and extend it, to render it directly applicable to the corresponding problems, without the need for any problem transformation. More specifically, we adapt three major components of the traditional LCS architecture: (i) the Rule Representation, to allow for rule consequents that include multiple labels; (ii) the Update Component, to consider multiple correct labels in rule parameter updates; and (iii) the Performance Component, to enable inference in multilabel settings where multiple concurrent decisions are required.
The aforementioned extensions implicitly define the structure and main contributions of the paper, which are detailed after briefly presenting the relevant background (Section 2). Briefly, our current work’s main contributions are:

a generalized multilabel rule format (Section 3) that has several distinct advantages over those used in other multilabel classification methods;

a multilabel Learning Classifier System (Section 4), named the MultiLabel Supervised Learning Classifier System (MlSLCS), whose components allow for efficient and accurate multilabel classification through developing expressive multilabel rulesets; and

an experimental evaluation (Section 5) of our proposed LCS approach, against other stateoftheart algorithms on widely used datasets, that validates its potential.
Section 6 restates our overall contributions, outlines future research directions and concludes this work with additional insights on the potential of the proposed algorithm.
2 Background
2.1 Multilabel Classification
Multilabel classification is a generalization of traditional classification where each sample is associated with a set of mutually nonexclusive binary categories, or labels, . Thus, defining the problem from a machine learning point of view, a multilabel classification model approximates a function : where is the feature space and is the powerset of the label space (i.e., the powerset of the set of all possible labels).
The general multilabel classification framework, by definition, implies the existence of an additional dimension: that of the multiple labels which data samples can be associated with. This additional complexity affects not only the learning processes that can be applied to the corresponding problems, but also the procedures employed during the evaluation of developed models (Tsoumakas et al., 2010).
The basic premise that differentiates learning, with respect to the singleclass case, is that to provide more accurate predictions, label correlations should be factored in multilabel classification models. This need is based on the observation that labels occur together with different frequencies. For example, a newspaper article is far more likely to be assigned the pair of tags ‘science’ and ‘environment’, than the pair ‘environment’ and ‘sports’. Of course, in the absence of label correlations, the corresponding multilabel problem is trivial and can be completely broken down (without any loss of useful information) to binary decision problems.
There are three main approaches to tackling multilabel classification problems in the literature: problem transformation, algorithm transformation (such as the LCS approach presented in this paper) and ensemble methods.
Problem Transformation methods transform a multilabel classification problem into a set of singlelabel ones. Various such transformations have been proposed, involving different tradeoffs between training time and label correlation representation. The simplest of all transformations is the Binary Relevance (BR) method (Tsoumakas and Katakis, 2007), to which the Classifier Chains (CC) method (Read et al., 2009) is closely related. Other transformations found in the literature are Ranking by Pairwise Comparison (RPC) (Hüllermeier et al., 2008) and the Label Powerset (LP) method that has been the focus of several studies, including the Pruned Problem Transformation (PPT) (Read, 2008) and HOMER (Tsoumakas et al., 2008).
Algorithm Transformation methods adapt learning algorithms to directly handle multilabel data. Such methods include: (a) several multilabel variants (MlkNN) of the popular Nearest Neighbors lazy learning algorithm (Zhang and Zhou, 2007)
, as well as hybrid methods combining logistic regression and
Nearest Neighbors (Cheng and Hüllermeier, 2009); (b) multilabel decision trees, such as MLC4.5
(Clare and King, 2001) and predictive clustering trees (PCTs) (Vens et al., 2008); (c) Adaboost.MH and Adaboost.MR (Schapire and Singer, 2000), that are two extensions of Adaboost.MH for multilabel learning; (d) several neural network approaches
(Crammer and Singer, 2003; Zhang and Zhou, 2006); (e) the Bayesian Networks approach by
Zhang and Zhang (2010); (f) the SVMbased ranking approach by Elisseeff and Weston (2005); and (g) the associative classification approach of MMAC (Thabtah et al., 2004).Ensemble methods are developed on top of methods of the two previous categories. The three most wellknown ensemble methods employing problem transformations as their base classifiers are RAkEL (Tsoumakas et al., 2011a), ensembles of pruned sets (EPS) (Read et al., 2008) and ensembles of classifier chains (ECC) (Read et al., 2009). On the other hand, an example of an ensemble method where the base classifier is an algorithm adaptation method (i.e., provides multilabel predictions) can be found in Kocev (2011) where ensembles of predictive clustering trees (PCTs) are presented.
As far as the evaluation of multilabel classifiers is concerned, several traditional evaluation metrics can be used, provided that they are properly modified. The specific metrics employed in our current study for algorithm comparisons are
Accuracy, Exact Match (Subset Accuracy) and Hamming Loss. In what follows, these metrics are defined for a dataset , consisting of multilabel instances of the form , where , (), is the set of all possible labels and is a prediction function.Accuracy is defined as the mean, over all instances, ratio of the size of the intersection and union sets of actual and predicted labels. It is, thus, a labelsetbased metric, defined as:
(1) 
Exact Match (Subset Accuracy) is a simple and relatively strict evaluation metric, calculated as the labelsetbased accuracy:
(2) 
where is the set of correctly classified instances for which .
Hamming Loss corresponds to the labelbased accuracy, taking into account false positive and false negative predictions and is defined as:
(3) 
where is the symmetrical difference (logical XOR) between and .
2.2 Learning Classifier Systems
Learning Classifier Systems (LCS) (Holland, 1975) are an evolutionary approach to supervised and reinforcement learning problems. Several flavors of LCS exist in the literature (Urbanowicz and Moore, 2009), with most of them following the “Michigan approach”, such as (a) the strengthbased ZCS (Wilson, 1994; Tzima and Mitkas, 2008) and SBXCS (Kovacs, 2002a, b); and (b) the accuracybased XCS (Wilson, 1995) and UCS (BernadóMansilla and GarrellGuiu, 2003; OrriolsPuig and BernadóMansilla, 2008). Accuracybased systems have been the most popular so far for solving a wide range of problem types (Bull et al., 2008) – such as classification (Butz et al., 2004; OrriolsPuig et al., 2009a; Fernández et al., 2010), regression (Wilson, 2002; Butz et al., 2008; Stalph et al., 2012), sequential decision making (Butz et al., 2005; Lanzi et al., 2006), and sequence labeling (Nakata et al., 2014, 2015) – in a wide range of application domains – such as medical diagnoses (Kharbat et al., 2007), fraud detection (Behdad et al., 2012) and robot arm control (Kneissler et al., 2014).
Given that multilabel classification is a supervised task, we chose to tackle the corresponding problems using supervised (Michiganstyle) LCS. Such LCS maintain a cooperative population of conditiondecision rules, termed classifiers, and combine supervised learning with a genetic algorithm (GA). The GA works on classifier conditions in an effort to adequately decompose the target problem into a set of subproblems, while supervised learning evaluates classifiers in each of them
(Lanzi, 2008). The most prominent example of this class of systems is the accuracybased UCS algorithm (BernadóMansilla and GarrellGuiu, 2003; OrriolsPuig and BernadóMansilla, 2008). Additionally, we have recently introduced SSLCS, a supervised strengthbased LCS, that provides an efficient and robust alternative for offline classification tasks (Tzima et al., 2012; Tzima and Mitkas, 2013) by extending previous strengthbased frameworks (Wilson, 1994; Kovacs, 2002a, b).To sum up, our current investigation focuses on developing a supervised accuracybased Michiganstyle LCS for multilabel classification by extending the base architecture of UCS and incorporating the clusteringbased initialization component of SSLCS. It also builds on our research presented in Allamanis et al. (2013), from which the main differences are: (a) the multilabel crossover operator (Section 4.4); (b) the modified deletion scheme and the population control strategy (Section 4.5); (c) the clusteringbased initialization process (Section 4.6); and, more importantly, (d) the extensive experimental investigation of the proposed algorithm, both in terms of target problems and rival algorithms (Section 5). The last point also addresses the main shortcoming of existing multilabel LCS approaches (Vallim et al., 2008, 2009; Ahmadi Abhari et al., 2011), namely the absence of empirical evidence on their potential for multilabel classification in realworld settings.
2.3 Rule Representation in LCS
LCS were initially designed with a ternary representation: rules involved conditions represented as fixedlength bitstrings defined over the alphabet {0, 1, #} and numeric actions. To deal with continuous attributes, often present in realworld classification problems, however, intervalbased rule representations were later introduced, starting with Wilson’s minmax representation that codifies continuous attribute conditions using the lower and upper limit of the acceptable interval of values. When using this representation, invalid intervals (where ) – and, thus, impossible conditions – may be produced by the genetic operators. A simple approach to fixing this problem was proposed in Stone and Bull (2003) that introduced the unorderedbound representation – the most popular representation used for continuous attributes in LCS in the last few years. The unorderedbound representation proposes the use of interval limits without explicitly specifying which is the upper and which the lower bound: the smaller of the two limit values is considered to be the interval’s lower bound, while the larger is the upper bound. The unorderedbound representation is our representation of choice for continuous attributes in our current work.
Other than intervalbased ones, several other rule representations have been introduced for LCS (mainly XCS and UCS) during the last few years. These representation aim to enable LCS to deal with function approximation (Wilson, 2002) and realworld problems, and include hyperelipsoidal representations (Butz et al., 2008), convex hulls (Lanzi and Wilson, 2006) and tile coding (Lanzi et al., 2006). Other more general approaches used to codify rules are neural networks (Bull and O’Hara, 2002), messy representations (Lanzi, 1999) and Sexpressions (Lanzi and Perrucci, 1999), fuzzy representations (OrriolsPuig et al., 2009b)
, geneticprogramming like encoding schemes involving code fragments in classifier conditions
(Iqbal et al., 2014), and dynamical genetic programming (Preen and Bull, 2013).3 Rules for Multilabel Classification
To tackle multilabel classification problems with rulebased methods, and thus also with LCS, we need an expressive rule format, able to capture correlations both between the feature and label space and among the various labels. In this Section, we introduce a rule format that possesses these properties and forms the basis of our proposed multilabel LCS, detailed in Section 4. In the last part of the Section, we also describe some “internal” rule representation issues.
3.1 Generalized Multilabel Rule Representation
Singlelabel classification rules traditionally follow the production system (or “ifthen”) form , where rule’s condition comprises a conjunction of tests on attribute values and its consequent contains a single value from the target classification problem’s set of possible categories (or classes). It is also worth noting that, for zeroorder rules, the condition comprises () tests
wherein is one of the problem’s attributes, is an operator, and is a constant set, number or range of numbers.
It is evident that rules following the form described above are not able to readily handle multilabel classifications. To alleviate this shortcoming, we introduce a modification to the rule consequent part, such that, for any given multilabel rule , the consequent part takes the form:
(4) 
where is one of the problem’s possible labels (, ), taking either the value for labels advocated by rule , or the value in the opposite case.
According to Eq. 4, the consequent part of a rule following our proposed Generalized Multilabel Representation includes both the labels that the rule advocates for (=, ), and the labels it is opposed to (=, ). It should be noted that (i) no label can appear more than once in the rule consequent part () and (ii) rules are allowed to “not care” about certain labels, which are, thus, absent from the rule consequent (). In other words, the proposed rule format has the important property of being able to map rule conditions to arbitrary subspaces of the labelspace.
An abbreviated notation for rule consequent parts can be derived by using the ternary alphabet and substituting “1” for advocated labels, “0” for labels the rule is opposed to and “#” for “don’t cares”. Thus, in a problem with three labels, a rule advocating the first label, being indifferent about the second and opposed to the third is denoted as: .
Rules following our proposed Generalized Multilabel Representation have some unique properties. First, rules are easy to interpret, rendering the discovered knowledge (rulesets) equally usable by both humans and computers. Such a property is important in cases where providing useful insights to domain experts is amongst the modelers’ goals.
Furthermore, rules have a flexible labelcorrelation representation. Algorithms inducing generalized multilabel rules do not require explicit knowledge of which label correlations to search for and can variably correlate the maximum possible number of labels to any given condition. Therefore, in contrast to problem transformation methods that need to explicitly create (at least) one model for each possible label correlation/combination being searched for, algorithms inducing generalized multilabel rules can approach all possible spectra between the BR (not looking into any label correlations) and LP (searching for all possible label combinations) transformations and simultaneously create the most compact rule representation of the problemspace, with no redundancy.
Consider, for example, the (artificial) problem toyx with 6 binary attributes and 4 labels, where the first two labels only depend on the first two attributes, according to the rules^{2}^{2}2These are actually the rules, without default hierarchies, defining the artificial problem studied in (Vallim et al., 2009).
(5) 
and the last two labels always have exactly the same values as the last two attributes. The shortest complete solution (SCS) (i.e., the solution containing the smallest possible number of rules that allow for specific decisions to be made for all labels of all data samples), given our generalized rule format, involves 7 rules: the 3 rules in Eq. 5, plus one of the following alternative rulesets.
If we do not use the generalized rule format, we are bound to induce rules with allspecific consequents that are not allowed to “don’t care” about any of the problem’s labels. This would be equivalent to the LP transformation, creating rules for each possible label combination, and would result in (at least) 12^{3}^{3}3For this particular problem, we need 12 and not =16 rules, as some label combinations are missing from the training dataset and, thus, no model would need to be built for them. rules for our current example – i.e., the combinations of each of the first 3 rules with each of the 4 rules in set A.
3.2 Rule Representation in Chromosomes
Rules in MlSLCS, not unlike traditional LCS, are mapped to chromosomes – consisting of 1s and 0s – to be used in the GA. Our approach universally employs an activation bit, indicating whether a test for a specific attribute’s values is active or inactive (#), irrespective of the attribute type. Thus, binary attributes and labels are represented using two bits. The first bit represents the activation status of the corresponding test and the second bit represents the target (attribute or label) value. Nominal attributes are represented by bits, where is the number of the attribute’s possible values. For continuous attributes we employ the “unorderedbound representation” (Stone and Bull, 2003), defining an acceptable range of values for an attribute through two bounds and , such that . The two threshold values and are represented by binary numbers discretized in the range where () is the lowest (highest) possible value for attribute . The number of bits used in this representation is , where determines the quantization levels () and the additional bit is the activation bit.
4 LCS for Multilabel Classification
As already mentioned, the scope of our current work comprises offline multilabel classification problems – that is classification problems that can be described by a collection of data and do not involve online interactions. We tackle these problems using Michiganstyle supervised LCS.
Such LCS have been successfully used for evolving rulesets in singlelabel classification domains. In these cases, evolved rulesets comprise cooperative rules that collectively solve the target problem, while they are also required to be maximally compact, i.e., containing the minimum number of rules that are necessary for solving the problem. Equivalently, all rules need to have maximally general conditions, that is the greatest possible feature space coverage. Additionally, a ruleset is considered an effective solution if it contains rules that are adequately correct, with respect to a specific performance/correctness metric.
While all the aforementioned properties are also desirable in generalized multilabel rulesets (i.e., rulesets comprising generalized multilabel rules, as described in Section 3.1), there is an additional important requirement. These rulesets also need to exhaustively cover the label space. In other words, rules in a multilabel ruleset should collectively be able to decide about all labels for every instance. This latter desirable property, together with the compactness requirement, indicates that multilabel rules should ideally have maximally general conditions and combine them with the corresponding maximally specific consequents.
Consider, for example the following two rules for the toyx problem:
(6) 
Both rules are perfectly accurate (for the labels for which they provide concrete decisions), but the first rule is clearly preferable, correlating the (common) condition with a larger part of the label space and, thus, promoting solution compactness.
Overall, it is evident that algorithms building rulesets for multilabel classification problems need to consider the tradeoff between condition generalization, consequent specialization and rule correctness. In an LCS setting, this means that the core learning and performance procedures need to be appropriately modified to effectively cope with multilabel problems. Thus, translating the aforementioned desirable properties of multilabel rulesets into concrete design choices towards formulating our proposed multilabel LCS algorithm, we derive the following requirements for its components:

the Performance Component, that is responsible for using the rules developed to classify previously unseen samples, needs to be modified to enable effective inference based on (generalized) multilabel rules;

the Update Component, which is responsible for updating rulespecific parameters, such as accuracy and fitness, needs an appropriate metric of rule quality, taking into account that generalized multilabel rules make decisions over subsets of labels that may, additionally, be only partially correct;

the Discovery Component that explores the search space and produces new rules through a steadystate GA, needs to focus on evolving multilabel rulesets that are accurate, complete and cover both the feature and label space.Subsumption conditions, controlling rule “absorption”, also need to be adapted, so as to favor rules with more general conditions and more specific consequents.
These substantial adaptations to the general LCS framework, essentially define the proposed MlSLCS algorithm and are presented in detail in the following Sections.
4.1 The Training Cycle of the MultiLabel Supervised Learning Classifier System (MlSLCS)
MlSLCS employs a population of gradually evolving, cooperative classifiers (rules) that collectively form the solution to the target classification task, by each encoding a fraction (niche) of the problem domain. Associated with each classifier, there are a number of parameters:

the numerosity , i.e., the number of classifier copies (or microclassifiers) currently present in the ruleset;

the correct set size
that estimates the average size of the correct sets the classifier has participated in;

the time step of the last occurrence of a GA in a correct set the classifier has belonged to;

the experience that is measured as the classifier’s number of appearances in match sets (multiplied by the number of labels);

the effective match set appearances that is the classifier’s experience, (possibly) reduced by a certain amount for each label that the classifier did not provide a concrete decision for (see Eq. 7);

the number of the classifier’s correct and incorrect label decisions, and respectively;

the accuracy
that estimates the probability of a classifier predicting the correct label; and

the fitness that is a measure of the classifier’s quality.
At each discrete timestep during training, MlSLCS receives a data instance’s attribute values and labels (  ) and follows a cycle of performance, update and discovery component activation (Alg. 1). The completion of a full training cycle is followed by a new cycle based on the next available input instance, provided, of course, that the algorithm’s termination conditions have not been met.
4.2 The Performance Component of MlSLCS
Upon receipt of a data instance , the system scans the current population of classifiers for those whose condition matches the input and forms the match set . Next, for each label , a correct set is formed containing the rules of that correctly predict label for the current instance^{4}^{4}4This is possible in a supervised framework, since the correct labels are directly available to the learning system.. The classifiers in incorrectly predicting label are placed in the incorrect set . Finally, if the system is in test mode^{5}^{5}5Under test mode, the population of MlSLCS does not undergo any changes; that is, the update and discovery components are disabled., a classification decision is produced based on the labels advocated by rules in ( and cannot be produced since the actual labels are unknown).
However, the process of classifying new samples based on models involving multilabel rules is not straightforward. In multilabel classification, a bipartition of relevant and irrelevant labels, rather than a single class, has to be decided upon, based on some threshold. Furthermore, rulesets evolved with LCS may contain contradicting or lowaccuracy rules. Therefore, a “vote and threshold” method is required to effectively classify unknown samples (Read, 2010). More specifically, an overall vote for each label
is obtained by allowing each rule to cast a positive (for advocated labels) or negative vote equal to its macrofitness. Votes are cast only for labels that a rule provides concrete decisions for. The resulting votes vector
is normalized, such that and and a threshold is used to select the labels that will be activated (those for which ). Assuming that the thresholding method aims at activating at least one label, the range of reasonable thresholds is .In our current work, we experimented with two threshold selection methods (Yang, 2001; Read, 2010), namely Internal Validation (Ival) and Proportional Cut (Pcut).
Internal Validation (Ival) selects, given the ruleset, the threshold value that maximizes a performance metric (such as accuracy), based on consecutive internal tests. It can produce good thresholds at a (usually) large computational cost, as the process of validating each threshold value against the training dataset is timeconsuming. Its complexity, however, can be significantly improved by exploiting the fact that most metric functions are convex with respect to the threshold.
Proportional Cut (Pcut) selects the threshold value that minimizes the difference in label cardinality (i.e., the mean number of labels that are activated per sample) between the training data and any other given dataset. This is achieved by minimizing the following error with respect to :
where is the training dataset, is the threshold function and is the dataset with respect to which we tune the threshold . It is worth noting that, in our case, it always holds that . Tuning the threshold with respect to the test dataset would imply an a priori knowledge of label structure in unlabeled samples and would, thus, result in biased evaluations and, possibly, a wrong choice of models to be used for posttraining predictions. The Pcut method, although not tailored to any particular evaluation measure, calibrates thresholds as effectively as Ival, at a fraction of the computational cost and is, thus, considered a method suitable for general use in experimental evaluations (Read, 2010).
Employing each rule’s fitness as its confidence level, it is possible to predict the labels of new (unknown) data samples by using only the fittest rule of those matching each sample’s attribute values. Of course, in case the fittest rule “does not care” for some of the labels, additional rules (sequentially, from a list of matching rules sorted by fitness) can be employed to provide a complete decision vector with specific values for all possible labels. The above described strategy, named Best Rule Selection (Best), has also been included in our experiments, since it is the one yielding the most compact, in terms of number of rules, prediction models.
4.3 The Update Component of MlSLCS
In training or explore mode, each classification of a data instance is associated with an update of the matching classifiers’ parameters. More specifically:

for all classifiers in match set , their experience is increased by one and their value is updated, based on whether they provide a concrete decision;

for all classifiers belonging to at least one correct set , their correct set size is updated, so that it estimates the average size of all correct sets the classifier has participated in so far; and

all classifiers in match set have their fitness updated.
The specific update strategies for and correct set size are presented in Alg. 2.
Fitness calculation in MlSLCS is based on a supervised approach that involves computing the accuracy () of classifiers as the percentage of their correct classifications (line 5 of Alg. 2). Moreover, motivated by the need to distinguish between rules that provide concrete decisions (positive or negative) about labels and those whose decisions are “indifferent”, we introduce the notion of . The correctness value of a rule for a label (with respect to a specific training cycle and, thus, specific and sets) is calculated according to the following equation:
where for rules not deciding on for the current instance (i.e., for matching rules neither in nor in ).
Accordingly, the match set appearances () that a rule obtains for a label , during a specific training cycle, is differentiated depending on whether provides a concrete decision or not, according to Eq. 7, where .
(7) 
In our current work, we explore a version of MlSLCS that slightly penalizes “indifferent” rules by considering #’s as partial (=) matches (=). The reasons that lead us to choose these specific values are detailed in Section 5.1. For now, though, let us again consider the simple example of the toyx problem and the rules of Eq. 6. Supposing that both rules have not encountered any instances so far (==), when the system processes the instance , the rules’ values will become 1 and 0.9, respectively. This means that ’s fitness will be greater than that of ’s when they compete in the GA selection phase for the first label and, thus, the system will have successfully applied the desired pressure towards maximally specific consequents.
Finally, as far as the update of rule overall correctset size is concerned, we have chosen a rather strict estimation, employing the size of the smallest label correct set that the rule participates in. This choice is motivated by the need to exert fitness pressure in the population towards complete labelspace coverage. This is, in our case, achieved by rewarding rules that explicitly advocate for or against “unpopular” labels.
4.4 The Discovery Component of MlSLCS
MlSLCS employs two rule discovery mechanisms: a covering operator and a steadystate niche genetic algorithm.
The covering operator is adapted from the one introduced in XCS (Wilson, 1995) and later used in UCS (BernadóMansilla and GarrellGuiu, 2003; OrriolsPuig and BernadóMansilla, 2008) and most of their derivatives. It is activated only during training and introduces new rules to the population when the system encounters an empty correct set for a label . Covering produces a single random rule with a condition matching the current input instance’s attribute values and generalized with a given probability per attribute. While this process is identical to the one employed in singleclass LCS, it is followed by an additional generalization process applied to the rule consequent, which is essential to evolving generalized multilabel rules. All labels in the newly created rule’s consequent are set to 0 or 1 according to the current input and then generalized (converted to #) with probability per label, except for the current label that remains specific.
The genetic algorithm is applied iteratively on all correct sets and invoked at a rate , where is defined as a (minimum) threshold on the average time since the last GA invocation of classifiers in (BernadóMansilla and GarrellGuiu, 2003). The evolutionary process employs experiencediscounted fitnessproportionate parent selection, with the selection probability assigned to each classifier being calculated according to:
(8) 
where
(9) 
and is the experience threshold for fitness discounting. After their selection, the two parent classifiers are copied to form two offspring, on which the multilabel crossover operator and a uniform mutation operator are applied with probabilities and , respectively.
The multilabel crossover operator is introduced in this work and intended for use specifically in multilabel classification settings. Its design was motivated by the fact that for the majority of datasets employed in our current work, the number of attributes is significantly larger than the number of labels (by at least one order of magnitude). This means that, using a singlepoint crossover operator, the probability that the crossover point would end up in the attribute space is significantly greater than that of it residing in the label space. Therefore, there would be a significantly greater probability of transferring the whole consequent part from the parents to their corresponding offspring than that of transferring the decisions for only a subset of labels .
Actually, allowing the transfer of any set of decisions as a policy for any given crossover occurring on would be a questionable choice: the fact that any two rules, selected to be parents, coincide in does not necessarily mean that they would coincide in where and . Keeping this observation in mind, we designed the multilabel crossover operator, with the aim of exerting more pressure towards accurate decisions per label. The newly proposed operator achieves that by not transferring decisions from the selected parents to their corresponding offspring other than that about the current label, i.e., the label corresponding to the correct set from which the parents were selected.
More specifically, the crossover point is selected pseudorandomly from the range ], where:
(10) 
and is the classifier size in bits. This means that the multilabel crossover operator takes into account the rule’s condition part and only one (instead of all ) of its labels: the label for which the current correct set (on which the GA is applied) has been formed for. If the crossover point happens to be in the range ], that is in the condition part of the rule’s chromosome, the two parent classifiers swap (a) their condition parts beyond the crossover point and (b) their decision for the current label from their consequent parts. Otherwise, that is when the crossover point happens to correspond to (any of the two bits representing) the current label, the two parent classifiers only swap their decision for the label being considered and no part of their conditions.
Returning to the GAbased rule generation process, after the crossover and mutation operators have been applied, MlSLCS checks every offspring as per its ability to codify a part of the problem at hand. Given the supervised setting of multilabel classification, this is equivalent to checking that each rule covers at least one instance of the training dataset. The presence of rules in the population that fail to cover at least one instance, termed zerocoverage rules^{6}^{6}6Coverage is defined as the number of data instances a rule matches., is unnecessary to the system. Also, depending on the completeness degree of the problem, it may be hindering its performance by lengthening the training time and rendering the production rate of zerocoverage rules through the GA uncontrollable. Therefore, to avoid these problems, MlSLCS removes zerocoverage rules just after their creation by the discovery component, assuring that .
Even after this step, the nonzerocoverage offspring are not directly inserted into the classifier population. They are gathered into a pool, until the GA has been applied to all label correct sets. Once the rule generation process has been completed for the current training cycle, and before their insertion into the classifier population, all rules in the offspring pool are checked for subsumption (a) against each of their parents and (b) in case no parent subsumes them, against the whole rule population. If a classifier (parent or not) is found to subsume the offspring being checked, the latter is not introduced into the population, but the numerosity of the subsuming classifier is incremented by one instead. Subsumption conditions require that the subsuming classifier is sufficiently experienced (), accurate () and more general than the offspring being checked (with and being userdefined parameters of the system). Additionally, the generality condition is extended for the multilabel case, such that a classifier can only subsume a classifier , if ’s condition part is equally or more general and its consequent part is equally or more specific than those, respectively, of the classifier being subsumed.
4.5 Population control strategies employed in MlSLCS
The system maintains an upper bound on the population size (at the microclassifier level) by employing a deletion mechanism, according to which a rule is selected for deletion with probability :
(11) 
where
and is a userdefined experience threshold.
In addition to the deletion mechanism that is present in most LCS implementations, in MlSLCS we introduce a new population control strategy that aims to increase the mean coverage of instances by the rules in the population. This strategy corresponds to the “control match set ” step (line 7 of Alg. 1) in the overall training cycle of MlSLCS and is based on the following observations:

Given a set of rules, such as the match set , the rules it comprises lie on different coverage levels. This means that rules cover different numbers of dataset instances, depending on the degree of generalization that the LCS has achieved.

A given coverage level in (a subset of rules in whose members cover the same number of instances) holds rules of various fitnesses.

If there are two or more rules in the lowest coverage level in , the rule whose fitness is the lowest among them is not necessary in . That is because there exist more rules that cover the instance from which was generated and are, in addition, more fit overall, classifying instances more accurately. The rule may still be of use in , if it is the sole rule covering an instance in the population. However, in the general case, can be removed from the population without any considerable loss of accuracy for the system.
The invocation condition for the match set control strategy (line 6 of Alg. 1) means that the corresponding deletion mechanism will only be activated after the population has reached its upper numerosity boundary for the first time. Thus, “regular” deletions from the population and deletions of lowcoverage rules from the match set are two processes (typically) applied simultaneously in the system. Using the above invocation condition accomplishes two objectives: (i) it prevents, during the first training iterations, the deletion of fit and specific rules that could be pass their ‘useful’ genes on to the next generation and (ii) it prevents (to a certain degree) the deletion of rules that coexist with others in the lowest coverage level of a specific match set but are unique in another.
Finally, as far as the computational cost of implementing population control is concerned, it is worth noting that it is negligible, as the coverage value for each rule is gradually determined through a full pass of the training dataset and is used only after that point (from when on, it does not change).
4.6 Clusteringbased initialization of MlSLcs
MlSLCS also employs a population initialization method that extracts information about the structure of the studied problems through a pretraining clustering phase and exploits this information by transforming it into rules suitable for the initialization of the learning process. The employed method is a generalization for the multilabel case of the clusteringbased initialization process presented in Tzima et al. (2012) that has been shown to boost LCS performance, both in terms of predictive accuracy and the final evolved ruleset’s size, in supervised singlelabel classification problems.
Simply put, the clusteringbased initialization method of MlSLCS detects a representative set of “data points”, termed centroids, from the target multilabel dataset D and transforms them into rules for the initialization of the LCS rule population prior to training.
More specifically,
the dataset D is partitioned into subsets, where is the total number of discrete label combinations present in D. Each subset consists of the instances whose label combination matches the discrete label combination . For each partition , :

The instances belonging to the partition are grouped into clusters, where is the number of instances in the partition and () is a userdefined parameter.

For each cluster ,
, identified in the previous step, its centroid is found employing a clustering algorithm (in our case, the kmeans algorithm). Then, a new rule is created whose condition part matches the centroid’s attribute values (more details on this procedure can be found in
Tzima et al. (2012)), while the decision part is set to the discrete label combination associated with the current partition. The centroidtorule transformation process also includes a generalization step (similar to the one used by the covering operator): some of the newly created rule’s conditions and decisions are generalized (converted to “don’t cares”), taking into account the attribute and label generalization probabilities defined by the user for clustering.
Finally, all rules created by clustering the training dataset are merged to create the ruleset used to initialize the learning process.
In our current work, we chose not to experiment with tuning the clusteringbased initialization process parameters and used the following values for all reported experiments: , and .
5 Experimental Validation of MlSLcs
In this Section, we present an experimental evaluation of our proposed multilabel LCS approach^{7}^{7}7The Java source code of our implementation of MlSLCS used throughout all reported experiments is publicly available at: https://github.com/fanioula/mlslcs.. We first provide a brief analysis of MlSLCS’s behavior on two artificial datasets and then compare its performance to that of 6 other stateoftheart methods on 7 realworld datasets.
5.1 Experiments on artificial datasets
Since the focus of our current work is on multilabel classification, we begin our analysis with two artificial problems, named toyx and mlposition respectively, that we consider representative of a wide class of problems from our target realworld domain, in terms of the label correlations involved. The toyx problem has already been described in Section 3.1. The mlpositionN (here, N=4) problem has binary attributes and labels. In every data sample, only one label is active, that is the label corresponding to the most significant bit of the binary number formed by the sample’s attributes. It is evident that, in this case, there is great imbalance among the labels, since label is only activated once, while label is activated in instances. The shortest complete solution of the problem involves exactly + rules, with different degrees of generalization in their condition parts, but no generalizations in their consequent parts. Specifically, for the mlposition problem the shortest complete solution (SCS) includes the following rules:
Overall, one can easily observe that toyx is a problem where two of the labels are only dependent on attribute values and independent of other labels, while mlposition involves labels that are completely dependent on each other (in fact, they are mutually exclusive). Most (nontrivial) realworld problems will be a “mixture” of these two cases (i.e., will involve a mixture of uncorrelated and correlated labels), so our intention is to tune the system to perform as well as possible for both artificial problems. In the current paper, we focus our study on the fitness update process (see Section 4.3) and, more specifically, the choice of the parameter value, given that we consider “don’t cares” as partial matches (, =).
Regarding performance metrics, the percentage of the SCS was selected as an appropriate performance metric, indicative of the progress of genetic search. Along with the %[SCS], we also report the multilabel accuracy (Eq. 1) achieved by the system throughout learning and the average number of rules in the final models evolved. All reported results are averaged over 30 runs (per problem and parameter setting) with different random seeds.
For all experiments and both problems, we kept the majority of parameter settings fixed, using a typical setup, consistent with those reported in the literature of singlelabel LCS: =, =, =, =, =, =, =, and =. The population size was set to , the number of iterations was , the GA invocation rate was and the generalization probability was . The only parameter varied between the two problems was the label generalization probability which was set to and , respectively, for toyx and mlposition.
The results of our experiments, depicted in Figures 1 and 2, reveal that the value of the parameter affects both the accuracy and the quality (in terms of number of rules) of the evolved solutions. This is especially evident in the toyx problem, where the SCS contains a complex tradeoff of featurespace generality and labelspace specificity: low values of result in overpenalizing labelspace indifferences and exerting pressure for highly specific consequents, thus also adversely affecting the system’s accuracy. The same pressure towards consequent specificity proves beneficial in the mlposition problem, due to the nature of the problem’s SCS that comprises rules providing specific decisions for all labels.
Given our goal to optimize system performance for both problems and the importance of the accuracy metric in realworld applications, choosing the value = is a good tradeoff. For this value and the toyx problem, the system discovers % of the SCS on average (% in out of the averaged experiments) and achieves a % accuracy (% in of the averaged experiments). For the mlposition problem (and again =), the system discovers % of the SCS on average (% in experiments) and achieves a % accuracy (% in experiments).
As far as the size of the final rulesets is concerned (again for =), after applying a simple ruleset compaction strategy (i.e., ordering the rules by their macrofitness and keeping only the top rules necessary to fully cover the training dataset and have specific decisions for all labels), we get models with and rules on average for toyx and mlposition, respectively. This points to aspects of the rule evaluation process that need to be further investigated, since for the chosen value of , some of the SCS rules are present, but not prevalent enough in the final rulesets for the compacted solutions to be of the optimal size.
5.2 Experimental setup for realworld problems
The benchmark datasets employed in this set of experiments are listed in Table 1
, along with their associated descriptive statistics and application domain. The datasets are ordered by complexity (
), while Label Cardinality (LCA) is the average number of labels relevant to each instance. We strived to include a considerable variety and scale of multilabel datasets. In total, we used 7 datasets, with dimensions ranging from 6 to 174 labels, and from less than 200 to almost 44,000 examples. All of the datasets are readily available from the Mulan website (http://mulan.sourceforge.net/datasets.html).dataset  DIST  DENS  LCA  domain  complexity  

flags  194  9C+10N  7  54  0.485  3.39  images  2.58E+04 
emotions  593  72N  6  27  0.311  1.87  music  2.56E+05 
genbase  662  1186C  27  32  0.046  1.25  biology  2.00E+06 
scene  2407  294N  6  14  0.179  1.07  images  4.25E+06 
CAL500  502  68N  174  502  0.150  26.04  music  5.94E+06 
enron  1702  1001C  53  753  0.064  3.38  text  9.03E+07 
mediamill  43907  120N  101  6555  0.043  4.38  video  5.32E+08 
Evaluation is done in the form of tenfold cross validation for the four smallest datasets^{8}^{8}8The specific splits in folds, along with the detailed results of the rival algorithm parameter tuning phase, are available at http://issel.ee.auth.gr/softwarealgorithms/mlslcs/.. For the enron, CAL500 and mediamill datasets a train/test split (provided on the Mulan website) is used instead, since crossvalidation is too time and/or computationally intensive for some methods^{9}^{9}9Some of the rival algorithms’ runs could not be completed, even on a machine with 64GB of RAM..
The rival algorithms against which the proposed MlSLCS algorithm is compared are HOMER, RAkEL, ECC, CC, MlkNN and BRJ48. For all algorithms, except ECC and CC, their implementations provided by the Mulan Library for Multilabel Learning (Tsoumakas et al., 2011b) were used, while for ECC and CC we used the MEKA environment (http://meka.sourceforge.net/).
As far as the parameter setup of the algorithms is concerned, in general, we followed the recommendations from the literature, combined with a modest parameter tuning phase, where appropriate. More specifically:

BR refers to a simple binaryrelation transformation of each problem using the C4.5 algorithm (WEKA’s (Witten and Frank, 2005) J48 implementation) and serves as our baseline.

For HOMER, Support Vector Machines (SVMs) are used as the internal classifier (WEKA’s SMO implementation). For the number of clusters, five different values (26) are considered and the best result is reported.

We experiment with three versions of RAkEL and report the best result: (a) the default setup (subset size and models) with C4.5 (WEKA’s J48) as the baseline classifier, (b) the “extended setup”, with a subset size equal to half the number of labels and models, and C4.5 (WEKA’s J48 implementation) as the baseline classifier, and (c) the “extended setup” and SVMs (WEKA’s SMO implementation) as the baseline classifier.

ECC and CC are used with SVMs (WEKA’s SMO implementation) as the baseline classifier, while the number of models for ECC is set to 10, as proposed by the algorithm’s authors in (Read et al., 2009).

Finally, the number of neighbors for the MlkNN method is determined by testing the values 6 through 20 (with step 2) and selecting the best result per dataset.
Where not stated differently, the default parameters were used.
For MlSLCS, we kept the majority of parameters fixed through all experiments, using the typical setup reported for the artificial problem experiments. The parameters varied were the population size , the number of iterations , the GA invocation rate and the generalization probabilities and . The choice of specific parameter values (Table 2) was based on an iterative process that involved starting with default values for all parameters (=, and =*, =, =, =) and tuning one parameter at a time, according to the following steps:

was set to either 0.1 or 0.01, depending on the resulting model’s performance on the train dataset;

for the values 0.33, 0.4, 0.8, 0.9, and 0.99 were iteratively tested and the one leading to the greater coverage of the train dataset’s instances was selected;

was selected between the values 300 and 2000, based on which one of them leads to a faster suppression of the covering process;

the population size was selected among the values 1000, 2000, 9000, 12000, and 25000, based on the resulting model’s performance on the train dataset;

evolved models were evaluated every =* iterations and training stopped when the performance on the test dataset (with respect to the accuracy metric) was greater than that of the baseline BR approach.
During the tuning process, the parameter values selected in each step were used (and kept constant) in all subsequent steps.
Dataset  

flags  500  1000  2000  0.33  0.01 
emotions  500  5000  2000  0.8  0.01 
genbase  500  12000  2000  0.4  0.10 
scene  2500  9000  300  0.99  0.10 
CAL500  200  1000  2000  0.9  0.10 
enron  600  25000  2000  0.99  0.10 
mediamill  10  1000  2000  0.9  0.10 
It is also worth noting that, when using Ival for MlSLCS, the corresponding thresholds were calibrated based on the (multilabel) accuracy metric, as in RAkEL.
Regarding the statistical significance of the measured differences in algorithm performance, we employ the procedure suggested in (Demšar, 2006) for robustly comparing classifiers across multiple datasets. This procedure involves the use of the Friedman test to establish the significance of the differences between classifier ranks and, potentially, a posthoc test to compare classifiers to each other. In our case, where the goal is to compare the performance of all algorithms to each other, the Nemenyi test was selected as the appropriate posthoc test.
5.3 Comparative Analysis of Results
Table 3 summarizes the results for the MlSLCS algorithm for all inference methods (see Section 4.2), namely Proportional Cut (Pcut), Internal Validation (Ival) and Best Classifier Selection (Best), all three evaluation metrics (multilabel accuracy, exact match and Hamming loss) and all datasets used in this study. All values reported are at a % scale and the results for the three evaluation metrics for each inference method refer to the same experiment per dataset.
Accuracy  Exact Match  Hamming Loss  

Pcut  Ival  Best  Pcut  Ival  Best  Pcut  Ival  Best  
flags  
emotions  
genbase  
scene  
CAL500  
enron  
mediamill 
Inspecting the obtained results, one can easily conclude that while no inference method is clearly dominant, Ival seems to yield the best results overall. It is also worth noting that the Best method outperforms the other two inference methods for 2 out of the 7 studied datasets, although it involves a considerably smaller number of rules in its final models. Especially in the case of the CAL500 dataset, the use of the full evolved ruleset (thresholded through Pcut or Ival) seems to be particularly harmful for system performance. This indicates a problem with either the evolution of rules or the threshold selection procedures that needs to be further investigated in the future.
In general, results with the Best method are acceptable and close to that of the other inference methods. Thus, the considerably smaller rulesets involved in Best models can be considered an effective summary of the target problem’s solution to be used for descriptive purposes. The need for such “description” is especially evident in realworld classification problems, where the desired solution must be interpretable by human experts and/or decision makers.
Considering the experiment that corresponds to the inference method with the best accuracy value for our proposed MlSLCS algorithm, Tables 5(a)  5(c) summarize the results of comparing it with its rival learning techniques. Achieved values (%) for the three evaluation metrics (multilabel accuracy, exact match and Hamming loss) and all datasets used in this study are reported. In Table 5(a) along with the accuracy rates, we also report each algorithm’s overall average rank (row labeled “Av. Rank”) and its position in the final ranking (row labeled “Final Pos.”). Accordingly, Tables 5(b) and 5(c), respectively, report the values for the exact match and the Hamming loss metrics, along with the corresponding rankings.



Based on the accuracy results, the average rank provides a clear indication of the studied algorithms relative performance: MlSLCS ranks second after RAkEL and outperforms all its rivals in 3 out of the 7 studied problems, including the relatively highcomplexity CAL500 problem. The comparison results are less favorable for MlSLCS when based on the exact match and Hamming loss metrics, as it ranks third in both cases. Still, MlSLCS achieves the best exact match value for 3 out of the 7 studied problems, including the CAL500 problem. In the latter case, MlSLCS (with the Best inference strategy) outperforms its rivals by at least 70%. We consider this result indicative of our proposed algorithm’s ability to effectively model label correlations, given the high label cardinality () of the problem.
Regarding the statistical significance of the measured differences in algorithm ranks, the use of the Friedman test does not
reject the null hypothesis (at
=0.05) that all algorithms perform equivalently, when applied to rankings based on the accuracy and exact match metrics. The same null hypothesis is rejected (at =0.05) when the studied algorithms are ranked based on Hamming loss, and the Nemenyi posthoc test detects a significant performance difference between RAkEL and (a) HOMER and CC at =0.1, and (b) ECC and BR at =0.05.Overall, regardless of the evaluation metric used, MlSLCS outperforms at least 4 of its 6 rivals. In the cases of accuracy and Hamming loss, the outperformed rivals include the stateoftheart algorithms HOMER and CC that have been recommended as benchmarks by a recent extensive comparative study of multilabel classification algorithms (Madjarov et al., 2012). Additionally, no statistically significant performance differences are detected between MlSLCS and the best performing RAkEL algorithm, with respect to all evaluation metrics. Thus, we consider obtained results indicative of (i) the potential of our proposed LCS approach for effective multilabel classification, as well as (ii) the flexibility of the generalized multilabel rule format that can mimic the knowledge representations induced by the studied rulebased, lazy and SVMbased ensemble learners, depending on the problem type.
6 Conclusions and Future Work
In this paper, we presented a generalized rule format suitable for generating compact and accurate rulesets in multilabel settings. The proposed format extends existing rule representations with a flexible mechanism for modeling label correlations without the need to explicitly specify the label combinations to be considered. Thus, algorithms inducing generalized multilabel rules can approach all possible spectra between the BR (no label correlations) and LP (all possible label combinations) transformations, while producing comprehensible knowledge in the form of “ifthen” rules.
In addition to detailing the generalized multilabel rule format, our current work also employed it in the context of a multilabel LCS algorithm, named MlSLCS, that is based on a supervised LCS learning framework, properly modified to meet the new requirements posed by the multilabel classification domain. Its extensive experimental evaluation, missing from previous research in the area, revealed that it is capable of consistently effective classification and highlighted it as the first LCSbased alternative to stateoftheart multilabel classification methods. Based on the average rank over the three evaluations metrics employed, MlSLCS came second with to RAkEL’s average first place, while it outperformed HOMER (whose average rank is ) that has recently been identified as a topperforming benchmark multilabel classification method (Madjarov et al., 2012).
Regarding the combined potential of MlSLCS and the proposed generalized multilabel rule format, it is also worth noting that they are, with small modifications to the internal representation of rule labels, directly applicable to the relatively new task of multidimensional classification.
The current limitation of our approach with respect to the arguably long times required for model training – that is also a problem for several nonevolutionary multilabel approaches, such as RAkEL and ECC – can be overcome by exploiting the parallelization, and thus scalability, potential of GAs.
An additional important issue, that needs to be addressed in future work, concerns the readability of the knowledge representations evolved, both in terms of rule quality (generalization degree) and quantity. Our first step towards this direction will be an experimental investigation of rule compaction methods available in the literature. Furthermore, based on the encouraging results obtained with the use of our clusteringbased initialization procedure, alternative rule initialization methods will be explored, as a means to boost the predictive accuracy and interpretability of the induced knowledge representations.
Acknowledgment
The first author would like to acknowledge that this research has been funded by the Research Committee of Aristotle University of Thessaloniki, through the “Excellence Fellowships for Postdoctoral Studies” program.
References
 Ahmadi Abhari et al. (2011) Ahmadi Abhari, K., Hamzeh, A., and Hashemi, S. (2011). Voting based learning classifier system for multilabel classification. In Proceedings of the 2011 GECCO Conference Companion on Genetic and Evolutionary Computation, pages 355–360, New York, NY, USA. ACM.
 Allamanis et al. (2013) Allamanis, M., Tzima, F. A., and Mitkas, P. A. (2013). Effective rulebased multilabel classification with learning classifier systems. In Tomassini, M., Antonioni, A., Daolio, F., and Buesser, P., editors, ICANNGA, volume 7824 of Lecture Notes in Computer Science, pages 466–476. Springer.
 Behdad et al. (2012) Behdad, M., Barone, L., French, T., and Bennamoun, M. (2012). On xcsr for electronic fraud detection. Evolutionary Intelligence, 5(2):139–150.
 BernadóMansilla and GarrellGuiu (2003) BernadóMansilla, E. and GarrellGuiu, J. (2003). Accuracybased learning classifier systems: models, analysis and applications to classification tasks. Evolutionary Computation, 11(3):209–238.
 Bull et al. (2008) Bull, L., BernadóMansilla, E., and Holmes, J. H., editors (2008). Learning Classifier Systems in Data Mining, volume 125 of Studies in Computational Intelligence. Springer.
 Bull and O’Hara (2002) Bull, L. and O’Hara, T. (2002). Accuracybased neuro and neurofuzzy classifier systems. In et al., W. B. L., editor, GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA, 913 July 2002, pages 905–911. Morgan Kaufmann.
 Butz et al. (2005) Butz, M., Goldberg, D., and Lanzi, P. (2005). Gradient descent methods in learning classifier systems: improving xcs performance in multistep problems. Evolutionary Computation, IEEE Transactions on, 9(5):452–473.
 Butz et al. (2004) Butz, M., Kovacs, T., Lanzi, P., and Wilson, S. (2004). Toward a theory of generalization and learning in XCS. Evolutionary Computation, IEEE Transactions on, 8(1):28–46.
 Butz et al. (2008) Butz, M., Lanzi, P., and Wilson, S. (2008). Function approximation with xcs: Hyperellipsoidal conditions, recursive least squares, and compaction. Evolutionary Computation, IEEE Transactions on, 12(3):355–376.
 Cheng and Hüllermeier (2009) Cheng, W. and Hüllermeier, E. (2009). Combining instancebased learning and logistic regression for multilabel classification. Machine Learning, 76(23):211–225.
 Clare and King (2001) Clare, A. and King, R. D. (2001). Knowledge discovery in multilabel phenotype data. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, pages 42–53, London, UK, UK. SpringerVerlag.
 Crammer and Singer (2003) Crammer, K. and Singer, Y. (2003). A family of additive online algorithms for category ranking. J. Mach. Learn. Res., 3:1025–1058.
 Demšar (2006) Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7:1–30.
 Elisseeff and Weston (2005) Elisseeff, A. and Weston, J. (2005). A kernel method for multilabelled classification. In Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval, pages 274–281.
 Fernández et al. (2010) Fernández, A., Garcia, S., Luengo, J., BernadóMansilla, E., and Herrera, F. (2010). Geneticsbased machine learning for rule induction: State of the art, taxonomy, and comparative study. Evolutionary Computation, IEEE Transactions on, 14(6):913–941.

Holland (1975)
Holland, J. H. (1975).
Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence
. University of Michigan Press, Ann Arbor, MI, USA.  Hüllermeier et al. (2008) Hüllermeier, E., Fürnkranz, J., Cheng, W., and Brinker, K. (2008). Label ranking by learning pairwise preferences. Artificial Intelligence, 172(1617):1897 – 1916.
 Iqbal et al. (2014) Iqbal, M., Browne, W., and Zhang, M. (2014). Reusing building blocks of extracted knowledge to solve complex, largescale boolean problems. Evolutionary Computation, IEEE Transactions on, 18(4):465–480.
 Kharbat et al. (2007) Kharbat, F., Bull, L., and Odeh, M. (2007). Mining breast cancer data with XCS. In GECCO ’07: Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 2066–2073, New York, NY, USA. ACM.
 Kneissler et al. (2014) Kneissler, J., Stalph, P., Drugowitsch, J., and Butz, M. (2014). Filtering Sensory Information with XCSF: Improving Learning Robustness and Robot Arm Control Performance. Evolutionary Computation, 22(1):139–158.
 Kocev (2011) Kocev, D. (2011). Ensembles for predicting structured outputs. PhD thesis, IPS Jožef Stefan, Ljubljana, Slovenia.
 Kovacs (2002a) Kovacs, T. (2002a). XCS’s StrengthBased Twin: Part I. In Lanzi, P. L., Stolzmann, W., and Wilson, S. W., editors, Learning Classifier Systems, 5th International Workshop, IWLCS 2002, Granada, Spain, September 78, 2002, Revised Papers, volume 2661 of Lecture Notes in Computer Science, pages 61–80. Springer.
 Kovacs (2002b) Kovacs, T. (2002b). XCS’s StrengthBased Twin: Part II. In Lanzi, P. L., Stolzmann, W., and Wilson, S. W., editors, Learning Classifier Systems, 5th International Workshop, IWLCS 2002, Granada, Spain, September 78, 2002, Revised Papers, volume 2661 of Lecture Notes in Computer Science, pages 81–98. Springer.
 Lanzi (2008) Lanzi, P. (2008). Learning Classifier Systems: Then and Now. Evolutionary Intelligence, 1(1):63–82.
 Lanzi (1999) Lanzi, P. L. (1999). Extending the representation of classifier conditions part I: From binary to messy coding. In Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., and Smith, R. E., editors, Proceedings of the GECCO Conference, pages 337–344, Orlando, Florida, USA. Morgan Kaufmann.
 Lanzi et al. (2006) Lanzi, P. L., Loiacono, D., Wilson, S. W., and Goldberg, D. E. (2006). Classifier prediction based on tile coding. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pages 1497–1504, New York, NY, USA. ACM.
 Lanzi and Perrucci (1999) Lanzi, P. L. and Perrucci, A. (1999). Extending the representation of classifier conditions part II: From messy coding to Sexpressions. In Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., and Smith, R. E., editors, Proceedings of the GECCO Conference, pages 345–352, Orlando, Florida, USA. Morgan Kaufmann.
 Lanzi and Wilson (2006) Lanzi, P. L. and Wilson, S. W. (2006). Using convex hulls to represent classifier conditions. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pages 1481–1488, New York, NY, USA. ACM.
 Madjarov et al. (2012) Madjarov, G., Kocev, D., Gjorgjevikj, D., and Džeroski, S. (2012). An extensive experimental comparison of methods for multilabel learning. Pattern Recognition, 45(9):3084–3104.
 Murphy (2012) Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press.
 Nakata et al. (2014) Nakata, M., Kovacs, T., and Takadama, K. (2014). A modified XCS classifier system for sequence labeling. In Proceedings of the 2014 Conference on Genetic and Evolutionary Computation, GECCO ’14, pages 565–572, New York, NY, USA. ACM.
 Nakata et al. (2015) Nakata, M., Kovacs, T., and Takadama, K. (2015). XCSSL: a rulebased genetic learning system for sequence labeling. Evolutionary Intelligence, pages 1–16.
 OrriolsPuig and BernadóMansilla (2008) OrriolsPuig, A. and BernadóMansilla, E. (2008). Revisiting UCS: Description, Fitness Sharing, and Comparison with XCS. In Bacardit, J., BernadóMansilla, E., Butz, M. V., Kovacs, T., Llorà, X., and Takadama, K., editors, Learning Classifier Systems, pages 96–116. SpringerVerlag, Berlin, Heidelberg.
 OrriolsPuig et al. (2009a) OrriolsPuig, A., BernadoMansilla, E., Goldberg, D., Sastry, K., and Lanzi, P. (2009a). Facetwise analysis of xcs for problems with class imbalances. Evolutionary Computation, IEEE Transactions on, 13(5):1093–1119.
 OrriolsPuig et al. (2009b) OrriolsPuig, A., Casillas, J., and BernadóMansilla, E. (2009b). FuzzyUCS: A MichiganStyle Learning FuzzyClassifier System for Supervised Learning. Evolutionary Computation, IEEE Transactions on, 13(2):260–283.
 Preen and Bull (2013) Preen, R. and Bull, L. (2013). Dynamical Genetic Programming in XCSF. Evolutionary Computation, 21(3):361–387.
 Read (2008) Read, J. (2008). A Pruned Problem Transformation Method for Multilabel classification. In Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008), pages 143–150.
 Read (2010) Read, J. (2010). Scalable MultiLabel Classification. PhD thesis, University of Waikato, Hamilton, New Zealand.
 Read et al. (2014) Read, J., Bielza, C., and Larranaga, P. (2014). Multidimensional classification with superclasses. Knowledge and Data Engineering, IEEE Transactions on, 26(7):1720–1733.
 Read et al. (2008) Read, J., Pfahringer, B., and Holmes, G. (2008). Multilabel classification using ensembles of pruned sets. In 2008 Eighth IEEE International Conference on Data Mining, pages 995–1000. IEEE.
 Read et al. (2009) Read, J., Pfahringer, B., Holmes, G., and Frank, E. (2009). Classifier chains for multilabel classification. Machine Learning and Knowledge Discovery in Databases, pages 254–269.
 Schapire and Singer (2000) Schapire, R. and Singer, Y. (2000). Boostexter: A boosting based system for text categorization. Machine learning, 39(2):135–168.
 Stalph et al. (2012) Stalph, P. O., Llorà, X., Goldberg, D. E., and Butz, M. V. (2012). Resource management and scalability of the {XCSF} learning classifier system. Theoretical Computer Science, 425(0):126 – 141. Theoretical Foundations of Evolutionary Computation.
 Stone and Bull (2003) Stone, C. and Bull, L. (2003). For real! xcs with continuousvalued inputs. Evol. Comput., 11(3):299–336.
 Thabtah et al. (2004) Thabtah, F., Cowling, P., and Peng, Y. (2004). MMAC: a new multiclass, multilabel associative classification approach. In Proceedings of the 2004 IEEE International Conference on Data Mining, pages 217–224.
 Tsoumakas and Katakis (2007) Tsoumakas, G. and Katakis, I. (2007). Multilabel classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13.
 Tsoumakas et al. (2008) Tsoumakas, G., Katakis, I., and Vlahavas, I. P. (2008). Effective and Efficient Multilabel Classification in Domains with Large Number of Labels. In ECML/PKDD 2008 Workshop on Mining Multidimensional Data.
 Tsoumakas et al. (2010) Tsoumakas, G., Katakis, I., and Vlahavas, I. P. (2010). Mining multilabel data. In Maimon, O. and Rokach, L., editors, Data Mining and Knowledge Discovery Handbook, pages 667–685. Springer.
 Tsoumakas et al. (2011a) Tsoumakas, G., Katakis, I., and Vlahavas, I. P. (2011a). Random klabelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering, 23(7):1079–1089.
 Tsoumakas et al. (2011b) Tsoumakas, G., SpyromitrosXioufis, E., Vilcek, J., and Vlahavas, I. (2011b). Mulan: A java library for multilabel learning. Journal of Machine Learning Research, 12:2411–2414.
 Tzima and Mitkas (2008) Tzima, F. and Mitkas, P. (2008). Zcs revisited: Zerothlevel classifier systems for data mining. In Data Mining Workshops, 2008. ICDMW ’08. IEEE International Conference on, pages 700–709.
 Tzima and Mitkas (2013) Tzima, F. A. and Mitkas, P. A. (2013). Strengthbased learning classifier systems revisited: Effective rule evolution in supervised classification tasks. Eng. Appl. of AI, 26(2):818–832.
 Tzima et al. (2012) Tzima, F. A., Mitkas, P. A., and Theocharis, J. B. (2012). Clusteringbased initialization of learning classifier systems  effects on model performance, readability and induction time. Soft Computing, 16(7):1267–1286.
 Urbanowicz and Moore (2009) Urbanowicz, R. J. and Moore, J. H. (2009). Learning classifier systems: A complete introduction, review, and roadmap. Journal of Artificial Evolution and Applications, 2009:1:1–1:25.
 Vallim et al. (2009) Vallim, R., Duque, T., Goldberg, D., and Carvalho, A. (2009). The multilabel ocs with a genetic algorithm for rule discovery: implementation and first results. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 1323–1330. ACM.
 Vallim et al. (2008) Vallim, R., Goldberg, D., Llorà, X., Duque, T., and Carvalho, A. (2008). A new approach for multilabel classification based on default hierarchies and organizational learning. In Proceedings of the 2008 GECCO conference companion on Genetic and evolutionary computation, pages 2017–2022. ACM.
 Vens et al. (2008) Vens, C., Struyf, J., Schietgat, L., Džeroski, S., and Blockeel, H. (2008). Decision trees for hierarchical multilabel classification. Machine Learning, 73(2):185–214.
 Wilson (2002) Wilson, S. (2002). Classifiers that approximate functions. Natural Computing, 1(23):211–234.
 Wilson (1994) Wilson, S. W. (1994). ZCS: A Zerothlevel Classifier System. Evolutionary Computation, 2(1):1–18.
 Wilson (1995) Wilson, S. W. (1995). Classifier fitness based on accuracy. Evolutionary Computation, 3(2):149–175.
 Witten and Frank (2005) Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd Edition). Morgan Kaufmann, San Francisco, CA, USA.
 Yang (2001) Yang, Y. (2001). A study of thresholding strategies for text categorization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 137–145, New York, NY, USA. ACM.
 Zhang and Zhang (2010) Zhang, M.L. and Zhang, K. (2010). Multilabel learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 999–1008, New York, NY, USA. ACM.
 Zhang and Zhou (2006) Zhang, M.L. and Zhou, Z.H. (2006). Multilabel neural networks with applications to functional genomics and text categorization. Knowledge and Data Engineering, IEEE Transactions on, 18(10):1338–1351.
 Zhang and Zhou (2007) Zhang, M.L. and Zhou, Z.H. (2007). MLKNN: A lazy learning approach to multilabel learning. Pattern Recognition, 40(7):2038–2048.
Comments
There are no comments yet.