Inducing Generalized Multi-Label Rules with Learning Classifier Systems

12/25/2015 ∙ by Fani A. Tzima, et al. ∙ ARISTOTLE UNIVERSITY OF THESSALONIKI 0

In recent years, multi-label classification has attracted a significant body of research, motivated by real-life applications, such as text classification and medical diagnoses. Although sparsely studied in this context, Learning Classifier Systems are naturally well-suited to multi-label classification problems, whose search space typically involves multiple highly specific niches. This is the motivation behind our current work that introduces a generalized multi-label rule format -- allowing for flexible label-dependency modeling, with no need for explicit knowledge of which correlations to search for -- and uses it as a guide for further adapting the general Michigan-style supervised Learning Classifier System framework. The integration of the aforementioned rule format and framework adaptations results in a novel algorithm for multi-label classification whose behavior is studied through a set of properly defined artificial problems. The proposed algorithm is also thoroughly evaluated on a set of multi-label datasets and found competitive to other state-of-the-art multi-label classification methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Every day massive amounts of data are collected and processed by computers and embedded devices. This data, however, is useless to the people and organizations collecting it, unless it can be properly processed and converted into actionable knowledge. Machine Learning (ML)

(Murphy, 2012) techniques are especially useful in such domains, where automatic extraction of knowledge from data is required.

One of the most common and extensively studied knowledge extraction tasks is classification. In traditional classification problems, data samples are associated with a single category, termed class, that may have two or more possible values. For example, the outlook for tomorrow’s weather may be ‘sunny’, ‘overcast’ or ‘rainy’. On the other hand, multi-label classification111 Multi-label classification can be viewed as a particular case of the multi-dimensional problem (Read et al., 2014) where the goal is to assign each data sample to multiple multi-valued (in contrast to binary, for the multi-label case) classes. , that is the focus of our current investigation, involves problems where each sample is associated with one or more binary categories, termed labels. For example, a newspaper article about climate change can be described by both tags ‘environment’ and ‘politics’; or a patient can simultaneously be diagnosed with ‘high blood pressure’, ‘diabetes’ and ‘myopia’.

Although single-label classification problems have been thoroughly explored, with the aid of various ML algorithms, literature on multi-label classification is far less abundant. Multi-label classification problems are, however, by no means less natural or intuitive and are, in fact, very common in real-life. The fact that, until recently, only a few of the corresponding problems were tackled as multi-label is mainly due to computational limitations. Recent research (see Tsoumakas et al. (2010) and Read (2010) for overviews) and modern hardware, though, has made multi-label classification more affordable. A gradually increasing number of problems are now being tackled as multi-label, allowing for richer and more accurate knowledge mining in real-world domains, such as medical diagnoses, protein function prediction and semantic scene processing.

A careful inspection of the corresponding literature reveals that multi-label classification is nowadays a widely popularized task, but Evolutionary Computation (EC) approaches to prediction model induction are very sparse. The few approaches that exist

(Vallim et al., 2008, 2009; Ahmadi Abhari et al., 2011) explore the use of Michigan-style Learning Classifier Systems (LCS) (Holland, 1975) – a Genetics-based ML method that combines EC and reinforcement (Wilson, 1995) or supervised learning (Bernadó-Mansilla and Garrell-Guiu, 2003; Orriols-Puig and Bernadó-Mansilla, 2008) – but report promising results only on small artificial and real-world problems. Although they lack an extensive experimental evaluation, however, both in terms of target multi-label classification problem and rival algorithm variety, they are still based on a valid premise. This premise actually also summarizes the motivation of our current work: LCS, due to their inherent characteristics, are naturally suited to multi-label classification and can provide an effective alternative in problem domains where highly expressive human-readable knowledge needs to be extracted, while maintaining low inference complexity.

Indeed, in recent years, LCS have been modified for data mining (Bull et al., 2008) and single-step classification problems, notably in the UCS (Orriols-Puig and Bernadó-Mansilla, 2008) and the SS-LCS frameworks (Tzima et al., 2012; Tzima and Mitkas, 2013). Their niche-based update and their overall iterative (rather than batch) learning approach has been shown to be very efficient in domains where different problem niches occur (including multi-class and unbalanced classification problems). Thus, we believe that it will also allow them to tackle the multiple and often very specific niches that comprise the search space of multi-label classification problems.

Moreover, LCS may provide a practical alternative to deterministic methods, when exhaustive search is intractable (for example, in multi-label classification problems with large numbers of labels and/or attributes) or, in general, when targeting problems with large, complex and diverse search spaces. In such cases, the global search capability of EC, combined with the local search ability of reinforcement learning, allows LCS to evolve flexible, distributed solutions, wherein discovered patterns are spread over a population of (individual or groups of) rules, each modeling a niche of the problem space

(Urbanowicz and Moore, 2009).

LCS are also model-free and, thus, do not make any assumptions about target data (e.g. number, types and dependencies among attributes, missing data, distribution of training instances in terms of the target categories). This allows them to identify all kinds of relationships – including epistatic ones that are characteristic of multi-label domains – both between the feature and label space and among the various labels.

Finally, as already mentioned, the nature of the knowledge representation evolved by LCS is a great advantage in certain application domains, where rule comprehensibility is an important requirement. At this point is should be noted that, Michigan-style LCS, although implicitly geared towards maximally accurate and general rules, tend to evolve rather large populations, mainly due to the distributed nature of the evolved solutions and the retainment of inexperienced rules created by the system’s exploration component. Ruleset compaction techniques are, though, available to reduce the number of rules in the final models and enhance their comprehensibility.

Overall, the aim of our current work (that builds on previous research presented in Allamanis et al. (2013)) is to develop an effective LCS algorithm for multi-label classification. In this direction, we employ a general supervised learning framework and extend it, to render it directly applicable to the corresponding problems, without the need for any problem transformation. More specifically, we adapt three major components of the traditional LCS architecture: (i)  the Rule Representation, to allow for rule consequents that include multiple labels; (ii) the Update Component, to consider multiple correct labels in rule parameter updates; and (iii) the Performance Component, to enable inference in multi-label settings where multiple concurrent decisions are required.

The aforementioned extensions implicitly define the structure and main contributions of the paper, which are detailed after briefly presenting the relevant background (Section 2). Briefly, our current work’s main contributions are:

  • a generalized multi-label rule format (Section 3) that has several distinct advantages over those used in other multi-label classification methods;

  • a multi-label Learning Classifier System (Section 4), named the Multi-Label Supervised Learning Classifier System (MlS-LCS), whose components allow for efficient and accurate multi-label classification through developing expressive multi-label rulesets; and

  • an experimental evaluation (Section 5) of our proposed LCS approach, against other state-of-the-art algorithms on widely used datasets, that validates its potential.

Section 6 restates our overall contributions, outlines future research directions and concludes this work with additional insights on the potential of the proposed algorithm.

2 Background

2.1 Multi-label Classification

Multi-label classification is a generalization of traditional classification where each sample is associated with a set of mutually non-exclusive binary categories, or labels, . Thus, defining the problem from a machine learning point of view, a multi-label classification model approximates a function where is the feature space and is the powerset of the label space (i.e., the powerset of the set of all possible labels).

The general multi-label classification framework, by definition, implies the existence of an additional dimension: that of the multiple labels which data samples can be associated with. This additional complexity affects not only the learning processes that can be applied to the corresponding problems, but also the procedures employed during the evaluation of developed models (Tsoumakas et al., 2010).

The basic premise that differentiates learning, with respect to the single-class case, is that to provide more accurate predictions, label correlations should be factored in multi-label classification models. This need is based on the observation that labels occur together with different frequencies. For example, a newspaper article is far more likely to be assigned the pair of tags ‘science’ and ‘environment’, than the pair ‘environment’ and ‘sports’. Of course, in the absence of label correlations, the corresponding multi-label problem is trivial and can be completely broken down (without any loss of useful information) to binary decision problems.

There are three main approaches to tackling multi-label classification problems in the literature: problem transformation, algorithm transformation (such as the LCS approach presented in this paper) and ensemble methods.

Problem Transformation methods transform a multi-label classification problem into a set of single-label ones. Various such transformations have been proposed, involving different trade-offs between training time and label correlation representation. The simplest of all transformations is the Binary Relevance (BR) method (Tsoumakas and Katakis, 2007), to which the Classifier Chains (CC) method (Read et al., 2009) is closely related. Other transformations found in the literature are Ranking by Pairwise Comparison (RPC) (Hüllermeier et al., 2008) and the Label Powerset (LP) method that has been the focus of several studies, including the Pruned Problem Transformation (PPT) (Read, 2008) and HOMER (Tsoumakas et al., 2008).

Algorithm Transformation methods adapt learning algorithms to directly handle multi-label data. Such methods include: (a) several multi-label variants (MlkNN) of the popular -Nearest Neighbors lazy learning algorithm (Zhang and Zhou, 2007)

, as well as hybrid methods combining logistic regression and

-Nearest Neighbors (Cheng and Hüllermeier, 2009)

; (b) multi-label decision trees, such as ML-C4.5

(Clare and King, 2001) and predictive clustering trees (PCTs) (Vens et al., 2008); (c) Adaboost.MH and Adaboost.MR (Schapire and Singer, 2000)

, that are two extensions of Adaboost.MH for multi-label learning; (d) several neural network approaches

(Crammer and Singer, 2003; Zhang and Zhou, 2006)

; (e) the Bayesian Networks approach by

Zhang and Zhang (2010); (f) the SVM-based ranking approach by Elisseeff and Weston (2005); and (g) the associative classification approach of MMAC (Thabtah et al., 2004).

Ensemble methods are developed on top of methods of the two previous categories. The three most well-known ensemble methods employing problem transformations as their base classifiers are RAkEL (Tsoumakas et al., 2011a), ensembles of pruned sets (EPS) (Read et al., 2008) and ensembles of classifier chains (ECC) (Read et al., 2009). On the other hand, an example of an ensemble method where the base classifier is an algorithm adaptation method (i.e., provides multi-label predictions) can be found in Kocev (2011) where ensembles of predictive clustering trees (PCTs) are presented.

As far as the evaluation of multi-label classifiers is concerned, several traditional evaluation metrics can be used, provided that they are properly modified. The specific metrics employed in our current study for algorithm comparisons are

Accuracy, Exact Match (Subset Accuracy) and Hamming Loss. In what follows, these metrics are defined for a dataset , consisting of multi-label instances of the form , where , (), is the set of all possible labels and is a prediction function.

Accuracy is defined as the mean, over all instances, ratio of the size of the intersection and union sets of actual and predicted labels. It is, thus, a label-set-based metric, defined as:

(1)

Exact Match (Subset Accuracy) is a simple and relatively strict evaluation metric, calculated as the label-set-based accuracy:

(2)

where is the set of correctly classified instances for which .

Hamming Loss corresponds to the label-based accuracy, taking into account false positive and false negative predictions and is defined as:

(3)

where is the symmetrical difference (logical XOR) between and .

The interested reader can find an extensive discussion on the merits and trade-offs of various multi-label classification methods and evaluation measures, along with the latter’s definitions, in Tsoumakas et al. (2010); Read (2010); Madjarov et al. (2012).

2.2 Learning Classifier Systems

Learning Classifier Systems (LCS) (Holland, 1975) are an evolutionary approach to supervised and reinforcement learning problems. Several flavors of LCS exist in the literature (Urbanowicz and Moore, 2009), with most of them following the “Michigan approach”, such as (a) the strength-based ZCS (Wilson, 1994; Tzima and Mitkas, 2008) and SB-XCS (Kovacs, 2002a, b); and (b) the accuracy-based XCS (Wilson, 1995) and UCS (Bernadó-Mansilla and Garrell-Guiu, 2003; Orriols-Puig and Bernadó-Mansilla, 2008). Accuracy-based systems have been the most popular so far for solving a wide range of problem types (Bull et al., 2008) – such as classification (Butz et al., 2004; Orriols-Puig et al., 2009a; Fernández et al., 2010), regression (Wilson, 2002; Butz et al., 2008; Stalph et al., 2012), sequential decision making (Butz et al., 2005; Lanzi et al., 2006), and sequence labeling (Nakata et al., 2014, 2015) – in a wide range of application domains – such as medical diagnoses (Kharbat et al., 2007), fraud detection (Behdad et al., 2012) and robot arm control (Kneissler et al., 2014).

Given that multi-label classification is a supervised task, we chose to tackle the corresponding problems using supervised (Michigan-style) LCS. Such LCS maintain a cooperative population of condition-decision rules, termed classifiers, and combine supervised learning with a genetic algorithm (GA). The GA works on classifier conditions in an effort to adequately decompose the target problem into a set of subproblems, while supervised learning evaluates classifiers in each of them

(Lanzi, 2008). The most prominent example of this class of systems is the accuracy-based UCS algorithm (Bernadó-Mansilla and Garrell-Guiu, 2003; Orriols-Puig and Bernadó-Mansilla, 2008). Additionally, we have recently introduced SS-LCS, a supervised strength-based LCS, that provides an efficient and robust alternative for offline classification tasks (Tzima et al., 2012; Tzima and Mitkas, 2013) by extending previous strength-based frameworks (Wilson, 1994; Kovacs, 2002a, b).

To sum up, our current investigation focuses on developing a supervised accuracy-based Michigan-style LCS for multi-label classification by extending the base architecture of UCS and incorporating the clustering-based initialization component of SS-LCS. It also builds on our research presented in Allamanis et al. (2013), from which the main differences are: (a) the multi-label crossover operator (Section 4.4); (b) the modified deletion scheme and the population control strategy (Section 4.5); (c) the clustering-based initialization process (Section 4.6); and, more importantly, (d) the extensive experimental investigation of the proposed algorithm, both in terms of target problems and rival algorithms (Section 5). The last point also addresses the main shortcoming of existing multi-label LCS approaches (Vallim et al., 2008, 2009; Ahmadi Abhari et al., 2011), namely the absence of empirical evidence on their potential for multi-label classification in real-world settings.

2.3 Rule Representation in LCS

LCS were initially designed with a ternary representation: rules involved conditions represented as fixed-length bitstrings defined over the alphabet {0, 1, #} and numeric actions. To deal with continuous attributes, often present in real-world classification problems, however, interval-based rule representations were later introduced, starting with Wilson’s min-max representation that codifies continuous attribute conditions using the lower and upper limit of the acceptable interval of values. When using this representation, invalid intervals (where ) – and, thus, impossible conditions – may be produced by the genetic operators. A simple approach to fixing this problem was proposed in Stone and Bull (2003) that introduced the unordered-bound representation – the most popular representation used for continuous attributes in LCS in the last few years. The unordered-bound representation proposes the use of interval limits without explicitly specifying which is the upper and which the lower bound: the smaller of the two limit values is considered to be the interval’s lower bound, while the larger is the upper bound. The unordered-bound representation is our representation of choice for continuous attributes in our current work.

Other than interval-based ones, several other rule representations have been introduced for LCS (mainly XCS and UCS) during the last few years. These representation aim to enable LCS to deal with function approximation (Wilson, 2002) and real-world problems, and include hyper-elipsoidal representations (Butz et al., 2008), convex hulls (Lanzi and Wilson, 2006) and tile coding (Lanzi et al., 2006). Other more general approaches used to codify rules are neural networks (Bull and O’Hara, 2002), messy representations (Lanzi, 1999) and S-expressions (Lanzi and Perrucci, 1999), fuzzy representations (Orriols-Puig et al., 2009b)

, genetic-programming like encoding schemes involving code fragments in classifier conditions

(Iqbal et al., 2014), and dynamical genetic programming (Preen and Bull, 2013).

3 Rules for Multi-label Classification

To tackle multi-label classification problems with rule-based methods, and thus also with LCS, we need an expressive rule format, able to capture correlations both between the feature and label space and among the various labels. In this Section, we introduce a rule format that possesses these properties and forms the basis of our proposed multi-label LCS, detailed in Section 4. In the last part of the Section, we also describe some “internal” rule representation issues.

3.1 Generalized Multi-label Rule Representation

Single-label classification rules traditionally follow the production system (or “if-then”) form , where rule’s condition comprises a conjunction of tests on attribute values and its consequent contains a single value from the target classification problem’s set of possible categories (or classes). It is also worth noting that, for zero-order rules, the condition comprises () tests

wherein is one of the problem’s attributes, is an operator, and is a constant set, number or range of numbers.

It is evident that rules following the form described above are not able to readily handle multi-label classifications. To alleviate this shortcoming, we introduce a modification to the rule consequent part, such that, for any given multi-label rule , the consequent part takes the form:

(4)

where is one of the problem’s possible labels (, ), taking either the value for labels advocated by rule , or the value in the opposite case.

According to Eq. 4, the consequent part of a rule following our proposed Generalized Multi-label Representation includes both the labels that the rule advocates for (=, ), and the labels it is opposed to (=, ). It should be noted that (i) no label can appear more than once in the rule consequent part () and (ii) rules are allowed to “not care” about certain labels, which are, thus, absent from the rule consequent (). In other words, the proposed rule format has the important property of being able to map rule conditions to arbitrary subspaces of the label-space.

An abbreviated notation for rule consequent parts can be derived by using the ternary alphabet and substituting “1” for advocated labels, “0” for labels the rule is opposed to and “#” for “don’t cares”. Thus, in a problem with three labels, a rule advocating the first label, being indifferent about the second and opposed to the third is denoted as: .

Rules following our proposed Generalized Multi-label Representation have some unique properties. First, rules are easy to interpret, rendering the discovered knowledge (rulesets) equally usable by both humans and computers. Such a property is important in cases where providing useful insights to domain experts is amongst the modelers’ goals.

Furthermore, rules have a flexible label-correlation representation. Algorithms inducing generalized multi-label rules do not require explicit knowledge of which label correlations to search for and can variably correlate the maximum possible number of labels to any given condition. Therefore, in contrast to problem transformation methods that need to explicitly create (at least) one model for each possible label correlation/combination being searched for, algorithms inducing generalized multi-label rules can approach all possible spectra between the BR (not looking into any label correlations) and LP (searching for all possible label combinations) transformations and simultaneously create the most compact rule representation of the problem-space, with no redundancy.

Consider, for example, the (artificial) problem toyx with 6 binary attributes and 4 labels, where the first two labels only depend on the first two attributes, according to the rules222These are actually the rules, without default hierarchies, defining the artificial problem studied in (Vallim et al., 2009).

(5)

and the last two labels always have exactly the same values as the last two attributes. The shortest complete solution (SCS) (i.e., the solution containing the smallest possible number of rules that allow for specific decisions to be made for all labels of all data samples), given our generalized rule format, involves 7 rules: the 3 rules in Eq. 5, plus one of the following alternative rulesets.

If we do not use the generalized rule format, we are bound to induce rules with all-specific consequents that are not allowed to “don’t care” about any of the problem’s labels. This would be equivalent to the LP transformation, creating rules for each possible label combination, and would result in (at least) 12333For this particular problem, we need 12 and not =16 rules, as some label combinations are missing from the training dataset and, thus, no model would need to be built for them. rules for our current example – i.e., the combinations of each of the first 3 rules with each of the 4 rules in set A.

3.2 Rule Representation in Chromosomes

Rules in MlS-LCS, not unlike traditional LCS, are mapped to chromosomes – consisting of 1s and 0s – to be used in the GA. Our approach universally employs an activation bit, indicating whether a test for a specific attribute’s values is active or inactive (#), irrespective of the attribute type. Thus, binary attributes and labels are represented using two bits. The first bit represents the activation status of the corresponding test and the second bit represents the target (attribute or label) value. Nominal attributes are represented by bits, where is the number of the attribute’s possible values. For continuous attributes we employ the “unordered-bound representation” (Stone and Bull, 2003), defining an acceptable range of values for an attribute through two bounds and , such that . The two threshold values and are represented by binary numbers discretized in the range where () is the lowest (highest) possible value for attribute . The number of bits used in this representation is , where determines the quantization levels () and the additional bit is the activation bit.

4 LCS for Multi-label Classification

As already mentioned, the scope of our current work comprises offline multi-label classification problems – that is classification problems that can be described by a collection of data and do not involve online interactions. We tackle these problems using Michigan-style supervised LCS.

Such LCS have been successfully used for evolving rulesets in single-label classification domains. In these cases, evolved rulesets comprise cooperative rules that collectively solve the target problem, while they are also required to be maximally compact, i.e., containing the minimum number of rules that are necessary for solving the problem. Equivalently, all rules need to have maximally general conditions, that is the greatest possible feature space coverage. Additionally, a ruleset is considered an effective solution if it contains rules that are adequately correct, with respect to a specific performance/correctness metric.

While all the aforementioned properties are also desirable in generalized multi-label rulesets (i.e., rulesets comprising generalized multi-label rules, as described in Section 3.1), there is an additional important requirement. These rulesets also need to exhaustively cover the label space. In other words, rules in a multi-label ruleset should collectively be able to decide about all labels for every instance. This latter desirable property, together with the compactness requirement, indicates that multi-label rules should ideally have maximally general conditions and combine them with the corresponding maximally specific consequents.

Consider, for example the following two rules for the toyx problem:

(6)

Both rules are perfectly accurate (for the labels for which they provide concrete decisions), but the first rule is clearly preferable, correlating the (common) condition with a larger part of the label space and, thus, promoting solution compactness.

Overall, it is evident that algorithms building rulesets for multi-label classification problems need to consider the trade-off between condition generalization, consequent specialization and rule correctness. In an LCS setting, this means that the core learning and performance procedures need to be appropriately modified to effectively cope with multi-label problems. Thus, translating the aforementioned desirable properties of multi-label rulesets into concrete design choices towards formulating our proposed multi-label LCS algorithm, we derive the following requirements for its components:

  • the Performance Component, that is responsible for using the rules developed to classify previously unseen samples, needs to be modified to enable effective inference based on (generalized) multi-label rules;

  • the Update Component, which is responsible for updating rule-specific parameters, such as accuracy and fitness, needs an appropriate metric of rule quality, taking into account that generalized multi-label rules make decisions over subsets of labels that may, additionally, be only partially correct;

  • the Discovery Component that explores the search space and produces new rules through a steady-state GA, needs to focus on evolving multi-label rulesets that are accurate, complete and cover both the feature and label space.Subsumption conditions, controlling rule “absorption”, also need to be adapted, so as to favor rules with more general conditions and more specific consequents.

These substantial adaptations to the general LCS framework, essentially define the proposed MlS-LCS algorithm and are presented in detail in the following Sections.

4.1 The Training Cycle of the Multi-Label Supervised Learning Classifier System (MlS-LCS)

MlS-LCS employs a population of gradually evolving, cooperative classifiers (rules) that collectively form the solution to the target classification task, by each encoding a fraction (niche) of the problem domain. Associated with each classifier, there are a number of parameters:

  • the numerosity , i.e., the number of classifier copies (or microclassifiers) currently present in the ruleset;

  • the correct set size

    that estimates the average size of the correct sets the classifier has participated in;

  • the time step of the last occurrence of a GA in a correct set the classifier has belonged to;

  • the experience that is measured as the classifier’s number of appearances in match sets (multiplied by the number of labels);

  • the effective match set appearances that is the classifier’s experience, (possibly) reduced by a certain amount for each label that the classifier did not provide a concrete decision for (see Eq. 7);

  • the number of the classifier’s correct and incorrect label decisions, and respectively;

  • the accuracy

    that estimates the probability of a classifier predicting the correct label; and

  • the fitness that is a measure of the classifier’s quality.

At each discrete time-step during training, MlS-LCS receives a data instance’s attribute values and labels ( | ) and follows a cycle of performance, update and discovery component activation (Alg. 1). The completion of a full training cycle is followed by a new cycle based on the next available input instance, provided, of course, that the algorithm’s termination conditions have not been met.

RUN_TRAINING_CYCLE

1:   read next data instance
2:  initialize empty set
3:  for each label  do
4:     initialize empty sets and
5:   generate match set out of using
6:  if deletions have commenced then
7:     control match set
8:  for each  do
9:     
10:      generate label correct set out of using
11:      generate label incorrect set out of using
12:  for each classifier  do
13:     UPDATE_FITNESS ( )
14:     if  such that  then
15:        UPDATE_CS ( )
16:  for each label  do
17:     if  is empty then
18:        
19:         generate covering classifier with and
20:        insert into the population
21:     else if then
22:        for each classifier in  do
23:           
24:         apply GA on
25:        ADD_TO_POPULATION ( )
26:        ADD_TO_POPULATION ( )
27:  while  do
28:     delete rule from population proportionally to

ADD_TO_POPULATION (  )

1:  if  has non-zero coverage then
2:     if  is not subsumed by parents then
3:        if  is not subsumed by any rule in  then
4:           insert into the population
Algorithm 1 MlS-LCS component activation cycle during training (at step ).

4.2 The Performance Component of MlS-LCS

Upon receipt of a data instance , the system scans the current population of classifiers for those whose condition matches the input and forms the match set . Next, for each label , a correct set is formed containing the rules of that correctly predict label for the current instance444This is possible in a supervised framework, since the correct labels are directly available to the learning system.. The classifiers in incorrectly predicting label are placed in the incorrect set . Finally, if the system is in test mode555Under test mode, the population of MlS-LCS does not undergo any changes; that is, the update and discovery components are disabled., a classification decision is produced based on the labels advocated by rules in ( and cannot be produced since the actual labels are unknown).

However, the process of classifying new samples based on models involving multi-label rules is not straightforward. In multi-label classification, a bipartition of relevant and irrelevant labels, rather than a single class, has to be decided upon, based on some threshold. Furthermore, rulesets evolved with LCS may contain contradicting or low-accuracy rules. Therefore, a “vote and threshold” method is required to effectively classify unknown samples (Read, 2010). More specifically, an overall vote for each label

is obtained by allowing each rule to cast a positive (for advocated labels) or negative vote equal to its macro-fitness. Votes are cast only for labels that a rule provides concrete decisions for. The resulting votes vector

is normalized, such that and and a threshold is used to select the labels that will be activated (those for which ). Assuming that the thresholding method aims at activating at least one label, the range of reasonable thresholds is .

In our current work, we experimented with two threshold selection methods (Yang, 2001; Read, 2010), namely Internal Validation (Ival) and Proportional Cut (Pcut).

Internal Validation (Ival) selects, given the ruleset, the threshold value that maximizes a performance metric (such as accuracy), based on consecutive internal tests. It can produce good thresholds at a (usually) large computational cost, as the process of validating each threshold value against the training dataset is time-consuming. Its complexity, however, can be significantly improved by exploiting the fact that most metric functions are convex with respect to the threshold.

Proportional Cut (Pcut) selects the threshold value that minimizes the difference in label cardinality (i.e., the mean number of labels that are activated per sample) between the training data and any other given dataset. This is achieved by minimizing the following error with respect to :

where is the training dataset, is the threshold function and is the dataset with respect to which we tune the threshold . It is worth noting that, in our case, it always holds that . Tuning the threshold with respect to the test dataset would imply an a priori knowledge of label structure in unlabeled samples and would, thus, result in biased evaluations and, possibly, a wrong choice of models to be used for post-training predictions. The Pcut method, although not tailored to any particular evaluation measure, calibrates thresholds as effectively as Ival, at a fraction of the computational cost and is, thus, considered a method suitable for general use in experimental evaluations (Read, 2010).

Employing each rule’s fitness as its confidence level, it is possible to predict the labels of new (unknown) data samples by using only the fittest rule of those matching each sample’s attribute values. Of course, in case the fittest rule “does not care” for some of the labels, additional rules (sequentially, from a list of matching rules sorted by fitness) can be employed to provide a complete decision vector with specific values for all possible labels. The above described strategy, named Best Rule Selection (Best), has also been included in our experiments, since it is the one yielding the most compact, in terms of number of rules, prediction models.

4.3 The Update Component of MlS-LCS

In training or explore mode, each classification of a data instance is associated with an update of the matching classifiers’ parameters. More specifically:

  • for all classifiers in match set , their experience is increased by one and their value is updated, based on whether they provide a concrete decision;

  • for all classifiers belonging to at least one correct set , their correct set size is updated, so that it estimates the average size of all correct sets the classifier has participated in so far; and

  • all classifiers in match set have their fitness updated.

The specific update strategies for and correct set size are presented in Alg. 2.

UPDATE_FITNESS ( )

1:  for each label  do
2:     
3:     
4:     
5:  
6:  

UPDATE_CS ( )

1:  
2:  
Algorithm 2 Rule and update for MlS-LCS

Fitness calculation in MlS-LCS is based on a supervised approach that involves computing the accuracy () of classifiers as the percentage of their correct classifications (line 5 of Alg. 2). Moreover, motivated by the need to distinguish between rules that provide concrete decisions (positive or negative) about labels and those whose decisions are “indifferent”, we introduce the notion of . The correctness value of a rule for a label (with respect to a specific training cycle and, thus, specific and sets) is calculated according to the following equation:

where for rules not deciding on for the current instance (i.e., for matching rules neither in nor in ).

Accordingly, the match set appearances () that a rule obtains for a label , during a specific training cycle, is differentiated depending on whether provides a concrete decision or not, according to Eq. 7, where .

(7)

In our current work, we explore a version of MlS-LCS that slightly penalizes “indifferent” rules by considering #’s as partial (=) matches (=). The reasons that lead us to choose these specific values are detailed in Section 5.1. For now, though, let us again consider the simple example of the toyx problem and the rules of Eq. 6. Supposing that both rules have not encountered any instances so far (==), when the system processes the instance , the rules’ values will become 1 and 0.9, respectively. This means that ’s fitness will be greater than that of ’s when they compete in the GA selection phase for the first label and, thus, the system will have successfully applied the desired pressure towards maximally specific consequents.

Finally, as far as the update of rule overall correct-set size is concerned, we have chosen a rather strict estimation, employing the size of the smallest label correct set that the rule participates in. This choice is motivated by the need to exert fitness pressure in the population towards complete label-space coverage. This is, in our case, achieved by rewarding rules that explicitly advocate for or against “unpopular” labels.

4.4 The Discovery Component of MlS-LCS

MlS-LCS employs two rule discovery mechanisms: a covering operator and a steady-state niche genetic algorithm.

The covering operator is adapted from the one introduced in XCS (Wilson, 1995) and later used in UCS (Bernadó-Mansilla and Garrell-Guiu, 2003; Orriols-Puig and Bernadó-Mansilla, 2008) and most of their derivatives. It is activated only during training and introduces new rules to the population when the system encounters an empty correct set for a label . Covering produces a single random rule with a condition matching the current input instance’s attribute values and generalized with a given probability per attribute. While this process is identical to the one employed in single-class LCS, it is followed by an additional generalization process applied to the rule consequent, which is essential to evolving generalized multi-label rules. All labels in the newly created rule’s consequent are set to 0 or 1 according to the current input and then generalized (converted to #) with probability per label, except for the current label that remains specific.

The genetic algorithm is applied iteratively on all correct sets and invoked at a rate , where is defined as a (minimum) threshold on the average time since the last GA invocation of classifiers in (Bernadó-Mansilla and Garrell-Guiu, 2003). The evolutionary process employs experience-discounted fitness-proportionate parent selection, with the selection probability assigned to each classifier being calculated according to:

(8)

where

(9)

and is the experience threshold for fitness discounting. After their selection, the two parent classifiers are copied to form two offspring, on which the multi-label crossover operator and a uniform mutation operator are applied with probabilities and , respectively.

The multi-label crossover operator is introduced in this work and intended for use specifically in multi-label classification settings. Its design was motivated by the fact that for the majority of datasets employed in our current work, the number of attributes is significantly larger than the number of labels (by at least one order of magnitude). This means that, using a single-point crossover operator, the probability that the crossover point would end up in the attribute space is significantly greater than that of it residing in the label space. Therefore, there would be a significantly greater probability of transferring the whole consequent part from the parents to their corresponding offspring than that of transferring the decisions for only a subset of labels .

Actually, allowing the transfer of any set of decisions as a policy for any given crossover occurring on would be a questionable choice: the fact that any two rules, selected to be parents, coincide in does not necessarily mean that they would coincide in where and . Keeping this observation in mind, we designed the multi-label crossover operator, with the aim of exerting more pressure towards accurate decisions per label. The newly proposed operator achieves that by not transferring decisions from the selected parents to their corresponding offspring other than that about the current label, i.e., the label corresponding to the correct set from which the parents were selected.

More specifically, the crossover point is selected pseudo-randomly from the range ], where:

(10)

and is the classifier size in bits. This means that the multi-label crossover operator takes into account the rule’s condition part and only one (instead of all ) of its labels: the label for which the current correct set (on which the GA is applied) has been formed for. If the crossover point happens to be in the range ], that is in the condition part of the rule’s chromosome, the two parent classifiers swap (a) their condition parts beyond the crossover point and (b) their decision for the current label from their consequent parts. Otherwise, that is when the crossover point happens to correspond to (any of the two bits representing) the current label, the two parent classifiers only swap their decision for the label being considered and no part of their conditions.

Returning to the GA-based rule generation process, after the crossover and mutation operators have been applied, MlS-LCS checks every offspring as per its ability to codify a part of the problem at hand. Given the supervised setting of multi-label classification, this is equivalent to checking that each rule covers at least one instance of the training dataset. The presence of rules in the population that fail to cover at least one instance, termed zero-coverage rules666Coverage is defined as the number of data instances a rule matches., is unnecessary to the system. Also, depending on the completeness degree of the problem, it may be hindering its performance by lengthening the training time and rendering the production rate of zero-coverage rules through the GA uncontrollable. Therefore, to avoid these problems, MlS-LCS removes zero-coverage rules just after their creation by the discovery component, assuring that .

Even after this step, the non-zero-coverage offspring are not directly inserted into the classifier population. They are gathered into a pool, until the GA has been applied to all label correct sets. Once the rule generation process has been completed for the current training cycle, and before their insertion into the classifier population, all rules in the offspring pool are checked for subsumption (a) against each of their parents and (b) in case no parent subsumes them, against the whole rule population. If a classifier (parent or not) is found to subsume the offspring being checked, the latter is not introduced into the population, but the numerosity of the subsuming classifier is incremented by one instead. Subsumption conditions require that the subsuming classifier is sufficiently experienced (), accurate () and more general than the offspring being checked (with and being user-defined parameters of the system). Additionally, the generality condition is extended for the multi-label case, such that a classifier can only subsume a classifier , if ’s condition part is equally or more general and its consequent part is equally or more specific than those, respectively, of the classifier being subsumed.

4.5 Population control strategies employed in MlS-LCS

The system maintains an upper bound on the population size (at the microclassifier level) by employing a deletion mechanism, according to which a rule is selected for deletion with probability :

(11)

where

and is a user-defined experience threshold.

In addition to the deletion mechanism that is present in most LCS implementations, in MlS-LCS we introduce a new population control strategy that aims to increase the mean coverage of instances by the rules in the population. This strategy corresponds to the “control match set ” step (line 7 of Alg. 1) in the overall training cycle of MlS-LCS and is based on the following observations:

  • Given a set of rules, such as the match set , the rules it comprises lie on different coverage levels. This means that rules cover different numbers of dataset instances, depending on the degree of generalization that the LCS has achieved.

  • A given coverage level in (a subset of rules in whose members cover the same number of instances) holds rules of various fitnesses.

  • If there are two or more rules in the lowest coverage level in , the rule whose fitness is the lowest among them is not necessary in . That is because there exist more rules that cover the instance from which was generated and are, in addition, more fit overall, classifying instances more accurately. The rule may still be of use in , if it is the sole rule covering an instance in the population. However, in the general case, can be removed from the population without any considerable loss of accuracy for the system.

The invocation condition for the match set control strategy (line 6 of Alg. 1) means that the corresponding deletion mechanism will only be activated after the population has reached its upper numerosity boundary for the first time. Thus, “regular” deletions from the population and deletions of low-coverage rules from the match set are two processes (typically) applied simultaneously in the system. Using the above invocation condition accomplishes two objectives: (i) it prevents, during the first training iterations, the deletion of fit and specific rules that could be pass their ‘useful’ genes on to the next generation and (ii) it prevents (to a certain degree) the deletion of rules that coexist with others in the lowest coverage level of a specific match set but are unique in another.

Finally, as far as the computational cost of implementing population control is concerned, it is worth noting that it is negligible, as the coverage value for each rule is gradually determined through a full pass of the training dataset and is used only after that point (from when on, it does not change).

4.6 Clustering-based initialization of MlS-Lcs

MlS-LCS also employs a population initialization method that extracts information about the structure of the studied problems through a pre-training clustering phase and exploits this information by transforming it into rules suitable for the initialization of the learning process. The employed method is a generalization for the multi-label case of the clustering-based initialization process presented in Tzima et al. (2012) that has been shown to boost LCS performance, both in terms of predictive accuracy and the final evolved ruleset’s size, in supervised single-label classification problems.

Simply put, the clustering-based initialization method of MlS-LCS detects a representative set of “data points”, termed centroids, from the target multi-label dataset D and transforms them into rules for the initialization of the LCS rule population prior to training.

More specifically,

the dataset D is partitioned into subsets, where is the total number of discrete label combinations present in D. Each subset consists of the instances whose label combination matches the discrete label combination . For each partition , :

  • The instances belonging to the partition are grouped into clusters, where is the number of instances in the partition and () is a user-defined parameter.

  • For each cluster ,

    , identified in the previous step, its centroid is found employing a clustering algorithm (in our case, the k-means algorithm). Then, a new rule is created whose condition part matches the centroid’s attribute values (more details on this procedure can be found in

    Tzima et al. (2012)), while the decision part is set to the discrete label combination associated with the current partition. The centroid-to-rule transformation process also includes a generalization step (similar to the one used by the covering operator): some of the newly created rule’s conditions and decisions are generalized (converted to “don’t cares”), taking into account the attribute and label generalization probabilities defined by the user for clustering.

Finally, all rules created by clustering the training dataset are merged to create the ruleset used to initialize the learning process.

In our current work, we chose not to experiment with tuning the clustering-based initialization process parameters and used the following values for all reported experiments: , and .

5 Experimental Validation of MlS-Lcs

In this Section, we present an experimental evaluation of our proposed multi-label LCS approach777The Java source code of our implementation of MlS-LCS used throughout all reported experiments is publicly available at: https://github.com/fanioula/mlslcs.. We first provide a brief analysis of MlS-LCS’s behavior on two artificial datasets and then compare its performance to that of 6 other state-of-the-art methods on 7 real-world datasets.

5.1 Experiments on artificial datasets

Since the focus of our current work is on multi-label classification, we begin our analysis with two artificial problems, named toyx and mlposition respectively, that we consider representative of a wide class of problems from our target real-world domain, in terms of the label correlations involved. The toyx problem has already been described in Section 3.1. The mlpositionN (here, N=4) problem has binary attributes and labels. In every data sample, only one label is active, that is the label corresponding to the most significant bit of the binary number formed by the sample’s attributes. It is evident that, in this case, there is great imbalance among the labels, since label is only activated once, while label is activated in instances. The shortest complete solution of the problem involves exactly + rules, with different degrees of generalization in their condition parts, but no generalizations in their consequent parts. Specifically, for the mlposition problem the shortest complete solution (SCS) includes the following rules:

Overall, one can easily observe that toyx is a problem where two of the labels are only dependent on attribute values and independent of other labels, while mlposition involves labels that are completely dependent on each other (in fact, they are mutually exclusive). Most (non-trivial) real-world problems will be a “mixture” of these two cases (i.e., will involve a mixture of uncorrelated and correlated labels), so our intention is to tune the system to perform as well as possible for both artificial problems. In the current paper, we focus our study on the fitness update process (see Section 4.3) and, more specifically, the choice of the parameter value, given that we consider “don’t cares” as partial matches (, =).

Regarding performance metrics, the percentage of the SCS was selected as an appropriate performance metric, indicative of the progress of genetic search. Along with the %[SCS], we also report the multi-label accuracy (Eq. 1) achieved by the system throughout learning and the average number of rules in the final models evolved. All reported results are averaged over 30 runs (per problem and parameter setting) with different random seeds.

For all experiments and both problems, we kept the majority of parameter settings fixed, using a typical setup, consistent with those reported in the literature of single-label LCS: =, =, =, =, =, =, =, and =. The population size was set to , the number of iterations was , the GA invocation rate was and the generalization probability was . The only parameter varied between the two problems was the label generalization probability which was set to and , respectively, for toyx and mlposition.

(a) %[SCS] achieved by MlS-LCS in toyx
(b) Accuracy achieved by MlS-LCS in toyx
Figure 1: Percentage of the shortest complete solution (%[SCS]) and multi-label accuracy achieved throughout the learning process for thetoyx problem. All curves are averages over thirty runs.
(a) %[SCS] achieved by MlS-LCS in mlposition
(b) Accuracy achieved by MlS-LCS in mlposition
Figure 2: Percentage of the shortest complete solution (%[SCS]) and multi-label accuracy achieved throughout the learning process for the mlposition problem. All curves are averages over thirty runs.

The results of our experiments, depicted in Figures 1 and 2, reveal that the value of the parameter affects both the accuracy and the quality (in terms of number of rules) of the evolved solutions. This is especially evident in the toyx problem, where the SCS contains a complex trade-off of feature-space generality and label-space specificity: low values of result in over-penalizing label-space indifferences and exerting pressure for highly specific consequents, thus also adversely affecting the system’s accuracy. The same pressure towards consequent specificity proves beneficial in the mlposition problem, due to the nature of the problem’s SCS that comprises rules providing specific decisions for all labels.

Given our goal to optimize system performance for both problems and the importance of the accuracy metric in real-world applications, choosing the value = is a good trade-off. For this value and the toyx problem, the system discovers % of the SCS on average (% in out of the averaged experiments) and achieves a % accuracy (% in of the averaged experiments). For the mlposition problem (and again =), the system discovers % of the SCS on average (% in experiments) and achieves a % accuracy (% in experiments).

As far as the size of the final rulesets is concerned (again for =), after applying a simple ruleset compaction strategy (i.e., ordering the rules by their macro-fitness and keeping only the top rules necessary to fully cover the training dataset and have specific decisions for all labels), we get models with and rules on average for toyx and mlposition, respectively. This points to aspects of the rule evaluation process that need to be further investigated, since for the chosen value of , some of the SCS rules are present, but not prevalent enough in the final rulesets for the compacted solutions to be of the optimal size.

5.2 Experimental setup for real-world problems

The benchmark datasets employed in this set of experiments are listed in Table 1

, along with their associated descriptive statistics and application domain. The datasets are ordered by complexity (

), while Label Cardinality (LCA) is the average number of labels relevant to each instance. We strived to include a considerable variety and scale of multi-label datasets. In total, we used 7 datasets, with dimensions ranging from 6 to 174 labels, and from less than 200 to almost 44,000 examples. All of the datasets are readily available from the Mulan web-site (http://mulan.sourceforge.net/datasets.html).

dataset DIST DENS LCA domain complexity
flags 194 9C+10N 7 54 0.485 3.39 images 2.58E+04
emotions 593 72N 6 27 0.311 1.87 music 2.56E+05
genbase 662 1186C 27 32 0.046 1.25 biology 2.00E+06
scene 2407 294N 6 14 0.179 1.07 images 4.25E+06
CAL500 502 68N 174 502 0.150 26.04 music 5.94E+06
enron 1702 1001C 53 753 0.064 3.38 text 9.03E+07
mediamill 43907 120N 101 6555 0.043 4.38 video 5.32E+08
Table 1: Benchmark datasets, along with their application domain and statistics: number of examples , number of nominal (c) or numeric (n) attributes , number of labels , number of distinct label combinations DIST, label density DENS and cardinality LCA. Datasets are ordered by complexity, defined as .

Evaluation is done in the form of ten-fold cross validation for the four smallest datasets888The specific splits in folds, along with the detailed results of the rival algorithm parameter tuning phase, are available at http://issel.ee.auth.gr/software-algorithms/mlslcs/.. For the enron, CAL500 and mediamill datasets a train/test split (provided on the Mulan website) is used instead, since cross-validation is too time and/or computationally intensive for some methods999Some of the rival algorithms’ runs could not be completed, even on a machine with 64GB of RAM..

The rival algorithms against which the proposed MlS-LCS algorithm is compared are HOMER, RAkEL, ECC, CC, MlkNN and BR-J48. For all algorithms, except ECC and CC, their implementations provided by the Mulan Library for Multi-label Learning (Tsoumakas et al., 2011b) were used, while for ECC and CC we used the MEKA environment (http://meka.sourceforge.net/).

As far as the parameter setup of the algorithms is concerned, in general, we followed the recommendations from the literature, combined with a modest parameter tuning phase, where appropriate. More specifically:

  • BR refers to a simple binary-relation transformation of each problem using the C4.5 algorithm (WEKA’s (Witten and Frank, 2005) J48 implementation) and serves as our baseline.

  • For HOMER, Support Vector Machines (SVMs) are used as the internal classifier (WEKA’s SMO implementation). For the number of clusters, five different values (2-6) are considered and the best result is reported.

  • We experiment with three versions of RAkEL and report the best result: (a) the default setup (subset size and models) with C4.5 (WEKA’s J48) as the baseline classifier, (b) the “extended setup”, with a subset size equal to half the number of labels and models, and C4.5 (WEKA’s J48 implementation) as the baseline classifier, and (c) the “extended setup” and SVMs (WEKA’s SMO implementation) as the baseline classifier.

  • ECC and CC are used with SVMs (WEKA’s SMO implementation) as the baseline classifier, while the number of models for ECC is set to 10, as proposed by the algorithm’s authors in (Read et al., 2009).

  • Finally, the number of neighbors for the MlkNN method is determined by testing the values 6 through 20 (with step 2) and selecting the best result per dataset.

Where not stated differently, the default parameters were used.

For MlS-LCS, we kept the majority of parameters fixed through all experiments, using the typical setup reported for the artificial problem experiments. The parameters varied were the population size , the number of iterations , the GA invocation rate and the generalization probabilities and . The choice of specific parameter values (Table 2) was based on an iterative process that involved starting with default values for all parameters (=, and =*, =, =, =) and tuning one parameter at a time, according to the following steps:

  1. was set to either 0.1 or 0.01, depending on the resulting model’s performance on the train dataset;

  2. for the values 0.33, 0.4, 0.8, 0.9, and 0.99 were iteratively tested and the one leading to the greater coverage of the train dataset’s instances was selected;

  3. was selected between the values 300 and 2000, based on which one of them leads to a faster suppression of the covering process;

  4. the population size was selected among the values 1000, 2000, 9000, 12000, and 25000, based on the resulting model’s performance on the train dataset;

  5. evolved models were evaluated every =* iterations and training stopped when the performance on the test dataset (with respect to the accuracy metric) was greater than that of the baseline BR approach.

During the tuning process, the parameter values selected in each step were used (and kept constant) in all subsequent steps.

Dataset
flags 500 1000 2000 0.33 0.01
emotions 500 5000 2000 0.8 0.01
genbase 500 12000 2000 0.4 0.10
scene 2500 9000 300 0.99 0.10
CAL500 200 1000 2000 0.9 0.10
enron 600 25000 2000 0.99 0.10
mediamill 10 1000 2000 0.9 0.10
Table 2: MlS-LCS parameters for the benchmark datasets

It is also worth noting that, when using Ival for MlS-LCS, the corresponding thresholds were calibrated based on the (multi-label) accuracy metric, as in RAkEL.

Regarding the statistical significance of the measured differences in algorithm performance, we employ the procedure suggested in (Demšar, 2006) for robustly comparing classifiers across multiple datasets. This procedure involves the use of the Friedman test to establish the significance of the differences between classifier ranks and, potentially, a post-hoc test to compare classifiers to each other. In our case, where the goal is to compare the performance of all algorithms to each other, the Nemenyi test was selected as the appropriate post-hoc test.

5.3 Comparative Analysis of Results

Table 3 summarizes the results for the MlS-LCS algorithm for all inference methods (see Section 4.2), namely Proportional Cut (Pcut), Internal Validation (Ival) and Best Classifier Selection (Best), all three evaluation metrics (multi-label accuracy, exact match and Hamming loss) and all datasets used in this study. All values reported are at a % scale and the results for the three evaluation metrics for each inference method refer to the same experiment per dataset.

Accuracy Exact Match Hamming Loss
Pcut Ival Best Pcut Ival Best Pcut Ival Best
flags
emotions
genbase
scene
CAL500
enron
mediamill
Table 3: Evaluation results for all inference methods employed by the MlS-LCS algorithm and all metrics used in algorithm comparisons. The best value per problem-metric pair is marked in bold.

Inspecting the obtained results, one can easily conclude that while no inference method is clearly dominant, Ival seems to yield the best results overall. It is also worth noting that the Best method outperforms the other two inference methods for 2 out of the 7 studied datasets, although it involves a considerably smaller number of rules in its final models. Especially in the case of the CAL500 dataset, the use of the full evolved ruleset (thresholded through Pcut or Ival) seems to be particularly harmful for system performance. This indicates a problem with either the evolution of rules or the threshold selection procedures that needs to be further investigated in the future.

In general, results with the Best method are acceptable and close to that of the other inference methods. Thus, the considerably smaller rulesets involved in Best models can be considered an effective summary of the target problem’s solution to be used for descriptive purposes. The need for such “description” is especially evident in real-world classification problems, where the desired solution must be interpretable by human experts and/or decision makers.

Considering the experiment that corresponds to the inference method with the best accuracy value for our proposed MlS-LCS algorithm, Tables 5(a) - 5(c) summarize the results of comparing it with its rival learning techniques. Achieved values (%) for the three evaluation metrics (multi-label accuracy, exact match and Hamming loss) and all datasets used in this study are reported. In Table 5(a) along with the accuracy rates, we also report each algorithm’s overall average rank (row labeled “Av. Rank”) and its position in the final ranking (row labeled “Final Pos.”). Accordingly, Tables 5(b) and 5(c), respectively, report the values for the exact match and the Hamming loss metrics, along with the corresponding rankings.

HOMER RAkEL ECC CC MlkNN BR MlS-LCS
flags
emotions
genbase
scene
CAL500
enron
mediamill
Av. Rank
Final Pos.
(a) Algorithm evaluation based on the “Accuracy” metric.
HOMER RAkEL ECC CC MlkNN BR MlS-LCS
flags
emotions
genbase
scene
CAL500
enron
mediamill
Av. Rank
Final Pos.
(b) Algorithm evaluation based on the “Exact Match” metric.
HOMER RAkEL ECC CC MlkNN BR MlS-LCS
flags
emotions
genbase
scene
CAL500
enron
mediamill
Av. Rank
Final Pos.
(c) Algorithm evaluation based on the “Hamming Loss” metric.
Table 4: Algorithm comparison based on multiple evaluation metrics. Superscripts refer to algorithm ranks (per problem) according to the Friedman test, the column labeled “Av. Rank” reports the average rank of the method in the corresponding row, while the column labeled “Final Pos.” holds its position in the (overall) final ranking.

Based on the accuracy results, the average rank provides a clear indication of the studied algorithms relative performance: MlS-LCS ranks second after RAkEL and outperforms all its rivals in 3 out of the 7 studied problems, including the relatively high-complexity CAL500 problem. The comparison results are less favorable for MlS-LCS when based on the exact match and Hamming loss metrics, as it ranks third in both cases. Still, MlS-LCS achieves the best exact match value for 3 out of the 7 studied problems, including the CAL500 problem. In the latter case, MlS-LCS (with the Best inference strategy) outperforms its rivals by at least 70%. We consider this result indicative of our proposed algorithm’s ability to effectively model label correlations, given the high label cardinality () of the problem.

Regarding the statistical significance of the measured differences in algorithm ranks, the use of the Friedman test does not

reject the null hypothesis (at

=0.05) that all algorithms perform equivalently, when applied to rankings based on the accuracy and exact match metrics. The same null hypothesis is rejected (at =0.05) when the studied algorithms are ranked based on Hamming loss, and the Nemenyi post-hoc test detects a significant performance difference between RAkEL and (a) HOMER and CC at =0.1, and (b) ECC and BR at =0.05.

Overall, regardless of the evaluation metric used, MlS-LCS outperforms at least 4 of its 6 rivals. In the cases of accuracy and Hamming loss, the outperformed rivals include the state-of-the-art algorithms HOMER and CC that have been recommended as benchmarks by a recent extensive comparative study of multi-label classification algorithms (Madjarov et al., 2012). Additionally, no statistically significant performance differences are detected between MlS-LCS and the best performing RAkEL algorithm, with respect to all evaluation metrics. Thus, we consider obtained results indicative of (i) the potential of our proposed LCS approach for effective multi-label classification, as well as (ii) the flexibility of the generalized multi-label rule format that can mimic the knowledge representations induced by the studied rule-based, lazy and SVM-based ensemble learners, depending on the problem type.

6 Conclusions and Future Work

In this paper, we presented a generalized rule format suitable for generating compact and accurate rulesets in multi-label settings. The proposed format extends existing rule representations with a flexible mechanism for modeling label correlations without the need to explicitly specify the label combinations to be considered. Thus, algorithms inducing generalized multi-label rules can approach all possible spectra between the BR (no label correlations) and LP (all possible label combinations) transformations, while producing comprehensible knowledge in the form of “if-then” rules.

In addition to detailing the generalized multi-label rule format, our current work also employed it in the context of a multi-label LCS algorithm, named MlS-LCS, that is based on a supervised LCS learning framework, properly modified to meet the new requirements posed by the multi-label classification domain. Its extensive experimental evaluation, missing from previous research in the area, revealed that it is capable of consistently effective classification and highlighted it as the first LCS-based alternative to state-of-the-art multi-label classification methods. Based on the average rank over the three evaluations metrics employed, MlS-LCS came second with to RAkEL’s average first place, while it outperformed HOMER (whose average rank is ) that has recently been identified as a top-performing benchmark multi-label classification method (Madjarov et al., 2012).

Regarding the combined potential of MlS-LCS and the proposed generalized multi-label rule format, it is also worth noting that they are, with small modifications to the internal representation of rule labels, directly applicable to the relatively new task of multi-dimensional classification.

The current limitation of our approach with respect to the arguably long times required for model training – that is also a problem for several non-evolutionary multi-label approaches, such as RAkEL and ECC – can be overcome by exploiting the parallelization, and thus scalability, potential of GAs.

An additional important issue, that needs to be addressed in future work, concerns the readability of the knowledge representations evolved, both in terms of rule quality (generalization degree) and quantity. Our first step towards this direction will be an experimental investigation of rule compaction methods available in the literature. Furthermore, based on the encouraging results obtained with the use of our clustering-based initialization procedure, alternative rule initialization methods will be explored, as a means to boost the predictive accuracy and interpretability of the induced knowledge representations.

Acknowledgment

The first author would like to acknowledge that this research has been funded by the Research Committee of Aristotle University of Thessaloniki, through the “Excellence Fellowships for Postdoctoral Studies” program.

References

  • Ahmadi Abhari et al. (2011) Ahmadi Abhari, K., Hamzeh, A., and Hashemi, S. (2011). Voting based learning classifier system for multi-label classification. In Proceedings of the 2011 GECCO Conference Companion on Genetic and Evolutionary Computation, pages 355–360, New York, NY, USA. ACM.
  • Allamanis et al. (2013) Allamanis, M., Tzima, F. A., and Mitkas, P. A. (2013). Effective rule-based multi-label classification with learning classifier systems. In Tomassini, M., Antonioni, A., Daolio, F., and Buesser, P., editors, ICANNGA, volume 7824 of Lecture Notes in Computer Science, pages 466–476. Springer.
  • Behdad et al. (2012) Behdad, M., Barone, L., French, T., and Bennamoun, M. (2012). On xcsr for electronic fraud detection. Evolutionary Intelligence, 5(2):139–150.
  • Bernadó-Mansilla and Garrell-Guiu (2003) Bernadó-Mansilla, E. and Garrell-Guiu, J. (2003). Accuracy-based learning classifier systems: models, analysis and applications to classification tasks. Evolutionary Computation, 11(3):209–238.
  • Bull et al. (2008) Bull, L., Bernadó-Mansilla, E., and Holmes, J. H., editors (2008). Learning Classifier Systems in Data Mining, volume 125 of Studies in Computational Intelligence. Springer.
  • Bull and O’Hara (2002) Bull, L. and O’Hara, T. (2002). Accuracy-based neuro and neuro-fuzzy classifier systems. In et al., W. B. L., editor, GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA, 9-13 July 2002, pages 905–911. Morgan Kaufmann.
  • Butz et al. (2005) Butz, M., Goldberg, D., and Lanzi, P. (2005). Gradient descent methods in learning classifier systems: improving xcs performance in multistep problems. Evolutionary Computation, IEEE Transactions on, 9(5):452–473.
  • Butz et al. (2004) Butz, M., Kovacs, T., Lanzi, P., and Wilson, S. (2004). Toward a theory of generalization and learning in XCS. Evolutionary Computation, IEEE Transactions on, 8(1):28–46.
  • Butz et al. (2008) Butz, M., Lanzi, P., and Wilson, S. (2008). Function approximation with xcs: Hyperellipsoidal conditions, recursive least squares, and compaction. Evolutionary Computation, IEEE Transactions on, 12(3):355–376.
  • Cheng and Hüllermeier (2009) Cheng, W. and Hüllermeier, E. (2009). Combining instance-based learning and logistic regression for multilabel classification. Machine Learning, 76(2-3):211–225.
  • Clare and King (2001) Clare, A. and King, R. D. (2001). Knowledge discovery in multi-label phenotype data. In Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, pages 42–53, London, UK, UK. Springer-Verlag.
  • Crammer and Singer (2003) Crammer, K. and Singer, Y. (2003). A family of additive online algorithms for category ranking. J. Mach. Learn. Res., 3:1025–1058.
  • Demšar (2006) Demšar, J. (2006). Statistical Comparisons of Classifiers over Multiple Data Sets. Journal of Machine Learning Research, 7:1–30.
  • Elisseeff and Weston (2005) Elisseeff, A. and Weston, J. (2005). A kernel method for multi-labelled classification. In Proceedings of the Annual ACM Conference on Research and Development in Information Retrieval, pages 274–281.
  • Fernández et al. (2010) Fernández, A., Garcia, S., Luengo, J., Bernadó-Mansilla, E., and Herrera, F. (2010). Genetics-based machine learning for rule induction: State of the art, taxonomy, and comparative study. Evolutionary Computation, IEEE Transactions on, 14(6):913–941.
  • Holland (1975) Holland, J. H. (1975).

    Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence

    .
    University of Michigan Press, Ann Arbor, MI, USA.
  • Hüllermeier et al. (2008) Hüllermeier, E., Fürnkranz, J., Cheng, W., and Brinker, K. (2008). Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16-17):1897 – 1916.
  • Iqbal et al. (2014) Iqbal, M., Browne, W., and Zhang, M. (2014). Reusing building blocks of extracted knowledge to solve complex, large-scale boolean problems. Evolutionary Computation, IEEE Transactions on, 18(4):465–480.
  • Kharbat et al. (2007) Kharbat, F., Bull, L., and Odeh, M. (2007). Mining breast cancer data with XCS. In GECCO ’07: Proceedings of the 9th annual conference on Genetic and evolutionary computation, pages 2066–2073, New York, NY, USA. ACM.
  • Kneissler et al. (2014) Kneissler, J., Stalph, P., Drugowitsch, J., and Butz, M. (2014). Filtering Sensory Information with XCSF: Improving Learning Robustness and Robot Arm Control Performance. Evolutionary Computation, 22(1):139–158.
  • Kocev (2011) Kocev, D. (2011). Ensembles for predicting structured outputs. PhD thesis, IPS Jožef Stefan, Ljubljana, Slovenia.
  • Kovacs (2002a) Kovacs, T. (2002a). XCS’s Strength-Based Twin: Part I. In Lanzi, P. L., Stolzmann, W., and Wilson, S. W., editors, Learning Classifier Systems, 5th International Workshop, IWLCS 2002, Granada, Spain, September 7-8, 2002, Revised Papers, volume 2661 of Lecture Notes in Computer Science, pages 61–80. Springer.
  • Kovacs (2002b) Kovacs, T. (2002b). XCS’s Strength-Based Twin: Part II. In Lanzi, P. L., Stolzmann, W., and Wilson, S. W., editors, Learning Classifier Systems, 5th International Workshop, IWLCS 2002, Granada, Spain, September 7-8, 2002, Revised Papers, volume 2661 of Lecture Notes in Computer Science, pages 81–98. Springer.
  • Lanzi (2008) Lanzi, P. (2008). Learning Classifier Systems: Then and Now. Evolutionary Intelligence, 1(1):63–82.
  • Lanzi (1999) Lanzi, P. L. (1999). Extending the representation of classifier conditions part I: From binary to messy coding. In Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., and Smith, R. E., editors, Proceedings of the GECCO Conference, pages 337–344, Orlando, Florida, USA. Morgan Kaufmann.
  • Lanzi et al. (2006) Lanzi, P. L., Loiacono, D., Wilson, S. W., and Goldberg, D. E. (2006). Classifier prediction based on tile coding. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pages 1497–1504, New York, NY, USA. ACM.
  • Lanzi and Perrucci (1999) Lanzi, P. L. and Perrucci, A. (1999). Extending the representation of classifier conditions part II: From messy coding to S-expressions. In Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., and Smith, R. E., editors, Proceedings of the GECCO Conference, pages 345–352, Orlando, Florida, USA. Morgan Kaufmann.
  • Lanzi and Wilson (2006) Lanzi, P. L. and Wilson, S. W. (2006). Using convex hulls to represent classifier conditions. In Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pages 1481–1488, New York, NY, USA. ACM.
  • Madjarov et al. (2012) Madjarov, G., Kocev, D., Gjorgjevikj, D., and Džeroski, S. (2012). An extensive experimental comparison of methods for multi-label learning. Pattern Recognition, 45(9):3084–3104.
  • Murphy (2012) Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. The MIT Press.
  • Nakata et al. (2014) Nakata, M., Kovacs, T., and Takadama, K. (2014). A modified XCS classifier system for sequence labeling. In Proceedings of the 2014 Conference on Genetic and Evolutionary Computation, GECCO ’14, pages 565–572, New York, NY, USA. ACM.
  • Nakata et al. (2015) Nakata, M., Kovacs, T., and Takadama, K. (2015). XCS-SL: a rule-based genetic learning system for sequence labeling. Evolutionary Intelligence, pages 1–16.
  • Orriols-Puig and Bernadó-Mansilla (2008) Orriols-Puig, A. and Bernadó-Mansilla, E. (2008). Revisiting UCS: Description, Fitness Sharing, and Comparison with XCS. In Bacardit, J., Bernadó-Mansilla, E., Butz, M. V., Kovacs, T., Llorà, X., and Takadama, K., editors, Learning Classifier Systems, pages 96–116. Springer-Verlag, Berlin, Heidelberg.
  • Orriols-Puig et al. (2009a) Orriols-Puig, A., Bernado-Mansilla, E., Goldberg, D., Sastry, K., and Lanzi, P. (2009a). Facetwise analysis of xcs for problems with class imbalances. Evolutionary Computation, IEEE Transactions on, 13(5):1093–1119.
  • Orriols-Puig et al. (2009b) Orriols-Puig, A., Casillas, J., and Bernadó-Mansilla, E. (2009b). Fuzzy-UCS: A Michigan-Style Learning Fuzzy-Classifier System for Supervised Learning. Evolutionary Computation, IEEE Transactions on, 13(2):260–283.
  • Preen and Bull (2013) Preen, R. and Bull, L. (2013). Dynamical Genetic Programming in XCSF. Evolutionary Computation, 21(3):361–387.
  • Read (2008) Read, J. (2008). A Pruned Problem Transformation Method for Multi-label classification. In Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008), pages 143–150.
  • Read (2010) Read, J. (2010). Scalable Multi-Label Classification. PhD thesis, University of Waikato, Hamilton, New Zealand.
  • Read et al. (2014) Read, J., Bielza, C., and Larranaga, P. (2014). Multi-dimensional classification with super-classes. Knowledge and Data Engineering, IEEE Transactions on, 26(7):1720–1733.
  • Read et al. (2008) Read, J., Pfahringer, B., and Holmes, G. (2008). Multi-label classification using ensembles of pruned sets. In 2008 Eighth IEEE International Conference on Data Mining, pages 995–1000. IEEE.
  • Read et al. (2009) Read, J., Pfahringer, B., Holmes, G., and Frank, E. (2009). Classifier chains for multi-label classification. Machine Learning and Knowledge Discovery in Databases, pages 254–269.
  • Schapire and Singer (2000) Schapire, R. and Singer, Y. (2000). Boostexter: A boosting- based system for text categorization. Machine learning, 39(2):135–168.
  • Stalph et al. (2012) Stalph, P. O., Llorà, X., Goldberg, D. E., and Butz, M. V. (2012). Resource management and scalability of the {XCSF} learning classifier system. Theoretical Computer Science, 425(0):126 – 141. Theoretical Foundations of Evolutionary Computation.
  • Stone and Bull (2003) Stone, C. and Bull, L. (2003). For real! xcs with continuous-valued inputs. Evol. Comput., 11(3):299–336.
  • Thabtah et al. (2004) Thabtah, F., Cowling, P., and Peng, Y. (2004). MMAC: a new multi-class, multi-label associative classification approach. In Proceedings of the 2004 IEEE International Conference on Data Mining, pages 217–224.
  • Tsoumakas and Katakis (2007) Tsoumakas, G. and Katakis, I. (2007). Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13.
  • Tsoumakas et al. (2008) Tsoumakas, G., Katakis, I., and Vlahavas, I. P. (2008). Effective and Efficient Multilabel Classification in Domains with Large Number of Labels. In ECML/PKDD 2008 Workshop on Mining Multidimensional Data.
  • Tsoumakas et al. (2010) Tsoumakas, G., Katakis, I., and Vlahavas, I. P. (2010). Mining multi-label data. In Maimon, O. and Rokach, L., editors, Data Mining and Knowledge Discovery Handbook, pages 667–685. Springer.
  • Tsoumakas et al. (2011a) Tsoumakas, G., Katakis, I., and Vlahavas, I. P. (2011a). Random k-labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering, 23(7):1079–1089.
  • Tsoumakas et al. (2011b) Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J., and Vlahavas, I. (2011b). Mulan: A java library for multi-label learning. Journal of Machine Learning Research, 12:2411–2414.
  • Tzima and Mitkas (2008) Tzima, F. and Mitkas, P. (2008). Zcs revisited: Zeroth-level classifier systems for data mining. In Data Mining Workshops, 2008. ICDMW ’08. IEEE International Conference on, pages 700–709.
  • Tzima and Mitkas (2013) Tzima, F. A. and Mitkas, P. A. (2013). Strength-based learning classifier systems revisited: Effective rule evolution in supervised classification tasks. Eng. Appl. of AI, 26(2):818–832.
  • Tzima et al. (2012) Tzima, F. A., Mitkas, P. A., and Theocharis, J. B. (2012). Clustering-based initialization of learning classifier systems - effects on model performance, readability and induction time. Soft Computing, 16(7):1267–1286.
  • Urbanowicz and Moore (2009) Urbanowicz, R. J. and Moore, J. H. (2009). Learning classifier systems: A complete introduction, review, and roadmap. Journal of Artificial Evolution and Applications, 2009:1:1–1:25.
  • Vallim et al. (2009) Vallim, R., Duque, T., Goldberg, D., and Carvalho, A. (2009). The multi-label ocs with a genetic algorithm for rule discovery: implementation and first results. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation, pages 1323–1330. ACM.
  • Vallim et al. (2008) Vallim, R., Goldberg, D., Llorà, X., Duque, T., and Carvalho, A. (2008). A new approach for multi-label classification based on default hierarchies and organizational learning. In Proceedings of the 2008 GECCO conference companion on Genetic and evolutionary computation, pages 2017–2022. ACM.
  • Vens et al. (2008) Vens, C., Struyf, J., Schietgat, L., Džeroski, S., and Blockeel, H. (2008). Decision trees for hierarchical multi-label classification. Machine Learning, 73(2):185–214.
  • Wilson (2002) Wilson, S. (2002). Classifiers that approximate functions. Natural Computing, 1(2-3):211–234.
  • Wilson (1994) Wilson, S. W. (1994). ZCS: A Zeroth-level Classifier System. Evolutionary Computation, 2(1):1–18.
  • Wilson (1995) Wilson, S. W. (1995). Classifier fitness based on accuracy. Evolutionary Computation, 3(2):149–175.
  • Witten and Frank (2005) Witten, I. H. and Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd Edition). Morgan Kaufmann, San Francisco, CA, USA.
  • Yang (2001) Yang, Y. (2001). A study of thresholding strategies for text categorization. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 137–145, New York, NY, USA. ACM.
  • Zhang and Zhang (2010) Zhang, M.-L. and Zhang, K. (2010). Multi-label learning by exploiting label dependency. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 999–1008, New York, NY, USA. ACM.
  • Zhang and Zhou (2006) Zhang, M.-L. and Zhou, Z.-H. (2006). Multilabel neural networks with applications to functional genomics and text categorization. Knowledge and Data Engineering, IEEE Transactions on, 18(10):1338–1351.
  • Zhang and Zhou (2007) Zhang, M.-L. and Zhou, Z.-H. (2007). ML-KNN: A lazy learning approach to multi-label learning. Pattern Recognition, 40(7):2038–2048.