## I Introduction

Multilabel classification (MLC) extends conventional single label classification (SLC) by allowing an instance to be assigned to multiple labels from a label set. It occurs naturally from a wide range of practical problems, such as document categorization, image classification, music annotation, webpage classification and bioinformatics applications, where each instance can be simultaneously described by several class labels out of a candidate label set. MLC is also closely related to many other research areas, such as subspace learning [1], nonnegative matrix factorization [2], multi-view learning [3] and multi-task learning [4]

. Because of its great generality and wide applications, MLC has received increasing attentions in recent years from machine learning, data mining, to computer vision communities, and developed rapidly with both algorithmic and theoretical achievements

[5, 6, 7, 8, 9, 10].The key feature of MLC that makes it distinct from SLC is label correlation, without which classifiers can be trained independently for each individual label and MLC degenerates to SLC. The correlation between different labels can be verified by calculating the statistics, e.g.,

test and Pearson’s correlation coefficient, of their distributions. According to [11], there are two types of label correlations (or dependence), i.e., the conditional correlations and the unconditional correlations, wherein the former describes the label correlations conditioned on a given instance while the latter summarizes the global label correlations of only label distribution by marginalizing out the instance. From a classification point of view, modelling of label conditional correlations is preferable since they are directly related to prediction; however, proper utilization of unconditional correlations is also helpful, but in an average sense because of the marginalization. Accordingly, quite a number of MLC algorithms have been proposed in the past a few years, by exploiting either of the two types of label correlations,^{1}

^{1}1

Studies on MLC, from different perspectives rather than label correlations, also exit in the literature, e.g., by defining different loss functions, dimension reduction and classifier ensemble methods, but are not in the scope of this paper.

and below, we give a brief review of the representative ones. As it is a very big literature, we cannot cover all the algorithms. The recent surveys [8, 9] contain many references omitted from this paper.-
By exploiting unconditional label correlations:

A large class of MLC algorithms that utilize unconditional label correlations are built upon label transformation. The key idea is to find new representation for the label vector (one dimension corresponds to an individual label), so that the transformed labels or responses are uncorrelated and thus can be predicted independently. Original label vector needs to be recovered after the prediction. MLC algorithms using label transformation include

[12] which utilizes low-dimensional embedding and [7] and [13] which use random projections. Another strategy of using unconditional label correlations, e.g., used in the stacking method [6] and the “Curds” & “Whey” procedure [14], is first to predict each individual label independently and correct/adjust the prediction by proper post-processing. Algorithms are also proposed based on co-occurrence or structure information extracted from the label set, which include random -label sets (RAKEL) [15], pruned problem transformation (PPT) [16], hierarchical binary relevance (HBR) [17] and hierarchy of multilabel classifiers (HOMER) [8]. Regression-based models, including reduced-rank regression and multitask learning, can also be used for MLC, with an interpretation of utilizing unconditional label correlations [11]. -
By exploiting conditional label correlations:

MLC algorithms in this category are diverse and often developed by specific heuristics. For example, multilabel

-nearest neighbour (MLkNN) [5]extends KNN to the multilabel situation, which applies maximum a posterior (MAP) label prediction by obtaining the prior label distribution within the

nearest neighbours of an instance. Instance-based logistic regression (IBLR) [6] is also a localized algorithm, which modifies logistic regression by using label information from the neighbourhood as features. Classifier chain (CC) [18], as well as its ensemble and probabilistic variants [19], incorporate label correlations into a chain of binary classifiers, where the prediction of a label uses previous labels as features. Channel coding based MLC techniques such as principal label space transformation (PLST) [20] and maximum margin output coding (MMOC) [21] proposed to select codes that exploits conditional label correlations. Graphical models, e.g., conditional random fields (CRFs) [22], are also applied to MLC, which provides a richer framework to handle conditional label correlations.

### I-a Multilabel Image Classification

Multilabel image classification belongs to the generic scope of MLC, but handles the specific problem of predicting the presence or absence of multiple object categories in an image. Like many related high-level vision tasks such as object recognition [23, 24], visual tracking [25], image annotation [26, 27, 28]

[29, 30, 31], multilabel image classification [32, 33, 34, 35, 36, 37] is very challenging due to large intra-class variation. In general, the variation is caused by viewpoint, scale, occlusion, illumination, semantic context, etc.On the one hand, many effective image representation schemes have been developed to handle this high-level vision task. Most of the classical approaches derive from handcrafted image features, such as GIST [38], dense SIFT [39], VLAD [40], and object bank [41]

. In contrast, the very recent deep learning techniques have also been developed for image feature learning, such as deep CNN features

[42, 43]. These techniques are more powerful than classical methods when learning from a very large amount of unlabeled images.On the other hand, label correlations have also been exploited to significantly improve image classification performance. Most of the current multilabel image classification algorithms are motivated by considering label correlations conditioned on image features, thus intrinsically falls into the CRFs framework. For example, probabilistic label enhancement model (PLEM) [44] designed to exploit image label co-occurrence pairs based on a maximum spanning tree construction and a piecewise procedure is utilized to train the pairwise CRFs model. More recently, clique generating machine (CGM) [45] proposed to learn the image label graph structure and parameters by iteratively activating a set of cliques. It also belongs to the CRFs framework, but the labels are not constrained to be all connected which may result in isolated cliques.

### I-B Motivation and Organization

Correlated logistic model (CorrLog) provides a more principled way to handle conditional label correlations, and enjoys several favourable properties: 1) built upon independent logistic regressions (ILRs), it offers an explicit way to model the pairwise (second order) label correlations; 2) by using the pseudo likelihood technique, the parameters of CorrLog can be learned approximately with a computational complexity linear with respect to label number; 3) the learning of CorrLog is stable, and the empirically learned model enjoys a generalization error bound that is independent of label number. In addition, the results presented in this paper extend our previous study [46] in following aspects: 1) we introduce elastic net regularization to CorrLog, which facilitates the utilization of the sparsity in both feature selection and label correlations; 2) a learning algorithm for CorrLog based on soft thresholding is derived to handle the nonsmoothness of the elastic net regularization; 3) the proof of generalization bound is also extended for the new regularization; 4) we apply CorrLog to multilabel image classification, and achieve competitive results with the state-of-the-art methods of this area.

To ease the presentation, we first summarize the important notations in Table I. The rest of this paper is organized as follows. Section II introduces the model CorrLog with elastic net regularization. Section III presents algorithms for learning CorrLog by regularized maximum pseudo likelihood estimation, and for prediction with CorrLog by message passing. A generalization analysis of CorrLog based on the concept of algorithm stability is presented in Section IV. Section V to Section VII report results of empirical evaluations, including experiments on synthetic dataset and on benchmark multilabel image classification datasets.

Notation |
Description |
---|---|

training dataset with examples, | |

modified training data set by replacing the -th example of with an independent example | |

modified training data set by discarding the -th example of | |

negative log pseudo likelihood over training dataset | |

regularized negative log pseudo likelihood over training dataset | |

elastic net regularization with weights , and parameter | |

model parameters of CorrLog | |

empirical learned model parameters by maximum pseudo likelihood estimation over | |

empirical learned model parameters over | |

empirical learned model parameters over | |

empirical error of the empirical model over training set | |

generalization error of the empirical model |

## Ii Correlated Logistic Model

We study the problem of learning a joint prediction , where the instance space and the label space

. By assuming the conditional independence among labels, we can model MLC by a set of independent logistic regressions (ILRs). Specifically, the conditional probability

of ILRs is given by(1) | ||||

where is the coefficients for the -th logistic regression (LR) in ILRs. For the convenience of expression, the bias of the standard LR is omitted here, which is equivalent to augmenting the feature of with a constant.

Clearly, ILRs (1) enjoys several merits, such as, it can be learned efficiently, in particular with a linear computational complexity with respect to label number , and its probabilistic formulation inherently helps deal with the imbalance of positive and negative examples for each label, which is a common problem encountered by MLC. However, it ignores entirely the potential correlation among labels and thus tends to under-fit the true posterior , especially when the label number is large.

### Ii-a Correlated Logistic Regressions

CorrLog tries to extend ILRs with as small effort as possible, so that the correlation among labels is explicitly modelled while the advantages of ILRs can be also preserved. To achieve this, we propose to augment (1) with a simple function

and reformulate the posterior probability as

(2) |

As long as cannot be decomposed into independent product terms for individual labels, it introduces label correlations into . It is worth noticing that we assumed to be independent of . Therefore, (2) models label correlations in an average sense. This is similar to the concept of “marginal correlations” in MLC [11]. However, they are intrinsically different, because (2) integrate the correlation into the posterior probability, which directly aims at prediction. In addition, the idea used in (2) for correlation modelling is also distinct from the “Curds and Whey” procedure in [14]

which corrects outputs of multivariate linear regression by reconsidering their correlations to the true responses.

In this paper, we choose to be the following quadratic form,

(3) |

It means that and are positively correlated given and negatively correlated given . It is also possible to define as functions of , but this will drastically increase the number of model parameters, e.g., by if linear functions are used.

By substituting (3) into (2), we obtain the conditional probability for CorrLog

(4) |

where the model parameter contains and . It can be seen that CorrLog is a simple modification of (1

), by using a quadratic term to adjust the joint prediction, so that hidden label correlations can be exploited. In addition, CorrLog is closely related to popular statistical models for joint modelling of binary variables. For example, conditional on

, (4) is exactly an Ising model [47] for . It can also be treated as a special instance of CRFs [22], by defining features and . Moreover, classical model multivariate probit (MP) [48] also models pairwise correlations in . However, it utilizes Gaussian latent variables for correlation modelling, which is essentially different from CorrLog.### Ii-B Elastic Net Regularization

Given a set of training data , CorrLog can be learned by regularized maximum log likelihood estimation (MLE), i.e.,

(5) |

where is the negative log likelihood

(6) |

and is a properly chosen regularization.

A possible choice for is the regularizer,

(7) |

with , being the weighting parameters. The regularization enjoys the merits of computational flexibility and learning stability. However, it is unable to exploit any sparsity that can be possessed by the problem at hand. For example, for MLC, it is likely that the prediction of each label only depends on a subset of the features of , which implies the sparsity of . Besides, can also be sparse since not all labels in are correlated to each other. regularizer is another choice for , especially regarding model sparsity. Nevertheless, it has been noticed by several studies that regularized algorithms are inherently unstable, that is, a slight change of the training data set can lead to substantially different prediction models. Based on above consideration, we propose to use the elastic net regularizer [49], which is a combination of and regularizers and inherits their individual advantages, i.e., learning stability and model sparsity,

(8) |

where controls the trade-off between the regularization and the regularization, and large encourages a high level of sparsity.

## Iii Algorithms

In this section, we derive algorithms for learning and prediction with CorrLog. The exponentially large size of the label space makes exact algorithms for CorrLog computationally intractable, since the conditional probability (4) needs to be normalized by the partition function

(9) |

which is a summation over an exponential number of terms. Thus, we turn to approximate learning and prediction algorithms, by exploiting the pseudo likelihood and the message passing techniques.

### Iii-a Approximate Learning via Pseudo Likelihood

Maximum pseudo likelihood estimation (MPLE) [50] provides an alternative approach for estimating model parameters, especially when the partition function of the likelihood cannot be evaluated efficiently. It was developed in the field of spatial dependence analysis and has been widely applied to the estimation of various statistical models, from the Ising model [47] to the CRFs [51]. Here, we apply MPLE to the learning of parameter in CorrLog.

The pseudo likelihood of the model over jointly distributed random variables is defined as the product of the conditional probability of each individual random variables conditioned on all the rest ones. For CorrLog (4), its pseudo likelihood is given by

(10) |

where and the conditional probability can be directly obtained from (4),

(11) | ||||

Accordingly, the negative log pseudo likelihood over the training data is given by

(12) |

To this end, the optimal model parameter of CorrLog can be learned approximately by the elastic net regularized MPLE,

(13) |

where , and are tuning parameters.

A First-Order Method by Soft Thresholding: Problem (III-A) is a convex optimization problem, thanks to the convexity of the logarithmic loss function and the elastic net regularization, and thus a unique optimal solution. However, the elastic net regularization is non-smooth due to the norm regularizer, which makes direct gradient based algorithm inapplicable. The main idea of our algorithm for solving (III-A) is to divide the objective function into smooth and non-smooth parts, and then apply the soft thresholding technique to deal with the non-smoothness.

Denoting by the smooth part of , i.e.,

(14) |

its gradient at the -th iteration is given by

(15) |

with

(16) | ||||

Then, a surrogate of the objective function in (III-A) can be obtained by using , i.e.,

The parameter in (III-A) servers a similar role to the variable updating step size in gradient descent methods, and it is set such that is larger than the Lipschitz constant of . For such , it can be shown that and . Therefore, the update of can be realized by the minimization

(18) |

which is solved by the soft thresholding function , i.e.,

(19) |

where

(20) |

Iteratively applying (19) until convergence provides a first-order method for solving (III-A). Algorithm 1 presents the pseudo code for this procedure.

Remark 1 From the above derivation, especially equations (15) and (19), the computational complexity of our learning algorithm is linear with respect to the label number . Therefore, learning CorrLog is no more expensive than learning independent logistic regressions, which makes CorrLog scalable to the case of large label numbers.

Remark 2 It is possible to further speed up the learning algorithm. In particular, Algorithm 1 can be modified to have the optimal convergence rate in the sense of Nemirovsky and Yudin [52], i.e., wherein is the number of iterations. However, its convergence is usually as slow as in standard gradient descent methods. Actually, we only need to replace the current variable in the surrogate (III-A) by a weighted combination of the variables from previous iterations. As such modification is a direct application of the fast iterative shrinkage thresholding, [53], we do not present the details here but leave readers to the reference.

### Iii-B Joint Prediction by Message Passing

For MLC, as the labels are not independent in general, the prediction task is actually a joint maximum a posterior (MAP) estimation over . In the case of CorrLog, suppose the model parameter is learned by the regularized MPLE from the last subsection, the prediction of for a new instance can be obtained by

(21) |

We use the belief propagation (BP) to solve (III-B) [54]. Specifically, we run the max-product algorithm with uniformly initialized messages and an early stopping criterion with 50 iterations. Since the graphical model defined by in (III-B) has loops, we cannot guarantee the convergence of the algorithm. However, we found that it works well on all experiments in this paper.

## Iv Generalization Analysis

An important issue in designing a machine learning algorithm is generalization, i.e., how the algorithm will perform on the test data compared to on the training data. In the section, we present a generalization analysis for CorrLog, by using the concept of algorithmic stability [55]. Our analysis follows two steps. First, we show that the learning of CorrLog by MPLE is stable, i.e., the learned model parameter does not vary much given a slight change of the training data set . Then, we prove that the generalization error of CorrLog can be bounded by the empirical error, plus a term related to the stability but independent of the label number .

### Iv-a The Stability of MPLE

The stability of a learning algorithm indicates how much the learned model changes according to a small change of the training data set. Denote by a modified training data set the same with but replacing the -th training example by another independent example . Suppose and are the model parameters learned by MPLE (III-A) on and , respectively. We intend to show that the difference between these two models, defined as

(22) |

is bounded by an order of , so that the learning is stable for large .

First, we need the following auxiliary model learned on , which is the same with but without the -th example

(23) |

where

(24) |

The following Lemma provides an upper bound of the difference .

###### Proof.

Next, we show a lower bound of the difference .

###### Proof.

In addition, by checking the Lipschitz continuous property of , we have the following Lemma 3.

###### Proof.

First, we have

and

That is is Lipschitz continuous with respect to and , with constant and , respectively. Therefore, (3) holds. ∎

By combining the above three Lemmas, we have the following Theorem 1 that shows the stability of CorrLog.

###### Theorem 1.

Given model parameters and learned on training datasets and , respectively, both by (III-A), it holds that

(30) |

### Iv-B Generalization Bound

We first define a loss function to measure the generalization error. Considering that CorrLog predicts labels by MAP estimation, we define the loss function by using the log probability

(35) |

where the constant and

(36) |

The loss function (35) is defined analogously to the loss function used in binary classification, where is replaced with the margin if a linear classifier is used. Besides, (35) gives a 0 loss only if all dimensions of are correctly predicted, which emphasizes the joint prediction in MLC. By using this loss function, the generalization error and the empirical error are given by

(37) |

and

(38) |

According to [55], an exponential bound exists for if CorrLog has a uniform stability with respect to the loss function (35). The following Theorem 2 shows this condition holds.

###### Theorem 2.

Given model parameters and learned on training datasets and , respectively, both by (III-A), it holds for ,

(39) |

###### Proof.

First, we have the following inequality from (35)

(40) |

Then, by introducing notation

(41) |

and rewriting

(42) |

we have

(43) |

Due to the fact that for any functions and it holds^{2}^{2}2
Suppose and maximize and
respectively, and without loss of generality
, we have
.

(44) |

we have

(45) |

Then, the proof is completed by applying Theorem 1. ∎

Now, we are ready to present the main theorem on the generalization ability of CorrLog.

###### Theorem 3.

Given the model parameter learned by (III-A), with i.i.d. training data and regularization parameters , , it holds with at least probability ,

(46) |

###### Proof.

Remark 3 A notable observation from Theorem 3 is that the generalization bound (3) of CorrLog is independent of the label number . Therefore, CorrLog is preferable for MLC with a large number of labels, for which the generalization error still can be bounded with high confidence.

Remark 4 While the learning of CorrLog (III-A) utilizes the elastic net regularization , where is the weighting parameter on the regularization to encourage sparsity, the generalization bound (3) is independent of the parameter . The reason is that regularization does not lead to stable learning algorithms [56], and only the regularization in contributes to the stability of CorrLog.

## V Toy Example

We design a simple toy example to illustrate the capacity of CorrLog on label correlation modelling. In particular, we show that when ILRs fail drastically due to ignoring the label correlations (under-fitting), CorrLog performs well. Consider a two-label classification problem on a 2-D plane, where each instance is sampled uniformly from the unit disc and the corresponding labels are defined by

where

Comments

There are no comments yet.