Multi-label learning deals with the problem that one instance is associated with multiple labels, such as a news document can be labeled as sports, Olympics, and ticket sales . Formally, let denote the -dimensional feature space and denote the label space with class labels. Given the multi-label training set , where is number of instances,
is the feature vector for the-th instance and is the set of labels associated with the -th instance. The task of multi-label learning is to learn a function from which can assign a set of proper labels to an instance.
One straightforward method to solve the multi-label learning problem is to decompose the problem into a set of independent binary classification problems 
. This strategy is easy to implement and existing single-label classification approaches, e.g., logistic regression and SVM, can be utilized directly. However, as can be seen by the news document example, an instance with the Olympics label has a high probability to have the label of sports. The correlations among the labels may provide useful information for one another and help to improve the performance of multi-label learning[1, 3].
Over the past years, a lot of methods have been proposed to improve the performance of multi-label learning by exploring the label correlations. Methods such as classifier chains, calibrated label ranking , and random -labelsets  usually have high complexity with a large number of class labels. [7, 8, 9] considered taking the label correlations as prior knowledge and incorporating it into the model training to utilize the label correlations. [10, 11] exploited label correlations through learning a latent label representation and optimizing label manifolds.  explored the correlations by solving an optimization problem which models the contribution of related labels, and then incorporating the learned correlations into the model training. In the existing approaches, a well-designed training model is required to achieve notable performances.
Inspired by the merits of first-order methods and taking into account the importance of correlations among the labels, a novel Multi-task Gradient Descent (MGD) algorithm is proposed in this paper to solve the multi-label learning problem. Treating a single-label learning problem as single task, the multi-label learning can be casted as solving multiple related tasks simultaneously. In MGD, each task minimizes its individual cost function using the gradient descent algorithm and the similarities among the tasks are then facilitated through transferring model parameter values during the optimization process of each task. We prove the convergence of MGD when the transfer mechanism and the step size of gradient descent satisfy certain easily achievable conditions. Compared with the existing approaches, MGD is easy to implement, has less requirement on the training model, and can achieve seamless asymmetric transformation such that negative transfer is mitigated . In addition, MGD can also benefit from parallel computing with small amount of information processed centrally when the number of tasks is large.
The rest of the paper is organized as the follows. Previous works related to multi-label learning are firstly reviewed. Secondly, we introduce the proposed MGD and provide the theoretical analysis, including model convergence and computational complexity. Thirdly, we present how the proposed MGD is extensively tested on real multi-label learning datasets and compared with strong baselines. At last, we summarize the proposed approach and the contributions of the paper.
Ii Related Work
Based on the order of information being considered, existing multi-label learning approaches can be roughly categorized into three major types . For first-order methods, the label correlations are ignored and the multi-label learning problem is handled in a label by label manner, such as BR  and LIFT . Second-order methods consider pairwise relations between labels, such as LLSF  and JFSC . High-order methods, where high-order relations among label subsets or all the labels are considered, such as RAkEL , ECC , LLSF-DL , and CAMEL . Generally, the higher the order of correlations being considered, the stronger is the correlation-modeling capabilities, while on the other hand, the more computationally demanding and less scalable the approach becomes.
Treating a single-label learning problem as one task, the multi-label learning problem can be seen as a special case of multi-task learning problem, where the feature vectors for are the same for different tasks. In majority of multi-task learning method, the relations among the tasks are promoted through regularization in the overall objective function that composed of all the tasks’ parameters, such as feature based approaches [15, 16, 17, 18, 19] and task relation based approaches [20, 21, 22]. Specifically, for the second-order multi-label learning approaches in [7, 8, 9]
, the label correlation matrix, which is taken as a prior knowledge obtained based on the similarity between label vectors, is often incorporated as a structured norm regularization term that regulates the learning hypotheses or perform label-specific feature selection and model training.
In contrast to the existing multi-label and multi-task learning approaches which incorporate correlation information into the model training process in the form of regularization, MGD serves as the first attempt to incorporate the correlations by transferring model parameter values during the optimization process of each task, i.e., when minimizing its individual cost function.
Iii The MGD Approach
In this section, we elaborate the proposed MGD algorithm for multi-label learning. We firstly introduce the mathematical notations used in the manuscript. We then generically formulate the multi-label learning problem and introduce how MGD can effectively solve multi-label learning problem via the reformative gradient descent where the correlated parameters are transferred across multiple tasks. At last, we perform the theoretical analysis of MGD, including convergence proof and computational complexity.
Throughout this paper, normal font small letters denote scalars, boldface small letters denote column vectors, and capital letters denote matrices. denotes zero column vector with proper dimension,
denotes identity matrix of size. denotes the transpose of matrix and denotes the Kronecker product. denotes a concatenated column vector formed by stacking on top of each other, and denotes a diagonal matrix with the -th diagonal element being . The norm without specifying the subscript represents the Euclidean norm by default. Following the notations used in Introduction, we alternatively represent the training set as where denotes the instance matrix and denotes the label matrix. In addition, we denote the training set for label as where is the -th column vector of the label matrix .
Iii-a Problem Formulation
Treating each single-label learning problem as one task, we have tasks to be solved simultaneously. Each task aims to minimize its own cost function
where is the model parameter and is the cost function of the -th task with training dataset . In this paper, we do not restrict the specific form of the cost functions. In particular, the cost functions is assumed to be strongly convex, twice differentiable, and the gradient of is Lipschitz continuous with constant , i.e.,
Cost functions such as mean squared error with norm 2 regularization and cross-entropy with norm 2 regularization apply. Non-differentiable cost functions where norm 1 regularization is used can also be approximated considered . Since is strongly convex and twice differentiable, there exists positive constant such that . As a result, we have
Iii-B The Proposed Framework
Equation (1) is solved using the gradient descent iteration,
where is the iteration index, is the step size, and is the gradient of at . As there are relations among tasks, we are able to improve the learning performance by considering the correlation of parameters belonging to different tasks. Based on this idea, we propose a reformative gradient descent iteration, which allows the values of the model parameters during each iteration to be transferred across similar tasks. The MGD is designed as follows,
where is the transfer coefficient describes the information flow from task to task , which satisfies the following conditions,
Iii-C Convergence Analysis
In this section, we give the convergence property of the proposed MGD iteration based on the expression in (III-B).
Denote as the best coefficient of label predictor for task , , and . The following theorem gives the convergence property of the iteration (III-B) under certain conditions on the step-size parameter .
Under the iteration in (III-B) with the transfer coefficient satisfies
is convergent if the step size is chosen to satisfy
Let the -th element of at iteration time being , denote , , and . Note that we are using the typeface to distinguish this from the single vector-valued variable . Write (III-B) into a concatenated form gives
Denote and . Subtracting from both sides of (10) gives
where . It can be verified that is a block diagonal matrix and the block diagonal elements for are Hermitian. We use the block maximum norm defined in  to show the convergence of the above iteration. The block maximum norm of a vector with is defined as 
The induced matrix block maximum norm is therefore defined as 
From the iteration in (III-C) we have
From Lemma D.3 in , we have
where the last equality comes from the fact that and the row summation of is one. Since , . Thus, where . By the definition of induced matrix block maximum norm, we have
In iteration (3), the transfer coefficient between task and task is a scaler. In the following, we consider the element-wise feature similarities between task and task . The transfer coefficient between task and task is assumed to be a diagonal matrix with its -th diagonal element being the transfer coefficient from the -th element of to the -th element of . The MGD iteration in (3) is then becomes
Following the same rescaling,
Under (16) with the transfer coefficient satisfies
is convergent if the following conditions are satisfied:
Let the -th block element of being . Following the similar procedure of the proof of Theorem 1, we obtain
Let being a block column vector with .
Recall that is a diagonal matrix and the elements therein are all no greater than 1, thus, . As a result
By the definition of matrix block maximum norm, we have
The condition to ensure convergence of the iteration in (III-C) becomes
Iii-D Relation with Multi-Task Learning
From the iteration in (III-B), we have
If fix for all , then, the last term in the brackets can be seen as the gradient of the following function
where denotes the collection of other tasks’ variables, i.e., . Thus, the iteration in (III-D) with fixed can be seen as the gradient descent algorithm which solves the following Nash equilibrium problem
In (19), each task’s objective function is influenced by other tasks’ decision variables. Since the objective function is continuous in all its arguments, strongly convex with respect to for fixed , and satisfies as for fixed , an Nash equilibrium exists . Furthermore, as a result of strongly convexity, the gradient of with respect to for fixed is strongly monotone. Thus, the Nash equilibrium for (19) is unique . Denote the Nash equilibrium of (19) as , . It is known that the Nash equilibrium satisfies the following condition :
Write the conditions in (20) in a concatenated form gives
It has been pointed out in  that the regularized multi-task learning algorithms which learn with task relations can be expressed as
where is the training loss of task , is a positive regularization parameter, models the task relations, and denotes constraints on . For comparison, we eliminate the constraints on , consider the case that is fixed, and let . Denote the optimal solutions of problem (21) as . The optimal solution satisfy the following condition,
the optimal solution will be the same as . The only limitation is that , which can not cover the situation where there exists non-negative non-diagonal values in . Overall, the regularized multi-tasking learning problem with task relation learning can be solved by the MGD algorithm by setting the coefficients between task and task properly. In addition, using MGD, we can consider feature-feature relations between different tasks since we can use as the transfer coefficient. Furthermore, in MGD, is not required to be equal to . This relaxation allows asymmetric task relations in multi-task learning , which is hard to achieve by most multi-task learning methods since is always symmetric in (22).
Another category of regularized multi-task learning method is learning with feature relations . The objective function of this kind of method is
where models the covariance between the features. The term can be decoupled as , which can be incorporated into for task , and can be learned using all the tasks’ parameters during the optimization process using MGD.
Iii-E Incorporating Second-Order Label Correlations
The transfer coefficients can be designed or learned by many different methods. In multi-label learning problems, the similarity between task and task can be modeled by the correlation between labels and
. In this paper, we use the cosine similarity to calculate the correlation matrix. The proposed MGD is summarized in Algorithm1.
After learning the model parameter , we can predict the label for a test instance by the corresponding prediction function associated with the cost function, and the final predicted label vector is .
Iii-F Complexity Analysis
We mainly analyze the complexity of the iteration parts listed in Algorithm 1. In each iteration, the gradient calculation leads to a complexity of , where is the complexity of calculating the gradient w.r.t. the dimension , which is determined by the actual cost function, and the update of the model parameter according to (3) needs . Therefore, the overall complexity of the MGD algorithms is of order , where is the iteration times.
In this section, we extensively compared the proposed MGD algorithm with related approaches on real-world datasets. For the proposed MGD algorithm, we reformulate the multi-label learning problem, which can be decomposed into a set of binary classification tasks. For each of the classification tasks, we use the function of 2-norm regularized logistic regression. Thus, for any task , the following objective function is optimized by the proposed algorithm,
where , , is the model parameter, is the remaining elements in except the first element, and is the regularization parameter. The gradient of over is
The MGD iteration is
The -th label prediction for an instant is predicted 1 if and 0 otherwise, where is the threshold. In the experiment, is chosen from .
Iv-a Experimental Setup
We conduct the multi-label classification on six benchmark multi-label datasets, including regular-scale datasets: emotions, genbase, cal500, and enron; and large-scale datasets: corel5k and bibtex. The details of the datasets are summarized in Table I, where ,