1 Introduction
In machine learning, multilabel learning refers to the setting where each data item can be associated to multiple classes simultaneously. For example, in image annotation, an image can be annotated using several tags; in document topic analysis, a document can be associated with multiple topics. Although there are several multilabel learning methods in the literature
mlknnpr2007 multilabelreviewtkde2014 , most of these require complete labelling of training examples, i.e., for every pair of training example and class label, their association needs to be provided.However, complete labelling is usually infeasible in practice. Most training instances are only partially labelled, with some or all of the labels not provided/missing. Let us consider the task of largescale image annotation, where the number of classes/tags is large (e.g.
, using labels of ImageNet
imagenetcvpr2009 ). Practically, a human annotator can only consider to annotate each training image with a subset of a potentially large and diverse set of tags. Furthermore, in many cases, due to the semantic similarities in the tags, some tags are typically left unchecked, e.g., an image tagged with “German Shepherd” may usually not be tagged also with “Dog”. Such a learning setting is referred to as the multilabel learning with missing labels (MLML) problem myicpr2014 ; LEMLICML2014 .As labels are usually related by semantic meanings or cooccurrences, the key to learning from missing labels is a good model to represent label dependency. One widely used model for label dependency is an undirected graph, through which the label information can be propagated among different instances and among different classes. For example, the label dependency between a pair of labels, such as instance similarity and class cooccurrence can be represented using such a graph (see green and blue edges in Fig. 1). However, as stated in myicpr2014 ; mypr2015
, the class cooccurrence derived from training labels can be inaccurate and biased when many missing labels exist. One alleviation method is to estimate cooccurrence relations from an auxiliary and possibly more comprehensive source (such as Wikipedia)
crbmmlml2015 . Another alternative is utilizing a class dependency that is independent of the provided labels. One widely used dependency in multilabel learning is the low rank assumption that the rank of the label matrix, where one row corresponds to one class, and each column indicates one instance, should be smaller than the number of rows (i.e., classes). Although this assumption has been successfully used in many multilabel models multilabelcompressedsensingnips2012 ; LEMLICML2014 , as indicated in multilabellowranksparsekdd2016 , the low rank assumption is difficult to be fully satisfied due to the existence of tail labels (i.e., the rare labels that occur in very few instances, thus they are difficult to be represented by the linear combinations of other labels). Instead, the sparse and lowrank decomposition that has been successfully used in other applications like image alignment imagealignmentpami2012 or visual tracking tianzhusparsecodingeccv2012 can be used in multilabel learning, to assume that the label matrix can be decomposed to the addition of one sparse and one lowrank matrices. Compared to the pure low rank assumption, this decomposition is more flexible to ensure the validity of the low rank assumption in practical multilabel problems. In this work we propose to combine the instancelevel similarity with the class cooccurrence, or the sparse and low rank decomposition respectively.The semantic dependency between two classes, such as “animalhorse” and “plantgrass” as shown in Fig. 1, can foster further label dependencies and improve label predictions in the test. To handle this requirement, a new set of constraints is introduced to require that
the label score (e.g., the presence probability) of the parent class cannot be lower than that of its child class
. This is traditionally referred to as the semantic hierarchical constraint biweiicml2011 ; myiccv2015 . The undirected graph (with instance similarity and class cooccurrence edges or the global sparse and low rank decomposition) cannot guarantee that the final label predictions will satisfy all semantic hierarchy constraints. To address this problem, we add semantic dependencies into the graph as directed edges, thus, resulting in an overall mixed dependency graph that encourages (or enforces) three types of label dependencies. The graph embedding the class cooccurrence is referred to as mixed graph with cooccurrence (MGCO), while the one with the sparse and low rank decomposition is denoted as mixed graph with sparse and low rank decomposition (MGSL). Please refer to Fig. 1 for an example of these models.The goal of this work is to learn from partially labeled training instances and to correctly predict the labels of testing instances that satisfy the semantic hierarchical constraints. Motivated by myicpr2014 ; mypr2015 , a discrete objective function is formulated to simultaneously encourage consistency between predicted and ground truth labels and encode traditional label dependencies (instance similarity with class cooccurrence or with sparse and low rank decomposition). Whereas, semantic hierarchical constraints are incorporated as hard linear constraints in the matrix optimization. The discrete problem is further relaxed to a convex problem, which is solved using ADMM admmboyd2011 .
Contributions:
(1) We address the MLML problem by using a mixed dependency graph to encode a network of label dependencies: instance similarity, class cooccurrence or sparse and low rank decomposition, as well as semantic hierarchical constraint. (2) Learning on the mixed dependency graph is formulated as a linearly constrained convex matrix optimization problem that is amenable to efficient solvers. (3) We conduct extensive experiments on the task of image annotation to show the superiority of our method in comparison to the stateoftheart. (4) We augment labelling of several widely used datasets, including Corel 5k corel5keccv2002 , ESP Game espgame2004 , IAPRTC12 iaprtc12data2006 and MediaMill mediamilldata2006 , with a semantic hierarchy drawn from Wordnet wordnet1998 . This ground truth augmentation will be made publicly available to enable further researches on the MLML problem in computer vision.
Compared to the previous conference version of this work myiccv2015 , the additional novelties in this manuscript are threefold. (1) We adopt the CNN extracted features on ESP Game and IAPRTC12 of which the original images are available, and the experimental performances are significantly improved compared to the one using traditional features. (2) The sparse and low rank decomposition is utilized to provide an alternative to the class cooccurrence, leading to further performance improvements. (3) More detailed experimental comparisons are provided to evaluate the influences of different label dependencies. (4) The experimental results of image retrieval are added.
2 Related Work
In the literature of multilabel learning, the previous works that are designed to handle missing labels can be generally partitioned into four categories. First, the missing labels are directly treated as negative labels, including semimultilabelsdm2008 ; wellmultilabelweak2010 ; bucakmultiincomplete2011 ; fasttagicml2013 ; Agrawalmlmillionlabelwww2013 ; hashmultilabeleccv2014a ; hashmultilabeleccv2014b ; multilabellinkpredictionaaai2015 . Common to these methods is that the label bias is brought into the objective function. As a result, their performance is greatly affected when massive groundtruth positive labels are initialized as negative labels. Second, filling in missing labels is treated as a matrix completion (MC) problem, including MCnips2010 ; MCPosnips2011 ; MCspeednips2013 . The recent LEML method LEMLICML2014 cast the MLML problem into the empirical risk minimization (ERM) framework. Both MC models and LEML are based on the low rank assumption of the whole label matrix. In contrast, the sparse and low rank decomposition is introduced to multilabel learning in a recent work multilabellowranksparsekdd2016
. Third, missing labels are treated as latent variables in probabilistic models, including the model based on Bayesian networks
multilabelcompressedsensingnips2012 ; bmlcsactivekdd2014and conditional restricted Boltzmann machines (CRBM). Last, Wu et al.
myicpr2014 defined three label states, including positive labels , negative labels and missing labels , to avoid the label bias. However, the two solutions proposed in myicpr2014 involves matrix inversion, which limits the scalability to handle larger datasets. Wu et al. mypr2015proposed an inductive model based on the framework of regularized logistic regression. It also adopts three label states and a hinge loss function to avoid the label bias. However, the classifier parameters corresponding to each class have to be learned sequentially. Furthermore, the computational cost of this method increases significantly with the number of classes, thus, this method becomes prohibitive for very large datasets.
Hierarchical multilabel learning (HML) MLreivew2014 has been applied to problems where the label hierarchy exists, such as image annotation hierarchyimageannotationreviewpr2012 , text classification hmltexticml2005 ; kernelhmltextjmlr2006 and protein function prediction. biweiicml2011 ; yuincompletehierarchybmc2015 . Except for a few cases, most existing HML methods only consider the learning problem of complete hierarchical labels. However, in real problems, the incomplete hierarchical labels commonly occur, such as in image annotation. Yu et al. yuincompletehierarchybmc2015 recently proposed a method to handle the incomplete hierarchical labels. However, the semantic hierarchy and the multilabel learning are used separately, such that the semantic hierarchical constraint can not be fully satisfied. Deng et al. dengeccv2014 developed a CRF model for object classification. The semantic hierarchical constraint and missing labels are also incorporated into this model. However, a significant difference is that dengeccv2014 focuses on a single object in each instance, while there are multiple object in each instance in our problem.
In the application of image annotation, both missing labels and semantic hierarchy have been explored in many previous works, such as wellmultilabelweak2010 ; bucakmultiincomplete2011 ; fasttagicml2013 ; tagcompletionpami2013 ; imagetagmissingcvpr2013 ; videoannotationicm2008 ; L1labeldenoisingbmvc2016 ; myaaai2016imbalance ; liaumissingpr2016 (missing labels) and hierarchyimageannotationreviewpr2012 ; mycvpr2017dia ; mycvpr2018d2iagan (semantic hierarchy). However, to the best of our knowledge, no previous work in image annotation has extensively studied missing labels and semantic hierarchy simultaneously. Note that the semantic hierarchical constraint used in our model is similar to the ranking constraint MLcalibratedranking2008 ; bucakmultiincomplete2011 that is widely used in multilabel ranking models, but there are significant differences. First, the ranking constraint used in these models means the predicted value of the provided positive label should be larger than that of the provided negative label, while the semantic hierarchical constraint involves the ranking of the predicted values between a pair of parent and classes. Besides, the ranking constraint is always incorporated as the loss function, while the semantic hierarchical constraint is formulated as the linear constraint in our model.
3 Problem and Model
3.1 Problem Definition
Our method takes as input two matrices: a data matrix , which aggregates the
dimensional feature vectors of all
(training and testing) instances, and a label matrix , which aggregates the dimensional label vectors of all instances. That is to say each instance can take one or more labels from the different classes . Its corresponding label vector determines its membership to each of these classes. For example, if , then is a member of and if , then is not a member of this class. However, if , then the membership of to is considered unknown (i.e., it has a missing label). Correspondingly, all labels of each testing instance are missing, i.e., . The semantic hierarchy is encoded as another matrix: , with being the number of directed edges. denotes the index vector of the th directed edge (see Fig. 1), with and , while all other entries are 0.Our goal is to obtain a complete label matrix that satisfies the following properties.

is consistent with the provided (not missing) labels in , i.e., if .

satisfies the instancelevel label similarity. It assumes that and have similar features, then their corresponding predicted labels (i.e., the and column of ) should be similar.

follows the classlevel label similarity. It assumes that if the cooccurrence between two classes is high, then they will be likely to coexist at many instances, i.e., the corresponding two row vectors of are similar.

can be decomposed as the sum of a sparse matrix and a low rank matrix, i.e., with being low rank and being sparse. The rationale of the low rank assumption is that one class could be represented by its related classes. However, due to the existence of tailed labels, the low rank assumption is unlikely to be exactly satisfied. Thus, the sparse matrix is introduced to include the tailed labels, then the remaining label matrix could be low rank.

is consistent with the semantic hierarchy . To enforce this, we ensure that if is the parent of , a hard constraint is applied, which guarantees that the score (the presence probability) of should not be smaller than the score of . This constraint ensures that the final predicted labels are consistent with the semantic hierarchical constraint.
Note that both criteria (3) and (4) embed the classlevel label dependencies, with (3) being pairwise while (4) being highorder. We propose two models to combine (1,2,3,5) and (1,2,4,5) respectively. Note that we can utilize both criteria (3) and (4) to construct a more general model, but to evaluate their different effects, in this manuscript we evaluate two models separately. By jointly incorporating all four criteria in model 1 or 2, the label information is propagated from provided labels to the missing labels. In what follows, we give a detailed exposition of how these criteria can be mathematically encoded in one unified optimization framework.
3.2 Label Consistency
The label consistency of with is enforced using
(1) 
where , and is defined as , with being a penalty factor mismatches between and . We set in the following manner. If , then , if , then , and if , then . That is to say a higher penalty is incurred if a ground truth label is but is predicted as , as compared to the reverse case. This idea reflects the observation that most entries of in many multilabel datasets (with a relatively large number of classes) are and that labels are rare (see the data statistics in Table 2). Of course, missing labels are not penalized.
3.3 Instancelevel Label Dependency
Similar to myicpr2014 ; mypr2015 , we incorporate the instancelevel label similarity (i.e., criteria (2)) using the regularization term in Eq. (2).
(2) 
where the instance similarity matrix is defined as: . The kernel size and is the th nearest neighbour of (measured by the Euclidean distance). Similar to myicpr2014 , we set . The normalization term makes the regularization term invariant to different scaling factors of elements in spectraltutorial2007 . The normalized Laplacian matrix is with .
3.4 Classlevel Label Dependency
Here, we consider three types of classlevel label dependencies, namely class cooccurrence, sparse and low rank decomposition and semantic hierarchy.
Class cooccurrence: This dependency is encoded using the regularization term in Eq. (3).
(3) 
Here, we define the class similarity matrix as: and . The normalized Laplacian matrix is defined as with .
Sparse and low rank decomposition: The sparse and low rank decomposition assumes that the label matrix can be decomposed to the addition of a sparse matrix and a low rank matrix , as follows,
(4) 
However, it is known that the minimization of is intractable in general nuclearnorm2010 . A widely used solution to minimize its convex approximation nuclearnormlowrank2002 , i.e., the nuclear norm , with being the
th singular value of
. Then the approximation of (4) is formulated as(5) 
Semantic hierarchical constraint: To enforce the semantic hierarchical constraint (i.e., criteria (5)), we apply the following constraint: . The resulting constraints can be aggregated in matrix form,
(6) 
where . is the indicator vector of the th directed edge , with and , with all other entries being 0.
4 MLML using Mixed Dependency Graph with Cooccurrence (MLMGCO)
By combining those four properties formulated in Eqs. (1,2,3, 6), we construct a mixed dependency graph to connect all label nodes (i.e., all entries in ), referred to as mixed dependency graph with cooccurrence (MGCO). Using MGCO, we formulate the MLML problem as a binary matrix optimization problem, where the linear combination of Eqs. (1,2,3) forms the objective and Eq. (6) enforces the semantic hierarchical constraints.
s.t.  (7) 
which is referred to as MLMGCO. The three terms in the objective function correspond to Eqs. (2,3,6) respectively. Due to the binary constraint on , it is difficult to efficiently solve this discrete problem. Thus, we use a conventional box relaxation, which relaxes to take on values in . Since both and are positive semidefinite (PSD), it is easy to prove that the relaxed problem of Eq. (28) is a convex quadratic problem (QP) with linear matrix constraints (refer to the Appendix A for the detailed proof of the convexity).
s.t.  (8) 
Due to its convexity and smoothness, the MLMGCO problem can be efficiently solved by many solvers. In this work, we adopt the alternative direction of method of multipliers (ADMM) admmboyd2011 , which decomposes the optimization problem into several steps that are easy to implement and intuitive to understand.
4.1 ADMM Algorithm for MLMGCO
Following the conventional ADMM framework admmboyd2011 , we firstly formulate the augmented Lagrange function of Problem (28), by introducing a nonnegative slack variable ,
(9) 
where and . Here, is the Lagrange multiplier (dual variable), is a penalty parameter, and denotes the matrix Frobenius norm. Then we want to solve the following problem
(10) 
It can be minimized by alternatively solving the following subproblems, with being the iteration index of the ADMM algorithm.
Subproblem with respect to : The update of is obtained by the following subproblem,
(11)  
where , and . Clearly, is positive semidefinite (PSD), so is PSD. Considering that is also PSD, thus Problem (11) is a convex quadratic programming (QP) problem with box constraints. It can be efficiently solved using projected gradient descent (PGD) with exact line search boydconvex2004 .
Projected gradient descent. The gradient of the objective function (11) with respect to and the step size are computed as
(12)  
(13)  
where indicates the iteration index of PGD. Then is updated as follows:
(14) 
The result of the final iteration of PGD will be used as the solution to Problem (11), i.e., . As Problem (11
) is convex, PGD is guaranteed to converge to the global optimal solution. However, to reduce the computational cost, we stop this update step only after a few PGD iterations. This heuristic makes the convergence of the overall ADMM much faster, without any considerable effect on performance.
Subproblems with respect to and : The updates for and are closed form,
(15)  
(16) 
According to the analysis in admmproof2013 ; admmproof2014 , the above ADMM algorithm is guaranteed to converge to the global minimum of Problem (28). Note that if without the semantic hierarchical constraints shown in (6), Problem (28) can be more efficiently solved by the PGD algorithm, rather than by ADMM.
5 MLML using the Mixed Dependency Graph with Sparse and Low Rank Decomposition (MLMGSL)
In this section we propose another formulation of the MLML problem, based on the mixed dependency graph with sparse and low rank decomposition (MGSL) constructed by Eqs. (1,2,5,6), as follows:
s.t.  (17) 
which is referred to as MLMGSL. Similarly, the binary constraint is also relaxed to the box constraint , then the relaxed continuous problem becomes
(18)  
s.t. 
Note that we have adopted a new loss term in (18) by introducing a tradeoff parameter . Due to the constraint , this new loss term is equivalent to the old loss term in (17). The benefit is the larger flexibility, leading to a more stable convergence in the optimization process. As demonstrated in imagealignmentpami2012 , both and are convex. Considering the convex smoothness term and the linear constraints, the optimization problem in (18) is also convex. We solve it again using the ADMM algorithm.
5.1 ADMM Algorithm for MLMGSL
The augmented Laplacian function of Problem (18) is formulated as follows
(19) 
where and are two dual variables, and are penalty parameters. Then we need to solve the following optimization problem
(20) 
which can be alternatively solved by optimizing the following subproblems.
Subproblem with respect to :
(21)  
Similar to the subproblem (11), it is not hard to see that (21) is convex, which can also be efficiently solved by the PGD algorithm with line search.
Subproblem with respect to :
(22)  
where we define to save space. denotes the singular value softthresholding operator imagealignmentpami2012 , utilizing the softthresholding operator and the SVD decomposition .
Subproblem with respect to :
(23)  
where the softthresholding operator and are defined as above.
Subproblems with respect to , and :
(24)  
(25)  
(26) 
In terms of the convergence, as demonstrated in ADMMmultiblocknotconvergent2016 , the ADMM algorithm for multiblock (more than 2 blocks) convex optimization is not necessarily convergent. Some further assumptions about the objective function or the parameters should be added to guarantee the convergence. For example, a recent work ADMMthreeblockconvergence2016 has proved that if the variable sequence generated by the above ADMM algorithm is assumed to satisfy the substrong monotonicity, and the parameters are set in a bounded range, then the algorithm will converge to a KKT solution. Please refer to ADMMthreeblockconvergence2016 for more details.
6 Experiments
In this section, we evaluate the proposed method and the stateoftheart methods on four benchmark datasets in image annotation and video annotation.
dataset  C1  C2  C3  C4  C5  C6 

Corel 5k corel5keccv2002  260  138  37  98  99  5 
ESP Game espgame2004  268  129  41  92  120  4 
IAPRTC12 iaprtc12data2006  291  179  36  132  98  4 
MediaMill mediamilldata2006  101  63  14  52  30  3 
6.1 Experimental Setup
Datasets. Four benchmark multilabel datasets are used in our experiments, including Corel 5k corel5keccv2002 , ESP Game espgame2004 , IAPRTC12 iaprtc12data2006 , and MediaMill mediamilldata2006 . These datasets are chosen because they are representative and popular benchmarks for comparative analysis among MLML methods. The features and labels of the first three image datasets are downloaded from the seminal work multilabeldatasetimageiccv2009 ^{1}^{1}1http://lear.inrialpes.fr/people/guillaumin/data.php
. Each image in these datasets is described by the dense SIFT features and is represented by a 1000dimensional vector. Moreover, the original images of ESP Game and IAPRTC12 are also available. Thus we can extract other features. It is known that the deep feature extracted from CNNs shows surprising performance in many imagebased tasks. Thus we also adopt the CNN features in our experiments for this two datasets. Specifically, the output of the relu7 layer of the pretrained VGGF
^{2}^{2}2http://www.vlfeat.org/matconvnet/pretrained/ vggfbmvc2014 model is extracted as the feature vector of 4096 dimensions. The features and labels of the video dataset MediaMill are downloaded from the ‘Mulan’ website ^{3}^{3}3http://mulan.sourceforge.net/datasetsmlc.html.Semantic hierarchies. We build semantic hierarchies for each dataset based on WordNet wordnet1998 . Specifically, for each dataset, we search for each class in Wordnet and extract one or more directed paths (i.e., a long sequence of directed edges from parent class to child class). In each path, we identify the nearest upstream class that is also in the label vocabulary (i.e., the set of all classes of the dataset of interest) as the parent class. This procedure is repeated for all classes in this dataset to form the semantic hierarchy matrix . In the same manner, we build the hierarchy for each of the four datasets. Similar to hierarchyimageannotationreviewpr2012 , we also consider two types of semantic dependency: ‘is a’ and ‘is a part of’. For example, a part of the semantic hierarchy of Corel 5k and ESP Game is shown in Fig. 2. Note that not all ‘is a part of’ dependencies are included in the semantic hierarchy, to ensure the corresponding semantic hierarchical constraint to be correct. For example, “tree is a part of forest”, but when ‘tree’ exists in one image, ‘forest’ doesn’t always exist, so we abandon it. A summary of these semantic hierarchies^{4}^{4}4The complete semantic hierarchies and the complete label matrices of all four datasets can be downloaded from “https://sites.google.com/site/baoyuanwu2015/”. is presented in Table 1.
Note that in aforementioned datasets, the provided groundtruth label matrices do not fully satisfy the semantic hierarchical constraints. In other words, some instances are labelled with a child class but not with the corresponding parent class. Therefore, we augment the label matrix according to the semantic hierarchy for each dataset. The semantically enhanced comprehensive groundtruth label matrix is referred to as “complete”, while the originally provided label matrix as “original”. The basic statistics of both the complete and original label matrices are summarized in Table 2.
dataset 
# instances (training, test) 
# class  C1  C2  ,  label matrix  C3  C4  C5 

Corel 5k corel5keccv2002  4999 = 4500 + 499  260  1000  N/A  20, 10, 100  original  3.40  65.30  1.31% 
complete  4.84  93.06  1.86%  
MediaMill mediamilldata2006  43907 = 30993 + 12914  101  120  N/A  20, 10, 100  original  4.38  1902  4.33% 
complete  6.17  2680  6.10%  
ESP Game espgame2004  20770 = 18689 + 2081  268  1000  4096  20, 10, 100  original  4.69  363.2  1.75% 
complete  7.27  563.6  2.71%  
IAPRTC12 iaprtc12data2006  19627 = 17665 + 1962  291  1000  4096  20, 10, 100  original  5.72  385.71  1.97% 
complete  9.88  666.3  3.39% 
Methods for comparison. In our methods, there are two places we use semantic hierarchies. One is to fill in the original initial label matrix , i.e., if , then is set to . denotes the ancestor classes of class in the semantic hierarchy. If we do this filling in , then it is referred to as filling initial label matrix, otherwise notfilling initial label matrix. The other place is to construct the constraint matrix (see Eq. (6)). To evaluate the influences of this two usages, we compare different variants of our methods, as shown in Table 3. Several stateoftheart multilabel methods that can also handle missing labels are used for comparison, including MCPos MCPosnips2011 , FastTag fasttagicml2013 , MLMLexact and MLMLappro myicpr2014 , as well as LEML LEMLICML2014 . FastTag is specially developed for image annotation, while other methods are general machine learning methods. Also, a stateoftheart method in hierarchical multilabel learning, called CSSAG biweiicml2011 , is also evaluated. CSSAG is a decoding method based on the predicted continuous label matrix of one another algorithm, i.e., the kernel dependency estimation (KDE) algorithm kdenips2002 . However, the KDE algorithm doesn’t work in the case of missing labels. To make a fair comparison between CSSAG and our proposed methods, the predicted label matrix of MLMGCO is used as the input of CSSAG. The results are obtained with publicly available MATLAB source code of these methods provided by their authors. Note that in our previous work myiccv2015 , MLRGL bucakmultiincomplete2011 and the binary SVM were also compared, but here we choose to remove the comparisons with them, due to their much higher costs on both computation and memory than other compared methods.
model  MLMGCO  MLMGSL  

constraint , initial  notfilling  filling  notfilling  filling 
without SH constraint  MLMGCO 
MLMGCO + filling 
MLMGSL 
MLMGSL + filling 
with SH constraint 
MLMGCO + constraint 
MLMGCO + filling + constraint 
MLMGSL + constraint 
MLMGSL + constraint + filling 
Evaluation metrics. Average precision (AP) multilabelevaluationtkdd2010 is adopted to measure the ranking performance of the predicted labels of each instance, i.e., the ranking performance of each column vector in the continuous label matrix . Mean average precision (mAP) informationretrieval2008 is also adopted to evaluate the performance of the tagbased image retrieval, i.e., the ranking performance of each row vector in . To quantify the degree to which the semantic hierarchical constraints are violated, we adopt a simplified hierarchical Hamming loss, similar to hmltexticml2005 ,
(27)  
where denotes the discrete label matrix generated by setting the top labels in the continuous label vector of each instance as , while all others as . denotes the complete groundtruth label matrix. indicates the logical AND operator. denotes the indicator function: if is true, then , otherwise . The above equation calculates the case that in the groundtruth , if the predicted label of the parent class is correct (i.e., ) but the label of the child class is incorrect (i.e., ). This case indicates the violation of semantic hierarchical constraints. Then we define an average hierarchical loss (AHL) as . In experiments we set on MediaMill, while on other datasets.
Other settings. To simulate different scenarios with missing labels, we create training datasets with varying portions of missing labels, ranging from to . Given a missing label proportion , firstly we randomly sample rounding entries in the training label matrix, with being the number of training instances. Then, for every sampled entry, we check whether it corresponds to the leaf or singleton classes in the constructed semantic hierarchies introduced above: if yes, choose it as a missing label, otherwise keep its original value in the training label matrix. Consequently, the number of missing labels is smaller than rounding. The reason of this setting is that if missing labels could be generated on root and intermediate classes, many of them can be directly inferred as positive labels using the semantic hierarchical constraint. Specifically, given one missing label generated on root or intermediate classes, if any one of its descendant classes is positive, then this missing label could be easily corrected to positive. Note that this setting is more favourable to other compared methods that don’t utilize the semantic hierarchical constraint. We repeat the above process 5 times to obtain different missing labels. In all cases, the experimental results of testing data are computed based on the complete label matrix. The reported results are summarized as the mean and standard deviation over all the runs. The tradeoff parameters of MLMGCO ( and ) and MLMGSL (, , and ) are tuned by crossvalidation. Specifically, for MLMGCO, we set the tuning ranges as , and ; for MLMGSL, they are , , and . and are defined as sparse matrices. The numbers of neighbors of each instance/class and are set as and , respectively.
An acceleration heuristic. The computation of the step size (see Eq. (13)) in MLMGCO takes about of the running time in each iteration. However, we observe that the step size in consecutive iterations tend to be very close. Thus, we only compute the step size once in every 5 iterations, while other consecutive step sizes are derived by multiplying a damping factor ( in our experiments) with that of their last iterations. Compared to the case where the step size is computed exactly in each iteration, the runtime is significantly reduced to about (this value depends on ) with a negligible effect in prediction performance.
6.2 Results without Semantic Hierarchical Constraints
Figs. 3 and 4 present AP and mAP results when the semantic hierarchy is not used as constraint, i.e., . In this case, the inequality constraints (see (6)) in MLMG are degenerate. Then the proposed model MLMGCO is a convex QP with box constraints that is solvable using the PGD algorithm, which is more efficient than the ADMM algorithm. The semantic hierarchy is only used to fill in the missed ancestor labels in the initial label matrix . We report both results of using the original initial label matrix and using the semantically filledin initial label matrix, as shown in Figs. 3 and 4, respectively. With the same initial label matrix and without constraints, it ensures the fair comparison among the formulations in different models for the MLML problem.
As shown in Fig. 3, both MLMGCO and MLMGSL consistently outperform other MLML methods, even without using the semantic hierarchy information. The improvement margin over the most competitive method on the six datasets is at least (AP) or (mAP). Compared with MLMLexact and MLMLapprox, MLMGCO shows significant improvement, especially when large proportions of missing labels exist. There are two main reasons. Firstly, there are many noisy negative labels in the original training label matrix, i.e., some positive labels 1 are incorrectly set to 0. Since a larger penalty is incurred when misclassifying a positive label in MLMGCO, the influence of noisy negative labels can be alleviated. However, this is not the case for both MLMLexact and MLMLapprox. Secondly, MLMGCO does not give any bias to missing labels. In contrast, missing labels are encouraged to be intermediate values between negative and positive labels in MLMLexact and MLMLapprox, which brings in label bias. This is why their performance decreases significantly as the missing proportion increases.
In terms of the comparison between MLMGCO and MLMGSL, their performance is similar at most cases. However, we observe that when the missing label proportion is small, MLMGCO is slightly better than MLMGSL; as the missing label proportion increases, MLMGSL shows better performance than MLMGCO. Specifically, at the case of missing labels, the relative improvements at AP values of MLMGSL over MLMGCO are , on MediaMill, ESP Game (traditional), ESP Game (CNN), IAPRTC12 (traditional) and IAPRTC12 ( CNN), respectively; while the relative improvements at mAP values are , , accordingly. It is consistent with the expectations of different assumptions used in MLMGCO and MLMGSL. As the classlevel smoothness used in MLMGCO is derived from the initial label matrix, when massive missing labels exist, the obtained smoothness is likely to be inaccurate; in contrast, the sparse and low rank decomposition used in MLMGSL is independent of the initial label matrix, thus it will not be influenced by the increased missing labels. Note that on Corel 5k, due to the extremely sparse positive labels in the label matrix (see the positive proportions in Table 2), the SVD step in MLMGSL algorithm cannot lead to the valid solution, thus the results of MLMGSL are not reported. Besides, the high memory requirements of MLMLexact and MLMLapprox preclude running them on MediaMill data.
The results of using the filledin initial label matrix are shown in Fig. 4. Similarly, both MLMGCO and MLMGSL show much better performance than other compared methods. At the case of missing labels, the relative improvements at AP values of MLMGSL over MLMGCO are on MediaMill, ESP Game (traditional), ESP Game (CNN), IAPRTC12 (traditional) and IAPRTC12 (CNN), respectively; while the relative improvements at mAP values are , accordingly.
Comparing Figs. 3 and 4, it is easy to see that the performance of most methods are significantly improved of using the filledin initial label matrix over using the original initial label matrix. The main reason is that the performance of any models will be significantly influenced by the noisy labels (i.e., the groundtruth positive labels are incorrectly set as negative labels in the original initial label matrix). It verifies the contribution of the augmented groundtruth label matrix using our constructed semantic hierarchies.
6.3 Results with Semantic Hierarchical Constraints
The results of utilizing the semantic hierarchy are shown in Fig. 5. To highlight the influence of semantic hierarchical constraints, here we again report the results of MLMGCO, MLMGCO + filling, MLMGSL and MLMGSL + filling, which have been presented in Section 6.2.
Comparison among four variants of MLMGCO. In Fig. 5, the results of four variants of MLMGCO are denoted using the lines with the mark, but with different colors. The results of MLMGCO are much inferior to those of the other three variants, because the semantic hierarchy is neither used in the initial label matrix, nor as constraints during the optimization. This demonstrates the importance of the semantic hierarchy. MLMGCO + filling and MLMGCO + constraint show the similar performance evaluated by AP and mAP. For MLMGCO + constraint, although there are many noisy labels in the initial label matrix, the constraint during the optimization can correct the noisy labels to a large extent, to achieve the similar ranking (AP and mAP) performance with MLMGCO + filling. However, the AHL values of MLMGCO + constraint are always 0, while those of MLMGCO + filling are always positive. This tells that the tag ranking list for each instance produced by MLMGCO + constraint is semantically consistent, while that produced by MLMGCO + filling is partially inconsistent with the semantic hierarchical constraint, i.e., some children tags are ranked higher than their ancestor tags. These two points verify the efficacy of embedding the semantic hierarchy as the linear constraint.
evaluation  method  MediaMill  ESP Game (traditional)  ESP Game (CNN)  IAPRTC12 (traditional)  IAPRTC12 (CNN)  

provided  val  testing  provided  val  testing  provided  val  testing  provided  val  testing  provided  val  testing  
AP  MLMGCO  0.9994  0.745  0.716  0.9994  0.4373  0.4422  0.9994  0.581  0.5887  1.0  0.5487  0.5527  0.9971  0.6455  0.6467 
MLMGSL  1.0  0.7694  0.7234  0.9967  0.4435  0.4441  0.9997  0.5965  0.5946  1.0  0.5518  0.5535  0.9997  0.6615  0.6604  
mAP  MLMGCO  0.9999  0.5167  0.3344  1.0  0.2573  0.2458  0.9999  0.4544  0.4578  1.0  0.4443  0.4412  0.999  0.5273  0.509 
MLMGSL  1.0  0.5404  0.3344  0.9996  0.2633  0.266  1.0  0.4949  0.4866  1.0  0.4472  0.4413  0.9999  0.5417  0.533 
MLMGCO + filling + constraint shows the best results among four variants at most cases. It not only gives the highest AP and mAP values, but also the semantically consistent results. This demonstrates that both filling and constraint contribute to the performance. Note that the improvements of AP values of the other three variants over MLMGCO are larger than the improvements of mAP values. The main reason is that both filling and constraint directly influence the labels in each column, and AP measures the label ranking performance in each column. In contrast, the row ranking, which is measured by mAP, is indirectly influenced by filling and constraint through the label propagation on the mixed dependency graph.
Comparison among four variants of MLMGSL. In Fig. 5, the results of four variants of MLMGCO are denoted using the lines with the mark, but with different colors. Similar with the above comparison about MLMGCO, MLMGSL shows the worst performance among its four variants; MLMGSL + filling and MLMGSL + constraint show similar performance in most cases; MLMGSL + filling + constraint shows the best performance.
Comparison between MLMGCO and MLMGSL. In Fig. 5, the corresponding variants of MLMGCO and MLMGSL are denoted as lines with the same color, but with different marks ( and respectively, see the same column of the legend). Similar with the comparison between MLMGCO and MLMGSL shown in Section 6.2, on most datasets, MLMGCO performs better than MLMGSL when the missing label proportion is small, while worse when the missing label proportion increases.
Comparison between MLMGCO+constraint and CSSAG. Based on the input continuous labels produced by MLMGCO, CSSAG will change continuous labels to binary ones according to the semantic hierarchy and the predefined number of positive labels. Consequently, the AP results of the discrete outputs of CSSAG are similar to the AP values of MLMGCO. But the mAP values of CSSAG are much lower than that of MLMGCO. We think the reason is that CSSAG focuses on adjusting the columnwise label rankings, while mAP measures the rowwise label ranking performance. Moreover, although CSSAG ensures that there are no inconsistent labels in its binary label matrix, it cannot provide a consistent continuous label ranking. In contrast, MLMG can satisfy these two conditions simultaneously. This comparison demonstrates that using the semantic hierarchy as constraint during optimization (as did in MLMGCO + constraint) is more effective than using it as the constraint in the postprocessing step (as did in CSSAG).
6.4 Evaluation of Semisupervised Multilabel Learning
In above experiments, missing labels are randomly generated across different training instances and different classes. A special case is that some training instances are fully annotated, while other training instances are totally unlabelled, referred to as semisupervised multilabel learning (SSML) semimultilabelsdm2008 . Our proposed model can naturally handle SSML. In contrast, not all compared multilabel models that handle missing labels can exploit totally unlabelled images, such as FastTag fasttagicml2013 . Here we provide a further evaluation of our proposed methods in the SSML setting. Specifically, we randomly choose a subset of training instances, of which the size is equivalent to the size of the testing instance set, then hide their labels to the model (i.e., setting the label value to ). This subset is referred to as validation set, while the subset of other fully labelled training images are called as provided set. The equivalent size between the validation set and the testing set ensures the fair comparison of the prediction performance on this two sets. For clarity, we only present the experiment at the case of notfilling initial label matrix and with SH constraint (see Table 3). Besides, since MLMGSL is inapplicable to Corel 5k with missing labels, here we ignore Corel 5k. The results are shown in Table 4
. On both ESP Game and IAPRTC12, the results evaluated by AP and mAP on the validation set are similar with or slightly higher than that on the testing set. It demonstrates that the joint probability distributions of image features and labels are close on training set and testing set of these two datasets. This point could facilitate to determine the model and algorithm parameters of our methods using crossvalidation. However, on MediaMill, there are significant gaps between the evaluation results on the validation set and the testing set, especially the results evaluated by mAP. This reveals that the joint probability distributions of instance features and labels are different between the training and the testing set in MediaMill.
6.5 Evaluation of Missing Label Imputations
As transductive models, our proposed methods can not only predict the labels of testing images, but also impute the missed labels of training images. Here we evaluate the imputation performance of missing labels using our methods, and compare with the prediction performance of testing images. For clarity, we only present the experiment at the case of notfilling initial label matrix and with SH constraint (see Table 3). In above experiments, missing labels are generated on only leaf and singleton classes. However, as our method enforces that the label score of the parent class cannot be lower than that of its child classes, the leaf and singleton classes are at the disadvantage in the competition with the root and intermediate classes. Thus, the imputation performance of missing labels corresponding to leaf and singleton classes is very poor, as they have to compete with the provided labels, of which a large proportion correspond to root and intermediate classes. Instead, here we change the setting of generating missing labels to that all classes could be missing. Then, at the same missing label proportion, there are actually more missing labels in the training label matrix, compared with the case that missing labels are only generated on leaf and singleton classes. Note that Corel 5k is not evaluated here. As demonstrated in Section 6.2, the SVD decomposition in MLMGSL cannot give the valid solution on Corel 5k, due to the extremely sparse positive labels in the provided label matrix of Corel 5k, especially when missing labels on all classes exist.
We present two evaluations. The first evaluation is using score, on provided, missing and testing labels. Specifically, we firstly discretize the predicted continuous label matrix by setting the labels of the top10 largest scores in the label vector corresponding to each image (i.e., the column vector of the label matrix) to , while all other entries in the same label vector as . The sublabelvectors of both provided labels and missing labels are extracted from each training label vector in the binary label matrix, which could be evaluated using score separately. This evaluation could clearly reveal that imputation performance of missing labels, compared with the prediction performance on provided labels and testing labels. The results are shown in Fig. 6. There are two observations from the results on all datasets. One is that the prediction performance of provided labels (see the green lines in Fig. 6) is always better than that of missing and testing labels, but the performance advantage is inversely proportional to the missing label proportion. The reason is that the label consistency term in our model (see Eq. (1)) encourages the predicted label scores to be consistent with the groundtruth labels at the provided entries of the label matrix. In contrast, there are no such a consistency term for missing and testing labels. When the missing label proportion is small, this consistency term could provide the reference for more labels. This explains the inverse proportion between the performance advantage of the prediction on provided labels and the missing label proportion. The other observation is that the imputation performance of missing labels (see the blue lines in Fig. 6) is worse than that of testing labels (see the red lines in Fig. 6) at the missing label proportion , but their performance becomes similar when the missing label proportion is large. The missing labels have to compete with the provided labels in the same label column. When a large proportion of labels are provided, the unfairness between missing labels and provided labels may preclude the recovery of the groundtruth positive labels in missing labels. The degree of this unfairness is inversely proportional to the missing label proportion. In contrast, there is no such a unfairness among the entries in the same label column for testing images, as all entries in the same column are missing. This difference is the main reason that the performance gap between the imputations of missing and testing labels when the missing label proportion is small.
Datasets  MCPos  FastTag  LEML  MLML  MLML  MLMGCO  MLMGCO +  MLMGSL + 

MCPosnips2011  fasttagicml2013  LEMLICML2014  exact myicpr2014  appro myicpr2014  constraint  constraint  
Corel 5k  72.94  55.94  203.4  50.6  2.56  1.86  4.76  21.3 
MediaMill  165.4  56.5  151.5  238.4  4.63  8.75  23.39  44.25 
ESP Game (traditional)  337  124  638.2  3004  239.2  11  28.2  337.3 
ESP Game (CNN)  1772  199.3  2164  
IAPRTC12 (traditional)  326.6  202.2  742.2  3378  238.4  11.6  30.5  158 
IAPRTC12 (CNN)  1709  271.7  2063 
The second evaluation is using the metrics AP and mAP, on both training and testing images, as shown in Fig. 7. It provides the observation of the performance influence of the additional missing labels at the root and intermediate classes. Compared with the reported results in Fig. 5 (see MLMGCO+constraint and MLMGSL+constraint), the corresponding results of ML MGCO and MLMGSL in Fig. 7 are slightly lower. The reason is that the additional missed labels at root and intermediate classes could be easily recovered using our methods, if any one of their descendant classes are provided.
6.6 Complexity and Runtime
Complexity. Here we analyze the complexities of our methods. MLMGCO is implemented by the PGD algorithm (see Section 4.1), which can be further accelerated with the following observations. First, both and are sparse, and there are only and nonzero entries, respectively. denotes the number of neighbours at the instancelevel, while is the number of neighbours at the classlevel (their specific values on different datasets are shown in Table 2). Second, there are some shared terms between different steps, such as and . Third, it is known that . Thus we have or . Considering always holds in the datasets in our experiments, the computational cost can be significantly reduced from to , or from to . Utilizing the above three observations, the actual computational complexity of MLMGCO is