Multi-label Learning with Missing Labels using Mixed Dependency Graphs

by   Baoyuan Wu, et al.

This work focuses on the problem of multi-label learning with missing labels (MLML), which aims to label each test instance with multiple class labels given training instances that have an incomplete/partial set of these labels. The key point to handle missing labels is propagating the label information from provided labels to missing labels, through a dependency graph that each label of each instance is treated as a node. We build this graph by utilizing different types of label dependencies. Specifically, the instance-level similarity is served as undirected edges to connect the label nodes across different instances and the semantic label hierarchy is used as directed edges to connect different classes. This base graph is referred to as the mixed dependency graph, as it includes both undirected and directed edges. Furthermore, we present another two types of label dependencies to connect the label nodes across different classes. One is the class co-occurrence, which is also encoded as undirected edges. Combining with the base graph, we obtain a new mixed graph, called MG-CO (mixed graph with co-occurrence). The other is the sparse and low rank decomposition of the whole label matrix, to embed high-order dependencies over all labels. Combining with the base graph, the new mixed graph is called as MG-SL (mixed graph with sparse and low rank decomposition). Based on MG-CO and MG-SL, we propose two convex transductive formulations of the MLML problem, denoted as MLMG-CO and MLMG-SL, respectively. Two important applications, including image annotation and tag based image retrieval, can be jointly handled using our proposed methods. Experiments on benchmark datasets show that our methods give significant improvements in performance and robustness to missing labels over the state-of-the-art methods.


page 2

page 16

page 18


Global Expanding, Local Shrinking: Discriminant Multi-label Learning with Missing Labels

In multi-label learning, the issue of missing labels brings a major chal...

Transductive Classification Methods for Mixed Graphs

In this paper we provide a principled approach to solve a transductive c...

GM-MLIC: Graph Matching based Multi-Label Image Classification

Multi-Label Image Classification (MLIC) aims to predict a set of labels ...

The Missing Link: Finding label relations across datasets

Computer Vision is driven by the many datasets which can be used for tra...

Multi-View Multi-Instance Multi-Label Learning based on Collaborative Matrix Factorization

Multi-view Multi-instance Multi-label Learning(M3L) deals with complex o...

Visual Reranking with Improved Image Graph

This paper introduces an improved reranking method for the Bag-of-Words ...

Modeling Multi-Label Action Dependencies for Temporal Action Localization

Real-world videos contain many complex actions with inherent relationshi...

1 Introduction

Figure 1: The left column includes two example images from the ESP Game esp-game-2004 dataset, and their corresponding features and labels are shown in other columns. The solid box denotes a provided label, while the dashed box indicates a missing label. The red (semantic hierarchical dependency), green (instance similarity), and blue (class co-occurrence) edges constitute the mixed graph with co-occurrence (MG-CO); The red, green edges and the sparse and low rank decomposition of the whole label matrix constitute the mixed graph with sparse and low rank decomposition (MG-SL).

In machine learning, multi-label learning refers to the setting where each data item can be associated to multiple classes simultaneously. For example, in image annotation, an image can be annotated using several tags; in document topic analysis, a document can be associated with multiple topics. Although there are several multi-label learning methods in the literature

mlknn-pr-2007 multilabel-review-tkde-2014 , most of these require complete labelling of training examples, i.e., for every pair of training example and class label, their association needs to be provided.

However, complete labelling is usually infeasible in practice. Most training instances are only partially labelled, with some or all of the labels not provided/missing. Let us consider the task of large-scale image annotation, where the number of classes/tags is large (e.g.

, using labels of ImageNet

imagenet-cvpr-2009 ). Practically, a human annotator can only consider to annotate each training image with a subset of a potentially large and diverse set of tags. Furthermore, in many cases, due to the semantic similarities in the tags, some tags are typically left unchecked, e.g., an image tagged with “German Shepherd” may usually not be tagged also with “Dog”. Such a learning setting is referred to as the multi-label learning with missing labels (MLML) problem my-icpr-2014 ; LEML-ICML-2014 .

As labels are usually related by semantic meanings or co-occurrences, the key to learning from missing labels is a good model to represent label dependency. One widely used model for label dependency is an undirected graph, through which the label information can be propagated among different instances and among different classes. For example, the label dependency between a pair of labels, such as instance similarity and class co-occurrence can be represented using such a graph (see green and blue edges in Fig. 1). However, as stated in my-icpr-2014 ; my-pr-2015

, the class co-occurrence derived from training labels can be inaccurate and biased when many missing labels exist. One alleviation method is to estimate co-occurrence relations from an auxiliary and possibly more comprehensive source (such as Wikipedia)

crbm-mlml-2015 . Another alternative is utilizing a class dependency that is independent of the provided labels. One widely used dependency in multi-label learning is the low rank assumption that the rank of the label matrix, where one row corresponds to one class, and each column indicates one instance, should be smaller than the number of rows (i.e., classes). Although this assumption has been successfully used in many multi-label models multilabel-compressed-sensing-nips-2012 ; LEML-ICML-2014 , as indicated in multilabel-low-rank-sparse-kdd-2016 , the low rank assumption is difficult to be fully satisfied due to the existence of tail labels (i.e., the rare labels that occur in very few instances, thus they are difficult to be represented by the linear combinations of other labels). Instead, the sparse and low-rank decomposition that has been successfully used in other applications like image alignment image-alignment-pami-2012 or visual tracking tianzhu-sparse-coding-eccv-2012 can be used in multi-label learning, to assume that the label matrix can be decomposed to the addition of one sparse and one low-rank matrices. Compared to the pure low rank assumption, this decomposition is more flexible to ensure the validity of the low rank assumption in practical multi-label problems. In this work we propose to combine the instance-level similarity with the class co-occurrence, or the sparse and low rank decomposition respectively.

The semantic dependency between two classes, such as “animalhorse” and “plantgrass” as shown in Fig. 1, can foster further label dependencies and improve label predictions in the test. To handle this requirement, a new set of constraints is introduced to require that

the label score (e.g., the presence probability) of the parent class cannot be lower than that of its child class

. This is traditionally referred to as the semantic hierarchical constraint bi-wei-icml-2011 ; my-iccv-2015 . The undirected graph (with instance similarity and class co-occurrence edges or the global sparse and low rank decomposition) cannot guarantee that the final label predictions will satisfy all semantic hierarchy constraints. To address this problem, we add semantic dependencies into the graph as directed edges, thus, resulting in an overall mixed dependency graph that encourages (or enforces) three types of label dependencies. The graph embedding the class co-occurrence is referred to as mixed graph with co-occurrence (MG-CO), while the one with the sparse and low rank decomposition is denoted as mixed graph with sparse and low rank decomposition (MG-SL). Please refer to Fig. 1 for an example of these models.

The goal of this work is to learn from partially labeled training instances and to correctly predict the labels of testing instances that satisfy the semantic hierarchical constraints. Motivated by my-icpr-2014 ; my-pr-2015 , a discrete objective function is formulated to simultaneously encourage consistency between predicted and ground truth labels and encode traditional label dependencies (instance similarity with class co-occurrence or with sparse and low rank decomposition). Whereas, semantic hierarchical constraints are incorporated as hard linear constraints in the matrix optimization. The discrete problem is further relaxed to a convex problem, which is solved using ADMM admm-boyd-2011 .


(1) We address the MLML problem by using a mixed dependency graph to encode a network of label dependencies: instance similarity, class co-occurrence or sparse and low rank decomposition, as well as semantic hierarchical constraint. (2) Learning on the mixed dependency graph is formulated as a linearly constrained convex matrix optimization problem that is amenable to efficient solvers. (3) We conduct extensive experiments on the task of image annotation to show the superiority of our method in comparison to the state-of-the-art. (4) We augment labelling of several widely used datasets, including Corel 5k corel5k-eccv-2002 , ESP Game esp-game-2004 , IAPRTC-12 iaprtc-12-data-2006 and MediaMill mediamill-data-2006 , with a semantic hierarchy drawn from Wordnet wordnet-1998 . This ground truth augmentation will be made publicly available to enable further researches on the MLML problem in computer vision.

Compared to the previous conference version of this work my-iccv-2015 , the additional novelties in this manuscript are threefold. (1) We adopt the CNN extracted features on ESP Game and IAPRTC-12 of which the original images are available, and the experimental performances are significantly improved compared to the one using traditional features. (2) The sparse and low rank decomposition is utilized to provide an alternative to the class co-occurrence, leading to further performance improvements. (3) More detailed experimental comparisons are provided to evaluate the influences of different label dependencies. (4) The experimental results of image retrieval are added.

2 Related Work

In the literature of multi-label learning, the previous works that are designed to handle missing labels can be generally partitioned into four categories. First, the missing labels are directly treated as negative labels, including semi-multi-label-sdm-2008 ; well-multi-label-weak-2010 ; bucak-multi-incomplete-2011 ; fasttag-icml-2013 ; Agrawal-ml-million-label-www-2013 ; hash-multi-label-eccv-2014a ; hash-multi-label-eccv-2014b ; multilabel-link-prediction-aaai2015 . Common to these methods is that the label bias is brought into the objective function. As a result, their performance is greatly affected when massive ground-truth positive labels are initialized as negative labels. Second, filling in missing labels is treated as a matrix completion (MC) problem, including MC-nips-2010 ; MC-Pos-nips-2011 ; MC-speed-nips-2013 . The recent LEML method LEML-ICML-2014 cast the MLML problem into the empirical risk minimization (ERM) framework. Both MC models and LEML are based on the low rank assumption of the whole label matrix. In contrast, the sparse and low rank decomposition is introduced to multi-label learning in a recent work multilabel-low-rank-sparse-kdd-2016

. Third, missing labels are treated as latent variables in probabilistic models, including the model based on Bayesian networks

multilabel-compressed-sensing-nips-2012 ; bml-cs-active-kdd-2014

and conditional restricted Boltzmann machines (CR-BM). Last, Wu et al.

my-icpr-2014 defined three label states, including positive labels , negative labels and missing labels , to avoid the label bias. However, the two solutions proposed in my-icpr-2014 involves matrix inversion, which limits the scalability to handle larger datasets. Wu et al. my-pr-2015

proposed an inductive model based on the framework of regularized logistic regression. It also adopts three label states and a hinge loss function to avoid the label bias. However, the classifier parameters corresponding to each class have to be learned sequentially. Furthermore, the computational cost of this method increases significantly with the number of classes, thus, this method becomes prohibitive for very large datasets.

Hierarchical multi-label learning (HML) ML-reivew-2014 has been applied to problems where the label hierarchy exists, such as image annotation hierarchy-image-annotation-review-pr-2012 , text classification hml-text-icml-2005 ; kernel-hml-text-jmlr-2006 and protein function prediction. bi-wei-icml-2011 ; yu-incomplete-hierarchy-bmc-2015 . Except for a few cases, most existing HML methods only consider the learning problem of complete hierarchical labels. However, in real problems, the incomplete hierarchical labels commonly occur, such as in image annotation. Yu et al. yu-incomplete-hierarchy-bmc-2015 recently proposed a method to handle the incomplete hierarchical labels. However, the semantic hierarchy and the multi-label learning are used separately, such that the semantic hierarchical constraint can not be fully satisfied. Deng et al. deng-eccv-2014 developed a CRF model for object classification. The semantic hierarchical constraint and missing labels are also incorporated into this model. However, a significant difference is that deng-eccv-2014 focuses on a single object in each instance, while there are multiple object in each instance in our problem.

In the application of image annotation, both missing labels and semantic hierarchy have been explored in many previous works, such as well-multi-label-weak-2010 ; bucak-multi-incomplete-2011 ; fasttag-icml-2013 ; tag-completion-pami-2013 ; image-tag-missing-cvpr-2013 ; video-annotation-icm-2008 ; L1-label-denoising-bmvc-2016 ; my-aaai-2016-imbalance ; li-au-missing-pr-2016 (missing labels) and hierarchy-image-annotation-review-pr-2012 ; my-cvpr-2017-dia ; my-cvpr-2018-d2ia-gan (semantic hierarchy). However, to the best of our knowledge, no previous work in image annotation has extensively studied missing labels and semantic hierarchy simultaneously. Note that the semantic hierarchical constraint used in our model is similar to the ranking constraint ML-calibrated-ranking-2008 ; bucak-multi-incomplete-2011 that is widely used in multi-label ranking models, but there are significant differences. First, the ranking constraint used in these models means the predicted value of the provided positive label should be larger than that of the provided negative label, while the semantic hierarchical constraint involves the ranking of the predicted values between a pair of parent and classes. Besides, the ranking constraint is always incorporated as the loss function, while the semantic hierarchical constraint is formulated as the linear constraint in our model.

3 Problem and Model

3.1 Problem Definition

Our method takes as input two matrices: a data matrix , which aggregates the

-dimensional feature vectors of all

(training and testing) instances, and a label matrix , which aggregates the -dimensional label vectors of all instances. That is to say each instance can take one or more labels from the different classes . Its corresponding label vector determines its membership to each of these classes. For example, if , then is a member of and if , then is not a member of this class. However, if , then the membership of to is considered unknown (i.e., it has a missing label). Correspondingly, all labels of each testing instance are missing, i.e., . The semantic hierarchy is encoded as another matrix: , with being the number of directed edges. denotes the index vector of the -th directed edge (see Fig. 1), with and , while all other entries are 0.

Our goal is to obtain a complete label matrix that satisfies the following properties.

  1. is consistent with the provided (not missing) labels in , i.e., if .

  2. satisfies the instance-level label similarity. It assumes that and have similar features, then their corresponding predicted labels (i.e., the and column of ) should be similar.

  3. follows the class-level label similarity. It assumes that if the co-occurrence between two classes is high, then they will be likely to co-exist at many instances, i.e., the corresponding two row vectors of are similar.

  4. can be decomposed as the sum of a sparse matrix and a low rank matrix, i.e., with being low rank and being sparse. The rationale of the low rank assumption is that one class could be represented by its related classes. However, due to the existence of tailed labels, the low rank assumption is unlikely to be exactly satisfied. Thus, the sparse matrix is introduced to include the tailed labels, then the remaining label matrix could be low rank.

  5. is consistent with the semantic hierarchy . To enforce this, we ensure that if is the parent of , a hard constraint is applied, which guarantees that the score (the presence probability) of should not be smaller than the score of . This constraint ensures that the final predicted labels are consistent with the semantic hierarchical constraint.

Note that both criteria (3) and (4) embed the class-level label dependencies, with (3) being pairwise while (4) being high-order. We propose two models to combine (1,2,3,5) and (1,2,4,5) respectively. Note that we can utilize both criteria (3) and (4) to construct a more general model, but to evaluate their different effects, in this manuscript we evaluate two models separately. By jointly incorporating all four criteria in model 1 or 2, the label information is propagated from provided labels to the missing labels. In what follows, we give a detailed exposition of how these criteria can be mathematically encoded in one unified optimization framework.

3.2 Label Consistency

The label consistency of with is enforced using


where , and is defined as , with being a penalty factor mismatches between and . We set in the following manner. If , then , if , then , and if , then . That is to say a higher penalty is incurred if a ground truth label is but is predicted as , as compared to the reverse case. This idea reflects the observation that most entries of in many multi-label datasets (with a relatively large number of classes) are and that labels are rare (see the data statistics in Table 2). Of course, missing labels are not penalized.

3.3 Instance-level Label Dependency

Similar to my-icpr-2014 ; my-pr-2015 , we incorporate the instance-level label similarity (i.e., criteria (2)) using the regularization term in Eq. (2).


where the instance similarity matrix is defined as: . The kernel size and is the -th nearest neighbour of (measured by the Euclidean distance). Similar to my-icpr-2014 , we set . The normalization term makes the regularization term invariant to different scaling factors of elements in spectral-tutorial-2007 . The normalized Laplacian matrix is with .

3.4 Class-level Label Dependency

Here, we consider three types of class-level label dependencies, namely class co-occurrence, sparse and low rank decomposition and semantic hierarchy.

Class co-occurrence: This dependency is encoded using the regularization term in Eq. (3).


Here, we define the class similarity matrix as: and . The normalized Laplacian matrix is defined as with .

Sparse and low rank decomposition: The sparse and low rank decomposition assumes that the label matrix can be decomposed to the addition of a sparse matrix and a low rank matrix , as follows,


However, it is known that the minimization of is intractable in general nuclear-norm-2010 . A widely used solution to minimize its convex approximation nuclear-norm-low-rank-2002 , i.e., the nuclear norm , with being the

-th singular value of

. Then the approximation of (4) is formulated as


Semantic hierarchical constraint: To enforce the semantic hierarchical constraint (i.e., criteria (5)), we apply the following constraint: . The resulting constraints can be aggregated in matrix form,


where . is the indicator vector of the -th directed edge , with and , with all other entries being 0.

4 MLML using Mixed Dependency Graph with Co-occurrence (MLMG-CO)

By combining those four properties formulated in Eqs. (1,2,3, 6), we construct a mixed dependency graph to connect all label nodes (i.e., all entries in ), referred to as mixed dependency graph with co-occurrence (MG-CO). Using MG-CO, we formulate the MLML problem as a binary matrix optimization problem, where the linear combination of Eqs. (1,2,3) forms the objective and Eq. (6) enforces the semantic hierarchical constraints.

s.t. (7)

which is referred to as MLMG-CO. The three terms in the objective function correspond to Eqs. (2,3,6) respectively. Due to the binary constraint on , it is difficult to efficiently solve this discrete problem. Thus, we use a conventional box relaxation, which relaxes to take on values in . Since both and are positive semi-definite (PSD), it is easy to prove that the relaxed problem of Eq. (28) is a convex quadratic problem (QP) with linear matrix constraints (refer to the Appendix A for the detailed proof of the convexity).

s.t. (8)

Due to its convexity and smoothness, the MLMG-CO problem can be efficiently solved by many solvers. In this work, we adopt the alternative direction of method of multipliers (ADMM) admm-boyd-2011 , which decomposes the optimization problem into several steps that are easy to implement and intuitive to understand.

4.1 ADMM Algorithm for MLMG-CO

Following the conventional ADMM framework admm-boyd-2011 , we firstly formulate the augmented Lagrange function of Problem (28), by introducing a non-negative slack variable ,


where and . Here, is the Lagrange multiplier (dual variable), is a penalty parameter, and denotes the matrix Frobenius norm. Then we want to solve the following problem


It can be minimized by alternatively solving the following sub-problems, with being the iteration index of the ADMM algorithm.

Sub-problem with respect to : The update of is obtained by the following sub-problem,


where , and . Clearly, is positive semi-definite (PSD), so is PSD. Considering that is also PSD, thus Problem (11) is a convex quadratic programming (QP) problem with box constraints. It can be efficiently solved using projected gradient descent (PGD) with exact line search boyd-convex-2004 .

Projected gradient descent. The gradient of the objective function (11) with respect to and the step size are computed as


where indicates the iteration index of PGD. Then is updated as follows:


The result of the final iteration of PGD will be used as the solution to Problem (11), i.e., . As Problem (11

) is convex, PGD is guaranteed to converge to the global optimal solution. However, to reduce the computational cost, we stop this update step only after a few PGD iterations. This heuristic makes the convergence of the overall ADMM much faster, without any considerable effect on performance.

Sub-problems with respect to and : The updates for and are closed form,


According to the analysis in admm-proof-2013 ; admm-proof-2014 , the above ADMM algorithm is guaranteed to converge to the global minimum of Problem (28). Note that if without the semantic hierarchical constraints shown in (6), Problem (28) can be more efficiently solved by the PGD algorithm, rather than by ADMM.

5 MLML using the Mixed Dependency Graph with Sparse and Low Rank Decomposition (MLMG-SL)

In this section we propose another formulation of the MLML problem, based on the mixed dependency graph with sparse and low rank decomposition (MG-SL) constructed by Eqs. (1,2,5,6), as follows:

s.t. (17)

which is referred to as MLMG-SL. Similarly, the binary constraint is also relaxed to the box constraint , then the relaxed continuous problem becomes


Note that we have adopted a new loss term in (18) by introducing a trade-off parameter . Due to the constraint , this new loss term is equivalent to the old loss term in (17). The benefit is the larger flexibility, leading to a more stable convergence in the optimization process. As demonstrated in image-alignment-pami-2012 , both and are convex. Considering the convex smoothness term and the linear constraints, the optimization problem in (18) is also convex. We solve it again using the ADMM algorithm.

5.1 ADMM Algorithm for MLMG-SL

The augmented Laplacian function of Problem (18) is formulated as follows


where and are two dual variables, and are penalty parameters. Then we need to solve the following optimization problem


which can be alternatively solved by optimizing the following sub-problems.

Sub-problem with respect to :


Similar to the sub-problem (11), it is not hard to see that (21) is convex, which can also be efficiently solved by the PGD algorithm with line search.

Sub-problem with respect to :


where we define to save space. denotes the singular value soft-thresholding operator image-alignment-pami-2012 , utilizing the soft-thresholding operator and the SVD decomposition .

Sub-problem with respect to :


where the soft-thresholding operator and are defined as above.

Sub-problems with respect to , and :


In terms of the convergence, as demonstrated in ADMM-multiblock-not-convergent-2016 , the ADMM algorithm for multi-block (more than 2 blocks) convex optimization is not necessarily convergent. Some further assumptions about the objective function or the parameters should be added to guarantee the convergence. For example, a recent work ADMM-three-block-convergence-2016 has proved that if the variable sequence generated by the above ADMM algorithm is assumed to satisfy the sub-strong monotonicity, and the parameters are set in a bounded range, then the algorithm will converge to a KKT solution. Please refer to ADMM-three-block-convergence-2016 for more details.

6 Experiments

In this section, we evaluate the proposed method and the state-of-the-art methods on four benchmark datasets in image annotation and video annotation.

Figure 2: A part of semantic hierarchies of Corel 5k and ESP Game, respectively.
dataset C1 C2 C3 C4 C5 C6
Corel 5k corel5k-eccv-2002 260 138 37 98 99 5
ESP Game esp-game-2004 268 129 41 92 120 4
IAPRTC-12 iaprtc-12-data-2006 291 179 36 132 98 4
MediaMill mediamill-data-2006 101 63 14 52 30 3
Table 1: Details of the semantic hierarchies for four datasets that we augmented. Column notations C1 to C6 respectively indicate the number of: nodes, edges, root nodes, leaf nodes, singleton nodes and depth.

6.1 Experimental Setup

Datasets. Four benchmark multi-label datasets are used in our experiments, including Corel 5k corel5k-eccv-2002 , ESP Game esp-game-2004 , IAPRTC-12 iaprtc-12-data-2006 , and MediaMill mediamill-data-2006 . These datasets are chosen because they are representative and popular benchmarks for comparative analysis among MLML methods. The features and labels of the first three image datasets are downloaded from the seminal work multilabel-dataset-image-iccv-2009 111

. Each image in these datasets is described by the dense SIFT features and is represented by a 1000-dimensional vector. Moreover, the original images of ESP Game and IAPRTC-12 are also available. Thus we can extract other features. It is known that the deep feature extracted from CNNs shows surprising performance in many image-based tasks. Thus we also adopt the CNN features in our experiments for this two datasets. Specifically, the output of the relu7 layer of the pre-trained VGG-F

222 vggf-bmvc-2014 model is extracted as the feature vector of 4096 dimensions. The features and labels of the video dataset MediaMill are downloaded from the ‘Mulan’ website 333

Semantic hierarchies. We build semantic hierarchies for each dataset based on WordNet wordnet-1998 . Specifically, for each dataset, we search for each class in Wordnet and extract one or more directed paths (i.e., a long sequence of directed edges from parent class to child class). In each path, we identify the nearest upstream class that is also in the label vocabulary (i.e., the set of all classes of the dataset of interest) as the parent class. This procedure is repeated for all classes in this dataset to form the semantic hierarchy matrix . In the same manner, we build the hierarchy for each of the four datasets. Similar to hierarchy-image-annotation-review-pr-2012 , we also consider two types of semantic dependency: ‘is a’ and ‘is a part of’. For example, a part of the semantic hierarchy of Corel 5k and ESP Game is shown in Fig. 2. Note that not all ‘is a part of’ dependencies are included in the semantic hierarchy, to ensure the corresponding semantic hierarchical constraint to be correct. For example, “tree is a part of forest”, but when ‘tree’ exists in one image, ‘forest’ doesn’t always exist, so we abandon it. A summary of these semantic hierarchies444The complete semantic hierarchies and the complete label matrices of all four datasets can be downloaded from “”. is presented in Table 1.

Note that in aforementioned datasets, the provided ground-truth label matrices do not fully satisfy the semantic hierarchical constraints. In other words, some instances are labelled with a child class but not with the corresponding parent class. Therefore, we augment the label matrix according to the semantic hierarchy for each dataset. The semantically enhanced comprehensive ground-truth label matrix is referred to as “complete”, while the originally provided label matrix as “original”. The basic statistics of both the complete and original label matrices are summarized in Table 2.


# instances (training, test)

# class C1 C2 , label matrix C3 C4 C5
Corel 5k corel5k-eccv-2002 4999 = 4500 + 499 260 1000 N/A 20, 10, 100 original 3.40 65.30 1.31%
complete 4.84 93.06 1.86%
MediaMill mediamill-data-2006 43907 = 30993 + 12914 101 120 N/A 20, 10, 100 original 4.38 1902 4.33%
complete 6.17 2680 6.10%
ESP Game esp-game-2004 20770 = 18689 + 2081 268 1000 4096 20, 10, 100 original 4.69 363.2 1.75%
complete 7.27 563.6 2.71%
IAPRTC-12 iaprtc-12-data-2006 19627 = 17665 + 1962 291 1000 4096 20, 10, 100 original 5.72 385.71 1.97%
complete 9.88 666.3 3.39%
Table 2: Data statistics of features and label matrices of four benchmark datasets. The column indexes C1 to C5 respectively indicate: the dimension of traditional features, the dimension of CNN features, the average positive classes of each instance, the average positive instances of each class and the positive label proportion in the whole training label matrix.

Methods for comparison. In our methods, there are two places we use semantic hierarchies. One is to fill in the original initial label matrix , i.e., if , then is set to . denotes the ancestor classes of class in the semantic hierarchy. If we do this filling in , then it is referred to as filling initial label matrix, otherwise not-filling initial label matrix. The other place is to construct the constraint matrix (see Eq. (6)). To evaluate the influences of this two usages, we compare different variants of our methods, as shown in Table 3. Several state-of-the-art multi-label methods that can also handle missing labels are used for comparison, including MC-Pos MC-Pos-nips-2011 , FastTag fasttag-icml-2013 , MLML-exact and MLML-appro my-icpr-2014 , as well as LEML LEML-ICML-2014 . FastTag is specially developed for image annotation, while other methods are general machine learning methods. Also, a state-of-the-art method in hierarchical multi-label learning, called CSSAG bi-wei-icml-2011 , is also evaluated. CSSAG is a decoding method based on the predicted continuous label matrix of one another algorithm, i.e., the kernel dependency estimation (KDE) algorithm kde-nips-2002 . However, the KDE algorithm doesn’t work in the case of missing labels. To make a fair comparison between CSSAG and our proposed methods, the predicted label matrix of MLMG-CO is used as the input of CSSAG. The results are obtained with publicly available MATLAB source code of these methods provided by their authors. Note that in our previous work my-iccv-2015 , MLR-GL bucak-multi-incomplete-2011 and the binary SVM were also compared, but here we choose to remove the comparisons with them, due to their much higher costs on both computation and memory than other compared methods.

constraint ,  initial not-filling filling not-filling filling
without SH constraint MLMG-CO

MLMG-CO + filling


MLMG-SL + filling

with SH constraint

MLMG-CO + constraint

MLMG-CO + filling + constraint

MLMG-SL + constraint

MLMG-SL + constraint + filling

Table 3: Different algorithm names of variants of our methods. See “Methods for comparison” in Section 6.1 for details.

Evaluation metrics. Average precision (AP) multilabel-evaluation-tkdd-2010 is adopted to measure the ranking performance of the predicted labels of each instance, i.e., the ranking performance of each column vector in the continuous label matrix . Mean average precision (mAP) information-retrieval-2008 is also adopted to evaluate the performance of the tag-based image retrieval, i.e., the ranking performance of each row vector in . To quantify the degree to which the semantic hierarchical constraints are violated, we adopt a simplified hierarchical Hamming loss, similar to hml-text-icml-2005 ,


where denotes the discrete label matrix generated by setting the top- labels in the continuous label vector of each instance as , while all others as . denotes the complete ground-truth label matrix. indicates the logical AND operator. denotes the indicator function: if is true, then , otherwise . The above equation calculates the case that in the ground-truth , if the predicted label of the parent class is correct (i.e., ) but the label of the child class is incorrect (i.e., ). This case indicates the violation of semantic hierarchical constraints. Then we define an average hierarchical loss (AHL) as . In experiments we set on MediaMill, while on other datasets.

Figure 3: Average precision (top) and mAP (bottom

) results of four benchmark datasets for methods with the original initial label matrix. The bar on each point indicates the corresponding standard deviation. Figure better viewed on screen.

Figure 4: Average precision (top) and mAP (bottom) results of four benchmark datasets for methods with the semantically filled-in initial label matrix. The bar on each point indicates the corresponding standard deviation. Figure better viewed on screen.

Other settings. To simulate different scenarios with missing labels, we create training datasets with varying portions of missing labels, ranging from to . Given a missing label proportion , firstly we randomly sample rounding entries in the training label matrix, with being the number of training instances. Then, for every sampled entry, we check whether it corresponds to the leaf or singleton classes in the constructed semantic hierarchies introduced above: if yes, choose it as a missing label, otherwise keep its original value in the training label matrix. Consequently, the number of missing labels is smaller than rounding. The reason of this setting is that if missing labels could be generated on root and intermediate classes, many of them can be directly inferred as positive labels using the semantic hierarchical constraint. Specifically, given one missing label generated on root or intermediate classes, if any one of its descendant classes is positive, then this missing label could be easily corrected to positive. Note that this setting is more favourable to other compared methods that don’t utilize the semantic hierarchical constraint. We repeat the above process 5 times to obtain different missing labels. In all cases, the experimental results of testing data are computed based on the complete label matrix. The reported results are summarized as the mean and standard deviation over all the runs. The trade-off parameters of MLMG-CO ( and ) and MLMG-SL (, , and ) are tuned by cross-validation. Specifically, for MLMG-CO, we set the tuning ranges as , and ; for MLMG-SL, they are , , and . and are defined as sparse matrices. The numbers of neighbors of each instance/class and are set as and , respectively.

An acceleration heuristic. The computation of the step size (see Eq. (13)) in MLMG-CO takes about of the running time in each iteration. However, we observe that the step size in consecutive iterations tend to be very close. Thus, we only compute the step size once in every 5 iterations, while other consecutive step sizes are derived by multiplying a damping factor ( in our experiments) with that of their last iterations. Compared to the case where the step size is computed exactly in each iteration, the runtime is significantly reduced to about (this value depends on ) with a negligible effect in prediction performance.

6.2 Results without Semantic Hierarchical Constraints

Figs. 3 and 4 present AP and mAP results when the semantic hierarchy is not used as constraint, i.e., . In this case, the inequality constraints (see (6)) in ML-MG are degenerate. Then the proposed model MLMG-CO is a convex QP with box constraints that is solvable using the PGD algorithm, which is more efficient than the ADMM algorithm. The semantic hierarchy is only used to fill in the missed ancestor labels in the initial label matrix . We report both results of using the original initial label matrix and using the semantically filled-in initial label matrix, as shown in Figs. 3 and 4, respectively. With the same initial label matrix and without constraints, it ensures the fair comparison among the formulations in different models for the MLML problem.

As shown in Fig. 3, both MLMG-CO and MLMG-SL consistently outperform other MLML methods, even without using the semantic hierarchy information. The improvement margin over the most competitive method on the six datasets is at least (AP) or (mAP). Compared with MLML-exact and MLML-approx, MLMG-CO shows significant improvement, especially when large proportions of missing labels exist. There are two main reasons. Firstly, there are many noisy negative labels in the original training label matrix, i.e., some positive labels 1 are incorrectly set to 0. Since a larger penalty is incurred when misclassifying a positive label in MLMG-CO, the influence of noisy negative labels can be alleviated. However, this is not the case for both MLML-exact and MLML-approx. Secondly, MLMG-CO does not give any bias to missing labels. In contrast, missing labels are encouraged to be intermediate values between negative and positive labels in MLML-exact and MLML-approx, which brings in label bias. This is why their performance decreases significantly as the missing proportion increases.

Figure 5: Average precision (AP) and average hierarchical loss (AHL) results of our methods and CSSAG.

In terms of the comparison between MLMG-CO and MLMG-SL, their performance is similar at most cases. However, we observe that when the missing label proportion is small, MLMG-CO is slightly better than MLMG-SL; as the missing label proportion increases, MLMG-SL shows better performance than MLMG-CO. Specifically, at the case of missing labels, the relative improvements at AP values of MLMG-SL over MLMG-CO are , on MediaMill, ESP Game (traditional), ESP Game (CNN), IAPRTC-12 (traditional) and IAPRTC-12 ( CNN), respectively; while the relative improvements at mAP values are , , accordingly. It is consistent with the expectations of different assumptions used in MLMG-CO and MLMG-SL. As the class-level smoothness used in MLMG-CO is derived from the initial label matrix, when massive missing labels exist, the obtained smoothness is likely to be inaccurate; in contrast, the sparse and low rank decomposition used in MLMG-SL is independent of the initial label matrix, thus it will not be influenced by the increased missing labels. Note that on Corel 5k, due to the extremely sparse positive labels in the label matrix (see the positive proportions in Table 2), the SVD step in MLMG-SL algorithm cannot lead to the valid solution, thus the results of MLMG-SL are not reported. Besides, the high memory requirements of MLML-exact and MLML-approx preclude running them on MediaMill data.

The results of using the filled-in initial label matrix are shown in Fig. 4. Similarly, both MLMG-CO and MLMG-SL show much better performance than other compared methods. At the case of missing labels, the relative improvements at AP values of MLMG-SL over MLMG-CO are on MediaMill, ESP Game (traditional), ESP Game (CNN), IAPRTC-12 (traditional) and IAPRTC-12 (CNN), respectively; while the relative improvements at mAP values are , accordingly.

Comparing Figs. 3 and 4, it is easy to see that the performance of most methods are significantly improved of using the filled-in initial label matrix over using the original initial label matrix. The main reason is that the performance of any models will be significantly influenced by the noisy labels (i.e., the ground-truth positive labels are incorrectly set as negative labels in the original initial label matrix). It verifies the contribution of the augmented ground-truth label matrix using our constructed semantic hierarchies.

6.3 Results with Semantic Hierarchical Constraints

The results of utilizing the semantic hierarchy are shown in Fig. 5. To highlight the influence of semantic hierarchical constraints, here we again report the results of MLMG-CO, MLMG-CO + filling, MLMG-SL and MLMG-SL + filling, which have been presented in Section 6.2.

Comparison among four variants of MLMG-CO. In Fig. 5, the results of four variants of MLMG-CO are denoted using the lines with the mark, but with different colors. The results of MLMG-CO are much inferior to those of the other three variants, because the semantic hierarchy is neither used in the initial label matrix, nor as constraints during the optimization. This demonstrates the importance of the semantic hierarchy. MLMG-CO + filling and MLMG-CO + constraint show the similar performance evaluated by AP and mAP. For MLMG-CO + constraint, although there are many noisy labels in the initial label matrix, the constraint during the optimization can correct the noisy labels to a large extent, to achieve the similar ranking (AP and mAP) performance with MLMG-CO + filling. However, the AHL values of MLMG-CO + constraint are always 0, while those of MLMG-CO + filling are always positive. This tells that the tag ranking list for each instance produced by MLMG-CO + constraint is semantically consistent, while that produced by MLMG-CO + filling is partially inconsistent with the semantic hierarchical constraint, i.e., some children tags are ranked higher than their ancestor tags. These two points verify the efficacy of embedding the semantic hierarchy as the linear constraint.

evaluation method MediaMill ESP Game (traditional) ESP Game (CNN) IAPRTC-12 (traditional) IAPRTC-12 (CNN)
provided val testing provided val testing provided val testing provided val testing provided val testing
AP MLMG-CO 0.9994 0.745 0.716 0.9994 0.4373 0.4422 0.9994 0.581 0.5887 1.0 0.5487 0.5527 0.9971 0.6455 0.6467
MLMG-SL 1.0 0.7694 0.7234 0.9967 0.4435 0.4441 0.9997 0.5965 0.5946 1.0 0.5518 0.5535 0.9997 0.6615 0.6604
mAP MLMG-CO 0.9999 0.5167 0.3344 1.0 0.2573 0.2458 0.9999 0.4544 0.4578 1.0 0.4443 0.4412 0.999 0.5273 0.509
MLMG-SL 1.0 0.5404 0.3344 0.9996 0.2633 0.266 1.0 0.4949 0.4866 1.0 0.4472 0.4413 0.9999 0.5417 0.533
Table 4: Evaluation of our proposed methods in the semi-supervised multi-label setting. ‘provided’ indicates the image subset of the fully labelled images in the original training set; ‘val’ represents the image subset of the unlabelled images in the original training set; ’testing’ denotes the testing images set, where the images are also unlabelled.

MLMG-CO + filling + constraint shows the best results among four variants at most cases. It not only gives the highest AP and mAP values, but also the semantically consistent results. This demonstrates that both filling and constraint contribute to the performance. Note that the improvements of AP values of the other three variants over MLMG-CO are larger than the improvements of mAP values. The main reason is that both filling and constraint directly influence the labels in each column, and AP measures the label ranking performance in each column. In contrast, the row ranking, which is measured by mAP, is indirectly influenced by filling and constraint through the label propagation on the mixed dependency graph.

Comparison among four variants of MLMG-SL. In Fig. 5, the results of four variants of MLMG-CO are denoted using the lines with the mark, but with different colors. Similar with the above comparison about MLMG-CO, MLMG-SL shows the worst performance among its four variants; MLMG-SL + filling and MLMG-SL + constraint show similar performance in most cases; MLMG-SL + filling + constraint shows the best performance.

Comparison between MLMG-CO and MLMG-SL. In Fig. 5, the corresponding variants of MLMG-CO and MLMG-SL are denoted as lines with the same color, but with different marks ( and respectively, see the same column of the legend). Similar with the comparison between MLMG-CO and MLMG-SL shown in Section 6.2, on most datasets, MLMG-CO performs better than MLMG-SL when the missing label proportion is small, while worse when the missing label proportion increases.

Comparison between MLMG-CO+constraint and CSSAG. Based on the input continuous labels produced by MLMG-CO, CSSAG will change continuous labels to binary ones according to the semantic hierarchy and the predefined number of positive labels. Consequently, the AP results of the discrete outputs of CSSAG are similar to the AP values of MLMG-CO. But the mAP values of CSSAG are much lower than that of MLMG-CO. We think the reason is that CSSAG focuses on adjusting the column-wise label rankings, while mAP measures the row-wise label ranking performance. Moreover, although CSSAG ensures that there are no inconsistent labels in its binary label matrix, it cannot provide a consistent continuous label ranking. In contrast, ML-MG can satisfy these two conditions simultaneously. This comparison demonstrates that using the semantic hierarchy as constraint during optimization (as did in MLMG-CO + constraint) is more effective than using it as the constraint in the post-processing step (as did in CSSAG).

6.4 Evaluation of Semi-supervised Multi-label Learning

In above experiments, missing labels are randomly generated across different training instances and different classes. A special case is that some training instances are fully annotated, while other training instances are totally unlabelled, referred to as semi-supervised multi-label learning (SSML) semi-multi-label-sdm-2008 . Our proposed model can naturally handle SSML. In contrast, not all compared multi-label models that handle missing labels can exploit totally unlabelled images, such as FastTag fasttag-icml-2013 . Here we provide a further evaluation of our proposed methods in the SSML setting. Specifically, we randomly choose a subset of training instances, of which the size is equivalent to the size of the testing instance set, then hide their labels to the model (i.e., setting the label value to ). This subset is referred to as validation set, while the subset of other fully labelled training images are called as provided set. The equivalent size between the validation set and the testing set ensures the fair comparison of the prediction performance on this two sets. For clarity, we only present the experiment at the case of not-filling initial label matrix and with SH constraint (see Table 3). Besides, since MLMG-SL is inapplicable to Corel 5k with missing labels, here we ignore Corel 5k. The results are shown in Table 4

. On both ESP Game and IAPRTC-12, the results evaluated by AP and mAP on the validation set are similar with or slightly higher than that on the testing set. It demonstrates that the joint probability distributions of image features and labels are close on training set and testing set of these two datasets. This point could facilitate to determine the model and algorithm parameters of our methods using cross-validation. However, on MediaMill, there are significant gaps between the evaluation results on the validation set and the testing set, especially the results evaluated by mAP. This reveals that the joint probability distributions of instance features and labels are different between the training and the testing set in MediaMill.

Figure 6: F1 scores of the predicted labels of the provided, missing and testing entries in the label matrix. Please see Section 6.5 for details. Figure better viewed on screen.
Figure 7: Average precision (top) and mAP (bottom

) results of our proposed methods on evaluation of missing label imputations. Please see Section

6.5 for details. Figure better viewed on screen.

6.5 Evaluation of Missing Label Imputations

As transductive models, our proposed methods can not only predict the labels of testing images, but also impute the missed labels of training images. Here we evaluate the imputation performance of missing labels using our methods, and compare with the prediction performance of testing images. For clarity, we only present the experiment at the case of not-filling initial label matrix and with SH constraint (see Table 3). In above experiments, missing labels are generated on only leaf and singleton classes. However, as our method enforces that the label score of the parent class cannot be lower than that of its child classes, the leaf and singleton classes are at the disadvantage in the competition with the root and intermediate classes. Thus, the imputation performance of missing labels corresponding to leaf and singleton classes is very poor, as they have to compete with the provided labels, of which a large proportion correspond to root and intermediate classes. Instead, here we change the setting of generating missing labels to that all classes could be missing. Then, at the same missing label proportion, there are actually more missing labels in the training label matrix, compared with the case that missing labels are only generated on leaf and singleton classes. Note that Corel 5k is not evaluated here. As demonstrated in Section 6.2, the SVD decomposition in MLMG-SL cannot give the valid solution on Corel 5k, due to the extremely sparse positive labels in the provided label matrix of Corel 5k, especially when missing labels on all classes exist.

We present two evaluations. The first evaluation is using score, on provided, missing and testing labels. Specifically, we firstly discretize the predicted continuous label matrix by setting the labels of the top-10 largest scores in the label vector corresponding to each image (i.e., the column vector of the label matrix) to , while all other entries in the same label vector as . The sub-label-vectors of both provided labels and missing labels are extracted from each training label vector in the binary label matrix, which could be evaluated using score separately. This evaluation could clearly reveal that imputation performance of missing labels, compared with the prediction performance on provided labels and testing labels. The results are shown in Fig. 6. There are two observations from the results on all datasets. One is that the prediction performance of provided labels (see the green lines in Fig. 6) is always better than that of missing and testing labels, but the performance advantage is inversely proportional to the missing label proportion. The reason is that the label consistency term in our model (see Eq. (1)) encourages the predicted label scores to be consistent with the ground-truth labels at the provided entries of the label matrix. In contrast, there are no such a consistency term for missing and testing labels. When the missing label proportion is small, this consistency term could provide the reference for more labels. This explains the inverse proportion between the performance advantage of the prediction on provided labels and the missing label proportion. The other observation is that the imputation performance of missing labels (see the blue lines in Fig. 6) is worse than that of testing labels (see the red lines in Fig. 6) at the missing label proportion , but their performance becomes similar when the missing label proportion is large. The missing labels have to compete with the provided labels in the same label column. When a large proportion of labels are provided, the unfairness between missing labels and provided labels may preclude the recovery of the ground-truth positive labels in missing labels. The degree of this unfairness is inversely proportional to the missing label proportion. In contrast, there is no such a unfairness among the entries in the same label column for testing images, as all entries in the same column are missing. This difference is the main reason that the performance gap between the imputations of missing and testing labels when the missing label proportion is small.

MC-Pos-nips-2011 fasttag-icml-2013 LEML-ICML-2014 -exact my-icpr-2014 -appro my-icpr-2014 constraint constraint
Corel 5k 72.94 55.94 203.4 50.6 2.56 1.86 4.76 21.3
MediaMill 165.4 56.5 151.5 238.4 4.63 8.75 23.39 44.25
ESP Game (traditional) 337 124 638.2 3004 239.2 11 28.2 337.3
ESP Game (CNN) 1772 199.3 2164
IAPRTC-12 (traditional) 326.6 202.2 742.2 3378 238.4 11.6 30.5 158
IAPRTC-12 (CNN) 1709 271.7 2063
Table 5: Runtime in seconds of all compared methods. The smallest runtime of each dataset is highlighted in bold.

The second evaluation is using the metrics AP and mAP, on both training and testing images, as shown in Fig. 7. It provides the observation of the performance influence of the additional missing labels at the root and intermediate classes. Compared with the reported results in Fig. 5 (see MLMG-CO+constraint and MLMG-SL+constraint), the corresponding results of ML MG-CO and MLMG-SL in Fig. 7 are slightly lower. The reason is that the additional missed labels at root and intermediate classes could be easily recovered using our methods, if any one of their descendant classes are provided.

6.6 Complexity and Runtime

Complexity. Here we analyze the complexities of our methods. MLMG-CO is implemented by the PGD algorithm (see Section 4.1), which can be further accelerated with the following observations. First, both and are sparse, and there are only and non-zero entries, respectively. denotes the number of neighbours at the instance-level, while is the number of neighbours at the class-level (their specific values on different datasets are shown in Table 2). Second, there are some shared terms between different steps, such as and . Third, it is known that . Thus we have or . Considering always holds in the datasets in our experiments, the computational cost can be significantly reduced from to , or from to . Utilizing the above three observations, the actual computational complexity of MLMG-CO is