## I Introduction

In Superset Label Learning (SLL), one training example can be ambiguously labeled with multiple candidate labels, among which only one is correct. This is different from the conventional supervised classification which works on the training examples with each of them only has one explicit label.

SLL has a variety of applications. For example, an episode of a video or TV serial may contain several characters chatting with each other, and their faces may appear simultaneously in a screenshot. We also have access to the scripts and dialogues indicating the characters’ names. However, these information only reveals who are in the given screenshot, but does not build the specific one-to-one correspondence between the characters’ faces and the appeared names. Therefore, each face in the screenshot is ambiguously named, and our target is to determine the groundtruth name of each face in the screen shot (see Fig. 1(a)). Another similar application is that in a photograph collection such as newsletters or family album, each photo may be annotated with a description indicating who are in this photo. However, the detailed identity of each person in the photo is not specified, so matching the persons with their real names is useful (see Fig. 1

(b)). SLL problem also arises in crowdsourcing, in which each example (image or text) is probably assigned multiple labels by different annotators. Nevertheless, some of the labels may be incorrect or biased because of the difference among various annotators in terms of expertise or cultural background, so it is necessary to find the most suitable label of every example resided in the candidate labels (see Fig.

1(c)). In above applications, manually labeling the groundtruth label of each example will incur unaffordable monetary or time cost, so SLL can be an ideal tool for tackling such problems with ambiguously labeled examples.Superset label learning [liu2012conditional] is also known as “partial label learning” [cour2011learning, zhangsolving2015, yuACML15] and “ambiguously label learning” [hullermeier2006learning, chen2014ambiguously]. For the consistency of our presentation, we will use the term “superset label learning” throughout this paper. Superset label learning is formally defined as follows. Suppose we have training examples with dimensionality , and their candidate labels are recorded by label sets , respectively. Therefore, the entire candidate label space consisted of possible class labels has the size . Besides, we assume that the groundtruth labels of these training examples are with (), whereas they are unknown to the learning algorithms. Therefore, given the output label set denoted by

, the target of a SLL algorithm is to build a classifier

based on so that it can accurately predict the single unambiguous label of an unseen test example .### I-a Related Work

To the best of our knowledge, the concept of SLL was firstly proposed by Grandvalet [grandvallet2002logistic]

, who elegantly adapts the traditional logistic regression to superset label cases. After that, there are mainly two threads for tackling the SLL problem: regularization-based models and instance-based models.

#### I-A1 Regularization-based Models

Regularization-based models try to achieve maximum margin effect by developing various loss functions. For example, Jin et al.

[jin2002learning] firstly assume that every element in the candidate set () has equal probability to be the correct label, and designs a “naive” superset label loss. Next, considering that it is inappropriate to treat all the candidate labels equally, they further propose to disambiguate the candidate labels,*i.e.*directly discovering each example’s groundtruth label from its multiple candidate labels, so that a discriminative loglinear model can be built. Besides, Cour et al. [cour2011learning, cour2009learning] hold that the above naive loss is loose compared to the real superset label 0-1 loss

^{1}

^{1}1The operation “” returns 1 if the argument within the bracket holds true, and 0 otherwise., so they propose another novel surrogate loss that is a tighter approximation to the real 0-1 loss than the naive loss. To be specific, this loss function is formulated as where can be hinge, exponential or logistic loss. Here the first term computes the mean value of the scores of the labels in . However, this averaging strategy has a critical shortcoming that its effectiveness can be largely decreased by the false positive label(s) in the candidate label set . As a result, the training process will be dominated by these false positive labels and the final model output can be biased. Therefore, Nguyen et al. [nguyen2008classification] develop the superset label hinge loss that maximizes the margin between the maximum model output among candidate labels and that among the remaining non-candidate labels, namely where is the model parameter. Differently, Hüllermeier et al. [H2015Superset] propose a generalized loss with its expression , where represents the logistic loss. However, above two formulations do not discriminate the groundtruth label from other candidate labels. Therefore, Yu et al. [yuACML15]

devise a new SLL maximum margin formulation based on Support Vector Machines (SVM) which directly maximizes the margin between the groundtruth label and all other labels. The corresponding loss function is

. Different from above methods that only assume that one example is associated with a set of candidate labels, Luo et al. [luo2010learning] consider a generalized setting in which each training example is a bag containing multiple instances and is associated with a set of candidate label vectors. Each label vector encodes the possible labels for the instances in the bag, and only one of them is fully correct.For the theoretical aspect, Cid-Sueiro [cid2012proper] studies the general necessary and sufficient condition for designing a SLL loss function, and provide a detailed procedure to construct a proper SLL loss under practical situations. Cid-Sueiro et al. [Cid2014Consistency] also reveal that the consistency of loss functions depends on the mixing matrix, which refers to the transition matrix relating the candidate labels and the groundtruth label. More generally, Liu et al. [liu2014learnability] discuss the learnability of regularization-based SLL approaches, and reveal that the key to achieving learnability is that the expected classification error of any hypothesis in the space can be bounded by the superset label 0-1 loss averaged over the entire training set.

Other representative regularization-based SLL algorithms include [chen2014ambiguously, shrivastava2012learning, zhang2014disambiguation] that utilize coding theory, [liu2012conditional] that employs the conditional multinomial mixture model, and [zeng2013learning] that leverages the low-rank assumption [Xu2016Local, Xu2015MultiTIP] to capture the example-label correspondences.

#### I-A2 Instance-based Models

Instance-based models usually construct a nonparametric classifier on the training set, and the candidate label set of a training example can be either disambiguated or kept ambiguous as it originally presents. Hüllermeier et al. [hullermeier2006learning] propose a series of nonparametric models such as superset label

-nearest neighborhood classifier and decision tree. The models in

[hullermeier2006learning] do not have a disambiguation operation and directly use the ambiguous label sets for training and testing. Differently, Zhang et al. [zhangsolving2015] proposes an iterative label propagation scheme to disambiguate the candidate labels of training examples. Furthermore, considering that the disambiguation process in current methods simply focuses on manipulating the label space, Zhang et al. [Zhang2016Partial] advocate making full use of the manifold information [Gong2015Deformed] embedded in the feature space, and propose a feature-aware disambiguation.### I-B Our Motivation

Although the method proposed in [zhangsolving2015] generally obtains the best performance among all existing SLL algorithms, it still suffers from several drawbacks. Firstly, as an instance-based method, it falls short of discovering the mutually exclusive relationship among different candidate labels, and does not take specific measures to highlight the potential groundtruth label during the disambiguation process. Secondly, as an iterative algorithm, the convergence property of the propagation sequence is only empirically illustrated and does not have a theoretical guarantee.

To address above two shortcomings, we propose a Regularization approach for Instance-based Superset Label learning, and term it as “RegISL”. The advantages of our RegISL are two folds: Firstly, to make the disambiguated labels discriminative, we design a proper discrimination regularizer along with the related constraints to increase the gap of scores between possible candidate labels and unlikely candidate labels. As a result, the potential groundtruth labels will become prominent, whereas the unlikely labels will be suppressed. Secondly, to avoid the convergence problem of iterative algorithm like [zhangsolving2015], we solve the designed optimization problem via the Augmented Lagrangian Multiplier (ALM) method [Xu2015MultiPAMI, Gong2016TLLT] which will always finds a stationary solution. Besides, due to the nonconvexity of the augmented Lagrangian objective function, we show that it can be decomposed as the difference of two convex components and then minimized by the ConCave Convex Procedure (CCCP) [yuille2003concave].

We empirically test our RegISL and other representative SLL methodologies [liu2012conditional, cour2011learning, zhangsolving2015, yuACML15, hullermeier2006learning, zhang2014disambiguation] on various practical applications such as character-name association in TV show, ambiguous image classification, automatic face naming in news images, and bird sound classification. The experimental results suggest that in most cases the proposed RegISL is able to outperform other competing baselines in terms of both training accuracy and test accuracy.

## Ii Model Description

This section introduces our nonparametric instance-based method RegISL. In the training stage (Section II-A), a graph is established on the training set to capture the relationship between pairs of training examples, where is the node set representing all training examples and is the edge set encoding the similarities between these nodes (see Fig. 2). In this work, two examples and are linked by an edge in if one of them belongs to the nearest neighbors of the other one, and the edge weight (*i.e.* the similarity between and ) is computed by the Gaussian kernel function [xiao2015parameterGaussian, Gong2016MultiTLLT]

(1) |

where denotes the kernel width. In contrast, is set to 0 if there is no edge between and . After that, a regularized objective function is built on , which is able to disambiguate the candidate labels and discover the unique real label of every training example. In the test stage (Section LABEL:sec:test), the test example is assigned label ( takes a value from with being the total number of classes) based on the disambiguated labels of its nearest neighbors in the training set.

### Ii-a Training Stage

For our instance-based RegISL, the main target of training stage is to pick up the real label of each training example from its candidate label set . The established graph can be quantified by the adjacency matrix where its -th element is if and 0 otherwise [zhu2003semi, wang2016semi].

Similar to [zhangsolving2015], the candidate labels of a training example ( takes a value from ) is represented by a -dimensional label vector , which is

(2) |

where denotes the size of set . Note that the sum of all the elements in every is 1 according to the definition of (2). Furthermore, we use the vectors to record the obtained labels of training examples , respectively, in which can be understood as the probability of belonging to the class , then our regularization model for RegISL can be expressed as

(3) |

In Eq. (3), the set includes the subscripts of zero elements in , “” computes the norm of the vector, and and are nonnegative trade-off parameters controlling the relative weights of the three terms in the objective function.

The first term in the objective function of Eq. (3) is called *smoothness term*, which requires the two examples connected by a strong edge (*i.e.* the edge weight is large) in to obtain similar labels [zhu2003semi, gong2014fick, pei2015manifold], so minimizing this smoothness term will force to get close to if is large. The second term is called *fidelity term*, which suggests that if ’s candidate label set does not contain the label (*i.e.* ), then the -th element in the finally obtained label vector should also be zero. Although there are many other ways to character the difference between and , here we simply adopt the quadratic form as it is perhaps the simplest way to compare and

. This form has also been widely used by many semi-supervised learning methodologies such as

[gong2014fick, Zhou03learningwith, wang2009linear]. The third*discrimination term*along with the

*normalization constraint*and

*nonnegative constraint*, critically makes the obtained to be discriminative. That is to say, by requiring the elements in nonnegative and summing up to 1, minimizing (

*i.e.*maximizing ) will widen the gap of values between possible labels and unlikely labels of , and thus yielding discriminative and confident label vector . The detailed reasons are explained as follows.

Suppose that we are dealing with a binary classification problem (*i.e.* ), and the label vector of example is . If is initially associated with the ambiguous candidate labels 1 and 2 (*i.e.* ), we hope that the finally obtained can approach to or , which confidently implies that belongs to the first or second class. In contrast, the output close to is not encouraged because such does not convey any information for deciding ’s real label. To this end, we impose the nonnegative and normalization constrains on as in Eq. (3), then its elements and will only select the values along the red line in Fig. 3. Furthermore, we take the red line as x-axis and plot the squared norm of under different and (see the blue curve). It can be clearly observed that hits the lowest value when both and are equal to 0.5, and gradually increases when approaches to or . Therefore, the label vector with large norm is encouraged by the discrimination term in Eq. (3), so that the obtained prefers definite results or and meanwhile avoids the ambiguous outputs that are close to .

For ease of optimizing Eq. (3), we may reformulate it into a compact formation. Based on ’s adjacency matrix , we further define a diagonal degree matrix with the -th diagonal element representing ’s degree computed by . Therefore, a positive semi-definite graph Laplacian matrix can be calculated as . Besides, we stack the row vectors as to establish a candidate label matrix . Similarly, the label matrix to be optimized is established by . Furthermore, by defining , and as the -dimensional all-one vector, -dimensional all-one vector, and

-dimensional all-zero matrix, respectively, Eq. (

3) can be rewritten as(4) |

In Eq. (4), “” computes the Frobenius norm of corresponding matrix, and “” refers to the elementwise product. is a -binary matrix with the element if and 0 otherwise.

Since Eq. (4) is a constrained optimization problem, we may use the method of Augmented Lagrangian Multiplier (ALM) to find its solution. Compared to the traditional Lagrangian method, ALM adds an additional quadratic penalty function to the objective, which leads to faster convergence rate and lower computational cost [bertsekas2014constrained]. Therefore, by introducing the multipliers and to deal with the nonnegative constraint and normalization constraint, respectively, the augmented Lagrangian function is expressed as

(5) |

where is an auxiliary variable that enforces the obtained optimal (*i.e.* ) to be nonnegative. The operation “” returns a matrix with its -th element being the largest element between and . The variable is the penalty coefficient.

Based on Eq. (5), the optimal solution of Eq. (4) can be obtained by alternately updating , , and , among which , and can be easily updated via the conventional rules of ALM, namely:

(8) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

MSRCv2 dataset. Each record represents “mean accuracy standard deviation”. The best and second best records are marked in red and blue, respectively. “” indicates that RegISL is significantly better (worse) than the corresponding method.MSRCv2 dataset for our comparison. This dataset contains 591 natural images with totally 23 classes. Every image is segmented into several compact regions with specific semantic information, and the labels of segmented regions form the candidate label set for the entire image. Among the segmented regions, the label of the most dominant region is taken as the single groundtruth label for the given image (see Fig. II-A). Similar to the experiment on Lost dataset, we also adopt the 512-dimensional GIST feature to represent the images, and all feature vectors are normalized to unit length for all the competing methodologies.The parameter settings of CLSL, CLSL_Naive, M3SL, ECOC, and LSB-CMM on MSRCv2 are the same with those on Lost dataset, because they are directly suggested by the authors. The graph parameters and for ISL, SLKNN and RegISL are respectively set to 10 and 0.1, where the optimal is chosen from the set , and is selected from . The experimental results are reported in Table III, which reveals that all the methods obtain relatively low accuracy. This is because MSRCv2 dataset is quite challenging for SLL. Firstly, this dataset is not large, but contains as many as 23 classes (see Table LABEL:tab:Datasets), so the training examples belonging to every class are very sparse. Besides, the number of examples having a certain candidate label ranges from 24 to 184, therefore such insufficient and skewed training examples pose a great difficulty for training a reliable classifier. Secondly, Fig. II-A reveals that the images inMSRCv2 are very complex, and the dominant foreground is often surrounded by the background regions with false positive labels, which will mislead both the training and test stages. Although this dataset is quite challenging, Table III clearly indicates that the proposed RegISL still outperforms other methods with a noticeable margin in terms of either training accuracy or test accuracy. Specifically, it can be observed that RegISL leads the second best method ISL with the margins roughly 0.06 on training accuracy and 0.005 on test accuracy, which again demonstrate the superiority of our regularization strategy to the existing non-regularized instance-based model. In contrast, the training accuracy and test accuracy obtained by the remaining approaches like CLSL, CLSL_Naive, M3SL, ECOC, LSB-CMM and SLKNN do not exceed 0.6 and 0.3, which are much worse than our RegISL.
## Iii-C Automatic Face Naming in News Images
Soccer Player dataset. Each record represents “mean accuracy standard deviation”. The highest and second highest records are marked in red and blue, respectively. “” indicates that RegISL is significantly better (worse) than the corresponding method.It is often the case that in a news collection every image is accompanied by a short textual description to explain the content of this image. Such a news image may contain several faces and the associated description will indicate the names of the people appeared in this image. However, the further information about which face matches which name is not specified. Therefore, in this section we use the The Table IV reports the experimental results, which reflect that ECOC achieves the best results on this dataset. Regarding the training accuracy, our RegISL is significantly better than CLSL, CLSL_Naive, M3SL, LSB-CMN, and comparable to ISL and ECOC. For test accuracy, RegISL performs favourably to CLSL, CLSL_Naive, M3SL, LSB-CMN, and SLKNN. However, it is inferior to the results generated by ECOC. Furthermore, we note that RegISL only falls behind ECOC by 0.003 in training accuracy and 0.009 in test accuracy, and it also generates the top level performance among the compared instance-based methods like SLKNN, ISL and RegISL, so the performance of RegISL is still acceptable on this dataset. ## Iii-D Bird Sound Classification
Bird Song dataset. Each record represents “mean accuracy standard deviation”. The best and second best results are marked in red and blue, respectively. “” indicates that RegISL is significantly better (worse) than the corresponding method. |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

In [briggs2012rank], the authors established a dataset Bird Song which contains 548 bird sound recordings that last for ten seconds. Each recording is consisted of 140 syllables, leading to totally 4998 syllables included by the dataset. Each syllable is regarded as an example and is described by a 38-dimensional feature vector. Since every recording contains the songs produced by different species of birds, our target is to identify which example (i.e. syllable) corresponds to which kind of bird. In this dataset, the bird species appeared in every record are manually annotated, so they serve as the candidate labels for all the syllables inside this recording.The number of neighbors for ISL, SLKNN and our RegISL is set to 10, and the kernel width in Eq. (1) is tuned to 1 to achieve the best performance. The trade-off parameters and are adjusted to 1000 and 0.01 as mentioned in Section LABEL:sec:TVSerial. We present the training accuracy and test accuracy of all the compared methods in Table V. A notable fact revealed by Table V is that the instance-based methods (e.g. SLKNN, ISL and RegISL) generate better performance than the regularization-based methodologies such as CLPL, CLPL_Naive, M3PL, LSB-CMM and ECOC. Among the three instance-based methods, ISL and SLKNN have already achieved very encouraging performances. However, our proposed RegISL can still improve their performances with a noticeable margin regarding either training accuracy or test accuracy. Therefore, the effectiveness of RegISL is demonstrated, which again suggests that integrating the regularization technique with the instance-based framework is beneficial to achieving the improved performance.
## Iii-E Illustration of ConvergenceIn Section II, we explained that the iteration process of ALM in our algorithm will converge to a stationary point. Here we present the convergence curves of RegISL on the adopted four datasets including ## Iii-F Effect of Tuning Parameters |