# Learning with Biased Complementary Labels

In this paper we study the classification problem in which we have access to easily obtainable surrogate for the true labels, namely complementary labels, which specify classes that observations do not belong to. For example, if one is familiar with monkeys but not meerkats, a meerkat is easily identified as not a monkey, so "monkey" is annotated to the meerkat as a complementary label. Specifically, let Y and Y̅ be the true and complementary labels, respectively. We first model the annotation of complementary labels via the transition probabilities P(Y̅=i|Y=j), i≠ j∈{1,...,c}, where c is the number of classes. All the previous methods implicitly assume that the transition probabilities P(Y̅=i|Y=j) are identical, which is far from true in practice because humans are biased toward their own experience. For example, if a person is more familiar with monkey than prairie dog when providing complementary labels for meerkats, he/she is more likely to employ "monkey" as a complementary label. We therefore reason that the transition probabilities will be different. In this paper, we address three fundamental problems raised by learning with biased complementary labels. (1) How to estimate the transition probabilities? (2) How to modify the traditional loss functions and extend standard deep neural network classifiers to learn with biased complementary labels? (3) Does the classifier learned from examples with complementary labels by our proposed method converge to the optimal one learned from examples with true labels? Comprehensive experiments on MNIST, CIFAR10, CIFAR100, and Tiny ImageNet empirically validate the superiority of the proposed method to the current state-of-the-art methods with accuracy gains of over 10%.

## Authors

• 5 publications
• 60 publications
• 30 publications
• 268 publications
• ### Bridging Ordinary-Label Learning and Complementary-Label Learning

Unlike ordinary supervised pattern recognition, in a newly proposed fram...
02/06/2020 ∙ by Yasuhiro Katsura, et al. ∙ 0

• ### Extended T: Learning with Mixed Closed-set and Open-set Noisy Labels

The label noise transition matrix T, reflecting the probabilities that t...
12/02/2020 ∙ by Xiaobo Xia, et al. ∙ 0

• ### Learning from Multiple Complementary Labels

Complementary-label learning is a new weakly-supervised learning framewo...
12/30/2019 ∙ by Lei Feng, et al. ∙ 0

• ### Learning from Complementary Labels

Collecting labeled data is costly and thus a critical bottleneck in real...
05/22/2017 ∙ by Takashi Ishida, et al. ∙ 0

• ### Generative-Discriminative Complementary Learning

Majority of state-of-the-art deep learning methods for vision applicatio...
04/02/2019 ∙ by Yanwu Xu, et al. ∙ 0

• ### Multi-Complementary and Unlabeled Learning for Arbitrary Losses and Models

A weakly-supervised learning framework named as complementary-label lear...
01/13/2020 ∙ by Yuzhou Cao, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Large-scale training datasets translate supervised learning from theories and algorithms to practice, especially in deep supervised learning. One major assumption that guarantees this successful translation is that data are accurately labeled. However, collecting true labels for large-scale datasets is often expensive, time-consuming, and sometimes impossible. For this reason, some weak but cheap supervision information has been exploited to boost learning performance. Such supervision includes side information

[36], privileged information [32], and weakly supervised information [17] based on semi-supervised data [40, 11, 8], positive and unlabeled data [7], or noisy labeled data [23, 33, 12, 13, 5, 10]. In this paper, we study another weak supervision: the complementary label which specifies a class that an object does not belong to. Complementary labels are sometimes easily obtainable, especially when the class set is relatively large. Given an observation in multi-class classification, identifying a class label that is incorrect for the observation is often much easier than identifying the true label.

Complementary labels carry useful information and are widely used in our daily lives: for example, to identify a language we do not know, we may say “not English”; to categorize a new movie without any fighting, we may say “not action”; and to recognize an image of a previous American president, we may say “not Trump”. Ishida et al. [15] then proposed learning from examples with only complementary labels by assuming that a complementary label is uniformly selected from the classes other than the true label class (

). Specifically, they designed an unbiased estimator such that learning with complementary labels was asymptotically consistent with learning with true labels.

Sometimes, annotators provide complementary labels based on both the content of observations and their own experience, leading to the biases in complementary labels.

Thus, complementary labels are mostly non-uniformly selected from the remaining classes, some of which even have no chance of being selected for certain cases. Regarding the bias governed by the observation content, let us take labeling digits 0-9 as an example. Since digit 1 is much more dissimilar to digit 3 than digit 8, the complementary labels of “3” are more likely to be assigned with “1” rather than “8”. Regarding the bias governed by annotators’ experience, taking our example above, we can see that if one is more familiar with monkeys than other animals, she may be more likely to use “monkey” as a complementary label.

Motivated by the cause of biases, we here model the biased procedure of annotating complementary labels via probabilities . Note that the assumption that a complementary label is uniformly selected from the remaining classes implies . However, in real applications, the probabilities should not be and can differ vastly. How to estimate the probabilities is a key problem for learning with complementary labels.

We therefore address the problem of learning with biased complementary labels. For effective learning, we propose to estimate the probabilities without biases. Specifically, we prove that given a clear observation for the -th class, i.e., the observation satisfying , which can be easily identified by the annotator, it holds that . This implies that probabilities can be estimated without biases by learning from the examples with complementary labels. To obtain these clear observations, we assume that a small set of easily distinguishable instances (e.g., 10 instances per class) is usually not expensive to obtain.

Given the probabilities , we modify traditional loss functions proposed for learning with true labels so that the modifications can be employed to efficiently learn with biased complementary labels. We also prove that by exploiting examples with complementary labels, the learned classifier converges to the optimal one learned with true labels with a guaranteed rate. Moreover, we also empirically show that the convergence of our method benefits more from the biased setting than from the uniform assumption, meaning that we can use a small training sample to achieve a high performance.

Comprehensive experiments are conducted on benchmark datasets including UCI, MNIST, CIFAR, and Tiny ImageNet, which verifies that our method significantly outperforms the state-of-the-art methods with accuracy gains of over 10%. We also compare the performance of classifiers learned with complementary labels to those learned with true labels. The results show that our method almost attains the performance of learning with true labels in some situations.

## 2 Related Work

#### Learning with complementary labels.

To the best of our knowledge, Ishida et al. [15] is the first to study learning with complementary labels. They assumed that the transition probabilities are identical and then proposed modifying traditional one-versus-all (OVA) and pairwise-comparison (PC) losses for learning with complementary labels. The main differences between our method and [15] are: (1) Our work is motivated by the fact that annotating complementary labels are often affected by human biases. Thus, we study a different setting in which transition probabilities are different. (2) In [15], modifying OVA and PC losses is naturally suitable for the uniform setting and provides an unbiased estimator for the expected risk of classification with true labels. In this paper, our method can be generalized to many losses such as cross-entropy loss and directly provides an unbiased estimator for the risk minimizer. Due to these differences, [15] often achieves promising performance in the uniform setting while our method achieves good performance in both the uniform and non-uniform setting.

#### Learning with noisy labels.

In the setting of label noise, transition probabilities are introduced to statistically model the generation of noisy labels. In classification and transfer learning, methods

[25, 21, 35, 38] employ transition probabilities to modify loss functions such that they can be robust to noisy labels. Similar strategies to modify deep neural networks by adding a transition layer have been proposed in [29, 26]. However, this is the first time that this idea is applied to the new problem of learning with biased complementary labels. Different from label noise, here, all diagonal entries of the transition matrix are zeros and the transition matrix sometimes may be not required to be invertible in empirical.

## 3 Problem Setup

In multi-class classification, let be the feature space and be the label space, where is the feature space dimension; ; and is the number of classes. We assume that variables are defined on the space with a joint probability measure ( for short).

In practice, true labels are sometimes expensive but complementary labels are cheap. This work thus studies the setting in which we have a large set of training examples with biased complementary labels and a very small set of correctly labeled examples. The latter is only used for estimating transition probabilities. Our aim is to learn the optimal classifier with respect to the examples with true labels by exploiting the examples with complementary labels.

For each example , a complementary label is selected from the complement set . We assign a probability for each to indicate how likely it can be selected, i.e., . In this paper, we assume that is independent of feature conditioned on true label , i.e., . This assumption considers the bias which depends only on the classes, e.g., if the annotator is not familiar with the features in a specific class, she is likely to assign complementary labels that she is more familiar with. We summarize all the probabilities into a transition matrix , where and . Here, denotes the entry value in the -th row and -th column of

. Note that transition matrix is also widely exploited in Markov chains

[9]

and has many applications in machine learning, such as learning with label noise

[25, 29, 26].

If complementary labels are uniformly selected from the complement set, then , . Previous work [15] has proven that the optimal classifier can be found under the uniform assumption. Sometimes, this is not true in practice due to human biases. Therefore, we focus on situations in which , are different. We mainly study the following problems: how to modify loss functions such that the classifier learned with these biased complementary labels can converge to the optimal one learned with true labels; the speed of the convergence; and how to estimate transition probabilities.

## 4 Methodology

In this section, we study how to learn with biased complementary labels. We first review how to learn optimal classifiers from examples with true labels. Then, we modify loss functions for complementary labels and propose a deep learning based model accordingly. Lastly, we theoretically prove that the classifier learned by our method is consistent with the optimal classifier learned with true labels.

### 4.1 Learning with True Labels

The aim of multi-class classification is to learn a classifier that predicts a label for a given observation . Typically, the classifier is of the following form:

 f(X)=argmaxi∈[c]gi(X), (4.1)

where and is the estimate of .

Various loss functions have been proposed to measure the risk of predicting for [1]. Formally, the expected risk is defined as.

 R(f)=E(X,Y)∼PXY[ℓ(f(X),Y)]. (4.2)

The optimal classifier is the one that minimizes the expected risk; that is,

 f∗=argminf∈FR(f), (4.3)

where is the space of .

However, the distribution is usually unknown. We then approximate by using its empirical counterpart: , where are i.i.d. examples drawn according to .

Similarly, the optimal classifier is approximated by .

### 4.2 Learning with Complementary Labels

True labels, especially for large-scale datasets, are often laborious and expensive to obtain. We thus study an easily obtainable surrogate; that is, complementary labels. However, if we still use traditional loss functions when learning with these complementary labels, similar to Eq.(4.1), we can only learn a mapping that tries to predict conditional probabilities and the corresponding classifier that predicts a for a given observation .

Therefore, we need to modify these loss functions such that the classifier learned with biased complementary labels can converge to the optimal one learned with true labels. Specifically, let be the modified loss function. Then, the expected and empirical risks with respect to complementary labels are defined as and , respectively. Here, are examples with complementary labels.

Denote and as the optimal solution obtained by minimizing and , respectively. They are and .

We hope that the modified loss function can ensure that , which implies that by learning with complementary labels, the classifier we obtain can also approach to the optimal one defined in (4.3).

Recall that in transition matrix , and . We observe that can be transferred to by using the transition matrix ; that is, ,

 P(¯Y=j|X) =∑i≠jP(¯Y=j,Y=i|X) (4.4) =∑i≠jP(¯Y=j|Y=i,X)P(Y=i|X) =∑i≠jP(¯Y=j|Y=i)P(Y=i|X).

Intuitively, if tries to predict the probability , , then can predict the probability . To enable end-to-end learning rather than transferring after training, we let

 q(X)=Q⊤g(X), (4.5)

where is now an intermediate output, and .

Then, the modified loss function is

 ¯ℓ(f(X),¯Y)=ℓ(q(X),¯Y). (4.6)

In this way, if we can learn an optimal such that , meanwhile, we can also find the optimal and the classifier .

This loss modification method can be easily applied to deep learning. As shown in Figure 2, we achieve this simply by adding a linear layer to the deep neural network. This layer outputs by multiplying the output of the softmax function (i.e., ) by the transposed transition matrix . With sufficient training examples with complementary labels, this deep neural network often simultaneously learns good classifiers for both and .

Note that, in our modification, the forward process does not need to compute . Even though the subsequent analysis for identification requires the transition matrix to be invertible, sometimes, we may have no such requirement in practice. We also show an example in our experiments that even with singular transition matrices, high classification performance can also be achieved if no column of is all-zero.

## 5 Identification of the Optimal Classifier

In this section, we aim to prove that the proposed loss modification method ensures the identifiability of the optimal classifier under a reasonable assumption:

###### Assumption 1.

By minimizing the expected risk , the optimal mapping satisfies .

Based on Assumption 1, we can prove that by the following theorem:

###### Theorem 1.

Suppose that is invertible and Assumption 1 is satisfied, then the minimizer of is also the minimizer of ; that is, .

Please find the detailed proof in Appendix A. Given sufficient training data with complementary labels, can converge to , which can be proved in the next section. According to Theorem 1, this also implies that also converges to the optimal classifier .

#### Examples of Loss Functions.

The proof of Theorem 1 relies on Assumption 1. However, for many loss functions, Assumption 1 can be provably satisfied. Here, we take the cross-entropy loss as an example to demonstrate this fact. The cross-entropy loss is widely used in deep supervised learning and is defined as

 ℓ(f(X),Y)=−c∑i=11(Y=i)log(gi(X)), (5.1)

where is an indicator function; that is, if the input statement is true, it outputs 1; otherwise, 0. For the cross-entropy loss, we have the following lemma:

###### Lemma 1.

Suppose is the cross-entropy loss and , where refers to a standard simplex in ; that is, , and . By minimizing the expected risk , we have .

Please see the detailed proof in Appendix B. In fact, losses such as square-error loss , also satisfy Assumption 1. The readers can prove it themselves using similar strategy. Combined with Theorem 1, we can see, by applying the proposed method to loss functions such as cross-entropy loss, we can prove that the optimal classifier can be found even when learning with biased complementary labels.

## 6 Convergence Analysis

In this section, we show an upper bound for the estimation error of our method. This upper bound illustrates a convergence rate for the classifier learned with complementary labels to the optimal one learned with true labels. Moreover, with the derived bound, we can clearly see that the estimation error could further benefit from the setting of biased complementary labels under mild conditions.

Since , we have . We will upper bound the error via upper bounding ; that is, when , . Specifically, it has been proven that

 ¯R(¯fn)−¯R(¯f∗) =¯R(¯fn)−¯Rn(¯fn)+¯Rn(¯fn)−¯Rn(¯f∗)+¯Rn(¯f∗)−¯R(¯f∗) (6.1) ≤¯R(¯fn)−¯Rn(¯fn)+¯Rn(¯f∗)−¯R(¯f∗) ≤2supf∈F|¯R(f)−¯Rn(f)|,

where the first inequality holds because and the error in the last line is called the generalization error.

Let be independent variables. By employing the concentration inequality [4], the generalization error can be upper bounded by using the method of Rademacher complexity [2].

###### Theorem 2 ([2]).

Let the loss function be upper bounded by . Then, for any , with the probability , we have

 supf∈F|¯R(f)−¯Rn(f)|≤2Rn(¯ℓ∘F)+M√log1/δ2n, (6.2)

are Rademacher variables uniformly distributed from

.

Before upper bounding , we need to discuss the specific form of the employed loss function . By exploiting the well-defined binary loss functions, one-versus-all and pairwise-comparison loss functions [39] have been proposed for multi-class learning. In this section, we discuss the modified loss function defined by Eqs. (4.6) and (5.1), which can be rewritten as,

 ¯ℓ(f(X),¯Y) =−c∑i=11(¯Y=i)log((Q⊤g)i(X)) (6.3) =−c∑i=11(¯Y=i)log(∑cj=1Qjiexp(hj(X))∑ck=1exp(hk(X))),

where denotes the -th entry of ; , ; and .

Usually, the convergence rates of generalization bounds of multi-class learning are at most with respect to and [15, 24]. To reduce the dependence on of our derived convergence rate, we rewrite as follows:

 ¯R(f)=∫Xc∑i=1P(¯Y=i)P(X|¯Y=i)¯ℓ(f(X),¯Y=i)dX (6.4) =c∑i=1P(¯Y=i)∫XP(X|¯Y=i)¯ℓ(f(X),¯Y=i)dX =c∑i=1¯πi¯Ri(f),

where and .

Similar to Theorem 2, we have the following theorem.

###### Theorem 3.

Suppose is given. Let the loss function be upper bounded by . Then, for any , with the probability , we have

 ¯R(¯fn)−¯R(¯f∗) ≤2supf∈F|¯R(f)−¯Rn(f)| (6.5) ≤2c∑i=1¯πisupf∈F|¯Ri(f)−¯Ri,ni(f)| ≤2c∑i=1¯πi(2Rni(¯ℓ∘F)+M√log1/δ2ni) =c∑i=1(4¯πiRni(¯ℓ∘F)+2¯πiM√log1/δ2ni),

where and is the empirical counterpart of , and , represents the numbers of whose complementary labels are .

Due to the fact that is actually defined with respect to rather than , we would like to bound the error by the Rademacher complexity of . We observe that the relationship between and is:

###### Lemma 2.

Let and suppose that , we have .

The detailed proof can be found in Appendix C. Combine Theorem 3 and Lemma 2, we have the final result:

###### Corollary 1.

Suppose is given. Let the loss function be upper bounded by . Then, for any , with the probability , we have

 ¯R(¯fn)−¯R(¯f∗)≤c∑i=1⎛⎝4c¯πiRni(H)+2¯πiM√log1/δ2ni⎞⎠. (6.6)

In current state-of-the-art methods [15], the convergence rate of is of order with respect to and while our derived bound is of order . Since our error bound depends on , the bound would be loose if (or ) is small. However, if is balanced and is about , our convergence rate is of order , which is smaller than the error bounds provided by previous methods if is very large.

Remark. Theorem 3 and Corollary 1 aim to provide the proof of uniform convergence for general losses and show how the convergence rate can benefit from the biased setting under mild conditions. Thus, assuming the loss is upper-bounded is reasonable for many loss functions such as the square-error loss. If the readers would like to derive specific error bound for the cross-entropy loss, strategies in [34] can be employed. If we assume that the transition matrix is invertible, we can derive similar results as those in Lemma 1-3 [34] for the modified loss function, which can be finally deployed to derive generalization error bound similar to Corollary 1.

## 7 Estimating Q

In the aforementioned method, transition matrix is assumed to be known, which is not true. Here, we thus provide an efficient method to estimate .

When learning with complementary labels, we completely lose the information of true labels. Without any auxiliary information, it is impossible to estimate the transition matrix which is associated with the class priors of true labels. On the other hand, although it is costly to annotate a very large-scale dataset, a small set of easily distinguishable observations are assumed to be available in practice. This assumption is also widely used in estimating transition probabilities in label noise problem [31]

and class priors in semi-supervised learning

[37]. Therefore, in order to estimate , we manually assign true labels to 5 or 10 observations in each class. Since these selected observations are often easy to classify, we further assume that they satisfy the anchor set condition [21]:

###### Assumption 2 (Anchor Set Condition).

For each class , there exists an anchor set such that and , .

Here, is a subset of features in class . Given several observations in , we are ready to estimate the transition matrix . According to Eq. (4.4),

 P(¯Y=¯y|X)=∑y′≠¯yP(¯Y=¯y|Y=y′)P(Y=y′|X). (7.1)

Suppose , then and . We have

 P(¯Y=¯y|X=x)=P(¯Y=¯y|Y=y). (7.2)

That is, the probabilities in can be obtained via given the observations in the anchor set of each class. Thus, we need only to estimate this conditional probability, which has been proven to be achievable in Lemma 1. In this paper, with the training sample , we estimate by training a deep neural network with the softmax function and cross-entropy loss. After obtaining these conditional probabilities, each probability in the transition matrix can be estimated by averaging the conditional probabilities on the anchor data in class .

## 8 Experiments

We evaluate our algorithm on several benchmark datasets including the UCI datasets, USPS, MNIST [18], CIFAR10, CIFAR100 [16], and Tiny ImageNet111The dataset is available at http://cs231n.stanford.edu/tiny-imagenet-200.zip. All our experiments are trained on neural networks. For USPS and UCI datasets, we employ a one-hidden-layer neural network (-3-) [15]. For MNIST, LeNet-5 [19] is deployed, and ResNet [14]

is exploited for the other datasets. All models are implemented in PyTorch

.

### 8.1 UCI and USPS

We first evaluate our method on USPS and six UCI datasets: WAVEFORM1, WAVEFORM2, SATIMAGE, PENDIGITS, DRIVE, and LETTER, downloaded from the UCI machine learning repository. We apply the same strategies of annotating complementary labels, standardization, validation, and optimization with those in [15]. The learning rate is chosen from , weight decay from , batch size 100.

For fair comparison in these experiments, we assume the transition probabilities are identical and known as prior. Thus, no examples with true labels are required here. All results are shown in Table 1. Our loss modification (“LM”) method is compared to a partial label (PL) method [6], a multi-label (ML) method [27], and “PC/S” (the pairwise-comparison formulation with sigmoid loss), which achieved the best performance in [12]. We can see, “PC/S” achieves very good performances. The relatively higher performance of our method may be due to that our method provides an unbiased estimator for risk minimizer.

### 8.2 Mnist

MNIST is a handwritten digit dataset including 60,000 training images and 10,000 test images from 10 classes. To evaluate the effectiveness of our method, we consider the following three settings: (1) for each image in class , the complementary label is uniformly selected from (“uniform”); (2) the complementary label is non-uniformly selected, but each label in has non-zero probability to be selected (“without0”); (3) the complementary label is non-uniformly selected from a small subset of (“with0”).

To generate complementary labels, we first give the probability of each complementary label to be selected. In the “uniform” setting, . In the “without0” setting, for each class , we first randomly split to three subsets, each containing three elements. Then, for each complementary label in these three subsets, the probabilities are set to , , and , respectively. In the “with0” setting, for each class , we first randomly selected three labels in , and then randomly assign them with three probabilities whose summation is 1. After is given, we assign complementary label to each image based on these probabilities. Finally, we randomly set aside 10% of training data as validation set.

In all experiments, the learning rate is fixed to ; batch size 128; weight decay ; maximum iterations 60,000; and stochastic gradient descend (SGD) with momentum [30] is applied to optimize deep models. Note that, as shown in [15] and previous experiments, [15] and our method have surpassed baseline methods such as PL and ML. In the following experiments, we will not again make comparisons with these baselines.

The results are shown in Table 2. The means and standard deviations of classification accuracy over five trials are reported. Note that the digit data features are not too entangled, making it easier to learn a good classifier. However, we can still see the differences in the performance caused by the change of settings for annotating complementary labels. According to the results shown in Table 2, “PC/S” [15] works relatively well under the uniform assumption but the accuracy deteriorates in other settings. Our method performs well in all settings. It can also be seen that due to the accurate estimates of these probabilities, “LM/E” with the estimated transition matrix is competitive with “LM/T” which exploits the true one.

### 8.3 Cifar10

We evaluate our method on the CIFAR10 dataset under the aforementioned three settings. CIFAR10 has totally 10 classes of tiny images, which includes 50,000 training images and 10,000 test images. We leave out 10% of the training data as validation set. In these experiments, ResNet-18 [14]

is deployed. We start with an initial learning rate 0.01 and divide it by 10 after 40 and 80 epochs. The weight decay is set to

, and other settings are the same as those for MNIST. Early stopping is applied to avoid overfitting.

We apply the same process as MNIST to generate complementary labels. The results in Table 3 verify the effectiveness of our method. “PC/S” achieves promising performance when complementary labels are uniformly selected, and our method outperforms “PC/S” in other settings. In the “uniform” setting, is not well estimated. As a result, the transition matrix is also poorly estimated. “LM/E” thus performs relatively badly.

The results of our method under the “uniform” and “without0” settings (shown in Table 3) are usually worse than that of “with0”. For a certain amount of training images, the empirical results show that in the“uniform” and “without0” setting, the proposed method converges at a slower rate than in the “with0” setting. This phenomenon may be caused by the fact that the uncertainty involved with the transition procedure in the “with0” setting is less than that in “uniform” and “without0” settings, making it easier to learn in the former setting. This phenomenon also indicates that, for images in each class, annotators need not to assign all possible complementary labels, but can provide the labels following the criteria, i.e., each label in the label space should be assigned as complementary label for images in at least one class. In this way, we can reduce the number of training examples to achieve high performance.

### 8.4 Cifar100

CIFAR100 also presents a collection of tiny images including 50,000 training images and 10,000 test images. But CIFAR100 has totally 100 classes, each with only 500 training images. Due to the label space being very large and the number of training data being limited, in both “uniform” and “without0” settings, few training data are assigned as for images in each class , . Both the proposed method and “PC/S” cannot converge. Here, we only conduct the experiments under the “with0” setting. To generate complementary labels, for each class , we randomly selected 5 labels from , and assign them with non-zero probabilities. Others have no chance to be selected.

In these experiments, ResNet-34 is deployed. Other experimental settings are the same with those in CIFAR10. Results are shown in the second column of Table 4. “PC/S” can hardly obtains a good classifier, but our method achieves high accuracies that are comparable to learning with true labels.

### 8.5 Tiny ImageNet

Tiny ImageNet represents 200 classes with 500 images in each class from ImageNet dataset [28]. Images are cropped to . Detailed information is lost during the down-sampling process, making it more difficult to learn. ResNet-18 for ImageNet [14] is deployed. Instead of using the original first convolutional layer with a

kernel and the subsequent max pooling layer, we replace them with a convolutional layer with a

kernel, stride=1, and no padding. The initial learning rate is 0.1, divided by 10 after 20,000 and 40,000 iterations. The batch size is 256 and weight decay is

. Other settings are the same as CIFAR100. The experimental results are shown in the third column of Table 4. We also only test our method under the setting “with0”. “PC/S” cannot converge here, but our method still achieves promising performance.

### 8.6 Discussions

In this section, we aim to verify the following facts about the proposed method: (1) In practice, our proposed method may not require Q to be invertible in some cases; (2) using randomly generated transition matrices and manually designed transition matrices achieves comparable performance; (3) the convergence rate of the proposed method can benefit more from the biased complementary labels than uniform complementary labels.

Non-invertible . In practice, our proposed method does not require to be invertible. To verify this, we test our method on MNIST dataset under the “with0” setting in which the complementary labels are generated according to a non-invertible transition matrix. This transition matrix is randomly generated and shown as follows:

 ⎡⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢⎣0.304200.519700.176200000.61410.04780.33810000000.02130.7834.195300.448900.103400.4477000.2596000.383600.356800.44160.54120000.017200.3049000.58280.1123000.1499000.3134000.5367.1061.471800000.4220000.4261.0579000.5160000⎤⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥⎦, (8.1)

The results are shown in Table 5. Compared with the results with respect to the invertible transion matrix, we can see the performance does not decrease, which indicates our proposed method can be applied to very general settings of complementary labels.

Manually Designed . In this experiment, rather than randomly selecting complementary labels for each class, we manually determine the transition matrix at first. The manually designed transition matrices in the “without0” and “with0” setting and their corresponding estimates are shown in Figure 3 and 4, respectively. The classification accuracies are reported in Table 6 and 7, respectively. We can see that the performance with respect to randomly selected and manually designed transition matrices is similiar.

The Benefit of Biased Complementary Labels.

Here, we aim to show that the convergence rate of the proposed method can benefit from the biased setting. These experiments are conducted on CIFAR10 dataset. We do not employ the MNIST dataset because it is easy to learn a good classifier from this data and the proposed method can achieve very high performance in both uniform and biased settings. Here, we randomly select a fraction of data from the CIFAR10 dataset, increasing the total sample size of training data from 5,000 to 50,000 with step size 5,000. The classification accuracies with respect to these training data in both “uniform” and “with0” settings are reported. As shown in Figure 5, the proposed method achieves high performance when sample sizes are small and also converges quickly. Our experimental results provide a good guidance of assigning complementary labels; that is, in practice, for all instances in a certain class, the complementary labels can be selected from a small subset of the label space and the remaining class labels in the label space have no chances to be selected as complementary labels for this class. For example, in the left subfigure of Figure 4, for all digits 0, we only assign the labels “1”, “5”, and “7” as complementary labels. In this way, we can learn an effective classifier with relatively small training dataset. The only requirement for selecting the biased complementary labels is that all class labels in the label space should be assigned as complementary labels. For example, seen in the left subfigure of Figure 4, digit “6” is only assigned as complementary labels of the digits “1”. This is OK for learning a good classifier. However, if “6” is never assigned as a complementary label. We have no information for this class, which makes it impossible for the algorithm to learn an effective classifier.

## 9 Conclusion

We address the problem of learning with biased complementary labels. Specifically, we consider the setting that the transition probabilities vary and most of them are zeros. We devise an effective method to estimate the transition matrix given a small amount of data in the anchor set. Based on the transition matrix, we proposed to modify traditional loss functions such that learning with complementary labels can theoretically converge to the optimal classifier learned from examples with true labels. Comprehensive experiments on a wide range of datasets verify that the proposed method is superior to the current state-of-the-art methods.

Acknowledgement. This work was supported by Australian Research Council Projects FL-170100117, DP-180103424, and LP-150100671. This work was partially supported by SAP SE and research grant from Pfizer titled “Developing Statistical Method to Jointly Model Genotype and High Dimensional Imaging Endophenotype”. We are also grateful for the computational resources provided by Pittsburgh Super Computing grant number TG-ASC170024.

## Appendix A Proof of Theorem 1

###### Proof.

According to Assumption 1 and based on the modified loss function, when learning from examples with complementary labels, we also have

 q∗i(X)=P(¯Y=i|X),∀i∈[c].

Let and . According to Eq. (11), we have

 ¯v(X)=Q⊤v(X), (A.1)

which further ensures

 q∗(X)=Q⊤v(X)=Q⊤g∗(X). (A.2)

If the transition matrix is invertible, then we find the optimal which ensures . The proof is completed. ∎

## Appendix B Proof of Lemma 1

###### Proof.

According to the definition of the cross-entropy loss, the loss is nonnegative. Then minimizing can be obtained by minimizing the conditional risk for every [22], and the conditional risk is defined as

 ψ(g)=−c∑i=1P(Y=i|x)log(gi(x)), (B.1)

where is short for .

The problem turns to minimizing subject to , . Then by using the Lagrange Multiplier method [3], we have

 L=ψ(g)−λ(c∑i=1gi(x)−1).

The derivative of with respect to is,

 ∂L∂g=[−P(Y=1|x)g1(x)−λ,⋯,−P(Y=c|x)gc(x)−λ]⊤,

which is equal to when substituting with the minimizer , i.e., . Then

 g∗i(x)=−λP(Y=i|x),∀i∈[c] and ∀x∈X.

Because and , then we have

 c∑i=1g∗i(x)=−λc∑i=1P(Y=i|x)=1.

Thus we easily get and and . The proof is completed. ∎

## Appendix C Proof of Lemma 2

In order to prove Lemma 2, we need the loss function to be Lipschitz continous with respect to , which can be proved by the following lemma,

###### Lemma 1.

Suppose that all column of the transition matrix is not all-zero, then loss function is 1-Lipschitz with respect to .

###### Proof.

Recall that

 ¯ℓ(f(X),¯Y=i)=−log(∑ck=1Qkiexp(hk(X))∑ck=1exp(hk(X))). (C.1)

Take the derivative of with respect to , we have

 ∂¯ℓ(f(X),¯Y=i)∂hj(X) (C.2) =−Qjiexp(hj(X))∑ck=1Qkiexp(hk(X))+exp(hj(X))∑ck=1exp(hk(X)).

According to Eq.(C.2), it is easy to conclude that , which also indicates that the loss function is 1-Lipschitz with respect to . The proof is completed. ∎

Now we are ready to prove Lemma 2.

###### Proof.

Since the softmax function preserve the rank of its inputs, . We thus have

 Rni(¯ℓ∘F) (C.3) =E[supf∈F1nini∑j=1σj¯ℓ(f(Xj),¯Yj=i)] =E[supargmax{h1(X),⋯,hc(X)}1nini∑j=1σj¯ℓ(f(Xj),¯Yj=i)] =E[supmax{h1(X),⋯,hc(X)}1nini∑j=1σj¯ℓ(f(Xj),¯Yj=i)] ≤E[c∑k=1suphk(X)1nini∑j=1σj¯ℓ(f(Xj),¯Yj=i)] =E[c∑k=1suphk(X)1nini∑j=1σjlog(∑cm=1Qmiexp(hm(X))∑cm=