1 Introduction
In supervised classification, a classifier is trained with labeled training data points, which are usually collected through human annotation.
While collecting labeled data points is the traditional way to apply supervised classification, pairwise comparison is often more appealing for human decision making [Fürnkranz and
Hüllermeier2010], where annotators are requested to compare two instances and give relative relationships between them;
e.g., which instance has stronger stimulus, whether two instances belong to the same category, and so on.
This is partly because (1) decision makers tend to be subjective at directly choosing a single hypothesis^{2}^{2}2[Thurstone1927] has studied a relationship between relative comparison and a single hypothesis on stimuli, which is known as the law of comparative judgement.,
and (2) decision makers are often biased about picking an opinion.^{3}^{3}3This bias is known as social desirability bias [Fisher1993]; questionees are unconsciously led to a socially desirable opinion when they are asked to reveal their opinions in a direct way. Such a tendency is observed especially in answering their sensitive matters such as criminal records.
Thus, pairwise comparison has been exploited in a number of applications such as pairwise ranking [Fürnkranz and
Hüllermeier2010, Jamieson and Nowak2011], Bayesian optimization [Eric et al.2008], derivativefree optimization [Jamieson et al.2012], and analytic hierarchical processes [Saaty1990].
One of the methods that incorporate pairwise comparison into identification of latent categories of data is semisupervised clustering [Basu et al.2008],
which utilizes pairwise supervision indicating whether two instances belong to the same cluster or not (known as mustlink and cannotlink constraints), guiding clustering as decision makers desire.
Recently, [Bao et al.2018] and [Shimada et al.2019]
gave an empirical risk minimization (ERM) formulation to train an inductive classifier from pairwise comparison, which has successfully connected supervised learning and pairwise comparison and avoided datasetdependent assumptions used in semisupervised clustering.
However, there is one important gap that has not been addressed in classification with pairwise supervision—real world data collection is bound to be noisy. Several previous works address noise in ordinary supervision with classlabeled data—[Natarajan et al.2013, Patrini et al.2017, Jiang et al.2017, Han et al.2018]. For pairwise supervision, there are two unique types of error. The first error results from pairing corruption: some pairs of instances are hard to identify whether they belong to the same category or not. The second error is from labeling corruption: labels of some instances are intrinsically ambiguous thus subsequent pairwise comparison is also affected. Depending on these situations, we can consider two types of noise models in pairwise supervision.
In this paper, we investigate classification with noisy pairwise supervision, where noise is present in pairwise comparison and follows either pairing corruption or labeling corruption. Our first strategy is inspired by previous work [Patrini et al.2017, Natarajan et al.2013]
that uses standard classlabeled data with classconditional noise to learn a classifier. We introduce a corrected loss function, which induces an unbiased estimator of the classification risk in the presence of noise for pairwise data. Subsequently, a classifier can be obtained through the minimization of the corrected loss. The second strategy is based on weighted empirical risk minimization, or costsensitive classification
[Elkan2001, Scott and others2012]. This is motivated from the insight that the Bayes classifier of the classification risk under the noisefree distribution corresponds to that of the weighted risk under the noisy distribution.To the best of our knowledge, this is the first work to investigate learning from noisy pairwise data and the following are our main contributions :

Definition of realistic noise models for the SD learning problem.

Providing two algorithms based on loss correction and weighted classification that can effectively learn from noisy SD training data (under either noise model) and still give good performance (standard binary classification accuracy) on clean test data.

Proving that the Bayes classifier for noisefree and symmetricnoise SD learning is identical to the Bayes classifier for standard binary classification.

We analyze the generalization bounds of our algorithms and provide two new generalization bounds for clean SD learning [Shimada et al.2019]

Performing experiments on various datasets to show that the proposed algorithms work well in practice.
2 Problem Setup
Let denote the instance space, the label space and the the underlying distribution over . is the distribution with respect to which we want to perform well—the test data is drawn from this distribution and the performance metric is classification accuracy : , where denotes the indicator function. However, we assume that due to the domain constraints we are unable to procure direct classlabeled data from and only have access to pairwise supervision—whether a pair of instances is from the same class () or from different classes (). There is a latent variable dictating whether the sample is drawn from similar data () or it is drawn from dissimilar data (). Let denote the distribution over . Further let denote the class prior and and . Now we present the two noise models that reflect what we may expect in realworld data.
2.1 Noise Model 1 : Pairing Corruption
This model is motivated by the following scenario— imagine nonperfect crowdworkers who are given a pool of instances and annotate pairs as , if they believe and otherwise. Since they are nonperfect, they make mistakes in this process and sometimes assigns when it should have been and vice versa. Thus, in this model two samples and are drawn from the underlying distribution and if , this pair of sample is classified as Similar (
) with probability
and if this pair is classified as Dissimilar () with probability . In other words, the noise rate for similar (S) and dissimilar (D) data are and respectively. Thus, under this noise model, each of the S and D samples are drawn from mixtures of the true S and D distributions : and . This noise model is instanceindependent but labeldependent (different noise rates for and ).2.2 Noise Model 2: Labeling Corruption
Consider that we are dealing with a privacy sensitive domain and responders do not want their individual labels sent to the learning module—further, some of them lie about their labels so the labels are noisy. A moderator converts the pointwise data to pairwise data , to preserve privacy. Mathematically, we first generate two samples and from the underlying distribution and then the label is flipped with probability (this is class conditioned : if a sample originally has label it is flipped to label with probability and respectively, for the other case). and Then in the next step the similar or dissimilar labels are assigned (there is assumed to be no noise in this step since the moderator is an expert). Thus, and .
If there were no noise corruption, the S data would comprise of two positive instances or two negative instances and the data would comprise of one positive instance and one negative instance. Accordingly, in the noisefree pairwise data scenario the densities (,) of the S and D data are respectively :
(1) 
Further, one can marginalize out and get the densities in terms of a single being drawn from either or . This view is important for our analysis and has been previously used in [Bao et al.2018]. The implication is that we can now treat the SD data as pointwise data from the Similar (S) and Dissimilar (D) classes.
(2) 
The above expressions are derived in [Bao et al.2018] under the noisefree assumption. Here we present the expression for the densities (—similar expressions can be obtained for ) under each of the noise models presented above. For the pairing corruption model we get :
(3) 
and for the labeling corruption model we get,
(4) 

best PN  clean SD 

TLoss  SDLoss  Weighted  Unweighted  KM  KMCOP  

































best PN  clean SD 

TLoss  SDLoss  Weighted  Unweighted  KM  KMCOP  
































3 Loss Correction Approach
In this section, we present the first of our two proposed algorithms for learning from noisy pairwise data. In this method, we use the parameter values of the noise model to obtain an unbiased estimate of the loss function. In other words, we cast the loss of learning from noisy SD data in terms of learning from standard classlabeled data and thus, we can do loss correction. In deriving the following expressions, we methodically account for all the different ways we can derive an S (or D instance) under each noise model (Figure 1). For the pairing noise model we obtain :
(5) 
On the other hand, for the labeling corruption noise model we obtain :
(6) 
We obtain similar equations for
under each noise model. It is noteworthy that the structure of the posterior probabilities is remarkably similar between the two noise models once we introduce the modified prior
, defined in (4). In other words, the labeling noise model can be interpreted as first corrupting the ordinary classlabeled P (Positive)N (Negative) data to get a noisy PN data with the modified class prior and then mimicking the pairing noise model with noise rate . Once we have expressed the posterior probabilities of the noisy SD data in terms of the original class posteriors, we can adopt the technique of backward correction [Patrini et al.2017] to construct the corrected losses — Assume there exists coefficients such that and , we can construct a new loss function such that the minimizer of the expected risk with this new loss function over the noisy SD data is the same as the minimizer of the expected risk with the original loss function over (the test distribution). Let be a loss function estimating by and let be the corresponding backward corrected loss function estimating by . Then we have where and therefore, . We can use the modified loss to train our classifier (model class) on the noisy SD data directly by ERM. Next, we state the generalization bounds for this approach. Here are some important notations:
,

,

,

.
Because is corrected appropriately, the empirical estimate of the risk converges to —by performing ERM on the noisy SD data, the empirical risk converges to the true risk on the standard PN data drawn from the underlying distribution . Let denote the Lipschitz constant of the loss , denote the Rademacher complexity of the function class for noisy SD instances.
Lemma 1 : With probability at least ,
(7) 
Theorem 1 : With probability at least ,
(8) 
Special Case of Clean SD Learning : When there is no noise = = or = = , for both the noise models,
(9) 
We see this matches the loss function derived for clean SD learning in [Shimada et al.2019] (on setting as and replacing ) :
(10) 
Our analysis through the lens of loss correction helps provide the generalization bound for clean SD learning as a special case of noisy SD learning. The Lipschitz constant for the corrected loss in the clean SD case is , where is the Lipschitz constant for . The generalization bound for noisefree SD learning is : With probability at least ,
(11) 
Optimization : While we have a generalization guarantee, efficient optimization is a concern especially because the corrected loss may not be convex. We present a condition which will guarantee the corrected loss to be convex.
Theorem 2 : If is convex and twice differentiable almost everywhere in (for each ) and also satisfies :

,

, where are elements of .
then is convex in . The first condition is satisfied by several common losses such as squared loss () and logistic loss (). The second condition depends on the noise rates and the prior. We can simplify this for each noise model to :
(12) 
where either the first bound or the second bound applies depending on whether the class prior is greater or less than respectively. For all cases of noisefree or symmetricnoise SD learning, any noise rates will satisfy this condition and thus, we can always perform efficient optimization.
4 Weighted Classification Approach
Now we develop our second algorithm for dealing with noisy S and D data. One key issue that we investigate here is how the Bayes classifier learned from noisy SD data relates to the traditional Bayes classifier.
Lemma 2 : Denoting and , the Bayes classifier under the noisy SD distribution denoted by is given by
(13) 
For the pairwise corruption case,
For the label corruption case, threshold is :
These expressions can be derived by using (5) and (6) in (13). They give us an important insight—the Bayes classifier for the noisy SD learning uses a different threshold from while the traditional Bayes classifier has thresholded at . Towards designing an algorithm we note that we can also obtain this Bayes classifier by minimizing the weighted 01 risk. Let
The following lemma from [Scott and others2012] is crucial in connecting the Bayes classifier threshold with the weight in weighted 01 classification.
Lemma 3 : Denote the risk under distribution as
Then minimizes .
We now show that there exists a choice of weight such that the weighted risk under the noisy SD distribution is linearly related to the ordinary risk under distribution .
Theorem 3 : There exist constants and and a function (that only depends on but not on ) such that
For the pairing corruption case:
(14) 
For the label corruption case :
(15) 
Remark 1: The weighted Bayes optimal classifier under the noisy SD distribution coincides with the Bayes classifier of the 01 loss under the clean ordinary distribution.
Generalization Bound : For the ease of optimization, we will use a surrogate loss instead of 01 to do weighted ERM. Any surrogate loss can be used as long as it can be decomposed as . The square, hinge, logistic and most common losses can be expressed in this form.
Let denote the minimizer of the empirical risk using the weighted surrogate loss.
(16) 
Theorem: If is an weighted margin loss of the form : and is classification calibrated and Lipschitz, then for the above choices of and , there exists a nondecreasing function with such that the following bound holds with probability at least :
(17) 
where denotes the the corresponding Bayes risk under .
Remark 2: For a fixed Lipschitz constant , as decreases we get a weaker generalization bound. For the pairing noise model, its easy to see that as noise rates increase, decreases but the relationship is more complicated for the labeling noise model. When the noise is symmetric (), . In this case, again we observe as increases, decreases and we get a weaker bound.
Remark 3: When or , we see that the optimal Bayes classifier for the (noisy) SD learning problem is the same as the Bayes classifier for the standard classlabeled binary classification task under distribution .
5 Estimation of Prior and Noise Parameters
We briefly discuss the parameters we need to know or estimate to apply each method for each noise model.
Loss Correction approach: The noise rate parameters can be tuned by crossvalidation on the noisy SD data.
We also need to estimate the class prior to construct the loss correction matrix , under both noise models. The class prior can be estimated as follows :

For the pairing corruption noise model :

For the labeling corruption noise model :
From each of the above equations we can solve for the class prior . The above equations can be derived from (5) and (6) by marginalizing out and using .
Method of Weighted Classification: The class prior only appears in the weight in the labeling corruption model. In the pairing corruption model, knowledge of the class prior is not needed to calculate . However, since we just have one parameter for the optimization problem, in practice we obtain directly by crossvalidation under both noise models.
6 Experiments
We empirically verify that the proposed algorithms is able to learn a classifier for the underlying distribution from only noisy similar and dissimilar training data. All experiments are repeated times on random traintest splits of : and the average accuracies are shown. We conduct experiments on two noise models independently. In the learning phase, the noise parameters and the weight
is tuned by crossvalidation. Evaluation is done on the clean classlabeled test dataset using standard classification accuracy as evaluation metric which is averaged over the test datasets to reduce variance across the corruption in the training data. We use a multilayer perceptron (MLP) with two hidden layers of
neurons, ReLU activation and a single logistic sigmoid output, as our model architecture for all our experiments except the synthetic generated data, where we use a logistic linear model. We use the squared loss throughout :and use Stochastic Gradient Descent (SGD) with a learning rate of
and momentum of .6.1 Synthetic Data
We generate a twodimensional dataset of points sampled from a normal distribution
as one class and as the other. Heredenotes the twodimensional identity matrix indicating that they are isotropic. Note that the dataset generated in this manner may be nonseparable. In Figure
2, we show the data distribution, the test data, the noisy SD data and the performance of the classifier learned from the noisy SD data on the test data.We use a nonseparable benchmark “banana” dataset which has twodimensional attributes and two classes. We perform two kinds of experiments. In the first experiment, for a given noise model, for different settings of symmetric noise parameters ( and ) we plot the variation of clean test accuracy with the number of noisy SD pairs () sampled for training. For this experiment setting, we show the results for the weighted classification algorithm in Figure 3. Since the Bayes classifier under the symmetric noise is identical to that of noisefree case under both the noise models, we see that the accuracy improves as we get a better approximation of the Bayes classifier as we have more and more SD datapoints in training. Note that the number of original training points in the dataset is fixed—what changes is only the number of SD points we sample from them. In the second experiment, for each noise model, for a fixed we show the gradual degradation of performance of the proposed algorithms (loss correction approach as well as the weighted classification approach) with increasing symmetric noise rates. These experiments confirm that higher noise hurts classification accuracy and having more pairwise samples help it.
6.2 Real World datasets
We further conduct experiments on several benchmark datasets from the UCI classification tasks.^{4}^{4}4Available at https://archive.ics.uci.edu/ml/datasets.php. All tasks are binary classification tasks of varying dimensions, class priors, and sample sizes. We compare the performance of our proposed approaches against two kinds of baselines.
Supervised Baseline : the noisefree versions of the proposed approaches are used, i.e, for loss correction approach we assume clean SD classification risk formulation [Shimada et al.2019] and for weighted classification approach we assume clean SD case and threshold at . This baseline is simple but since there are no existing supervised algorithms that can handle SD data except [Shimada et al.2019] and [Bao et al.2018], and none for the noisy SD supervision case, that is the only baseline to compare against that makes use of the whole data and supervision provided.
Unsupervised Baseline : We also compare against unsupervised clustering and semisupervised clustering based methods. For unsupervised clustering, we ignore all pairwise information and apply KMeans [MacQueen and others1967] with, number of clusters as , directly on the noisy SD datapoints and use the obtained clusters to classify the test data. We further use constrained KMeans clustering [Wagstaff et al.2001], where we treat the SD pairs as constraints to supervise the clustering of the SD data pooled together. These two baselines are shown in Tables 1 and 2.
From Tables 1 and 2, we observe that for both noise models and for every noise rate, our proposed approaches outperform the baselines. Further, we see that the clean SD performances match the best PN performance which empirically verifies the optimal classifiers for learning from noisefree standard PN and pairwise SD data coincide .
7 Conclusion and Future Work
In this paper we investigated a novel setting—learning from noisy pairwise labels under two different noise models. We showed the connections of this problems to standard classlabeled binary classification, proposed two algorithms and analyzed their generalization bounds. We empirically showed that they outperform supervised and unsupervised baselines. For future work, it is worthwhile to investigate more complicated noise models such as instancedependent noise [Menon et al.2016] in this setting.
References
 [Bao et al.2018] Han Bao, Gang Niu, and Masashi Sugiyama. Classification from pairwise similarity and unlabeled data. In International Conference on Machine Learning, pages 461–470, 2018.
 [Basu et al.2008] Sugato Basu, Ian Davidson, and Kiri Wagstaff. Constrained clustering: Advances in algorithms, theory, and applications. CRC Press, 2008.

[Elkan2001]
Charles Elkan.
The foundations of costsensitive learning.
In
International joint conference on artificial intelligence
, volume 17, pages 973–978. Lawrence Erlbaum Associates Ltd, 2001.  [Eric et al.2008] Brochu Eric, Nando D Freitas, and Abhijeet Ghosh. Active preference learning with discrete choice data. In Advances in neural information processing systems, pages 409–416, 2008.
 [Fisher1993] Robert J Fisher. Social desirability bias and the validity of indirect questioning. Journal of consumer research, 20(2):303–315, 1993.
 [Fürnkranz and Hüllermeier2010] Johannes Fürnkranz and Eyke Hüllermeier. Preference learning. Springer, 2010.

[Han et al.2018]
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and
Masashi Sugiyama.
Coteaching: Robust training of deep neural networks with extremely noisy labels.
In Advances in neural information processing systems, pages 8527–8537, 2018.  [Jamieson and Nowak2011] Kevin G Jamieson and Robert Nowak. Active ranking using pairwise comparisons. In Advances in Neural Information Processing Systems, pages 2240–2248, 2011.
 [Jamieson et al.2012] Kevin G Jamieson, Robert Nowak, and Ben Recht. Query complexity of derivativefree optimization. In Advances in Neural Information Processing Systems, pages 2672–2680, 2012.
 [Jiang et al.2017] Lu Jiang, Zhengyuan Zhou, Thomas Leung, LiJia Li, and Li FeiFei. Mentornet: Regularizing very deep neural networks on corrupted labels. arXiv preprint arXiv:1712.05055, 4, 2017.
 [MacQueen and others1967] James MacQueen et al. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, volume 1, pages 281–297. Oakland, CA, USA, 1967.
 [Menon et al.2016] Aditya Krishna Menon, Brendan Van Rooyen, and Nagarajan Natarajan. Learning from binary labels with instancedependent corruption. arXiv preprint arXiv:1605.00751, 2016.
 [Natarajan et al.2013] Nagarajan Natarajan, Inderjit S Dhillon, Pradeep K Ravikumar, and Ambuj Tewari. Learning with noisy labels. In Advances in neural information processing systems, pages 1196–1204, 2013.

[Patrini et al.2017]
Giorgio Patrini, Alessandro Rozza, Aditya Krishna Menon, Richard Nock, and
Lizhen Qu.
Making deep neural networks robust to label noise: A loss correction
approach.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 1944–1952, 2017.  [Saaty1990] Thomas L Saaty. Decision making for leaders: the analytic hierarchy process for decisions in a complex world. RWS publications, 1990.
 [Scott and others2012] Clayton Scott et al. Calibrated asymmetric surrogate losses. Electronic Journal of Statistics, 6:958–992, 2012.
 [Shimada et al.2019] Takuya Shimada, Han Bao, Issei Sato, and Masashi Sugiyama. Classification from pairwise similarities/dissimilarities and unlabeled data via empirical risk minimization. arXiv preprint arXiv:1904.11717, 2019.
 [Thurstone1927] Louis L Thurstone. A law of comparative judgment. Psychological review, 34(4):273, 1927.

[Wagstaff et al.2001]
Kiri Wagstaff, Claire Cardie, Seth Rogers, Stefan Schrödl, et al.
Constrained kmeans clustering with background knowledge.
In Icml, volume 1, pages 577–584, 2001.
Comments
There are no comments yet.