We consider multiclass classification with a feature space and labels Given the training data , the usual goal is to find a prediction function with low classification error where is a new observation of an input-output pair. This type of prediction produces a definite prediction even for cases that are hard to classify.
In this paper we use conformal prediction (Vovk et al., 2005) where we estimate a set-valued function with the guarantee that for all distributions . This is a distribution-free confidence guarantee. Here, is a user-specified confidence level. We note that the “classify with a reject option” (Herbei and Wegkamp, 2006) also allows set-valued predictions but does not give a confidence guarantee.
The function can sometimes output the null set. That is, for some values of . This allows us to distinguish two types of uncertainty. When is a large set, there are many possible labels consistent with . But when does not resemble the training data, we will get alerting us that we have not seen examples like this so far.
There are many ways to construct conformal prediction sets. Our construction is based on finding an estimate of . We then find an appropriate scalar and we set . The scalars are chosen so that . We shall see that this construction works well when there is a large number of classes as is often the case in deep learning classification problems. This guarantees that ’s with low probability — that is regions where we have not seen training data — get classified as .
An important property of this approach is that can be estimated independently for each class. Therefore,
is predicted to a given class in a standalone fashion which enables adding or removing classes without the need to retrain the whole classifier. In addition, we empirically demonstrate that the method we propose is applicable to large-scale high-dimensional data by applying it to the ImageNet ILSVRC dataset and the CelebA and IMDB-Wiki facial datasets using features obtained from state of the art convolutional neural networks.
Paper Outline. In section 2 we discuss the difference between and . In section 3 we provide an example to enlighten our motivation. In section 4 we present the general framework of conformal prediction and survey relevant works in the field. In section 5 we formally present our method. In section 6 we demonstrate the performance of the proposed classifier on the ImageNet challenge dataset using state of the convolutional neural networks. In section 7 we consider the problem of gender classification from facial pictures and show that even when current classifiers fail to generalize from CelebA dataset to IMDB-Wiki dataset, the proposed classifier still provides sensible results. Section 8 contains our discussion and concluding remarks. The Appendix in the supplementary material contains some technical details.
Related Work. There is an enormous literature on set-valued prediction. Here we only mention some of the most relevant references. The idea of conformal prediction originates from Vovk et al. (2005). There is a large followup literature due to Vovk and his colleagues which we highly recommend for the interested readers. Statistical theory for conformal methods was developed in (Lei, 2014; Lei et al., 2013; Lei and Wasserman, 2014), and the multiclass case was studied in (Sadinle et al., 2017) where the goal was to develop small prediction sets based on estimating . The authors of that paper, similarly to Vovk et al. (2003), tried to avoid outputting null sets. In this paper, we use this as a feature. Finally, we mention a related but different technique called classification with the “reject option” (Herbei and Wegkamp, 2006). This approach permits one to sometimes refrain from providing a classification but it does not aim to give confidence guarantees.
Recently, Lee et al. (2018) suggested a framework based on to predict out of distribution and adversarial attacks.
Most classifiers — including most conformal classifiers — are built by estimating . Typically one sets the predicted label of a new to be . Since the prediction involves the balance between and . Of course, in the special case for all , we have .
However, for set-valued classification, can be negatively affected by and . Indeed, in this case there are significant advantages to using to construct the classifier. Taking into account ties the prediction of an observation with the likelihood of observing that class. Since there is no restriction on the number of classes, ultimately an observation should be predicted to a class regardless of the class popularity. Normalizing by makes the classifier oblivious to the probability of actually observing . When
is extremely low (an outlier),still selects the most likely label out of all tail events. In practice this may result with most of the space classified with high probability to a handful of classes almost arbitrarily despite the fact that the classifier has been presented with virtually no information in those areas of the space. This approach might be necessary if a single class has to be selected . However, if this is not the case, then a reasonable prediction for an with small is the null set.
There are also conformal methods utilizing to predict a set of classes (Sadinle et al., 2017; Vovk et al., 2003). There methods do not overcome the inherent weakness within . As will be explained later on, the essence of this methods is to classify to for some threshold t. Due to the nature of the points which are most likely to be predicted as the null set are when , for all classes . But this is exactly the points in space for which any set valued prediction should predict all class as possible.
As we shall see, conformal predictors based on can overcome all these issues.
3 Motivating Example - Iris Dataset
The Iris flower data set is a benchmark dataset often used to demonstrate classification methods. It contains four features that were measured from three different Iris species. In this example, for visualization purposes, we only use two features: the sepal and petal lengths in cm.
shows the decision boundaries for this problem comparing the results of (a) K-nearest neighbors (KNN), (b) support vector machines with the RBF kernel (SVM) and (c) our conformal prediction method using an estimate.
Both the KNN and the SVM methods provide sensible boundaries between the class where there are observations. In areas with low density the decision boundaries are significantly different. The SVM classifies almost all of the space to a single class. The KNN creates an infinite strip bounded between two (almost affine) half spaces. In a hubristic manner, both methods provide very different predictions with probability near one without sound justification.
The third plot shows the conformal set where the is chosen as described in Section 5. The result is a cautious prediction. If a new falls into a region with little training data then we output . In such cases our proposed method modestly avoids providing any claim.
4 Conformal Prediction
Let be independent and identically distributed (iid) pairs of observations from a distribution . In set-valued supervised prediction, the goal is to find a set-valued function such that
where denotes a new pair of observations.
Conformal prediction — a method created by Vovk and collaborators (Vovk et al., 2005) — provides a general approach to construct prediction sets based on the observed data without any distributional assumptions. The main idea is to construct a conformal score, which is a real-valued, permutation-invariant function where and denotes the training data. Next we form an augmented dataset where is set equal to arbitrary values . We then define for . We test the hypothesis that the new label is equal to using the p-value . Then we set . (Vovk et al., 2005) proves that for all distributions . There is a great flexibility in the choice of conformity score and 4.1 discusses important examples.
As described above,
it is computationally expensive to construct
since we must re-compute
the entire set of conformal scores for each choice of .
This is especially a problem
in deep learning applications where training is usually expensive.
One possibility for overcoming the computational burden is based
on data splitting where
is estimated from part of the data and the conformal scores
are estimated from the remaining data; see
(Vovk, 2015; Lei and Wasserman, 2014).
is to construct the scores from the original data
In this case, we no longer have
the finite sample guarantee for all distributions ,
but we do get
as long as some conditions are satisfied111 A sequence of random variables
A sequence of random variablesis if .. See (Sadinle et al., 2017) for further discussion on this point.
Here are several known examples for conformal methods used on different problems.
Supervised Regression. Suppose we are interested in the supervised regression problem. Let be any regression function learned from training data. Let denote the residual error of on the observation , that is, . Now we form the ordered residuals , and then define
If is a consistent estimator of then . See Lei and Wasserman (2014).
Unsupervised Prediction. Suppose we observe independent and identically distributed from distribution . The goal is to construct a prediction set for new . Lei et al. (2013) use the level set where
is a kernel density estimator. They show that ifis chosen carefully then for all .
Multiclass Classification. There are two notable solutions also using conformal prediction for the multiclass classification problem which are directly relevant to this work.
Least Ambiguous Set-Valued Classifiers with Bounded Error Levels. Sadinle et al. (2017) extended the results of Lei (2014) and defined , where is any consistent estimator of . They defined the minimal ambiguity as which is the expected size of the prediction set. They proved that out of all the classifiers achieving the desired coverage, this solution minimizes the ambiguity. In addition, the paper considers class specific coverage controlling for every class .
Universal Predictor. Vovk et al. (2003) introduce the concept of universal predictor and provide an explicit way to construct one. A universal predictor is the classifier that produces, asymptotically, no more multiple prediction than any other classifier achieving level coverage. In addition, within the family of all classifiers that produce the minimal number of multiple predictions it also asymptotically obtains at least as many null predictions.
5 The Method
5.1 The Classifier
Let be an estimate of the density for class . Define to be the empirical quantile of the values . That is,
where . Assuming that and minimal conditions on and , it can be shown that where is the largest such that . See Cadre et al. (2009) and Lei et al. (2013). We set . We then have the following proposition which is proved in the appendix.
Assume the conditions in Cadre et al. (2009) stated also in the appendix. Let be a new observation. Then as .
An exact, finite sample method can be obtained using data splitting. We split the training data into two parts. Construct from the first part of the data. Now evaluate on the second part of the data and define using these values. We then set . We then have:
Let be a new observation. Then, for every distribution and every sample size, .
This follows from the theory in Lei and Wasserman (2014). The advantage of the splitting approach is that there are no conditions on the distribution, and the confidence guarantee is finite sample. There is no large sample approximation. The disadvantage is that the data splitting can lead to larger prediction sets. Algorithm 1 describes the training, and Algorithm 2 describes the prediction.
5.2 Density Estimation
The density has to be estimated from data. One possible way is to use the standard kernel density estimation method, which was shown to be optimal in the conformal setting under weak conditions in Lei, Robins, and Wasserman (2013). This is useful for theoretical purposes due to the large literature on the topic. Empirically, it is faster to use the distance from the nearest neighbors.
Density estimation in high dimensions is a difficult problem. Nonetheless, as we will show in the numerical experiments (Section 6), the proposed method works well in these tasks as well. An intuitive reason for this could be that the accuracy of the conformal prediction does not actually require to be close to in . Rather, all we need is that the ordering imposed by approximates the ordering defined by . Specifically, we only need that is approximated by for . We call this “ordering consistency.” This is much weaker than the usual requirement that be small. This new definition and implications on the approximation of will be further expanded in future work.
5.3 Class Adaptivity
As algorithms 1 and 2 demonstrate, the training and prediction of each class is independent from all other classes. This makes the method adaptive to addition and removal of classes ad-hoc. Intuitively speaking, if there is probability for the observation to be generated from the class it will be classified to the class regardless of any other information.
Another desirable property of the method is that it is possible to obtain different coverage levels per class if the task requires that. This is achieved by setting to be the quantile of the values .
5.4 Class Interaction
Defining independently for each class has the desired property of class adaptivity, but also it discards relevant information regarding the relations of each of the classes. Figure 1 (c) demonstrate how the different classifiers decision boundaries are independent.
More complex decision boundaries can be created using correlated estimators of . One example of such an estimator is , for some . This estimator penalizes high density regions for the other classes. Figure 2 visualize the results of such estimator on the Iris dataset.
6 ImageNet Challenge Example
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) (Deng et al., 2009) is a large visual dataset of more than million images labeled across different classes. It is considered a large scale complex visual dataset that reflects object recognition state-of-the-art through a yearly competition.
In this example we apply our conformal image classification method to the ImageNet dataset. We remove the last layer from the pretrained Xception convolutional neural network (Chollet, 2016) and use it as a feature extractor. Each image is represented as a dimensional feature in We learn for each of the classes a unique kernel density estimator trained only on images within the training set of the given class. When we evaluate results of standard methods we use the Inception-v4 model (Szegedy et al., 2017) to avoid correlation between the feature extractor and the prediction outcome as much as possible.
The Xception model obtains near state-of-the-art results of (top-) and (top-) accuracy on ImageNet validation set. As a sanity check to the performance of our method, selecting for each image the highest (and top 5) prediction of achieves (top-) and (top-) on ImageNet validation set. We were pleasantly surprised by this result. Each of the ’s were learned independently possibly discarding relevant information on the relation between the classes. The kernel density estimation is done in and the default bandwidth levels were used to avoid overfitting the training set. Yet the naive performance is roughly on par with GoogLeNet (Szegedy et al., 2015) the winners of challenge (top-1: , top-5: ).
For conformal methods the confidence level is predefined. The method calibrates the number of classes in the prediction sets to satisfy the desired accuracy level. The the main component affecting the results is the hyperparameter. For small values of the accuracy will be high but so does the number of classes predicted for every observation. For large values of more observations are predicted as the null set and less observations predicted per class. Figure 3 (a) presents the trade-off between the level and the number of classes and the proportion of null set predictions for this example. For example , accuracy would require on average predictions per observation and null set predictions. The actual selection of the proper value is highly dependent on the task. As discussed earlier, a separate for each class can also be used to obtain different accuracy per class.
Figures 3 (b) and (c) show illustrative results from the ImageNet validation set. (b) presents a picture of a "Barber Shop". When the method correctly suggests the right class in addition to several other relevant outcomes such as "Bakery". When only the "Barber Shop" remains. (c) show a "Brain Coral". For the method still suggests classes which are clearly wrong. As increases the number of classes decrease and for only "Brain Coral" and "Coral Reef" remains, both which are relevant. At "Coral Reef" remains, which represents a misclassification following from the fact that the class threshold is lower than that of "Brain Coral". Eventually at the null set is predicted for this picture.
Figure 5 shows a collage of images using . To avoid selection bias we’ve selected the first images in the ImageNet validation set.
6.1 Adversarial Robustness
Adversarial attacks attempt to fool machine learning models through malicious input. The suggested method is designed to be cautious and provide multiple predictions under uncertainty which results with a robust performance under different attacks. In this section we use the foolbox library (Rauber et al., 2017) to generate different attacks on ImageNet validation and test the performance of the method on the ResNet50 model (He et al., 2016). We attack the first images that are accurately classified by the model.
Table 1 shows the prediction results of two type of attacks, untargeted (using Deepfool (Rauber et al., 2017) and the FGSM attack (Kurakin et al., 2016)) and targeted (using L-BFGS-B (Tabacof and Valle, 2016) and Projected Gradient Descent (PGD) (Kurakin et al., 2016)). The untargeted attacks perturb the image the least in order to find any misclassification. This yields predictions of both the true class and the adversarial class. While the attack reduces the performance of the model, the model is more robust than standard methods. Targeted attacks attempt to predict a specific class given apriori (randomly selected). This requires the attack to create larger modifications to the original image, and as a result the model mostly predict the null set both for the true label and the adversarial label.
Figure 4 (a) shows the outcome when the input is random noise. We set the threshold . This gives a less conservative classifier that should have the largest amount of false positives. Even with such a low threshold all random noise images over categories are correctly flagged as the null set. Evaluating the same sample on the Inception-v4 model (Szegedy et al., 2017) results with a top prediction average of (with standard error) to "Kite" and () to "Envelope". The top-5 classes together has mean probability of
, much higher than the uniform distribution expected for prediction of random noise.
Figure 4 (b) show results on Jackson Pollock paintings - an abstract yet more structured dataset. Testing different paintings with all result with the null set. When testing the Inception-v4 model output, paintings are classified with probability greater than to either "Coil", "Ant", "Poncho", "Spider Web" and "Rapeseed" depending on the image.
Figure 4 (c) is the famous picture of Muhammad Ali knocking out Sonny Liston during the first round of the rematch. "Boxing" is not included within in the ImageNet challenge. Our method correctly chooses the null set with as low as . Standard method are forced to associate this image with one of the classes and choose "Volleyball" with probability and the top-5 are all sport related predictions with probability. This is good result given the constraint of selecting a single class, but demonstrate the impossibility of trying to create classes for all topics.
7 Gender Recognition Example
In the next example we study the problem of gender classification from facial pictures. CelebFaces Attributes Dataset (CelebA) (Liu et al., 2015) is a large-scale face attributes dataset with more than celebrity images attributed, each with 40 attribute annotations including the gender (Male/Female). IMDB-Wiki dataset is a similar large scale ( images) dataset (Rothe et al., 2016) with images taken from IMDB and Wikipedia.
We train a standard convolutional neural network ( convolution and dense layers with the corresponding pooling and activation layers) to perform gender classification on CelebA. It converges well obtaining accuracy on a held out test set, but fails to generalize to the IMDB-Wiki dataset achieving accuracy, slightly better than a random guess. The discrepancy between the two datasets follows from the fact that facial images are reliant on preprocessing to standardize the input. We have used the default preprocessing provided by the datasets, to reflect a scenarios in which the distribution of the samples changes between the training and the testing. Figure 6 (a) and (b) show mean pixel values for females pictures within CelebA vs pictures in the IMDB-Wiki dataset. As seen, the IMDB-Wiki is richer and offers larger variety of human postures.
Although the standard classification method fails in this scenario, the conformal method suggested in this paper still offers valid and sensible results both on CelebA and IMDB-Wiki when using the features extracted from the network trained on CelebA. Figure6 (c) shows the performance of the method with respect to both datasets. CelebA results are good since they are based on features that perform well for this dataset. The level of accuracy is roughly as expected by the design, while the proportion of null predictions is roughly . Therefore for all there are almost no false positives and all of the errors are the null set.
The IMDB-Wiki results are not as good, but better than naively using a accuracy classifier. Figure 6 (c) show the classifier performance as a function of . Both the accuracy and the number of false positives are tunable. For high values of the accuracy is much higher than , but would results in a large number of observations predicted as both genders. If cautious and conservative prediction is required small values of would guarantee smaller number of false predictions, but a large number of null predictions. The suggested conformal method provides a hyper-parameter controlling which type of errors are created according to the prediction needs, and works even in cases where standard methods fail.
In this paper we showed that conformal, set-valued predictors based on have very good properties. We obtain a cautious prediction associating an observation with a class only if the there is high probability of that observation is generated from the class. In most of the space the classifier predicts the null set. This stands in contrast to standard solutions which provide confident predictions for the entire space based on data observed from a small area. This can be useful when a large number of outliers are expected or in which the distribution of the training data won’t fully describe the distribution of the observations when deployed. Adversarial attacks are an important example of such scenarios. We also obtain a large set of labels in the set when the object is ambiguous and is consistent with many different classes. Thus, our method quantifies two types of uncertainty: ambiguity with respect to the given classes and outlyingness with respect to the given classes.
In addition, the conformal framework provides our method with its coverage guarantees and class adaptivity. It is straightforward to add and remove classes at any stage of the process while controlling either the overall or class specific coverage level of the method in a highly flexible manner, if desired by the application. This desired properties comes with a price. The distribution of for each class is learned independently and the decision boundaries are indifferent to data not within the class. In case precise decision boundries are more desired, complex estimation functions overcome this limitation and provide decision boundaries almost equivalent to standard methods.
During the deployment of the method, evaluation of a large number of kernel density estimators is required. This is relatively slow compared to current methods. This issue can be addressed in future research with more efficient ways to learn ordering-consistent approximations of that can be deployed on GPU’s.
- Cadre et al. (2009) Benoît Cadre, Bruno Pelletier, and Pierre Pudlo. Clustering by estimation of density level sets at a fixed probability. 2009.
- Chollet (2016) François Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv preprint, 2016.
- Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Herbei and Wegkamp (2006) Radu Herbei and Marten H Wegkamp. Classification with reject option. Canadian Journal of Statistics, 34(4):709–721, 2006.
- Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
- Lee et al. (2018) Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177, 2018.
- Lei (2014) Jing Lei. Classification with confidence. Biometrika, 101(4):755–769, 2014.
- Lei and Wasserman (2014) Jing Lei and Larry Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71–96, 2014.
- Lei et al. (2013) Jing Lei, James Robins, and Larry Wasserman. Distribution-free prediction sets. Journal of the American Statistical Association, 108(501):278–287, 2013.
- Liu et al. (2015) Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015.
- Rauber et al. (2017) Jonas Rauber, Wieland Brendel, and Matthias Bethge. Foolbox v0. 8.0: A python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131, 2017.
- Rothe et al. (2016) Rasmus Rothe, Radu Timofte, and Luc Van Gool. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision (IJCV), July 2016.
- Sadinle et al. (2017) Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, (just-accepted), 2017.
- Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. Cvpr, 2015.
Szegedy et al. (2017)
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
- Tabacof and Valle (2016) Pedro Tabacof and Eduardo Valle. Exploring the space of adversarial images. In 2016 International Joint Conference on Neural Networks (IJCNN), pages 426–433. IEEE, 2016.
Annals of Mathematics and Artificial Intelligence, 74(1-2):9–28, 2015.
- Vovk et al. (2003) Vladimir Vovk, David Lindsay, Ilia Nouretdinov, and Alex Gammerman. Mondrian confidence machine. Technical Report, 2003.
- Vovk et al. (2005) Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005.
Appendix A Appendix: Details on Proposition 1
Here we provide more details on Proposition 1. We assume that the conditions in Cadre et al. (2009) hold. In particular, we assume that and where is the bandwidth of the density estimator. In addition we assume that is compact and that where .
Let and . Note that, conditional on the training data ,
From Theorem 2.3 of Cadre et al. (2009) we have that where is Lebesgue measure and denotes the set difference. It follows that
since, under the conditions, is consistent in the norm. It follows that as required.
We should remark that, in the above, we assumed that the number of classes is fixed. If we allow to grow the analysis has to change. Summing the errors in the expression above we have that where now the remainder is
We then need assume that as increases, the grow fast enough so that . However, this condition can be weakened by insisting that for all with small, we force to omit . If this is done carefully, then the coverage condition can be preserved and we only need to be small when summing over the larger classes. The details of the theory in this case will be reported in future work.