Prediction Confidence from Neighbors

by   Mark Philip Philipsen, et al.
Aalborg University

The inability of Machine Learning (ML) models to successfully extrapolate correct predictions from out-of-distribution (OoD) samples is a major hindrance to the application of ML in critical applications. Until the generalization ability of ML methods is improved it is necessary to keep humans in the loop. The need for human supervision can only be reduced if it is possible to determining a level of confidence in predictions, which can be used to either ask for human assistance or to abstain from making predictions. We show that feature space distance is a meaningful measure that can provide confidence in predictions. The distance between unseen samples and nearby training samples proves to be correlated to the prediction error of unseen samples. Depending on the acceptable degree of error, predictions can either be trusted or rejected based on the distance to training samples. can be used to decide whether a sample is worth adding to the training set. This enables earlier and safer deployment of models in critical applications and is vital for deploying models under ever-changing conditions.



There are no comments yet.


page 1

page 2

page 3

page 4


Towards Safe Machine Learning for CPS: Infer Uncertainty from Training Data

Machine learning (ML) techniques are increasingly applied to decision-ma...

Modeling Generalization in Machine Learning: A Methodological and Computational Study

As machine learning becomes more and more available to the general publi...

Reliable and Trustworthy Machine Learning for Health Using Dataset Shift Detection

Unpredictable ML model behavior on unseen data, especially in the health...

Maize Yield and Nitrate Loss Prediction with Machine Learning Algorithms

Pre-season prediction of crop production outcomes such as grain yields a...

Making Early Predictions of the Accuracy of Machine Learning Applications

The accuracy of machine learning systems is a widely studied research to...

Predicting Unreliable Predictions by Shattering a Neural Network

Piecewise linear neural networks can be split into subfunctions, each wi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Deploying Machine Learning (ML) methods, most noticeably Deep Neural Networks (DNNs), in a complex and changing world reveals some fundamental shortcomings with these methods. Focusing on DNNs, the shortcomings include failures in classifying known objects when presented with novel poses 

[2] and a lack of robustness to minor perturbations in the input, where small changes, imperceptible to humans, may result in incorrect predictions [16]. Both problems can be mitigated with training data that densely covers the experience space. Ensuring this often proves challenging, especially in dynamic environments. Such environments are called non-stationary and are likely to result in what is known as concept drift, which occurs when data distributions change such that existing models become outdated and must be retrained. In less severe cases this is called virtual concept drift and will require tuning of representations or supplemental learning [4]. Spam detection [11], price prediction [6], and factory automation [8, 12], are examples of applications that encounter concept drift and where it is necessary to retrain models from time to time.

For the vast majority of ML applications it is assumed that the world is predictable and static. It may seem that way most of the time, but the reality is that things can suddenly change in surprising ways. Data sets used during system development and used in benchmarks are usually captured with a low variety in hardware and environments. In reality, the sources of variation include; differences in calibration, changes to setup and sensors [7]

, product changes, outliers from unforeseen events, etc. This means that most systems are guarantied to encounter input that differs from the training distribution, likely resulting in faulty predictions 


The idea presented here is analogous to a doctor looking at medical images in order to diagnose a patient. Here, similar images from previously confirmed cases are a big help in guiding a diagnosis. The idea of assisting doctors in this way has actually been implemented using an automatic image similarity search system 

[3]. Similarly known samples are consulted when assigning a confidence to a new prediction. In the majority of related work looking at measuring image similarity, the problem often is that an algorithms understanding of ”similar” is different from what is considered relevant by humans. What is interesting here is what is considered similar to the DNN. This is best measured using the learned representations from the activations in the layers of the DNN. By considering known samples that are similar in terms of activations across the DNN, we can get an idea of what performance to expect. The proximity to known samples can provide a level of certainty about the expected outcome. This idea rely on the assumption that the network is relatively smooth/stable locally.

Our research is preoccupied with automating processes found in slaughterhouses. The complexity of these processes and their non-stationary nature means that models must continuously adapt and improve or at least be able to predict when a given input is likely to result in a faulty prediction. The disruption and the monetary consequences of mistakes, makes it important for models to abstain and ask for guidance in cases where the outcome is unlikely to be acceptable. The purpose of this work is to detect out-of-distribution (OoD) samples for the tool pose predictor presented in [13]. This is used to judge whether predictions can be trusted and as a method for identifying the most valuable samples for extending the training set.

I-a Contributions

With this paper, we address the problem of imperfect ML models in critical real world applications by determining whether a given prediction can be trusted. This is done based on the distance between the sample and nearby known samples in the feature space of a Deep Neural Network. The contributions can be summarized as:

  1. Confidence measure for unseen samples based on distance to training set neighbours in feature space.

Ii Related Work

Trust or confidence in predictions is important when choosing to deploy a new model and when predictions result in critical actions. Local Interpretable Model-agnostic Explanations (LIME), is a novel explanation technique that learn to provide an interpretable explanation of the predictions of any classifier. LIME enables a human to give feedback on the ”reasoning” behind a prediction and suggest features to be removed from the model, leading to better generalization. The flexibility of the method is shown by applying it to random forests for text classification and neural networks for image classification. The idea is that global trust in the model is build by understand and gaining trust in individual predictions covering the input space. Although the complete model might be difficult to explain, individual decisions are explainable by relying on a local neighborhood region. The method is time consuming, taking up to 10 minutes to explain predictions for an image classification task 

[15]. A small “loss prediction module” can be added to a target network, the module then learns to predict target losses of unlabeled inputs and thereby the ability to predict whether samples are likely to result in wrong predictions from the target model. The module is connected to several layers of the target model thereby considering different levels of information for loss prediction. The loss of the target model constitutes the ground truth for the loss prediction module [17]

. Using the data embedding from one of the final layers of the network a confidence score is computed based on local density estimates. Although, the method can be used for any DNN, it does require changes to the training procedure in order for the scores to be useful 


Low confidence samples can be considered as being OoD. When such samples are detected, a system can either query for help [10] or abstain from predicting as done for detection of liver abnormalities using ultrasound images [5].

Iii Method

The presented method relies on the latent representation is found in the layers of DNNs. The latent representations correspond to the networks activations at a given layer and contain some of the information originally expressed in the input. The representations can be considered as coordinates in a high dimensional space. The further down the DNN the representation is found, the higher level concepts it describes. Representations can be extracted from any DNN and at any layer, but it is preferable to rely on a bottleneck or one of the final layers in order to lower the dimensionality of the latent space.

Here an autoencoder network architecture (see Figure

1) is used to learn latent representations of point clouds. The autoencoder is based on PointNet [14] and has been used for recognition tasks and shape editing [1]

. Principal component analysis is used to reduce the dimensionality of the representations, primarily for visualization purposes. Only the two most principal components are used in the following plots.

Fig. 1: Autoencoder architecture with input point cloud, latent space bottleneck , and reconstructed point cloud.

Iii-a Measure of Similarity

The underpinning intuition for comparing samples using their latent representations is that similar inputs elicit similar activations throughout the DNN. Figure 2 (a) shows the distribution of known samples i.e. training samples (blue) and new unknown samples based on their latent representations. This reveals that the two data sets are very similar.

Iii-B Distribution of Errors

It is interesting to investigate whether samples with large errors are found in specific areas of the latent space. In this case the error is the reconstruction loss of the autoencoder, for other tasks and network types, the error can be any other kind of loss or prediction error that can be quantified. Figure 2 (b) shows the reconstruction error for new samples and their placement in the feature space.

Fig. 2: (a) Known samples (blue) and new samples (green) based on their feature space representations. (b) Distribution of error rates for new samples.

Iii-C Error as a Function of Distance

It is practically impossible to make guaranties about the performance of DNNs on unseen samples. This is especially the case with limited amounts of training data or OoD samples. Figure 3 shows the reconstruction error for new samples in relation to their distance from known samples. There is a clear trend showing a direct relationship between error and distance to nearest neighbor. This means that a threshold can be used to classify OoD samples as OoD.

Fig. 3: The reconstruction error in relation to the distance to nearest neighbor in the training set.

Iii-D Identifying Out-of-distribution Samples

For a system that is being deployed in a critical production environment, the threshold may be based on the severity of error that can be accepted. In a less critical environment or during the introduction of a new system, the threshold may be determined by the amount of human involvement that can be afforded. Figure 4 (a) and (b) shows the gradual expansion of a training set based on a threshold that is selected based on a given labeling budget.

Fig. 4: Training set distribution (blue), novel samples that should be added to training distribution (green), samples that contain insufficient novelty to warrant adding to training distribution (orange), novel samples that are outside of the confidence threshold and should be abstained from as well as added to the training distribution (red), (a) First batch. (b) Second batch.

Iv Discussion

By measuring the similarity between an unseen sample and known samples it is possible to anticipate how the system will perform on the new sample. Knowing whether a given new sample exist in a sparse region of the training distribution and if performance in the region is acceptable may be useful when trying to determine whether to act on a prediction. This knowledge can also be used to select the most critical new examples to be added to the training set. This is a tool which has the potential to enable cheaper, earlier, and safer deployment of models. It is vital when deploying models in ever-changing environments.


This work was supported by Innovation Fund Denmark and the Danish Pig Levy Fund.


  • [1] P. Achlioptas, O. Diamanti, I. Mitliagkas, and L. J. Guibas (2017) Representation learning and adversarial generation of 3d point clouds. CoRR abs/1707.02392. External Links: Link, 1707.02392 Cited by: §III.
  • [2] M. A. Alcorn, Q. Li, Z. Gong, C. Wang, L. Mai, W. Ku, and A. Nguyen (2018) Strike (with) a pose: neural networks are easily fooled by strange poses of familiar objects. CoRR abs/1811.11553. External Links: Link, 1811.11553 Cited by: §I, §I.
  • [3] C. J. Cai, E. Reif, N. Hegde, J. Hipp, B. Kim, D. Smilkov, M. Wattenberg, F. Viegas, G. S. Corrado, M. C. Stumpe, et al. (2019) Human-centered tools for coping with imperfect algorithms during medical decision-making. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 4. Cited by: §I.
  • [4] R. Elwell and R. Polikar (2011) Incremental learning of concept drift in nonstationary environments. IEEE Transactions on Neural Networks 22 (10), pp. 1517–1531. Cited by: §I.
  • [5] K. Hamid, A. Asif, W. A. Abbasi, D. Sabih, and F. ul Amir Afsar Minhas (2018) Machine learning with abstention for automated liver disease diagnosis. CoRR abs/1811.04463. External Links: Link, 1811.04463 Cited by: §II.
  • [6] M. Harries and K. Horn (1995) Detecting concept drift in financial time series prediction using symbolic machine learning. In AI-CONFERENCE-, pp. 91–98. Cited by: §I.
  • [7] A. Hosny, C. Parmar, J. Quackenbush, L. H. Schwartz, and H. J. Aerts (2018) Artificial intelligence in radiology. Nature Reviews Cancer 18 (8), pp. 500. Cited by: §I.
  • [8] C. Lin, D. Deng, C. Kuo, and L. Chen (2019) Concept drift detection and adaption in big imbalance industrial iot data using an ensemble learning method of offline classifiers. IEEE Access 7, pp. 56198–56207. Cited by: §I.
  • [9] A. Mandelbaum and D. Weinshall (2017) Distance-based confidence score for neural network classifiers. CoRR abs/1709.09844. External Links: Link, 1709.09844 Cited by: §II.
  • [10] B. J. Meyer and T. Drummond (2019)

    The importance of metric learning for robotic vision: open set recognition and active learning

    CoRR abs/1902.10363. External Links: Link, 1902.10363 Cited by: §II.
  • [11] L. Nosrati and A. N. Pour (2011) Dynamic concept drift detection for spam email filtering. In Proceedings of ACEEE 2nd International Conference on Advances Information and Communication Technologies (ICT 2011), Vol. 2, pp. 124–126. Cited by: §I.
  • [12] M. Pechenizkiy, J. Bakker, I. Žliobaitė, A. Ivannikov, and T. Kärkkäinen (2010) Online mass flow prediction in cfb boilers with explicit detection of sudden concept drift. ACM SIGKDD Explorations Newsletter 11 (2), pp. 109–116. Cited by: §I.
  • [13] M. P. Philipsen and T. B. Moeslund (2019) Cutting pose prediction from point clouds. preprint. Cited by: §I.
  • [14] C. R. Qi, H. Su, K. Mo, and L. J. Guibas (2016) PointNet: deep learning on point sets for 3d classification and segmentation. CoRR abs/1612.00593. External Links: Link, 1612.00593 Cited by: §III.
  • [15] M. T. Ribeiro, S. Singh, and C. Guestrin (2016) ”Why should I trust you?”: explaining the predictions of any classifier. CoRR abs/1602.04938. External Links: Link, 1602.04938 Cited by: §II.
  • [16] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §I.
  • [17] D. Yoo and I. S. Kweon (2019) Learning loss for active learning. CoRR abs/1905.03677. External Links: Link, 1905.03677 Cited by: §II.