Detecting Anomalous Inputs to DNN Classifiers By Joint Statistical Testing at the Layers
Detecting anomalous inputs, such as adversarial and out-of-distribution (OOD) inputs, is critical for classifiers deployed in real-world applications, especially deep neural network (DNN) classifiers that are known to be brittle on such inputs. We propose an unsupervised statistical testing framework for detecting such anomalous inputs to a trained DNN classifier based on its internal layer representations. By calculating test statistics at the input and intermediate-layer representations of the DNN, conditioned individually on the predicted class and on the true class of labeled training data, the method characterizes their class-conditional distributions on natural inputs. Given a test input, its extent of non-conformity with respect to the training distribution is captured using p-values of the class-conditional test statistics across the layers, which are then combined using a scoring function designed to score high on anomalous inputs. We focus on adversarial inputs, which are an important class of anomalous inputs, and also demonstrate the effectiveness of our method on general OOD inputs. The proposed framework also provides an alternative class prediction that can be used to correct the DNNs prediction on (detected) adversarial inputs. Experiments on well-known image classification datasets with strong adversarial attacks, including a custom attack method that uses the internal layer representations of the DNN, demonstrate that our method outperforms or performs comparably with five state-of-the-art detection methods.
READ FULL TEXT