1 Introduction
Classification algorithms are often trained with the goal of inferring samples as if they come from the training distribution. However, there is little guarantee that the training set forms an adequate support for the entire distribution. A common approach to addressing this is to augment the training set with artificial, anticipated transformations. However, there exist many practical and theoretical problems with this approach, including increased training time, model parameters and assumptions about the deployment conditions [6, 9, 10]. When faced with samples that are poorly represented in the training set, current stateoftheart deep neural network architectures have builtin mechanisms that detect these as outofdistribution (OOD) samples or suffer erroneous classifications with high confidence [4, 13, 20]. Although lower layers can successfully report the presence of invariant features in a perturbed image, their locations and orientations may not be wellmodelled by higher layers. Only the last layer directly contributes to the final prediction in most current architectures, creating a single point of failure. We hypothesize that classifications based on a consensus of predictions made from both lower and higher layers will be more robust to perturbations without the need of augmenting training data.
Convolutional neural networks (CNNs) and residual networks (ResNets) combine features locally using kernels of shared weights per layer and rely on increasingly abstract features with depth. ResNets use skip connections between blocks of convolutional layers to attain better performance with deeper architectures [5]. In both architectures, each layer produces a convolution block consisting of a number of channels . We interpret these blocks from a planar perspective such that each position on the plane contains a vector of length . We refer to vectors from shallow layers as ‘lower level’ features and those from deeper layers as ‘higher level’ features.
Our proposed architecture summarizes low and high level features across deep networks and uses consensus between these summaries to make classifications. The benefits of DeepConsensus include:

Ease of attachment to a variety of existing architectures (Figure 2).
2 Related work
One way to achieve generalization is to model equivariant properties by representing changes in lower level features with similar changes in higher level features [3]. Another approach is to become invariant to these properties, which is necessary when data is scarce. There exist many architectures that possess equivariance or invariance towards specific, engineered properties. Rotation equivariant vector field networks apply filters orientated across a range of rotations [16]. Steerable filters have been successfully applied to CNNs to enable translational and rotational invariance simultaneously [15]. Spherical CNNs use kernels of shared weights on spherically projected signals and exhibit rotational equivariance [2]. Scaleinvariant CNNs use multiple columns [22] or kernels [6] of convolutional layers that specialize at different magnifications. Transformation invariant pooling uses Siamese CNNs that analyze two different transformations of the same object and selects the maximum outputs as the defining features of that class [10]. Group equivariant CNNs use special convolution filters that are capable of representing the equivariance of combinations of transformations. They perform well on the p4m group transformations, achieving excellent scores on several datasets [3]. DeepConsensus differs from these models in two distinct ways: firstly, it does not require preemptively augmenting the training set and secondly, it is not engineered towards any particular type of perturbation or disturbance.
Another line of related work is OOD detection, where models are trained to produce low confidence scores on samples that are not from the training distribution. DeVries and Taylor proposed using incorrectly predicted training examples to learn a secondary confidence score classifier [4]. Vyas et al. trained an ensemble of classifiers on different partitions of the training set to predict the other partition as OOD and used their agreement of OOD scores during evaluation [20]. Lee et al. used generative adversarial networks to learn edge examples of the training distribution to use for OOD classification [13]. ODIN exploits the observation that small perturbations have a greater effect on temperaturescaled softmax predictions for indistribution samples than OOD [14]. DeepConsensus does not aim to detect OOD samples explicity, but seeks to correctly classify samples with similar features to those seen in the training set.
Prototypes are learnable representations in the form of one or more latent vector per class. Comparing features to prototypes instead of forming predictions directly from the features has shown promise in robust classification by several works [19, 23]. We build on the idea by using prototypes of feature summaries for every layer rather than only the deepest layer.
Zero and fewshot learning are metalearning concepts that aim to construct new categories for objects that do not exist in the training set, given metainformation or few examples. A classic zeroshot example is learning to classify zebras given a training set containing horses and semantic descriptions of zebras in relation to horses. DeepConsensus is different in that it seeks to recognize unfamiliar objects as having similarities to classes learned during training without the need of extra information.
3 Architecture
The goal of DeepConsensus is to summarize outputs from each layer of a deep network such as CNN or ResNet, then compare the summaries to prototype summaries for each class, and finally find a consensus among these results for the prediction. Figure 1 gives a graphical overview of the architecture.
3.1 Summarization
In DeepConsensus, the summary operation of layer is defined as:
(1) 
where represents the channel vector at row and column in the convolutional block of layer and is a nonlinear function with learnable parameters .
3.2 Prototype alignment
Summary vectors are then compared to learned, layerspecific prototypes using distance function :
(2) 
where denotes the number of classes and
denotes some distance metric. The extra prototype allows layers to optout if the summary vector does not match any class prototype. We choose cosine similarity for the distance metric instead of Euclidean distance
[19, 23] because it is the most robust to the perturbations we tested (Figure 15).3.3 Consensus
The consensus of predictions made by participating layers forms the final classification. For CNN or ResNet layer outputs , DeepConsensus can be succinctly represented as the classification function :
(3) 
where weighs the contribution of layer to the final prediction. Since conventional architectures use only the highest level features for classification, they can be expressed in the same form with weights:
and the distance metric is dot product. In DeepConsensus, more than one layer makes a nonzero contribution to the final prediction and the distance metric is cosine similarity.
3.4 Other architectural and training details
The CNN and ResNet architectures used in the experiments are detailed in Figure 2. These networks are intended to demonstrate that DeepConsensus improves robustness of a variety of architectures as opposed to being tuned towards any particular one. All weights are initialized randomly from
. Maxpooling is used instead of convolutions with stride 2 because of improved performance on perturbed test sets after training on unaltered training sets. These networks form the base network for DeepConsensus. We choose
to be a single linear layer with a square weight matrix, followed by leaky ReLU with . Therefore, the prototypes of a particular layer have the same length as the number of channels of that layer. With this configuration, since DeepConsensus does not use terminal fully connected layers, the number of parameters of each network and its DeepConsensus version are practically equal.The training regime is simple and standard, consisting of partitioning of the training set for validation, using cross entropy loss with Adam optimization [7]
and reducing learning rate upon validation score plateau with factor 0.1 and patience 3. Since validation scores agree with unperturbed test scores and both values do not improve by more than 1% of the total accuracy after 15 epochs of training, we choose 30 epochs to be the termination point at which actual test scores are taken.
4 Experiments
Gaussian noise addition with standard deviation of 30 and
Gaussian blur with standard deviation of 1.5.We train and validate the base CNN, ResNet, and their corresponding DeepConsensus augmentations on the standard training samples of MNIST [12], EMNIST with a balanced split of 47 classes [1], FashionMNIST [21] and CIFAR10 [8]. To accommodate the spatial perturbations on the test set, both training and test samples are placed in the center of a
black background. The training set is not perturbed, while the test samples are subject to increasing levels of translation, magnification (using nearestneighbor interpolation), addition of Gaussian noise, and blurring. Examples of the MNIST training and testing sets are shown in Figure
9 and results are shown in Figure 11 and Supplementary Figure 1. Figure 12 demonstrates similar improvements with the ResNet architecture. DeepConsensus does equally well on the unperturbed test set as its base network, but also develops invariance to translation and moderate resistance against the other conditions even without training set augmentation. Table 2shows that the high variance in scores on heavily perturbed test sets is mainly due to parameter initialization.
Model  Test score 

Base CNN  
DeepConsensus 
Test accuracy on MNIST quadrants, where the spatial position of the digit also determines its class. The hyperparameters of these models are held constant from the perturbation study and the results are repeated 3 times with random initializations. Despite being invariant to translation, DeepConsensus can still do well on spatial tasks.
Condition  Tstatistic  Pvalue 

Translation  
Magnification  
Noise  
Blur 
Tstatistics and Pvalues are calculated using twotailed independent Ttests with unequal variance.
Model  Random  Fixed  Decrease 

Base CNN  69.3%  
DeepConsensus  89.3% 
Since DeepConsensus appears to be immune to large translation perturbations, we questioned if it is capable of classification tasks that depend on spatial positioning. We synthesized a 40 class MNIST dataset where we placed the original image randomly in one of the four quadrants of a black image and maintained the same training set (60k) and test set (10k) size. Each digit is mapped to 4 different classes corresponding to their quadrant location (see Figure 10). Table 1 shows that DeepConsensus achieves the same test score as its base network, demonstrating sensitivity to spatial positioning despite being normally invariant to translation. Figure 13 shows the average contribution of each layer on the previous experiments and this dataset.
Conventional CNNs and ResNets are invariant to translation locally and achieve equivariance to global translation only if the training set is perturbed in a similar way to the test set, as demonstrated by the translation condition in Figure 11. As Figure 13 shows, DeepConsensus uses consensus to exploit the soft prediction scores from each layer, such that overall accuracy is greater than that of any one particular layer. On the other hand, conventional networks use higherlevel features exclusively, which are spatiallyweighted combinations of lower level features and are not wellmodelled for perturbed inputs, causing low accuracy on such tasks. We compare DeepConsensus to the stateofart architecture p4mCNN [3] on the standard MNISTrot dataset, where both the 12K training and 50K test samples are rotated randomly about the center axis [11]. Furthermore, we compare both models on variations of MNIST where the training set is unaltered but the test set is perturbed. We demonstrate that DeepConsensus performs similarly to CNNs at perturbation tasks the CNN model is specialized for, and significantly outperforms it in translation (Figure 14).
We perform ablation studies on DeepConsensus (Figure 15). Best results are achieved when using a nonlinear transformation on features before summarization, and using cosine similarity as the metric for comparison with prototypes.
Having observed that DeepConsensus is robust against large perturbations, we subject DeepConsensusResNet to adversarial examples to observe its robustness against small perturbations. After training on MNIST and SVHN [18] for 30 epochs, both the base ResNet and the DeepConsensusResNet version are analyzed with DeepFool, which attempts to find the smallest perturbation on the input to force the model to change classification output [17]. The adversarial robustness metric or perturbation density is
(4) 
where is the classifier, is the test set, and is the minimal perturbation found by DeepFool [17]. Perturbation densities are significantly higher with DeepConsensus (Table 3), showing that DeepConsensus is robust against the small, targeted perturbations of DeepFool. Figure 20 compares sample adversarial examples for DeepConsensusResNet versus the base ResNet on MNIST and SVHN.
Dataset  ResNet  DeepConsensus  Increase 

MNIST  1500%  
SVHN  700% 
5 Discussion
In the experiments, we show that DeepConsensus improves the robustness of different architectures against multiple types of perturbation. Another major strength of DeepConsensus is the ability to become sensitive to spatial properties despite normally being invariant to them (e.g., Table 1). This is because it uses highlevel features when they agree with lower level features, or take a consensus of partially correct predictions when their first choice predictions (FCP) do not align. As Figure 13 shows, when trained on MNIST quadrants, the features of the last three layers are well aligned and correct. Therefore, the FCP of the sum of their predictions is equal to that of any one of their predictions. In contrast, for EMNIST with perturbations on the test set, the higher accuracy exhibited by consensus is similar in mechanism to an ensemble of classifiers – although the FCP for any one particular layer may be incorrect, the FCP of their sum is more likely to be correct. It is precisely the agreement of soft prediction scores that allows DeepConsensus to become sensitive, or invariant, to spatial positioning.
We designed DeepConsensus with the intention of achieving magnification invariance. Computing the cosine similarity between summaries of features and prototypes is equivalent to checking if the two vectors have some approximate scalar relation. Effectively, this operation finds the class whose prototype contains a similar ratio of features to the input. We originally believed that magnification maintains the ratio of features, but two observations falsify this belief: increasing the image size decreases the border width in a nonproportional way, and features are weighted differently from one another. We tried applying various bounded activation functions to mitigate these effects, but were not successful. The network assigns different weightings to different features regardless of activation expressiveness, preventing ratiobased prototype comparisons from becoming fully invariant to magnification.
The ablation studies (Figure 15) suggest cosine similarity to be the most robust for comparing layer summaries to class prototypes. Using Euclidean distance for the comparison metric does not perform well for magnification nor Gaussian noise addition, which have different ratios of features compared to the other perturbation types. This is consistent with the optimal solution of Euclideanbased prototype matching, which is to find a class prototype that matches the sample vector exactly. On the other hand, cosine similarity matches only the ratio of features, so it is a less rigid matching function that is more suitable for robust classification. Using a linear layer to classify the summary vectors leads to poor performance across all four perturbations, suggesting that fully connected layers of classifiers should be replaced with prototypebased comparisons to improve general robustness.
The perturbation densities of adversarial examples found by DeepFool are higher for models augmented with DeepConsensus (Table 3). Although the intention of this experiment is to demonstrate that DeepConsensus is resistant to small perturbations and we do not claim DeepConsensus to be robust against adversarial attacks, it illuminates a potential direction in adversarial defence. Instead of finding ways of preventing adversarial examples from being found, which may not be possible, we should aim to develop models that can directly consider higher and lower level features. Simply depending on higher level features creates a single point of failure at the deepest layer, whereas to fool all layers of an entire network may result in large and more obvious adversarial perturbations.
5.1 Future work
Closely related to the idea of fewshot learning, a practical application of DeepConsensus is to take a small, supervised batch of the test distribution to determine the accuracy of each layer. These accuracy can then be multiplied as the weighted contribution of each layer for the model to dynamically adapt to different deployment conditions without the need of retraining.
Follow up work includes showing changes in performance on more drastic perturbations, such as broad nonlinear rasterizations and obstruction of distinguishing characteristics. Additionally, further research is needed on how DeepConsensus can:

be adapted to work with large, pretrained models, like the Inception versions or deep ResNet networks, where consensus is built along their layers,

augment existing invariant models, such as improving translational invariance of CNNs, and

attain generalization when training data is scarce.
We also wish to find a better summarization approach than simply summing features. We originally tried a more sophisticated approach where the contributions of layers depended on the agreement between their predictions and the predictions of their neighboring layers. This was not successful because deeper layers tended to have strong agreement regardless of their accuracy, producing similar behavior to the base network on perturbed samples.
6 Conclusion
We expose weaknesses of various models on classification of perturbed samples, when trained on standard training sets. We propose augmenting existing models with the DeepConsensus architecture to improve their resistance to a variety of perturbations. DeepConsensus is not expensive in terms of parameters and does not require any preemptive training set augmentation. We also show that it is amenable to classification tasks that are sensitive to spatial positioning, and that each of its components are necessary and effective at improving general robustness. In addition, we demonstrate the resistance of DeepConsensus against the small, targeted perturbations of DeepFool.
7 Acknowledgements
Special thanks to Kamyar Ghasemipour, who provided important suggestions and helpful discussions.
References
 [1] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik. Emnist: an extension of mnist to handwritten letters. arXiv:1702.05373, 2017.
 [2] T. S. Cohen, M. Geiger, J. Koehler, and M. Welling. Spherical cnns. ICLR, 2018.
 [3] T. S. Cohen and M. Welling. Group equivariant convolutional networks. ICML, 2016.
 [4] T. DeVries and G. W. Taylor. Learning confidence for outofdistribution detection in neural networks. arXiv:1802.04865, 2018.
 [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
 [6] A. Kanazawa, A. Sharma, and D. Jacobs. Locally scaleinvariant convolutional neural networks. NIPS, 2014.
 [7] D. P. Kingma and J. Ba. Adam: a method for stochastic optimization. ICLR, 2015.
 [8] A. Krizhevsky. Learning multiple layers of features from tiny images. 2009.
 [9] D. Laptev and J. M. Buhmann. Transformationinvariant convolutional jungles. CVPR, 2015.
 [10] D. Laptev, N. Savinov, J. M. Buhman, and M. Pollefeys. Tipooling: transformationinvariant pooling for feature learning in convolutional neural networks. arXiv:1604.06318, 2016.
 [11] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of deep architectures on problems with many factors of variation. ICML, 2007.
 [12] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. IEEE, 1986.
 [13] K. Lee, H. Lee, K. Lee, and J. Shin. Training confidencecalibrated classifiers for detecting outofdistribution samples. ICLR, 2018.
 [14] S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of outofdistribution image detection in neural networks. ICLR, 2018.
 [15] F. A. H. M. Weiler and M. Storath. Learning steerable filters for rotation equivariant cnns. CVPR, 2018.
 [16] D. Marcos, M. Volpi, N. Komodakis, and D. Tuia. Rotation equivariant vector field networks. ICCV, 2017.
 [17] S. M. MoosaviDezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. CVPR, 2016.
 [18] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. NIPS, 2011.
 [19] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for fewshot learning. arXiv:1703.05175, 2017.
 [20] A. Vyas, N. Jammalamadaka, X. Zhu, D. Das, B. Kaul, and T. L. Wilke. Outofdistribution detection using an ensemble of self supervised leaveout classifiers. ECCV, 2018.

[21]
H. Xiao, K. Rasul, and R. Vollgraf.
Fashionmnist: a novel image dataset for benchmarking machine learning algorithms.
2017.  [22] Y. Xu, T. Xiao, J. Zhang, K. Yang, and Z. Zhang. Scaleinvariant convolutional neural network. CVPR, 2015.
 [23] H. Yang, X. Zhang, F. Yin, and C. Liu. Robust classification with convolutional prototype learning. CVPR, 2018.