1 Introduction
Neural Networks have become increasingly popular than ever in addressing realworld computer vision (CV)
[2, 10], natural language (NLP) problems [16, 1]. With the supervised strategies, deep neural networks are able to achieve humanwise performances on various tasks like image classification [2, 10], reading compression [21] and machine translation [5], etc. However, the deep neural networks are also known to have little control over its output distribution under unseen scenarios, which can potentially cause very concerning problems if applied to realworld applications. Even worse, such models are prone to adversarial attacks and raise concerns in AI safety [11, 17]. In order to resolve such a concerning issue and propose more robust deep learning frameworks, there have been recently rising interests in studying the outofdistribution detection problem
[7], or a highdimensional and largescaled anomaly/novelty detection problem, which aims at separating indistribution (IND) real images from outofdistribution (OOD) real images.
Such detection problem draws much attention from the community and many research studies have been published [3, 12, 13, 3, 22, 15, 14]. These methodologies are quite versatile in terms of their settings and assumptions, here we first taxonimize the outofdistribution detection algorithms and position these previous research works in Figure 1 with respect to their complexity and knowledge. The vertical axis considers the complexity in the model level, and the horizontal axis denotes the knowledge required for training the detection algorithm:

Retraining: it has a new loss function and requires the neural network to be retrained.

WhiteBox: it does not need retraining, but the detailed network architecture is known.

BlackBox: it does not need retraining and the network architecture is hidden.

OOD Val: requires outofdistribution validation set.

IND Train: requires the indistribution training set.

IND Val: requires indistribution validation set.
In realworld applications, especially for largescale and distributed AI systems, the retraining assumption is too strong and hardly practical due to the introduced complexities. The whitebox assumption is stronger than blackbox assumption because knowing the network architecture makes its protection easier. However, the blackbox model is more practical since many realworld AI systems only provide interfaces about their output with unexposed lowlevel architecture. In terms of data knowledge, OOD Val is the strongest assumption in practice because it is unrealistic to preknow the source of outofdistribution images. IND Train is still a strong assumption in realworld applications especially with largescale distributed AI systems. In comparison, IND Val is much more practical where only a smallscale heldout dataset is required. Overall, by increasing knowledge (horizontal) and complexity (vertical), the approaches become less and less practically useful in realworld applications due to the increasing implementation difficulty and less knowledge availability.
The Baseline [7] has the least application difficulty as it only requires the highest softmax score for representing the model’s confidence. Mahalanobis [13] assumes a whitebox model and proposes to use its lowlevel features to compute Mahalanobis distance as evidence to train an ensemble discriminator for outofdistribution detection. ODIN [14] though requires no knowledge about the model architecture, it uses an outofdistribution validation set to finetune its hyperparameters (temperature and perturbation magnitude). Finally, Confidence Learning [3], Semantic Label [22], Adversarial Training [12] and Deep Prior Network(DPN) [15] have the strongest assumption, they require retraining the model on the training set with newly proposed loss function.
In this paper, we propose a very practical method with application difficulty only beyond Baseline [7]. We assume the classification model is a blackbox and only smallscale heldout indomain validation data exists. Such weak assumption makes our method well suited for realworld applications. In a nutshell, our proposed method is established on Dirichlet prior network [15]
, where we first propose to degenerate it as a softmaxbased neural network. Thus, we can directly rely on pretrained neural networks to compute the predictive uncertainty measure. To further tackle such overconfidence issue of pretrained classifier with compromised detection accuracy, we next design a concentration perturbation algorithm to enhance the robustness of the established uncertainty measure by learning an adaptive perturbation function to better separate in and outofdistribution images. We demonstrate our methodology in
Figure 2, where we first add EXP operator to degenerated Dirichlet prior network as pretrained classifier to compute the Dirichlet concentration parameters , which is fed into Perturbation Function to generate a noise . We add the noise to original prior network to increase the robustness of the established uncertainty measure . A thresholdbased detector is used to tell whether an input image is from in or outofdistribution.Our main contributions are listed below:

We degenerate Dirichlet prior network as softmaxbased classification model and directly use a pretrained classifier to estimate the uncertainty measure.

We are the first to propose THE concentration perturbation algorithm and use it to greatly enhance current predictive uncertainty measure.

Our proposed method is able to achieve or approach stateoftheart results across different datasets and architectures.
2 Background
Here we particularly consider the image classification problem, where corresponds to the images and corresponds to object labels. Given the training data , a Bayesian framework depicts the predictive uncertainty ^{1}^{1}1shorthand for over an unseen image as follows:
(1) 
where data (aleatoric) uncertainty is described by labellevel posterior and model (episdemic) uncertainty is described by modellevel posterior . The integral in Equation 1 is intractable in deep neural networks, thus MonteCarlo Sampling algorithm is used to approximate it as follows:
where each is a categorical distribution over the simplex with . This ensemble is a collection of points on the a simplex as depicted in Figure 3, which can be viewed as an implicit distribution induced by the posterior over the model parameters .
Given an ensemble from such implicit distribution, the entropy of expected distribution though indicates uncertainty in model prediction, is impossible determine from such predictive distribution whether the uncertainty is from a high class overlap or the input is far from training data space. Though [4] has proposed different measures like mutual information to determine the uncertainty source, it is very hard in practice due to the difficulty of selecting an appropriate prior distribution and the expensive computation needed for MonteCarlo estimation in deep neural networks.
In order to explicitly separate these two sources of uncertainty, prior network [15] is proposed to explicitly parameterize modellevel uncertainty with a distribution over distribution on a simplex . In [15], they propose to reformulates the predictive posterior as follows:
(2) 
where the model uncertainty is collapsed into by using a point estimate . Ideally, the prior network should yield sharp distribution on the corner of simplex when the network is confident about its prediction (knownknown). For noisy input data with heavy class overlap (data uncertainty), the network should yield sharp distribution in the center of the simplex (knownunknown). While for outofdomain images, the prior network is supposed to yield flat distribution on the simplex (unknownunknown). That is to say, the parameters
should encapsulate the knowledge about the boundary which separates indistribution from outofdistribution data. Such prior network is realized by Dirichlet in practice due to its tractable statistical properties and the probability density function (PDF) of Dirichlet prior distribution over all possible values of the Kdimensional categorical distribution
is written as:(3) 
where with is the concentration parameter of the Dirichlet distribution and is the normalization factor. In practice, the Dirichlet prior network is realized by a neural network function with parameters , which takes as inputs the unseen image
and then generate Kdimensional vector
:(4) 
In prior network [15], the entropybased unceratinty measure is proven perfectly separate modellevel uncertainty from datalevel uncertainty and computationally efficient due to its closedform solution:
where denotes the sum over all K dimensions. Please note that we refer to confidence measure as the negative value of entropy value .
3 Degenerated Prior Network
Unlike Deep Prior Network [15] which proposes a multitask training loss function to train prior network from scratch (refer to the original paper for details), our method proposes to degenerate prior network as a softmaxbased neural network to save the retraining efforts.
During training, prior network is optimized to maximize the empirical marginal likelihood on a given dataset as follows:
(5) 
Recall that in widely used softmaxbased neural networks, the crossentropy objective is described as follows:
(6) 
where is the lastlayer output from deep neural network. It can be easily observed from the Equation 5 and Equation 6 that the Dirichlet objective function is aligned with softmaxbased crossentropy if the following holds:
(7) 
where is the scale constant^{2}^{2}2For simplicity, we set the during our experiments to avoid finetuning the scaling hyperparameter. . Therefore, if the exponential output of a pretrained DNN is used as concentration parameters for prior network, then training softmaxbased neural network is equivalent to training Dirichlet prior network. Therefore, we can easily obtain the predictive uncertainty measure as .
While this degenerated prior network is sufficient in some relatively simpler case, we observe compromised detection accuracy under largescaled dataset (CIFAR100). The uncertainty measure becomes so sensitive and erratic to noises that its detection accuracy can hardly provide accurate estimation for modellevel uncertainty, and the general performance is only slightly higher than Baseline [7]. Here we visualize the distribution of in and outofdistribution data in Figure 4 under such confidence measure. We conclude that such compromised performance is caused by the known overfitting issue in the pretrained neural network, where the model becomes overconfident about its prediction though it misclassifies it. More specifically, the classification model greatly emphasizes certain dimensions in the concentration parameter regardless of the form of the inputs, which causes both data sources to have indistinguishable high confidence.
4 Concentration Perturbation
In order to enhance the robustness of the established entropybased uncertainty measure, we are inspired by fastsign perturbation [11, 14] to design a concentration mechanism to increase the robustness of uncertainty measure. Unlike [11, 14] which requires gradient backpropagation, our proposed method does not require a backward propagation operation and better fits the blackbox assumption. The other difference is that our method is on top of concentration parameters rather than raw input image . As illustrated by Figure 5, we first experiment with fastsign perturbation algorithm as follows:
(8) 
Through our experiments, such gradientbased perturbation algorithm yields trivial gain or even causes damage to detection accuracy. We conclude that the assumption made in ODIN [14] no longer holds under our scenarios since the in and outofdistribution concentration parameters are lying in regions of equivalent sharpness or flatness, hence the hillclimbing perturbation only yields similar impacts on both inputs, thus fails to better separate the in from outofdistribution images (as depicted in Figure 5 with green color, both and climb equal height in the contour). Therefore, we set out to find a more sophisticated mechanism to separate these two data sources (as depicted in rouge color of Figure 5, the outofdistribution drops much faster than the indistribution after adding noise ). Based on such philosophy, we design a parameterized perturbation function with parameter , which takes as input the concentration parameter to generate a noise in a way that it can widen the uncertainty difference between in and outofdistribution images but does not affect the model prediction.
(9) 
Here, we particularly investigate the simplest linear transform based perturbation function
with denoting the learnable pertubation matrix.In order to obtain such perturbation matrix , we propose a discriminative loss function , which aims at enlarging the gap between in and outofdistribution images.
(10)  
(11) 
On one hand, the magnitude of such perturbation matrix is encouraged to be small so that the generated noise does not affect model’s output landscape. Therefore we enforce the first constrain on the norm of perturbed noise: with denoting the maximum allowed perturbation ratio. On the other hand, the perturbed concentration should still lie in the support space , hence we enforce the second constraint on the positivity of noise : . Since it’s impractical to assume we have access to outofdistribution (OOD) dataset, we propose to use adversarial examples generated by FSGM [11] as the synthesized OOD examples. Thus, the optimal perturbation matrix is described as follows:
(12)  
(13) 
Here we propose to optimize by gradient ascent algorithm. The first constraint is realized by rescaling the noise whose norm is larger than
while the second constraint is realized by simply adding a ReLU
[18] activation to the perturbation weight .5 Experiments
We follow the previous papers [7, 14] to replicate their experimental setups. For each sample fed into the neural network, we will calculate the uncertainty measure based on the output concentration
, which will be used to predict which distribution the samples come from. Finally, several different evaluation metrics are used to measure and compare how well different detection methods can separate the two distributions.
Data Source  Dataset  Content (classes)  #Train  #Test 
InDistribution  CIFAR10 [9]  10 classes: Airplane, Truck, Bird, etc.  60,000  10,000 
CIFAR100 [9]  100 classes: Mammals, Fish, Flower, etc  60,000  10,000  
OutofDistribution  iSUN [24]  908 clases: Airport, Abby, etc    8,925 
LSUN [26]  10 classes: Bedrooms, Churches, etc    10,000  
TinyImageNet [2] 
1000 classes: Plant, Natural object, Sports, etc    10,000  
SVHN [19]  10 classes: The Street View House Numbers    26,032 
5.1 Datasets and Implementation
Here we list all the datasets used in Table 1, which are available in github^{3}^{3}3https://github.com/ShiyuLiang/odinpytorch. In order to make fair comparisons with previous outofdistribution detection algorithms, we replicate the same setting as [7]. For both CIFAR10 and CIFAR100 dataset, we pretrain VGG13 [23], ResNet18 [6], ResNet34 [6], WideResnet [27] (depth=28, widening factor=10), ResNext [25] (depth=29, widening factor=8) with publicly available code^{4}^{4}4https://github.com/bearpaw/pytorchclassification, and then use the converged model as the blackbox. We adopt the publicly available implementation in github^{5}^{5}5https://github.com/1Konny/FGSM to generate FGSM [11] examples. For concentration perturbation matrix , we initialize all the weights to zero and then optimize it via Adam optimizer [8] with lr=1e3 and weightdecay=5e4. We experimented with different setups of hyperparameter and found that setting
can yield generally promising results. Our method is implemented with Pytorch
[20] based on publicly available, all the code and trained models will be released in github ^{6}^{6}6https://github.com/wenhuchen/.5.2 Experimental results
We measure the quality of outofdistribution detection using the established metrics for this task [7].

FPR at 95% TPR (lower is better): Measures the false positive rate (FPR) when the true positive rate (TPR) is equal to 95%. Note that TNR = 1  FPR.

Detection Error (lower is better): Measures the minimum possible misclassification probability defined by . Note that Detection Accuracy = 1  Detection Error.

AUROC (larger is better): Measures the Area Under the Receiver Operating Characteristic curve. The Receiver Operating Characteristic (ROC) curve plots the relationship between TPR and FPR.

AUPR (larger is better): Measures the Area Under the PrecisionRecall (PR) curve, where AUPRIn refers to using indistribution as positive class and AUPROut refers to using outofdistribution as positive class.
IND/OOD Model  Method  FPR@  Detection Error  AUROC 
TPR95  
CIFAR10/ iSUN VGG13  Baseline  43.8  11.4  94 
ODIN  22.4  10.2  95.8  
Confidence  16.3  8.5  97.5  
Semantic  23.2  10.2  96.4  
Ours  10.7  7.4  97.7  
CIFAR10/ LSUN VGG13  Baseline  41.9  11.5  94 
ODIN  20.2  9.8  95.9  
Confidence  16.4  8.3  97.5  
Semantic  22.9  13.9  96.0  
Ours  10.3  7.4  97.8  
CIFAR10/ TinyImgNet VGG13  Baseline  43.8  12  93.5 
ODIN  24.3  11.3  95.7  
Confidence  18.4  9.4  97  
Semantic  19.8  10.1  96.5  
Ours  13.8  7.9  97.5 

Method 


AUROC  

CIFAR10/ TinyImgNet ResNet18  Baseline  59.0  15.1  91.1  
ODIN  32.1  11.2  94.9  
DPN      93.0  
Mahalanobis  2.9  0.6  96.3  
Ours  17.1  8.7  96.8  
CIFAR10/ LSUN ResNet18  Baseline  50.2  12.3  93.1  
ODIN  17.9  8.4  96.9  
DPN      90.2  
Mahalanobis  1.2  0.3  97.5  
Ours  7.7  5.9  98.3  
CIFAR10/ SVHN ResNet18  Baseline  49.5  13.3  92.0  
ODIN  29.7  15.1  91.7  
DPN      95.9  
Mahalanobis  12.2  2.3  92.6  
Ours  28.7  13.6  93.2 
CIFAR10 experiments
Here we first demonstrate our experimental results on CIFAR10 datasets, which use pretrained neural networks on CIFAR10 dataset to detect whether unseen image inputs are from indistribution. We experiment with two neural network architectures, VGG13 [23] (see Table 2) and ResNet18 [6] (see Table 3). In Table 2, we mainly compare against Baseline [7], ODIN [14], Confidence [3] and Semantic [22] under the VGG13 architecture. We can easily observe that our proposed method can significantly outperform these competing algorithms across all metrics. In Table 3, we mainly compare against ODIN [14], DPN [15] and Mahalanobis [13] under ResNet18 architecture. We can observe that Mahalanobis algorithm performs extremely well on FPR(TRP=95%) and detection error metrics, but our method is more superior in terms of the AUROC metric.

Method 


AUROC  

CIFAR100/ iSUN ResNet34  ODIN  61.3  23.7  83.6  
Semantic  58.4  21.4  85.2  
Mahalanobis  18.7  11.6  94.1  
Ours  19.8  12.2  94.2  
CIFAR100/ LSUN ResNet34  ODIN  76.8  42.4  78.9  
Semantic  79.5  42.2  79.0  
Mahalanobis  14.9  9.0  95.4  
Ours  14.5  9.6  95.9  
CIFAR100/ TinyImgNet ResNet34  ODIN  63.9  25.2  82.3  
Semantic  62.4  24.4  83.1  
Mahalanobis  12.0  9.1  96.5  
Ours  16.3  10.3  95.3 
Model  OOD  Method  FPR@TPR95  Detection Error  AUROC  AUPR In  AUPR Out 

WideResNet CIFAR100  iSUN  Base/ODIN/Ours  82.7/57.3/18.0  43.9/31.1/11.1  72.8/86.6/95.5  74.2/85.9/95.5  69.2/84.9/95.6 
LSUN  Base/ODIN/Ours  82.2/56.5/13.3  43.6/30.8/8.8  73.9/86.0/97.1  75.7/86.2/97.1  69.2/84.9/97.2  
TinyImgNet  Base/ODIN/Ours  79.2/55.9/16.8  42.1/30.4/10.2  72.2/84.0/96.2  70.4/82.8/95.7  70.8/84.4/96.4  
ResNeXt29 CIFAR100  iSUN  Base/ODIN/Ours  82.2/61.6/18.4  31.0/21.4/11.2  74.5/86.4/94.9  79.8/89.1/95.3  67.7/82.7/94.0 
LSUN  Base/ODIN/Ours  82.2/62.4/13.6  31.8/22.1/8.7  73.6/85.9/96.5  77.4/87.8/96.8  69.5/83.9/95.8  
TinyImgNet  Base/ODIN/Ours  79.6/60.2/16.9  31.0/21.5/9.9  75.1/86.5/96.5  78.4/88.2/96.8  71.6/84.8/95.8 
CIFAR100 experiments
Here we experiment with largescaled CIFAR100 dataset to further investigate the effectiveness of our proposed algorithm. In Table 4, we mainly compare against ODIN [14], Semantic [22], Mahalanobis [13]. under the ResNet34 architecture. We can observe very similar trends as Table 3 where both our method and Mahalanobis are significantly outperforming our competing algorithms, though Mahalanobis method achieves very surprising FPR and detection error scores, it is lagged behind us in terms of AUROC measure. We also provide more experimental results in Table 5, where we can observe consistently promising empirical results across different model architectures. Our proposed methodology is very simple and easy to implement, yet very effective in defending against outofdistribution examples.
The advantage of our method agasint DPN [15] lies in its full exploitation of the pretrained neural network, which makes our model much better fitted for largescale realworld applications. Unlike Mahalanobis [13] which needs to handcraft a huge sum of lowlevel ensemble features, our method only need the lastlayer output from the blackbox model, which saves feature engineering efforts and achieves almost equivalent performance on different datasets. Here we visualize the perturbation ratio distribution in Figure 7 and observe that the ratio is mainly focused in very small values , which confirms our intuition to design mild perturbation without violating model output landscape.
5.3 Ablation Study
In this section, we are particularly interested in understanding the impact of our concentration perturbation algorithm on final OOD detection metrics. Here we first visualize our results for CIFAR10 and CIFAR100 in Figure 6. From these two diagrams, we could observe a very significant increase across different metrics and network architectures, especially on TNR (TPR=95%) and Detection Accuracy. The other trend we observe is that our perturbation seems to seems to yield lesser improvement on CIFAR10 than CIFAR100 dataset, which reflects our assumption that the quality of Dirichlet uncertainty is highly related to the classification accuracy, or say, when the pretrained model has very weak classification capability, such uncertainty measure is highly inaccurate and very prone to misclassified examples. Another interesting observation is that: the uncertainty measure is sensitive to model architecture, especially the scale of the output layer. In VGG13 [23] architecture (see Figure 6), such uncertainty measure is able to yield very promising OOD detection accuracy without concentration perturbation. While ResNet18 [6] though is able to achieve better accuracy in CIFAR10, its outofdistribution detection accuracy is much lower than VGG13.
5.4 Impact of Concentration Perturbation
In this section, we are interested in studying the linear perturbation matrix to understand its essence. First of all, we visualize the matrix in Figure 8. As can be seen, the diagonal elements are overwhelming the nondiagonal elements in terms of magnitude due to our norm control on the perturbation noise. Here, we visualize the perturbed concentration in Figure 9, from which we can see that the beforeperturbed concentration is rather sharp over some dimensions (classes), which reflects the known overconfident issue in pretrained neural networks. By adding perturbation noise, the whole spectrum becomes much noiser than before but the highlighted specks remain unchanged. The insight behind such perturbation noise is to decrease model’s unreasonably high confidence into a more rational range by slightly increasing uncertainties into model’s prediction. Then, we compare the confidence shift caused by confidence perturbation and fastsign perturbation [14] in Figure 10, from which we find that our perturbation noise can remarkably increase the confidence in indistribution images while reducing the confidence in outofdistribution examples, which greatly helps separate indistribution from outofdistribution examples. In comparison, the fastsign perturbation increases confidence measure for both in and outofdistribution equivalently, which fails to helps separate these two sources images sources.
Besides, we also visualize the discriminative training process in Figure 11, we observe that the training loss approximated by synthesized outdomain data is quite aligned with the detection metrics computed on real outdomain data. We conclude that the synthesized adversarial examples [11] have good generalization ability, which lies the foundation for our methodology.
6 Conclusion
In this paper, we aim at designing a simple yet effective outofdistribution detection algorithm to increase the robustness of existing neural networks. Our method though requires the least knowledge and introduces minor complexity during training, yields very significant performance especially on the largescale dataset. However, our method is sensitive to different neural architectures, which could sometimes lead to inferior performance. In the future work, we plan to study more about the architectural differences and investigate their causes. Besides, the generalizing ability of adversarial examples is the cornerstone of our method, it’s interesting to see how the adversarial examples generated by different algorithms influence the detection accuracy and why these differences happen.
References

[1]
Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin.
A neural probabilistic language model.
Journal of machine learning research
, 3(Feb):1137–1155, 2003. 
[2]
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei.
Imagenet: A largescale hierarchical image database.
In
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on
, pages 248–255. Ieee, 2009.  [3] T. DeVries and G. W. Taylor. Learning confidence for outofdistribution detection in neural networks. arXiv preprint arXiv:1802.04865, 2018.
 [4] Y. Gal. Uncertainty in deep learning. University of Cambridge, 2016.
 [5] H. Hassan, A. Aue, C. Chen, V. Chowdhary, J. Clark, C. Federmann, X. Huang, M. JunczysDowmunt, W. Lewis, M. Li, et al. Achieving human parity on automatic chinese to english news translation. arXiv preprint arXiv:1803.05567, 2018.
 [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [7] D. Hendrycks and K. Gimpel. A baseline for detecting misclassified and outofdistribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
 [8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [9] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.

[10]
A. Krizhevsky, I. Sutskever, and G. E. Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in neural information processing systems, pages 1097–1105, 2012.  [11] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
 [12] K. Lee, H. Lee, K. Lee, and J. Shin. Training confidencecalibrated classifiers for detecting outofdistribution samples. arXiv preprint arXiv:1711.09325, 2017.
 [13] K. Lee, K. Lee, H. Lee, and J. Shin. A simple unified framework for detecting outofdistribution samples and adversarial attacks. arXiv preprint arXiv:1807.03888, 2018.
 [14] S. Liang, Y. Li, and R. Srikant. Enhancing the reliability of outofdistribution image detection in neural networks. arXiv preprint arXiv:1706.02690, 2017.
 [15] A. Malinin and M. Gales. Predictive uncertainty estimation via prior networks. arXiv preprint arXiv:1802.10501, 2018.
 [16] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
 [17] S.M. MoosaviDezfooli, A. Fawzi, and P. Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, 2016.
 [18] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML10), pages 807–814, 2010.
 [19] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS workshop on deep learning and unsupervised feature learning, volume 2011, page 5, 2011.
 [20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
 [21] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
 [22] G. Shalev, Y. Adi, and J. Keshet. Outofdistribution detection using multiple semantic label representations. arXiv preprint arXiv:1808.06664, 2018.
 [23] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.

[24]
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba.
Sun database: Largescale scene recognition from abbey to zoo.
In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on, pages 3485–3492. IEEE, 2010.  [25] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5987–5995. IEEE, 2017.
 [26] F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. Lsun: Construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
 [27] S. Zagoruyko and N. Komodakis. Wide residual networks. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 1922, 2016, 2016.
Comments
There are no comments yet.