1 Introduction
There has been a recent urge in both research and industrial interests in deep learning
lecun2015deep , with deep neural networks demonstrating stateoftheart performance in recent years across a wide variety of applications. In particular, deep convolutional neural networks lecun1 ; lecun2has been shown to outperform other machine learning approaches for visual perception tasks ranging from image classification
krizhevsky2012imagenet to object detection liu2016ssd and segmentation MaskRCNN . One of the key driving factors behind the tremendous recent successes in deep neural networks has been the availability of massive computing resources thanks to the advances and proliferation of cloud computing and highly parallel computing hardware such as graphics processing units (GPUs). The availability of this wealth of computing resources has enabled researchers to explore significantly more complex and increasingly deeper neural networks that has resulted in significant performance gains over past machine learning methods. For example, in the realm of visual perception, the depth of deep convolutional neural networks with stateoftheart accuracies have reached hundreds of layers, hundreds of millions of parameters in size, and billions of calculations for inferencing.While the ability to build such large and complex deep neural networks has led to a constant increase in accuracy, the primary metric for performance widely leveraged for evaluating networks, it has also created significant barriers to the deployment of such networks for practical edge device usage. The practical deployment bottlenecks associated with the powerful yet highly complex deep neural networks in research literature has become even more visible in recent years due to the incredible proliferation of mobile devices, consumer devices, and other edge devices and the increasing demand for machine learning applications in such devices. As a result, the design of deep neural networks that account for more than just accuracy as the sole indicator of network performance and instead strike a strong balance between accuracy and complexity has very recently become a very hot area of research focus, with a number of different deep neural network architectures designed specifically with efficiency in mind iandola2016squeezenet ; howard2017mobilenets ; tinyssd ; munet ; sandler2018mobilenetv2 ; squishednet ; ShuffleNet .
One of the key challenges in designing deep neural networks that strikes a strong balance between accuracy and complexity for practical usage lies in the difficulties with assessing how well a particular network architecture is striking that balance. As previous mentioned, using accuracy as the sole metric for network performance does not provide the proper indicators of how efficient a particular network is in practical scenarios such as deployment on mobile devices and other consumer devices. As a result, there has been a recent interest in the design of quantitative metrics for evaluating deep neural networks that accounts for more than just model accuracy. In particular, it is generally desirable to design such metrics in a manner that is as hardware vendor agnostic as possible so that different network architectures can be compared to each other in a consistent manner. One of the most widely cited metrics in research literature for assessing the performance of deep neural networks that accounts for both accuracy and architectural complexity is the information density metric proposed by Canziani , which attempts to measure the relative amount of accuracy captured within one of the most basic building blocks of a deep neural network: a parameter. More specifically, the information density () of a deep neural network is defined as the accuracy of the deep neural network (denoted by ) divided by the number of parameters needed for representing it (denoted by ),
(1) 
While highly effective for giving a good general idea of the balance between accuracy and architectural complexity (which also acts as a good indicator for memory requirements), the information density metric does not account for the fact that, depending on the design of the network architecture, the architecture complexity does not necessarily reflect the computational requirements for performing network inference (e.g., MobileNet howard2017mobilenets has more parameters than SqueezeNet iandola2016squeezenet but has lower computational requirements for network inference). Therefore, the exploration and investigation towards universal performance metrics that account for accuracy, architectural complexity, and computational complexity is highly desired as it has the potential to improve network model search and design.
In this study, we continue the conversation towards universal metrics for evaluating the performance of deep neural networks for practical usage. In particular, we propose a new balanced metric called NetScore, which is designed specifically to provide a quantitative assessment of the balance between accuracy, computational complexity, and network architecture complexity of a deep neural network. This paper is organized as follows. Section 2 describes the proposed NetScore metric and the design principles around it. Section 3 presents and discusses experimental results that compare the NetScore, information density, and top1 accuracy across 60 different deep convolutional neural networks for image classification on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2012) dataset ImageNet , making this one of the largest comparative studies between deep neural networks.
2 NetScore: Design Principles
The proposed NetScore metric (denoted here as ) for assessing the performance of a deep neural network for practical usage can be defined as:
(2) 
where is the accuracy of the network, is the number of parameters in the network, is the number of multiply–accumulate (MAC) operations performed during network inference, and , , are coefficients that control the influence of accuracy, architectural complexity, and computational complexity of the network on . A number of design principles were taken into consideration in the design of the proposed NetScore metric, which is described below.
Model accuracy representation: In the NetScore metric, the obvious incorporation of the model accuracy of the network into the metric is in the numerator of the ratio, as an increase in accuracy should naturally lead to an increase in the metric, similar to the information density metric Canziani . We further introduce a coefficient in the proposed NetScore metric to provide better control over the influence of model accuracy on the overall metric. In particular, we set to better emphasize the importance of model accuracy in assessing the overall performance of a network in practical usage, as deep convolutional neural networks that have unreasonably low model accuracy remain unusable in practical scenarios, regardless how small or fast the network is. In this study, the unit used for is in percent top1 accuracy on the ILSVRC 2012 dataset ImageNet .
Model architectural and computational complexity representations: Taking inspiration from the information density metric Canziani , we represent the architectural complexity of a deep neural network by the number of parameters in the network and incorporate it in the denominator of the ratio. As such, the architecture complexity of the network is inversely proportional to the metric , where an increase in architectural complexity results in a decrease in . In addition, we incorporate the computational complexity of the deep neural network as an additional factor in the denominator of the ratio to be taken into consideration for assessing the overall performance of a network for practical usage, which is particularly important in operational scenarios such as inference on mobile devices and other consumer devices where computational power is limited. To represent the computational complexity of the network in a manner that is relatively hardware vendor agnostic, thus enabling a more consistent comparison between networks, we chose to leverage the number of multiply–accumulate (MAC) operations necessary for performing network inference. Given that the computational bottleneck associated with performing network inference on a deep neural network is predominantly in the computation of MAC operations, the number of MAC operations is a good proxy for the computational complexity of the network. By incorporating both architectural and computational complexity, the proposed NetScore metric can better quantify the balance between accuracy, memory requirements, and computational requirements in practical usage. Furthermore, we introduce two coefficients ( and , respectively) to provide better control over the influence of architectural and computational complexity on the overall metric. In particular, we set and since, while architectural and computational complexity are both very important factors to assessing the overall performance of a network in practical scenarios, the most important metric remains the model accuracy given that, as eluded to before, networks with unreasonably low model accuracy are not useful in practical scenarios regardless of size and speed.
Logarithmic scaling: One of the difficulties with comparing the overall performance of different deep neural networks with each other is their great diversity in their model accuracy, architectural complexity, and computational complexity. This makes the dynamic range of the performance metric quite large and unwieldy for practitioners to compare for model search and design purposes. To account for this large dynamic range, we take inspiration from the field of signal processing; in particular, the decibel scale commonly used to express the ratio between one value of a property to another on a logarithmic scale. In the proposed NetScore metric, we transform the ratio between the model accuracy property () and the model architectural and computational complexity ( and ) into the logarithmic decibel scale to reduce the dynamic range to within a more readily interpretable range.
3 Experimental Results and Discussion
To get a better sense regarding the overall performance of the huge wealth of deep convolutional neural networks introduced in research literature in the context of practical usage, we perform a largescale comparative analysis across a diverse set of 60 different deep convolutional neural networks designed for image classification using the following quantitative performance metrics: i) top1 accuracy, ii) information density, and iii) the proposed NetScore metric. The dataset of choice for the comparative analysis in this study is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC 2012) dataset ImageNet , which consists of 1000 different classes. To the best of the author’s knowledge, this comparative analysis is one of the largest in research literature and the hope is that the results presented in this study can act as a reference guide for practitioners in the field.
The set of deep convolutional neural networks being evaluated in this study are: AlexNet krizhevsky2012imagenet , AmoebaNetA (4, 50) AmoebaNet , AmoebaNetA (6, 190) AmoebaNet , AmoebaNetA (6, 204) AmoebaNet , AmoebaNetB (3, 62) AmoebaNet , AmoebaNetB (6, 190) AmoebaNet , AmoebaNetC (4, 50) AmoebaNet , AmoebaNetC (6, 228) AmoebaNet , CondenseNet (G=C=4) CondenseNet , CondenseNet (G=C=8) CondenseNet , DenseNet121 (k=32) densenet , DenseNet169 (k=32) densenet , DenseNet161 (k=48) densenet , DenseNet201 (k=32) densenet , DPN131 DPN , GoogleNet GoogleNet , IGCL100M2 IGC , IGCL16M16 IGC , IGCL100M2 IGC , InceptionResNetv2 Inceptionv4 , Inceptionv2 Inceptionv2 , Inceptionv3 Inceptionv2 , Inceptionv4 Inceptionv4 , MobileNetv1 (1.0224) howard2017mobilenets , MobileNetv1 (1.0192) howard2017mobilenets , MobileNetv1 (1.0160) howard2017mobilenets , MobileNetv1 (1.0128) howard2017mobilenets , MobileNetv1 (0.75224) howard2017mobilenets , MobileNetv2 sandler2018mobilenetv2 , MobileNetv2 (1.4) sandler2018mobilenetv2 , NASNetA (4 @ 1056) NASNet , NASNetA (6 @ 4132) NASNet , NASNetB (4 @ 1536) NASNet , NiN NiN , OverFeat Overfeat , PNASNet5 (4, 216) PNASNet , PolyNet PolyNet , PreResNet152 preresnet , PreResNet200 preresnet , PyramidNet101 (alpha=250) PyramidNet , PyramidNet200 (alpha=300) PyramidNet , PyramidNet200 (alpha=450) PyramidNet , ResNet152 ResNet , ResNet50 ResNet , ResNet101 ResNet , ResNeXt101, SENet SENet , ShuffleNet (1.5) ShuffleNet , ShuffleNet (x2) ShuffleNet , SimpleNet SimpleNet , SqueezeNet iandola2016squeezenet , SqueezeNetv1.1 iandola2016squeezenet , SqueezeNext (1.023v5) SqueezeNext , SqueezeNext (2.023) SqueezeNext , SqueezeNext (2.023v5) SqueezeNext , TinyDarkNet TinyDarkNet , VGG16 VGG16 , Xception Xception , ZynqNet ZynqNet .
In this study, the units used for and for two of the quantitative performance metrics (information density and the proposed NetScore metric) are in MParams (millions of parameters) and GMACs (billions of MAC operations), respectively, given that most modern deep convolutional neural networks are within those architectural and computational complexity ranges.
Top1 accuracy: The top1 accuracies across 60 different deep convolutional neural networks for the ILSVRC 2012 dataset is shown in Fig. 1. It can be clearly observed that significant progress has been made in the design of deep convolutional neural networks for image classification over the past six years, with the difference between the deep convolutional neural network with the highest top1 accuracy in this study (i.e., AmoebaNetC (6, 228)) and that of AlexNet exceeding 25%. It is also interesting to see that more recent developments in efficient deep convolutional neural networks such as MobileNetv1, MobileNetv2, and ShuffleNet all have top1 accuracies that exceed VGG16, the third largest tested network evaluated in the study that was also the stateoftheart just four years ago, thus further illustrating the improvements in network design over the past few years.
Information density: The information densities across 60 different deep convolutional neural networks for the ILSVRC 2012 dataset is shown in Fig. 2. It can be clearly observed that the deep convolutional neural networks that were specifically designed for efficiency (e.g., MobileNetv1, MobileNetv2, ShuffleNet, SqueezeNet, Tiny DarkNet, and SqueezeNext) have significantly higher information densities compared to networks that were designed purely with accuracy as a metric. More specifically, the SqueezeNext (1.023v5), Tiny DarkNet, and the SqueezeNet family of networks had the highest information density by a wide margin compared to the other tested deep convolutional neural networks, which can be attributed to their significantly lower architectural complexity in terms of number of network parameters. Another notable observation from the results in Fig. 2 is that the dynamic range of the information density metric is quite large across the diverse set of 60 deep convolutional neural networks evaluated in this study.
NetScore: The NetScore across 60 different deep convolutional neural networks for the ILSVRC 2012 dataset is shown in Fig. 3. Similar to the trend observed in Fig. 2, it can be clearly observed that many of the deep convolutional neural networks that were specifically designed for efficiency have significantly higher NetScores compared to networks that were designed purely with accuracy as a metric. However, what is interesting to observe is that the NetScore ranking amongst these efficient networks are quite different than that when using the information density metric. In particular, the top ranking deep convolutional neural networks with the highest NetScores are SqueezeNext (1.023v5), CondenseNet (G=C=8), and MobileNetv2.
The SqueezeNet family of networks, on the other hand, had much lower relative NetScores compared to the aforementioned efficient networks despite having the top two highest information densities. This observation illustrates the effect of incorporating computational complexity to the assessment of deep convolutional neural networks for practical usage, given that while the SqueezeNet family of networks has significantly lower architectural complexities compared to other tested networks, it also is offset by noticeably higher computational complexities compared to other tested efficient networks such as the MobileNetv1, MobileNetv2, SqueezeNext, and ShuffleNet network families.
The proposed NetScore metric, which by no means is perfect, could potentially be useful for guiding practitioners in model search and design and hopefully push the conversation towards better universal metrics for evaluating deep neural networks for use in practical scenarios. Future work includes incorporating additional or alternative factors that are important to assessing architectural and computational complexities of deep neural networks beyond what is being used in the NetScore metric, as well as finding a good balance between these different factors based on relative importance for the deployment of deep neural networks for practical usage in scenarios such as mobile devices and other edge devices.
Acknowledgment
The author thank Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada Research Chairs program, Nvidia, and DarwinAI.
References
 [1] A. Canziani, A. Paszke, and E. Culurciello. An analysis of deep neural network models for practical applications. arXiv preprint arXiv:1605.07678, 2017.
 [2] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. CoRR, abs/1707.01629, 2017.
 [3] François Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016.
 [4] Y. Le Cun, Y. Bengio, and G. Hinton. Deep learning. Nature, 2015.
 [5] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of IEEE, 1998.
 [6] Y. Le Cun, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel. Handwritten digit recognition with a backpropagation network. Proceedings of the Advances in Neural Information Processing Systems (NIPS), 1989.
 [7] Amir Gholami, Kiseok Kwon, Bichen Wu, Zizheng Tai, Xiangyu Yue, Peter H. Jin, Sicheng Zhao, and Kurt Keutzer. Squeezenext: Hardwareaware neural network design. CoRR, abs/1803.10615, 2018.
 [8] D. Gschwend. Zynqnet: An fpgaaccelerated embedded convolutional neural network. https://github.com/dgschwend/zynqnet, 2016.
 [9] Dongyoon Han, Jiwhan Kim, and Junmo Kim. Deep pyramidal residual networks. CoRR, abs/1610.02915, 2016.
 [10] Seyyed Hossein HasanPour, Mohammad Rouhani, Mohsen Fayyaz, and Mohammad Sabokrou. Lets keep it simple, using simple architectures to outperform deeper and more complex architectures. CoRR, abs/1608.06037, 2016.
 [11] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask rcnn. ICCV, 2017.
 [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
 [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. CoRR, abs/1603.05027, 2016.
 [14] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W., T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 [15] Jie Hu, Li Shen, and Gang Sun. Squeezeandexcitation networks. CoRR, abs/1709.01507, 2017.
 [16] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. CoRR, abs/1711.09224, 2017.
 [17] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. CoRR, abs/1608.06993, 2016.
 [18] F. Iandola, S. Han, M. Moskewicz, K. Ashraf, W. Dally, and K. Keutzer. Squeezenet: Alexnetlevel accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
 [19] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
 [20] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. CoRR, abs/1312.4400, 2013.
 [21] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, LiJia Li, Li FeiFei, Alan L. Yuille, Jonathan Huang, and Kevin Murphy. Progressive neural architecture search. CoRR, abs/1712.00559, 2017.
 [22] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg. SSD: Single shot multibox detector. In ECCV, 2016.

[23]
H. Su J. Krause S. Satheesh S. Ma Z. Huang A. Karpathy A. Khosla M. Bernstein et al. journal=International Journal of Computer Vision year=2015 O. Russakovsky, J. Deng.
Imagenet large scale visual recognition challenge.  [24] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V. Le. Regularized evolution for image classifier architecture search. CoRR, abs/1802.01548, 2018.
 [25] J. Redmon. Tiny darknet. https://pjreddie.com/darknet/tinydarknet/, 2016.
 [26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. arXiv preprint arXiv:1704.04861, 2017.
 [27] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.
 [28] M. Shafiee, F. Li, B. Chwyl, and A. Wong. Squishednets: Squishing squeezenet further for edge device scenarios via deep evolutionary synthesis. In NIPS, 2017.
 [29] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [30] Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke. Inceptionv4, inceptionresnet and the impact of residual connections on learning. CoRR, abs/1602.07261, 2016.
 [31] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014.
 [32] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
 [33] Alexander Wong, Mohammad Javad Shafiee, and Michael St. Jules. muNet: A highly compact deep convolutional neural network architecture for realtime embedded traffic sign classification. CoRR, abs/1804.00497, 2018.
 [34] Alexander Wong, Mohammad Javad Shafiee, Francis Li, and Brendan Chwyl. Tiny SSD: A tiny singleshot detection deep convolutional neural network for realtime embedded object detection. CoRR, abs/1802.06488, 2018.
 [35] Ting Zhang, GuoJun Qi, Bin Xiao, and Jingdong Wang. Interleaved group convolutions for deep neural networks. CoRR, abs/1707.02725, 2017.
 [36] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017.
 [37] Xingcheng Zhang, Zhizhong Li, Chen Change Loy, and Dahua Lin. Polynet: A pursuit of structural diversity in very deep networks. CoRR, abs/1611.05725, 2016.
 [38] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition. CoRR, abs/1707.07012, 2017.