There is growing importance to bringing deep neural network processing to mobile or embedded devices (also known as edge-devices) (surveyefficientdnns)
. Deep neural networks such as Convolutional Neural Networks (hereafter referred to as ConvNets) have achieved greater accuracy compared to humans for a large variety of predictive tasks, for example, image classification in computer vision and text classification in natural language processing(resnet). ConvNets are designed during a training phase
where machine learning researchers search for the best model using accuracy111suitably defined for the task. as a metric. Once the ConvNet model is trained, the pre-trained model is available for use during the inference or testing phase.
There are numerous benefits to performing inferences locally on edge devices such as reducing energy costs of datacenters, lower (user) latency and reduced need for constant internet connectivity (lane). However, such devices have a unique set of constraints in terms of resources, for example, battery life, that are atypical of the environments in which the models are trained. This has paved way for the exploration of energy-efficient ConvNet designs through manual and automated searches for low-cost neural network model designs - for example, MobileNet (mobilenets) and MnasNet (tan2018mnasnet) - exploration of compression and quantization and other software-based acceleration techniques, and the use of application-specific hardware accelerators (surveyefficientdnns). Despite these efforts, there are very few studies that model the energy use of deep learning models in the context of these optimizations. Such modelling approaches are useful in the areas of resource-aware ConvNet designs such as automatic Neural Architecture Search (tan2018mnasnet), energy-aware pruning techniques (yang2016designing) and in neural network accelerators simulators (samajdar2018scale) that focus on designing energy-efficient deep learning hardware.
Previous studies (synergy) on predictive modelling have indicated that relatively simple features, such as the sum of the multiply-accumulate (MAC) counts (we refer to this as layer-type), can be used to estimate hardware performance counters such as SIMD instruction counts and bus accesses that are useful for determining an application’s performance. The performance counter information is then used to estimate the energy consumption for the convolutional layers in a ConvNet for real systems. Other works, such as (neuralpower)
have relied on a large set of complex features, extracted from each layer’s specification (we refer to this asindividual layers), to yield highly complex predictive models. However, none of these works have investigated the trade-offs of choosing features on predictive model accuracy and complexity.
The aim of this work is to perform a thorough analysis of algorithmic features of predictive models based on layer-type and individual layers that can offer the best trade-off in model complexity (defined in Section 5) and predictive accuracy, and compare our results to previous works. We first illustrate the techniques for layer-type versus individual layer features to build predictive models for the convolutional layers on a mobile CPU and based on the results we apply the method to other layers in a ConvNet to get an overall estimate of the ConvNet’s inference phase running on different software and mobile hardware platforms.
Our contribution is as follows:
An extension to the SyNERGY framework (synergy) to tie the energy measurements obtained on the mobile device to the application-level at a per layer granularity and support the building of layer-wise energy predictive models (Step 1 in Figure 1).
In Step 2, we perform an exhaustive search based on standard feature subset selection techniques to evaluate the individual layer features in terms two metrics: predictive model accuracy and complexity for building predictive models for the convolutional layers in the context of a given hardware-software combination. We then compare the best predictive model using individual layer features with predictive models based on layer-type features to determine the best features for building predictive models in terms of predictive model accuracy and complexity.
In Step 3
, energy predictive models for different layers are built using the best features selected in Step 2. Our predictive models, based on simple algorithmic features such as the summation of multiply-accumulate (MAC) operation counts of all layers, outperform the predictive models using more complex features from the individual layers themselves. We can achieve significant accuracy with 4 to 32 times lower model complexity with similar accuracy compared to complex predictive models proposed in previous works. Our results are based on 12 representative ConvNet models chosen from existing deep learning frameworks (Caffe2(caffe)), which is a larger number than used in previous studies (synergy; neuralpower) including newer low-cost ConvNet models (for example, SqueezeNet and MobileNet) used in the mobile and embedded space.
We combine predictions from predictive models for different layers to get an overall estimate of the energy consumption of the deep learning model during the inference phase for multiple combinations of hardware and numerical software libraries: Eigen222http://eigen.tuxfamily.org on a Snapdraon820 (Eigen-Snapdragon820)333https://developer.qualcomm.com/software/snapdragon-neural-processing-engine, Eigen on a Jetson TX1 (Eigen-TX1)444http://www.nvidia.com/object/jetson-tx1-module.html and OpenBLAS on a Jetson TX1 (OpenBLAS-TX1)555http://www.openblas.net/. Our choice includes two platforms with the same library, that is Eigen-Snapdragon820 and Eigen-TX1, and two libraries on a single platform, that is Eigen-TX1 and OpenBLAS-TX1. We also evaluate our methodology on a mobile GPU using CuDNN on the Jetson TX1 (CuDNN-TX1). For all hardware and software combinations, we achieve a significant predictive test accuracy in the range 76% to 84% compared with empirical measurements on the platforms.
The organisation of the paper is as follows. Section 2 provides details of the ConvNet model specifications for their different layers. Section 3 goes into more detail of the specific methodology for energy measurements across the two hardware platforms: Jetson TX1 and Snapdragon 820. Section 4 presents the empirical energy measurements obtained for the overall and at different layer-types energy found in ConvNets. Section 5 covers an analysis of the features to be used for the predictive models and evaluates the models based on predictive accuracy and model complexity. Section 6 details the final results for the energy-predictive models for the convolutional, pooling and fully-connected layers. Section 7 compares related work in performance and energy measurement and modelling. Finally, section 9 concludes and highlights possible future directions.
2. ConvNet Model Specifications
This section covers the necessary background to understand the different layers of a ConvNet at the algorithmic level. For each layer, we describe the candidate input features (highlighted in bold) that will be evaluated in the feature selection phase in Section 5.
A ConvNet is an end-to-end pipeline of feature extraction and classification. The feature extractors are arranged into layers that extract high-level representations from image data (goodfellow)
. As the number of layers increases the level of abstraction increases, for example, from edges or colour blobs to object shapes. The final layer uses this information to provide a classification output (or a decision). Typical layers found are convolution (Conv), pooling (Pool), normalization (there are two types: batch normalization or Local Response Normalization-(LRN)), Rectified Linear Unit (ReLU) and fully-connected (Fc) that transform the input data into a probabilistic output. We provide a description of the main layers targeted in our study: Conv, Pool and Fc.
|Dataset||# Layers||Parameters||Model Size|
|SqueezeNet||squeezenet (squeezenet)||80.3||ImageNet||26 Conv + 3 MaxPool||1.2 M||5 MB|
|squeezenetRes (squeezenet)||82.5||ImageNet||26 Conv + 3 MaxPool||1.2 M||6.3 MB|
|ALL-CNN-C||allcnn (allcnn)||90.92||CIFAR 10||9 Conv||1.3 M||5.5 MB|
|GoogleNet||googlenet (szegedy)||90.85||ImageNet||57 Conv + 1 Fc + 13 MaxPool||6.9 M||54 MB|
|DenseNet||densenet (huang2017densely)||92.12||ImageNet||121 Conv + 1 MaxPool||7.8 M||32.3 MB|
|Inception-v3||inceptionv3 (szegedy)||90.92||ImageNet||94 Conv + 1 Fc + 5 MaxPool||23 M||95.5 MB|
|resnet50 (resnet)||93.29||ImageNet||53 Conv + 1 Fc + 1 MaxPool||25 M||103 MB|
|MobileNet||mobilenet (mobilenets)||70.6||ImageNet||27 Conv||29 M||17 MB|
|Places-CDNS-8s||places (placescdns)||86.8||ImageNet||8 Conv + 3 Fc + 5 MaxPool||60 M||241.6 MB|
|AlexNet||alexnet (imageNet)||80.3||ImageNet||5 Conv + 3 Fc + 3 MaxPool||62 M||244 MB|
|VGG_CNN_S||vggsmall (vgg)||86.9||ImageNet||5 Conv + 3 Fc + 3 MaxPool||102 M||393 MB|
|Inception-BN||inceptionbn (inceptbn)||89.0||ImageNet||69 Conv + 1 Fc + 5 MaxPool||1.4 B||134.6 MB|
The bulk of a ConvNet model is made up of Conv layers. The computational complexity of a standard Conv layer can be represented by the number of Multiply-accumulates (MAC) performed which is given by Equation 1:
where , and represent the output feature map dimensions, , are the kernel filter dimensions and , and are the input feature map dimensions in the , and dimensions, as shown in Figure 2. The z dimension represents the number of channels in the feature maps. These dimensions are governed by the stride (which governs the step size by which the kernel filter slides across the input in and ) and padding
(the number of zeros that need to be padded around the input border to allow whole filters to be applied), andkernel shape. The storage complexity or data volume666Referred to as Bandwidth in NeuralPower, (neuralpower) includes the cost for storing the input feature map or input volume to each layer, the corresponding kernel or filter weights () and biases, and the output feature map or output volume. The volume of data (in number) is given by Equation 2.
In Figure 2, refers to the number of images in the input, commonly known as the batch size. Newer models such as MobileNet (mobilenets) leverage depth-wise separable convolutions and have the following computational complexity:
A pooling and sub-sampling layer aggregates the output from the previous layer using a pooling window () in and . The Max pooling operator computes the maximum over this window and downsamples the output using the max value while the Average pooling finds the average value over the window and down samples the output using the average value. This results in a pooling output of dimension (). The computation for a max value within a single window involves a comparison operation with each of its elements. For example, the and window has 9 elements that require 8 comparison operations. We refer to this as the Op count for the pooling layer.
Unlike the Conv layer, the inputs to a Fc layer are connected to all the outputs of the previous layer. The Fc layers are similar to Conv layers with the exception that , , ,
are usually greater than 1 for the first Fc layer, and then it flattens out in later Fc layers to a 1-dimensional vector.
Table 1 gives a list of ConvNet models chosen for this study. Column 5 represents the counts of each of the described layers - Conv, Fc and Maxpool - present in each model. Recently, in newer ConvNet models (for example, Inception-BN, Inception-V3, GoogleNet and Residual Nets) the traditional stack of fully-connected layers, seen in older models such as AlexNet and VGG, is replaced with a Global Average Pooling layer introduced in nin and a single Fc layer. Column 7 shows the size of the model which is stored in 32-bit floating point precision. We evaluate models ranging from 1.2M to 1.4B parameters (note, these are also referred to as weights), as given in Column 6 of Table 1. The top-5 accuracy of the model is a measure based on the Top-5 predictions of the object category in a given image (ImageNetChallenge). In our study, we evaluate the energy use of these pre-trained ConvNets and build layer-wise energy predictive models for inferences executing on a mobile device.
We extend the existing SyNERGY framework to collect power measurements and use it for developing energy predictive models. Figure 1, represents a high-level overview of the extended framework. The input to the deep learning software framework is the ConvNet model specification, the pre-trained weights from Caffe2’s model repository and the input image. Inferences are executed in the deep learning software ecosystem with the necessary back-end acceleration libraries on the chosen hardware. The code annotations supply the information of a layer’s beginning and end times. The host machine remotely collects data such as the annotations and power measurements from the target hardware. The target device runs the actual inference. The details of the target platforms in this study are provided in Table 2. Next, we describe the steps necessary to set up the software tools on both the host and target systems.
3.1. Tools for remote monitoring on the host and Caffe2 on target mobile devices.
The host system runs ARM Streamline version: 5.28.1, Linux 64-bit version that is compiled using the sources from the DS-5 Development studio (ARMstreamline). To facilitate this communication between the host machine and the target machine, we need the gator daemon which communicates with the host’s Streamline and the gator driver as a loadable kernel module (ARMstreamline). The Caffe and Caffe2 binary can be built directly for the Jetson TX1 while for the Snapdragon we cross-compile the Caffe2’s Android binary using the android-ndk-r16 toolchain. To integrate ARM Streamline with Caffe and Caffe2 we use the Streamline annotation library. We identify the specific functions that call each layer in the software stack in Caffe and Caffe2 and place the code annotations. This includes Caffe’s net.cpp and Caffe2’s netsimple.cpp. We use the older Caffe with OpenBLAS support as done in the SyNERGY framework (synergy).
3.2. Power measurement set up on the target
The Jetson TX1 development comes with an on-board TI-INA3221x power sensor chip that has to be enabled during kernel source cross-compilation777List of kernel modifications can be found in https://github.com/ARM-software/gator along with enabling its entry in the device tree binaries (dtb). This power sensor provides system-level power, CPU-level power and GPU-level power in mW. We use the system-level power to the SoC as this accounts for the power due to the processing core, DRAM memory and peripherals. The power measurements are gathered with the default interactive Linux governor. The Snapdragon 820 development board comes with on-board power pins. To capture energy measurement, we use the ARM energy probe to provide system-level power for the SoC.
In our study, the power values are collected as the inference phase executes on either the target CPU (single-threaded) or the GPU. We report the execution time per image (sec/image) and energy per image (mJ/image) averaged over 5 separate runs for single image inferences. The power sampling rate is fixed to 1 kHz. To extract per layer measurements the execution profile with time-stamped code-annotations are aligned to the time-stamped power profile. We then use the extracted power measurement to calculate the energy consumed over the time duration as per Equation 5.
where is the power sample over the time duration and is the total execution time of inference application.
|GoogleNet-Caffe2||Eigen - Jetson TX1||Eigen - Snapdragon 820|
|Energy (mJ)||Time (sec)||
|Energy (mJ)||Time (sec)||
|Conv||7856.84 457.2||1.7354 0.1||84.44||84.14||842.66 95.86||0.3866 0.03||77.64||77.28|
|Fc||156.35 22.9||0.0344 0.0||1.68||1.66||18.20 6.04||0.0082 0.0||1.67||1.63|
|MaxPool||504.41 46.9||0.1126 0.01||5.42||5.45||59.87 3.5||0.0284 0.0||5.51||5.67|
|LRN||637.29 22.4||0.1468 0.0||6.84||7.11||137.62 10.8||0.0646 0.0||12.68||12.91|
|ReLU||93.37 14.4||0.021 0.0||1.00||1.01||23.14 15.4||0.0106 0.0||2.13||2.11|
|AveragePool||3.59 2.01||0.0008 0.0||0.03||0.03||0.28 0.0||0.0002 0.0||0.02||0.03|
|Concat||52.12 5.4||0.0114 0.0||0.56||0.55||3.00 2.54||0.0014 0.0||0.27||0.27|
|Average Total||9304.00 507.1||2.0624 0.0||100||100||1085.28||0.5002||100||100|
|GoogleNet-Caffe||OpenBLAS Jetson TX1|
|Energy (mJ)||Time (sec)||
|Conv||4883.26 818.6||1.52 0.2||62.36||62.66|
|InnerProduct||138.96 30.7||0.04 0.01||1.77||1.76|
|Pooling||761.83 92.2||0.23 0.03||9.72||9.82|
|LRN||1716.78 43.1||0.52 0.01||21.92||21.53|
|ReLU||178.86 35.9||0.05 0.01||2.28||2.29|
|Split||9.00 6.0||0.002 0.0||0.11||0.11|
|Concat||128.59 22.5||0.03 0.0||1.64||1.63|
|Dropout||0.6798 1.5||0.0002 0.0||0.008||0.008|
|Softmax||1.85 1.6||0.0006 0.0||0.02||0.02|
4. Layer-type Energy and Performance Measurements
In this section, we use SyNERGY to provide empirical time and energy measurements for the overall inference as well as finer-grained layer-type for an example ConvNet model (in our case, GoogleNet). Layer-type for the convolutional layers is when we group individual convolutional layers regardless of its specifications (for example, , or ) into a broader category of Conv.
We summarize our findings for Eigen-TX1 and Eigen-Snapdragon820 in Table 3, and OpenBLAS-TX1 in Table 4. In the case of Caffe2, the pooling layer is further split into two types: MaxPool and AveragePool while in the case of the original Caffe framework both versions are grouped under the Pooling category. Certain layers like dropout and softmax execute too quickly and are too small to be captured. We also report the average percentage of energy and time of each layer when compared to the total energy and time. If we compare the total inference energy in all three software and hardware cases, the combination with the least amount of energy per inference is the Caffe2’s Eigen-Snapdragon820. However, comparing the energy of the Conv layer for the Jetson TX1 with both software backends, we observe that OpenBLAS consumes less energy than Eigen.
If we compare the energy of the different layers, we observe that the Conv layer contributes the most to the total energy consumed in the inference phase () across all three combinations. The pooling () and LRN () rank second making them good candidates for optimization. In this section, we use the energy measurements obtained from SyNERGY to evaluate the energy consumption of a ConvNet model to identify energy bottlenecks and perform comparative analysis for different software and hardware systems. In Section 6.2, we show how predictive models based on energy can be used to perform such comparative analysis. Finally, this layer-type abstraction will be useful in the next sections, where we focus on building predictive models at the granularity of layer-types and compare it with previous approaches that use individual layers for predictive models.
5. Feature Selection
Previous work such as Neuralpower (neuralpower), build layer-wise predictive models by using complex features extracted from individual layers. For example, higher order terms for kernel size, input volume and others. While other works (synergy) use simple aggregate algorithmic features (we refer to this as layer-type), for example, an aggregate MAC count to build predictive models for the convolutional layers. Therefore, in this section, we first aim to evaluate algorithmic features (highlighted in bold in Section 2) extracted for individual layers to build predictive model in terms of predictive model accuracy and complexity. Our feature selection is based on standard techniques of best subset selection using metrics such as Bayesian Information Criterion (BIC) (james2013introduction). These methods are typically used to evaluate the trade-off in model complexity and accuracy as features are added to the model. We refer to model complexity as the number of features in the final predictive model.
We demonstrate this feature analysis for all the convolutional layers of all 12 target ConvNets executing on the CPU of Jetson TX1 with OpenBLAS backend. The linear features (or degree d=1) for these layers include: kernel shape, padding, stride, (same as ), (same as ), , , input size, output size, weights, data volume and MAC. In this case, each convolutional layer has a set of 12 features. The target response is the energy for an individual layer. Figure a, shows that a model with 5 features (indicated by red circle as the lowest BIC) would be a good a choice. To model non-linear features, we extended the linear feature set to consist of higher order polynomial terms and cross terms (of degree ) for these features (this includes, , and others). For a degree 2 model the model with lowest BIC has a model complexity of 62 features, as shown in Figure b.
|1||5||Best linear model||4631||
In order to understand predictive model should yield greater predictive accuracy in Table 5, we make a relative comparison of a single feature-based model using MAC to the predictive model using the best combination of linear features and to the predictive model with best combination of linear and non-linear features obtained from the previous step. Based on the relative comparison, a single feature MAC model is found to be within 3% of the best linear feature model and within 29% of the best non-linear feature model. Therefore, to get a highly accurate predictive model which is indicated by a lower BIC, the number of features extracted for individual layers would be in the order of 62 non-linear features. In the next section, we compare regression-based models based on individual features with predictive models based on layer-type features on the basis of predictive accuracy to determine whether higher complexity models indeed offer better accuracy when compared to lower-complexity models.
|MAC model||Energy||67.02 11.91||1||1|
|NeuralPower (neuralpower)||Runtime||77.48 21.21||2||4|
5.1. Analysis of Model Accuracy & Complexity
Based on our analysis of features in the previous section, we build regression based models trained using the standard supervised learning approach in machine learning (james2013introduction). Cross validation is performed 10 times and the convolutional layers used in train and test sets are in the ratio 80:20. The regression-based model layer-wise predictive models is given by Equation 6. Similar predictive models can be built for other layers to give the overall energy of the inference as the sum of the predictions from all the layer-wise predictive models, as given by Equation 7.
where represents the degree of the algorithmic feature.
As described in the previous sections, we have two types of predictive models based on the type of features: layer-wise predictive models use features aggregated across layers while individual layer models use features from every layer. The individual convolutional layer models are of four categories, as given in Table 6: a single feature model (MAC) without summation counts (as done in (synergy)), a model with the best BIC for linear features (best linear), a model with the best BIC for non-linear features (best non-linear) and finally, we compare with a previous work, Neuralpower (neuralpower). NeuralPower is based on predictions from run-time and power estimation models, to get an estimate of time and power, and subsequently energy, for individual layers. For this, we use the code provided by NeuralPower888https://github.com/cmu-enyac/NeuralPower.
By comparing the models based on individual layer features, as summarised in Table 6, we find that using a larger set of complex features does not provide a massive boost in accuracy compared with the use of simpler features, as indicated by the results from Table 5; for example, compare the data for complex models such as the Best non-linear model and NeuralPower in Table 6
with that for the simpler models such as the MAC model and Best linear model. Furthermore, for a single hardware and software configuration and a single split of their dataset, Neuralpower reports an overall accuracy of 97.21% (based on the Root-mean-squared-percentage-error (RMSPE)). When using Neuralpower on our dataset, we observe similar high accuracies for certain splits (for example, considering the upper bound we get 77.47+21.21=98.69%). However, this behaviour is not consistently observed across other splits of train and test sets as done in our experimental evaluation. Our results indicate a mean and variance of the accuracy of 77.4721.21% in Table 6, across different splits of training and test sets.
On further analysis of the results of NeuralPower, we observe that certain ConvNet models are over-predicted while others are under-predicted when using two different predictive models - the runtime and power models - leading to a cancellation effect when calculating energy to give an overall high accuracy which may be misleading. However, we do not observe such cancellation effects when using a simple model trained directly on energy use information. In addition, the model offers a higher accuracy and lower variance compared to using models based on individual layers (See Column 4 of Table 6).
Given this behaviour, we conclude that predictive models based on offers a good first approximation to estimate the energy consumed in a ConvNet in terms of model complexity and accuracy. The model yields a higher and more stable predictive accuracy but with model complexity 4 and 17 lower compared to previous approaches that use data from individual layers (Column 6 of Table 6). Moreover, MAC (or more generally an operation count, Op) is a universal feature that can be extracted for other layer-types (see Section 6). Therefore in next section, we extend the approach to other layer-types to get an overall estimate of the energy consumed by the deep learning model.
6. Overall Inference Energy
In this section, we extend our method to construct per layer energy prediction models for the Conv, Fc and pooling layers using MAC count (or, equivalently, OpCount for pooling layers). To make a prediction for an entire ConvNet model, the training and test sets are split based on the ConvNet models themselves during cross-validation. This ensures that for given test ConvNet all its layers are present only in the test set. The predictive models are evaluated in terms of their relative accuracy, given by Equation 8, which quantifies the relative performance of the predictor with respect to the baseline measured energy value (synergy). We average the relative test errors across all test examples and across all folds of data.
6.1. Layer-type predictions
From Figure 4
, we plot the linear regression model over the data points in our dataset. We observe that the relative positions of each data point in all three cases (that is, Eigen-Snapdragon820, Eigen-TX1 and OpenBLAS-TX1) follow a similar trend. At the bottom left corner, we observe models with low MAC count and low energy use, for example, squeezenet and mobilenet. It is interesting to observe that smaller sized models, in terms of number of parameters, do not always result in better energy use. For example, resnet50 outperforms inceptionv3 in terms of energy use in all three cases despite being roughly the same size (seeTable 1). We also observe that alexnet has lower energy use than squeenzenet and mobilenet despite being approximately 3 and 2 times greater in model size 999This is considering only the convolutional layers. respectively. This is because the latter models use smaller kernel shapes such and to reduce the number of parameters in the model. However, these require optimized software routines for small kernel shapes to exploit the resources on a system effectively. Our results for the relative accuracy of the Conv predictive models on the test set is tabulated in Table 7. In all three cases, we find that the linear regression model using solely MAC count as an input feature achieves a test accuracy between 75% to 82%.
We aim to model the Fc layers and pooling layers using an equivalent feature to count as done previously for the Conv layers. However, as observed in Table 1 from Section 2, there are fewer ConvNets with Fc layers, and fewer Fc layers per ConvNet model. Despite the fact that we are using a larger number of ConvNets than previous studies, the data for the Fc layers is limited. As seen in Table 7, the lower accuracy between 56% to 76% using a linear fit could be a result of insufficient data points. We could address this issue by trying to generate more points for the Fc layers by using individual Fc layers as adopted by previous approaches or generate more data points by using varying batch sizes.
For the pooling layers, we focussed on MaxPool operations as they account for more number of layers than average pooling in real ConvNets. We use the OpCount given in Equation 4. The results in Figure 4, show that using solely the OpCount as an input feature we can obtain a linear fit with test accuracy between 82% to 90%.
6.2. Overall energy predictions
In this section, we obtain an estimate for the energy for the whole ConvNet by summing the predictions from the Conv, Fc and MaxPool layer-type predictive models. We select GoogleNet, AlexNet and VGG_CNN_S as test data points (see Table 8) because AlexNet and a variant of VGG with 16 layer were evaluated in NeuralPower (neuralpower), and GoogleNet is our running example. We use the remaining model points as the training set to form the linear model.
Table 8 shows the prediction results for each layer-type Conv, Pool and Fc given by Columns 3, 4 and 5 and the overall predicted results (Column 6: Total predicted) for the inference. The measured energy for the Conv, Pool and Fc layer-type is given in Columns 7, 8 and 9 and the overall measured energy in Column 10 of Table 8. Similar to the results obtained using empirical measurements in Section 4, we find that using the predicted energy for the convolutional layers (given in Column 3) of GoogleNet, the OpenBLAS library is less energy consuming than the Eigen library for the TX1 platform.
Finally, we also report accuracy using RMSPE (as per the metric used in NeuralPower) and relative test accuracy (see Equation 8). Both metrics provide similar results. We find that across the four software-hardware combinations, including mobile GPUs (CuDNN-TX1 in Table 8), we achieve a significant relative test accuracy of between 76% to 85% using solely summation of MAC (or operation) counts as the input feature to a linear model.
7. Related Work
To enable efficiency in deep learning algorithms, software and hardware will require better understanding in the energy use of deep learning models. This section covers related work in the areas of performance and energy benchmarking, and performance and energy modelling.
Performance and energy benchmarking: Performance or execution time is used as metric to evaluate deep learning models on existing desktop and server systems as done in Fathom (fathom). These studies are representative of execution environments with powerful processors and availability of larger memories which is not typically representative of resource constrained low-powered devices. Our work instead provides both performance and energy use of 12 representative ConvNet models when executing on resource constrained mobile systems and identifies energy bottlenecks at a fine-grained level.
Recent energy benchmarking efforts, such as BenchIP (benchip)
have emerged to understand the energy use of deep learning applications across different types of hardware systems. The authors develop a benchmark suite of single layers and full ConvNet models and is aimed at evaluating different hardware systems. However, it is unclear how usable this framework would be for measurement and modelling studies described in this paper as it is yet to be open-sourced.
Performance and energy modelling: To overcome the requirement of having to execute every model to measure its performance, recent studies (modeling) have focussed on modelling the execution time and resource usage for only the convolutional layers in a ConvNet model. They use matrix multiplication as a major component in a convolutional layer and execute different matrix sizes in isolation to model its performance and resource use. The authors identify that such isolation failed to capture the dependencies between layers during actual inference runs leading to an over-estimate in the prediction of execution time compared to actual execution time. Our work instead captures the energy use of the layers in the context of the execution environment of an entire inference and uses it to build predictive models.
Early studies (surveyefficientdnns), relied on counting the number of weights of the deep learning model and energy look-up tables for estimating the energy cost of DRAM memory accesses during the inference phase on specialized hardware. However, such estimation models for deep learning on general purpose processing processors such as CPUs and GPUs have only recently emerged (for example, a mobile CPU (synergy) and desktop GPUs (neuralpower).) Our work instead builds upon the former that models the energy consumption using platform-specific performance counter information. Specifically, we build predictive models at the application-level using platform-agnostic neural network features.
Our work also shares similarity to NeuralPower (neuralpower) that builds predictive models for a desktop GPU. However, we differ in three main aspects. First, NeuralPower develop per layer power and runtime prediction models on desktop GPUs such as Titan X GPU to predict the energy for 5 ConvNet test models. Our work focusses on empirical power measurements obtained in a resource constrained mobile devices and comprehensively evaluates 12 representative ConvNets. Second, NeuralPower does not provide an analysis on how to select features used to build their predictive models. We use statistical analysis to select dominant input features extracted from the algorithm. Third, although the average energy prediction accuracy reported by NeuralPower appears high, it can not be replicated on a different set of ConvNets as shown in our results in Section 6.
The overhead of the power measuring software, introduced by the gator daemon, executing on the target device is negligible, approximately 3% (ARMstreamline). Our data collection phase on each platform takes around 5 minutes for all 12 ConvNets. Our feature selection process takes less than a minute. Predictive model training and testing takes approximately 5ms. This low overhead is beneficial, as for a new software-hardware configuration, we only pay this cost once and use a few ConvNets to approximate the energy use on the platform.
Finally, the predictive models in our work are built at the layer-level and any optimizations to accelerate a layer, such as fused-layer implementations (fused), is typically done below this level of abstraction, and is thus automatically covered.
9. Conclusions and Future Scope
Deep neural network inference is becoming increasingly popular on low-power mobile devices. In this work, we focus on building energy predictive models by thoroughly investigating the impact of choosing application-level features on the final predictive model accuracy and complexity.
To support building of predictive models we extended the SyNERGY- a framework for gathering energy measurements on different mobile devices. We compare two types of predictive models found in the literature - based on features selected for layers at different levels - individual layers and layer-type. Our analysis using subset feature selection techniques for individual layer models indicate that highly complex features are required to achieved greater predictive accuracy. However unlike the results of previous works, we find that predictive models based on layer-type features (for example, summation of operation counts) offer a better model complexity of 4 to 32 times less than models using individual layer features for a similar average accuracy (). We further demonstrate that such an approach can be extended to other layer-types with an accuracy of between 76% to 84% using solely summation counts of MAC or Op counts as the input feature to a linear model across different mobile hardware and software combinations (specifically, mobile CPUs and GPU): Eigen-Snapdragon820, Eigen-TX1, OpenBLAS-TX1 and CuDNN-TX1.
As future work, we aim to extend our modelling studies to layers found in other types of deep neural networks such as Recurrent Neural Networks, other devices and further explore non-linear modelling strategies.