Energy Predictive Models for Convolutional Neural Networks on Mobile Platforms

04/10/2020 ∙ by Crefeda Faviola Rodrigues, et al. ∙ The University of Manchester 0

Energy use is a key concern when deploying deep learning models on mobile and embedded platforms. Current studies develop energy predictive models based on application-level features to provide researchers a way to estimate the energy consumption of their deep learning models. This information is useful for building resource-aware models that can make efficient use of the hard-ware resources. However, previous works on predictive modelling provide little insight into the trade-offs involved in the choice of features on the final predictive model accuracy and model complexity. To address this issue, we provide a comprehensive analysis of building regression-based predictive models for deep learning on mobile devices, based on empirical measurements gathered from the SyNERGY framework.Our predictive modelling strategy is based on two types of predictive models used in the literature:individual layers and layer-type. Our analysis of predictive models show that simple layer-type features achieve a model complexity of 4 to 32 times less for convolutional layer predictions for a similar accuracy compared to predictive models using more complex features adopted by previous approaches. To obtain an overall energy estimate of the inference phase, we build layer-type predictive models for the fully-connected and pooling layers using 12 representative Convolutional NeuralNetworks (ConvNets) on the Jetson TX1 and the Snapdragon 820using software backends such as OpenBLAS, Eigen and CuDNN. We obtain an accuracy between 76 prediction of the test ConvNets across different hardware-software combinations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

There is growing importance to bringing deep neural network processing to mobile or embedded devices (also known as edge-devices) (surveyefficientdnns)

. Deep neural networks such as Convolutional Neural Networks (hereafter referred to as ConvNets) have achieved greater accuracy compared to humans for a large variety of predictive tasks, for example, image classification in computer vision and text classification in natural language processing

(resnet). ConvNets are designed during a training phase

where machine learning researchers search for the best model using accuracy

111suitably defined for the task. as a metric. Once the ConvNet model is trained, the pre-trained model is available for use during the inference or testing phase.

There are numerous benefits to performing inferences locally on edge devices such as reducing energy costs of datacenters, lower (user) latency and reduced need for constant internet connectivity (lane). However, such devices have a unique set of constraints in terms of resources, for example, battery life, that are atypical of the environments in which the models are trained. This has paved way for the exploration of energy-efficient ConvNet designs through manual and automated searches for low-cost neural network model designs - for example, MobileNet (mobilenets) and MnasNet (tan2018mnasnet) - exploration of compression and quantization and other software-based acceleration techniques, and the use of application-specific hardware accelerators (surveyefficientdnns). Despite these efforts, there are very few studies that model the energy use of deep learning models in the context of these optimizations. Such modelling approaches are useful in the areas of resource-aware ConvNet designs such as automatic Neural Architecture Search (tan2018mnasnet), energy-aware pruning techniques (yang2016designing) and in neural network accelerators simulators (samajdar2018scale) that focus on designing energy-efficient deep learning hardware.

Previous studies (synergy) on predictive modelling have indicated that relatively simple features, such as the sum of the multiply-accumulate (MAC) counts (we refer to this as layer-type), can be used to estimate hardware performance counters such as SIMD instruction counts and bus accesses that are useful for determining an application’s performance. The performance counter information is then used to estimate the energy consumption for the convolutional layers in a ConvNet for real systems. Other works, such as (neuralpower)

have relied on a large set of complex features, extracted from each layer’s specification (we refer to this as

individual layers), to yield highly complex predictive models. However, none of these works have investigated the trade-offs of choosing features on predictive model accuracy and complexity.

The aim of this work is to perform a thorough analysis of algorithmic features of predictive models based on layer-type and individual layers that can offer the best trade-off in model complexity (defined in Section 5) and predictive accuracy, and compare our results to previous works. We first illustrate the techniques for layer-type versus individual layer features to build predictive models for the convolutional layers on a mobile CPU and based on the results we apply the method to other layers in a ConvNet to get an overall estimate of the ConvNet’s inference phase running on different software and mobile hardware platforms.

Figure 1. SyNERGY energy measurement and prediction framework.

Our contribution is as follows:

  • An extension to the SyNERGY framework (synergy) to tie the energy measurements obtained on the mobile device to the application-level at a per layer granularity and support the building of layer-wise energy predictive models (Step 1 in Figure 1).

  • In Step 2, we perform an exhaustive search based on standard feature subset selection techniques to evaluate the individual layer features in terms two metrics: predictive model accuracy and complexity for building predictive models for the convolutional layers in the context of a given hardware-software combination. We then compare the best predictive model using individual layer features with predictive models based on layer-type features to determine the best features for building predictive models in terms of predictive model accuracy and complexity.

  • In Step 3

    , energy predictive models for different layers are built using the best features selected in Step 2. Our predictive models, based on simple algorithmic features such as the summation of multiply-accumulate (MAC) operation counts of all layers, outperform the predictive models using more complex features from the individual layers themselves. We can achieve significant accuracy with 4 to 32 times lower model complexity with similar accuracy compared to complex predictive models proposed in previous works. Our results are based on 12 representative ConvNet models chosen from existing deep learning frameworks (Caffe2

    (caffe)), which is a larger number than used in previous studies (synergy; neuralpower) including newer low-cost ConvNet models (for example, SqueezeNet and MobileNet) used in the mobile and embedded space.

  • We combine predictions from predictive models for different layers to get an overall estimate of the energy consumption of the deep learning model during the inference phase for multiple combinations of hardware and numerical software libraries: Eigen222http://eigen.tuxfamily.org on a Snapdraon820 (Eigen-Snapdragon820)333https://developer.qualcomm.com/software/snapdragon-neural-processing-engine, Eigen on a Jetson TX1 (Eigen-TX1)444http://www.nvidia.com/object/jetson-tx1-module.html and OpenBLAS on a Jetson TX1 (OpenBLAS-TX1)555http://www.openblas.net/. Our choice includes two platforms with the same library, that is Eigen-Snapdragon820 and Eigen-TX1, and two libraries on a single platform, that is Eigen-TX1 and OpenBLAS-TX1. We also evaluate our methodology on a mobile GPU using CuDNN on the Jetson TX1 (CuDNN-TX1). For all hardware and software combinations, we achieve a significant predictive test accuracy in the range 76% to 84% compared with empirical measurements on the platforms.

The organisation of the paper is as follows. Section 2 provides details of the ConvNet model specifications for their different layers. Section 3 goes into more detail of the specific methodology for energy measurements across the two hardware platforms: Jetson TX1 and Snapdragon 820. Section 4 presents the empirical energy measurements obtained for the overall and at different layer-types energy found in ConvNets. Section 5 covers an analysis of the features to be used for the predictive models and evaluates the models based on predictive accuracy and model complexity. Section 6 details the final results for the energy-predictive models for the convolutional, pooling and fully-connected layers. Section 7 compares related work in performance and energy measurement and modelling. Finally, section 9 concludes and highlights possible future directions.

2. ConvNet Model Specifications

This section covers the necessary background to understand the different layers of a ConvNet at the algorithmic level. For each layer, we describe the candidate input features (highlighted in bold) that will be evaluated in the feature selection phase in Section 5.

A ConvNet is an end-to-end pipeline of feature extraction and classification. The feature extractors are arranged into layers that extract high-level representations from image data (goodfellow)

. As the number of layers increases the level of abstraction increases, for example, from edges or colour blobs to object shapes. The final layer uses this information to provide a classification output (or a decision). Typical layers found are convolution (Conv), pooling (Pool), normalization (there are two types: batch normalization or Local Response Normalization-(LRN)), Rectified Linear Unit (ReLU) and fully-connected (Fc) that transform the input data into a probabilistic output. We provide a description of the main layers targeted in our study: Conv, Pool and Fc.

Figure 2. A simple two layer ConvNet model
ConvNet
Naming
Convention in graphs
Top-5
accuracy (%)
Dataset # Layers Parameters Model Size
SqueezeNet squeezenet (squeezenet) 80.3 ImageNet 26 Conv + 3 MaxPool 1.2 M 5 MB
SqueezeNet with
Residual Connections
squeezenetRes (squeezenet) 82.5 ImageNet 26 Conv + 3 MaxPool 1.2 M 6.3 MB
ALL-CNN-C allcnn (allcnn) 90.92 CIFAR 10 9 Conv 1.3 M 5.5 MB
GoogleNet googlenet (szegedy) 90.85 ImageNet 57 Conv + 1 Fc + 13 MaxPool 6.9 M 54 MB
DenseNet densenet (huang2017densely) 92.12 ImageNet 121 Conv + 1 MaxPool 7.8 M 32.3 MB
Inception-v3 inceptionv3 (szegedy) 90.92 ImageNet 94 Conv + 1 Fc + 5 MaxPool 23 M 95.5 MB
Residual-net 50 layers
resnet50 (resnet) 93.29 ImageNet 53 Conv + 1 Fc + 1 MaxPool 25 M 103 MB
MobileNet mobilenet (mobilenets) 70.6 ImageNet 27 Conv 29 M 17 MB
Places-CDNS-8s places (placescdns) 86.8 ImageNet 8 Conv + 3 Fc + 5 MaxPool 60 M 241.6 MB
AlexNet alexnet (imageNet) 80.3 ImageNet 5 Conv + 3 Fc + 3 MaxPool 62 M 244 MB
VGG_CNN_S vggsmall (vgg) 86.9 ImageNet 5 Conv + 3 Fc + 3 MaxPool 102 M 393 MB
Inception-BN inceptionbn (inceptbn) 89.0 ImageNet 69 Conv + 1 Fc + 5 MaxPool 1.4 B 134.6 MB
Table 1. ConvNet models in the literature

The bulk of a ConvNet model is made up of Conv layers. The computational complexity of a standard Conv layer can be represented by the number of Multiply-accumulates (MAC) performed which is given by Equation 1:

(1)

where , and represent the output feature map dimensions, , are the kernel filter dimensions and , and are the input feature map dimensions in the , and dimensions, as shown in Figure 2. The z dimension represents the number of channels in the feature maps. These dimensions are governed by the stride (which governs the step size by which the kernel filter slides across the input in and ) and padding

(the number of zeros that need to be padded around the input border to allow whole filters to be applied), and

kernel shape. The storage complexity or data volume666Referred to as Bandwidth in NeuralPower, (neuralpower) includes the cost for storing the input feature map or input volume to each layer, the corresponding kernel or filter weights () and biases, and the output feature map or output volume. The volume of data (in number) is given by Equation 2.

(2)

In Figure 2, refers to the number of images in the input, commonly known as the batch size. Newer models such as MobileNet (mobilenets) leverage depth-wise separable convolutions and have the following computational complexity:

(3)

A pooling and sub-sampling layer aggregates the output from the previous layer using a pooling window () in and . The Max pooling operator computes the maximum over this window and downsamples the output using the max value while the Average pooling finds the average value over the window and down samples the output using the average value. This results in a pooling output of dimension (). The computation for a max value within a single window involves a comparison operation with each of its elements. For example, the and window has 9 elements that require 8 comparison operations. We refer to this as the Op count for the pooling layer.

(4)

Unlike the Conv layer, the inputs to a Fc layer are connected to all the outputs of the previous layer. The Fc layers are similar to Conv layers with the exception that , , ,

are usually greater than 1 for the first Fc layer, and then it flattens out in later Fc layers to a 1-dimensional vector.

Table 1 gives a list of ConvNet models chosen for this study. Column 5 represents the counts of each of the described layers - Conv, Fc and Maxpool - present in each model. Recently, in newer ConvNet models (for example, Inception-BN, Inception-V3, GoogleNet and Residual Nets) the traditional stack of fully-connected layers, seen in older models such as AlexNet and VGG, is replaced with a Global Average Pooling layer introduced in nin and a single Fc layer. Column 7 shows the size of the model which is stored in 32-bit floating point precision. We evaluate models ranging from 1.2M to 1.4B parameters (note, these are also referred to as weights), as given in Column 6 of Table 1. The top-5 accuracy of the model is a measure based on the Top-5 predictions of the object category in a given image (ImageNetChallenge). In our study, we evaluate the energy use of these pre-trained ConvNets and build layer-wise energy predictive models for inferences executing on a mobile device.

System Operating System
Deep learning
framework
Backend
acceleration library
Processor Memory
Jetson TX1
Ubuntu 16.04 LTS
Linux Kernel: 4.4.38+
Caffe2 Eigen
ARM Cortex A57/A53,
Quad-Core,
64-bit, 1.9GHz
4 GB 64 bit
LPDDR3
25.6 GB/s
Caffe
OpenBLAS, CuDNN - 6.0.21
libopenblas_-
cortexa57p-r0.3.1.dev.so
Open-Q 820
(APQ8096) (Intrinsyc)
Android 7.0, API: 24.0
(8096_Open-
Q_820_Android_BSP -N-3.3)
Caffe2 Eigen
Qualcomm Kryo CPU,
Quad-Core,
64-bit, 2.2GHz
3 GB 2 x 32 bit
LPDDR4
29.9GB/s
Table 2. Platform and software specification

3. Methodology

We extend the existing SyNERGY framework to collect power measurements and use it for developing energy predictive models. Figure 1, represents a high-level overview of the extended framework. The input to the deep learning software framework is the ConvNet model specification, the pre-trained weights from Caffe2’s model repository and the input image. Inferences are executed in the deep learning software ecosystem with the necessary back-end acceleration libraries on the chosen hardware. The code annotations supply the information of a layer’s beginning and end times. The host machine remotely collects data such as the annotations and power measurements from the target hardware. The target device runs the actual inference. The details of the target platforms in this study are provided in Table 2. Next, we describe the steps necessary to set up the software tools on both the host and target systems.

3.1. Tools for remote monitoring on the host and Caffe2 on target mobile devices.

The host system runs ARM Streamline version: 5.28.1, Linux 64-bit version that is compiled using the sources from the DS-5 Development studio (ARMstreamline). To facilitate this communication between the host machine and the target machine, we need the gator daemon which communicates with the host’s Streamline and the gator driver as a loadable kernel module (ARMstreamline). The Caffe and Caffe2 binary can be built directly for the Jetson TX1 while for the Snapdragon we cross-compile the Caffe2’s Android binary using the android-ndk-r16 toolchain. To integrate ARM Streamline with Caffe and Caffe2 we use the Streamline annotation library. We identify the specific functions that call each layer in the software stack in Caffe and Caffe2 and place the code annotations. This includes Caffe’s net.cpp and Caffe2’s netsimple.cpp. We use the older Caffe with OpenBLAS support as done in the SyNERGY framework (synergy).

3.2. Power measurement set up on the target

The Jetson TX1 development comes with an on-board TI-INA3221x power sensor chip that has to be enabled during kernel source cross-compilation777List of kernel modifications can be found in https://github.com/ARM-software/gator along with enabling its entry in the device tree binaries (dtb). This power sensor provides system-level power, CPU-level power and GPU-level power in mW. We use the system-level power to the SoC as this accounts for the power due to the processing core, DRAM memory and peripherals. The power measurements are gathered with the default interactive Linux governor. The Snapdragon 820 development board comes with on-board power pins. To capture energy measurement, we use the ARM energy probe to provide system-level power for the SoC.

In our study, the power values are collected as the inference phase executes on either the target CPU (single-threaded) or the GPU. We report the execution time per image (sec/image) and energy per image (mJ/image) averaged over 5 separate runs for single image inferences. The power sampling rate is fixed to 1 kHz. To extract per layer measurements the execution profile with time-stamped code-annotations are aligned to the time-stamped power profile. We then use the extracted power measurement to calculate the energy consumed over the time duration as per Equation 5.

(5)

where is the power sample over the time duration and is the total execution time of inference application.

GoogleNet-Caffe2 Eigen - Jetson TX1 Eigen - Snapdragon 820
Energy (mJ) Time (sec)
Avg.
energy (%)
Avg.
time (%)
Energy (mJ) Time (sec)
Avg.
energy (%)
Avg.
time (%)
Conv 7856.84 457.2 1.7354 0.1 84.44 84.14 842.66 95.86 0.3866 0.03 77.64 77.28
Fc 156.35 22.9 0.0344 0.0 1.68 1.66 18.20 6.04 0.0082 0.0 1.67 1.63
MaxPool 504.41 46.9 0.1126 0.01 5.42 5.45 59.87 3.5 0.0284 0.0 5.51 5.67
LRN 637.29 22.4 0.1468 0.0 6.84 7.11 137.62 10.8 0.0646 0.0 12.68 12.91
ReLU 93.37 14.4 0.021 0.0 1.00 1.01 23.14 15.4 0.0106 0.0 2.13 2.11
AveragePool 3.59 2.01 0.0008 0.0 0.03 0.03 0.28 0.0 0.0002 0.0 0.02 0.03
Concat 52.12 5.4 0.0114 0.0 0.56 0.55 3.00 2.54 0.0014 0.0 0.27 0.27
Dropout 0 0 0 0 0 0 0 0
Softmax 0 0 0 0 0.48 0.0002 0.04 0.03
Average Total 9304.00 507.1 2.0624 0.0 100 100 1085.28 0.5002 100 100
Table 3. GoogleNet per layer-type breakdown of energy and time for Eigen Library on Cortex-A57 and Kryo CPU
GoogleNet-Caffe OpenBLAS Jetson TX1
Energy (mJ) Time (sec)
Avg.
energy (%)
Avg.
time (%)
Conv 4883.26 818.6 1.52 0.2 62.36 62.66
InnerProduct 138.96 30.7 0.04 0.01 1.77 1.76
Pooling 761.83 92.2 0.23 0.03 9.72 9.82
LRN 1716.78 43.1 0.52 0.01 21.92 21.53
ReLU 178.86 35.9 0.05 0.01 2.28 2.29
Split 9.00 6.0 0.002 0.0 0.11 0.11
Concat 128.59 22.5 0.03 0.0 1.64 1.63
Dropout 0.6798 1.5 0.0002 0.0 0.008 0.008
Softmax 1.85 1.6 0.0006 0.0 0.02 0.02
Average Total 7830.07 2.43 100 100
Table 4. GoogleNet per layer-type breakdown of energy and time for OpenBLAS - Cortex-A57

4. Layer-type Energy and Performance Measurements

In this section, we use SyNERGY to provide empirical time and energy measurements for the overall inference as well as finer-grained layer-type for an example ConvNet model (in our case, GoogleNet). Layer-type for the convolutional layers is when we group individual convolutional layers regardless of its specifications (for example, , or ) into a broader category of Conv.

We summarize our findings for Eigen-TX1 and Eigen-Snapdragon820 in Table 3, and OpenBLAS-TX1 in Table 4. In the case of Caffe2, the pooling layer is further split into two types: MaxPool and AveragePool while in the case of the original Caffe framework both versions are grouped under the Pooling category. Certain layers like dropout and softmax execute too quickly and are too small to be captured. We also report the average percentage of energy and time of each layer when compared to the total energy and time. If we compare the total inference energy in all three software and hardware cases, the combination with the least amount of energy per inference is the Caffe2’s Eigen-Snapdragon820. However, comparing the energy of the Conv layer for the Jetson TX1 with both software backends, we observe that OpenBLAS consumes less energy than Eigen.

If we compare the energy of the different layers, we observe that the Conv layer contributes the most to the total energy consumed in the inference phase () across all three combinations. The pooling () and LRN () rank second making them good candidates for optimization. In this section, we use the energy measurements obtained from SyNERGY to evaluate the energy consumption of a ConvNet model to identify energy bottlenecks and perform comparative analysis for different software and hardware systems. In Section 6.2, we show how predictive models based on energy can be used to perform such comparative analysis. Finally, this layer-type abstraction will be useful in the next sections, where we focus on building predictive models at the granularity of layer-types and compare it with previous approaches that use individual layers for predictive models.

5. Feature Selection

Previous work such as Neuralpower (neuralpower), build layer-wise predictive models by using complex features extracted from individual layers. For example, higher order terms for kernel size, input volume and others. While other works (synergy) use simple aggregate algorithmic features (we refer to this as layer-type), for example, an aggregate MAC count to build predictive models for the convolutional layers. Therefore, in this section, we first aim to evaluate algorithmic features (highlighted in bold in Section 2) extracted for individual layers to build predictive model in terms of predictive model accuracy and complexity. Our feature selection is based on standard techniques of best subset selection using metrics such as Bayesian Information Criterion (BIC) (james2013introduction). These methods are typically used to evaluate the trade-off in model complexity and accuracy as features are added to the model. We refer to model complexity as the number of features in the final predictive model.

We demonstrate this feature analysis for all the convolutional layers of all 12 target ConvNets executing on the CPU of Jetson TX1 with OpenBLAS backend. The linear features (or degree d=1) for these layers include: kernel shape, padding, stride, (same as ), (same as ), , , input size, output size, weights, data volume and MAC. In this case, each convolutional layer has a set of 12 features. The target response is the energy for an individual layer. Figure a, shows that a model with 5 features (indicated by red circle as the lowest BIC) would be a good a choice. To model non-linear features, we extended the linear feature set to consist of higher order polynomial terms and cross terms (of degree ) for these features (this includes, , and others). For a degree 2 model the model with lowest BIC has a model complexity of 62 features, as shown in Figure b.

(a) Linear features
(b) Non-linear features
Figure 3. Subset feature selection. Lower BIC is better.
Polynomial
degree (d)
Model
Complexity
(m)
Model
BIC
(lower is
better)
Features
Relative
comparison
to MAC model
1 1 MAC model 4753 MAC 1
1 5 Best linear model 4631
Kernel, Stride, Ox,
Output Volume, MACs
2.63
2 62
Best non-linear
model
3703
62 linear & non-linear
features
28.35
Table 5. BIC subset selection for the Convolutional layer.

In order to understand predictive model should yield greater predictive accuracy in Table 5, we make a relative comparison of a single feature-based model using MAC to the predictive model using the best combination of linear features and to the predictive model with best combination of linear and non-linear features obtained from the previous step. Based on the relative comparison, a single feature MAC model is found to be within 3% of the best linear feature model and within 29% of the best non-linear feature model. Therefore, to get a highly accurate predictive model which is indicated by a lower BIC, the number of features extracted for individual layers would be in the order of 62 non-linear features. In the next section, we compare regression-based models based on individual features with predictive models based on layer-type features on the basis of predictive accuracy to determine whether higher complexity models indeed offer better accuracy when compared to lower-complexity models.

Dataset OpenBLAS-TX1 Model type
10-Fold cross
validation accuracy
Polynomial
degree (d)
Model
complexity (m)
Overall Conv
model
model Energy 81.84 7.8 1 1
Individual
convolutional
layer model
MAC model Energy 67.02 11.91 1 1
Best linear model
Energy 72.83 10.7 1 5
Best non-linear model
Energy 79.58 13.03 2 32
NeuralPower (neuralpower) Runtime 77.48 21.21 2 4
Power 2 17
Table 6. Comparison of energy predictive accuracy and model complexity

5.1. Analysis of Model Accuracy & Complexity

Based on our analysis of features in the previous section, we build regression based models trained using the standard supervised learning approach in machine learning (james2013introduction). Cross validation is performed 10 times and the convolutional layers used in train and test sets are in the ratio 80:20. The regression-based model layer-wise predictive models is given by Equation 6. Similar predictive models can be built for other layers to give the overall energy of the inference as the sum of the predictions from all the layer-wise predictive models, as given by Equation 7.

(6)
(7)

where represents the degree of the algorithmic feature.

As described in the previous sections, we have two types of predictive models based on the type of features: layer-wise predictive models use features aggregated across layers while individual layer models use features from every layer. The individual convolutional layer models are of four categories, as given in Table 6: a single feature model (MAC) without summation counts (as done in (synergy)), a model with the best BIC for linear features (best linear), a model with the best BIC for non-linear features (best non-linear) and finally, we compare with a previous work, Neuralpower (neuralpower). NeuralPower is based on predictions from run-time and power estimation models, to get an estimate of time and power, and subsequently energy, for individual layers. For this, we use the code provided by NeuralPower888https://github.com/cmu-enyac/NeuralPower.

By comparing the models based on individual layer features, as summarised in Table 6, we find that using a larger set of complex features does not provide a massive boost in accuracy compared with the use of simpler features, as indicated by the results from Table 5; for example, compare the data for complex models such as the Best non-linear model and NeuralPower in Table 6

with that for the simpler models such as the MAC model and Best linear model. Furthermore, for a single hardware and software configuration and a single split of their dataset, Neuralpower reports an overall accuracy of 97.21% (based on the Root-mean-squared-percentage-error (RMSPE)). When using Neuralpower on our dataset, we observe similar high accuracies for certain splits (for example, considering the upper bound we get 77.47+21.21=98.69%). However, this behaviour is not consistently observed across other splits of train and test sets as done in our experimental evaluation. Our results indicate a mean and variance of the accuracy of 77.47

21.21% in Table 6, across different splits of training and test sets.

On further analysis of the results of NeuralPower, we observe that certain ConvNet models are over-predicted while others are under-predicted when using two different predictive models - the runtime and power models - leading to a cancellation effect when calculating energy to give an overall high accuracy which may be misleading. However, we do not observe such cancellation effects when using a simple model trained directly on energy use information. In addition, the model offers a higher accuracy and lower variance compared to using models based on individual layers (See Column 4 of Table 6).

Given this behaviour, we conclude that predictive models based on offers a good first approximation to estimate the energy consumed in a ConvNet in terms of model complexity and accuracy. The model yields a higher and more stable predictive accuracy but with model complexity 4 and 17 lower compared to previous approaches that use data from individual layers (Column 6 of Table 6). Moreover, MAC (or more generally an operation count, Op) is a universal feature that can be extracted for other layer-types (see Section 6). Therefore in next section, we extend the approach to other layer-types to get an overall estimate of the energy consumed by the deep learning model.

6. Overall Inference Energy

In this section, we extend our method to construct per layer energy prediction models for the Conv, Fc and pooling layers using MAC count (or, equivalently, OpCount for pooling layers). To make a prediction for an entire ConvNet model, the training and test sets are split based on the ConvNet models themselves during cross-validation. This ensures that for given test ConvNet all its layers are present only in the test set. The predictive models are evaluated in terms of their relative accuracy, given by Equation 8, which quantifies the relative performance of the predictor with respect to the baseline measured energy value (synergy). We average the relative test errors across all test examples and across all folds of data.

(8)

6.1. Layer-type predictions

From Figure 4

, we plot the linear regression model over the data points in our dataset. We observe that the relative positions of each data point in all three cases (that is, Eigen-Snapdragon820, Eigen-TX1 and OpenBLAS-TX1) follow a similar trend. At the bottom left corner, we observe models with low MAC count and low energy use, for example, squeezenet and mobilenet. It is interesting to observe that smaller sized models, in terms of number of parameters, do not always result in better energy use. For example, resnet50 outperforms inceptionv3 in terms of energy use in all three cases despite being roughly the same size (see

Table 1). We also observe that alexnet has lower energy use than squeenzenet and mobilenet despite being approximately 3 and 2 times greater in model size 999This is considering only the convolutional layers. respectively. This is because the latter models use smaller kernel shapes such and to reduce the number of parameters in the model. However, these require optimized software routines for small kernel shapes to exploit the resources on a system effectively. Our results for the relative accuracy of the Conv predictive models on the test set is tabulated in Table 7. In all three cases, we find that the linear regression model using solely MAC count as an input feature achieves a test accuracy between 75% to 82%.

(a) Eigen-Snapdragon820
(b) Eigen-TX1
(c) OpenBLAS-TX1
Figure 4. Linear regression-based energy prediction models using MAC for the Conv layer.
Layer
10-fold cross
validation accuracy
Software-Hardware
Combination
Linear Regression
accuracy (%)
Conv Eigen-Snapdragon820 75.24 12.0
Eigen-TX1 78.16 6.3
OpenBLAS-TX1 82.41 7.4
FC Eigen-Snapdragon820 76.75 10.3
Eigen-TX1 64.56 9.3
OpenBLAS-TX1 56.72 5.3
Pool Eigen-Snapdragon820 90.01 4.4
Eigen-TX1 82.39 8.1
OpenBLAS-TX1 86.05 7.5
Table 7. Per layer-type energy prediction results for all software-hardware combinations

We aim to model the Fc layers and pooling layers using an equivalent feature to count as done previously for the Conv layers. However, as observed in Table 1 from Section 2, there are fewer ConvNets with Fc layers, and fewer Fc layers per ConvNet model. Despite the fact that we are using a larger number of ConvNets than previous studies, the data for the Fc layers is limited. As seen in Table 7, the lower accuracy between 56% to 76% using a linear fit could be a result of insufficient data points. We could address this issue by trying to generate more points for the Fc layers by using individual Fc layers as adopted by previous approaches or generate more data points by using varying batch sizes.

For the pooling layers, we focussed on MaxPool operations as they account for more number of layers than average pooling in real ConvNets. We use the OpCount given in Equation 4. The results in Figure 4, show that using solely the OpCount as an input feature we can obtain a linear fit with test accuracy between 82% to 90%.

6.2. Overall energy predictions

In this section, we obtain an estimate for the energy for the whole ConvNet by summing the predictions from the Conv, Fc and MaxPool layer-type predictive models. We select GoogleNet, AlexNet and VGG_CNN_S as test data points (see Table 8) because AlexNet and a variant of VGG with 16 layer were evaluated in NeuralPower (neuralpower), and GoogleNet is our running example. We use the remaining model points as the training set to form the linear model.

Table 8 shows the prediction results for each layer-type Conv, Pool and Fc given by Columns 3, 4 and 5 and the overall predicted results (Column 6: Total predicted) for the inference. The measured energy for the Conv, Pool and Fc layer-type is given in Columns 7, 8 and 9 and the overall measured energy in Column 10 of Table 8. Similar to the results obtained using empirical measurements in Section 4, we find that using the predicted energy for the convolutional layers (given in Column 3) of GoogleNet, the OpenBLAS library is less energy consuming than the Eigen library for the TX1 platform.

Finally, we also report accuracy using RMSPE (as per the metric used in NeuralPower) and relative test accuracy (see Equation 8). Both metrics provide similar results. We find that across the four software-hardware combinations, including mobile GPUs (CuDNN-TX1 in Table 8), we achieve a significant relative test accuracy of between 76% to 85% using solely summation of MAC (or operation) counts as the input feature to a linear model.

Software-Hardware Test ConvNet
Total
predicted (mJ)
Conv Pool Fc
Total layer
measured (mJ)
Accuracy (%)
“100-RMSPE”
Rel.Test
Accuracy (%)
Eigen-Snapdragon820 GoogleNet 604.39 54.96 9.98 669.33 842.66 59.87 18.2056 920.7356 81.41 83.11 9.4
AlexNet 281.55 5.16 599.64 886.35 271.842 5.46 495.81 773.112
VGG_CNN_S 1306.34 7.89 815.36 2129.59 966.55 6.77 985.69 1959.01
Eigen-TX1 GoogleNet 5783.59 1079.29 182.73 7045.61 6325.22 1583.87 156.35 8065.44 84.70 84.81 2.2
AlexNet 2476.42 93.24 9934.27 12503.93 2875.25 104.94 11885.76 14865.95
VGG_CNN_S 10285.35 170.43 19537.94 29993.72 9177.43 123.86 16331.49 25632.78
OpenBLAS-TX1 GoogleNet 4534.3 1272.75 276.89 6083.94 4883.26 765.16 171.4 5819.82 71.61 76.24 9.02
AlexNet 1908.68 67.34 17318.97 19294.99 1562.12 86.47 11884.87 13533.46
VGG_CNN_S 7285.57 121.1 19536.91 26943.58 6837.69 213.83 28472.67 35524.19
CuDNN-TX1 GoogleNet 1471.22 484.53 52.04 2007.79 2579.81 409.51 84.16 3073.48 77.54 82.43 17.0
AlexNet 619.30 35.56 2959.56 3614.42 527.27 38.26 3033.99 3599.525
VGG_CNN_S 2363.91 64.70 4988.58 7417.19 1362.50 78.48 4864.99 6305.97
Table 8. Aggregate Energy prediction results for Conv, pooling and Fc layers

7. Related Work

To enable efficiency in deep learning algorithms, software and hardware will require better understanding in the energy use of deep learning models. This section covers related work in the areas of performance and energy benchmarking, and performance and energy modelling.

Performance and energy benchmarking: Performance or execution time is used as metric to evaluate deep learning models on existing desktop and server systems as done in Fathom (fathom). These studies are representative of execution environments with powerful processors and availability of larger memories which is not typically representative of resource constrained low-powered devices. Our work instead provides both performance and energy use of 12 representative ConvNet models when executing on resource constrained mobile systems and identifies energy bottlenecks at a fine-grained level.

Recent energy benchmarking efforts, such as BenchIP (benchip)

have emerged to understand the energy use of deep learning applications across different types of hardware systems. The authors develop a benchmark suite of single layers and full ConvNet models and is aimed at evaluating different hardware systems. However, it is unclear how usable this framework would be for measurement and modelling studies described in this paper as it is yet to be open-sourced.

Performance and energy modelling: To overcome the requirement of having to execute every model to measure its performance, recent studies (modeling) have focussed on modelling the execution time and resource usage for only the convolutional layers in a ConvNet model. They use matrix multiplication as a major component in a convolutional layer and execute different matrix sizes in isolation to model its performance and resource use. The authors identify that such isolation failed to capture the dependencies between layers during actual inference runs leading to an over-estimate in the prediction of execution time compared to actual execution time. Our work instead captures the energy use of the layers in the context of the execution environment of an entire inference and uses it to build predictive models.

Early studies (surveyefficientdnns), relied on counting the number of weights of the deep learning model and energy look-up tables for estimating the energy cost of DRAM memory accesses during the inference phase on specialized hardware. However, such estimation models for deep learning on general purpose processing processors such as CPUs and GPUs have only recently emerged (for example, a mobile CPU (synergy) and desktop GPUs (neuralpower).) Our work instead builds upon the former that models the energy consumption using platform-specific performance counter information. Specifically, we build predictive models at the application-level using platform-agnostic neural network features.

Our work also shares similarity to NeuralPower (neuralpower) that builds predictive models for a desktop GPU. However, we differ in three main aspects. First, NeuralPower develop per layer power and runtime prediction models on desktop GPUs such as Titan X GPU to predict the energy for 5 ConvNet test models. Our work focusses on empirical power measurements obtained in a resource constrained mobile devices and comprehensively evaluates 12 representative ConvNets. Second, NeuralPower does not provide an analysis on how to select features used to build their predictive models. We use statistical analysis to select dominant input features extracted from the algorithm. Third, although the average energy prediction accuracy reported by NeuralPower appears high, it can not be replicated on a different set of ConvNets as shown in our results in Section 6.

8. Discussion

The overhead of the power measuring software, introduced by the gator daemon, executing on the target device is negligible, approximately 3% (ARMstreamline). Our data collection phase on each platform takes around 5 minutes for all 12 ConvNets. Our feature selection process takes less than a minute. Predictive model training and testing takes approximately 5ms. This low overhead is beneficial, as for a new software-hardware configuration, we only pay this cost once and use a few ConvNets to approximate the energy use on the platform.

Finally, the predictive models in our work are built at the layer-level and any optimizations to accelerate a layer, such as fused-layer implementations (fused), is typically done below this level of abstraction, and is thus automatically covered.

9. Conclusions and Future Scope

Deep neural network inference is becoming increasingly popular on low-power mobile devices. In this work, we focus on building energy predictive models by thoroughly investigating the impact of choosing application-level features on the final predictive model accuracy and complexity.

To support building of predictive models we extended the SyNERGY- a framework for gathering energy measurements on different mobile devices. We compare two types of predictive models found in the literature - based on features selected for layers at different levels - individual layers and layer-type. Our analysis using subset feature selection techniques for individual layer models indicate that highly complex features are required to achieved greater predictive accuracy. However unlike the results of previous works, we find that predictive models based on layer-type features (for example, summation of operation counts) offer a better model complexity of 4 to 32 times less than models using individual layer features for a similar average accuracy (). We further demonstrate that such an approach can be extended to other layer-types with an accuracy of between 76% to 84% using solely summation counts of MAC or Op counts as the input feature to a linear model across different mobile hardware and software combinations (specifically, mobile CPUs and GPU): Eigen-Snapdragon820, Eigen-TX1, OpenBLAS-TX1 and CuDNN-TX1.

As future work, we aim to extend our modelling studies to layers found in other types of deep neural networks such as Recurrent Neural Networks, other devices and further explore non-linear modelling strategies.

Acknowledgements.
This research was conducted with support for C. Rodrigues and G.D. Riley from the IS-ENES2 project, funded under the European FP7-INFRASTRUCTURES-2012-1 call (GA No: 312979). C. Rodrigues is also part-funded by Arm under a PhD Studentship Agreement. M. Luján is supported by a Royal Society University Research Fellowship.

References