Exploring Deep Neural Networks on Edge TPU

This paper explores the performance of Google's Edge TPU on feed forward neural networks. We consider Edge TPU as a hardware platform and explore different architectures of deep neural network classifiers, which traditionally has been a challenge to run on resource constrained edge devices. Based on the use of a joint-time-frequency data representation, also known as spectrogram, we explore the trade-off between classification performance and the energy consumed for inference. The energy efficiency of Edge TPU is compared with that of widely-used embedded CPU ARM Cortex-A53. Our results quantify the impact of neural network architectural specifications on the Edge TPU's performance, guiding decisions on the TPU's optimal operating point, where it can provide high classification accuracy with minimal energy consumption. Also, our evaluations highlight the crossover in performance between the Edge TPU and Cortex-A53, depending on the neural network specifications. Based on our analysis, we provide a decision chart to guide decisions on platform selection based on the model parameters and context.

READ FULL TEXT VIEW PDF

page 3

page 4

page 5

03/30/2021

Exploring Edge TPU for Network Intrusion Detection in IoT

This paper explores Google's Edge TPU for implementing a practical netwo...
11/06/2020

Deep Learning-based Cattle Activity Classification Using Joint Time-frequency Data Representation

Automated cattle activity classification allows herders to continuously ...
01/19/2018

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs

Deep Neural Networks are becoming increasingly popular in always-on IoT ...
11/12/2019

Scientific Image Restoration Anywhere

The use of deep learning models within scientific experimental facilitie...
02/11/2018

Edge-Host Partitioning of Deep Neural Networks with Feature Space Encoding for Resource-Constrained Internet-of-Things Platforms

This paper introduces partitioning an inference task of a deep neural ne...
11/20/2017

SquishedNets: Squishing SqueezeNet further for edge device scenarios via deep evolutionary synthesis

While deep neural networks have been shown in recent years to outperform...
10/08/2018

Neural Network based classification of bone metastasis by primary cacinoma

Neural networks have been known for a long time as a tool for different ...

I Introduction

Artificial Intelligence (AI) and more specifically machine learning (ML) have gained significant attention from researchers and industry in recent years for solving complex prediction and classification problems. A sizeable portion of the data that is used as input to ML models originates from the Internet of Things (IoT), which has also seen sizeable growth in the number and types of devices deployed globally. The confluence of the two technologies will enable the Intelligent Edge and will result in a significant transformation of pervasive systems towards more decentralised and in-situ analytics. While in most cases ML models will be trained in the cloud, it will be increasingly necessary to run the inference tasks at the edge, due to bandwidth, latency, cost or privacy constraints that prevent the raw sensor data from being transmitted to the cloud. Running advanced ML algorithms on resource-constrained edge devices has been a major challenge. This is particularly so for deep neural network (DNN) models due to their demanding computational, memory and energy footprint. As a result, Google, NVIDIA and others have recently developed dedicated hardware accelerator platforms to support ML inference at the edge. However, the performance-energy trade-off of deep learning on these platforms is currently not well understood, which is what we investigate in this paper.

We specifically explore Google’s Edge TPU [3], which has the ability to perform 4 trillion operations per second (TOPS) consuming only 2 Watts of power. In particular, we explore the Edge TPU’s ability for deep learning at the edge, by varying different feed forward neural network hyper parameters including, the number of input nodes, number of hidden layers and nodes at hidden layers. In order to be able to explore the impact of the number of input nodes without changing the number of input features, we use the joint-time-frequency data representation, in particular spectrogram, in this exploration.

We consider cattle activity classification, using a realistic large scale dataset, as a motivating use case. Within this context, we convert the activity time series to spectrogram representation. We demonstrate that this representation has excellent scaling properties, and can significantly compress data, and hence the neural network model, while maintaining a high predictive power and classification performance.

As a key contribution of the paper, we perform a systematic analysis of the Edge TPU performance and model architectural trade-off by studying different scales of the model via different spectrogram resolutions and hyper-parameters of the neural network architecture, i.e. scaling neural networks with different model sizes and number of parameters, on different ML hardware and software platforms, and explore the impact on ML performance (classification Score) and energy efficiency. The practicality of Edge TPU is further studied by implementing the same experiments on two compute platforms, ARM Cortex-A53 which is a power-efficient 64-bit embedded CPU and a traditional Intel i7-4790 CPU.

Our experimental evaluations reveal a number of interesting findings. The results show that the Edge TPU can perform exceptionally well compared to traditional CPUs, with comparable classification performance while significantly reducing energy cost, which is critical for edge ML applications. We also find that achieving the optimal performance from the Edge TPU, both in terms of inference performance and energy efficiency is very sensitive to the choice of model size and number of parameters. The energy efficiency of the Edge TPU can dramatically degrade, if the model size or number of parameters are not within the optimal operating region.

Surprisingly, Cortex-A53 is more energy efficient than Edge TPU for very small model sizes. Our evaluations show that there is a relatively narrow “sweet spot” in terms of model size and number of parameters, within which Edge TPU achieves high classification performance and very high energy efficiency. All configurations outside of this sweet spot lead the Cortex-A53 to again outperform the Edge TPU. To the best of the authors’ knowledge, this paper is the first to report this highly non-linear behaviour of the Edge TPU.

Based on our extensive analysis, we provide a decision chart and associated guidelines for selecting the most suitable platform and for tuning the ML model size and structure to achieve the optimal performance from the selected platform, both in terms of inference performance and energy efficiency.

The rest of the paper is organised as follows. Key related work is briefly summarised in Section II. Section III

presents the use case application of cattle activity classification, the dataset, and we briefly discuss the joint-time-frequency domain representation (spectrogram). In Section 

IV our experimental methodology is presented, including spectrogram scaling, and DNN scaling, and we briefly provide background on Edge TPU and Cortex-A53, and Intel i7 platforms. Results are presented in Section V, including the classification performance Section V-A, the energy efficiency Section V-B, the ratio of Cortex-A53 and Edge TPU Section V-C. In Section VI, a decision chart and guidelines on model design for the Edge TPU are provided; and is followed by conclusions in Section VII.

Ii Related Work

Machine learning at the edge has recently attracted significant attention, both in industry-based and academic research [15, 19]. A number of ML hardware accelerator platforms, such as Google’s Edge TPU, have recently been developed, and a number of papers have evaluated them. For example, Wisultschew et al. [27] studied and compared the performance and efficiency of the Google’s Edge TPU and the Intel’s Movidius Neural Compute Stick (NCS) for 3D-object detection. Hui et al. [12] compared the accuracy in 3D object detection of different ML hardware acceleration platforms, i.e. Google Edge TPU, NVIDIA Xavier, and NovuTensor. Reuther et al. [24] compared the performance of the Edge TPU to an Intel® CoreTM i9-9900k CPU, as well as and Intel’s Neural Compute Stick 2 (NCS2). As a key result, the paper reports a similar inference performance of Edge TPU compared to the i9-9900k CPU, but with, not surprisingly, a significantly lower power consumption. Kljucaric et al. [14] compared the performance and efficiency of NVIDIA Xavier, Edge TPU, and NCS2 for optical character recognition using AlexNet and GoogleNet. The authors reported while NCS2 is more efficient for AlexNet, Edge TPU outperforms with GoogleNet.

In [10]

, the architectural trade-offs between computational and energy efficiency for both feed forward and convolutional neural networks are explored on Edge TPU and Cortex-A53. The authors proposed a deep learning-based network intrusion detection system at the edge for IoT networks using a time-series dataset with fixed number of input features. It is demonstrated that for both types of neural networks, the model size is the determinant parameter for Edge TPU’s performance.

Later, the performance of the Edge TPU accelerators is evaluated extensively for 423K different convolutional neural networks using NASBench dataset by Google’s research team [29]. The latency and accuracy of different model structures are studied on 3 different configurations of the Edge TPU. However, the Edge TPU configurations are not adjustable on user-level yet. The authors also proposed a graph-based neural network model to predict the performance of the models based on the given structure.

Our recent preliminary work [11] studies the performance of feed-forward neural networks with 4 hidden layers for different number of input nodes on Intel i7 and the Edge TPU. The focus was on the exploration of the spectrogram data representation and its scaling properties. We demonstrated that spectrogram data representation allows the compression of the data while maintaining its predictive power.

To the best of our knowledge, none of the related works have explored the details of the Edge TPU’s performance in terms of its sensitivity to the architectural model variations and model specifications, which is the focus of this paper. In order to scale the neural network model size, we use the scalability properties of the joint-time-frequency domain data representation, and present the results of our extensive experimental evaluations. These evaluations were performed in the context of a cattle activity classification application, which is discussed in Section III.

Iii Use Case: Cattle Activity Classification

As a use case, we consider the problem of automated cattle activity classification, which is of increasing interest to the beef and dairy industry [21, 18]. The quantity and quality of beef and dairy produce are directly related to the animals’ health and welfare, which in turn can be inferred from animal behaviour  [20]. It has been demonstrated that deep neural network classifiers have a great potential for this application [8, 9]. While training of the neural network classifier can easily be performed in the cloud, the actual activity classification (inference) needs to be done at the edge, due to limited, intermittent or expensive network connectivity (satellite link) of the embedded sensor devices that are attached to the animals. An example of such an edge device is the Ceres Tag [2], a small solar powered device attached to the ear of cattle, which is equipped with multiple sensors and satellite communication.

Due to computational and energy requirements, it has traditionally been a challenge to run deep learning algorithms on very resource constrained edge devices. This paper explores Edge TPU through the motivating application of cattle activity classification. In particular, we explore how scaling of the neural network architecture impacts on the performance of Edge TPU. These explorations are applicable to different fields of study which interest in exploring the DNN hyper-parameters, and using and evaluating Edge TPU ASIC [10, 9, 11]. We will use the joint-time-frequency domain sensor data representation (spectrogram) to be able to scale the input size of the models, since its potential in this application context has been demonstrated in our earlier work [8, 9, 11] 111We note that, while the power requirements of Edge TPU are still higher than what an ear-tag sized solar power cell can currently support, the use of Edge TPU is feasible on larger, collar-sized sensors..

Iii-a Activity Classification Architecture

Fig. 1: A schematic illustration of the activity classification architecture.

An overview of our activity classification architecture, which is discussed in more detail in [9], is shown in Figure 1. The classification process starts on the left with the collection of time series from a tri-axial IMU (Inertial Measurement Unit) sensor that is attached to the subject. The data is then converted into a joint-time-frequency representation (spectrogram), and fed into a feed-forward Deep Neural Network (DNN) to return one of the 9 considered activity classes, as shown in the figure.

Iii-B Background: Joint Time-frequency Representation

Fig. 2: A sample spectrogram for a 10 second window of cattle activity acceleration signal, with the signal also shown in their respective time-domain and frequency-domain representations.

Many human and animal related activity monitoring studies have been performed by recording and analysing accelerometery signals. In some of these activity classification studies it has been shown that the human activity acceleration [23], human fetal activity acceleration [16, 17], and cattle activity acceleration signals [9] have a time-varying spectral content.

While such signals are traditionally represented in the time or frequency domain, a joint-time-frequency representation of these signals has much greater potential in the context of activity classification [9, 23]. A commonly used method for joint-time-frequency data representation is a spectrogram. The spectrogram representation of a discrete time signal is computed as shown in Equation 1, with and representing discrete time and frequency respectively, and with representing the windowing functions [23].

(1)

Figure 2 shows an example spectrogram of an acceleration sensor signal, obtained from a sensor placed on cattle for the purpose of activity classification. Our use case application scenario is discussed in more details in Section III-A. The horizontal and the vertical axes represent time and normalised frequency respectively. In this figure, colour indicates acceleration energy density (normalised), ranging from red (high) to blue (low). For illustration purposes, the corresponding acceleration signals in time-domain and frequency-domain representation are shown at the top and left respectively.

Iii-C Dataset

The dataset we used for our experiments was collected by CSIRO in Armidale, Australia, over a one month period in 2018. The data was obtained from tri-axial IMU (Inertial Measurement Unit) sensors that that were attached via collars to 10 cattle. The sensors collected accelerometer, magnetometer and gyroscope readings in 3 axes, resulting in 9 streams of time-series sensor data sampled at 50 Hz, and with a total of more than 3.5 million samples. The dataset was labelled manually (by human recorders) with the corresponding 9 cattle activity classes. Further details on the dataset are available in [9].

Iv Experimental Methodology

In this study, to evaluate the Edge TPU performance, we explore the DNN hyper-parameters with the focus on varying the number of nodes at input layer (), number of hidden layers (), and the number of nodes at hidden layers (). Since the model size corresponds to the input layer size (spectrogram resolution) and number of hidden layers and their number of nodes, changing the considered hyper-parameters leads to scale the model size and number of parameters per layer. In the following sections, the spectrogram scaling and DNN architecture scaling are explained.

Iv-a Spectrogram Scaling

Fig. 3: The different steps of scaled resolution for an example spectrogram.

In order to match the input size to the varied number of nodes at input layer, we use the scaling properties of the spectrogram data representation. The goal is to explore different scales (or resolutions) of the spectrogram and evaluate the impact on the model size, classification performance and energy consumption. The focus is on the model architectural variation and energy efficiency trade-off, which is crucial for resource-constrained edge devices.

In this context, we define spectrogram resolution as the ratio of the one-dimensional pixel count compared to the original (full) resolution. A resolution of 50% refers to the case where the number of pixels in each dimension is halved, which results in a spectrogram image with one quarter of the size, i.e. number of pixels. The lowest considered resolution of 10% has therefore a size of only 1% of the original resolution.

In our experiments, we consider spectrograms for 9 axis collected from three tri-axial sensors. A spectrogram with 100% resolution leads to 3875 nodes per axis and in total 34875 nodes at the input layer of the DNN. In our experiments, we consider spectrogram resolutions in steps of 10%, i.e. from 10% to 100%. The scaled versions of the spectrograms are computed via bicubic downsampling [28]. As an illustration, Figure 3 shows the different resolutions of an example spectrogram (10 second window) of the accelerometry signal for the “Ruminating Lying” activity, which is one of the 9 cattle activity classes that we are considering.

Iv-B Neural Network Architecture and Scaling

Fig. 4: The neural network architecture.

Figure 4 shows a scheme of the feed forward neural network architecture used in this study. It is built up by a stack of fully connected layers including the input layer, various number of the hidden layers and the output layer. The number of nodes at the first (input) layer equals the size of the input spectrogram, which is determined by its resolution. The hidden layers have

nodes with Rectified Linear Unit (ReLU) activation function for all nodes, except for the last layer

with 32 nodes. The output layer uses the Softmax function, to return one of the 9 considered activity classes, as shown in Figure 4

. The loss function for this model is Sparse Categorical Cross-entropy.

Parameter Definition
number of hidden layers
=
number of nodes at input layer (input size)
for to
number of nodes at the th hidden layer
number of nodes at the last hidden layer
number of nodes at output layer (output size)
TABLE I: The considered hyper-parameters

The design space of the DNN model is explored by varying the range of number of input nodes (), number of hidden layers () and nodes at hidden layers (). Table I lists the these hyper parameters and their definitions.

In the first set of experiments, for each number of input nodes, i.e. the spectrogram resolution/scale, number of hidden layers (L) is doubled in each step, starting from 2 up to 64. When increasing the number of hidden layers, the number of nodes at hidden layers remains fixed. Then, in the next set of experiments, the number of input nodes and hidden layers are fixed at and , respectively, and number of nodes at hidden layers () are doubled for each half of the hidden layers at each step of the experiments. Finally, to investigate the sensitivity of Edge TPU to the choice of parameters at hidden layers, we fix and , and change the from 5000 to 10,000 nodes with steps of 100 which corresponds to the nodes at the first hidden layer.

Iv-C Setup: Software and Hardware Platforms

Fig. 5: (a) Coral Dev Board and (b) Coral USB Accelerator with Edge TPU and (c) Raspberry Pi3 model B+ platforms.

While TensorFlow (TF) Lite version 2.5.0 was our main machine learning software platform, different hardware platforms were investigated to study the practicality of Edge TPU. TF Lite is a deep learning framework aimed at resource constrained embedded and IoT devices, which uses 8-bit unsigned integers instead of 32-bit floating point numbers as in TF 

[1, 25]. For the hardware platform, while the focus of the paper is the exploration of Edge TPU, other hardware platforms are considered for the comparison which include a common embedded 64-bit CPU and a traditional CPU-based workstation.

Iv-C1 Google Edge TPU

In 2019, Google launched the Edge TPU, a purpose-built ASIC hardware accelerator for machine learning applications with high performance and a low energy footprint, with the aim of running machine learning inference at the edge [3]. Figure 5-(a) and (b), respectively, show the Coral Dev Board and the USB Accelerator both with Google’s Edge TPU. The TPU is based on a systolic array architecture, which allows performing matrix multiplication highly efficiently and at a massive scale. All operations are limited to 8 bit integers, which increases both the performance and energy efficiency [13]. In our experiments we used both the Coral USB Accelerator [7] and the Coral Dev Board [6].

Iv-C2 ARM Cortex-A53

The first comparison platform is the Raspberry Pi 3B+, which is equipped with a 64-bit quad-core Arm Cortex-A53 CPU @1.4GHz and 1GB RAM. It is powered with a 5V-2.5A power supply running the Raspbian GNU/Linux OS. Figure 5-(c), shows the credit card sized Raspberry Pi 3B+.

Iv-C3 Intel i7

The second comparison platform is a traditional CPU-based workstation using an Intel i7-4790 @3.60 GHz CPU and 64GB of RAM, running the Linux kernel 4.15.

V Results

Parameter Range
{2, 4, 8, 16, 32, 64}
{377, 1350, 3420, 5400, 9072, 12150,
17424, 22500, 28476, 34875}
for to
64
32
9
TABLE II: Summary of the considered hyper-parameters to explore the design space of DNN models

Our results mainly focus on the classification performance, energy efficiency and memory considerations in terms of the architectural variations of the neural network models. We also consider the ratio of energy efficiency between the Edge TPU and Cortex-A53, which is of critical importance for deploying machine learning on resource constrained edge devices and provides an overall view of the performance of each platform.

V-a Classification Performance vs Model Variations

TableII presents the range of hyper-parameters explored in the first set of experiments. The number of hidden layers is varied from 2 to 64 with the steps of 2 and the number of input nodes are specified in the second row of the Table. The rest of the hyper parameters are fixed as shown in the next rows of the table.

Fig. 6: The average Score versus number of hidden layers for configurations in TableII implemented on i7, Cortex-A53 , and Edge TPU.

Figure 6 shows the Score as a function of number of hidden layers for different number of input nodes. Different colours indicate different number of input nodes as shown in the colour bar. The results are shown for the three HW platforms, and as seen, the F1-Score is very similar between the three platforms. In all cases, except the case with 377 input nodes, the Score first reaches up to 88.8% with increasing number of hidden layers, and then drops. It can also be seen that the maximum F1 score is achieved with 1350 to 9072 input nodes and 16 to 32 hidden layers. The Score declines for 64 hidden layers due to the over-fitting to the majority class (our dataset is imbalanced).

Fig. 7: Energy per inference vs model size for configurations in TableII implemented on i7, Cortex-A53 , and Edge TPU.

V-B Energy Efficiency vs Model Variations

In order to estimate the energy per inference for the Edge TPU, we measure the inference time, and multiply it with the peak power consumption of 2 Watts, as specified in 

[5]. The same method is considered for the Cortex-A53 CPU on a Raspberry Pi 3B+ with the active bare-board average power consumption of 2 Watts. For the measurement of energy per inference for Intel i7, we used the PyRAPL [26] toolkit, which takes advantage of Intel’s proprietary “Running Average Power Limit” (RAPL) technology [22].

V-B1 Exploring model size

Figure 7 shows the inference energy in millijoule (mJ) for a single inference step as a function of model size in MB, which includes the effect of both the input layer size (spectrogram resolution) and number of hidden layers. The number of input nodes are identified with different colours. Results are shown for the set of hyper-parameters in Table II on three HW platforms.

For Intel i7, Figure 7 shows a relatively linear increase. The slope () and intercept () of the regression line equal to 13.623 (mJ/MB) and 1.74 (mJ), respectively. The reliability of the linear proportion is demonstrated by correlation coefficient and coefficient of determination values.

In the case of Cortex-A53, Figure 7 shows a qualitatively similar pattern where the coefficients of the regression line are as (mJ/MB), (mJ), and and . The slope of the line is significantly lower compared to Intel i7.

Finally, Figure 7 shows the inference energy for Edge TPU platform. Here, we observe a surprising bimodal behaviour, with an almost constant and very low energy usage of 0.29 mJ to 1.05 mJ per inference for the input nodes of 377 to 5400 and the model sizes of 0.03 MB to 0.93 MB.

It consists of one linear (corresponding to input nodes of 377 to 5400) and multiple almost constant sections (corresponding to input nodes of 9072 to 34875). The regression line for the linear section has the following coefficients (mJ/MB), (MB), and . While for the first section (input nodes of 377 to 5400), the energy usage linearly increases with the model size, for the higher number of the input nodes (input nodes of 9072 to 34875) it jumps with the increase of the input nodes, but then stays almost constant with other model size increases (due to the increase of the number of hidden layers).

Fig. 8: The comparison of energy per inference vs model size on i7, Cortex-A53 , and Edge TPU; focusing on number of input nodes on Edge TPU performance.

Figure 8 shows the three energy consumption graphs from Figure 7 side by side and compares the range of energy consumption and behaviour of the three HW platforms. With regards to the bimodal behaviour of Edge TPU, two different colours are used to indicate its energy consumption for the number of input nodes fewer/greater than 5400. The zoomed region (lower left) shows that for models with size less than 0.15 MB, Cortex-A53 is more efficient than Edge TPU. This cross-over is reported in another paper of ours [10] for small models with different neural network configurations on another application. Beyond 0.15 MB for input nodes , Edge TPU is more efficient than Cortex-A53; however, for input nodes Cortex-A53 overtakes Edge TPU again.

Parameter Range
128
5400
for to
and to
{(64, 64), (128, 64), (128, 128),
(256, 64), (300, 64), (305, 64),
(310, 64), (256, 128), (256, 256),
(512, 64), (512, 128)}
32
9
TABLE III: Summary of the considered hyper-parameters to investigate the sensitivity of Edge TPU to the model size

To further investigate the bimodal behaviour of Edge TPU, we considered increasing the model size for the largest model before the transition of Edge TPU, with and ; and studied the energy consumption and memory usage on the Edge TPU and Cortex-A53. In this experiment, the model size is increased by increasing the number of nodes at hidden layers. The models in this set of experiments have fixed number of input nodes while nodes at hidden layers are doubled in each step, i.e. number of nodes for half of the hidden layers in DNN are doubled on each step. Table III summarises the hyper-parameters used in this set of experiments. In this table, pair shows the number of nodes at hidden layers for layers to and to , respectively. The set of pairs shown for the range of are used in each step of the experiment.

Figure 9 shows the energy usage versus the model size for this experiments. It is shown that up to  MB the energy efficiency is relatively proportional to the increase of the model size. The implementation on the Cortex-A53 shows that, the Edge TPU is more efficient for the models with input nodes. However, the advancement of the Edge TPU lasts until the model size is less than  MB before the transition happens. Beyond 8 MB, which is the size of the available on-chip memory, the energy usage curve sharply increases.

Fig. 9: Energy per inference vs model size for configurations in TableIII on Edge TPU andCortex-A53.
Fig. 10: On-chip and Off-chip memory usage vs model size for configurations in TableIII on Edge TPU and Cortex-A53.

In order to understand the reason for the transition and cause of the bimodal behaviour, we investigated the compilation reports provided by the Edge TPU compiler [4], with a particular focus on the memory usage. 222The Edge TPU compiler, aims to extract the highest parallelism levels for the execution of the operations when maps the supported operations on the Edge TPU. All the neural networks operations used in this study are supported by the Edge TPU and mapped to execute on chip. [29]. The Edge TPU uses two main parts of memory for the storage of its model parameters, i.e. on-chip and off-chip memory. Figure 10 shows the memory usage vs the model size, for the on-chip and off-chip memories in KB. As seen, for the models of size beyond 8 MB, the Edge TPU starts using off-chip memory resource to provide the memory requirements of the processes. This is consistent with the behaviour shown in Figure 9, where the transition happens for the models of size beyond 8 MB, which is the amount of available SRAM on Edge TPU.

V-B2 Exploring number of nodes per layer

Fig. 11: Energy per inference vs number of hidden layers for configurations in TableII on Edge TPU.

In the last experiment, it is demonstrated that model size is a determinant factor in energy usage, but here we show it is not the only one. Taking the note from Figure 8, demonstrating that model size is not the only determinant parameter of the bimodal behaviour of the Edge TPU regarding the overlap of the energy consumption curves from models with different number of input nodes and with different number of hidden layers. As mentioned earlier, model size is affected by both the number of hidden layers and the number of nodes per layer. Figure 11 shows the energy consumption of Edge TPU versus number of hidden layers and for different number of input nodes. It is observed that energy usage is not sensitive to increasing the number of hidden layers. But it is not the case for increasing number of input nodes. For input nodes 377 to 5400, the range and slope of energy usage is almost the same. While for input nodes beyond 5400, the energy consumption is incremented in a quantised manner. The quantised behaviour of the Edge TPU returns to the HW architecture of this ASIC.

Fig. 12: Energy per inference on left (a) and model size on right (b) vs different spectrogram resolutions for Edge TPU. Different line markers show different number of hidden layers and the colour bar shows different number of input parameters which maps to the spectrogram resolutions.

From a different point of view, Figure 12(a) shows the energy consumption of Edge TPU versus number of input nodes (spectrogram resolution). Here, we observe the bimodal behaviour of Edge TPU for regardless of number of hidden layers, with an almost constant and very low energy usage of 0.29 mJ per inference for the resolutions of 10-40% and two hidden layers, and an almost linear increase for the higher spectrogram resolutions, i.e. 50-100%. The same behaviour is observed for other steps of hidden layers with a slightly different range of 0.78-1.05 mJ per inference for the resolutions of 10-40% and 128 hidden layers, and the transition happens for the resolutions beyond 40%. In Figure 12(b) the size of all models remains below 8 MB.

Fig. 13: Used off-chip and on-chip memory vs different spectrogram resolutions for Edge TPU. Different line markers show different number of hidden layers and the colour bar shows different number of input parameters which maps to the spectrogram resolutions.

The results of investigations in the compilation reports are illustrated in Figure 13. This figure shows Edge TPU memory usage in KB, shown separately for on-chip and off-chip memory in KB, for the range of number of input nodes (spectrogram resolutions) and number of hidden layers. We observe that regardless of number of hidden layers for number of input nodes 377 up to 5400 (a resolution of 10% up to 40%) the use of on-chip memory linearly increases, while the off-chip memory is unused. In this range, the entire neural network model and all its parameters fit into the on-chip memory, resulting in a highly efficient operation. This is clearly consistent with the shown energy per inference curve (see Figure 12(a)), which shows a very minimal and almost constant energy consumption in that range from 0.29 mJ to 1.05 mJ per inference. When the number of input nodes increases from 5400 to 9072, we observe that the on-chip memory usage drops by 76% on average for different number of hidden layers. The rest of the required memory is provided with a corresponding use of off-chip memory. The amount of off-chip memory is fixed for each of different number of input nodes regardless of number of hidden layers or the model size.

Increase of the off-chip memory happens in steps, which the pattern matches the quantised pattern of energy usage observed in Figure 

14. Here, the model size is far less than 8 MB (See Figure 12(b)). It seems the storage of model parameters in the on-chip memory follows an “all-or-nothing” approach, with storage of the entire model in on-chip memory if it fits, and storage of the entire model off-chip, if not.

Parameter Range
2
377
{5000, 5100, 5200, …, 10000}
32
9
TABLE IV: Summary of the considered hyper-parameters to investigate the sensitivity of Edge TPU to number of parameters per layer

So far, the role of model size and number of input parameters in the use of energy consumption is demonstrated. To study the effect of number of parameters at hidden layers on the performance of the Edge TPU, we investigated the smallest on-chip models according to results shown in Figure 8 with and . The exploration focuses on the number of nodes at the first hidden layer () ranging from 5000 to 10,000 with the step of 100. The number of nodes at the last hidden layer () and the output layer ( are fixed at 32 and 9, respectively. The summary of the range hyper-parameters is presented in Table IV.

Fig. 14: Energy per inference vs number of nodes at hidden layer for configurations in TableIV on Edge TPU.
Fig. 15: On-chip and Off-chip memory usage and model size vs number of nodes at hidden layer for configurations in TableIV on Edge TPU.

Figure 14 shows energy per inference vs number of nodes at the first hidden layer. Almost constant and very low energy usage of 0.39 mJ per inference for 5000 to 5900 nodes is observed, and at 6000 nodes a small step is noticed which increases the energy by 38% to 0.63 mJ. The reason of this jump is not clear but it is assumed it is related to the hardware architecture of Edge TPU.

Between 6000 and 8000 nodes, the energy usage increases linearly to 0.72 mJ. Then at 8000 nodes a significant step is observed which increases the energy usage by 58% to 1.72 mJ. Beyond 8100, the energy usage increases linearly and reaches to 2.13 mJ. The compilation reports for the models described above are presented in Figure 15, shown the on-chip, off-chip memory usage, and model size. It is shown that increasing the nodes linearly increases the model size from 1.85 MB to 3.69 MB. It is interesting that at 8100 nodes, the transition happens and Edge TPU starts to use off-chip memory; however the model size is relatively smaller than the available on-chip memory. We believe the reason is the amount of parameter memory of Edge TPU which is equal to 8192 333The amount of on-chip parameter memory is reported in [29].. This explains the significant gap observed in Figure 11 between 5400 and 9072 input nodes, and also the significant jump in Figure 14 at 8000 nodes i.e. the bimodal behaviour of the Edge TPU. Hence, number of nodes per layer is determinant factor for energy usage as well as the model size. The number of nodes per layer should stay fewer than 8000 for the model to fit into on-chip available memory.

The Edge TPU compiler determines the mapping of data based on the available on-chip memory (processing engine memory and core memory), the model parameters (input activations, weight parameters), and outputs. The Edge TPU ASIC is build up on 2D arrays of processing engines (PE) which work on a Single Instruction Multiple Data (SIMD) logic. Inside a PE, multiple or single compute lanes and the shared PE memory are engaged in processing a set of activations that are given in each cycle. The scarce on-chip memory including the PEs memory and the core memory are used for storing input activations, partial sums, outputs of computations, and weight parameters. Based on SIMD, caching model parameters provides efficiency by reducing the parameter transfers to be used in each cycle [29]. However, when the model size or number of parameters exceeds the on-chip memory limit the transition happens, and the off-chip memory is used for streaming model parameters resulting to the bimodal behaviour of Edge TPU. Thus, the IO bandwidth becomes the critical factor. The bimodal behaviour of the Edge TPU can be triggered by the number of parameters per layer when greater than 8000 or when the model size is greater than 8 MB. Hence, to stay on-chip, the model specifications should stay below these limitations.

V-C Energy Efficiency Ratio of Edge TPU over Cortex-A53

Fig. 16: The ratio for energy efficiency of Edge TPU over Cortex-A53 vs model size; focusing on number of input nodes.

In machine learning (ML) at the edge, both performance, e.g. classification Score in our use case, and energy efficiency are critical. So far, we have considered classification performance and energy efficiency vs different variations of DNN architectural design space including number of input nodes, number of hidden layers and number of nodes at hidden layers. We have seen there is no significant difference in classification performance of HW platforms. However, the energy efficiency of the Edge TPU is bimodal and is dependent on both the model size and number of parameters per layer. This bimodal behaviour caused Cortex-A53 with its linear behaviour, to outperform Edge TPU at some points.

In order to simplify the comparison of the Cortex-A53 and Edge TPU, we summarise the results by considering the ratio of energy efficiency of Edge TPU over Cortex-A53. Figure 16 shows the ratio on y-axis and model size (MB) on x-axis. It is observed that for models with fewer than 5400 input nodes when the model size is less than 0.15 MB, Cortex-A53 is more efficient than the Edge TPU. Then, with the increase in the model size, the Edge TPU outperforms Cortex-A53. The advancement of the Edge TPU is at highest before the model size exceeds 8 MB; and Cortex-A53 overtakes Edge TPU at 13.5 MB. However, for greater than 5400 input nodes regardless of the model size Cortex-A53 outperforms Edge TPU. It is important to note that, Edge TPU is highly sensitive to the choice of model size and number of parameters per layer.

Vi Design Considerations to use Edge TPU

Our investigations show that great care needs to be taken when designing neural network models for Edge TPU. If done right, and with a model size that fits on the on-chip memory, an excellent energy performance and efficiency can be achieved. Conversely, if the model is too large and does not fit in the on-chip memory, the energy efficiency of the Edge TPU is not much better than that of traditional CPUs. For practical ML applications at the edge, it is therefore critical to have the ability to scale the model size to the “sweet spot”, which provides both good performance and energy efficiency.

Fig. 17: Suggested co-design for feed forward neural networks implementation on Edge TPU.

The Edge TPU specifications, at the time of this study, are 8 MB on-chip memory and 8192 on-chip parameter memory. And according to the findings of this study, Edge TPU outperforms Cortex-A53 between two points, which we refer to as the cross-over point 1 at 0.15 MB and the cross-over point 2 at 12.5 MB. In this regard, the key points suggested to consider when designing a feed forward neural network to be implemented on Edge TPU are listed below:

  • If the model is small with the model size less than the cross-over point 1, Cortex-A53 is more efficient than Edge TPU.

  • If the model size is less than the available on-chip memory (8 MB) and number of parameters per layer is fewer than the on-chip parameter memory (8192), Edge TPU performs at its sweet spot.

  • If number of parameters per layer is more than the on-chip parameter memory, Cortex-A53 is more efficient.

  • If the model size is greater than the cross-over point 2, then cortex-A53 is more efficient.

Figure 17 summarises the above points in a flowchart for feed forward neural networks implementation on Edge TPU.

Vii Conclusions

In this paper, we have evaluated the Google’s Edge TPU for deep feed forward neural network at the edge, and compared its performance with that of both traditional Intel and embedded 64-bit ARM CPUs. For our exploration, we considered the use case application of activity classification in particular cattle, which requires inference to run at the edge, on highly resource-constrained embedded systems. A key focus of this paper was the exploration of the DNN model architectural variations. In our experiments we used the spectrogram data representation to scale the input size of DNN, as it has excellent scaling properties, i.e. it allows the compression of the data to a fraction of its original size, without sacrificing much of its predictive power and hence classification accuracy in the context of our considered application. Further, the hyper parameters of the DNN architecture which leads to scaling the model size and number of parameters per layer are explored for the feed forward neural network.

Our experimental results have shown the Edge TPU can provide excellent classification accuracy and energy efficiency that is significantly higher than traditional CPUs, and hence has great potential for application in energy constrained IoT devices. However, we have also shown that the Edge TPU performance and energy efficiency is highly sensitive to the neural network model size and number of parameters. If the model size or number of parameters per layer is too large to fit on the on-chip TPU memory, the energy efficiency significantly drops to a level that is almost on par with the considered CPUs. Surprisingly, for very small neural network models Cortex-A53 is faster and more energy efficient than Edge TPU.

While our exploration was based on a specific use case application and a specific edge ML hardware platform, we believe the results are applicable more generally in the context of ML at the edge.

Acknowledgments

We acknowledge the support of the following researchers in regards to data collection: Greg J. Bishop-Hurley with CSIRO Agriculture and Food, and Paul Greenwood, Alistair Donaldson and Reg Woodgate with NSW Department of Primary Industries, and Jody McNally , Troy Kalinowski and Aaron Ingham with CSIRO Agriculture and Food. We also acknowledge support from Steffen Bollmann with centre of advanced imaging at the university of Queensland.

References

  • [1] M. Abadi et al. (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §IV-C.
  • [2] CSIRO (2019) Ceres tag: smart ear tags for livestock. Web Page. External Links: Link Cited by: §III.
  • [3] Google (2020) Advanced neural network processing for low-power devices. Web Page. External Links: Link Cited by: §I, §IV-C1.
  • [4] Google (2020) Edge TPU Compiler. Web Page. External Links: Link Cited by: §V-B1.
  • [5] Google (2020) Edge tpu performance benchmarks. Web Page. External Links: Link Cited by: §V-B.
  • [6] Google (2021) Dev Board. Web Page. External Links: Link Cited by: §IV-C1.
  • [7] Google (2021) USB Accelerator. Web Page. External Links: Link Cited by: §IV-C1.
  • [8] S. Hosseininoorbin (2020) PhD forum abstract: activity classification at the edge. In 2020 19th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Vol. , pp. 369–370. External Links: Document Cited by: §III, §III.
  • [9] S. Hosseininoorbin, S. Layeghy, B. Kusy, R. Jurdak, G. J. Bishop-Hurley, P. L. Greenwood, and M. Portmann (2021) Deep learning-based cattle behaviour classification using joint time-frequency data representation. Computers and Electronics in Agriculture 187, pp. 106241. Cited by: §III-A, §III-B, §III-B, §III-C, §III, §III.
  • [10] S. Hosseininoorbin et al. (2021) Exploring edge tpu for network intrusion detection in iot. External Links: 2103.16295, Link Cited by: §II, §III, §V-B1.
  • [11] S. Hosseininoorbin et al. (2021) Scaling spectrogram data representation for deep learning on edge tpu. In 2021 IEEE International Conference on Pervasive Computing and Communications Workshops and other Affiliated Events (PerCom Workshops), Vol. , pp. 572–578. External Links: Document Cited by: §II, §III.
  • [12] Y. Hui et al. (2020) Early experience in benchmarking edge ai processors with object detection workloads. Journal Article In International Symposium on Benchmarking, Measuring and Optimization, pp. 32–48. External Links: Document Cited by: §II.
  • [13] N. P. Jouppi et al. (2017)

    In-datacenter performance analysis of a tensor processing unit

    .
    In Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, New York, NY, USA, pp. 1–12. External Links: ISBN 9781450348928, Link, Document Cited by: §IV-C1.
  • [14] L. Kljucaric et al. (2020) Architectural analysis of deep learning on edge accelerators. In 2020 IEEE High Performance Extreme Computing Conference (HPEC), Vol. , pp. 1–7. External Links: Document Cited by: §II.
  • [15] M. Kumar et al. (2020-05) Energy-Efficient Machine Learning on the Edges. In 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 912–921. External Links: Document, ISBN 978-1-7281-7445-7, Link Cited by: §II.
  • [16] S. Layeghy, G. Azemi, P. Colditz, and B. Buashash (2014) Classification of Fetal Movement Accelerometry Through Time-Frequency Features. In 2014 8th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–6. Cited by: §III-B.
  • [17] S. Layeghy et al. (2014) Non-Invasive Monitoring of Fetal Movements Using Time-Frequency Features of Accelerometry. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4379–4383. Cited by: §III-B.
  • [18] Meat & Livestock, Australia (mla) (2019-01) GLOBAL snapshot l beef. Report mla. External Links: Link Cited by: §III.
  • [19] M. Merenda et al. (2020) Edge machine learning for ai-enabled iot devices: A review. Sensors (Switzerland). External Links: Document, ISSN 14248220 Cited by: §II.
  • [20] J. Moran and R. Doyle (2015) Cow talk. Book, CSIRO Publishing. External Links: ISBN 9781486301621, Document Cited by: §III.
  • [21] OECD/FAO (2018) OECD/fao (2018). Book Section In OECD-FAO Agricultural Outlook 2018-2027, pp. 149–174. External Links: ISBN 9789264297210, Document Cited by: §III.
  • [22] S. Pandruvada (2020) RUNNING average power limit – rapl. Web Page. External Links: Link Cited by: §V-B.
  • [23] D. Ravi et al. (2017) A deep learning approach to on-node sensor data analytics for mobile or wearable devices. IEEE Journal of Biomedical and Health Informatics 21 (1), pp. 56–64. External Links: Document Cited by: §III-B, §III-B.
  • [24] A. Reuther et al. (2019) Survey and benchmarking of machine learning accelerators. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), Vol. , pp. 1–9. Cited by: §II.
  • [25] Tensorflow (2020) Tensorflow for Mobile @ IoT. Web Page. External Links: Link Cited by: §IV-C.
  • [26] The Spirals Research Group (University of Lille and Inria) (2020) PyRAPL. Web Page. External Links: Link Cited by: §V-B.
  • [27] C. Wisultschew et al. (2019) Artificial Vision on Edge IoT Devices: A Practical Case for 3D Data Classification. In 2019 34th Conference on Design of Circuits and Integrated Systems, DCIS 2019, External Links: Document, ISBN 9781728154589 Cited by: §II.
  • [28] P. Xia et al. (2013)

    Performance comparison of bilinear interpolation, bicubic interpolation, and b-spline interpolation in parallel phase-shifting digital holography

    .
    Optical review 20 (2), pp. 193–197. Cited by: §IV-A.
  • [29] A. Yazdanbakhsh, K. Seshadri, B. Akin, J. Laudon, and R. Narayanaswami (2021) An evaluation of edge tpu accelerators for convolutional neural networks. External Links: Link Cited by: §II, §V-B2, footnote 2, footnote 3.