Sponge Examples: Energy-Latency Attacks on Neural Networks

06/05/2020 ∙ by Ilia Shumailov, et al. ∙ University of Cambridge 8

The high energy costs of neural network training and inference led to the use of acceleration hardware such as GPUs and TPUs. While this enabled us to train large-scale neural networks in datacenters and deploy them on edge devices, the focus so far is on average-case performance. In this work, we introduce a novel threat vector against neural networks whose energy consumption or decision latency are critical. We show how adversaries can exploit carefully crafted sponge examples, which are inputs designed to maximise energy consumption and latency. We mount two variants of this attack on established vision and language models, increasing energy consumption by a factor of 10 to 200. Our attacks can also be used to delay decisions where a network has critical real-time performance, such as in perception for autonomous vehicles. We demonstrate the portability of our malicious inputs across CPUs and a variety of hardware accelerator chips including GPUs, and an ASIC simulator. We conclude by proposing a defense strategy which mitigates our attack by shifting the analysis of energy consumption in hardware from an average-case to a worst-case perspective.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The wide adoption of machine learning has motivated serious study of its security vulnerabilities. Threat vectors such as adversarial examples 

Biggio et al. (2013); Szegedy et al. (2013), data poisoning Nelson et al. (2008); Jagielski et al. (2018), and membership inference attacks Shokri et al. (2017); Salem et al. (2018) have been extensively explored. These attacks either target the confidentiality or integrity of machine learning systems Biggio and Roli (2018); Papernot et al. (2016). So what about the third leg of the security triad, their availability? In this paper, we introduce an attack that increases the power drawn by neural networks and the time they take to make decisions. Increased energy consumption on edge devices that are battery-powered (such as smartphones or IoT devices) can make them unavailable Martin et al. (2004). Perhaps even more seriously, an attack that can slow down decisions can subvert safety-critical or mission-critical systems.

Our key observation is that different inputs of the same size can cause a deep neural network (DNN) to draw very different amounts of time and energy: this energy-latency gap is the vulnerability we exploit.

Our attack can be even more effective against the growing number of systems that use GPUs or custom hardware. Machine learning in general, and deep learning in particular, command workloads heavy in matrix algebra. GPUs were fundamental to the AlexNet breakthrough in 2012 

Krizhevsky et al. (2012); in response to increasing demand, Google introduced TPUs to facilitate inference – and training – in its datacenters Jouppi et al. (2017); while Apple introduced the Neural Engine to make its smartphones more energy-efficient for on-device deep learning Team (2017).

Our first attack uses a genetic algorithm to craft malicious inputs, which we call

sponge examples, designed to soak up energy from a neural network. Time measurements obtained by profiling mutated inputs serve as the fitness function needed for the genetic algorithm to evolve better sponge examples. Our second, enhanced, attack instead optimizes sponge examples with L-BFGS and a loss that encourages the optimizer to return inputs with large activation norms across all hidden layers of a neural network. We find that this second attack is even more effective at maximizing energy consumption during inference111Technically, we might describe this process as pessimization rather than optimization, as we’re finding the inputs that give the worst possible performance.. Quite apart from battery-draining attacks, sponge examples could cause a control system to fail to meet real-time service requirements. To give but two examples, autonomous vehicle designers should consider whether they could cause accidents, while engineers building neural networks into cognitive radar should analyse whether they create a new opportunity for jamming.

Our contributions are the following:

  • We formulate a novel energy-based threat against the availability of ML systems. We instantiate a corresponding threat vector, sponge examples, with two attacks designed to cause the worst possible energy consumption during inference.

  • Our evaluation shows that canonical models for both vision and language tasks are vulnerable to sponge examples, including models specifically compressed to be deployed on edge devices (e.g., MobileNet). Our genetic algorithm attack increases energy consumption by natural language tasks up to 200, latency by up to 70

    and our enhanced L-BFGS-B attack increases computer vision task consumption by up to 8


  • We demonstrate the portability of the threat vector instantiated by sponge examples, given that our conclusions are consistent across a variety of chips: CPUs, GPUs, and an ASIC simulator. Sponge examples also transfer across model architectures.

  • We present a simple defense against sponge examples that can also prevent unexpected increases in the energy consumption of ML systems in the absence of adversaries, with direct beneficial consequences for the carbon footprint of models deployed for inference at scale (e.g., machine translation on smartphones), as well as for the worst-case performance of such models.

2 Background on Hardware Acceleration for Deep Learning

Workloads generated by inference for deep neural networks (DNNs) are intensive in both compute and memory. Common hardware products such as CPUs and GPUs are now being adapted to this workload, and provide features dedicated to accelerating DNN inference. Intel’s Knights Mill CPU provides a set of SIMD instructions Cooperation (2016)

, while NVIDIA’s Volta GPU introduces Tensor Cores to facilitate the low-precision multiplications that underpin much of deep learning

Markidis et al. (2018).

Hardware dedicated to deep learning is now pervasive in data centers, with examples including Big Basin at Facebook Hazelwood et al. (2018), BrainWave at Microsoft Chung et al. (2018), and racks of TPUs at Google Jouppi et al. (2017, 2018); the underlying hardware on these systems are either commodity hardware (Big Basin), reconfigurable hardware (FPGAs for BrainWave), or custom silicon (TPUs). The latter two are specifically designed to improve the number of Operations per Watt (OPs/W) of DNN inference. As we discuss later, custom and semi-custom hardware often exploit sparsity in data and the adequacy of low-precision computations to train neural networks, reducing both arithmetic complexity and DRAM accesses, and thus achieving significantly better power efficiency Chen et al. (2019); Han et al. (2016). Our attack undermines the benefits brought by such hardware.

Performance per watt is an important indicator for the efficiency of cloud infrastructure Barroso (2005). Power oversubscription is a popular method for cloud services to handle provisioning. However, this makes data centers vulnerable to power attacks Li et al. (2016); Somani et al. (2016); Xu et al. (2014, 2015). If malicious users can remotely generate power spikes on multiple hosts in the data center at the same time, they might overload the system and cause disruption of service Li et al. (2016); Palmieri et al. (2015). Energy attacks against mobile devices aim to drain the battery more quickly Martin et al. (2004); Fiore et al. (2014). The possible victims of energy attacks on mobile systems range from phones to more constrained sensors Chen et al. (2009). Higher energy consumption also increases hardware temperature, which in turn increases the failure rate. For example, Anderson et al. note that an increase of C causes component failure rates to double Anderson et al. . Modern hardware throttles to avoid over-heating; but throttling causes tasks to consume more energy as now they take longer to run, increasing total static and dynamic power costs.

3 Methodology

3.1 Adversary Model

In this paper we assume an adversary with the ability to supply an input sample to a target system, which then processes the sample using a single CPU, GPU or ASIC. We assume no rate limiting, apart from on-device dynamic power control or thermal throttling.We assume no physical access to the systems i.e.  an attacker cannot reprogram the hardware or change the configuration.

We consider our attack in three threat models. The first threat model is a white-box setup: we assume the attackers know the model architecture and parameters. The second threat model considers an interactive black-box: we assume attackers have no knowledge of the architecture and parameters, but are able to time the operations or measure energy consumption remotely and can query the target as many times as they want. The third threat model is the one of a blind adversary, where no knowledge of the target architecture and parameters is assumed and attackers cannot take any direct measurements. In this setting, the adversary has to resort to a direct transfer of sponge examples found previously to a new target model – without prior interaction.

A simple example of a target system could be a dialogue system. Users interact continuously by sending queries and can measure energy consumption, or when that is not possible have a timing side channel by measuring the response time (see Section 4).

3.2 The Energy Gap

We tested three hardware platforms: CPU, GPU and an ASIC simulator. The amount of energy consumed by one inference pass (i.e.  a forward pass in a neural network) depends primarily on:

  • The overall number of arithmetic operations required to process the inputs;

  • The number of memory accesses e.g. to the GPU DRAM.

The intriguing question now is:

Is there a significant gap in energy consumption for different model inputs of the same dimension?

As well as fixing the dimension of inputs, we also do not consider inputs that would exceed the pre-defined numerical range of each dimension. If models do have a large energy gap between different inputs, we describe two hypotheses that we think attackers can exploit to create sponge examples, that is, inputs with abnormally high energy consumption.

3.2.1 Hypothesis 1: Data Sparsity

The rectified linear unit (ReLU), which computes

, is the de facto choice of activation function in neural network architectures. This design introduces sparsity in the activations of hidden layers when the weighted sum of inputs to a neuron is negative. A large number of ASIC neural network accelerators consequently exploit runtime data sparsity to increase efficiency. For instance, ASIC accelerators may employ zero-skipping multiplications or encode DRAM traffic to reduce the off-chip bandwidth requirement. Hence, inputs that lead to less sparse activations will increase the number of operations and the number of memory accesses, and thus energy consumption.

3.2.2 Hypothesis 2: Computation Dimensions

Aside from data sparsity, modern networks also have a computational dimension. Along with variable input and output shapes, the internal representation size often changes as well – for example, in the Transformer-based architectures for machine translation Vaswani et al. (2017). In this case, both the input and output are sequences of words and internal representation depends on both of them. Each word is represented as a number of tokens, whose shape depends on the richness of input and output dictionaries. As computation progresses, internally each inference step depends on all of the inputs and outputs so far.

Consider an input sequence and an output sequence . We denote the input and output token sizes (i.e.  the number of unique words) with and . Each of the words in a sequence is embedded in a space of dimensionality , for the input, and , for the output. Algorithm 1 contains the pseudocode for a Transformer’s principal steps. In red, we annotate the computational complexity of the following instruction. As can be seen, several quantities can be manipulated by an adversary to increase the algorithm’s run time: 1) token size of the input sentence ; 2) token size of the output sentence ; and 3) size of the input and output embedding spaces ( and ). All of the above can cause a non-linear increase in algorithmic complexity and thus heavily increase the amount of energy consumed. Importantly, perturbing these quantities does not require that the adversary modify the dimension of input sequence ; so with no changes to the input dimensionality, the adversary is able to increase energy consumption non-linearly.

Result: y
1 O()
2 = Tokenize(x);
4 O()
5 = Encode ();
6 O()
7 while  has no end of sentence token do
8       O()
9       = Encode ();
10       O()
11       = model.Inference(, , );
12       O();
13       = Decode();
14       .add();
15 end while
16 O();
y = Detokenize()
Algorithm 1 Translation Transformer NLP pipeline

3.3 Latency and Energy Attacks on Neural Networks

Having presented the intuition behind our attacks, we now introduce three strategies for finding sponge examples, corresponding to the threat models described in Section 3.1.

3.3.1 Finding Sponge Examples with Genetic Algorithms in White and Black-box Settings

Genetic algorithms (GA) are a powerful tool for adversaries Xu et al. (2016). They can optimise a diverse set of objectives, require no local gradient information. They are a particularly good fit for adversaries who only have access to the model’s prediction in a black-box setting. We start with a pool of randomly generated samples . These are images for computer vision models, or sentences for NLP tasks. We then iteratively evolve the population pool:

  • For computer vision tasks, we sample two parents and from the population pool, and crossover the inputs using a random mask .

  • For NLP tasks, we sample two parents and , and crossover by concatenating the left part of parent with the right part of parent . We then probabilistically invert the two parts.

We explain the reasons for these choices in Appendix D. Next, we randomly mutate (i.e.  randomly perturb) a proportion of the input features (i.e.  pixels in vision, words in NLP) of the children. To maintain enough diversity in the pool, we preserve the best per-class samples in the pool. We obtain a fitness score for all pool members, namely their energy consumption. We then select the winning top 10% samples ,222As the sample pool is large, selecting the top 10% makes the process more tractable. and use them as parents for the next iteration. This genetic algorithm is simple but effective in finding sponge examples.

3.3.2 Sponge Examples in the White-box Setting

We now consider an adversary with access to the model’s parameters. Rather than a genetic algorithm, we use L-BFGS-B Byrd et al. (1995) to optimise the following objective:


where is the set of all activation values and the activations of layer . This generates inputs that increase activation values of the model across all of the layers simultaneously. Following Hypothesis 1 outlined above, the decrease in density prevents hardware from skipping some of the operations which in turn increases energy consumption. We only evaluate the performance of sponge examples found by L-BFGS-B on computer vision tasks because of the discrete nature of the NLP tasks.

3.4 Cross-model and Cross-hardware Transferability for Blind Adversaries

When adversaries are unable to query the model, they cannot directly solve an optimisation problem to find sponge examples, even using interactive black-box approach i.e.run the GA. In this blind adversary setting, we exploit transferability across both models and hardware. Indeed, in Section 4.5 and Appendix F, we show that sponge examples transfer across models. We examine three hardware platforms in our evaluation:

  • CPU: The platform is an Intel(R) Xeon(R) CPU E5-2620 v4 with 2.10GHz clock frequency. We use the Running Average Power Limit (RAPL) to measure energy consumption of the CPU. RAPL has been thoroughly evaluated and found to reflect actual energy consumption, as long as the counters are not sampled too quickly Hähnel et al. (2012); Khan et al. (2018).

  • GPU: We use a GeForce 1080 Ti GPU with a 250.0 Watts power limit, a slowdown temperature and a throttling temperature. We use the NVIDIA Management Library (NVML) to measure energy consumption. NVML was previously found to capture energy quite accurately, with occasional instability for high-low patterns and high sampling rates Sen et al. (2018).

  • ASIC: We also developed a deterministic ASIC simulator, which monitors and records the runtime operations and number of DRAM accesses assuming a conservative memory flushing strategy. We then use measurements by Horowitz to approximate energy consumption Horowitz (2014): at 45nm technology and 0.9V, we assume 1950 pJ to access a 32 bit value in DRAM and 3.7 pJ for a floating-point multiplication.

We show in Section 4.5 that sponge examples transfer across these types of hardware.

4 Evaluation

4.1 Experimental Setup

NLP tasks.

We first evaluate our sponge example attack on a range of NLP models provided by the FairSeq framework Ott et al. (2019). The models we consider have achieved top performance at their respective tasks. We report performance of the RoBERTa Liu et al. (2019) model, an optimised BERT Devlin et al. (2018), on three GLUE benchmarks Wang et al. (2018). The datasets we considered include tasks in the SuperGLUE benchmark plus a number of machine-translation tasks. The SuperGLUE benchmark follows the style of GLUE but includes a wider range of language-understanding tasks including question answering and conference resolution Wang et al. (2018, 2019). Further, we evaluate the attack on a number of translation tasks (WMT) using Transformer-based models Ott et al. (2018); Edunov et al. (2018); Ng et al. (2019).

Consider the pipeline for handling text. Before getting to the models, text goes through several prepossessing steps. First, words get tokenized in a manner meaningful for the language. We used the tokenizer from the Moses toolkitKoehn et al. (2007), which separates punctuation from words and normalises characters. Next, tokenized blocks get encoded. Until recently, unknown words were simply replaced with an unknown token. Modern encoders improve performance by exploiting the idea that many words are a combination of other words. BPE is a popular approach that breaks unknown words into subwords it knows and uses those as individual tokens Sennrich et al. (2015). In that way, known sentences get encoded very efficiently, mapping every word to a token. (Unknown words lead to a much larger sentence representation.)

Vision tasks.

We evaluate the sponge example attack on a range of vision models provided in the TorchVision library. We show the performance of ResNet-18, ResNet-50 and ResNet-101 He et al. (2016), DenseNet-121, DenseNet-161, DenseNet-201 Huang et al. (2017), GoogleNet Szegedy et al. (2015) and MobileNet-V2 Sandler et al. (2018)

. All of the networks solve a canonical computer vision classification task – ImageNet-2017.

4.2 White-box Sponge Examples against NLP tasks

scale=0.75,center Input size Language Understanding: SuperGLUE Benchmark with Liu et al. (2019) CoLA 15 5829.32 4.30 69.72 83.92 87.11 30 9388.40 4.30 138.07 164.07 169.91 100 22698.87 4.30 452.49 518.19 530.80 MNLI 15 6126.65 12.88 73.47 86.97 89.96 30 9631.68 17.66 142.63 168.96 174.34 100 22952.14 34.47 456.11 518.89 531.40 WSC 15 27876.53 14.48 523.28 1300.19 2152.67 30 82822.58 34.94 1882.63 3927.63 5348.06 100 662811.96 194.89 16754.13 25367.30 30692.95 Machine Translation: WMT14/16 with Ott et al. (2018) EnFr 30 59597.32 31.87 109.80 118.47 141.27 50 93731.34 48.54 166.13 249.89 569.85 EnDe 15 18133.66 18.19 35.80 242.39 542.35

Table 1:

Energy is reported in millijoules. GA was ran for 100 epochs with a pool size of 100. We report GPU energy readings using NVML and show the amount of energy and time increase as

and respectively. In addition, we show the performance of samples from the evaluation dataset (), random noise () and sponge example performance on the accelerator simulator. For sponge examples, we average the final GA pool samples and report it as ; in addition, the top sponge examples are averaged and reported as . More results are available in Appendix.

Table 1 shows the energy consumption of different models in the presence of an energy-latency adversary. For all models examined in Table 1, we found effective sponge examples that increased energy consumption considerably. In the best-case scenario, we were able to slow down the task evaluation by a factor of and increase its energy consumption by a factor of . The main reason for performance degradation was increased computation dimension, as described in Algorithm 1. First, for a given input sequence size, the attack maximises the size of the post-tokenisation representation (), exploiting the tokeniser and sub-word processing. Second, the attack learns to maximise output sequence length, since this links directly to the computation cost. Third, internal computation coupled with the output sequence length and post-tokenisation length give a quadratic increase in energy consumption. An interesting observation is that text consumes a lot less energy in comparison to and texts. This can be attributed to the fact that natural samples are efficiently encoded, whereas random and attack samples produce an unnecessarily long representation, meaning random noise can be used as a scalable black-box latency and energy attacker.

4.3 White-box Sponge Examples against CV Tasks

According to the hypotheses in Section 3.2.1 and Section 3.2.2, we can increase energy consumption by increasing either computation dimension or data density. Although theoretically we can provide larger images to increase the computation dimension for computer vision networks, very few modern networks currently deal with dynamic input or output. Usually preprocessing normalizes variable-sized images to a pre-defined size by either cropping or scaling. Therefore, for computer vision models, we focus on increasing the energy and latency via data density. We calculate the theoretical upper bounds of data density using Interval Bound Propagation (IBP) Gowal et al. (2018). Although originally developed for certifiable robustness, we adopt the technique to look at internal network bounds that only take for the whole natural image range333Note that we assume full floating point precision here. In practice, emerging hardware often uses much lower quantisation which will result in a lower maximum data density..

scale=0.8,center [s] [mJ] ratio post-ReLU Density Density Max Density ResNet-50 L-BFGS-B Sponge 0.011 164.727 0.863 0.619 0.885 0.998 Sponge 0.016 160.887 0.843 0.562 0.868 Natural 0.017 160.562 0.842 0.572 0.867 Random 0.017 155.820 0.817 0.483 0.845 DenseNet-121 L-BFGS-B Sponge 0.033 152.595 0.783 0.571 0.826 0.829 Sponge 0.029 149.564 0.767 0.540 0.814 Natural 0.033 147.227 0.755 0.523 0.804 Random 0.030 144.365 0.741 0.487 0.792 MobileNet v2 L-BFGS-B Sponge 0.011 87.511 0.844 0.692 0.890 0.996 Sponge 0.010 84.513 0.815 0.645 0.868 Natural 0.011 85.075 0.821 0.646 0.873 Random 0.011 80.805 0.779 0.567 0.844

Table 2: Energy is reported in milli joules. GA was ran for 100 epochs with a pool size of 100. More results are available in Appendix.

Table 2 presents the performance of sponge example attacks on selected CV models; we show results for all of them in Appendix C. As shown in , the inference pass of CV models uses a lot less time and the maximum energy consumption is a lot lower. Since the energy consumption is lower per inference, it is challenging to get a true measurement of energy given the interference from the GPU’s hardware temperature control, and that energy inspection tools lack the resolution. We attempt to statistically measure consumption in Appendix E. In Table 2, we focused on the results of GA maximising the ASIC simulator cost.

shows the energy estimated from the accelerator simulator; density is the ratio between the number of non-zero values and total values, post-ReLU Density only considers the values after the ReLU non-linearity, and ‘

ratio’ refers to the cost on the ASIC with optimisations compared to the cost on an ASIC without any optimisations. In this case, we consider ASIC-level optimisations of zero-skipping multiplications Kim et al. (2017) and compressed DRAM accesses Parashar et al. (2017). ‘Max density’ refers to the largest possible density obtained through IBP. The results for density and energy suggest that both attacks successfully generate sponge examples that are marginally more expensive in terms of energy. To be precise, we were able to get a increase in energy consumption when compared to natural samples. To better understand the difference in performance please refer to Appendix H. Finally, we observe that sponge examples are transferable and can be used to launch a blind black-box attack. See Appendix F for details.

Figure 4: Performance of Sponge Examples based on the Energy, Time and Simulator fitness costs.

scale=0.75,center From To [mJ] [s] [mJ] [s] [mJ] Black-box  Ott et al. (2018)  Ott et al. (2018) Sponge 3648.219 0.174 17251.000 1.048 51512.966 Natural 1450.403 0.053 6146.550 0.537 23610.145  Edunov et al. (2018) Sponge 2909.245 0.414 47723.500 3.199 181936.595 Natural 1507.364 0.253 27265.250 1.344 71714.201  Ng et al. (2019) Sponge 3875.365 0.652 67183.100 4.409 247585.091 Natural 1654.965 0.215 25033.620 2.193 121210.376 White-box  Ott et al. (2018)  Ott et al. (2018) Sponge 48447.093 2.414 260187.900 13.615 781758.680 Natural 1360.118 0.056 6355.620 0.520 23262.311

Table 3: Energy values are reported in milli Joules and time is reported in seconds. GA was ran for 100 epochs with a pool size of 1000. More results are available in Appendix.

4.4 Interactive Black-box Sponge Examples against NLP Tasks

In this section we show the performance of the attack run in an interactive black-box manner against NLP tasks. We compare the sponge examples found by our attacks based on 1) time; 2) energy; and 3) ASIC cost. The last is a white-box baseline, whilst the first two are interactive black-box attacks. This means we use whether time, energy or ASIC cost as a fitness scoring for the GA. Figure 4 shows sponge example performance against a WMT14 English-to-French Transformer-based translator with an input of size 15 and pool size of 100. It can be seen that, although both Energy and Time attackers run without any knowledge of network internals, they successfully increase the energy and time costs.

4.5 Blind Black-box Sponge Examples and Hardware Transferability on NLP Tasks

In this section we turn to the question of transferability across hardware and different models in a blind black-box manner. Table 3 shows the results across different models, languages and hardware platforms. We observe a significant increase in energy and time in comparison to natural samples. However, the blind black-box attacks fail to achieve the same level of energy or latency degradation when taking the white-box case as a baseline.

5 Defending against Sponge Examples

Sponge examples can be found by adversaries with limited knowledge and capabilities, making the threat vector realistic. We propose a simple defense to preserve the availability of hardware accelerators in the presence of sponge examples. In Table 1, we observe that there exists a large energy gap between natural examples and random or sponge examples. We propose that prior to the deployment of a model, natural examples get profiled to measure the time or energy cost of inference. The defender can then fix a cut-off threshold. This way, the maximum consumption of energy per inference run is controlled and sponge examples will have a bounded impact on availability.

This deals with the case where the threat model is battery drainage. Where the threat is a jamming attack on real-time performance, as for example with the vision system of an autonomous vehicle or a missile targeting system driven by cognitive radar, the system will need to be designed for worst-case performance, and if need be a fallback driving or targeting mechanism should be provided.

6 Conclusion

We introduced energy-latency attacks, which equip an adversary to increase the latency and energy consumption of ML systems thus jeopardizing their availability. Our attacks are effective against deep neural networks in a spectrum of threat models that realistically capture current deployments of ML, whether as a service or on edge devices. They can be mounted by adversaries whose access varies from total access to none at all. Our work demonstrates the need for careful worst-case analysis of the latency and energy consumption of computational systems that use deep learning mechanisms.

Broader Impact: Carbon emission and Machine Learning

Most of the prior research on the carbon footprint of machine learning focuses on the energy required to train large neural network models and its contribution to carbon emissions Henderson et al. (2020); Lacoste et al. (2019); Strubell et al. (2019). This work shows that we need to study carbon emissions at small scales as well as large. As with side-channel attacks on cryptographic systems, the fine-grained energy consumption of neural networks is a function of the inputs processed. In this case, the main consequence is not leakage of confidential information but a denial of service attack.

First, sponge examples can aim to drain a device’s batteries; the operations and memory accesses in inference account for around a third of the work done during a complete backpropagation step (


  forward-backward pass), but inference happens at a much higher frequency and scale compared to training once a model is deployed. Our research characterizes the worst-case energy consumption of inference. This is particularly pronounced with natural language processing tasks, where the worst case can take over 100 times the time and energy of the average case.

Second, the sponge examples found by our attacks can be used in a targeted way to cause an embedded system to fall short of its performance goal. In the case of a machine-vision system in an autonomous vehicle, this might enable an attacker to crash the vehicle; in the case of a missile guided by a neural network target tracker, a sponge example countermeasure might break the tracking lock. Adversarial worst-case performance must, in such applications, be tested carefully by system engineers.


Partially supported with funds from Bosch-Forschungsstiftung im Stifterverband. Research at the University of Toronto and Vector Institute was supported by a Canada CIFAR AI Chair, the Vector Institute’s industrial sponsors, DARPA, and NSERC.


  • [1] D. Anderson, J. Dykes, and E. Riedel More than an interface-scsi vs. ata.. Cited by: §2.
  • [2] L. A. Barroso (2005) The price of performance. Queue 3 (7), pp. 48–53. Cited by: §2.
  • [3] B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli (2013) Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pp. 387–402. Cited by: §1.
  • [4] B. Biggio and F. Roli (2018)

    Wild patterns: ten years after the rise of adversarial machine learning

    Pattern Recognition 84, pp. 317–331. Cited by: §1.
  • [5] J. A. Butts and G. S. Sohi (2000) A static power model for architects. In Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000, pp. 191–201. Cited by: Appendix A.
  • [6] R. H. Byrd, P. Lu, J. Nocedal, and C. Zhu (1995) A limited memory algorithm for bound constrained optimization. SIAM Journal on scientific computing 16 (5), pp. 1190–1208. Cited by: §3.3.2.
  • [7] X. Chen, K. Makki, K. Yen, and N. Pissinou (2009) Sensor network security: a survey. IEEE Communications Surveys & Tutorials 11 (2), pp. 52–73. Cited by: §2.
  • [8] Y. Chen, T. Yang, J. Emer, and V. Sze (2019) Eyeriss v2: a flexible accelerator for emerging deep neural networks on mobile devices. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 9 (2), pp. 292–308. Cited by: §2.
  • [9] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield, T. Massengill, M. Liu, D. Lo, S. Alkalay, M. Haselman, et al. (2018) Serving dnns in real time at datacenter scale with project brainwave. IEEE Micro 38 (2), pp. 8–20. Cited by: §2.
  • [10] I. Cooperation (2016) Intel architecture instruction set extensions programming reference. Intel Corp., Mountain View, CA, USA, Tech. Rep, pp. 319433–030. Cited by: §2.
  • [11] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.1.
  • [12] S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale. arXiv preprint arXiv:1808.09381. Cited by: §4.1, Table 3.
  • [13] U. Fiore, F. Palmieri, A. Castiglione, V. Loia, and A. De Santis (2014) Multimedia-based battery drain attacks for android devices. In 2014 IEEE 11th Consumer Communications and Networking Conference (CCNC), pp. 145–150. Cited by: §2.
  • [14] E. García-Martín, C. F. Rodrigues, G. Riley, and H. Grahn (2019) Estimation of energy consumption in machine learning. Journal of Parallel and Distributed Computing 134, pp. 75–88. Cited by: Appendix A, Appendix A.
  • [15] B. Goel, S. A. McKee, and M. Själander (2012) Techniques to measure, model, and manage power. In Advances in Computers, Vol. 87, pp. 7–54. Cited by: Appendix A.
  • [16] S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. Mann, and P. Kohli (2018) On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715. Cited by: §4.3.
  • [17] M. Hähnel, B. Döbel, M. Völp, and H. Härtig (2012) Measuring energy consumption for short code paths using rapl. ACM SIGMETRICS Performance Evaluation Review 40 (3), pp. 13–17. Cited by: 1st item.
  • [18] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally (2016) EIE: efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News 44 (3), pp. 243–254. Cited by: §2.
  • [19] K. Hazelwood, S. Bird, D. Brooks, S. Chintala, U. Diril, D. Dzhulgakov, M. Fawzy, B. Jia, Y. Jia, A. Kalro, et al. (2018) Applied machine learning at facebook: a datacenter infrastructure perspective. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 620–629. Cited by: §2.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [21] P. Henderson, J. Hu, J. Romoff, E. Brunskill, D. Jurafsky, and J. Pineau (2020) Towards the systematic reporting of the energy and carbon footprints of machine learning. arXiv preprint arXiv:2002.05651. Cited by: Broader Impact: Carbon emission and Machine Learning.
  • [22] M. Horowitz (2014) 1.1 computing’s energy problem (and what we can do about it). In 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), Vol. , pp. 10–14. Cited by: 3rd item.
  • [23] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §4.1.
  • [24] M. Jagielski, A. Oprea, B. Biggio, C. Liu, C. Nita-Rotaru, and B. Li (2018) Manipulating machine learning: poisoning attacks and countermeasures for regression learning. In 2018 IEEE Symposium on Security and Privacy (SP), pp. 19–35. Cited by: §1.
  • [25] N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, et al. (2017) In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12. Cited by: §1, §2.
  • [26] N. Jouppi, C. Young, N. Patil, and D. Patterson (2018) Motivation for and evaluation of the first tensor processing unit. IEEE Micro 38 (3), pp. 10–19. Cited by: §2.
  • [27] K. N. Khan, M. Hirki, T. Niemi, J. K. Nurminen, and Z. Ou (2018-03) RAPL in action: experiences in using rapl for power measurements. ACM Trans. Model. Perform. Eval. Comput. Syst. 3 (2). External Links: ISSN 2376-3639, Link, Document Cited by: 1st item.
  • [28] D. Kim, J. Ahn, and S. Yoo (2017)

    A novel zero weight/activation-aware hardware architecture of convolutional neural network

    In Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017, pp. 1462–1467. Cited by: §4.3.
  • [29] P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. (2007)

    Moses: open source toolkit for statistical machine translation

    In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp. 177–180. Cited by: §4.1.
  • [30] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
  • [31] A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres (2019) Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700. Cited by: Broader Impact: Carbon emission and Machine Learning.
  • [32] C. Li, Z. Wang, X. Hou, H. Chen, X. Liang, and M. Guo (2016) Power attack defense: securing battery-backed data centers. ACM SIGARCH Computer Architecture News 44 (3), pp. 493–505. Cited by: §2.
  • [33] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: Table 4, §4.1, Table 1.
  • [34] S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng, and J. S. Vetter (2018) Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 522–531. Cited by: §2.
  • [35] T. Martin, M. Hsiao, D. Ha, and J. Krishnaswami (2004) Denial-of-service attacks on battery-powered mobile computers. In Second IEEE Annual Conference on Pervasive Computing and Communications, 2004. Proceedings of the, pp. 309–318. Cited by: §1, §2.
  • [36] B. Nelson, M. Barreno, F. J. Chi, A. D. Joseph, B. I. Rubinstein, U. Saini, C. A. Sutton, J. D. Tygar, and K. Xia (2008) Exploiting machine learning to subvert your spam filter.. LEET 8, pp. 1–9. Cited by: §1.
  • [37] N. Ng, K. Yee, A. Baevski, M. Ott, M. Auli, and S. Edunov (2019) Facebook fair’s WMT19 news translation task submission. CoRR abs/1907.06616. External Links: Link, 1907.06616 Cited by: §4.1, Table 3.
  • [38] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grangier, and M. Auli (2019) Fairseq: a fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, Cited by: §4.1.
  • [39] M. Ott, S. Edunov, D. Grangier, and M. Auli (2018)

    Scaling neural machine translation

    arXiv preprint arXiv:1806.00187. Cited by: Table 4, §4.1, Table 1, Table 3.
  • [40] F. Palmieri, S. Ricciardi, U. Fiore, M. Ficco, and A. Castiglione (2015) Energy-oriented denial of service attacks: an emerging menace for large cloud infrastructures. The Journal of Supercomputing 71 (5), pp. 1620–1641. Cited by: §2.
  • [41] N. Papernot, P. McDaniel, A. Sinha, and M. Wellman (2016) Towards the science of security and privacy in machine learning. arXiv preprint arXiv:1611.03814. Cited by: §1.
  • [42] A. Parashar, M. Rhu, A. Mukkara, A. Puglielli, R. Venkatesan, B. Khailany, J. Emer, S. W. Keckler, and W. J. Dally (2017) Scnn: an accelerator for compressed-sparse convolutional neural networks. ACM SIGARCH Computer Architecture News 45 (2), pp. 27–40. Cited by: §4.3.
  • [43] A. Salem, Y. Zhang, M. Humbert, P. Berrang, M. Fritz, and M. Backes (2018) Ml-leaks: model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246. Cited by: §1.
  • [44] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §4.1.
  • [45] S. Sen, N. Imam, and C. Hsu (2018) Quality assessment of gpu power profiling mechanisms. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), Vol. , pp. 702–711. Cited by: 2nd item.
  • [46] R. Sennrich, B. Haddow, and A. Birch (2015) Neural machine translation of rare words with subword units. CoRR abs/1508.07909. External Links: Link, 1508.07909 Cited by: §4.1.
  • [47] R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pp. 3–18. Cited by: §1.
  • [48] G. Somani, M. S. Gaur, D. Sanghi, and M. Conti (2016) DDoS attacks in cloud computing: collateral damage to non-targets. Computer Networks 109, pp. 157–171. Cited by: §2.
  • [49] E. Strubell, A. Ganesh, and A. McCallum (2019) Energy and policy considerations for deep learning in nlp. arXiv preprint arXiv:1906.02243. Cited by: Broader Impact: Carbon emission and Machine Learning.
  • [50] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §4.1.
  • [51] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
  • [52] C. V. M. L. Team (2017)

    An on-device deep neural network for face detection

    In Apple Machine Learning Journal, Cited by: §1.
  • [53] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2.2.
  • [54] A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019) Superglue: a stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 3261–3275. Cited by: §4.1.
  • [55] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) Glue: a multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461. Cited by: §4.1.
  • [56] W. Xu, Y. Qi, and D. Evans (2016)

    Automatically evading classifiers

    In Proceedings of the 2016 network and distributed systems symposium, Vol. 10. Cited by: §3.3.1.
  • [57] Z. Xu, H. Wang, and Z. Wu (2015) A measurement study on co-residence threat inside the cloud. In 24th USENIX Security Symposium (USENIX Security 15), pp. 929–944. Cited by: §2.
  • [58] Z. Xu, H. Wang, Z. Xu, and X. Wang (2014) Power attack: an increasing threat to data centers.. In NDSS, Cited by: §2.

Appendix A Definitions

Energy is the total consumed static power and dynamic power for an interval of time .

Static power refers to the consumption of the circuitry when it is in idle state [15]. There exist multiple models to estimate the static energy consumption depending on the technology [15, 14, 5]. In this paper we follow a coarse-grained approach to energy estimation.


where is the reverse saturation current; is the diode voltage; is the Boltzmann’s constant; is the electronic charge; is temperature and is the supply voltage.

Dynamic power refers to consumption from (dis)charging of the said circuitry [14].


Here, refers to the activity factor i.e. components that are currently consuming power; is the source voltage; is the capacitance; and is the clock frequency.

Ultimately an attacker attempts to solve an optimisation problem


For all parameters considered in the equation, only very few could be affected by the adversary described in Section 3.1. In particular, there are only four parameters that an attack might manipulate: , , and . Although frequency and temperature cannot be controlled directly, they are affected through optimisations performed by the computing hardware. As we assume a single GPU, CPU or ASIC, we focus on the activity ratio , the time and the switching power from flipping the state of transistors. The execution time and activity ratio link tightly to the number of operations and memory accesses performed. In the temporal dimension, attackers might trigger unnecessary passes of a compute-intensive block; in the spatial domain, attackers can turn sparse operations to dense ones. These temporal and spatial attack opportunities can significantly increase the number of memory and arithmetic operations and thus create an increase in and to maximise energy usage.

Appendix B Parameter Choices

We have thoroughly evaluated different parameter choices for the Sponge attack and found that a small pool size and a relatively short number of GA iterations could be sufficient for a large number of tasks.

Figure 5: GA performance with WSC task from GLUE Benchmark. Words of size 29 are evaluated with pool sizes of 100, 300, 500, 700 and 900.

Figure 5 shows the performance of Sponge samples on the RoBERTa model for the Winograd Schema Challenge (WSC) with different pool sizes and varying input sequence length. The horizontal axis shows the number of GA iterations. In terms of pool size of the GA, although there is an increase in performance for larger pool sizes, the increase is marginal. In addition, smaller pool sizes significantly reduce the runtime for the attack. From the hardware perspective, using a large pool size might trigger GPUs to throttle, so that the runtime will be further increased. For the number of GA iterations, we observed that consistently for smaller input sequences, the convergence is faster. This is mainly because the complexity of the search is less. In practice, we found almost all input sequence lengths we tested plateau within 100 GA iterations; even going to over 1000 iterations gives only a small increase in performance. For these reasons, for the experiments presented below we report the results of the attack with a pool size of 100 for GLUE and Computer Vision benchmarks, and 1000 for translation tasks. We use 100 GA iterations for all benchmarks tested.

Appendix C Language and Vision Models Results

Table 1 and Table 2 show full sets of Sponge attacks running on NLP and CV tasks. It can be seen that the attacker can exploit the algorithmic complexity of NLP models and cause massive increases in both energy consumption and latency. It is clear that the increase is non-uniform across models, and we hypothesize that this is connected to the task detail and richness of the dictionaries we used.

scale=0.8,center Input size SuperGLUE Benchmark with [33] CoLA 15 5829.32 4.30 69.72 83.92 87.11 30 9388.40 4.30 138.07 164.07 169.91 50 14030.60 4.30 227.48 267.60 275.27 70 17369.33 4.30 318.87 370.34 379.76 100 22698.87 4.30 452.49 518.19 530.80 MNLI 15 6126.65 12.88 73.47 86.97 89.96 30 9631.68 17.66 142.63 168.96 174.34 50 13777.36 23.28 232.22 270.95 278.85 70 17696.38 28.52 321.63 370.80 381.21 100 22952.14 34.47 456.11 518.89 531.40 WSC 15 27876.53 14.48 523.28 1300.19 2152.67 30 82822.58 34.94 1882.63 3927.63 5348.06 40 126009.10 47.34 3151.23 6438.32 9345.28 70 324814.61 115.17 8444.53 15632.59 20191.31 100 662811.96 194.89 16754.13 25367.30 30692.95 WMT14/16 with [39] EnFr 10 23497.33 15.07 40.47 55.36 90.38 30 59597.32 31.87 109.80 118.47 141.27 50 93731.34 48.54 166.13 249.89 569.85 EnDe 15 18133.66 18.19 35.80 242.39 542.35

Table 4: Energy is reported in milli joules. All numbers are simulated, except for NVML, which is sampled directly from the GPU. GA was ran for 100 epochs with a pool size of 100.

For CV tasks, we can also observe that the attacker is capable of constructing samples that result in an increased internal data density. We observe that, for some network architectures, the effect is more prominent. Interestingly, we find that the majority of improvement comes not from density-side after ReLU, but rather from the propagation of non-zero values as this increases the number of DRAM accesses. For example, on mobilenet, decrease in sparsity leads to at least 7% degradation on ASIC.

scale=0.8,center Time Simulated Absolute ration Density ImageNet ResNet-18 Sponge LBFGSB 0.006 53.359 0.899 0.896 Sponge 0.006 51.816 0.873 0.869 Natural 0.007 51.748 0.871 0.869 Random 0.006 49.685 0.837 0.834 ResNet-50 Sponge LBFGSB 0.011 164.727 0.863 0.885 Sponge 0.016 160.887 0.843 0.868 Natural 0.017 160.562 0.842 0.867 Random 0.017 155.820 0.817 0.845 ResNet-101 Sponge LBFGSB 0.021 258.526 0.857 0.873 Sponge 0.024 254.182 0.842 0.861 Natural 0.027 253.042 0.839 0.857 Random 0.025 249.027 0.825 0.846 DenseNet-121 Sponge LBFGSB 0.033 152.595 0.783 0.826 Sponge 0.029 149.564 0.767 0.814 Natural 0.033 147.227 0.755 0.804 Random 0.030 144.365 0.741 0.792 DenseNet-161 Sponge LBFGSB 0.040 288.427 0.726 0.764 Sponge 0.044 287.153 0.723 0.761 Natural 0.045 282.273 0.711 0.751 Random 0.044 279.270 0.703 0.744 DenseNet-201 Sponge LBFGSB 0.048 237.745 0.756 0.788 Sponge 0.046 239.845 0.763 0.794 Natural 0.049 234.948 0.747 0.781 Random 0.046 233.700 0.743 0.777 GoogleNet Sponge LBFGSB 0.018 47.454 0.862 0.953 Sponge 0.015 46.088 0.837 0.951 Natural 0.016 45.964 0.835 0.955 Random 0.015 44.164 0.802 0.938 MobileNet v2 Sponge LBFGSB 0.011 87.511 0.844 0.890 Sponge 0.010 84.513 0.815 0.868 Natural 0.011 85.075 0.821 0.873 Random 0.011 80.805 0.779 0.844

Table 5: Energy is reported in milli joules. All numbers are simulated, except for NVML, which is sampled directly from the GPU. GA was ran for 100 epochs with a pool size of 100.

Appendix D Domain Specific Optimisations for Sponge

In Section 3.3 we outlined the genetic algorithm we used to find Sponge samples.

Despite the common structure, we relied heavily on domain-specific optimisations. In this section, we explain the reasoning behind them.

First, for NLP tasks, the greatest impact on performance was acquired from exploiting the encoding schemes used by different tasks. While the genetic algorithm was fast to pick up, it struggled with efficiency around the mid-point, where the parents were concatenated. For example, when trying to break down individual words to more tokens, we observed the GA inserting backslashes into the samples. When concatenated, we saw cases where two non-backslashes followed each other, meaning the GA was losing on a couple of characters. As a solution, we probabilistically flipped the halves and saw a slight improvement.

For CV tasks, we observed that random samples were always classified as belonging to the same class. Furthermore, random samples had very low internal density. We hypothesize that this has to do with the fact that on random samples there are very few class features, as opposed to what is observed in natural samples. As the GA improvement largely depends on randomness, that meant that we often observed that after merging two highly dense parents, uniform randomness across all pixels was decreasing sparsity to the level of random samples. In other words, uniform randomness was diluting class features. In order to counter this phenomenon, instead of applying uniform randomness across all pixel values, we resorted to diluting only 1% of them. That led to a bigger improvement of the whole population pool. Furthermore, after observing that the density is class-dependent, it became apparent that in order to preserve diversity in the pool it was important to keep samples from multiple classes. For this, we tried to ensure that at least 20 different classes were preserved in the pool.

We attempted to use domain knowledge and tried adding operations like rotation, transposition and re-scaling into the mutation process, yet we found that these did not lead to significant improvements.

Appendix E Measuring impact in practice

Although we have presented in Section 4.3 that Sponge attacks cause the ASIC energy consumption to rise for computer vision tasks, it is still unclear what this translates to the real life.

If one were to directly measure the CPU or GPU load per adversarial sample, interpreting it would be hard, especially when one talks about the energy cost improvements in the order of around 5% for ResNet18 and 3% as for DenseNet101. As is mentioned in Appendix A the main energy costs include the frequency of switching activities, voltage and clock frequency. Due to the heat impact from voltage and clock frequency, a large number of different optimisations are deployed by the hardware. Here, the optimisations try to balance multiple objectives – they try to be as performant as they can, whilst being as energy efficient as possible and also maintain reliability. Modern CPUs and GPUs have a number of performance modes between which the hardware can switch. For example, official Nvidia documentation lists 15 different performance modes.

Figure 8: ResNet-18 solving ImageNet-2017 without any rate limiting with increasing internal density.

Figure 8 shows measurements taken during the Sponge GA attack running against ResNet-18. The x-axis shows the number of epochs, with each epoch the internal density is increasing from 0.75% to 0.8%. In (a), the right y-axis shows mean energy readings per sample, whereas left y-axis shows mean power readings per-sample. In (b) the left y-axis shows mean latency values per-sample.

The amount of power consumed is strongly correlated to the amount of time taken by each sample. When the GPU speeds up, it consumes more energy but requires less time, but the rise in temperature causes the hardware then to go to a more conservative mode to cool down. We observe this heating and cooling cycle with all tasks running on GPUs, making it hard to measure the absolute performance and the attack impact.

We can however measure the performance statistically. First, we turn to a question of

Can we detect energy differences between Natural, Random and sponge samples?

To investigate the relationship between the samples we use Mann-Whitney-Wilcoxon U test (U-test), a nonparametric test for difference between distributions. With three classes of samples, we need three pairwise comparisons. For each one, the null hypothesis that the distributions of energy consumed by the samples are identical. The complement hypothesis is that of a difference between distributions.

The U-test is based on three main assumptions:

  • Independence between samples;

  • The dependent variable is at least ordinal;

  • The samples are random.

The first assumption is fulfilled since no sample belongs to more than one category i.e. natural, random and sponge. The second assumption is satisfied by the fact that both time and energy are cardinal variables. The third assumption, however, is harder to satisfy.

The cause of this lies in the closed nature of hardware optimisations: although some of the techniques are known, the exact parameters are unknown. Furthermore, it is hard to achieve same state of the hardware even through power cycling. As was mentioned in Appendix A temperature affects energy directly, and it is hard to make sure that the hardware always comes back to the same state.

To minimise temperature effects we apply the load of natural, attack and random samples and wait until the temperature stabilises. That takes approximately 30000 samples. The order of the samples is random, and at this point it can be assumed that all of the data and instruction caches are filled. Finally, because the samples are randomly shuffled, all of the predictive optimisations will work with the same probability for each of the classes.

For these reasons, we believe it is safe to assume that the samples themselves are random in that the effect of hardware optimisations is random, so that the last assumption of Mann-Whitney test is fulfilled.

Using Mann-Whitney U test we can pairwise compare the natural, random and sponge samples. The test indicates that the three types of samples generate energy consumption distributions which are statistically different (one-sided test, p-value=0.000) for mobilenet executed on a CPU. On a practical level, the amount of energy consumed by sponge samples is 1.5% higher on a CPU and >7% on ASIC. We could not evaluate the energy recordings on a GPU, as the standard deviation was in excess of 15% which worsened as temperature increased.

Figure 11: Mann-Whitney test on CPU measured Mobilenet execution. Number of observations is shown on x-axis and p-value on the y-axis.

Figure 11 shows the confidence of Mann-Whitney test with mobilenet measured on the CPU as a function of the number of observations. The number of observations is on the x-axis, and the p-value on the y-axis. As it can be seen, in a stable environment i.e. the temperature has stabilised, after about 100 observations per class the differences become statistically significant at any reasonable confidence level. A similar trend is observed for unstable temperature environment, but around three times more data is required. That means that in practice, about 100–300 observations per class are sufficient to differentiate between classes with high confidence.

Appendix F Transferability of Attacks

Figure 14: Sponge example density transanferability.

Figure 14 shows the density difference of transferred sponge samples. As it can be clearly seen, for all but one – i.e. mobilenet – network, the sponge samples have increased the internal data density despite not having any knowledge of what natural samples look like. All of the sponge samples outperformed random noise, suggesting that sponge samples target specific features of the data set and can be applied in a blind Black-box fashion.

Appendix G Per class density transfer

Figure 18: Per-class density of natural samples from the ImageNet validation dataset.

Figure 18 shows the density of natural samples from ImageNet dataset. It can be clearly seen that there are per-class similarities between data densities of natural samples. These are particularly pronounced within resnet and densenet architectures hinting that similar architectures learn similar class-data representations. Finally the third graph subgraph in Figure 18.c shows the summed per-class densities across all of the tested networks. There are classes that are consistently densely represented across all the tested networks.

Appendix H Understanding Sponge and sponge performance

Figure 21: Per-class mean density of samples evaluated on ResNet18 and DenseNet161. The natural samples are from the validation set and are compared to 50 000 randomly generated samples and 1000 Sponge GA samples. The scales are normalised to form a probability density.

To better understand the results, we present Figure 21 which shows per-class density distributions of natural, random and Sponge samples. There are 50000 random and natural samples respectively and 1000 Sponge samples, with the bars normalised to form a probability density.

The first thing that becomes apparent is that randomly generated samples on CV models cost significantly less energy because many activations are not activated. On average, random samples result in a sparser computation – around more sparse for ResNet18 – and our simulator costs for natural samples are around % higher than the costs of random samples. Second, a surprising finding is that the most and least sparse samples are clustered in a handful of classes. In other words, certain classes have inputs that are more expensive than others in terms of energy. For ResNet-18, the most sparse classes are ‘wing’ and ‘spotlight’ while the least sparse are ‘greenhouse’ and ‘howler monkey’. We observe a similar effect for larger ResNet variants and also DenseNets, although the energy gap is smaller on DenseNets. Interestingly, we see that energy expensive classes are consistent across different architectures, and we further demonstrate this class-wise transferability in Appendix G. Ultimately, the implications of this phenomenon are that it is possible to burn energy or slow a system without much preparation, by simply bombarding the model with natural samples from energy-consuming classes. Finally, we see that the Sponge samples are improving the population performance and tend to outperform natural sample. We observe that it is easier for Sponge to outperform all natural samples for DensNets of different size, yet it struggles to outperform all of the ResNets. We further measure the energy performance statistically in Appendix E.