The power that machine learning models consume when making predictions can be affected by a model's architecture. This paper presents various estimates of power consumption for a range of different activation functions, a core factor in neural network model architecture design. Substantial differences in hardware performance exist between activation functions. This difference informs how power consumption in machine learning models can be reduced.READ FULL TEXT VIEW PDF
We have proposed orthogonal-Padé activation functions, which are trainab...
Machine learning has become mainstream across industries. Numerous examp...
The lack of any sender authentication mechanism in place makes CAN
With the globalization of the semiconductor manufacturing process, elect...
Convolutional Neural Networks (CNNs) usually use the same activation
Differential Power Analysis (DPA) has been an active area of research fo...
The hyper-parameters of a neural network are traditionally designed thro...
The field of deep neural networks has reported strong progress in many problem areas, including natural language processing (NLP), image recognition, and game playing. Many of the advances in these areas have been the fruit of using larger and thus more computationally demanding neural networks.Amodei and Hernandez (2018) find that the cost of training doubled every few months between the releases of AlexNet (Krizhevsky et al., 2012) and AlphaZero Silver et al. (2018). In NLP, power consumption has also risen: Strubell et al. (2019) determine the carbon footprint of a contemporary machine translation architecure search to be in the order of hundreds of intercontinental flights, for models that offer only marginal performance improvement.
This paper examines activation functions, a core part of neural networks. The activation function is the non-linearity at the core of each network node. It is applied over the input and bias parameters at a given node for each inference that a model makes. This makes for a potentially large number of computation being required to make predictions, predicated on network structure and size. When it comes to individual calculations, there is also broad variance. The complexity of low-level instructions for each these functions also varies widely, from the simple rectified linear unit to the transcendental hyperbolic tangent. This variance has the potential to lead to differences in power consumption.
The constraints on choice of a neural network activation function are (a) that it must be differentiable, and (b) that it must have a continuous domain. These are required in order to train networks through backpropagation. A large range of functions fit these constraints, and as such, currently popular machine learning frameworks implement a broad range of activation functions.
The machine code for executing these functions can vary in complexity significantly. Figure 1 compares toy x86 code for rectified linear unit activation with code for a hyperbolic tangent (tanh) function. The tanh code is not only more complex, but also requires use of special resources such as an FPU (floating point unit). In practice, these functions are often run in a SIMD/SIMT structure, where a single instruction is performed over many data points at a time. However, x86 and CUDA SIMD instruction sets have similar restrictions: there is no direct tanh function for either, and instead a sequence of secondary operations have to be performed. While inference requires many other operations beyond calculating activation functions, the difference in scale of these functions’ computation still leaves room for optimisation.
A further bottleneck is presented by hardware structure. The operations needed to compose some activation functions can require special hardware. This hardware can be scarce and therefore highly contended. For example, an NVIDIA V100 card’s streaming multiprocessor (SM) has just one single special function unit (SFU) to sixteen 32-bit floating point cores, eight 64-bit floating point cores, and sixteen 32-bit integer cores (Durant et al., 2017; Markidis et al., 2018). When computing an activation function requires special hardware, such as an SFU, the rest of the hardware unit (e.g. a CUDA warp) may be left idle until computation completes. Due to this bottleneck, use of these activation functions could lead to both increased power consumption (through fixed overheads incurred in the background as threads wait) and also slow models.
Current research increasingly addresses the energy impact of machine learning (Schwartz et al., 2019; Fan et al., 2020). Large amounts of work has been done on reducing the training time of machine learning models (Girosi et al., 1995; Prechelt, 1998); on reducing precision to afford bandwidth (Woodland, 1989; Wang et al., 2018); and increasing efficiency through parameter reduction (Kim and Rush, 2016; Alvarez and Park, 2019). Toolkits for measuring emissions impact have started to appear with particular detailed results in some geographic regions, that work on a limited range of hardware (Henderson et al., 2020). However, the specific impact that activation function choice has on power consumption has not been previously investigated.
The experiment goal is to gauge the power consumption impact of varying activation function type in a neural network. Although activation function choice can impact the length of the training phase, predicated on both architecture and training data, the resources needed to label one instance at inference time are predicated only on architecture. Thus, experiments do not need to consider training data in order to estimate impacts on inference-time power consumption.
Implementations of AlphaDropout, CELU, Dropout, Dropout2d, Dropout3d, ELU, GELU, Hardshrink, Hardtanh, Identity, LeakyReLU, LogSigmoid, LogSoftmax, PReLU, ReLU, ReLU6, RReLU, SELU, Sigmoid, Softmax, Softmin, Softplus, Softshrink, Softsign, Tanh, and Tanhshrink in PyTorch 1.5.0(Paszke et al., 2019) are evaluated.
The test neural network had a 64-unit linear input layer, four hidden layers of 1024 units, and a sixteen unit linear output layer. The activation functions in the 4096 hidden nodes were varied as an experiment parameter. Each model was trained for each activation function for 2000 epochs using random data, randomly initialised weights, and Adam optimisation.
The metric is power use, measured for different activation functions. Power consumption was proxied through wall clock time via Python’s time.perf_counter(). This metric has known deficiencies: while a process consumes time it may impose a range of power overheads. Nevertheless, we expect that the specific variation in real power consumption is low enough across these similar experimental workloads, and spurious loads will be cushioned through the use of multiple runs. Experiments were run three times and means taken. Experiments had a one day completion time limit.
Test platforms were: (a) CPU on a commodity laptop (2017 MacBook Pro); (b) CPU on a server (Xeon E5-2660, 40 hyperthreaded cores; 256GB RAM) with consumer GPU (NVIDIA GeForce GTX 1080 Ti); (c) datacentre GPU (NVIDIA Tesla P100 16GB). Platforms (b) and (c) both ran CUDA 10.1. SIMT CPU extensions such as AVX and AVX512 were left enabled or disabled according to pip distribution defaults.
Inference workloads are a pre-generated amount of random data. Scales are chosen to resemble the scales of real prediction workloads, especially for on-demand services. Inference set workload sizes range from to instances.
Code is available at https://github.com/leondz/inferencepower (including full graphs).
Figure 2 shows the inference time taken per-instance for a range of activation functions and inference set sizes. There are differences in the time taken by activation functions, and therefore power consumed. Function performance indicates that dropout functions have a lower power consumption, and that the identity function the lowest. The net time per instance is higher for smaller inference sets, which can be explained by the impact of fixed costs.
Of the activation functions, tanh, logSigmoid, tanhshrink, and softmin are the slowest to run at scale. This indicates that, due to their increased estimated emissions impact, use of these functions should be considered carefully before models using them are deployed broadly or deployed in an application with a long lifetime.
As the size of the inference set increases, so does the proportion of runtime spent running these functions. While the “v”-shaped part of the curve suggests some caching/batching effects for the GPUs, the relative difference between activation functions is the phenomenon of interest, and that persists. The spread between the fastest and slowest activation functions is roughly a factor of two, and is present across workload sizes and platform. There is a slight suggestion that the function-based efficiency spread may close gradually on GPUs with even larger inference sets.
The scale of difference between fast and slow activation functions is shown in Figure 3. Dropout functions perform roughly as well as each other. The performance difference between activation functions varies, but the spread persists. CPU activation workload spreads are fairly consistent regardless of the size of the inference set. GPU spread varies depending on inference set size, with some smaller instances workload sizes presenting high variance, and a generally decreasing spread as inference set sizes rise. This suggests that, for GPUs, activation function choice has more effect in situations where inference is performed over smaller sets at a time.
There are spikes in GPU spread at certain inference set scales. For example, the difference between activation functions on consumer GPU hardware was a factor of 11 when doing inference on a set of values, i.e. an order of magnitude. The datacentre GPU platform did inference on the workload of values seven times slower on the slowest activation function than on the fastest. The magnitude of the scale of variation indicates: (a) that applications should be analysed and tuned on the target hardware if one is to avoid particularly costly activation functions; (b) applications with high-frequency workloads of smaller inference sets may be particularly prone to raised power consumption and emissions due to activation function choice.
shows the mean and standard deviation in activation function performance across inference set sizes and platforms, normalised relative to the identity function’s performance. Higher spreads and variations indicate greater potential impact from activation function tuning. GPUs processing time over different activation functions varies more than for CPUs, depending on inference set size. The size of the spread in absolute GPU timings seen atand (Figure 2(a) is echoed here. On the other hand, the consumer CPU platform experiences relatively little variation across activation function performances as inference set size increases. This suggests that tuning function choice in larger-scale machine learning environments, e.g. datacentres and GPU hardware, can lead to the greatest relative emission reductions.
Low power activations are less useful if one needs to use them more often to do the same thing. This is especially important when training machine learning models. The number of iterations required is predicated upon not only network architecture, but also the training data, the hyperparameters, and the starting conditions. Further, depending on a model’s usage scenario, the part of its total power consumption represented by training can be between everything (if one never predicts) and asymptotic to zero (if one does many predictions). Nevertheless, it is helpful to estimate the demands that different activations place during this phase.
To work out how many iterations are needed, a dummy workload and performance target can be set up. In this case, we used MNIST data (LeCun et al., 1998). Th evaluation network was similar to that in Section 3
, but instead using the MNIST training data with an input dimension of 784, a final sigmoid output layer with ten nodes (one per digit), and optimised with stochastic gradient descent. The activation function of the middle four layers of 1024 nodes each is varied as the experiment parameter. The hardware is a server CPU, platform (b) from Section3. Training was stopped after the epoch when validation accuracy exceeded 0.90, or after 100 epochs. Time was only accumulated during training and not evaluation. Note that networks with hidden layers composed of regularising functions (i.e. dropout) and the identity function are still able to learn the target function in this setup due to the presence of sigmoid output function.
Figure 4 shows the total time consumed to reach the target validation accuracy. Not all functions reached the required accuracy within the given number of epochs. If a function reached the maximum epoch count in any of its runs without achieving 90% on the validation set, its bar is marked in grey. Scaled Exponential Linear Units (Klambauer et al., 2017) performed particularly well on this problem over multiple random initialisations. From a learning perspective, the networks have not performed particularly well: identity performed quickest, suggesting it was simpler to have a narrow sigmoid layer learn the problem than multiple broad hidden ones. Of the functions, linear units generally performed well; only one of these did not complete the problem in the required number of epochs. It is also possible that the chosen optimiser is not equally suitable for all activation functions. However, there is an indication that many of the functions that are efficient in earlier experiments evaluating inference-time performance can also perform well during training.
This paper estimated the power consumption of neural network activation functions. The range over activation functions was often a factor of two or more, with larger spreads for different platform-dependent workloads. This result was consistent across device type (CPU and GPU), on both consumer and datacenter hardware, and for various scales of dataset. The scale of spread indicates that choice of neural network activation function affects machine learning model power consumption.
Thanks to Peter Sestoft for suggestions regarding hardware constraints.
NVIDIA tensor core programmability, performance & precision. In Proceedings of the International Parallel and Distributed Processing Symposium Workshops (IPDPS), pp. 522–531. Cited by: §2.
PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §3.
A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science 362 (6419), pp. 1140–1144. Cited by: §1.
Weight limiting, weight quantisation and generalisation in multi-layer perceptrons. In First IEE International Conference on Artificial Neural Networks, (Conf. Publ. No. 313), pp. 297–300. Cited by: §2.
These are the mean absolute prediction times for various activation functions in seconds, on CUDA GPUs.
These are the mean absolute prediction times for various activation functions in seconds, on CPUs.