Function approximation is a software engineering technique in which we seek to replace complex functions with faster, but less accurate, alternatives. For example, consider . This is a challenging and complex function to implement correctly and much research has been published on algorithms for achieving minimal error with correct rounding de Dinechin et al. (2005); De Dinechin et al. (2004); Defour et al. (2004). However, if the caller were able to tolerate some error we could replace the implementation of with some terms from the Taylor Series for . This yields faster execution at the cost of some loss of accuracy.
Much of the research on function approximation incorporates a significant amount of mathematical reasoning to argue that the approximation falls within a worst-case error bound in all cases: the Taylor Series example above would not be used in practice because its worst-case bound is too high.
However, there are many applications where such strict requirements are unnecessary and in this case one can make use of simpler (and considerably faster) approximations that are ‘good enough’ Sampson et al. (2013). In this paper we consider one such application: that of activation functions in neural networks. Not only is a neural network inherently tolerant to error but practitioners will admit that there is a certain art to the selection of activation functions (and other hyper-parameters) and thus a certain leeway in their accuracy.
Function approximation has been applied to activation functions before. The implementation of Word2Vec makes use of a lookup table for approximating the exponential function and the popular machine learning library Theano includes an approximation of thefunction called ultra_fast_sigmoid
. Google also released some research very recently where they showed a piecewise approximations of their ‘Swish’ activation function for use in their edge computing TensorFlow interface which shows the viability of some approximation for limited hardwareHoward et al. (2019). However, this paper is the first detailed study on the subject in this area, and shows that approximations can be used as drop-in replacements for current models. The approximations we present outperform all of the mentioned alternative approximation approaches.
In this paper we consider a range of approximations with a trade-off between complexity and accuracy. We study the overall impact on training and prediction time for popular neural network architectures. We identify two approximation options for each: a safe approximation that works in all cases and a ranged approximation that requires some restrictions to it’s input domain. For all networks tested we find that our safe approximations outperform the standard implementations with an to improvement in training times. Where appropriate our ranged implementations provide an improvement of to . Our training and inference benchmarks are initially run on CPU as a proof of concept.
We believe that our safe approximations are of particular relevance to library developers since they are better-performing drop-in replacements for existing functions. There is also a growing range of specialised hardware designed for accelerating machine learning tasks and in future work we would like to consider how well our approximations might translate into hardware implementations.
2 Activation functions
Activation functions are used in neural networks to introduce non-linearity. For activation functions with bounded results, the popular choices are logistic, trigonometric and clamping functions with the sigmoid () and hyperbolic tangent () functions being used most frequently. For unbounded or partially bounded activations, the function and it’s variants are most popular.
is the dominant choice for an activation function but and still have significant use as gating functions in feed-forward and recurrent networks.
We focus on and in particular as major components of LSTMs gating behaviour which in it’s original form uses three calls to and two calls to Graves (2013).
For cases (such as LSTMs) where is not appropriate the performance impact can be notable: in micro-benchmarks measuring function execution time we found and to be 3.6x and 7.9x more expensive than . In this paper we focus on these two functions given their popularity and relatively high cost.
A significant proportion of CPU time is spent in the computation of values in activation functions. For example, for a densely connected network with 51,000 trainable parameters using a activation function with two equal-sized hidden layers we measure that approximately 29% of time is spent computing the function. For context, in the paper which introduced sequence-to-sequence learning for neural networks the authors use a 5-layer LSTM network which has 380 million trainable parameters Sutskever et al. (2014).
3 Function Approximation
A conventional implementation of a mathematical function will aim to compute results which arise from rounding actual values to the precision of the data-type used. For example, implementations of the C Math library often come with documentation stating to how many units-in-last-place the implementation of a specific function will return in the worst case. An application might be able to tolerate considerably more error than this but it can be difficult to prove conclusively. In this paper we seek to validate our approximations empirically. We believe this is appropriate since the ‘correct’ choice of activation function is most often demonstrated empirically too.
In addition to tolerating error in the function’s outputs an application might also restrict the function’s input domain. This provides further opportunity for approximation since it is only necessary to mimic the original function within the domain of interest. We therefore develop two classes of approximation. Our ranged approximation produces higher performance under the assumption that the input domain is approximately -5.5 to 5.5. Our safe approximation makes no assumptions about input domain and so is a drop-in replacement for all networks.
We consider two different approximation approaches: 1) taking a mathematically derived approximation of the function and then using the parameters to produce a range of alternative versions at varying precision; 2) performing a numerical regression to fit the given function. The mathematically defined approach produces more stable results but can be more complex. The regression approach produces often faster and simpler functions but the output result is heavily determined by the input parameters and function that is being replaced.
We first consider an optimisation of which transforms a max operator (which may use a branch) to a sum and a division.
This may result in changing the return value from to a value near to
under the rules of IEEE floating-point arithmetic11, but it also avoids the branch in the if-statement if the max function is not properly implemented. In our tests this resulted in an insignificant change and so we use the standard implementation of as our baseline.
We consider four forms of approximation of which we show in Figure 1 and describe below.
Firstly we implemented as Lambert’s continued fraction. We limit ourselves to a finite number () of iterations depending on the precision we want and then simplify using a symbolic algebra package. We call this approximation tanh_cont_n.
We also consider two forms of polynomial approximation of : 1) a Taylor Series of the form
and; 2) Padé approximants of the form. We select a desired number of terms (), choose a uniform sample of 5000 values in the range to and then determine the values of the coefficients () using a least-squares fitting procedure. We choose this range based on the shape of the (and ) which are very close to and by this point. Using these two techniques yields the approximations tanh_taylor_n and tanh_pade_n_m. Minor variations in the number of points sampled and the range covered did not have a big impact on our results.
As a final option for we consider the Serpentine function (serp) defined as
As the serp function does not fit beyond the main gradient, we also introduced a variant serp_clamp which is clamped to the values or for inputs outside of the range to .
Figure 2 shows three approximation forms for the function. These arise from the approximation of the function included within the implementation. Here we show a well-known expansion of the implementation for an infinite limit. We call approximations of this form sigmoid_fastexp_n.
We also generate fits for Talyor and Padé polynomials in the same manner as for . This yields approximations sigmoid_taylor_n and sigmoid_pade_n_m.
We note that there is an alternative approach to the fast computation of the exponential function based on exploiting the definition of IEEE Floating-Point numbers Schraudolph (1999). This is no longer as effective as it once was due to it’s reliance on a union structure to treat the floating-point number as both an integer for some parts and a floating-point number for others. This technique is not easily transferable to SIMD Nassimi and Sahni (1981) where arrays of numbers are worked on in parallel and it is not always trivial to cast between different type representations in memory.
4 Performance results
In practice it is the performance of the whole network which is important rather than a particular activation function. For example, when training a neural network the ideal choice of activation function will result in the lowest loss for the least training time. This means that when selecting an approximation we are looking for a trade-off between the computation cost and the resulting error. Fast approximations with high error might cause a network to take longer to converge than slower approximations with a lower error.
It is currently not possible to analytically determine the best choice of activation function for some network architecture. Similarly we are unable to prove that a particular approximation is a better choice in all cases. Instead we seek to empirically justify our choices through end-to-end measurements on three popular representative machine learning tasks covering three popular network architectures.
We consider the task of classifying images in the MNIST dataset and use a network comprising convolutional and dense layers inspired by the design of successful neural network structuresSimard et al. (2003) based on the implementation from a selection of provided models for Flux Innes (2018). While ReLU is sometimes used for convolutional image classification tasks for it’s speed and simplicity, and have been used more commonly in the past Kalchbrenner et al. (2014); Lawrence et al. (1997); Krizhevsky et al. (2012) and have some desirable properties which can reduce training times in some scenarios Ciresan et al. (2011).
MNIST AutoencoderAutoencoders are common in the area of generative machine learning and are often evaluated using the MNIST dataset LeCun et al. (1998). We implemented a simple autoencoder to work with this dataset. Autoencoders are compatible with many different activations when structured with different configurations Szegedy et al. (2013); Burda et al. (2015); Ng and others (2011); Vincent et al. (2008); Toderici et al. (2015); Chen et al. (2012). Our model is the simplest example of an autoencoder and as such is compatible with , and activations.
Sequence-to-sequence problems are another common task. We selected a common LSTM network layout with 2 hidden LSTM layers for text generation that takes a text dataset and learns to generate similar text. LSTM cells make use of bothand activation functions and so we considered approximations to both.
We used each approximation in turn to train our three networks for a fixed number of epochs recording the loss and the total time taken. We then compared these values to the non-approximated activation functions to compute the relative loss and relative time taken (Table1). Relative values are with respect to the function being replaced (rather than ) and smaller relative values indicate better performance. We used an Intel Xeon E5-2673 v3 @ 2.40GHz (14GB RAM) for the MNIST workloads and a dual-core Intel Xeon E5-2673 v4 @ 2.30GHz (8GB RAM) for the RNN workload. We ran our benchmarks on Azure cloud machines and we provide virtual machine images222https://drive.google.com/file/d/1trqpemv9BScwt88Xd69zpZKGWM2RMXs3/view?usp=sharing for Azure which replicate our results.
In most cases our replacement activation functions either converged with similar loss to the original function or failed to converge entirely (marked in the table). We found a few instances (such as sigm_fastexp_2 in Convnet) for which the loss was drastically lower (44%). We highlight this case as another example of the difficulty of making definitive statements about the correct choice of activation function. In all cases where the training loss converged our approximations resulted in a reduced overall training time.
We found that two functions (sigm_fastexp_512 and tanh_pade_4_4) produced the best reduction in training time (–) whilst working in all cases. We mark these as our chosen safe approximations. We also found functions (such as sigm_fastexp_2 and serp) which produced even better reduction in training times (–) but have such significant divergence from the target functions that we cannot argue that they are a suitable replacement in all cases. We mark these as ranged approximations suitable for networks (such as these) where the activation input values are roughly within the range of to .
We also found performance improvements when using approximations for inference rather than training. For the Character-based RNN (LSTM) model our safe approximations offered ~8% savings where as our ranged approximations allowed for savings of up to 20% when performing 1000 sequential inferences. This is potentially more signficant than the reduction in training times since inference is often performed on restricted hardware such as mobile devices.
5 Comparison with ultra_fast_sigmoid
ultra_fast_sigmoid is a fast implementation in popular machine learning framework library Theano Bergstra et al. (2010). It is implemented as a piece-wise approximation. It is one of only a few approximations available openly as standard in popular machine learning libraries.
We compared ultra_fast_sigmoid to sigm_fastexp_512. Figure 3 shows the much greater approximation error for ultra_fast_sigmoid. For the Autoencoder and CharRNN workloads sigm_fastexp_512 results in lower loss in less time. For the Convnet workload sigm_fastexp_512 is slightly slower but with drastically lower loss.
6 Comparison with Word2Vec lookup table
Word2Vec makes use of shallow 2-layer networks to create word embeddings. In these models the authors used an approximated sigmoid implementation which makes use of a 1000 element lookup table. We have compared this lookup table to our approximations in our test models.
) we see that in some cases training fails to make progress. We believe this occurs due to the lack of interpolation in the table lookup resulting in quantisation of outputs. As a result small incremental changes to the weight values may not result in a change of output. This could potentially be migitated with use of a different optimiser or by adding interpolation (at the cost of a performance reduction).
Again, sigmoid_fastexp_512 results in less loss in less training time.
7 Approximations in TensorFlow
TensorFlow is one of the most commonly used machine learning libraries. Despite the fact that it is a Python library it achieves high performance through native implementation of the core functions. As such this makes experimentation with novel activation functions difficult: one must inject a low-level implementation and then provide a mechanism to reference it from the high-level code. Flux does not suffer from this issue because the entire system is implemented in Julia (a relatively high performance language) and alternative activation functions can be straightforwardly applied on a level playing field with the default options.
Despite being unable to directly evaluate our new activation functions in TensorFlow we were able to identify the optimisation space by measuring the performance of the simplest possible activation function, the identity function.
Figure 5 shows the time taken to train a simple MNIST classifier (one convolution layer and two dense layers) when using and in both TensorFlow (left) and Flux (right). The approximations we have discussed in this paper fall within the region between these two lines. Even with this simple model this demonstrates that there is scope for improving model performance in TensorFlow by optimising activation functions.
8 Threats to validity
Although we have tried to evaluate our activation functions on three representative workloads it is not possible to say for sure how well they will work in the general case. However, the relative errors in our safe approximations are so small that we would expect them to be drop-in replacements.
Neural networks are commonly trained offline on large compute clusters whereas inference happens with interactive latencies and increasingly on limited hardware (such as mobile phone handsets). The majority of our results focus on training times because training loss provides a convenient measurement to check that the activation function is performing well. However, we also found that our approximations improve inference and mention this in Section 4.
Our results only consider the performance on CPUs whereas much training (and inference) happens on GPUs or specialist hardware such as TPUs. We therefore cannot say how our approximations perform in these circumstances. We argue that our safe approximations are useful even if only applied to CPUs since they generally reduce training times with no impact on loss. It would be particularly interesting to consider hardware implementations of these approximations for specialist hardware. We leave this for future work.
We have shown that approximation of activation functions in neural networks can improve the training and inference time without a negative effect on the accuracy of the network.
We investigated a range of functions and propose two safe approximations sigm_fastexp_512 and tanh_pade_4_4 for and respectively. These approximations produce faster training and inference times without damaging prediction accuracy. Our ranged approximations produced even larger speedups but will not work for all networks.
As such we think these functions are candidates for inclusion as standard options in machine learning libraries. We also showed that these functions outperform existing approximations in these libraries.
In the future we hope to expand on this work by integrating approximation as a standard option into many machine learning libraries so that it can be used to improve training time on a larger scale. Additionally we wish to analyse the effect of approximations on large and complex networks and hardware to understand if there is a structure which may cause approximations to not be beneficial.
-  (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §1.
-  (2010) Theano: a CPU and GPU math expression compiler. In Proceedings of the Python for scientific computing conference (SciPy), Vol. 4. Cited by: §5.
-  (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Cited by: §4.
Marginalized denoising autoencoders for domain adaptation. arXiv preprint arXiv:1206.4683. Cited by: §4.
Flexible, high performance convolutional neural networks for image classification. In
Twenty-Second International Joint Conference on Artificial Intelligence, Cited by: §4.
-  (2004) Fast correct rounding of elementary functions in double precision using double-extended arithmetic. Ph.D. Thesis, INRIA. Cited by: §1.
-  (2005) Towards the post-ultimate libm. In 17th IEEE Symposium on Computer Arithmetic (ARITH’05), pp. 288–295. Cited by: §1.
-  (2004) Proposal for a standardization of mathematical function implementation in floating-point arithmetic. Numerical algorithms 37 (1-4), pp. 367–375. Cited by: §1.
-  (2013) Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850. Cited by: §2.
-  (2019) Searching for mobilenetv3. arXiv preprint arXiv:1905.02244. Cited by: §1.
-  (1985) IEEE standard for binary floating-point arithmetic. Institute of Electrical and Electronics Engineers, New York. Note: Note: Standard 754–1985 Cited by: §3.1.
-  (2018) Flux: elegant machine learning with julia. Journal of Open Source Software 3 (25), pp. 602. Cited by: §4.
-  (2014) A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188. Cited by: §4.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §4.
-  (1997) Face recognition: a convolutional neural-network approach. IEEE transactions on neural networks 8 (1), pp. 98–113. Cited by: §4.
-  (1998) MNIST dataset. URL http://yann. lecun. com/exdb/mnist. Cited by: §4.
-  (1981) Data broadcasting in SIMD computers. IEEE Transactions on Computers 100 (2), pp. 101–107. Cited by: §3.3.
-  (2011) Sparse autoencoder. CS294A Lecture notes 72 (2011), pp. 1–19. Cited by: §4.
-  (2013) Good-enough computing. IEEE Spectrum 50 (10), pp. 54–59. Cited by: §1.
-  (1999) A fast, compact approximation of the exponential function. Neural Computation 11 (4), pp. 853–862. Cited by: §3.3.
-  (2003) Best practices for convolutional neural networks applied to visual document analysis.. In Icdar, Vol. 3. Cited by: §4.
-  (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §2.
-  (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §4.
-  (2015) Variable rate image compression with recurrent neural networks. arXiv preprint arXiv:1511.06085. Cited by: §4.
-  (2008) Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning, pp. 1096–1103. Cited by: §4.